ML questions

Yunrui Li
8 min readJul 16, 2021

This posts focus on those ML fundamentals.

There’re two steps to prepare for ML interview.

Step1: Understanding those fundamentals and theories to ensure we can answer those questions very well in the interview

Step2: To implement those code to ensure we can explain in detailed way!

Q1: Could you explain dropout?

Dropout[1]
Dropout in training mode[2]
Dropout in testing mode[3] but in Pytorch, we do identify function
Code snippet implemented with torch[4]
Code snippet implemented with numpy[5]

Note:

  1. Dropout is applied on output of previous layer, which will be multiply with bias and wight to get input of current layer. Then, it will pass into activation function to get output of current layer. 簡言之, dropout is applied before activation.
  2. Dropout在訓練和測試的時候表現不一樣, 在training時, dropout可以達到regularization的作用(讓weights 有shrink的作用) so that we prevent from overfitting。在testing時,我們需要更穩定的結果,在implementation的時候不會使用丟棄法, 因為如果隨機zero out一些neurons,这会带来结果不稳定的问题,也就是给定一个测试数据,有时候输出a有时候输出b,结果不稳定,这是实际系统不能接受的,用户可能认为模型预测不准。那么一种”补偿“的方案就是每个神经元的权重都乘以一个p,这样在“总体上”使得测试数据和训练数据是大致一样的
  3. Dropout probability is a hyper-parameter needed to be tune.
  4. During testing, the module simply computes an identify function. 指的就是什麼input進來, 原封不動出去, 等效於乘以identity matrix. 所以在training的時候才要對值做放大(rescale), 這樣理論上, 丢弃法不改变其输入的期望值。
  5. Given dropout probability p, 我們來計算輸出的期望值:
    (1-p)x + p0 = (1-p)x
    解釋: 有1-p的機率被保留參與訓練輸出是x, 有p的機率被丟棄, 輸出是0. 最後, 這個layer輸出的期望值是(1-p)x. 那我們只要把輸入在training時除以(1-p)做個縮放, 在testing時, 不做任何事, 只輸出x. 則等效於在訓練和測試的時候輸出的期望值都是相同, 維持一樣x, 符合我們希望试数据和训练数据是大致一樣的需求.

[1].https://www.jianshu.com/p/15ce88117075

[2].http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.388.797&rep=rep1&type=pdf

[3].https://zhuanlan.zhihu.com/p/38200980

[4].https://colab.research.google.com/notebooks/intro.ipynb?utm_source=scs-index#scrollTo=2LEv1Cgu_O0g

Q2: Could you explain k-nearest neighbor (k-NN)

Q3: Could you explain ROC curve?

Q4: Could you explain bagging? [5]

Bagging, shortened for bootstrap aggregating, is designed to improve the stability and accuracy of ML algorithms. It reduces variance and helps to avoid overfitting.

Given a dataset, instead of training one classifier on the entire dataset, you sample with replacement to create different datasets, called bootstraps, and train a classification or regression model on each of these bootstraps. Sampling with replacement ensures each bootstrap is independent of its peers.

If the problem is classification, the final prediction is decided by the majority vote of all models. For example, if 10 classifiers vote SPAM and 6 models vote NOT SPAM, the final prediction is SPAM.

If the problem is regression, the final prediction is the average of all models’ predictions.

A random forest is an example of bagging. A random forest is a collection of decision trees constructed by both bagging and feature randomness, each tree can pick only from a random subset of features to use.

Due to its ensembling nature, random forests correct for decision trees’ overfitting to their training set.

Q5: Could you explain boosting? [5]

Boosting is a family of iterative ensemble algorithms that convert weak learners to strong ones. Each learner in this ensemble is trained on the same set of samples but the samples are weighted differently among iterations. Thus, future weak learners focus more on the examples that previous weak learners misclassified.

  1. You start by training the first weak classifier on the original dataset.
  2. Samples are reweighted based on how well the first classifier classifies them, e.g. misclassified samples are given higher weight.
  3. Train the second classifier on this reweighted dataset. Your ensemble now consists of the first and the second classifiers.
  4. Samples are weighted based on how well the ensemble classifies them.
  5. Train the third classifier on this reweighted dataset. Add the third classifier to the ensemble.
  6. Repeat for as many iterations as needed.
  7. Form the final strong classifier as a weighted combination of the existing classifiers — classifiers with smaller training errors have higher weights.

Adaboost and XgBoost is famous example of boosting framwork.

GBDT与XGBoost的区别和联系?其中一个重要的回答就是:GBDT在模型训练时只使用了代价函数的一阶导数信息,XGBoost对代价函数进行二阶泰勒展开,可以同时使用一阶和二阶导数。

為什麼我們給錯誤分類的samples比較高的weight?

1.根據Adaboost的定義Error是那些錯誤分類的samples的總和:

這樣定義的原因是在下一輪的classifier會嘗試去把error降低.

2. 實際implement, 我們藉由改變loss function來做re-weighting

怎麼讓ML算法對不同weight的training data感受不一樣, 就是改變loss function: 權重高的data, 錯誤會有比較高的loss. 用這樣的方式, 那ML alg就會去學習讓這些錯誤的data正確使整體的loss下降

Q5: Why ensembling method could improve the stability and accuracy of ML ML algorithm ? [6]

如果我們把ensembling裡面每個基模型產生的誤差/error想成一個隨機變數 X:

有關隨機變數複習, 參考CS229講義[7][8]

對於隨機變數 X, 我們通常關注期望值和標準差

一個隨機變數會有n個可能發生的值x, 每個值會有相對應發生的機率p
變異數是x-m平方的期望值

如果兩個隨機變數是獨立的, 兩個隨機變數的join probability distribution(聯合機率分配)等於兩個個別隨機變數的marginal probability distribution(邊際機率分配)相乘. 反之相關, 我們就計算兩者之間的covariance.

兩個隨機變數要馬獨立要馬相關

Independently and Identically distribution(I.I.D)

從bias and variance角度來思考:

现在如果我们设想每个随机变量(x)都是一个给定模型的误差,就可以看到用到的模型的数目在增长(导致第二项消失),另外模型间的相关性在降低(使得第一项消失并且得到一个独立同分布的定义),最终导致了集成方法误差的方差的总体降低。

誤差的variance/方差(假設彼此誤差之間是相依的, 符合真實情況), 設tho =0則等於下面式子
誤差的variance/方差(假設彼此誤差之間是獨立的)

使用模型的偏差(bias)和方差(variance)来描述其在训练集上的准确度和防止过拟合的能力

抽样的随机性带来了模型的随机性。

在bagging和boosting框架中,通过计算基模型誤差X的期望和方差,我们可以得到模型整体的期望和方差。

For boosting,

  1. decreasing bias of your base model
  2. more additive

每一次迭代都根据上一次迭代的预测结果对样本进行加权,所以随着迭代不断进行,误差会越来越小,所以模型的 bias 会不断降低。但因為這次模型和上一次模型相依性較高, 會略微增加variance.

Q5: What are some of the fundamental differences between bagging and boosting algorithms?

從三個角度看差別:

  1. Training data怎麼取得(how to generate training data)

Ensemble 用某一種方式產生不同的training data來訓練base learner, 使彼此之間很不一樣.

bagging 是 re-sampling, 而boosting是 re-weighting.

  1. ensemble中的分類器匹此之間的關西
  2. ensemble中的分類器怎麼取得(how to train)
  3. ensemble中的base learner/weak learner 特性
difference between bagging and boosting[4]

Bagging用的base learner是low bias, high variance(偏差高方差小).

Boosting用的base learner是high bias, low variance(偏差低方差高).

舉例來說:

Randmom Forest用的base learner 是decision tree是很複雜model, 所以base learning在bagging是overfitting, 是low bias, high variance(unstable).

Adaboost 用的base learner 是decision dump是很簡單model, 只是一條線的分類器, 所以base learning在boosting是underfitting, 是high bias, low variance(stable).

Q6: What is Adaboost ?

weight:每一筆training example都有一個weight. 初始值大家都一樣1/N, where N = total number of training data

weighted error: 就是把error rate根據不同weight 做normalization. error rate分子是所有training data中有多少是分錯的.

alpha: performance of each classifier, calculated by 加权误差(weighted error)的负对数概率(negative log-odds)。

re-weight rule( it’s associated with how to re-weight training data based on previous weak learner)

參考下圖了解為什麼re-weight rule 是長weight * exponential of negative y_hat * h*alpha

基本上,

錯誤分類的example 我們乘上一個值dt, 使其原本權重增加

正確分類的example 我們除上一個值dt, 使其原本權重減少

這個dt經過推到就是和alpha有關.

https://www.youtube.com/watch?v=tH9FH1DH5n0&t=2558s&ab_channel=Hung-yiLee

model output: additive modeling, weighted by alpha. and take sign function (因為這是binary classification為例所以take sign function, 把它轉成{-1,1})

alpha

sign function

取正負號的函數

其實AdaBoost就是在等效於在訓練時, 最小化指數損失函數(exponential loss)

為什麼AdaBoost可以理論保證當T夠大時可以得到近似於0的training error.( 因為他一直連乘一個介於0~1的值)
https://zhuanlan.zhihu.com/p/58883095

Q7: What is gradient boosting?

Gradient Boosting produces a prediction model typically from weak decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

The way he optimized is use gradient descent!

So, gradient boosting is also called forward stage-wise additive modeling.

最後的模型, 就是經過T次迭代後的函數.

Gradient Boosting的核心思想借鉴于梯度下降法,其基本原理是根据当前模型损失函数的负梯度信息来训练新加入的弱分类器(weak learner),然后将训练好的弱分类器以累加的形式结合到现有模型中。采用决策树作为弱分类器的Gradient Boosting算法被称为GBDT。[9]

In gradient boosting, each weak learner is fit to the negative gradient of a loss function with respect to y_hat/previous model output.

而為什麼這樣是一種boosting的框架呢?
负梯度也被称为“响应 (response)”或“伪残差 (pseudo residual)”,从名字可以看出是一个与残差接近的概念。直觉上来看/Intuitively,残差r=y−f(x)越大,表明前一轮学习器f(x)的结果与真实值y相差较大,那么下一轮学习器通过拟合残差或负梯度,就能纠正之前的学习器犯错较大的地方。

在實務上, 我們把找learning rate給固定以簡化, 而不是去找一個最好的learning rate使得loss越小越好.

與Taylor’s series相比較, 我們可以觀察發現
我們透過建立一個weak classifier to fit negative gradients!!(模擬Taylor’s expansion), 下方的式子就是exactly same as gradient descent. For XgBoost, they use second order information to get more precise approximation result.

pseudo code of gradient boosting framework

為什麼gradient在gradient boosting frameworks中又被稱作pseudo residuals?

Hence for a general loss function, gradient boosting requires that we fit the gradients at each step, and for mean squared loss, the gradient just happens to be the residuals so whilst is seems somewhat intuitive it is also slightly misleading to think of gradient boosting as fitting residuals (or even pseudo residuals as the gradients are often named), it is actually gradient descent in function space [11].

Gradient Descent vs Gradient boosting

Gradient Descent是透過對

Q4:Could you explain GBDT?

We take decision stump, decision tree with one layer, as weak learner of gradient boosting.

3steps:

  1. Have a base model to output prediction y_hat
  2. Compute residual error
  3. Sequentially construct DT but with same features but different label( we take previous residual error as label)

Why take residual error as label or each weak learner?

Gradient Boosting中则将负梯度作为上一轮基学习器犯错的衡量指标,在下一轮学习中通过拟合负梯度来纠正上一轮犯的错误。这里的关键问题是:为什么通过拟合负梯度就能纠正上一轮的错误了?

簡言之,梯度提升算法(gradient boosting) 是一种梯度下降(gradient descent)算法,只需要更改损失函数(loss function)和計算梯度就能将其推广。

pseudo algorithm of gradient boosting decision tree[1]

Step1: 我們要找到model的初始值, 這個初始值如何決定, 就是透過我們要最小化我們所定義的loss function, 這個loss function是一個 label和y_hat的函數. 要找到一個函數的最小值, 就是找到一階導數的微分(first order derivative with respect to model output為零的點.
Note:

  1. gamma is nothing but predicted value(y_hat)(model的輸出就是y_hat就是predicted value)
  2. 在GBDT的frame work中, 優點是loss function彈性, 讓我們可以解決各種ML task 像是classification, regression and ranking. 也就是說loss function 可以是許多不同的(可微分)Loss Function such as MSE[3]. 在機器學習中, loss function也可以指objective function.
  3. first order derivative with respect to y_hat= gradient
  4. residuals basically is gradient/ first order derivative with respect to function(x), where function(x) = y_hat = predicted_value. So, residuals經過數學推導我們發現就是gradient. 所以我們可以用減法就可以很簡單計算gradient.
step1: initialize model with constant value(y_hat=60)
step2: compute pseudo residuals

Note:
r11: residual that example1 caused by model 1 (we’ll created M models)
r21: residual that example2 caused by model 1
r31: residual that example3 caused by model 1

step3中的f下標m-1(xi)就是output of previous model(which is fixed value already, 60 in the below)

what step3 looks like when take square loss as our loss function

3rd part:

4.3[M] What problems is gradient boosting good for?

Q8.Why do gradient descent work in theory?

從Taylor’s series推導過來:

我們把loss對附近一小步做Taylor’s expansion去尋找怎樣的一小步, 我們可以使得下一步loss會最小, 其中近似這一小步要無限小(Theoretically) 我們才可以忽略高次項, 而只考慮first order derivative. 而這一步的方向, 從下面推導發現他就是當前梯度的反方向

Q9:Could you explain linear regression

For regression problem, Mean Square loss/ Mean Square Error (均方誤差)是我們常用的objective function. 而基于均方誤差最小化来进行模型求解的方法称 为"最小二乘法" (least square method).

簡單的線性回歸, 是可以透過一次微分的0的方程式, 求出最優解(close-formed) function

What are the basic assumptions to be made for linear regression?

Question Reference:

https://huyenchip.com/ml-interviews-book/contents/8.1.2-questions.html
Question list:

Reference:

[1].Reference:https://www.youtube.com/watch?v=Nol1hVtLOSg&t=391s&ab_channel=KrishNaik
[2].https://www.cnblogs.com/shiyublog/p/10992609.html
[3].https://www.cnblogs.com/shiyublog/p/10992609.html

[4].https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-ensemble-learning%E4%B9%8Bbagging-boosting%E5%92%8Cadaboost-af031229ebc3

[5].https://huyenchip.com/ml-interviews-book/contents/8.1.1-overview:-basic-algorithm.html

[6].https://www.zhihu.com/question/29036379

[7].https://stanford.edu/~shervine/l/zh-tw/teaching/cs-229/refresher-probabilities-statistics

[8].http://math1.ck.tp.edu.tw/%E6%9E%97%E4%BF%A1%E5%AE%89/%E5%AD%B8%E8%A1%93%E7%A0%94%E7%A9%B6/%E4%B8%8A%E8%AA%B2%E8%AC%9B%E7%BE%A9/99%E8%AA%B2%E7%B6%B1/%E6%95%B8%E7%94%B2%E4%B8%8A/1-1%E9%9A%A8%E6%A9%9F%E7%9A%84%E6%84%8F%E7%BE%A9.pdf

[9].https://mp.weixin.qq.com/s?__biz=MzI5NDMzMjY1MA==&mid=2247485110&idx=1&sn=86bcdb38f51fc5f82236b35c349ada4b&scene=21#wecha%E2%80%A6

[10]. Hung-yi Lee上課整理筆記, https://sakura-gh.github.io/ML-notes/

[11]. https://datascience.stackexchange.com/questions/31609/why-do-we-use-gradients-instead-of-residuals-in-gradient-boosting

--

--

Yunrui Li

I’m Taiwanese expat. I’v worked in Singapore as data scientist after graduation from Taiwan and currently I work in Amsterdam as machine learning engineer.