Implement Commonly asked ML algorithm in the interview from scratch

Yunrui Li
4 min readSep 15, 2020


This article is to make sure you don’t fail at those simple shit during the interview. For example, as a machine learning engineer, you might get this interview questions could you explain how K-means works?

  • Supervised Training
  1. KNN

2. Logistic Regression

In logistic regression, we want to predict probabilistic values by applying sigmoid function to linear regression model. During the training, the cost function we choose usually is cross entropy error.

3. Linear Regression

In linear regression, we want to predict continuous values by fitting a training data with a linear function. During the training, the cost function we choose usually is mean square error.


Perceptron is very simple binary classifier with linear decision boundary, which is composed of a linear regression(weighted input) plus activation function. During the training, the cost function we choose usually is hinge loss. In fact, a neural network with a single neuron is the same as linear regression! The only difference is the neural network post-processes the weighted input with an activation function [7].

4.Decision Tree

Decision Tree is supervised machine learning algorithm. Idea is to build a binary tree to split our data. Unlike logistic regression, there’s no need to calculate gradient in decision tree algorithm to determine what best DT looks like. They use greedy search to explore all possible features and for each feature, to explore all unique possible value to determine threshold with help of entropy and information gain. Entropy is loss that normally we used for DT [8].

5.Random Forest

6.AdaBoost(Gradient Boosting)

7.Naive Bayes ->For DS role, interviewers somehow like to ask this probabilities questions maybe because they think it makes them looks so smart but .. just prepare you know Naive Bayes, dumb ass.

Bayes Theorem

  • Unsupervised Training
  1. k-means clustering

2. PCA (dimension reduction)

Fundamental Statistic Concepts [6]:

  • Events could be independent, dependent, and mutually exclusive.
  • Probability is the likelihood of an event occurring.
  • Joint probability is the likelihood of more than one event occurring at the same time P(A and B). It is the probability of the intersection of two or more events written as p(A ∩ B). If event A and event B are independent to each other, P(A∩B) = P(A) * P(B).
  • If event A and event B are mutually exclusive, P(A∩B) = 0.
  • Conditional probability is probability of an event B is the probability that an event A has already occurred, if event A is dependent to event B. It is denoted by P(B|A).

Classic Textbook Questions that might be asked during ML interview:

  • Generative models vs Discriminative models[5]
summary from[5]
  • Can you explain what’s batch normalisation?
  • Can you explain what’s bias and variance error?

It’s a way to decompose our machine learning models’s error.

Basic Terms we use a lot in ML algorithm:

  • Epochs: number of passes of the entire training dataset that ml algorithm completed or consumed. For example, one epoch means an entire training dataset is passed into ml algorithm only once. [1][2]
  • Batch Size: Because one epoch is too big to feed into computer at once, which will raise memory error, we divide dataset into several batches. So, batch size is number of training data given into a single batch.
  • Iterations: number of batches needed to complete one epoch. For example, there’re 2000 training examples in our entire training dataset and batch size is assigned to 500, then iterations is 4. If the batch size is the whole training dataset then the number of epochs is the number of iterations.
  • Gradient: Normally, It refers to the derivative of cost function with respect to parameters such as weight and bias in machine learning.
  • Cost Function: to tell us how good our model is at making predictions for a given parameters(weight and bias for example). In order to have own gradients, this function should be differentiable with respect to parameters.
  • Gradient Descents: It’s an iterative optimizer algorithm to get minimum of cost. There’s a couple of optimzers, such as SGD, Adam, and etc. [3]
  • Learning rate: it’s an important parameter of most ml algorithm. Basically, it tells us how far we go in negative gradient direction in each step. For example, if you choose smaller learning rate, it might be slow but reach out to our minimum. On the contrary, if you choose larger learning rate, it might be fast but jump around and never reach out to our minimum.
  • Adaptive learning rate: Principally, adaptive learning rate algorithm such as AdaGrad(Accumulating Historical Gradients), Adam, they automatically reduce the learning rate by some factor every few epochs(Put in simple words, we hope in the beginning of training, it’s larger LR; in the end of training, it’s smaller LR. LR cannot be one-size-fits-all: Giving different parameter different LR)
  • Normalizing: It’s an important data pre-processing to ensure all values of input data are within the same range to speed up and stabilize process of calculating gradient [4]. For example, normalize our features so that they are all in the range -1 to 1. In practice, min-max scaling in practice.
  • Regularization: It’s a way to prevent overfitting to reduce variance of NN by having a limitation on weight(It’s called weight decay)such as L1 and L2, which make NN is more robust to noise error.
  • Euclidean Distance: square root of the sum over the squared distances.










[9]. Stanford Machine learning cheet sheet:



Yunrui Li

I’m Taiwanese expat. I’v worked in Singapore as data scientist after graduation from Taiwan and currently I work in Amsterdam as machine learning engineer.