Problem Formulation

和search ranking system做個比較??

為什麼通常選擇用implicit data 去建立recommendation system instead of explicit ?

- Amount of training data we can collect to build recommendation system
- Explicit data 在 data collection會有MNAR(非隨機缺失的問題)
- 通常低分的電影人們不會較少評估, Therefore, we won’t get much information on the kind of movies that a user does not like.
- Also, movies with fewer ratings would have less impact on the recommendation process.

What is data stream?

It’s constant stream of data. For example, twitter messages and online news article.

Usually, for data stream related problem, there’s **add function** which will keep adding data **from data stream into data structure you choose** once it’s called. So, initially, we need a good data structure as a container to store past data stream.

- Sorting as brute force solution: Take
**O(nlogn)**once you called add function, which is very efficient. - Binary Search + insertion: Take
**O(logn)+O(n)=O(n), linear time**once you called add function, which is better. - Think about heap as optimal solution: Take
**O(log n)**once…

Some pattern about split string/array related problem. Usually, naive solution is generate all possible split, which takes exponential running time. And optimal solution is greedily split in most of cases. And, this is tricky. So, I try to summarize what I learned in this post.

- 1221. Split a String in Balanced Strings

- 659. Split Array into Consecutive Subsequences

This is a very classical problem, so-called K-th problem.

Basically, given a user, our problem can boil down to finding most recent K data/ top k data with a certain priority given a data stream

**Sorting+get top k data points**

So, the naive approach is we can use **sorting**. Find all tweets of a user’s followers, then sort them by creating time in descending order and return top K out of them.

In this way, the advantage is intuitive and easy to implement. But, disadvantage is not very efficient in terms of time complexities, which is **O(NlogN)**. …

Case study: you’re asked to design a search relevance system for a search engine.

**Problem Clarifications**:

3 aspects:

- scope
- scale
- personalisation or not?

**Metrics**:

- Online metrics:
- Offline metrics:

relevance comes from ground truth

Note:

- in order to consider personalisation, we might need user’s historical search data and know if he/she is a logged-in user.
- NDCG:https://zhuanlan.zhihu.com/p/136199536

**Architectural components:**

Except the metric such as AUC, precision, recall, and etc, we also ensure that we meet the capacity and performance requirements in the mean time

And, *performance* and *capacity* are the most important to think about when designing the ML system. **Performance** based SLA ensures that we return the results back within a given time frame (e.g. 500ms) for 99% of queries. **Capacity** refers to the load that our system can handle, e.g., the system can support 1000 QPS (queries per second).

- metric can be divided into
**component-wise**metric and**end-to-end**metric **offline**metric vs**online**metric**performance**vs**capacity**

…

This article is to make sure when you use heap library in the real interview, the interviewer ask you details about it, you won’t fail.

**PQ** is implemented using **binary heap tree**.

Binary heap tree is

- It’s a
**complete binary tree**. - Could be
**max heap**or**min heap**. The parent and child follow the relationship in figure. - the tree is represented by array.

There’re mainly two methods:

1.Insert using heaplify up.

- Step1: Append the inserted val in the end of heap array.
- Step2: heaplify up till right position.

2. Delete

- Step1: Swap the first value with end value in the heap array
- Step2: heaplify down till right position.
- Step3: pop the last element in the heap array.

2-D矩陣題目:

1.把 each cell 想成x-y coordinate上的一點用index代表.

2.Binary Matrix可以想到sparse matrix的處理方式, 只儲存index whose value is 1.

- 48. Rotate Image

- 5. Longest Palindromic Substring

I’m Taiwanese expat. I’v worked in Singapore as data scientist after graduation from Taiwan and currently I work in Amsterdam as machine learning engineer.