The Complete Magazine on Open Source

CodeSport

SHARE
/ 1013 0

Coding

In this month’s column, we feature a bunch of interview questions based on data science.

Since quite a few of our student readers have requested me to feature data science interview questions, let’s look at a few of them in this month’s column.

1. Given that most data science folk use Python extensively, if you are preparing for data science interviews, it would be good to brush up on your Python coding skills. It is also important to understand and be able to explain the advantages and disadvantages of using an interpreted language such as Python as opposed to using a natively compiled language like C.
Frequently, a candidate is asked to write Python code to generate the prime numbers that lie between 1 and 10,000. You are first asked to write a single threaded version of the code. Then you are asked if you can come up with a multi-threaded version of the code which is faster, given that you are running your code on a multi-core machine. Here is where the tricky part of the question lies. Given that you have written multi-threaded code in Python to compute all the prime numbers that lie between 1 and 10,000, to run it on a 4-core machine, how much faster do you expect it to be, compared to the single-threaded machine? The clue is to think about CPython’s Global Interpreter Lock. As a bonus question, the interviewer could ask you,“Given that your favourite language is still Python and hence you would not like to rewrite the multi-threaded version in C, how can you make sure that your multi-threaded Python code is actually faster than the single-threaded code on the multi-core machine?”

2. You are given a dataset containing the weekly rainfall experienced in Bengaluru for the last 20 years. The data contains the week, date, and the corresponding rainfall in millimetres for that week. Your first task is to find out the outliers in your data. How would you do this?

3. You are given a set of documents and you are asked to perform clustering based on document level similiarities. You are familiar with different clustering algorithms such as ‘kmeans’, so you are confident that you can do this. However, the interviewer now adds a second part to the problem — he wants you to label the clusters. How would you generate labelled clusters? What are the means of generating meaningful labels for the clusters? Can the clusters be labelled post the clustering process, prior to it or during the process itself? Describe your solution and explain why you made that particular choice?

4. Consider the problem mentioned in (3). Given a set of documents, you first need to calculate the similarity between the documents. Let us assume that you decided to use a bag-of-words model. What are the similarity measures that you can use? Under what circumstances would you use the cosine similarity measure and when would you select the Jaccard similarity measure?

5. You are given a document corpus consisting of more than a 100,000 documents, with each document containing at least 1000 words. You decide to construct a Term Document Frequency matrix M where each row represents a term in your vocabulary and each column represents a document. Given a document D, you are asked to find all documents that are similar to D. How would you go about it? Given that M is a high-dimensional sparse matrix, how would you perform dimensionality reduction on M?

6. Consider the problem mentioned in (5). Given the document corpus, you will be given a query where you need to retrieve the top K documents which are most relevant to that query. How would you leverage your solution for (5) to solve this problem?

7. With elections around the corner in many Indian states, you have been asked to predict whether a particular candidate will win the election. You are given details on past election results for that constituency, the popularity of the candidates and their competitors, the amount of money spent by each candidate, and the time spent on the election campaign by each of the candidates in the past elections and in the current election. Is this a clustering or classification problem? If it is a classification problem, what kind of regression model would you use to solve this problem?

8. What is a maximum margin classifier? Can you give an example of one? What is the advantage of using a maximum margin classifier?

9. What is Non-negative Matrix Factorization (NMF)? Which of the data science problems typically use it? Can you implement an algorithm for NMF in polynomial time? If not, how would you solve this problem?

10. Are you familiar with gradient descent methods? Can you explain it in detail? Will gradient descent methods always converge to the global minimum (or maximum)? If yes, explain how? If not, explain why it would not converge with an example?

11. What is a confusion matrix? When would you use it to report your results?

12. Given a classifier, can you define the terms ‘sensitivity’ and ‘specificity’ of your system? Let us assume that you have built a classifier which can predict whether a tumour is benign or malignant based on certain tumour characteristics. Your intention is to build a classifier which can detect all truly malignant tumours, even if some of the benign tumours get falsely reported as malignant. Which of the two metrics’ sensitivity or specificity should you try to improve in this case? Now, assume that you have been asked to use a classifier to predict whether a student will pass the examination or not. Your system should be designed such that it does not report a failing student as passing even if it mis-predicts certain passing students as potential failures. In that case, which of the two metrics’ sensitivity or specificity should you try to improve in this case?

13. What is meant by Area Under Curve (AUC) of Receiver Operating Characteristic (ROC)? When would you use this metric as opposed to using the standard accuracy metric for your classifier?

14. What is meant by linear separability? If you are told that your problem data is not linearly separable, what classifier would you choose to use and why?

15. You have designed a linear regression classifier using certain features. Now you are asked to add two additional features which are categorical to your classifier, in order to improve its performance. Would you add these features to your existing linear regression classifier or would you design a different classifier to handle the categorical features? If your answer is yes, explain how. If your answer is no, explain why.

16. What is hinge loss? Which type of classifiers use hinge loss? What is the loss function used in linear regression and logistic regression?

17. In the last few years, there has been considerable interest in deep learning and in neural networks. What are deep neural networks?

18. Can you explain some of the different neural networks such as the feed forward neural network, convolutional neural network, recursive neural network and recurrent neural network? What is the advantage of deep neural network based classification systems over simple ML based classifiers such as Support Vector Machines?

19. You have designed a deep learning based classification system to classify different images such as people, buildings, animals, etc. Now you are given a new problem of classifying emails as either spam or non-spam and are asked to design a deep learning based classification system. Would you be able to reuse the earlier Deep Neural Network system you have built? If yes, explain the changes that are needed to be made to your earlier solution for classifying email text as either spam or non-spam. If your answer is no, explain why the same system cannot be reused.

20. Neural networks were described in computer science a few decades back. Why have they now suddenly become popular in the context of machine learning in the last 10 years?
If you are interested in learning more about deep learning and, in particular, how deep learning can be applied in the context of natural language processing, there is an ongoing course from Stanford University titled ‘Deep Learning for Natural Language Processing’ offered by Richard Socher, and is available at http://cs224d.stanford.edu/. This is a very exhaustive and informative course. However, it is important to have done a preliminary machine learning course before embarking on a deep learning course.
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!