Last month, we had discussed a set of questions on data management and cloud computing. We continue our discussion of computer science questions this month, focusing on machine learning and natural language processing. Let us get started with some tough questions to whet your appetite.
1. In natural language processing, the last couple of years have seen word embedding becoming extremely popular in various NLP tasks such as named entity recognition, sentiment analysis, semantic similarity detection, etc. Most of you would be familiar with Googles word2vec package (https://code.google.com/p/word2vec/). What are word embeddings? Can you explain briefly the two major algorithms in word2vec – namely the skip-gram method and continuous bag of words (CBOW) method? Here is a link to an excellent article which provides much of the historical background on word embeddings: http://gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/.
2. Over the last few years, we have also seen the rise of deep learning methods in computer vision, pattern recognition, etc. While deep learning achieved considerable popularity with Googles work (http://www.wired.com/2012/06/google-x-neural-network/) on using deep learning algorithms to automatically label CAT images, there are still open questions on why deep learning is effective in computer vision, speech recognition and various other pattern-recognition tasks. Can you explain how deep learning is different from traditional methods of machine learning? If you are interested in knowing more about deep learning and how to code on it, I would like to suggest the online tutorial by Andrej Karpathy titled, Hackers Guide to Neural Networks.
4. What are Support Vector Machines (SVMs)? Let us assume that you are given a set of data points, some marked red and some blue. You are asked to write a classifier, which can separate the blue and red points. You have written your favourite SVM based classifier for this task. Now you are given Figure 1 as input.
What is the separation boundary your classifier would find for this input? Can you use an SVM based classifier for separation for this input? If yes, explain how. If not, explain why.
5. What is a convex function? What is the significance of convex functions in machine learning algorithms?
6. In supervised machine learning problems, what is the process of cross-validation? How does cross-validation help prevent a model from being over-fitted to training data?
7. In machine learning, can you explain the difference between model parameters and hyper-parameters? Do all models have hyper-parameters? Can you explain what would be the hyper-parameter for the standard linear regression model of the following form:
wTx = y
where x is the feature vector for the data, y is the scalar variable that needs to be predicted and w is the model parameters matrix?
8. You are running a randomised clinical trial, where you are asked to determine whether a newly discovered drug for cancer is effective. You have a control group of patients who do not get the new drug and a target group of patients who get the drug. Let us assume that the clinical effectiveness of the new drug is measured in terms of the reduction in the number of deaths in the control and target groups for six months following the start of the clinical trial. Can you apply the concept of A/B testing for this situation? If not, explain why.
9. Consider the same example as above. You find that 20 people out of 400 died in the control group during the trial period, whereas 17 people out of 400 people died in the target group (who received the new drug). Does this establish the clinical effectiveness of the new drug against cancer? Can you explain your answer in terms of the statistical significance of the results?
10. Lets consider Question (8) again. While Questions (8) and (9) make the implicit assumption that the difference in the number of deaths seen in the control and target groups is due to the new drug being administered to the target group, is this a correct assumption to make? If yes, substantiate why. If not, give an example to explain your answer.
11. What are the confounding factors in statistical experiments? How does one account for confounding factors when conducting randomised experiments?
12. A well-known phrase used in machine learning and statistics is: Correlation is not the same as causation. Can you give an example to illustrate this statement?
13. There are three kinds of analytics — prescriptive analytics, descriptive analytics and predictive analytics. Can you explain each of these terms and describe how they are different from each other?
14. There are a number of data analytics frameworks which can scale to analyse vast amounts of Web-scale data. Most of you would be familiar with Hadoop frameworks. Of late, there has been considerable interest in the Apache Spark framework as an alternate to Hadoop. Can you explain the major differences between the Hadoop and Apache Spark platforms? What would be your preferred choice for running a purely batch analytics application? And what would be your choice for running a real-time analytics application?
15. Most of you would be familiar with recommender systems, which can be built using either collaborative filtering approaches or knowledge based recommendations. Can you explain the difference between the two approaches?
16. What is the cold start problem in recommender systems? How is it handled in collaborative filtering systems? Does this problem occur in knowledge based recommender systems?
17. We are all familiar with search engines displaying ads that are related to our search query on the Web pages we are browsing. For example, if you were searching for information on Samsung mobile phones, your search engine will display ads for different mobile phones and mobile phone accessories. Can you explain how the search engines decide which ads to display for each user? For example, consider two different users, A and B, who are both looking for the same product. Are they likely to be displayed the same set of ads by the search engine? If you think not, explain how the search engines would decide what ads to display for each user, even though their search query term was identical.
18. Given that one word can have different meanings (this is known as polysemy), how do information retrieval systems figure out what is the specific information that is being searched for? For example, consider a user who searches for windows. How would the search engine determine whether he is looking for Microsoft Windows or glass windows? Blindly displaying search results for both meanings is likely to impact the user experience. If you are the designer of the search engine, explain how you would attack this problem? Consider the following related problem. When the user types in the search query as automobile, how does the search engine include Web pages that talk about cars as well?
19. Consider the display of search query results by search engines. Given a search query term, how do search engines rank the relevant results? What factors are taken into account in ranking the search results? What is meant by filter bubble in search engines and how would you avoid the problem?
20. While traditionally, search engines have been tasked with the goal of returning Web pages relevant to the search query, they are increasingly being used to answer user questions. Google uses its knowledge graph to provide fact checked information about well-known personalities along with traditional search results. For example, if you type the question, What is the first book by J.K. Rowling? you will get a text box containing the relevant snippet as the answer along with standard search results. How would you design a search engine that can provide direct answers to questions?
If you have any favourite programming questions/software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!