The Complete Magazine on Open Source


, / 127 0

In this month’s column, we feature a set of computer science interview questions.

Last month, we had discussed a set of questions on data science, algorithms and machine learning. We continue this discussion in this month’s column, focusing on machine learning and natural language processing.

1. We are all familiar with spoken language recognition systems, which can perform speech-to-text (STT) recognition. Systems such as Apple’s Siri, Google Now and Microsoft’s Cortona have to deal with a number of complexities as different speakers have widely varying accents and use different styles of pronunciation. Let us consider a person who is reciting the poem: ‘Twinkle, Twinkle, Little Star, How I wonder what you are’. Let us assume that the STT system has, so far, correctly identified all the words spoken, except for the last word ‘are’. You are given access to a large corpus of English words on which the STS system can be trained. How would you design an algorithm that can recognise the poem completely, assuming that it has been able to correctly recognise all previous words in that poem spoken so far?

2. Can you explain what are unigrams, bigrams and trigrams? How can you use these n-grams to build a language model which can predict the next word in a sentence spoken in English? Though English has a huge vocabulary, let us just assume that we are dealing with a limited vocabulary of 1,000,000 words. For this vocabulary size, how many parameters need to be estimated for the language model?

3. Let us again consider the STS system mentioned in Question (1), above. We are given a sentence spoken in English and have been asked to convert it to text. For example, consider the sentence, “I took a flight to New York.” Let us assume that the name ‘New York’ did not occur at all in the training corpus you have used to train the STS system. However, the training corpus did contain the sentence, “I took a flight to Paris.” Can you explain how your language model will recognise the name ‘New York’?

4. You are given the task of writing a program that can recognise automobile number plates. The number plate is of the form XXNN-XXNNNN, where X stands for an upper case English letter of the alphabet (between A-Z) and N stands for a single digit between 0-9.

a. You have access to a regular expression package available from your favourite programming language. So all you need is to provide the appropriate regular expression to the regular expression library. What is the regular expression that you would come up with?
b. You are told that you can’t use any pre-existing regular expression library. You are asked to write C code to recognise whether an input is a valid number plate expression or not.

5. We are familiar with spelling and word completion suggestions that appear as we type a search query in various search engines and the suggestions they provide when we misspell a phrase. Can you design a system which takes the user input as a character stream and provide on-the-fly spelling correction suggestions and auto-completion of phrases for the English language? Now, if you are asked to implement this system for a different language like Hindi, is it possible to adapt your existing system to the new language? If yes, explain how you would do that? If not, explain why it is not possible.

6. We are all familiar with Wikipedia articles. Let us assume that you are given access to all the Wikipedia articles that were created in the last two years. You are asked to find out what subjects are popular in Wikipedia, i.e., articles on those subjects get edited often. What NLP technique would you use to solve this problem?

7. Let us consider the following two sentences:

a. Today was a bank holiday.
b. He was fishing by the river bank.

Though the word ‘bank’ appears in both the sentences, their meaning is quite different. In the first sentence, ‘bank’ stands for a financial institution whereas in the second sentence, it stands for the ‘sandy terrain along the side of the river’. Assume that you have access to an English Lexicon like WordNet (, which can provide synonyms for each English word. Can you explain the NLP technique that you would use to identify what the intended meaning of the word ‘bank’ is, in each of the above two sentences?

8. We are all familiar with various news collection services on the Web such as Google News, Yahoo News, etc. All these services offer a feature called the ‘Personalised News Section’ in which they try to show a few news items that they think may be of personal interest to you. For instance, if you are interested in following a particular movie actor or a writer, then the personalised news contents would feature any news related to your favourite actor. Can you explain how these online news services identify the news items that may be of interest to you?

9. When we enter a search phrase on the Web search engine, we get a ranked set of results that best match our query, as determined by the search engine. For instance, if you enter the phrase ‘Open Source For You’ in the Google search bar, you get 410,000 results matching it, and the first page of search results contains the 20 top most matches. Given that we have 410,000 results matching the search query, can you come up with an algorithm that does a ranking of these results? Can you identify the typical features that may be used by a search engine to rank its results?

10. Let us again consider a search engine which accepts a query phrase from the user. Instead of returning all Web pages that match the query phrase, you are asked to design a special search engine that returns only Web pages containing software code that best matches the search phrase. The code search engine takes two inputs—the first is the query phrase delimited using double quotes, and the second is the programming language for the code results which should be returned by the search engine. For example, if the user enters ‘Insertion Sort’ ‘C++’, then the code search engine should return Web pages which contain the code for ‘insertion sort’ written using the programming language C++.

a. For the first question, let us assume that you have a code search engine that can return all Web pages containing code for the search phrase specified. For example, it can return code pages that contain code for the insertion sort, written in Python, Java, Scala, Fortran, etc. Now the problem is reduced to that of filtering those code pages which are not in the specified programming language, namely C++. Can you come up with an algorithm for solving this simplified problem?
b. Now, how would you design a code search engine that will return all Web pages containing code for the search phrase in any programming language?

11. Given a listing of any search results from a search engine corresponding to a user-supplied query phrase, can you define what is meant by the terms ‘precision’ and ‘recall’ for that search engine?

12. Most of you would be familiar with supervised and unsupervised training techniques in machine learning. Let us assume that you have created a machine learning model using a supervised learning technique in one domain. For instance, you have created a machine learning model to do sentiment analysis of user reviews for books. Now you are given the problem of doing sentiment analysis of user reviews for restaurants. Would you be able to reuse your current model or do you need to construct a new model for the new domain?

13. As a follow-up to question 12, what is transfer learning? When is transfer learning effective?

14. What is the principle of distributional semantics in natural language processing? Can you mention a couple of text mining techniques that employ the principle of distributional semantics?

15. Consider a machine learning problem in which you are asked to predict the mortality of a patient admitted to the ICU. You are provided data of the last 24 hours related to his various vital parameters (which are numeric time series data), lab reports and clinical progress reports (which are text documents). Given that your data set contains both text and numeric data, how would you select features for the classifier that you are building to predict the patient’s mortality?

If you have any favourite programming questions or software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!