The Complete Magazine on Open Source


, / 159 0

In this month’s column, we feature a set of computer science interview questions.

Last month, we had discussed a set of questions on operating systems. Let’s continue our discussion of computer science questions, while focusing on data management and cloud computing.

1. We are all familiar with transactional databases for storing structured data. However, you are asked to design a data management solution, where the data store needs to store the large documents. The documents are answer sheets of students in a school record management system. Each student is uniquely identified by a student ID, and the data store needs to store all the answer sheets of each student, for all the subjects throughout an academic year, with the ability to store the past three years’ records as well. What kind of database would you use? Can you explain the database schema that you would employ?
2. Spatial applications are getting developed in large numbers these days. Consider a taxi fleet management company, where the trajectories of the taxis need to be stored in a database. Spatial applications should be able to query the database for a specific trajectory or segments of a trajectory. It should be possible to answer queries of the form, where given a set of location points, it should be possible to retrieve all the taxis whose trajectories overlap the location points. What kind of database would you use for storing trajectory data? For speeding up query performance, what kind of indexing would you perform for the trajectory data?
3. Assume that you are asked to design the medical data management solution for a big hospital. Medical data generated in the ICU is among the most critical in a hospital, where different sensors generate physiological data also known as vitals. This is a time series data where various physiological signals like heart rate, blood pressure and oxygen level are recorded at periodic time intervals of, say, one minute – for each patient in the ICU. What kind of database would you use to store time series data?
4. Assume that you are asked to design the data management solution of a small scale social network that is becoming popular. The social network has around 100,000 members and is expected to grow by 10X over the next one year. The members are identified by their unique member IDs. Members can have other members as friends. Each member has a unique profile along with a message history, which includes the messages he/she has sent to other members of the network. What kind of database would you use to represent the members of the social network and their friends?
5. Assume that you are asked to redesign the social network site mentioned in Problem 4 as a dating site. The new constraint on the friendship relationship among the members is that they can only become a friend with a member of the opposite sex. How would you implement the dating site’s social network? One of the frequent queries that needs to be optimised for the dating site is: Given two members (X,Y) where X and Y are of the opposite sex, do they have any common connections who can introduce them to each other? How would you design your data management solution such that you can speed up this query?
6. Most of you would be familiar with Dockers (, which is an open source project built on top of Linux containers. It allows applications to be isolated from each other. Can you explain how Dockers differs in providing isolation to applications compared to virtual machines?
7. You are given a 16 core, 32GB RAM physical machine running Linux. You need to host a Web server and an application server on this physical server. You need to make a choice between deploying separate virtual machines for the Web server and the app server, and hosting them in two different Dockers containers. What would you choose and can you explain the rationale behind your choice?
8. In Problem 7, you are now told that you don’t need to host an application server, but you need to host as many Web servers as possible on the physical machine. Would you now use a virtual machine to host each Web server instance or would you use a Dockers container for each Web server? Can you explain the reason for your choice?
9. Most of you would be familiar with cloud computing services such as Amazon EC2. Amazon also provides cloud storage through Amazon Simple Storage, popularly known as Amazon S3. This object based cloud storage service can be accessed via HTTP services, and your files can be stored as objects in it. What are the advantages of an object based storage service as opposed to a file based storage service?
10. Consider Question 9 and imagine you are running a database server instance on Amazon EC2. How would you host your application data on the cloud? If you are asked to choose between object based cloud storage and block based cloud storage, what would you pick? Can you provide the reasons for your choice?
11. Assume that you have been asked to move your on-premise enterprise payroll application onto the cloud. You have decided on the cloud compute and storage vendor. Your application has 20TB of data, which needs to be hosted onto the cloud. How would you ensure the movement of this data from your on-premise storage to the cloud? What are the issues and precautions that you need to take?
12. Consider that you have stored your critical enterprise data on cloud storage. How do you protect against any data loss due to a failure of the cloud storage system?
13. We are all familiar with the use of RAID in enterprise storage for redundancy so as to avoid data loss. There are two ways of preventing data loss in enterprise storage. One is to have multiple copies of the data. The other is to store the data in erasure coded format so that in the event of a disk failure, the data can be reconstructed from the erasure coded format. What are the pros and cons of these two choices? For cloud storage, would you opt for storing the data in erasure coded format or by keeping multiple copies of the data? Explain the reasons for your choice.
14. Your organisation has been asked to evaluate how using cloud compute and storage compares with using on-premise compute and storage for its IT needs. Which of your enterprise applications would you consider moving to the cloud and why?
15. Assume that you have moved your organisation’s payroll data onto the cloud. However, your organisation has four offices and all of them could be updating the payroll database at the same time. How does the cloud storage system synchronise these multiple updates from different geographical locations onto the same data object? While traditional database or file locking would take care of this issue in on-premise data storage, can you use the same mechanism to ensure synchronised, ordered updates from multiple geographical locations onto the same data object? If yes, explain how? If not, explain why traditional file or database locking mechanisms would not serve the purpose?
16. Your enterprise applications have widely varying compute demands, over a period of time. For instance, during month-ends, there is excessive demand for compute to run batch jobs. Similarly, for a couple of hours each night, there is a spike in compute resource usage, again due to batch jobs. How would you handle such dynamic resource demand variations if you need to move your enterprise application onto the cloud?
17. Assume that data redundancy on cloud storage is achieved by means of maintaining three copies of each piece of data. When a user-initiated data change occurs, how are the three copies updated? Would the user-initiated write return success to the user only after all the three copies are updated? Or would it return success right after updating any one of the copies successfully? What are the pros and cons of these two choices?
18. Assume that you have bought cloud compute and cloud storage and have moved your enterprise application onto the cloud. Consider two scenarios—one in which the same vendor provides both compute and storage on the cloud, and another in which the cloud compute and cloud storage come from two different vendors. In each of these scenarios, how do you ensure the locality of data for your cloud hosted application?
If you have any favourite programming questions or software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming!