The Complete Magazine on Open Source

The Importance of Data Modelling in MongoDB

MongoDB achieves scalability with ease due to its unique architecture. The key component of this architecture is the data model, which is based on schema-less documents and collections. While adopting MongoDB in any application, it is important to base it on data Modelling.

MongoDB is one of the most popular NoSQL databases around and is gaining popularity exponentially. Many more developers have started using it and increasing numbers of applications are built using MongoDB. As organisations adopt MongoDB as an integral architectural component of their solutions, it is important that they build it on the solid foundation of data Modelling.
Data Modelling is the first step in using any database, be it relational or NoSQL. It refers to the process of creating database design iteratively to meet the application’s needs. It involves analysis and depiction of data entities and their relationships for an application.

Traditional databases are primarily relational in nature, where the database schema is defined at design time based on data objects. The data structure is static in nature, and data should follow the rules to be stored and retrieved using this static schema. If data does not follow the database schema, it cannot be stored in the database. Moreover, database structures are normalised and decomposed into multiple smaller tables to avoid data repetitions and implement the get-what-you-want pattern of data retrieval. These tables have relationships built into them to enforce their consistency, integrity, atomicity and durability.
On the other hand, NoSQL databases are generally schema-less. Although there is a database schema defined to start with, data does not have to follow the schema strictly. NoSQL databases accept data of any structure within their collections (tables in the relational world), irrespective of whether it matches their schema or not. In other words, they are flexible regarding their enforcement of schema to data. They do not have inbuilt inherent relationships between their collections; rather, applications need to implement the necessary logic for atomicity and consistency of data between collections. And there are no inbuilt constructs to join data from multiple collections.
MongoDB provides two different ways to create relationships between data objects – References and Embedment.

Data Modelling in the NoSQL world
Data Modelling is equally important in the NoSQL world as it is in the relational world. There are important facets of applications that cannot be realised without implementing proper and optimised data models. Data Modelling is not an afterthought. It is an important process in the application planning and design phases. Some of the important reasons for data Modelling are listed below.

Scalability: Scalability refers to the increase in application workload due to increase in traffic. Applications should be designed to handle and perform well when the usage of the application increases. From the NoSQL perspective, it means that collections and data entities should be modelled based on the current and future demand for the application. There should not be non-availability or degradation of performance due to an increase in the number of users or transactions in the database.

Performance: Database Modelling is about trade-offs between security, availability, scalability and performance. A right balance between these architectural blocks helps in creating the optimal database design for an application. A data model helps in understanding the read and write needs of an application and also helps in deciphering data updates patterns frequently while some remains static most of the time. It helps in creating an application that performs as expected with an increased workload over time.

Application needs: Different applications have different demands from the database. Some are read-intensive while others are write-centric. Applications can be OLTP based or meant for reporting. Some applications have multiple facets built into them. Different data Modelling strategies should be used based on the nature of the application. Reporting applications are read heavy and writes should not reduce their performance, while transactional systems should be able to read and write with equal ease.

Data consistency: Data consistency helps in reducing redundant data, and understanding relationships, their update patterns and taxonomy. It helps in storing only the required data in its correct form.

Capacity: Each document has a defined size and, generally, performance falls with re-allocation of size to it. Data Modelling helps in creating optimal sized documents, reducing redundant data in each document. It also helps in identifying the overall capacity needs of the database.

MongoDB support for data Modelling
MongoDB provides advanced constructs for enabling data Modelling. It provides References and Embedment for defining structure and relationships of documents. It is important to understand these two constructs before getting into data Modelling.

References
Relationships are defined based on matching data contained in columns in different collections. In MongoDB, these relationships are defined based on semantics. The MongoDB engine does not enforce this relationship, and it is completely dependent on the application to implement and respect this relationship while reading and writing data in collections. References store the relationships between data by including links or references from one document to another. Applications can resolve these references to access the related data. Data models based on References are also known as Normalised data models. The example illustrated next shows the Users document references in a placeofBirth document.

{
"_id": "Ritesh",
"name": "Ritesh Modi",
"placeofBirth": "RiteshPOB"
}

{
"PlaceofBirth": {
"_id": "RiteshPOB"
"street": "123 xyz street",
"city": "xyzcity",
"state": "xyzState",
"zip": "12345"
}
}

References relationship should be used:

  • To implement one-to-many relationships between documents.
  • To implement many-to-many relationships between documents.
  • If the referenced entities are updated frequently.
  • If the referenced entities grow indefinitely.

Embedment
Embedded relationships in documents refer to storing related documents within a original document. The related data is part of the schema of embedding documents. In effect, the entire data is stored together within a single document, with related data stored as an array or sub-object. Data models based on Embedment are also known as De-Normalised data models. The example illustrated next shows the placeofBirth entity embedded in the Users document.

{
"_id": "Ritesh",
"name": "Ritesh Modi",
"placeOfBirth": {
"street": "123 xyz Street",
"city": "xyzcity",
"state": "xyzState",
"zip": "12345"
}
}

Embedded documents should be used when:

  • There is a contained relationship between entities.
  • The embedded entity is an integral part of the document.
  • The embedded entities are not updated frequently.
  • The embedded entities do not grow indefinitely.
  • Relationships range from one to a few, between embedding and embedded entities.

Important considerations for MongoDB data Modelling
While designing database document structure and data models in MongoDB, special consideration should be given to the following aspects for deploying highly scalable, performance-centric and efficient databases. It is to be noted that these are not mutually exclusive and should be evaluated in combination with each other.

Data usage: While designing a data model, emphasis should be laid on the patterns that the applications will be using to access the data. The patterns refer to reading, writing, updating and deletion of data. Some applications are completely read-centric (like the reporting application), while other are write-centric like an e-commerce application. Some are a combination of both. In some applications, a particular feature is read-heavy while others are write-heavy. There are possibilities that even within a single document some data is frequently updated while other data remains static. Based on these patterns, appropriate strategies should be devised using relationships, indexes, growth in document size and atomicity. Documents with Embedded relationships perform better than documents with References relationships if both the data are needed while reading.

Atomicity: Atomicity in database parlance means that operations either succeed or fail as a single unit. If there are multiple sub-operations within a parent transaction, the parent operation will fail if any of its sub-transactions fail. Operations in MongoDB happen at the collection level. A single write operation can affect only a single collection. Even if it attempts to affect multiple collections, these will be treated as separate operations. There is no support from the database engine to roll back a part of operations, if the sub-operations fail. The application should implement the logic for affecting multiple collections.

If updating multiple collections is a requirement, Embedded relationships should be used because entire data is available within a single document. There is no risk that a part of the operation will succeed or fail. However, References relationships can be used when it does not matter if sub-operations fail.

Document structure: Document structure plays a crucial role in data Modelling. The application is written based on the structure of documents. The documents can be designed using the References or Embedment relationship.

Document growth: MongoDB assigns a fixed document size during the initialisation phase. MongoDB’s storage engine will relocate the document on the disk when document size exceeds the allocated space for that document, MongoDB will relocate the document on the disk. With MongoDB 3.0.0, however, the default use of the power of two-sized allocations minimises the occurrences of such re-allocations as well as allows for the effective reuse of the freed record space.
When using Embedded documents, it should be carefully analysed if the sub-object can grow out of bounds. If it can, there is the possibility of performance degradation when the size of the document crosses its limit. In such cases, References relationship should be used to ensure that growth in document size stays within limits.

Indexing: Indexes are especially useful in improving performance while retrieving data. They help in fetching sorted data, helping applications to eliminate the need to sort them explicitly. Collections that are frequently accessed for read operations should implement indexes on the column on which frequent searches are made. While indexes are beneficial during read operations, they introduce negative performance for write operations. Indexes should be built on columns that are updated infrequently and queried frequently. Another drawback of indexes is that they consume additional storage space and should be considered carefully before being implemented.

Sharding: Sharding is a database load balancing technique fully supported by MongoDB. It refers to horizontal partitioning of data into multiple MongoDB instances, with each instance holding specific and unique data. Each instance is referred to as a ‘shard’ and hosts a portion of overall collection data. Sharding is typically employed with large datasets in collections with heavy operations on them.

Strategies for MongoDB data Modelling
Data Modelling is equally important in the NoSQL world as it is in the relational world. There are important facets of applications that cannot be realised without implementing a proper and optimised data model.

One-to-one with Embedded relationship: In this strategy, one data entity is embedded into another data entity, where both the entities have a one-to-one relationship with each other.

An example of a one-to-one Embedded relationship, between the user and the details of his place of birth, is illustrated here:

{
"_id": "Ritesh",
"name": "Ritesh Modi",
"placeOfBirth": {
"street": "123 xyz Street",
"city": "xyzcity",
"state": "xyzState",
"zip": "12345"
}
}

A one-to-one Embedded relationship should be used when:

  • Both the name and place of birth are retrieved together frequently.
  • Both the name and place of birth are updated together.
  • Place of birth sub-entity is not growing.

One-to-one with References relationship: In this strategy, one data entity references another data entity, where both the entities have a one-to-one relationship with each other.
An example of a one-to-one Referenced relationship, between the user and the details of his place of birth, is illustrated here:

{
"_id": "Ritesh",
"name": "Ritesh Modi",
"placeofBirth": "RiteshPOB"
}

{
"PlaceofBirth": {
"_id": "RiteshPOB"
"street": "123 xyz Street",
"city": "xyzcity",
"state": "xyzState",
"zip": "12345"
}
}

One-to-one Referenced relationships should be used when:

  • Both the name and place of birth are not retrieved together.
  • Both the name and place of birth are updated using different operations.
  • Place of birth sub-entity is not growing.

One-to-many with Embedded relationship: In this strategy, a multiple data entity is embedded into another data entity, where they have a one-to-many relationship with each other.
An example of a one-to-many Embedded relationship, between an author and the books he has authored, is illustrated here:

{
"_id": "Ritesh",
"name": "Ritesh Modi",
"booksAuthored": [
{
"name": "Windows server 2016",
"publisher": "Self Publishing",
"year": "2016",
"price": "30"
},
{
"name": "Ubuntu Linux",
"publisher": "Self Publishing",
"year": "2017",
"price": "40"
}
]
}

One-to-many Embedded relationships should be used when:

  • Both the author and the books published are retrieved together frequently.
  • Both the author and the books published are updated together.
  • Place of birth sub-entity is not growing out of bounds, i.e., there is one-to-few relationship between entities.

One-to-many with References relationship: In this strategy, collections are referenced where they have a one-to-many relationship with each other.
An example of a one-to-many Referenced relationship between authors and books published is illustrated here:

{
"_id": "Ritesh",
"name": "Ritesh Modi",
}


{
"_id": "bookid",
"authorid": "Ritesh",
"books": [
{
"name": "Windows server 2016",
"publisher": "Self Publishing",
"year": "2016",
"price": "30"
},
{
"name": "Ubuntu Linux",
"publisher": "Self Publishing",
"year": "2017",
"price": "40"
}
]
}

}

One-to-many Referenced relationships should be used when:

  • Both the author and books published are not retrieved together.
  • Both the author and books published are updated at different times in different operations.
  • Books authored can grow out of bounds.
  • Avoiding repetition of data.

Many-to-many with Embedded relationship: It is not advisable to implement the many-to-many strategy with Embedded relationships, as it leads to unnecessary repetition of data and applications have to write complex logic to perform updates and retrievals.
Many-to-many with References relationship: In this strategy, collections are Referenced where they have a many-to-many relationship with each other.
An example of a many-to-many Referenced relationship between a bank account and account holders is illustrated here. The bank account document uses arrays to reference the account holders, and the account holders’ document refers to the bank account using arrays.

{
"_id": "123456789",
"accountNumber": "123456789",
"accountName": “ManytoManyAccount,
"accountHoders": [ 111111111, 222222222]
}

{
"_id": 111111111,
"name": "Ritesh",
"age": 25,
"phone": "9999999999",
"email": "[email protected]",
"account_id": [ 123456789 ]

}

{
"_id": 222222222,
"name": "Sohan",
"age": 30,
"phone": "8888888888",
"email": "[email protected]",
"account_id": [ 123456789]

}
  • Many-to-many Referenced relationships should be used when:
  • Many-to-many relationships exist between entities.
  • Both collections can grow out of bounds.
  • Avoiding repetition of data.