It’s understandable that those new to MongoDB – a so-called “schema free” database - might assume that they no longer need to be concerned with the art-science of data modeling. However, in reality data modeling is just as important in MongoDB as in other databases. Indeed, because of some of the modeling principles for MongoDB are less well understood, arguably, more attention needs to be given to the data modeling process.
Data in a MongoDB database is represented within JSON (JavaScript Object Notation) objects. JSON documents are built up from a small set of very simple constructs: values, objects, and arrays.
- Arrays consist of lists of values enclosed by square brackets (“[“ and “]”) and separated by commas (“,”).
- Objects consist of one or more name value pairs in the format “name”:”value”, enclosed by braces (“{“ and :}” ) and separated by commas (“,”).
- Values can be Unicode strings, standard format numbers (possibly including scientific notation), Booleans, arrays, or objects.
The last few words in the definition above are very important: Because values may include objects or arrays, which themselves contain values, a JSON structure can represent an arbitrarily complex and nested set of information. In particular, arrays can be used to represent repeating groups of documents which in a relational database would require a separate table. For instance, the line-items for an order might be included within a nested array of documents within an order document.
This “embedding” strategy has many advantages – it’s very quick to retrieve an item and all its children. Also updates to the parent and children can be atomic, which is not possible if the parents and children are in separate documents.
However, not everything in a complex application can be represented within the one document. For instance, it would be inadvisable to include every single attribute of a product within every line item for every order. The duplication of product details would cause a massive increase in storage, and the overhead of updating the redundant information – changing a default price for a product for instance – would be overwhelming.
Moreover, MongoDB documents are limited to a maximum size of 16MB. It’s therefore important not to store unbounded array elements within an embedded document because an array that increases indefinitely will eventually blow up this 16MB limit.
The alternative to the “document embedding strategy” is to “link” documents using the MongoDB equivalent of a foreign key. For instance, a document could contain an array of references to the identifiers of documents in a different collection. Not so different from standard relational foreign keys.
Because MongoDB offers only limited support for joins between collections, the linking strategy can result in complicated code and poor performance – especially in the absence of suitable indexes. A compromise solution is to embed the most recently created children in the parent, and link to older children. This has the advantage of allowing the “first page” of a parent-child query to be satisfied with a single IO, without risking a violation of the 16MB limit.
Choosing the correct data model in MongoDB requires a sound understanding of the data structures and the nature of the queries which will be executed. In many ways, it requires more thought than the initial modelling of a relational schema, which can be driven by the mostly deterministic rules of third normal form.
MongoDB allows you to evolve schemas over time, which in some respects makes it easier to correct deficiencies in early data models. But it’s still often very tricky to change existing document layouts in a production application and usually way better to get the model at least approximately right in the first iteration. MongoDB developers should invest time early in the development process to establish a sound data model.