danielwertheim

danielwertheim


notes from a passionate developer

Share


Sections


Tags


Disclaimer

This is a personal blog. The opinions expressed here represent my own and not those of my employer, nor current or previous. All content is published "as is", without warranty of any kind and I don't take any responsibility and can't be liable for any claims, damages or other liabilities that might be caused by the content.

MongoDb for .Net folks - a quick theoretical peak what schema-less means

Now it’s time to have a quick peak on this whole “schema-less” thing that NoSQL products offer. We will of course focus on MongoDB when we compare a document-oriented DB against a more traditional, relational DB commonly found in RDBMS like SQL Server.

Previous articles are:

Schema, MongoDB vs RDBMS

You are probably familiar with some of the parts building up a schema in a relational database of a RDBMS, like SQL Server. Lets just see how the terminology maps to MongoDB world:

Building blocks, RDBMS vs MongoDb

Table --> Collection
Row --> Document
Column --> Field
Join --> Embedded document
Join --> Link/Reference
Index --> Index

Table vs Collection

The Collection is a named bag where you put documents in. It’s created on the fly (upon first use). Collections are schema-less, and being schema-less, there’s nothing keeping you from mixing entities in one certain collection. You could have a collection named Cars and let it hold: Car, Driver, Fruit, ... with completely different members in their documents:

Car: {registrationNo: 'ABC123', make: 'VOLVO', model: 'XC90'}

vs

Driver: {socSecNo: 12345, firstname: 'Daniel', lastname: 'Wertheim', license: 'AB'}

VS

Fruit: {type: 'APPLE', name: 'Royal Gala', color: 'Red', season: 'Autumn'}

So in MongoDB there’s NO underlying scheme stating: this collection has the fields: registrationNo, make, model, socSecNo, firstname, lastname, license, type, name, color, season, with information that e.g ‘firstname’ is ‘nvarchar(50) NULL’. You can of course try and mimic this behavior in a RDBMS, by having key-value tables, like I have in SisoDb.

Schema-less, does that mean I don’t need to care about schemas?

No it doesn't! In theory, perhaps, but I would say NO! The way you design your documents should be with regards taken to how you will consume them. What is of most importance for the current context? Write- or read performance? Are you working in a statically typed language like C# and you wan’t to have pre-made classes defining your storage model? If, then mixing Cars and Fruits in the same collection will enforce you to have some sort of application logic when you query the collection. Will document D1 be a Car or a Fruit?

Since you are writing the next “killer app”, you will probably have lots of data. You will then have to select an index for documents in the collection and most likely also a sharding-key. Having documents with different schema in the same collection could cause you trouble, since the sharding-key needs to be in both.

I also thing you should be thinking of maintenance. There’s no schema that you can inspect. So what if your collection would contain documents with very different contracts? The person maintaining your app might not realize this if he just inspects some documents in a certain collection.

Row vs Document & Join vs Embedded document

It’s probably easy to think of both row and document, as representations of an instance of one certain entity. In a traditional normalized DB in a RDBMS, this would only hold true for dead simple entities that has no relations; that is, all data for the entity is contained in one row. Let say we have the entity Order, where one Order can contain many OrderLine instances. In a relational DB this would be expressed by an Order- and an OrderLine-table, with a relation between them, where the OrderLine-table would have a foreign-key to a certain row in the Order-table. In a document oriented DB you would most likely embed the Order-lines within the Order; hence we would store the data seeing the Order as an aggregate containing Order lines. So in MongoDB, you wouldn't have to perform a join to build up the complete result of one certain Order.

Relational DB
Order {#Id, Date, CustomerNo, ...}
--<1:N> --
OrderLine {#OrderId, #ArticleNo, Quantity, Price, Tax }

Document DB:
Order {#_id, date, customerNo, ..., lines: [
    {articleNo, quantity, prixe, tax},
    {articleNo, quantity, prixe, tax}]}

So, lets say you need to get all orders that contains a certain article number. When your query will hit the database it will look something like this in a relational database:

select [columns of interest]
from dbo.Order as O
inner join dbo.OrderLine as Ol
    on Ol.OrderId = O.Id
where Ol.ArticleNo = @articleNo;

In MongoDB this would look like:

db.orders.find({'lines.articleNo'=articleNo})

So, what if we would add a Customer into our model? Please, bear with me. I’ll elaborate on this below, under: “Join vs Link/Reference“.

Index vs Index

The concepts of indexes exists in MongoDB as well. The query above could be made faster by applying an index on articleNo:

db.ensureIndex({'lines.articleNo': 1})

We will look into indexing in a future articles. For now, it’s enough to realize indexes exists in MongoDB as well and affects the design of your documents. Also note the difference compared to an RDBMS. You would have to ensure indexes exists for getting an effective join as well as an index on the ArticleNo.

Join vs Link/Reference

Above we dealt a bit if “Join vs Embed document”, which looked a bit of how documents can use embedding to create aggregates and thereby remove the requirements of joins. If you are coming from a RDBMS world you are probably twisting in your chair right now. Thinking about: duplication of data, referential constraints, etc. You can of course separate data and store them in different documents and include _id values that you would use to perform application level links/references. So now you have two techniques: embedding vs linking.

As always: “it depends on usage of the data”. I would say that data in a document-oriented world is stored in a use-case centric way rather then as normalized grid of small data units. A traditional example of this is. If we have a blog-post which in turn could have comments, which in turn could have replies, you could store this as one big document holding blog post information as well as a tree of comments:

blogpost: {
    _id: ..., title: ..., content: ..., ts: ..., author: ...,
    comments: [
        {title: ..., ts: ..., author: ..., comment: ...},
        {title: ..., ts: ..., author: ..., comment: ...,
            replies:[
                {title: ..., ts: ..., author: ...},
                {title: ..., ts: ..., author: ...}
            ]
        }
    ]
}

If you tend to always query and display a single blog-post as a whole unit this could be fine. But what if you get a massive amount of comments? You might hit the document size limit of 16Mb. Alternatives? Linking. You could store the blog-post id in a big document containing all the comments:

blogpost: {
    _id: 'fooId', title: ..., content: ..., ts: ..., author: ...}

comments: {_id: ..., blogpostId: 'fooId', replies: [
        {title: ..., ts: ..., author: ..., comment: ...},
        {title: ..., ts: ..., author: ..., comment: ...}]
}

But again, you might hit the size limit of this new comments document. Or perhaps you want to be able to work with individual comments. So, split it apart.

blogpost: {
    _id: 'fooId', title: ..., content: ..., ts: ..., author: ...,
    comments: ['commentId1', 'commentId2', ...]
}

comment: {_id: 'commentId1', blogpostId: 'fooId'}
comment: {_id: 'commentId2', blogpostId: 'fooId'}

Since we have introduced _ids on both sides we can navigate in both directions. We can work with the blog-post and we can work with single comments as well as list all comments by blog-post id. But, but, but! Again. Will this suite your needs? In this case perhaps. But in similar models the extra queries needed to build up the same “blog-post view” with the post and all comments might hurt your performance. The design should always be context specific. You might also have situations were you actually need to mix, choosing a relational DB and a document-oriented DB? You could also duplicate data and create views specific to certain use-cases. One view holding parts of the blog-post, like: title, ts, top10comments.

A more clear example would be to continue the Order sample. We would probably have a document called Customer; which wouldn't be embedded in the Order, but parts of the customer would. At least one unique value, so that you could link to the Customer in your application.

order: {_id: ..., customerId: ...m customerNo: ..., ...}

Summary

We have had a quick look at the concept of being “schema-less”. Using MongoDB as an example, I would say, your applications still has a conceptual schema, but your database has no schema enforcing conformity in your collections/sets. And I would like to end this with a traditional quote:

“With great powers comes great responsibility”

That’s all. Lets hope I've time to make a more “non theoretical” entry about this soon.

//Daniel

View Comments