Introduction to NoSQL in Cosmos DB

In this chapter, we will start our journey toward developing applications that work with a globally distributed, massively scalable, and multi-model database service provided by Microsoft: Azure Cosmos DB. We will focus on a high-level technical overview of this innovative database service.

Modern applications that take advantage of Azure and other cloud platforms usually require working with massive amounts of data that might be organized in different ways. In addition, these applications require elastic scale out of storage and throughput. We might start with a few gigabytes, but we can end up with many petabytes in months. Our application can start working with most clients in California, but it might expand its clients in Germany, Switzerland, and Norway in the near future. Of course, our application will be continuously evolving and we will have to store more data related to each performed operation based on the new requirements. In this chapter, we will understand why Cosmos DB is an excellent candidate to be used as a database service in these kinds of applications.

In this chapter, we will cover the following:

  • Making the paradigm shift to the NoSQL way
  • Learning about the main features of Cosmos DB
  • Understanding the supported NoSQL data models
  • Using the appropriate API for each data model
  • Diving deep into the Cosmos DB resource model
  • Understanding the system topology
  • Learning about the resource hierarchy for each container

Making the paradigm shift to the NoSQL way

During the last decade or so, the most popular databases have been relational database management systems. Hence, there is a huge number of developers who know how to build an application that requires the persisting and querying of data by creating tables and relationships in relational databases, such as Microsoft SQL Server.

However, when we work with C# and .NET Core, we work with object-oriented programming. LINQ makes it possible to easily query objects by adding functional programming features to C#, but we need to add a complexity layer between our application and the relational databases: an Object-Relational Mapping (ORM) solution, such as Entity Framework or NHibernate.

We have entities in our C# application, and the ORM maps these entities' relationships to the tables. This way, we can create an instance of an entity and persist it in the underlying tables in the relational database system. The ORM translates the operations into the necessary SQL for the relational database system to insert new rows in the appropriate tables.

Of course, we could continue explaining the different operations and how the objects in our application, the ORM, and the relational database make them happen. However, our goal is to start making the paradigm shift between the usual way of working (with relational databases) and the new way of working (with NoSQL databases).

The first version of an application that works with C#, an ORM, and a relational database is not a problem. However, when new requirements arrive and we need to add new properties to an existing object or relate it to another object, we have to perform migrations to make it possible to use the new objects in the different operations and persist them in the underlying relational database. We have to make changes in different places. We need to edit the classes that define the properties that an entity has to persist, we have to make sure that the ORM mappings are updated, and we need to ensure that the underlying relational database has the new columns in the necessary tables. Hence, we make changes in the code, in the ORM, and in the underlying database schema.

Whenever it is necessary to deploy a new version of the application, we have to make sure that the migration process is executed and that the underlying database schema has the required version for the C# code and the ORM configuration. The migration process makes the necessary changes in the tables and relationships to make it match the ORM mappings. Hence, a single property that needs to be added to an object and needs to be persisted generates a cascade of changes in different parts of our application. Of course, there are many ways of automating the necessary tasks. However, the fact that the tasks are automated doesn't mean that they aren't required.

Now, let's start thinking about the way in which we are going to work with a Cosmos DB NoSQL document database. We can start writing our first version of an application with C# and .NET Core, work with object-oriented programming, and use the provided methods in the Cosmos DB .NET Core SDK to persist the created objects in the schema-agnostic document database. There is no ORM between our application code and the NoSQL database service. We work with objects, we persist them, we retrieve them, and we query them. We only need to specify serialization and deserialization settings if necessary, but we don't have to worry about mapping an object to tables and their relationships.

Now, seriously, what else do we need to do to create our first version of the application? We have to learn how to work with the Cosmos DB .NET Core SDK, as well as the necessary tools for interacting with and managing a Cosmos DB database, in order to be ready for the first version of the application. We also have to understand the SQL dialect, which allows us to work against a Cosmos DB document database, in addition to many scalability and provisioning strategies.

A NoSQL database makes it easier to start working with a first version of an application compared to the process required with the traditional ORM and relational database management system combination. We work with documents, we store documents, we retrieve documents, we query documents. We don't require complex mappings and translations. We can work with object-oriented code without adding complex middleware such as an ORM.

Whenever it is necessary to deploy a new version of an application that is working against a Cosmos DB NoSQL document database, we don't have to worry about migration processes. If we need to add a single property to an object, we just add it and persist it in the schema-agnostic document database. There is no need to run any script that makes changes to the existing documents. We can continue working with the documents with a different schema, as they will be able to coexist with the new documents that have a new schema.

What about queries that work only with the new property? No problem—we can use properties or keys that don't exist in all the persisted documents; the schema-agnostic features support this scenario. For example, we can start running queries that check whether the value of the new property matches some specific criteria after persisting an object that has the new property.

You might be wondering, "why haven't I been working with NoSQL for the last 10 years?" There is a simple answer to this question: storage costs were higher and relational database management systems made it easy to optimize storage use while providing great database features. However, things have changed, and nowadays, we have new options. Cosmos DB provides a NoSQL database service that allows us to get up and running very quickly. We will learn how to create a first version and a second version of an application to simplify the paradigm shift to the NoSQL way of working with Cosmos DB.

Obviously, as always happens, relational databases will still be great for thousands of scenarios. However, be sure that the time you invest in learning Cosmos DB features will allow you to use its services in an application in which you thought that the only choice was a traditional relational database.

Learning about the main features of Cosmos DB

Cosmos DB extends the database service that was known as Azure Document DB. However, it is very important to note that Cosmos DB adds a huge number of features to the services offered by its predecessor. In fact, Cosmos DB is continuously adding new features and has quickly become one of the most innovative services found in Azure that targets mission-critical applications at a global scale.

Cosmos DB is a NoSQL database service included in Azure. NoSQL definitely means not only SQL in the case of this database service, because Cosmos DB provides a SQL API that allows us to query documents by using SQL in one of the possible models that the database service supports. Cosmos DB is a multi-model database service, and therefore it supports different non-relational models, which we will analyze later.

Let's perform a bottom-to-top analysis to have a better understanding of this database service. The following are three main features that Cosmos DB provides that establish pillars for supporting additional features:

  • Partitioning
  • Replication
  • Resource governance

Partitioning makes it possible for Cosmos DB to provide an elastic scale out of storage and throughput by distributing the data in multiple logical and underlying physical partitions. We can start with something very small and grow elastically and seamlessly to something very large, increasing both storage and throughput as required. For example, we can start with a total storage size measured in gigabytes and end up with petabytes. We can start with small throughput requirements per second and end up with huge throughput requirements per second.

Replication makes it possible to deliver turnkey global distribution and replicate data through any number of regions in which Cosmos DB is available. The number of regions is continuously increasing and there are no limitations on the number of regions to which we can replicate data. For example, we can have a Cosmos DB database service working with the West US, East US, Brazil South, Japan East, and Japan West regions. The following diagram shows icons with sample regions in which a Cosmos DB database can be replicated (at the time of writing this book).

The hexagons represent the regions in which a database can be replicated:

Cosmos DB offers five consistency models to enable us to select the most appropriate one based on the most convenient write performance and the desired consistency. This way, we can manage performance with respect to consistency. We will analyze them in detail later in this chapter.

Resource governance makes it possible to provide high availability. Cosmos DB can provide 99.99% (also known as four nines) of availability in a single region and 99.999% (also known as five nines) of availability in multiple regions. Availability is one of the most important aspects of a database. Cosmos DB provides high availability in a transparent and automatic way that doesn't require manual changes in the configuration; that is, we don't need to make changes or redeploy and we can continue using the same endpoint.

Of course, one of the key aspects of a database service is performance. Cosmos DB provides the necessary features for achieving predictable performance. The database service implements resource governance at a very fine level of granularity and on a per-request basis. This way, the database service guarantees a pre-configured desired throughput as well as the latency for each individual request. Hence, capacity planning is really straightforward.

Understanding the supported NoSQL data models

There are many flavors of NoSQL database. The following are the four most common types of NoSQL database:

  • Key/value: This is a persistent dictionary. It is best for when we know the key and we need to retrieve the associated value for the key.
  • Column, wide-column, or column-family: This organizes related data into columns instead of the typical organization in rows. It is best for when we need to query across specific columns in the database.
  • Document: This allows persisting JSON objects (documents), which can include nested objects or arrays of other objects.
  • Graph: This allows you to persist edges and nodes with their properties. It is best for when we need to store and navigate through complex relationships.

The following diagram outlines each of the four explained flavors of NoSQL database to make it easy to understand the typical data they persist:

Cosmos DB uses a schema-agnostic data store on top of the previously explained main features that provide a core platform. Cosmos DB can efficiently project this data store to the four previously listed NoSQL data models. Thus, the database service allows us to select the most appropriate NoSQL data model based on our needs, and we can take full advantage of partitioning, replication, and resource governance with any of them.

Using the appropriate API for each data model

Under the hood, Cosmos DB internally stores data in a format called Atom-RecordSequence (ARS), which is highly optimized for partitioning and replication. Hence, no matter the NoSQL data type and API, the data ends up stored in this internal format.

Cosmos DB provides support for five different APIs with SDKs for many programming languages and platforms. Based on the data model we use with our database, we must use a specific API to interact with the Cosmos DB database service. The following table summarizes the five APIs that are available based on the four data models:

NoSQL database type

Available APIs

Key/value

Table API

Column, wide-column, or column-family

Cassandra API

Document

SQL API MongoDB API

Graph

Gremlin API

Based on the information provided in the previous table, if we work with a document database, we can work with either the SQL API or the MongoDB API. If we are migrating an existing application that works with MongoDB to Cosmos DB, we can take advantage of the use of the MongoDB API to migrate the application to the new database service. If we are building an application from scratch, we might consider the use of the SQL API, which provides a Cosmos DB dialect of SQL to work against a document database. We will cover both scenarios in this book. We will work with the SQL API with .NET and C#, and we will work with the MongoDB API with Node.js.

The following diagram shows graphics that represent each of the four explained flavors of NoSQL database and the APIs that can be used for each of them:

Diving deep into the Cosmos DB resource model

First, we must understand the Cosmos DB resource model, which is used by all supported NoSQL data models and some APIs. When we provision a new Cosmos DB account, we will be provided with a URI and an endpoint that represents the account and allows clients to establish a connection. At the time we provision the account, we must select the API that we want to use, and this selection will determine the type of NoSQL database that we will be creating, among other things, which we will learn about later. The following list shows the available APIs with the names used in the Azure portal and the type of NoSQL database that each of them will end up creating:

  • SQL: Document
  • MongoDB: Document
  • Cassandra: Wide-column
  • Azure Table: Key/value
  • Gremlin (graph): Graph

Once we have an account provisioned, we can create a new database that will use the API that was selected for the account. An account can have many databases of the same NoSQL type that use the same API.

The following diagram shows the generalized hierarchy of elements that belong to a Cosmos DB account:

Each database will have a set of containers whose name will be different based on the NoSQL database type and API. In fact, based on the NoSQL database type, the containers will be projected in a different way to the underlying data storage. The following list specifies the container name for each NoSQL database type:

  • Document: Collection
  • Graph: Graph
  • Key/value: Table
  • Wide-column: Table

For example, when we work with a document database with either the SQL API or the MongoDB API, we will organize documents into containers known as collections. Whenever we create a new collection, we are able to provision the desired throughput, which we can then scale up or down on demand. We will also be able to specify a hint for how we want to distribute the data on the underlying partition sets. We will analyze each of these topics in detail later, as we want to stay focused on the Cosmos DB resource model for now.

Once we have a container provisioned, we can start storing data on it. One of the latest enhancements added in 2018 for this database service was the introduction of a multimaster capability. When we enable this feature, Cosmos DB allows us to write to our Cosmos DB containers in multiple regions at the same time with a latency of less than 10 milliseconds at the 99th percentile when we consume the Cosmos DB service within the Azure network. The multi-master feature makes it possible to use the provisioned throughput for databases and containers in all the available regions.

Each container will have a set of items whose names will be different based on the NoSQL database type and API. As is the case with the containers, based on the NoSQL database type, the items will be projected in a different way to the underlying data storage. The following list specifies the item name for each NoSQL database type:

  • Document: Documents
  • Graph: Vertexes and edges
  • Key/value: Entities
  • Wide-column: Rows

The following diagram shows the generalized hierarchy of elements that belong to a Cosmos DB account with the appropriate names based on the NoSQL database type on the right-hand side:

In addition, there are other container-level resources for server-side programmability that enable multi-record transactions within the partition key. We can write these resources in ECMAScript 2015 JavaScript:

  • Stored procedures
  • Triggers
  • User-defined functions, also known as UDF

When we work with document databases, stored procedures allow us to operate on any document in the collection in which the stored procedure is defined.

We can write triggers that will be executed when specific operations are performed on a document. We can define pre-triggers, which are executed before the operation is performed; and post-triggers, which are executed after the operation is performed.

We can declare user-defined functions to extend the Cosmos DB query language's grammar and provide functions that implement custom business logic.

If a version conflict occurs on a resource for any operation, the conflicting resource will be persisted in a conflict feed within the container.

The following diagram shows the generalized container-level resources that belong to any Cosmos DB container:

The following diagram shows the generalized collection-level resources that belong to a Cosmos DB collection for a document database that uses either the SQL API or the MongoDB API:

The following diagram illustrates the way Cosmos DB projects the data stored in the ARS format to the appropriate individual item for the different supported NoSQL database types and APIs:

It is very important to understand the Cosmos DB resource model and the name used to identify each element, because we will be working with its different components throughout this book, as well as the different examples.

Understanding the system topology NoSQL

Now that we understand the basics of the Cosmos DB resource model, we will analyze the system topology that is hidden behind the scenes and makes it possible to run the database service at a global scale. The following diagram illustrates the system topology, starting at a Cosmos DB account on Earth, covering up to the fault domains. At the time I was writing this book, Azure didn't have any Moon or Mars regions enabled for Cosmos DB:

As previously explained, Cosmos DB is available in many Azure regions across around the world. Each Azure region has many data centers. Each data center has deployed many big racks known as stamps. The stamps are divided into fault domains that have server infrastructures.

The following diagram illustrates the system topology for each fault domain:

There are clusters with hundreds of servers deployed to many fault domains. The replica sets are deployed to many fault domains to provide an infrastructure that is highly resilient and continues working without issues when hardware failures occur. Each cluster has a database replica with the following elements:

  • Resource governor for throughput and latency guarantees
  • Transport layer for replication
  • Admission control for security (authentication and authorization)
  • Database engine to run operations, queries, and indexing

Learning about the resource hierarchy for each container

The following diagram shows the resource hierarchy for each container. For example, as previously learned, in a document database, the container is a collection:

The containers are the logical resources that are exposed to the APIs as collections, graphs, or tables. Each container has partition sets, which are composed of database replicas. The database service hosts four replicas per region. This way, whenever there are either hardware or software updates, they are completely transparent to us and we can continue working with the remaining replicas.

Resource partitions provide resource-governed coordination primitives. The following diagram shows a replica set in detail. Notice that each replica set hosts an instance of the database engine:

The database service is always online and available. The software and hardware updates on the Azure side happen under the hood for one out of four replicas per region while the remaining replicas continue working. Hence, we don't have to worry about availability due to operating system or database engine updates.

Test your knowledge

Let's see whether you can answer the following questions correctly:

  1. In a single region, Cosmos DB can provide:
    1. 99.99% (also known as four nines) of availability
    2. 99.999% (also known as five nines) of availability 3. 99.9999% (also known as six nines) of availability
  2. In multiple regions, Cosmos DB can provide:
    1. 99.99% (also known as four nines) of availability
    2. 99.999% (also known as five nines) of availability
    3. 99.9999% (also known as six nines) of availability
  3. Which of the following database type supported by Cosmos DB can work with either the SQL API or the MongoDB API:
    1. Key/value
    2. Column-family
    3. Document
  4. The name for the container in a document database in Cosmos DB is:
    1. Key/value
    2. Document container
    3. Collection
  5. The name for the item in a document database in Cosmos DB is:
    1. Value
    2. Document
    3. Row

Summary

In this chapter, we learned about the three main features of Cosmos DB that establish pillars for supporting additional features: partitioning, replication, and resource governance. We covered the four NoSQL data models supported by Cosmos DB and saw how they relate to the five available APIs.

Then, we learned about the different elements of the Cosmos DB resource model, allowing us to have a clear understanding of how to work with this database service. We understood the system topology that provides support to Cosmos DB at a global scale and we analyzed the resource hierarchy for each container. We now know the name for each element that we will have to use to develop applications that work with Cosmos DB and to manage this innovative database service.

Now that we understand the basics of Cosmos DB, we will provision a Cosmos DB account with the SQL API and we will start working with a document database, its collections and documents, which are the topics we are going to discuss in the next chapter.