Together with our colleagues at the University of Hamburg, we — that is Felix Gessert, Wolfram Wingerath, Steffen Friedrich and Norbert Ritter — presented an overview over the NoSQL landscape at SummerSOC’16 last month. Here is the written gist.
Tags: system-design
via Pocket https://ift.tt/3wgswBq original site
June 10, 2021 at 07:09PM
Comments
from: github-actions[bot] on: 6/10/2021
NoSQL Databases: a Survey and Decision Guidance by Felix Gessert
NoSQL Databases: a Survey and Decision Guidance
[](https://medium.com/@ felix.gessert?source=post_page—–ea7823a822d——————————–)
[Felix Gessert](https://medium.com/@ felix.gessert?source=post_page—–ea7823a822d——————————–)
Aug 15, 2016 · 26 min read
(At the bottom of this page, you find a BibTeX reference to cite this article.)
Together with our colleagues at the University of Hamburg, we — that is Felix Gessert, Wolfram Wingerath, Steffen Friedrich and Norbert Ritter — presented an overview over the NoSQL landscape at SummerSOC’16 last month. Here is the written gist. We give our best to convey the condensed NoSQL knowledge we gathered building Baqend.
TL;DR
Today, data is generated and consumed at unprecedented scale. This has lead to novel approaches for scalable data management subsumed under the term “NoSQL” database systems to handle the ever-increasing data volume and request loads. However, the heterogeneity and diversity of the numerous existing systems impede the well-informed selection of a data store appropriate for a given application context. Therefore, this article gives a top-down overview of the field: Instead of contrasting the implementation specifics of individual representatives, we propose a comparative classification model that relates functional and non-functional requirements to techniques and algorithms employed in NoSQL databases. This NoSQL Toolbox allows us to derive a simple decision tree to help practitioners and researchers filter potential system candidates based on central application requirements.
1. Introduction
Traditional relational database management systems (RDBMSs) provide powerful mechanisms to store and query structured data under strong consistency and transaction guarantees and have reached an unmatched level of reliability, stability and support through decades of development. In recent years, however, the amount of useful data in some application areas has become so vast that it cannot be stored or processed by traditional database solutions. User-generated content in social networks or data retrieved from large sensor networks are only two examples of this phenomenon commonly referred to as Big Data. A class of novel data storage systems able to cope with Big Data are subsumed under the term NoSQL databases, many of which offer horizontal scalability and higher availability than relational databases by sacrificing querying capabilities and consistency guarantees. These trade-offs are pivotal for service-oriented computing and as-a-service models, since any stateful service can only be as scalable and fault-tolerant as its underlying data store.
There are dozens of NoSQL database systems and it is hard to keep track of where they excel, where they fail or even where they differ, as implementation details change quickly and feature sets evolve over time. In this article, we therefore aim to provide an overview of the NoSQL landscape by discussing employed concepts rather than system specificities and explore the requirements typically posed to NoSQL database systems, the techniques used to fulfil these requirements and the trade-offs that have to be made in the process. Our focus lies on key-value, document and wide-column stores, since these NoSQL categories cover the most relevant techniques and design decisions in the space of scalable data management.
In Section 2, we describe the most common high-level approaches towards categorizing NoSQL database systems either by their data model into key-value stores, document stores and wide-column stores or by the safety-liveness trade-offs in their design (CAP and PACELC). We then survey commonly used techniques in more detail and discuss our model of how requirements and techniques are related in Section 3 , before we give a broad overview of prominent database systems by applying our model to them in Section 4 . A simple and abstract decision model for restricting the choice of appropriate NoSQL systems based on application requirements concludes the paper in Section 5.
2. High-Level System Classification
In order to abstract from implementation details of individual NoSQL systems, high-level classification criteria can be used to group similar data stores into categories. In this section, we introduce the two most prominent approaches: data models and CAP theorem classes.
)
2.1 Different Data Models
The most commonly employed distinction between NoSQL databases is the way they store and allow access to data. Each system covered in this paper can be categorised as either key-value store, document store or wide-column store.
)
Figure 1: Key-value stores offer efficient storage and retrieval of arbitrary values.
2.1.1 Key-Value Stores. A key-value store consists of a set of key-value pairs with unique keys. Due to this simple structure, it only supports get and put operations. As the nature of the stored value is transparent to the database, pure key-value stores do not support operations beyond simple CRUD (Create, Read, Update, Delete). Key-value stores are therefore often referred to as schemaless: Any assumptions about the structure of stored data are implicitly encoded in the application logic (schema-on-read) and not explicitly defined through a data definition language (schema-on-write).
The obvious advantages of this data model lie in its simplicity. The very simple abstraction makes it easy to partition and query the data, so that the database system can achieve low latency as well as high throughput. However, if an application demands more complex operations, e.g. range queries, this data model is not powerful enough. Figure 1 illustrates how user account data and settings might be stored in a key-value store. Since queries more complex than simple lookups are not supported, data has to be analyzed inefficiently in application code to extract information like whether cookies are supported or not (cookies: false).
2.1.2 Document Stores. A document store is a key-value store that restricts values to semi-structured formats such as JSON documents. This restriction in comparison to key-value stores brings great flexibility in accessing the data. It is not only possible to fetch an entire document by its ID, but also to retrieve only parts of a document, e.g. the age of a customer, and to execute queries like aggregation, query-by-example or even full-text search.
)
Figure 2: Document stores are aware of the internal structure of the stored entity and thus can support queries.
3.1.3 Wide-Column Stores Wide-column stores inherit their name from the image that is often used to explain the underlying data model: a relational table with many sparse columns. Technically, however, a wide-column store is closer to a distributed multi-level sorted map: The first-level keys identify rows which themselves consist of key-value pairs. The first-level keys are called row keys, the second-level keys are called column keys. This storage scheme makes tables with arbitrarily many columns feasible, because there is no column key without a corresponding value. Hence, null values can be stored without any space overhead. The set of all columns is partitioned into so-called column families to colocate columns on disk that are usually accessed together. On disk, wide-column stores do not colocate all data from each row, but instead values of the same column family and from the same row. Hence, an entity (a row) cannot be retrieved by one single lookup as in a document store, but has to be joined together from the columns of all column families. However, this storage layout usually enables highly efficient data compression and makes retrieving only a portion of an entity very efficient. The data are stored in lexicographic order of their keys, so that data that are accessed together are physically co-located, given a careful key design. As all rows are distributed into contiguous ranges (so-called tablets) among different tablet servers, row scans only involve few servers and thus are very efficient.
)
Figure 3: Data in a wide-column store.
Bigtable, which pioneered the wide-column model, was specifically developed to store a large collection of webpages as illustrated in Figure 3. Every row in the webpages table corresponds to a single webpage. The row key is a concatenation of the URL components in reversed order and every column key is composed of the column family name and a column qualifier, separated by a colon. There are two column families: the “contents” column family with only one column holding the actual webpage and the “anchor” column family holding links to each webpage, each in a separate column. Every cell in the table (i.e. every value accessible by the combination of row and column key) can be versioned by timestamps or version numbers. It is important to note that much of the information of an entity lies in the keys and not only in the values .
2.2 Consistency-Availability Trade-Offs: CAP and PACELC
Another defining property of a database apart from how the data are stored and how they can be accessed is the level of consistency that is provided. Some databases are built to guarantee strong consistency and serializability (ACID), while others favour availability (BASE). This trade-off is inherent to every distributed database system and the huge number of different NoSQL systems shows that there is a wide spectrum between the two paradigms. In the following, we explain the two theorems CAP and PACELC according to which database systems can be categorised by their respective positions in this spectrum.
CAP. Like the famous FLP Theorem, the CAP Theorem, presented by Eric Brewer at PODC 2000 and later proven by Gilbert and Lynch, is one of the truly influential impossibility results in the field of distributed computing, because it places an ultimate upper bound on what can possibly be accomplished by a distributed system. It states that a sequentially consistent read/write register that eventually responds to every request cannot be realised in an asynchronous system that is prone to network partitions. In other words, it can guarantee at most two of the following three properties at the same time:
- Consistency (C): Reads and writes are always executed atomically and are strictly consistent (linearizable). Put differently, all clients have the same view on the data at all times.
- Availability (A): Every non-failing node in the system can always accept read and write requests by clients and will eventually return with a meaningful response, i.e. not with an error message.
- Partition-tolerance (P): The system upholds the previously displayed consistency guarantees and availability in the presence of message loss between the nodes or partial system failure.
Brewer argues that a system can be both available and consistent in normal operation, but in the presence of a system partition, this is not possible: If the system continues to work in spite of the partition, there is some non-failing node that has lost contact to the other nodes and thus has to decide to either continue processing client requests to preserve availability (AP, eventual consistent systems) or to reject client requests in order to uphold consistency guarantees (CP). The first option violates consistency, because it might lead to stale reads and conflicting writes, while the second option obviously sacrifices availability. There are also systems that usually are available and consistent, but fail completely when there is a partition (CA), for example single-node systems. It has been shown that the CAP-theorem holds for any consistency property that is at least as strong as causal consistency, which also includes any recency bounds on the permissible staleness of data (Δ-atomicity). Serializability as the correctness criterion of transactional isolation does not require strong consistency. However, similar to consistency, serializability can also not be achieved under network partitions.
The classification of NoSQL systems as either AP, CP or CA vaguely reflects the individual systems’ capabilities and hence is widely accepted as a means for high-level comparisons. However, it is important to note that the CAP Theorem actually does not state anything on normal operation; it merely tells us whether a system favors availability or consistency in the face of a network partition. In contrast to the FLP-Theorem, the CAP theorem assumes a failure model that allows arbitrary messages to be dropped, reordered or delayed indefinitely. Under the weaker assumption of reliable communication channels (i.e. messages always arrive but asynchronously and possibly reordered) a CAP-system is in fact possible using the Attiya, Bar-Noy, Dolev algorithm, as long as a majority of nodes are up. (Therefore, consensus as used for coordination in many NoSQL systems either natively (e.g. in Megastore) or through coordination services like Chubby and Zookeeper is even harder to achieve with high availability than strong consistency, see FLP Theorem.)
PACELC. This lack of the CAP Theorem is addressed in an article by Daniel Abadi in which he points out that the CAP Theorem fails to capture the trade-off between latency and consistency during normal operation, even though it has proven to be much more influential on the design of distributed systems than the availability-consistency trade-off in failure scenarios. He formulates PACELC which unifies both trade-offs and thus portrays the design space of distributed systems more accurately. From PACELC, we learn that in case of a Partition, there is an Availability-Consistency trade-off; Else, i.e. in normal operation, there is a Latency-Consistency trade-off.
This classification basically offers two possible choices for the partition scenario (A/C) and also two for normal operation (L/C) and thus appears more fine-grained than the CAP classification. However, many systems cannot be assigned exclusively to one single PACELC class and one of the four PACELC classes, namely PC/EL, can hardly be assigned to any system.
3. Techniques
Every significantly successful database is designed for a particular class of applications, or to achieve a specific combination of desirable system properties. The simple reason why there are so many different database systems is that it is not possible for any system to achieve all desirable properties at once. Traditional SQL databases such as PostgreSQL have been built to provide the full functional package: a very flexible data model, sophisticated querying capabilities including joins, global integrity constraints and transactional guarantees. On the other end of the design spectrum, there are key-value stores like Dynamo that scale with data and request volume and offer high read and write throughput as well as low latency, but barely any functionality apart from simple lookups.
In this section, we highlight the design space of distributed database systems, concentrating on sharding, replication, storage management and query processing. We survey the available techniques and discuss how they are related to different functional and non-functional properties (goals) of data management systems. In order to illustrate what techniques are suitable to achieve which system properties, we provide the NoSQL Toolbox (Figure 4) where each technique is connected to the functional and non-functional properties it enables (positive edges only).
Figure 4: The NoSQL Toolbox: It connects the techniques of NoSQL databases with the desired functional and non-functional system properties they support.
3.1 Sharding
Several distributed relational database systems such as Oracle RAC or IBM DB2 pureScale rely on a shared-disk architecture where all database nodes access the same central data repository (e.g. a NAS or SAN). Thus, these systems provide consistent data at all times, but are also inherently difficult to scale. In contrast, the (NoSQL) database systems focused in this paper are built upon a shared-nothing architecture, meaning each system consists of many servers with private memory and private disks that are connected through a network. Thus, high scalability in throughput and data volume is achieved by sharding (partitioning) data across different nodes (shards) in the system. There are three basic distribution techniques: range-sharding, hash-sharding and entity-group sharding. To make efficient scans possible, the data can be partitioned into ordered and contiguous value ranges by range-sharding. However, this approach requires some coordination through a master that manages assignments. To ensure elasticity, the system has to be able to detect and resolve hotspots automatically by further splitting an overburdened shard.
Range sharding is supported by wide-column stores like BigTable, HBase or Hypertable and document stores, e.g. MongoDB, RethinkDB, Espresso and DocumentDB. Another way to partition data over several machines is hash-sharding where every data item is assigned to a shard server according to some hash value built from the primary key. This approach does not require a coordinator and also guarantees the data to be evenly distributed across the shards, as long as the used hash function produces an even distribution. The obvious disadvantage, though, is that it only allows lookups and makes scans unfeasible. Hash sharding is used in key-value stores and is also available in some wide-coloumn stores like Cassandra or Azure Tables.
The shard server that is responsible for a record can be determined as serverid = hash(id)%servers, for example. However, this hashing scheme requires all records to be reassigned every time a new server joins or leaves, because it changes with the number of shard servers (servers). Consequently, it infeasible to use in elastic systems like Dynamo, Riak or Cassandra, which allow additional resources to be added on-demand and again be removed when dispensable. For increased flexibility, elastic systems typically use consistent hashing where records are not directly assigned to servers, but instead to logical partitions which are then distributed across all shard servers. Thus, only a fraction of the data have to be reassigned upon changes in the system topology. For example, an elastic system can be downsized by offloading all logical partitions residing on a particular server to other servers and then shutting down the now idle machine. For details on how consistent hashing is used in NoSQL systems, see the Dynamo paper.
Entity-group sharding is a data partitioning scheme with the goal of enabling single-partition transactions on co-located data. The partitions are called entity-groups and either explicitly declared by the application (e.g. in G-Store andMegaStore) or derived from transactions’ access patterns (e.g. in Relational Cloud and Cloud SQL Server). If a transaction accesses data that spans more than one group, data ownership can be transferred between entity-groups or the transaction manager has to fallback to more expensive multi-node transaction protocols.
3.2 Replication
In terms of CAP, conventional RDBMSs are often CA systems run in single-server mode: The entire system becomes unavailable on machine failure. And so system operators secure data integrity and availability through expensive, but reliable high-end hardware. In contrast, NoSQL systems like Dynamo, BigTable or Cassandra are designed for data and request volumes that cannot possibly be handled by one single machine, and therefore they run on clusters consisting of thousands of servers. (Low-end hardware is used, because it is substantially more cost-efficient than high-end hardware.) Since failures are inevitable and will occur frequently in any large-scale distributed system, the software has to cope with them on a daily basis . In 2009, Google fellow Jeff Dean stated that a typical new cluster at Google encounters thousands of hard drive failures, 1,000 single-machine failures, 20 rack failures and several network partitions due to expected and unexpected circumstances in its first year alone. Many more recent cases of network partitions and outages in large cloud data centers have been reported . Replication allows the system to maintain availability and durability in the face of such errors. But storing the same records on different machines (replica servers) in the cluster introduces the problem of synchronization between them and thus a trade-off between consistency on the one hand and latency and availability on the other.
Gray et al. propose a two-tier classification of different replication strategies according to when updates are propagated to replicas and where updates are accepted. There are two possible choices on tier one (“when”): Eager(synchronous) replication propagates incoming changes synchronously to all replicas before a commit can be returned to the client, whereas lazy (asynchronous) replication applies changes only at the receiving replica and passes them on asynchronously. The great advantage of eager replication is consistency among replicas, but it comes at the cost of higher write latency due to the need to wait for other replicas and impaired availability. Lazy replication is faster, because it allows replicas to diverge; as a consequence, stale data might be served. On the second tier (“where”), again, two different approaches are possible: Either a master-slave (primary copy) scheme is pursued where changes can only be accepted by one replica (the master) or, in a update anywhere (multi-master) approach, every replica can accept writes. In master-slave protocols, concurrency control is not more complex than in a distributed system without replicas, but the entire replica set becomes unavailable, as soon as the master fails. Multi-master protocols require complex mechanisms for prevention or detection and reconciliation of conflicting changes. Techniques typically used for these purposes are versioning, vector clocks, gossiping and read repair (e.g. in Dynamo) and convergent or commutative datatypes (e.g. in Riak).
Basically, all four combinations of the two-tier classification are possible. Distributed relational systems usually perform eager master-slave replication to maintain strong consistency. Eager update anywhere replication as for example featured in Google’s Megastore suffers from a heavy communication overhead generated by synchronisation and can cause distributed deadlocks which are expensive to detect. NoSQL database systems typically rely on _lazy_replication, either in combination with the master-slave (CP systems, e.g. HBase and MongoDB) or the update anywhere approach (AP systems, e.g. Dynamo and Cassandra). Many NoSQL systems leave the choice between latency and consistency to the client, i.e. for every request, the client decides whether to wait for a response from any replica to achieve minimal latency or for a certainly consistent response (by a majority of the replicas or the master) to prevent stale data.
An aspect of replication that is not covered by the two-tier scheme is the distance between replicas. The obvious advantage of placing replicas near one another is low latency, but close proximity of replicas might also reduce the positive effects on availability; for example, if two replicas of the the same data item are placed in the same rack, the data item is not available on rack failure in spite of replication. But more than the possibility of mere temporary unavailability, placing replicas nearby also bears the peril of losing all copies at once in a disaster scenario. An alternative technique for latency reduction is used in Orestes, where data is cached close to applications using web caching infrastructure and cache coherence protocols.
Geo-replication can protect the system against complete data loss and improve read latency for distributed access from clients. Eager geo-replication, as implemented in Megastore, Spanner, MDCC and Mencius achieve strong consistency at the cost of higher write latencies (typically 100ms to 600ms). With lazy geo-replication as in Dynamo, PNUTS, Walter, COPS, Cassandra and BigTable recent changes may be lost, but the system performs better and remains available during partitions. Charron-Bost et al. (Chapter 12) and Öszu and Valduriez (Chapter 13) provide a comprehensive discussion of database replication.
3.3 Storage Management
For best performance, database systems need to be optimized for the storage media they employ to serve and persist data. These are typically main memory (RAM), solid-state drives (SSDs) and spinning disk drives (HDDs) that can be used in any combination. Unlike RDBMSs in enterprise setups, distributed NoSQL databases avoid specialized shared-disk architectures in favor of shared-nothing clusters based on commodity servers (employing commodity storage media). Storage devices are typically visualized as a “storage pyramid” (see Figure 5 or Hellerstein et al.). There is also a set of transparent caches (e.g. L1-L3 CPU caches and disk buffers, not shown in the Figure), that are only implicitly leveraged through well-engineered database algorithms that promote data locality. The very different cost and performance characteristics of RAM, SSD and HDD storage and the different strategies to leverage their strengths (storage management) are one reason for the diversity of NoSQL databases. Storage management has a spatial dimension (where to store data) and a temporal dimension (when to store data). Update-in-place and append-only-IO are two complementary spatial techniques of organizing data; in-memory prescribes RAM as the location of data, whereas logging is a temporal technique that decouples main memory and persistent storage and thus provides control over when data is actually persisted.
)
Figure 5: The storage pyramid and its role in NoSQL systems.
In their seminal paper “the end of an architectural era”, Stonebraker et al. have found that in typical RDBMSs, only 6.8% of the execution time is spent on “useful work”, while the rest is spent on:
- buffer management (34.6%), i.e. caching to mitigate slower disk access
- latching (14.2%), to protect shared data structures from race conditions caused by multi-threading
- locking (16.3%), to guarantee logical isolation of transactions
- logging (1.9%), to ensure durability in the face of failures
- hand-coded optimizations (16.2%)
This motivates that large performance improvements can be expected if RAM is used as primary storage (in-memory databases). The downside are high storage costs and lack of durability — a small power outage can destroy the database state. This can be solved in two ways: The state can be replicated over n in-memory server nodes protecting against n-1 single-node failures (e.g. HStore, VoltDB) or by logging to durable storage (e.g. Redis or SAP Hana). Through logging, a random write access pattern can be transformed to a sequential one comprised of received operations and their associated properties (e.g. redo information). In most NoSQL systems, the commit rule for logging is respected, which demands every write operation that is confirmed as successful to be logged and the log to be flushed to persistent storage. In order to avoid the rotational latency of HDDs incurred by logging each operation individually, log flushes can be batched together (group commit) which slightly increases the latency of individual writes, but drastically improves throughput.
SSDs and more generally all storage devices based on NAND flash memory differ substantially from HDDs in various aspects: “(1) asymmetric speed of read and write operations, (2) no in-place overwrite — the whole block must be erased before overwriting any page in that block, and (3) limited program/erase cycles” (Min et al., 2012). Thus, a database system’s storage management must not treat SSDs and HDDs as slightly slower, persistent RAM, since random writes to an SSD are roughly an order of magnitude slower than sequential writes. Random reads, on the other hand, can be performed without any performance penalties. There are some database systems (e.g. Oracle Exadata, Aerospike) that are explicitly engineered for these performance characteristics of SSDs. In HDDs, both random reads and writes are 10–100 times slower than sequential access. Logging hence suits the strengths of SSDs and HDDs which both offer a significantly higher throughput for sequential writes.
For in-memory databases, an update-in-place access pattern is ideal: It simplifies the implementation and random writes to RAM are essentially equally fast as sequential ones, with small differences being hidden by pipelining and the CPU-cache hierarchy. However, RDBMSs and many NoSQL systems (e.g. MongoDB) employ an update-in-place update pattern for persistent storage, too. To mitigate the slow random access to persistent storage, main memory is usually used as a cache and complemented by logging to guarantee durability. In RDBMSs, this is achieved through a complex buffer pool which not only employs cache-replace algorithms appropriate for typical SQL-based access patterns, but also ensures ACID semantics. NoSQL databases have simpler buffer pools that profit from simpler queries and the lack of ACID transactions. The alternative to the buffer pool model is to leave caching to the OS through virtual memory (e.g. employed in MongoDB’s MMAP storage engine). This simplifies the database architecture, but has the downside of giving less control over which data items or pages reside in memory and when they get evicted. Also read-ahead (speculative reads) and write-behind (write buffering) transparently performed with OS buffering lack sophistication as they are based on file system logics instead of database queries.
Append-only storage (also referred to as log-structuring) tries to maximize throughput by writing sequentially. Although log-structured file systems have a long research history, append-only I/O has only recently been popularized for databases by BigTable’s use of Log-Structured Merge (LSM) trees consisting of an in-memory cache, a persistent log and immutable, periodically written storage files. LSM trees and variants like Sorted Array Merge Trees (SAMT) and Cache-Oblivious Look-ahead Arrays (COLA) have been applied in many NoSQL systems (Cassandra, CouchDB, LevelDB, Bitcask, RethinkDB, WiredTiger, RocksDB, InfluxDB, TokuDB). Designing a database to achieve maximum write performance by always writing to a log is rather simple, the difficulty lies in providing fast random and sequential reads. This requires an appropriate index structure that is either permanently updated as a copy-on-write (COW) data structure (e.g. CouchDB’s COW B-trees) or only periodically persisted as an immutable data structure (e.g. in BigTable-style systems). An issue of all log-structured storage approaches is costly garbage collection (compaction) to reclaim space of updated or deleted items.
In virtualized environments like Infrastructure-as-a-Service clouds many of the discussed characteristics of the underlying storage layer are hidden.
Table 1: A direct comparison of functional requirements, non-functional requirements and techniques among MongoDB, Redis, HBase, Riak, Cassandra and MySQL according to our NoSQL Toolbox.
3.4 Query Processing
The querying capabilities of a NoSQL database mainly follow from its distribution model, consistency guarantees and data model. Primary key lookup, i.e. retrieving data items by a unique ID, is supported by every NoSQL system, since it is compatible to range- as well as hash-partitioning. Filter queries return all items (or projections) that meet a predicate specified over the properties of data items from a single table. In their simplest form, they can be performed as filtered full-table scans. For hash-partitioned databases this implies a scatter-gather pattern where each partition performs the predicated scan and results are merged. For range-partitioned systems, any conditions on the range attribute can be exploited to select partitions.
To circumvent the inefficiencies of O(n) scans, secondary indexes can be employed. These can either be local secondary indexes that are managed in each partition or global secondary indexes that index data over all partitions. As the global index itself has to be distributed over partitions, consistent secondary index maintenance would necessitate slow and potentially unavailable commit protocols. Therefore in practice, most systems only offer eventual consistency for these indexes (e.g. Megastore, Google AppEngine Datastore, DynamoDB) or do not support them at all (e.g. HBase, Azure Tables). When executing global queries over local secondary indexes the query can only be targeted to a subset of partitions if the query predicate and the partitioning rules intersect. Otherwise, results have to be assembled through scatter-gather. For example, a user table with range-partitioning over an age field can service queries that have an equality condition on age from one partition whereas queries over names need to be evaluated at each partition. A special case of global secondary indexing is full-text search, where selected fields or complete data items are fed into either a database-internal inverted index (e.g. MongoDB) or to an external search platform such as ElasticSearch or Solr (Riak Search, DataStax Cassandra).
Query planning is the task of optimizing a query plan to minimize execution costs. For aggregations and joins, query planning is essential as these queries are very inefficient and hard to implement in application code. The wealth of literature and results on relational query processing is largely disregarded in current NoSQL systems for two reasons. First, the key-value and wide-column model are centered around CRUD and scan operations on primary keys which leave little room for query optimization. Second, most work on distributed query processing focuses on OLAP (online analytical processing) workloads that favor throughput over latency whereas single-node query optimization is not easily applicable for partitioned and replicated databases. However, it remains an open research challenge to generalize the large body of applicable query optimization techniques especially in the context of document databases. (Currently only RethinkDB can perform general Θ-joins. MongoDB’s aggregation framework has support for left-outer equi-joins in its aggregation framework and CouchDB allows joins for pre-declared map-reduce views.)
In-database analytics can be performed either natively (e.g. in MongoDB, Riak, CouchDB) or through external analytics platforms such as Hadoop, Spark and Flink (e.g. in Cassandra and HBase). The prevalent native batch analytics abstraction exposed by NoSQL systems is MapReduce. (An alternative to MapReduce are generalized data processing pipelines, where the database tries to optimize the flow of data and locality of computation based on a more declarative query language, e.g. MongoDB’s aggregation framework.) Due to I/O, communication overhead and limited execution plan optimization, these batch- and micro-batch-oriented approaches have high response times.Materialized views are an alternative with lower query response times. They are declared at design time and continuously updated on change operations (e.g. in CouchDB and Cassandra). However, similar to global secondary indexing, view consistency is usually relaxed in favor of fast, highly-available writes, when the system is distributed. As only few database systems come with built-in support for ingesting and querying unbounded streams of data,near-real-time analytics pipelines commonly implement either the Lambda Architecture or the Kappa Architecture: The former complements a batch processing framework like Hadoop MapReduce with a stream processor such as Storm (see for example Summingbird) and the latter exclusively relies on stream processing and forgoes batch processing altogether.
4. System Case Studies
In this section, we provide a qualitative comparison of some of the most prominent key-value, document and wide-column stores. We present the results in strongly condensed comparisons and refer to the documentations of the individual systems for in-detail information. The proposed NoSQL Toolbox (see Figure 4) is a means of abstraction that can be used to classify database systems along three dimensions: functional requirements, non-functional requirements and the techniques used to implement them. We argue that this classification characterizes many database systems well and thus can be used to meaningfully contrast different database systems: Table 1 shows a direct comparison of MongoDB, Redis, HBase, Riak, Cassandra and MySQL in their respective default configurations. A more verbose comparison of central system properties is presented in the large comparison Table 2 at the end of this article.
The methodology used to identify the specific system properties consists of an in-depth analysis of publicly available documentation and literature on the systems. Furthermore, some properties had to be evaluated by researching the open-source code bases, personal communication with the developers as well as a meta-analysis of reports and benchmarks by practitioners.
For detailed descriptions see the slides from our ICDE 2016 Tutorial, which goes over many details of the different NoSQL systems:
The comparison elucidates how SQL and NoSQL databases are designed to fulfill very different needs: RDBMSs provide an unmatched level of functionality whereas NoSQL databases excel on the non-functional side through scalability, availability, low latency and/or high throughput. However, there are also large differences among the NoSQL databases. Riak and Cassandra, for example, can be configured to fulfill many non-functional requirements, but are only eventually consistent and do not feature many functional capabilities apart from data analytics and, in case of Cassandra, conditional updates. MongoDB and HBase, on the other hand, offer stronger consistency and more sophisticated functional capabilities such as scan queries and — only MongoDB: — filter queries, but do not maintain read and write availability during partitions and tend to display higher read latencies. Redis, as the only non-partitioned system in this comparison apart from MySQL, shows a special set of trade-offs centered around the ability to maintain extremely high throughput at low-latency using in-memory data structures and asynchronous master-slave replication.
Figure 6: A decision tree for mapping requirements to (NoSQL) database systems.
5. Conclusions
Choosing a database system always means to choose one set of desirable properties over another. To break down the complexity of this choice, we present a binary decision tree in Figure 6 that maps trade-off decisions to example applications and potentially suitable database systems. The leaf nodes cover applications ranging from simple caching (left) to Big Data analytics (right). Naturally, this view on the problem space is not complete, but it vaguely points towards a solution for a particular data management problem.
The first split in the tree is along the access pattern of applications: They either rely on fast lookups only (left half) or require more complex querying capabilities (right half). The fast lookup applications can be distinguished further by the data volume they process: If the main memory of one single machine can hold all the data, a single-node system like Redis or Memcache probably is the best choice, depending on whether functionality (Redis) or simplicity (Memcache) is favored. If the data volume is or might grow beyond RAM capacity or is even unbounded, a multi-node system that scales horizontally might be more appropriate. The most important decision in this case is whether to favor availability (AP) or consistency (CP) as described by the CAP theorem. Systems like Cassandra and Riak can deliver an always-on experience, while systems like HBase, MongoDB and DynamoDB deliver strong consistency.
The right half of the tree covers applications requiring more complex queries than simple lookups. Here, too, we first distinguish the systems by the data volume they have to handle according to whether single-node systems are feasible (HDD-size) or distribution is required (unbounded volume). For common OLTP (online transaction processing) workloads on moderately large data volumes, traditional RDBMSs or graph databases like Neo4J are optimal, because they offer ACID semantics. If, however, availability is of the essence, distributed systems like MongoDB, CouchDB or DocumentDB are preferrable.
If the data volume exceeds the limits of a single machine, the choice of the right system depends on the prevalent query pattern: When complex queries have to be optimised for latency, as for example in social networking applications, MongoDB is very attractive, because it facilitates expressive ad-hoc queries. HBase and Cassandra are also useful in such a scenario, but excel at throughput-optimised Big Data analytics, when combined with Hadoop.
In summary, we are convinced that the proposed top-down model is an effective decision support to filter the vast amount of NoSQL database systems based on central requirements. The NoSQL Toolbox furthermore provides a mapping from functional and non-functional requirements to common implementation techniques to categorize the constantly evolving NoSQL space.
Table 2: A qualitative comparison of MongoDB, HBase, Cassandra, Riak and Redis.
Don’t want to miss our next post on NoSQL topics? Get it conveniently delivered to your inbox by joining our newsletter.
If you want to cite this article, please use this DBLP reference:
@ article{DBLP:journals/ife/GessertWFR17,
author = {Felix Gessert and
Wolfram Wingerath and
Steffen Friedrich and
Norbert Ritter},
title = {NoSQL database systems: a survey and decision guidance},
journal = {Computer Science - R{\&}D},
volume = {32},
number = {3-4},
pages = {353–365},
year = {2017},
doi = {10.1007/s00450-016-0334-3}
}