What are the 3 main reasons why relational databases have been the most popular in industry for the last 30 years?

In 2010, the talk about a "big data" trend has reached a fever pitch. "Big data" centers around the notion that organizations are now (or soon will be) dealing with managing and extracting information from databases that are growing into the multi-petabyte range.

Nội dung chính Show

The Rise And Fall Of Hadoop
The Uncanny Persistence Of RDMBS
Obsolete Assumption: Databases Need Reliable Storage
Obsolete Assumption: Your Storage Is Slower Than Your Network
Obsolete Assumption: RAM Is Scarce
How (And Why) To Smash A Relic
When were relational databases popular?
What are the 3 basic relationships in relational database?
Why do companies use relational databases?
Why relational databases are important in modern business?

This dramatic amount of data has caused developers to seek new approaches that tend to avoid SQL queries and instead process data in a distributed manner. These so-called "NoSQL," such as Cassandra and MongoDB databases, are built to scale easily and handle massive amounts of data in a highly fluid manner.

And while I am a staunch supporter of the NoSQL approach, there is often a point where all of this data needs to be aggregated and parsed for different reasons, in a more traditional SQL data model.

It occurred to me recently that I've heard very little from the relational database (RDBMS) side of the house when it comes to dealing with big data. To that end, I recently caught up via e-mail with EnterpriseDB CEO Ed Boyajian, whose company provides services, support, and training around the open-source relational database PostgreSQL.

Boyajian stressed four points:

1. Relational databases can process ad-hoc queries

Production applications sometimes require only primary key lookups, but reporting queries often need to filter or aggregate based on other columns. Document databases and distributed key value stores sometimes don't support this at all, or they may support it only if an index on the relevant column has been defined in advance.

2. SQL reduces development time and improves interoperability

SQL is, and will likely remain, one of the most popular and successful computer languages of all time. SQL-aware development tools, reporting tools, monitoring tools, and connectors are available for just about every combination of operating system, platform, and database under the sun, and nearly every programmer or IT professional has at least a passing familiarity with SQL syntax.

Even for the types of relatively simple queries that are likely to be practical on huge data stores, writing an SQL query is typically simpler and faster than writing an algorithm to compute the desired answer, as is often necessary for data stores that do not include a query language.

3. Relational databases are mature, battle-tested technology

Nearly all of the major relational databases on the market today have been around for 10 years or more and have very stable code bases. They are known to be relatively bug-free, and their failure modes are well understood. Experienced DBAs can use proven techniques to maximize uptime and be confident of successful recovery in case of failure.

4. Relational databases conform to widely accepted standards

Migrating between two relational databases isn't a walk in the park, but most of the systems available today offer broadly similar capabilities, so many applications can be migrated with fairly straightforward changes. When they can't, products and services to simplify the process are available from a variety of vendors.

Document databases and distributed key-value stores have different interfaces, offer different isolation and durability guarantees, and accept very different types of queries. Changing between such different systems promises to be challenging.

Ed also provided an amusing analogy that perhaps illustrates how the differing types of databases (RDBMS, NoSQL and everything in between) relate to each other. You be the judge.

"An RDBMS is like a car. Nearly everybody has one and you can get almost everywhere in it. A key-value store is like an Indy car. It's faster than a regular car, but it has some limitations that make it less than ideal for a trip to the grocery store. And a column-oriented database is a helicopter. It can do many of the same things that a car can do, but it's unwieldy for some things that a car can do easily, and on the flip side excels at some things that a car can't do at all."

Ultimately, users care more about the data than they do about their database. Managing and manipulating the data to meet their specific needs should always trump any specific technology approach.

It seems like a question a child would ask: “Why are things the way they are?”

It is tempting to answer, “because that’s the way things have always been.” But that would be a mistake. Every tool, system, and practice we encounter was designed at some point in time. They were made in particular ways for particular reasons. And those designs often persist like relics long after the rationale behind them has disappeared. They live on – sometimes for better, sometimes for worse.

A famous example is the QWERTY keyboard, devised by inventor Christopher Latham Sholes in the 1870s. According to the common account, Latham’s intent with the QWERTY layout was not to make typists faster but to slow them down, as the levers in early typewriters were prone to jam. In a way it was an optimization. A slower typist who never jammed would produce more than a faster one who did.

New generations of typewriters soon eliminated the jamming that plagued earlier models. But the old QWERTY layout remained dominant over the years despite the efforts of countless would-be reformers.

It’s a classic example of a network effect at work. Once sufficient numbers of people adopted QWERTY, their habits reenforced themselves. Typists expected QWERTY, and manufacturers made more QWERTY keyboards to fulfill the demand. The more QWERTY keyboards manufacturers created, the more people learned to type on a QWERTY keyboard and the stronger the network effect became.

Psychology also played a role. We’re primed to like familiar things. Sayings like “better the devil you know” and “If it ain’t broke, don’t fix it,” reflect a principle called the Mere Exposure effect, which states that we tend to gravitate to things we’ve experienced before simply because we’ve experienced them. Researchers have found this principle extends to all aspects of life: the shapes we find attractive, the speech we find pleasant, the geography we find comfortable. The keyboard we like to type on.

To that list I would add the software designs we use to build applications. Software is flexible. It ought to evolve with the times. But it doesn’t always. We are still designing infrastructure for the hardware that existed decades ago, and in some places the strain is starting to show.

The Rise And Fall Of Hadoop

Hadoop offers a good example of how this process plays out. Hadoop, you may recall, is an open-source framework for distributed computing based on white papers published by Google in the early 2000s. At the time, RAM was relatively expensive, magnetic disks were the main storage medium, network bandwidth was limited, files and datasets were large and it was more efficient to bring compute to the data than the other way around. On top of that, Hadoop expected servers to live in a certain place – in a particular rack or data center.

A key innovation of Hadoop was the use of commodity hardware rather than specialized, enterprise-grade servers. That remains the rule today. But between the time Hadoop was designed and the time it was deployed in real-world applications, other ‘facts on the ground’ changed. Spinning disks gave way to SSD flash memory. The price of RAM decreased and RAM capacity increased exponentially. Dedicated servers were replaced with virtualized instances. Network throughput expanded. Software began moving to the cloud.

To give some idea of the pace of change, in 2003 a typical server would have boasted 2 GB of RAM and a 50 GB hard drive operating at 100 MB/sec, and the network connection could transfer 1Gb/sec. By 2013, when Hadoop came to market, the server would have 32 GB RAM, a 2 TB hard drive transferring data at 150 MB/sec, and a network that could move 10 Gb/sec.

Hadoop was built for a world that no longer existed, and its architecture was already deprecated by the time it came to market. Developers quickly left it behind and moved to Spark (2009), Impala (2013), Presto (2013) instead. In that short time, Hadoop spawned several public companies and received breathless press. It made a substantial –albeit brief – impact on the tech industry even though by the time it was most famous, it was already obsolete.

Hadoop was conceived, developed, and abandoned within a decade as hardware evolved out from under it. So it might seem incredible that software could last fifty years without significant change, and that a design conceived in the era of mainframes and green-screen monitors could still be with us today. Yet that’s exactly what we see with relational databases.

The Uncanny Persistence Of RDMBS

In particular, the persistence is with the Relational Database Management System, or RDBMS for short. By technological standards, RDBMS design is quite old, much older than Hadoop, originating in the 1970s and 1980s. The relational database predates the Internet. It comes from a time before widespread networking, before cheap storage, before the ability to spread workloads across multiple machines, before widespread use of virtual machines, and before the cloud.

To put the age of RDBMS in perspective, the popular open source Postgres is older than the CD-ROM, originally released in 1995. And Postgres is built on top of a project that started in 1986, roughly. So this design is really old. The ideas behind it made sense at the time, but many things have changed since then, including the hardware, the use cases and the very topology of the network,

Here again, the core design of RDBMS assumes that throughput is low, RAM is expensive, and large disks are cost-prohibitive and slow.

Given those factors, RDBMs designers came to certain conclusions. They decided storage and compute should be concentrated in one place with specialized hardware and a great deal of RAM. They also realized it would be more efficient for the client to communicate with a remote server than to store and process results locally.

RDBMS architectures today still embody these old assumptions about the underlying hardware. The trouble is those assumptions aren’t true anymore. RAM is cheaper than anyone in the 1960s could have imagined. Flash SSDs are inexpensive and incredibly responsive, with latency of around 50 microseconds, compared with roughly 10 milliseonds for the old spinning disks. Network latency hasn’t changed as much – still around 1 millisecond – but bandwidth is 100 times greater.

The result is that even now, in the age of containers, microservices, and the cloud, most RDBMS architectures treat the cloud as a virtual datacenter. And that’s not just a charming reminder of the past. It has serious implications for database cost and performance. Both are much worse than they need to be because they are subject to design decisions made 50 years ago in the mainframe era.

Obsolete Assumption: Databases Need Reliable Storage

One of the reasons relational databases are slower than their NoSQL counterparts is that they invest heavily in keeping data safe. For instance, they avoid caching on the disk layer and employ ACID semantics, writing to disk immediately and holding other requests until the current request has finished. The underlying assumption is that with these precautions in place, if problems crop up, the administrator can always take the disk to forensics and recover the missing data.

But there’s little need for that now – at least with databases operating in the cloud. Take Amazon Web Services as an example. Its standard Elastic Block Storage system makes backups automatically and replicates freely. Traditional RDBMS architectures assume they are running on a single server with a single point of storage failure, so they go to great lengths to ensure data is stored correctly. But when you’re running multiple servers in the cloud – as you do – if there’s a problem with one you just fail over to one of the healthy servers.

RDBMSs go to great lengths to support data durability. But with the modern preference for instant failover, all that effort is wasted. These days you’ll failover to a replicated server instead of waiting a day to bring the one that crashed back online. Yet RDBMS persists in putting redundancy on top of redundancy. Business and technical requirements often demand this capability even though it’s no longer needed – a good example of how practices and expectations can reinforce obsolete design patterns.

Obsolete Assumption: Your Storage Is Slower Than Your Network

The client/server model made a lot of sense in the pre-cloud era. If your network was relatively fast (which it was) and your disk was relatively slow (which it also was), it was better to run hot data on a tricked-out, specialized server that received queries from remote clients.

For that reason, relational databases originally assumed they had reliable physical disks attached. But once this equation changed, and local SSDs could find data faster than it could be moved over the network, it made more sense for applications to read data locally. But at the moment we can’t do this because it’s not how databases work.

This makes it very difficult to scale RDBMS, even with relatively small datasets, and makes performance with large data sets much worse than it would be with local drives. This in turn makes solutions more complex and expensive, for instance by requiring a caching layer to deliver the speed that could be obtained cheaper and easier with fast local storage.

Obsolete Assumption: RAM Is Scarce

RAM used to be very expensive. Only specialized servers had lots of it, so that is what databases ran on. Much of classic RDBMS design revolved around moving data between disk and RAM.

But here again, the cloud makes that a moot point. AWS gives you tremendous amounts of RAM for a pittance. But most people running traditional databases can’t actually use it. It’s not uncommon to see application servers with 8 GB of RAM, while the software running on them can only access 1 GB, which means roughly 90 percent of the capacity is wasted.

That matters because there’s a lot you can do with RAM. Databases don’t only store data. They also do processing jobs. If you have a lot of RAM on the client, you can use it for caching, or you can use it to hold replicas, which can do a lot of the processing normally done on the server side. But you don’t do any of that right now because it violates the design of RDBMS.

How (And Why) To Smash A Relic

Saving energy takes energy. But software developers often choose not to spend it. After all, as the inventor of Perl liked to say, laziness is one of the great programmer’s virtues. We’d rather build on top of existing knowledge than invent new systems from scratch.

But there is a cost to taking design principles for granted, even if it is not a technology as foundational as RDBMS. We like to think that technology always advances. RDBMS reminds us some patterns persist because of inertia. They become so familiar we don’t see them anymore. They are relics hiding in plain sight.

Once you do spot them, the question is what to do about them. Some things persist for a reason. Maturity does matter. You need to put on your accountant’s hat and do a hard-headed ROI analysis. If your design is based on outdated assumptions, is it holding you back? Is it costing you more money than it would take to modernize? Could you actually achieve a positive return?

It’s a real possibility. Amazon created a whole new product – the Aurora database – by rethinking the core assumptions behind RDBMS storage abstraction.

You might not go that far. But where there’s at least a prospect of positive ROI, it’s a good sign that change is strategic. And that’s your best sign that tearing down your own design is worth the cost of building something new in its place.

Avishai Ish-Shalom is developer advocate at ScyllaDB.

When were relational databases popular?

Oracle brought the first commercial relational database to market in 1979 followed by DB2, SAP Sysbase ASE, and Informix. In the 1980s and '90s, relational databases grew increasingly dominant, delivering rich indexes to make any query efficient.

What are the 3 basic relationships in relational database?

There are 3 different types of relations in the database: one-to-one. one-to-many, and. many-to-many.

Why do companies use relational databases?

A relational database model ensures that all users always see the same data. This improves understanding across a business because everyone sees the same information. This ensures that nobody makes business decisions based on out-of-date information.

Why relational databases are important in modern business?

A relational database's main benefit is the ability to connect data from different tables to create useful information. This approach helps organizations of all sizes and industries decipher relationships between different sets of data, from various departments, to create meaningful insights.