The Story of Apache HBase: From Humble Beginnings to Worldwide Ubiquity

Apache HBase has cemented itself as one of the most versatile open source NoSQL databases in the world, but its origins date back to 2005 within a small startup trying to transform the search industry.

HBase was created out of necessity – to meet the scalability demands of a fledgling company named Powerset that aimed to revolutionize search through breakthroughs in natural language processing and semantic intelligence.

Powerset‘s vision was to enable users to query search engines naturally using full sentences and conversational questions rather than just keywords. This posed immense data scalability challenges in a pre-cloud era dominated by traditional relational databases.

To make this possible, its engineers had to build an entirely custom platform that could store and make sense of a firehose of unstructured natural language data.

They turned to Google‘s internal BigTable paper for inspiration. Much like Google‘s need to manage web pages and clicks, Powerset needed a flexible, distributed database that could scale massively across commodity servers.

And thus HBase was born in 2006 – an open source, column-oriented database architected for the cloud before cloud was even a concept.

From Closed Source to Open Sourced

In the beginning, HBase was entirely proprietary software running on over 500 servers that supported 30TB of Powerset‘s ever-growing linguistic data.

When Microsoft acquired Powerset for over $100 million in 2008, the original creators made an insightful decision. Instead of keeping HBase restricted, they open sourced the project to let the broader developer ecosystem benefit from their pioneering work.

This decision marked the start of HBase‘s journey to ubiquity. Later in 2008, it became a top-level project under the Apache Software Foundation, ensuring its longevity and community-driven progress for years to come.

But at that time, HBase was still raw – lacking real world hardening and missing critical features expected of an enterprise grade database. Transforming from a closed source research prototype to a robust open source project required significant work.

Early Growing Pains

With an eager community of developers and emerging startups now kicking the tires on HBase, its architectural and operational shortcomings became evident.

The early releases were notoriously unstable for production environments. Real world workloads exposed issues around memory management, compaction stalls, node failures, and more.

Several promising startups actually crashed HBase clusters completely by running operational loads the database wasn‘t fully prepared for at the time. This delivered wake up calls to the committer community.

In response, the next few years focused intently on augmenting HBase into a hardened database truly ready for primetime. The community rallied to add capabilities expected of any serious database – snapshotting, access controls, change data capture, and vast improvements to overall system reliability.

Maturing Through Real-World Pressure

There was no replacement for real world pressure in accelerating HBase‘s maturity through the crucial late 2000s.

Major early adopters like Facebook, Yahoo, and Adobe placed big bets on HBase to handle massive production workloads. Their breakthrough applications organically exposed holes that had be to plugged.

Facebook adopted HBase to initially store messaging data across 75 million active users on its fledgling platform. But scalability bottlenecks arose with data growing exponentially, forcing migration to other systems later.
Yahoo chose HBase in 2012 to be the foundational database driving its web analytics pipeline. This generated hundreds of thousands of operations per second across a mammoth cluster that provided insights into Yahoo‘s content consumption.
Adobe leveraged HBase heavily to create a customer profile management repository as a critical system of record for all audience data and attributes. They built this as part of Treasure Data – their customer intelligence cloud.

These pioneering customers at enormous scale really put HBase through the wringer – pinpointing shortcomings, sharing feedback and contributing improvements to the ecosystem.

Over time, HBase‘s architecture and tooling evolved significantly to comfortably sustain such workloads for the long haul. Innovations included native backup tools, optimized compaction, built-in replication, autosharding tables, rack awareness and more.

Mature 1.0 Release But Competition Looms

Close to 10 years after initial work started, HBase finally hit the major 1.0 milestone in 2016. This release reflected its graduation from early stage NoSQL database to serious enterprise grade contender.

Architecturally, HBase reached a level of stability on par with leading databases. Capabilities expected of any transactional system were now comprehensive – high availability, disaster recovery, access controls, encryption, consistent indexing.

The external ecosystem flourished in synch – with commercial vendors extending HBase for operational BI uses, hardware vendors preconfiguring HBase appliances, and specialized support firms providing 24×7 managed services. Adoption also accelerated within Asia with Alibaba and Xiaomi running some of the largest known clusters.

However, the database market evolved feverishly alongside HBase‘s own ascent. Competition emerged from document stores like MongoDB focused on developer velocity. Cassandra offered blazing fast write throughput and operational simplicity. Eventually, managed cloud services like DynamoDB provided turnkey NoSQL.

So while HBase‘s growth trajectory kept soaring, it still had to fight for mindshare amidst the crowded Big Data technology landscape now full of compelling alternatives pitching differentiated capabilities.

Streamlined Architecture for the Cloud Native Era

In response, recent years have focused on reinventing HBase for the new world order – one dominated by Kubernetes, containers, and a vast open source data technology menagerie.

The flagship HBase 2.0 release in 2018 emphasized architectural enhancements for easier modernization. It introduced cleaner separation between computation and storage for improved flexibility and operational simplicity.

HBase 3.0 built further upon this foundation to radically overhaul the storage layer, allowing for running nodes as containers and having storage separately managed. Online cluster rebalancing was made seamless through master-mobility architecture. Dramatic improvements reduced maintenance overhead via the HBCK2 cluster replication tool.

These changes future-proof HBase for the cloud native world demanding pluggability with adjacent data systems, ease of deployment on containers, storage dissociation from computing, and efficient resource elasticity.

Still Going Strong In Its Teens

Well into its teens, HBase continues enjoying broad adoption today from thousands of organizations, including marquee names like Facebook, Netflix, Yahoo, Adobe, Flipkart, Didi Chuxing and Xiaomi running enormous clusters handling billions of rows.

After years of hardening, stabilization and architectural advances, HBase prevails as a highly versatile open source database uniquely bridging OLTP and OLAP systems for modern big data apps – proving its resilience despite maturation of the wider NoSQL ecosystem.

Just like the natural language search capabilities it helped realize, HBase‘s historical significance rests in pioneering big data databases as we know them today – effectively demonstrating how NoSQL systems running on commodity hardware can transform scalability economics.

As one of the most proven battle tested open source databases around, HBase‘s best years still likely lie ahead as the world‘s data explosion continues.