Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Megastore.pdf

INTRODUCTION

Megastore Blends the scalability of a NOSQL data storage with the convenience of a traditional RDBMS. It uses synchronous replication to achieve high availability and a consistent view of data. It provide fully serializable ACID semantics over distant replicas with low enough latencies to support iterative application.

We accomplish this by taking a middle ground in the RDBMS vs. NoSQL design space: we partition the data-store and replicate each partition separately, providing full ACID semantics within partitions, but only limited consistency guarantees across them.

We believe that megastore is the largest system deployed that uses Paxos to replicate primary user data across datacenters on every write. Megastore has been widely deployed within Google for several years. It handles more than three billion write and 20 billion read transactions daily and stores nearly a petabyte of primary data across many global datacenters.

The key contribution of this paper are:

  1. the design of a data model and storage system that allows rapid development of interactive applications where high availability and scalability are built-in from the start;
  2. an implementation of the Paxos replication and consensus algorithm optimized for low-latency operation across geographically distributed datacenters to provide high availability for the system;
  3. a report on our experience with a large-scale deployment of Megastore at Google.

2 Toward Availability and scale

  • for availability, we implemented a synchronous ,fault-tolerant log replicator optimized for long distance-links.
  • for scale, we partitioned data into a vast space of small databases, each with its own replicated log stored in a per-replica NoSQL datastore.

2.1

common strategies for with-area replication:

  • Asynchronous Master/slave: a master node replicates write-ahead log entries to at least one slave. Log appends are acknowledged at the master in parallel with transmission to slaves. The master can support fast ACID transactions but risks downtime or data loss during failover to a slave. a consensus protocol is required to mediate mastership.
  • Synchronous master/slave a master waits for changes to be mirrored to slaves before acknowledging them allowing failover without datal loss. Master and slave failures need timely detection by an external system
  • Optimistic replication any member of a homogeneous replica group can accept mutations, which are asynchronously propagated through the group. availability and latency are excellent. However, the global mutation ordering is not known at commit time, so transactions are impossible.

2.2 Partitioning and locality

2.2.1 Entity groups

To scale throughput and localize outages, we partition our data into a collection of entity groups, eah independently and synchronously replicated over a wide area. The underlying data is stored in scalable NoSQL datastore in each datacenter(see Figure 1).

Entities within an entity group are mutated with single phase ACID transactions(for which the commit record is replicated via Paxos). Operations across entity group could rely on expensive two-phase commits, but typically leverage Megastore's asynchronous messaging. A transaction in a sending entity group places one or more messages in a queue; transactions in receiving entity groups atomically consume those message and apply ensuring mutations.

Note that we use asynchronous message between logically distant entity groups, not physically distant replicas. All network traffic between datacenters is from replicated operations, which are synchronous and consistent.

Index local to an entity group obey ACID semantics; those across entity group have looser consistency. See Figure 2 for the vraious operation on and between entity groups.

2.2.2 Selecting Entity Group boundaries

The entity group defines the a priori grouping of data for fast operations. Boundaries that are too fine-grained force excessive cross-group operations, but placing to much unrelated data in a single group serializes unrelated writes, which degrades throughput, but placing too much unrelated data in a single group serializes unrelated writes, which degrades throughput.

The following examples show ways applications can work within these constraints

  • Email
  • Blogs
  • Maps

physical layout

We use Google's Bigtable for scalable fault-tolerant storage within a single datacenter, allowing us to support arbitrary read and write throughput by spreading operations across multiple rows.

We minimize latency and maximize throughput by letting applications control the placement of data: through the selection of Bigtable instances and specification of locality within an instance.

To minimize latency, application try to keep data near users and replicas near each other. They assign each entity group to the region or continent from which it is accessed most. Within that region they assign a triplet or quintuplet of replicas to datacenters with isolated failure domains.

A tour of megastore

Megastore map this architecture onto a feature set carefully chosen to encourage rapid development of scalable applications. This section motivates the tradeoffs and describes the developer-facing features that result.

3.1 Api design philosophy

ACID transactions simplify reasoning about correctness, but it is equally important to be able to reason about performance. Megastore emphasizes cost-transparent API with runtime costs that match application developers' intuitions.

Normalized relational schemas rely on join at query time to service user operations. This is not the right model for Megastore applications for several resons:

  • High-volume interactive workload benefit more from predictable performance than from an expressive query language.
  • read dominate writes in our target applications, so it pays to move work from read time to write time.
  • Storing and querying hierarchical data is straightforward in key-value stores like Bigtable.

With this in mind, we designed a data model and schema language to offer fine-grained control over physical locality. Hierarchical layouts and declarative denormalization help eliminate the need for most joins. Queries specify scans or lookups against particular table and indexes.

Join, when required, are implemented in application code. We provide an implementation of merge phase of the merge join algorithm, in which the user provides multiple queries that return primary key for the same table in the same order; we then return the intersection of keys for all the provided queries.

We also have applications that implement outer joins with parallel queries. This typically involves an index lookup followed by parallel index lookups using the results of the initial lookup. We have found that when the secondary index lookups are done in parallel and the number of results from the first lookup is reasonably small, this provides an effective stand-in for SQL-style join.

3.2 data model

Megastore defines a data model that lies between the abstract tuples of an RDBMS and the concrete row-column storage of NoSQL. ASs in an RDBMS, the data model is declared in schema and is strongly typed. Each schema has a set of tables, each containing a set of entities, which in turn contain a set of properties a set of entities. properties are named and typed values. The types can be strings, various flavors of numbers, or Google's protocol Buffers. They can be required, optional, or repeated(allowing a list of values in a single property). All entities in a table have the same set of allowable properties. All entities in a table have the same set of allowable properties. A sequence of properties is used to form the primary key of the entity, and the primary keys must be unique with in the table. Figure 3 shows an example schema for a simple photo storage application.

Megastore tables are either entity group root tables or child tables. Each child table must declare a single distinguished foreign key referencing a root table, illustrated by the ENTITY group KEY annotation in Figure3. Thus each child entity references a particular entity in its root table(called the root entity). An entity group consists of a root entity along with all entities in child tables that reference it. A megastore instance can have several root tables, resulting in different classes of entity groups.

In the example schema of Figure 3, each user's photo collection is a separate entity group. The root entity is the User, and the photos are child entities. Note the photo. tag field is repeated, allowing multiple tags per photo without the need for a sub-table.

3.2.1 pre-joining with Keys

While traditional relational modeling recommend that all primary keys take surrogate value, Megastore keys are chosen to cluster entities that will be read together. Each entity is mapped into a single Bigtable row; the primary key values are concatenated to form the Bigtable row key, and each remaining property occupies its own Bigtable column.

Note how the Photo and User tables in figure 3 share a common user_id key prefix. The IN TABLE USER directive instructs Megastore to colocate these two tables into the same Bigtable, and the key ordering ensures that Photo entities are stored adjacent to the corresponding User. This mechanism can be applied recursively to speed queries along arbitrary join depths. Thus, users can force hierarchical layout by manipulating the key order.

Schemas declare keys to be sorted ascending or descending, or avert sorting altogether: the scatter attribute instructs Megastore to prepend a two-byte hash to each key. Encoding monotonically increasing keys this way prevents hotsots in large data set that span Bigtable servers.

3.2.2 Indexes

Secondary indexes can be declared on any list of entity properties, as well as fields with in protocol buffers. We distinguish between two high-level classes of indexes: local and global (see Figure2). A local index is treated as separete indexes for each entity group. It is used to find data within an entity group. In Figure3, PhitiByTime is an example of a local index. The index entries are stored in the entity group and are updated atomically and consistently with the primary entity data.

a global index spans entity groups. It is used to find entities without knowing in advance the entity groups that contain them. The phitisByTag index in Figure 3 is global and enables discovery of photos marked with a given tag, regardless of owner. Global index scans can read data owned by many entity groups but are not guaranteed to reflect all recent updates.

Megastore offer additional indexing features:

  • storing clause(物化视图)
  • repeated indexes
  • Inline Indexes.

3.2.3 mapping to Bigtable

The Bigtable column name is a concatenation of the megastore table name and property name, allowing entities from different Megastore to be mapped in to the same Bigtable row without collision. Figure 4 show how data from the example photo application might look in Bigtable.

Within the Bigtable row for a root entity, we store the transaction and replication metadata for the entity group, including the transaction log. Storing all metadata in a single Bigtable row allows us to update it atomically through a single Bigtable transaction.

Each index entry is represented as a single Bigtable row; the row key of the cell is constructed using the indexed property values concatenated with the primary key of the indexed entity. For example, the photosByTime index row keys would be the tuple(user_id,time,primary key) for each photo. Indexing repeated fields produces one index entry per repeated element. For example, the primary key for a photo with three tags would appear in the photosByTag index thrice.

3.3 Transactions and concurrency contorl

Each Megastore entity group functions as a mini-database that provides serializable ACID semantics. A transaction writes its mutations into entity group's write-ahead log, then the mutations are applied to the data.

Bigtable provides the ability to store multiple values in the same row/column pair with different timestamps. We use this feature to implement multiversion concurrency control(MVCC): when mutations within a transaction are applied, the values are written at the timestamp of their transaction. Readers use the timestamp of the last fully applied transaction to avoid seeing partial updates. Readers and writers don't block each other, and reads are isolated from writes for the duration of a transaction.

3.1.1 queues

Queues provides transactional messaging between entity groups. They can be used for cross-group operations, to batch multiple update into a single transaction, or to defer work. A transaction on an entity group can atomically send or receive multiple message in addition to updating its entities. Each message has a single sending and receiving entity group; if they differ ,delivery is asynchronous.(See figure 2)

Queues offer a way to perform operation that affect many entity groups. For example, consider a calendar application in which each calendar has a distinct entity group, and we want to send an invitation to a group of calendars. A single transaction can atomically send invitation queue messages to many distinct calendars. Each calendar receiving the messages will process the invitation in its own transaction which updates the invitee's state and deletes the message.

There is a long history of message queues in full-featured RDBMSs. Our support is notable for its scale: declaring a queue automatically create an inbox on each entity group, giving us millions of endpoints.

3.3.2 two-phase commit

Megastore supports two-phase commit for atomic updates across entity groups. Since these transactions have much higher latency and increase the risk of contention, we generally discourage application from using the feature in favor of queues. Nevertheless, they can be useful in simplifying application code for unique secondary key enforcement.

Last modification:May 23, 2025
如果觉得我的文章对你有用,请随意赞赏