The Google File System
ABSTRACT 摘要
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
我们已经设计并实现了谷歌文件系统,一个用于大型分布式数据密集型应用的可扩展分布式文件系统。它实现了在廉价的商品硬件上运行时提供容错机制,并为大量客户端提供了高聚合性能。
While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier fifile system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
虽然和先前的分布式文件系统拥有许多相同的目标,但这次我们的设计是基于观察当前和预期的应用程序工作负载和技术环境得到的,因此它出现了违背先前的一些文件系统假设的情况。这让我们重新审视传统的选择和探索截然不同的设计点。
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
文件系统已经成功满足我们的存储需求。它作为存储平台被广泛部署在谷歌内部,用于生成和处理业务所产生的数据以及需要大量数据集的研发工作。迄今为止最大的集群在1000多台机器的数千个磁盘上提供了数百TB的存储空间,并由数百个客户端同时访问。
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
在本文中,我们介绍了为支持分布式应用程序而设计的文件系统接口扩展,讨论了我们设计的许多方面,并报告了微观基准测试和实际使用的测量结果。
1. INTRODUCTION 简介
We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google's data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application work loads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.
我们设计并实现了谷歌文件系统(GFS),以满足快速增长的数据处理需求。GFS与先前的分布式文件系统有许多相同的目标,如性能、可扩展性、可靠性和可用性。然而,这次我们的设计是基于观察当前和预期的应用程序工作负载和技术环境关键点得到的,因此它出现了违背先前的一些文件系统假设的情况。我们重新审视了传统的选择,并探索了设计中的根本不同点。
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.
第一,组件故障是常态而非意外。文件系统由成百上千台由廉价硬件构成的存储机器组成,并由相当多数量的客户端访问。组件的数量和质量直接决定了文件系统的可用性,因为任何给定的时间内都有可能发生某些组件无法工作,某些组件无法从它们当前的失效状态中恢复。我们已经遇到了各种各样的问题,例如:应用程序错误、操作系统错误、人为失误以及由硬盘,内存,连接器,网络和电源故障引起的问题。因此,持续监控、错误检测、容错机制和自动恢复是文件系统必须考虑的内容。
Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes have to be revisited.
第二,按照普通标准衡量,文件是海量的,这些文件通常体积高达数GB。每个文件一般会包含许多应用程序对象,如web文档。当我们处理数十亿个对象组成的高达好几TB并不断快速增长的数据集时,即使文件系统能够支持,也难以管理数十亿个KB大小的文件块。因此,设计的假设条件和参数,例如IO操作和Block块大小都需要重新考虑。
Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
第三,绝大多数的文件修改是通过在文件结尾追加新数据,而非覆盖现有数据的方式。文件的随机写操作在实践中几乎不存在。一旦写入完成后,对文件的操作就只有读,而且通常是顺序读取,大量的数据符合这些特征。比如:数据分析程序扫描的超大数据集;正在运行的应用程序连续生成的数据流;存档数据;由一台机器生成并在另外一台机器上处理的中间数据,这些中间数据的处理可能是同时进行也可能是后续处理的。对于这种海量文件的访问模式,客户端对数据块缓存是没有意义的,数据的追加操作是保证性能优化和原子性的主要考量因素。
Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility. For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper.
第四,应用程序和文件系统API的协同设计提高了整个系统的灵活性。例如,我们放宽了GFS的一致性模型的要求,减轻了文件系统对应用程序的苛刻要求,极大地简化了GFS的设计。我们引入了原子性的追加操作,从而保证多个客户端可以对同一个文件进行追加操作,不需要额外的同步操作来保证数据一致性。这些将在本文后面进行更加详细的讨论。
Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of diskstorage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.
当前我们部署了多个GFS集群,服务不同的应用。最大的集群拥有超过1000个存储节点,提供超过300TB的磁盘存储,被不同机器上的数百个客户端连续不断的频繁访问。
2. DESIGN OVERVIEW 设计概述
2.1 Assumptions 假设
In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details.
设计GFS的过程中我们做了很多的假设,它们既意味着挑战,也带来了机遇。我们之前已经提到了一些关键点,现在将详细阐述这些假设。
The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
系统是构建在很多廉价的组件之上,组件失效是常态。系统必须持续监控自身状态、侦测错误,拥有容错机制和恢复失效组件的能力。
The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
系统存储一定数量的大文件,我们预期会有几百万个文件,每个文件的大小通常在100MB或者更大。GB级别的文件也是普遍存在的,需要被有效管理。系统也必须支持小文件,但是不需要针对小文件做专门的优化。
The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1MB or more. Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go backand forth.
系统工作负载主要包括两种读操作:大型流式读取和小型随机读取。在大型流式读取中,通常一次读取数百KB的数据,更常见的是读取1MB或者更多的数据,来自同一个客户端的连续操作通常读取文件的连续区域。小规模的随机读取通常是在文件特定的偏移位置上读取几KB数据。如果应用程序对性能非常关注,通常的做法是把小规模的随机读取操作合并、排序,之后按照顺序批量读取,这就避免了在文件中前后来回的移动读取位置。
The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.
系统的工作负载还包括许多大规模顺序追加写入操作。一般情况下,每次写入操作数据大小和大规模读操作数据大小差不多。数据一旦写入,文件就很少会被修改。系统支持小规模的随机位置写入操作,但是效率没必要很高。
The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be read later, or a consumer may be reading through the file simultaneously.
系统必须高效的实现多客户端并行追加到同一个文件里的明确的语义。我们的文件通常用于生产者消费者队列或者多路合并。几百个机器运行的生产者,同时对同一个文件进行追加操作。用最小的同步开销实现原子性的追加操作是必不可少的。文件可以在稍后读取,或者是消费者在一个文件进行追加操作的同时读取这个文件。
High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulkat a high rate, while few have stringent response time requirements for an individual read or write.
稳定的网络带宽远比低延迟更重要。我们的目标程序绝大部分要求能够高速率的、大批量的处理数据,极少有程序对单一的读写操作有严格的响应时间要求。
2.2 Interface 接口
GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames. We support the usual operations to create, delete, open, close, read, and write files.
GFS提供了一套类似传统文件系统的API接口函数,虽然并不是严格按照POSIX等标准API的形式实现的。文件以分层目录的形式组织,用路径来标识。我们支持常用的操作,如创建新文件、删除文件、打开文件、关闭文件、读和写文件。
Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.
此外,GFS提供了快照和记录追加操作。快照以很低的成本创建一个文件或者目录树的拷贝。记录追加操作允许多个客户端同时对一个文件进行数据追加操作,同时保证每个客户端的追加操作都是原子性的。这对于实现多路结果合并,以及“生产者-消费者”队列非常有用,多个客户端可以在不需要额外的同步锁的情况下,同时对同一个文件追加数据。我们发现这些类型的文件对于构建大型分布式应用是非常重要的。快照和记录追加操作将在3.4和3.3节分别讨论。
2.3 Architecture 架构
A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
一个GFS集群包含一个单独Master节点和多台块服务器(Chunkserver),如图一所示,并同时被多个客户端访问。集群中的机器通常是普通的Linux机器,运行着用户级别的服务进程。我们可以很容易的把块服务器和客户端放在同一台机器上,前提是机器资源允许,并且我们能够接受不可靠的应用程序代码带来的稳定性降低的风险。
Files are divided into fixed-size chunks. Each chunkis identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunkcreation. Chunkservers store chunks on local disks as Linux files and read or write chunkdata specified by a chunkhandle and byte range. For reliability, each chunkis replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.
GFS存储的文件会被切割为固定大小的块(chunk)。master节点在创建每个块(chunk)的时候会给它们分配一个不变的且全局唯一的句柄标识。chunkserver节点把不同的块(chunk)以linux文件的形式存储在本地磁盘上,并根据指定的块标识和字节范围来读写块数据。出于可靠性的考虑,每个块会复制到多个chunkserver节点上。GFS默认存储三个副本,不过用户可以为不同的文件指定不同的副本机制。
The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunklease management, garbage collection of orphaned chunks, and chunkmigration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
master节点管理所有文件系统的元数据,包括:命名空间、访问控制信息、文件和块(chunk)的映射信息以及当前块(chunk)的位置信息。master节点还管理系统范围内的活动,例如:块租用管理(chunklease management)、孤儿块(orphaned chunks)的垃圾回收以及chunkserver节点之间的数据迁移。master节点使用心跳信息(hearbeat message)周期性的与chunkserver节点通讯——向其发送指令并收集chunkserver服务器的状态。
GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hookinto the Linux vnode layer.
连接到不同应用程序的GFS客户端实现了文件系统API。应用程序通过操作GFS客户端与GFS的master节点和chunkserver节点通讯,完成数据的读取和写入操作。客户端和master节点通信获取元数据,所有与数据相关的操作都是由客户端直接和chunkserver节点进行交互的。我们不提供POSIX标准的API功能,因此GFS API调用不需要深入到linux vnode级别。
Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.
无论是客户端还是chunkserver都不会缓存文件数据。客户端缓存数据带来的好处很小,因为大部分程序要么以流的方式读取巨大的文件,要么工作集太大无法被缓存。不考虑缓存相关问题简化了客户端和整个系统的设计和实现(不过,客户端会缓存元数据)。chunkserver不需要缓存文件数据的原因是,块(chunk)被存储为本地文件,linux操作系统的文件系统缓存会把经常访问的数据缓存在内存中。
2.4 Single Master 单一master节点
Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
单一master节点的策略大大简化了我们的设计。单一master节点可以通过全局信息精确定位块(chunk)位置以及进行复制决策。然而,我们必须尽量减少对master节点的读写,避免master节点成为系统的性能瓶颈。客户端并不通过master节点读写文件数据,相反,客户端向master节点询问它应该联系哪个chunkserver。客户端将元数据信息(询问master的结果)缓存一段时间,后续操作将直接和chunkserver进行数据读写操作。
Let us explain the interactions for a simple read with reference to Figure 1. First, using the fixed chunksize, the client translates the file name and byte offset specified by the application into a chunkindex within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunkindex as the key.
让我们利用图1(figure 1)解释一下一次简单读取的流程。首先,客户端把程序指定的文件名和字节偏移量,根据固定的块大小(chunksize),转换成文件的块索引(chunkindex)。然后,客户端向master节点发送一个包含文件名和块索引的请求,master节点返回相应的chunk handle和副本位置信息。客户端会将文件名和块索引作为key缓存起来。
The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunkrequire no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This extra information sidesteps several future client-master interactions at practically no extra cost.
然后,客户端发送请求到其中的一个副本处,大多数情况下选择最近的。请求包含了chunk handle和该副本的字节范围。在后续对这个块(chunk)的读取操作中,客户端不需要再和master节点通讯,除非缓存的元数据信息过期或者文件被重新打开。实际上,客户端通常会在一次请求中查询多个块(chunk)信息,master节点的回应也可能包含了紧跟着这些被请求的块(chunk)后面的块(chunk)信息,这些额外的信息在没有任何代价的情况下,避免了客户端和master节点未来可能会发生的几次通讯。
2.5 Chunk Size 块大小
Chunksize is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system blocksizes. Each chunkreplica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunksize.
块大小(chunksize)是设计的关键参数之一。我们选择了64MB,这个尺寸比一般文件系统的块大小大的多。每个块副本都作为一个普通的linux文件存储在chunkserver上,并仅在需要时进行扩展。惰性空间分配策略避免了因内部碎片造成的空间浪费,对于这样大的块大小来说,(内部分段fragment)可能是最大的一个缺陷了。
A large chunksize offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunkrequire only one initial request to the master for chunklocation information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunklocation information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
选择较大的块大小有几个重要的优点。第一,它减少了客户端和master节点的通讯需求,因为对同一个块进行读写操作只需要向master节点请求一次块的位置信息。这种方式对于降低我们工作负载来说效果显著,因为我们的应用程序通常是顺序读写大型文件,即使是小规模的随机读,客户端也可以轻松缓存数TB工作数据集的所有块位置信息。第二,采取较大的块大小,客户端能够对一个块进行多次操作,这样就可以通过与chunkserver保持较长时间的TCP连接来减少网络负载。第三,采取较大的块大小减少了master节点需要保存的元数据的数量,这样我们就可以把元数据全部放在内存中,在2.6.1节我们会讨论元数据全部放在内存中带来的其他优点。
On the other hand, a large chunksize, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunkfiles sequentially.
另一方面,即使配合惰性空间分配,采用较大的块大小也有缺陷。一个小文件包含较少的块,甚至只有一个块,如果许多客户端同时请求这一个小文件,存储这个小文件的块的chunkserver就可能会变成热点。在实践中,由于我们的应用程序主要是顺序读取大型的包含多个块的文件,热点还不是主要的问题。
However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunkfile and then started on hundreds of machines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batch queue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.
然而,当GFS首次被批处理队列系统使用时,热点问题还是产生了:一个可执行文件在GFS上保存为一个单块(single-chunk)文件,之后这个可执行文件在数百台机器上同时启动,存放这个可执行文件的chunkservers被数百个客户端的并发请求访问导致系统局部过载。我们通过使用更大的副本因子来保存这个可执行文件,并使批处理队列系统错开应用程序启动时间,从而解决了这个问题。一个可能的长效解决方案是,在这种情况下,允许客户端从其它客户端读取数据。
2.6 Metadata 元数据
The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunkmapping) are also kept persistent by logging mutations to an operation log stored on the master’s local diskand replicated on remote machines. Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.
master服务器(注意区分逻辑节点和服务器的概念,此处探讨的是服务器的行为,如存储、内存等)存储3种主要类型的元数据:文件和块的命名空间、文件和块的映射关系、每个块副本的位置信息。所有元数据都保存在master服务器的内存中。前两种类型的元数据(命名空间和映射关系)同时也会以操作日志的形式记录在master服务器的本地磁盘上,日志文件会被复制到其他远程服务器上。采用操作日志的方式,我们能够简单可靠的更新master服务器的状态,不需要担心master服务器崩溃导致数据不一致的问题。master服务器不会持久保存块位置信息,相反,master服务器在启动时或者有新的chunkserver加入时,会向各个chunkserver轮询他们所存储的块信息。
2.6.1 In-Memory Data Structures 内存中的数据结构
Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunk migration to balance load and disk space usage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further.
因为元数据保存在内存中,所以master服务器的操作速度非常快。并且,master服务器可以在后台简单而高效的周期性扫描master节点保存的全部状态信息。此周期性扫描也用于实现块垃圾回收、在chunkserver出现故障时重新复制数据、通过块迁移平衡chunkserver之间的负载和磁盘空间。4.3节和4.4节将深入讨论这些行为。
One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.
这种将元数据全部保存在内存的方法的潜在问题是,存储块的数量和整个系统的负载能力都受限于master服务器内存大小。在实践中,这并不是一个严重的问题。master服务器只需要不到64字节的元数据就能管理64MB的块,由于大多数文件都包含多个块,因为绝大多数的块都能够放满,只有最后一个块是部分填充的。同样的,每个文件在命名空间中的数据大小通常在64字节以下,因为它使用前缀压缩算法保存文件名。
If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.
如果需要支持更大的文件系统,为master服务器增加额外内存,即可将全部元数据存储在内存中。相对于一个简洁,高效,可靠,灵活的文件系统,增加额外内存而付出的成本是一个很小的代价。
2.6.2 Chunk Locations 块位置信息
The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.
master服务器并不持久化保存哪个chunkserver上面指定的块副本信息,master服务器只是在启动时轮询chunkserver获取这些信息。master服务器能够保证它持有的信息始终是最新的,因为它控制所有块的位置分配,并且通过周期性心跳监控chunkserver服务器的状态。
We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.
最初,我们试图把块的位置信息持久保存在master服务器上,但后来我们发现在启动时轮询chunkserver,之后定期轮询的方式更简单。这种设计简化了chunkserver加入集群、离开集群、更名、失效以及重启的时候,master服务器和chunkserver之间数据同步的问题。在一个拥有数百台服务器的时候,这类事件会频繁发生。
Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.
可以从另外一个角度理解这个设计决策:只有chunkserver才能最终决定一个块是否在它的硬盘上,在master服务器上维护块位置信息的全局视图是没有意义的,因为chunkserver的错误可能会导致块自动消失(例如,磁盘损坏或者无法访问),亦或者操作人员可能会对chunkserver重命名。
2.6.3 Operation Log 操作日志
The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent op erations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.
操作日志保存了关键元数据的变更记录。这对GFS非常重要,不仅仅是因为操作日志是元数据唯一的持久化存储记录,它也作为判断同步操作顺序的逻辑时间基线(注:即通过逻辑日志的序号作为操作发生的逻辑时间,类似于事务系统中的LSN)。所有的文件和块,连同他们的版本(见4.5),都是由他们的逻辑时间唯一,永久标识。
Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.