Feb 12, 2025
Photo by bharath kumar on Unsplash
Have you ever wondered how Google manages to handle so much data? Its solutions include cloud storage, growing emails, and growing photos and videos in 4K and 8K (better and better quality). Where in the early 20s, a single picture used to be around 10KB, now the lowest quality starts from 5MB.
We are generating so much data; the question is how Google kept up with this much load when the internet first started to boom.
With this rapid rise in data, Google wanted to reinvent something that serves their use case. They came up with their own filesystem and released a paper on the Google File System (GFS) in 2003.
Unlike traditional file systems, GFS was engineered to handle files ranging from gigabytes to terabytes in size, support concurrent access by thousands of clients, and withstand frequent hardware failures inherent in data centers. This blog delves into the technical architecture, design philosophy, and operational mechanics of GFS, providing a comprehensive exploration of its inner workings.
According to Google:
Before diving into the details, it would make sense to understand what a distributed file system (DFS) is and why we need one.
A Distributed File System (DFS), not the search algorithm that we learn in graphs, is a system where data is stored across multiple computers but appears to the user as a single unified storage. Unlike a local file system on a personal computer, a DFS allows access to files from different machines over a network using authentication credentials.
You can argue, but why do we even need a distributed file system? To answer it, think of it this way: A modern personal computer usually comes with how much disk space? 1TB? But what if we are a video editor and we have so much data that it exceeds the disk space? One viable solution would be to use an external hard disk, but that comes with risks. What if the hard disk fails? All our data will be lost. This is where distributed file systems come in. They offer several advantages:
Enough of all the context, now let's see how Google File System provides us all the functionalities. While GFS shares the majority of the specifics of a previously distributed file system, they have favored certain choices and design principles that suit their own need for scaling and application workloads.
Let's start by explaining a few terms that have been heavily used in the paper. After that, we will understand how these interact with each other, the complete flow, and how the design choices have influenced the scale.
Let's start with the Master. It is the brain of GFS. Think of the master as the one who keeps all the information about which file is present on what machine. Since it is a distributed file system, there has to be a mapping of the files to the chunks, and the chunks to the machines, or else how will one know which file is present where in this vast sea of the internet? A better term for this would be the coordinator or manager since it manages metadata (file-to-chunk mappings, chunk locations, access control).
It maintains critical metadata such as:
1{
2 "data/user/file": {
3 "0": "1378493021847",
4 "1": "9847562304123",
5 "2": "5647382910567",
6 "3": "2309487561023",
7 "4": "6789234501892"
8 }
9}
Chunk Mapping
1{
2 "1378493021847": [
3 "CS1"
4 ],
5 "9847562304123": [
6 "CS1"
7 ],
8 "5647382910567": [
9 "CS3"
10 ],
11 "2309487561023": [
12 "CS2"
13 ],
14 "6789234501892": [
15 "CS3"
16 ]
17}
All this information is stored in memory, but it also persists to disk for durability. You might ask, why are we keeping the data in memory? Isn't it volatile? And isn't memory limited?
To answer the first question, fast-lookup—since clients request metadata frequently, in-memory storage ensures low latency. This reduces disk I/O and speeds up operations.
Now coming to the volatility part—yes, it is volatile, but the master periodically persists metadata in two ways to prevent data loss:
So, once the master fails or is restarted, it loads metadata from the last checkpoint and replays the log. The checkpoint is in a compact B-tree-like form that can be directly mapped into memory and used for namespace lookup without extra parsing.
But this still doesn't answer the question—if we have hundreds of TBs of data, how are we going to fit everything in memory?
Note that we are not storing the file itself in memory but the metadata. This metadata consists of the chunks placed over different machines. Let's break down the memory usage of GFS metadata:
Each chunk requires approximately 64 bytes (note: this is for metadata, not the file itself), which includes:Using this information, we can extrapolate how much metadata our master can manage based on the available memory.
Memory Size (RAM) | Number of Chunks | Total Data Stored (64MB per chunk) | Approx. Number of Files (Assuming 16MB avg. file size) |
---|---|---|---|
1 GB | ~16 million | ~1 Petabyte (PB) | ~64 million files |
4 GB | ~64 million | ~4 Petabyte (PB) | ~256 million files |
8 GB | ~128 million | ~8 PB | ~512 million files |
16 GB | ~256 million | ~16 PB | ~1 billion files |
32 GB | ~512 million | ~32 PB | ~2 billion files |
For example, a 1 TB file split into 64 MB chunks generates only ~16,000 chunk handles. Even with petabytes of data, the master's in-memory metadata (storing ~64 bytes per chunk) rarely exceeded a few gigabytes. This design ensured that the master could serve metadata requests quickly without becoming a bottleneck.
To know how much these MB, GB, or TB are you can refer to the following post:
In this article, we'll try to visualize the storage system of a modern computer. After reading this article, you'll have an idea to compare how much storage is with day-to-day life.
Chunkservers are the machines where we store the actual files. The chunks are stored as Linux files and serve read/write requests. Each chunk is replicated across multiple chunkservers (default: 3 replicas) for fault tolerance. Replication ensures data availability even if nodes fail. So if one chunkserver dies, there will be 2 other servers where that same chunk will be present. Think of them as distributed nodes responsible for storing 64 MB chunks of data on local disks.
These chunkservers periodically report their status to the GFS Master through what is called a heartbeat. The Master tracks which servers are alive and manages replication. If one server dies, it checks what all chunks that server holds and initiates the rebalancing process.
Clients are the ones who add media, delete, or even update it. In more formal terms: these are the applications interfacing with GFS via a library that communicates with the master and chunkservers.
How Clients Read a File Using ChunkServers?Note: The Master is never in the data path! It only provides metadata, while ChunkServers handle actual data transfer.
Before explaining its architecture, let's first discuss the assumptions that the Google researchers took to build such a system.
These assumptions were challenged by critics, particularly its single-master design and relaxed consistency. But the researchers were making the filesystem that suited their needs, and these trade-offs underscored GFS's philosophy: optimize for the common case, not the edge case.
A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, refer to the image.
GFS Architecture
There are a lot of things going on in this picture. We can see that we have our GFS client that talks to both the GFS master as well as the GFS chunkservers. One important thing to note is the Legend (shown in the extreme right). The Data being transferred only happens between the Client and the Chunkservers. Interesting, isn't it? We are excluding the master from it, we tend to minimize its involvement in reads and writes so that it does not become a bottleneck.
So how does the client know where to get the data it needs? A client asks the master which chunkservers it should contact for it to get the required data.
Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. The master returns the result which the client caches for a limited time and interacts with the chunkservers directly for many subsequent operations.
Let us understand a basic read flow.Say we want to read a log file named server_logs.txt, which is 300MB in size. Since GFS stores files in 64MB chunks, this file is split into 5 chunks:
Chunk Index | Chunk ID | Chunk Servers |
---|---|---|
0 | 1378493021847 | CSA, CSB, CSC |
1 | 9847562304123 | CSB, CSD, CSE |
2 | 5647382910567 | CSC, CSF, CSG |
3 | 8759632145876 | CSD, CSH, CSI |
4 | 3198457219846 | CSE, CSJ, CSK |
CSi denotes the chunkservers. Let's understand the "Read" process through scenarios:
In this scenario, the client wants to read data at byte 200MB from server_logs.txt. Here's how it works step by step:
It determines the chunk in which its data is present since we have divided the whole file into 5 chunks, this is how the breakup will look like, and after examining it the client will come to the conclusion that it needs chunk 3.
The client contacts the GFS Master with the file name (server_logs.txt) and chunk index (3), to which the master responds with two things. Note: At this point, the client caches this information to avoid repeatedly asking the Master.
Having the chunk ID and the chunkserver locations, the client picks the nearest Chunkserver (e.g., CSD) and requests the specific byte range within the chunk.
Receiving the request from the client, the chunkserver then sends the requested data.
Let me explain this scenario, it is possible that when the client wanted to read a chunk it contacted the chunkserver, but that data was moved to some other chunkserver. But how is it possible?
It is possible in case the chunkserver crashed and the master got to know about it through the absence of the heartbeat, and due to this the master initiated the rebalancing process and moved the data to some other chunkserver.
But how does the client handle this case? Here are the steps that are taken by the client:
It is possible that the chunk the client wanted to read was updated or modified by other clients. In this case, the client might read the outdated data. Here's how the GFS handles this scenario:
This is somewhat similar to scenario 3, here the problem is that say Client A is writing to server_logs.txt, modifying Chunk 3, and at the same time, Client B tries to read from Chunk 3.
How should the GFS handle this scenario? Shall it avoid the read altogether or give an outdated result? Here is where we lose our grip on consistency.
Writes in GFS are append-only (mostly for logs) and do not overwrite existing data. If Client B reads from Chunk 3 before Client A finishes writing, it may see partial data, but If the write is complete, Client B gets the latest version of the chunk.Let's understand how the Write flow might look like, this is where we will go deep into the file creation, chunk splitting, and how consistency plays a role.
We can consider "write" as a process that changes the state of the system. A state change can be termed as a mutation. In the GFS paper, mutation is defined as follows:
Now, If one changes the content or metadata of a chunk, all its replicas must also be updated. But how are we going to ensure that? How are we going to ensure consistency, fault tolerance, and efficiency?
Let's take a simple example of the log file system, where new logs are continuously added to the end of a file.
Imagine a distributed logging system where multiple servers (here clients) are writing logs to a file server_logs.txt in GFS.
Currently, assume that the initial size of the file is 300 MB. This means this file was broken down into 5 chunks, i.e., it spans 5 chunks: Chunk 1 → Chunk 5
Since each chunk is 64MB, the last chunk (Chunk 5) is only 20MB full, meaning there is 44MB of free space for new log entries.
Write Flow
Now, let's say a client wants to append a new log entry (500KB) to server_logs.txt. How will it do it?
Now you might ask, instead of just writing a 500 KB, what if we want to write a 50 MB of data? In that case, the master allocates a new chunk 6, and the append is redirected to this chunk. The master updates the client's cache to point to 6.
If the primary fails, the lease will expire, and the Master will assign a new Primary from the existing replicas. Post which the client will initiate the same thing with the new primary.
If the secondary fails, the primary won't get the ack. As GFS ensures strong consistency across replicas, it must handle this failure properly.
Let's say the primary is D, and the secondaries are H and I, and our H failed. If the primary doesn't get an ack, it will detect the failure and notify the Master about the H that it has failed. The master will then mark H as unavailable and stop sending new writes to it.
But, since Server I successfully acknowledged, the write is committed on Server D & I. GFS allows writes to succeed as long as the majority of replicas acknowledge it. In this case, 2 out of 3 replicas succeeded, so the write is still valid.
Now, all the chunks present on the H need to be re-replicated. For that, the master initiates the rebalancing process. In our case, say chunk 3 was present on the H server, the master needs to replicate the under-replicated chunk. The master then chooses a new ChunkServer, say (Server M), and copies the latest chunk data from Server D or I to restore replication.
Note: Until re-replication is complete, new writes only go to Server D & I. After the replication is done, the Master adds Server M as a new replica, and future writes will go to Server D, I, and M.
No. The client does not need to retry the write because GFS commits writes as long as the majority of replicas succeed. The Primary handles failure internally and reports success if at least one Secondary succeeds.
Let's recap what we learned so far.
When a client needs to read data, it contacts the master server with the file name and the specific byte offset. The master looks up the file's metadata and returns the chunk identifier along with the list of chunk servers holding that chunk. The client caches this information and then reads directly from one of those chunk servers, bypassing the master for the actual data transfer.
For writing (especially for appends), the client sends the data to the master to get the current chunk's details. The master selects a primary chunk server for the target chunk (and provides a lease to it) along with the locations of its replicas. The client then pushes the data to the primary and its replicas (often via a pipeline to reduce load). The primary determines the proper offset for the append, writes the data, and instructs the replicas to do the same at that offset. After receiving acknowledgments from all replicas, the primary confirms the write's success back to the client.
This design ensures that data is quickly accessed and reliably stored, even in the face of hardware failures. I hope you get an overall view of how the Google File System read and write flow works. We'll discuss the rest of the paper in the next article.
Subscribe to the newsletter to learn more about the decentralized web, AI and technology.
Please be respectful!