Hello!

GFS Google File System System Design Engineering Technology

The Google File System Explained (Part 1)

The Google File System is a scalable distributed file system for large distributed data-intensive applications; It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

by Kishan Kumar

Feb 12, 2025

Photo by bharath kumar on Unsplash

Introduction
Overview
Assumptions
GFS Architecture
Recap

Have you ever wondered how Google manages to handle so much data? Its solutions include cloud storage, growing emails, and growing photos and videos in 4K and 8K (better and better quality). Where in the early 20s, a single picture used to be around 10KB, now the lowest quality starts from 5MB.

We are generating so much data; the question is how Google kept up with this much load when the internet first started to boom.

With this rapid rise in data, Google wanted to reinvent something that serves their use case. They came up with their own filesystem and released a paper on the Google File System (GFS) in 2003.

Unlike traditional file systems, GFS was engineered to handle files ranging from gigabytes to terabytes in size, support concurrent access by thousands of clients, and withstand frequent hardware failures inherent in data centers. This blog delves into the technical architecture, design philosophy, and operational mechanics of GFS, providing a comprehensive exploration of its inner workings.

According to Google:

The Google File System is a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

Before diving into the details, it would make sense to understand what a distributed file system (DFS) is and why we need one.

A Distributed File System (DFS), not the search algorithm that we learn in graphs, is a system where data is stored across multiple computers but appears to the user as a single unified storage. Unlike a local file system on a personal computer, a DFS allows access to files from different machines over a network using authentication credentials.

You can argue, but why do we even need a distributed file system? To answer it, think of it this way: A modern personal computer usually comes with how much disk space? 1TB? But what if we are a video editor and we have so much data that it exceeds the disk space? One viable solution would be to use an external hard disk, but that comes with risks. What if the hard disk fails? All our data will be lost. This is where distributed file systems come in. They offer several advantages:

Scalability: A single machine has limited storage capacity, whereas DFS allows you to store and manage massive amounts of data by distributing it across multiple machines. One machine might contain all your gaming data, while another might contain all your 50GB study materials. If your data grows, new storage nodes can be added without disrupting existing operations.
Fault Tolerance & High Availability: No matter how expensive our hardware is, hardware failures are inevitable. If all our precious data is stored on a single machine and it crashes, we lose access to the data. DFS replicates data across multiple machines, ensuring that even if one machine fails, the data remains accessible.
Performance Optimization: Instead of a single machine handling all our read/write requests, DFS distributes the workload across multiple nodes. This improves throughput and reduces latency, making large-scale applications more efficient.
Data Sharing & Collaboration: Say we created a masterpiece and want to share it with a group of people. One solution would be to share the same hard drive on which we have stored it with multiple people one by one. You can see how problematic this might become. Other users can see other contents present on the disk. While there are solutions to avoid this, you get the idea. In DFS, multiple users and applications can access and update the same files concurrently, making collaboration seamless.
Security: DFS allows authentication and authorization mechanisms to control who can read, write, or modify files. Sensitive data can be encrypted and stored securely.

Overview

Enough of all the context, now let's see how Google File System provides us all the functionalities. While GFS shares the majority of the specifics of a previously distributed file system, they have favored certain choices and design principles that suit their own need for scaling and application workloads.

Let's start by explaining a few terms that have been heavily used in the paper. After that, we will understand how these interact with each other, the complete flow, and how the design choices have influenced the scale.

Master Server
Chunk Servers
Clients

Master Server

Let's start with the Master. It is the brain of GFS. Think of the master as the one who keeps all the information about which file is present on what machine. Since it is a distributed file system, there has to be a mapping of the files to the chunks, and the chunks to the machines, or else how will one know which file is present where in this vast sea of the internet? A better term for this would be the coordinator or manager since it manages metadata (file-to-chunk mappings, chunk locations, access control).

It maintains critical metadata such as:

Namespace and File-to-Chunk Mappings: Each file is divided into chunks, and each chunk is assigned a globally unique 64-bit chunk handle. Refer to the following mapping for reference. Here the file (named sarcastically "file") is divided into 5 chunks (0, 1, 2, 3, 4) and each chunk is assigned a 64-bit unique handle. Think of them as the identifiers.

1{
2  "data/user/file": {
3    "0": "1378493021847",
4    "1": "9847562304123",
5    "2": "5647382910567",
6    "3": "2309487561023",
7    "4": "6789234501892"
8  }
9}

Chunk Locations: The master tracks which chunk servers store replicas of each chunk. Note, that we make multiple copies of the same chunk and store them on different chunkservers. Refer to the following image, here the chunks are assigned to the chunkservers denoted by CS1, CS2, and CS3. For simplicity, we have not shown the replicas.

Chunk Mapping

1{
2  "1378493021847": [
3    "CS1"
4  ],
5  "9847562304123": [
6    "CS1"
7  ],
8  "5647382910567": [
9    "CS3"
10  ],
11  "2309487561023": [
12    "CS2"
13  ],
14  "6789234501892": [
15    "CS3"
16  ]
17}

Access Control: Manages lease grants for mutations (writes/appends) to ensure serializability. This is done to disallow other clients to access or append to this file.

All this information is stored in memory, but it also persists to disk for durability. You might ask, why are we keeping the data in memory? Isn't it volatile? And isn't memory limited?

To answer the first question, fast-lookup—since clients request metadata frequently, in-memory storage ensures low latency. This reduces disk I/O and speeds up operations.

Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunk migration to balance load and disk space usage across chunkservers.

Now coming to the volatility part—yes, it is volatile, but the master periodically persists metadata in two ways to prevent data loss:

Operation Log (Write-Ahead Log): Every new change is written to the disk before applying it in memory. This allows easy recovery after crashes.
Periodic Checkpoints: The master dumps its entire metadata state to disk at intervals. This makes recovery faster than replaying the full log.

So, once the master fails or is restarted, it loads metadata from the last checkpoint and replays the log. The checkpoint is in a compact B-tree-like form that can be directly mapped into memory and used for namespace lookup without extra parsing.

But this still doesn't answer the question—if we have hundreds of TBs of data, how are we going to fit everything in memory?

Note that we are not storing the file itself in memory but the metadata. This metadata consists of the chunks placed over different machines. Let's break down the memory usage of GFS metadata:

Each chunk requires approximately 64 bytes (note: this is for metadata, not the file itself), which includes:

Chunk ID: 8 bytes
Version number: 8 bytes
File-to-chunk mapping: 16 bytes
Chunk location info: (ChunkServer list, ~32 bytes)

Using this information, we can extrapolate how much metadata our master can manage based on the available memory.

Memory Size (RAM)	Number of Chunks	Total Data Stored (64MB per chunk)	Approx. Number of Files (Assuming 16MB avg. file size)
1 GB	~16 million	~1 Petabyte (PB)	~64 million files
4 GB	~64 million	~4 Petabyte (PB)	~256 million files
8 GB	~128 million	~8 PB	~512 million files
16 GB	~256 million	~16 PB	~1 billion files
32 GB	~512 million	~32 PB	~2 billion files

For example, a 1 TB file split into 64 MB chunks generates only ~16,000 chunk handles. Even with petabytes of data, the master's in-memory metadata (storing ~64 bytes per chunk) rarely exceeded a few gigabytes. This design ensured that the master could serve metadata requests quickly without becoming a bottleneck.

To know how much these MB, GB, or TB are you can refer to the following post:

How much data can we store in 1 byte, kilobyte, megabyte, gigabyte, etc.?

In this article, we'll try to visualize the storage system of a modern computer. After reading this article, you'll have an idea to compare how much storage is with day-to-day life.

Chunkservers

Chunkservers are the machines where we store the actual files. The chunks are stored as Linux files and serve read/write requests. Each chunk is replicated across multiple chunkservers (default: 3 replicas) for fault tolerance. Replication ensures data availability even if nodes fail. So if one chunkserver dies, there will be 2 other servers where that same chunk will be present. Think of them as distributed nodes responsible for storing 64 MB chunks of data on local disks.

These chunkservers periodically report their status to the GFS Master through what is called a heartbeat. The Master tracks which servers are alive and manages replication. If one server dies, it checks what all chunks that server holds and initiates the rebalancing process.

Clients

Clients are the ones who add media, delete, or even update it. In more formal terms: these are the applications interfacing with GFS via a library that communicates with the master and chunkservers.

How Clients Read a File Using ChunkServers?

The client asks the GFS Master for the file metadata.
The Master returns the list of ChunkServers storing the required chunks.
The client directly contacts a ChunkServer to read the data.
The ChunkServer streams the requested chunk back to the client.

Note: The Master is never in the data path! It only provides metadata, while ChunkServers handle actual data transfer.

Assumptions

Before explaining its architecture, let's first discuss the assumptions that the Google researchers took to build such a system.

The whole file system is built using cheap commodity hardware. What is commodity hardware, you say? Commodity hardware is a term for inexpensive, widely available, and interchangeable computing components and systems.
The files being stored on these hardware are generally large, ranging from 100 MB to GBs. It also supports small files, but it is not something for which we are going to optimize the system. At Google's scale, most of the data being generated is large. Hence this assumption.
The workload prioritizes the append-only writes. Once written, files are seldom modified again.
Files are being split up into 64 MB chunks instead of 4 KBs used in traditional file systems. This reduces metadata overhead.
Consistency is relaxed, unlike other filesystems that guarantee POSIX semantics, where reads always reflect the latest write.

These assumptions were challenged by critics, particularly its single-master design and relaxed consistency. But the researchers were making the filesystem that suited their needs, and these trade-offs underscored GFS's philosophy: optimize for the common case, not the edge case.

GFS Architecture

A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, refer to the image.

GFS Architecture

There are a lot of things going on in this picture. We can see that we have our GFS client that talks to both the GFS master as well as the GFS chunkservers. One important thing to note is the Legend (shown in the extreme right). The Data being transferred only happens between the Client and the Chunkservers. Interesting, isn't it? We are excluding the master from it, we tend to minimize its involvement in reads and writes so that it does not become a bottleneck.

So how does the client know where to get the data it needs? A client asks the master which chunkservers it should contact for it to get the required data.

Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. The master returns the result which the client caches for a limited time and interacts with the chunkservers directly for many subsequent operations.

Let us understand a basic read flow.

Read flow

Say we want to read a log file named server_logs.txt, which is 300MB in size. Since GFS stores files in 64MB chunks, this file is split into 5 chunks:

Chunk Index	Chunk ID	Chunk Servers
0	1378493021847	CSA, CSB, CSC
1	9847562304123	CSB, CSD, CSE
2	5647382910567	CSC, CSF, CSG
3	8759632145876	CSD, CSH, CSI
4	3198457219846	CSE, CSJ, CSK

CSi denotes the chunkservers. Let's understand the "Read" process through scenarios:

Scenario 1: The Client wants to read a specific byte

In this scenario, the client wants to read data at byte 200MB from server_logs.txt. Here's how it works step by step:

It determines the chunk in which its data is present since we have divided the whole file into 5 chunks, this is how the breakup will look like, and after examining it the client will come to the conclusion that it needs chunk 3.

0–63 MB → chunk 0
64–127 MB → chunk 1
128MB - 191MB → chunk 2
192MB - 255 MB → chunk 3
255–300 MB → chunk 4

The client contacts the GFS Master with the file name (server_logs.txt) and chunk index (3), to which the master responds with two things. Note: At this point, the client caches this information to avoid repeatedly asking the Master.

Chunk ID: 8759632145876
ChunkServers that store this chunk: CSD, CSH, CSI

Having the chunk ID and the chunkserver locations, the client picks the nearest Chunkserver (e.g., CSD) and requests the specific byte range within the chunk.

Receiving the request from the client, the chunkserver then sends the requested data.

Scenario 2: The Client's cached data (from the master) is stale

Let me explain this scenario, it is possible that when the client wanted to read a chunk it contacted the chunkserver, but that data was moved to some other chunkserver. But how is it possible?

It is possible in case the chunkserver crashed and the master got to know about it through the absence of the heartbeat, and due to this the master initiated the rebalancing process and moved the data to some other chunkserver.

But how does the client handle this case? Here are the steps that are taken by the client:

First, the client sends a read request to a previously known ChunkServer (from the cache).
If the ChunkServer no longer has the chunk, it returns an error.
The client then gets to know that it has stale data and now is the time to discard its cached metadata and contact the Master again.
The Master provides the updated list of ChunkServers, and the client continues the read.

Scenario 3: The client reads stale data

It is possible that the chunk the client wanted to read was updated or modified by other clients. In this case, the client might read the outdated data. Here's how the GFS handles this scenario:

Each chunk of a file has a version number.
If a ChunkServer has an older version of a chunk, the Master removes it from the metadata response.
The client only reads from ChunkServers with the latest chunk version.

Scenario 4: The Client Reads While Another Client is Writing

This is somewhat similar to scenario 3, here the problem is that say Client A is writing to server_logs.txt, modifying Chunk 3, and at the same time, Client B tries to read from Chunk 3.

How should the GFS handle this scenario? Shall it avoid the read altogether or give an outdated result? Here is where we lose our grip on consistency.

Writes in GFS are append-only (mostly for logs) and do not overwrite existing data. If Client B reads from Chunk 3 before Client A finishes writing, it may see partial data, but If the write is complete, Client B gets the latest version of the chunk.

Write Flow

Let's understand how the Write flow might look like, this is where we will go deep into the file creation, chunk splitting, and how consistency plays a role.

We can consider "write" as a process that changes the state of the system. A state change can be termed as a mutation. In the GFS paper, mutation is defined as follows:

A mutation is an operation that changes the contents or metadata of a chunk such as a write or an append operation.

Now, If one changes the content or metadata of a chunk, all its replicas must also be updated. But how are we going to ensure that? How are we going to ensure consistency, fault tolerance, and efficiency?

Let's take a simple example of the log file system, where new logs are continuously added to the end of a file.

Imagine a distributed logging system where multiple servers (here clients) are writing logs to a file server_logs.txt in GFS.

Currently, assume that the initial size of the file is 300 MB. This means this file was broken down into 5 chunks, i.e., it spans 5 chunks: Chunk 1 → Chunk 5

Since each chunk is 64MB, the last chunk (Chunk 5) is only 20MB full, meaning there is 44MB of free space for new log entries.

Write Flow

Now, let's say a client wants to append a new log entry (500KB) to server_logs.txt. How will it do it?

First, the client will send an append request to the Master, specifying the filename and the 500 KB.
The Master identifies the last chunk of the file (i.e., Chunk 5). If Chunk 5 has enough space, which it has, it is selected. If that was not the case, i.e., If Chunk 5 was full, the Master assigns a new Chunk (Chunk 6). Since the file is 300MB, the new write must go into Chunk 5 (which has 44MB free). The Master then grants the Primary a lease (valid for 60 seconds) so that it can coordinate appends. It then also provides a list of replica chunkservers as well. Here is what the client gets after it asks for the primary chunkserver to which it needs to append:
- Chunk ID: 3198457219846
- Primary ChunkServer: CSE
- Replicas: [CSJ, CSK]
- Lease Validity: 60 seconds
The client then sends the 500KB log data to all three replicas (Primary + Two Secondaries), i.e., CSE, CSJ, and CSK. As soon as the chunk server receives the data, it stores it in the LRU buffer cache of each ChunkServer. It is not flushed to the disk yet.
Now it is at this time where the primary (CSE) ensures that all the replicas have the data from the client. Now there can be one more optimization at this point. Instead of asking the client to send the data to the other two replicas (CSJ, CSK), we can offload that task to the primary itself. Data can be sent in a pipeline to reduce network load: Client → CSE (Primary) → CSJ → CSK. The client need not follow a certain order; it can do so in any order. After sending the data to the primary (CSE), the client requests it to append to the log.
The Primary writes the data at the next available offset (since this is an append). After that, it tells the replicas to write the data at the same offset.
The replicas (CSJ & CSK) apply the append at the same offset as the Primary. Once they finish, they send back an acknowledgment to the Primary.
The Primary waits for acknowledgments from all replicas. Once received, it sends a successful response to the client.

What If the Chunk Runs Out of Space?

Now you might ask, instead of just writing a 500 KB, what if we want to write a 50 MB of data? In that case, the master allocates a new chunk 6, and the append is redirected to this chunk. The master updates the client's cache to point to 6.

What if the Primary Fails?

If the primary fails, the lease will expire, and the Master will assign a new Primary from the existing replicas. Post which the client will initiate the same thing with the new primary.

What if the Secondary Fails?

If the secondary fails, the primary won't get the ack. As GFS ensures strong consistency across replicas, it must handle this failure properly.

Let's say the primary is D, and the secondaries are H and I, and our H failed. If the primary doesn't get an ack, it will detect the failure and notify the Master about the H that it has failed. The master will then mark H as unavailable and stop sending new writes to it.

But, since Server I successfully acknowledged, the write is committed on Server D & I. GFS allows writes to succeed as long as the majority of replicas acknowledge it. In this case, 2 out of 3 replicas succeeded, so the write is still valid.

Now, all the chunks present on the H need to be re-replicated. For that, the master initiates the rebalancing process. In our case, say chunk 3 was present on the H server, the master needs to replicate the under-replicated chunk. The master then chooses a new ChunkServer, say (Server M), and copies the latest chunk data from Server D or I to restore replication.

Note: Until re-replication is complete, new writes only go to Server D & I. After the replication is done, the Master adds Server M as a new replica, and future writes will go to Server D, I, and M.

Does the Client See a Write Failure if a Secondary ChunkServer Fails?

No. The client does not need to retry the write because GFS commits writes as long as the majority of replicas succeed. The Primary handles failure internally and reports success if at least one Secondary succeeds.

Let's recap what we learned so far.

Recap

Read Flow

When a client needs to read data, it contacts the master server with the file name and the specific byte offset. The master looks up the file's metadata and returns the chunk identifier along with the list of chunk servers holding that chunk. The client caches this information and then reads directly from one of those chunk servers, bypassing the master for the actual data transfer.

Write Flow

For writing (especially for appends), the client sends the data to the master to get the current chunk's details. The master selects a primary chunk server for the target chunk (and provides a lease to it) along with the locations of its replicas. The client then pushes the data to the primary and its replicas (often via a pipeline to reduce load). The primary determines the proper offset for the append, writes the data, and instructs the replicas to do the same at that offset. After receiving acknowledgments from all replicas, the primary confirms the write's success back to the client.

This design ensures that data is quickly accessed and reliably stored, even in the face of hardware failures. I hope you get an overall view of how the Google File System read and write flow works. We'll discuss the rest of the paper in the next article.

References:

https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

. . .

The 0xkishan Newsletter

Subscribe to the newsletter to learn more about the decentralized web, AI and technology.

Comments on this article

Please be respectful!

Privacy

Collection Notice

Hello!

The Google File System Explained (Part 1)

Table of Contents

Overview

Master Server

How much data can we store in 1 byte, kilobyte, megabyte, gigabyte, etc.?

Chunkservers

Clients

Assumptions

GFS Architecture

Read flow

Scenario 1: The Client wants to read a specific byte

Scenario 2: The Client's cached data (from the master) is stale

Scenario 3: The client reads stale data

Scenario 4: The Client Reads While Another Client is Writing

Write Flow

What If the Chunk Runs Out of Space?

What if the Primary Fails?

What if the Secondary Fails?

Does the Client See a Write Failure if a Secondary ChunkServer Fails?

Recap

Read Flow

Write Flow

The 0xkishan Newsletter

Comments on this article