Understanding How Git Stores Files in a Repository

Git is the most popular version control system, and one reason for its efficiency is how it stores data. Unlike traditional version control systems, which track changes in files, Git stores snapshots of the entire project’s file structure. In this post, we'll explore Git's inner workings.

Understanding How Git Stores Files in a Repository

Overview

Git is one of the most popular version control systems, and one of the reasons for its efficiency is how it stores data. Unlike traditional version control systems, which track changes in files, Git stores snapshots of the entire project’s file structure. In this post, we'll dive into the inner workings of Git, focusing on how it stores files and metadata in a repository.

We’ll cover the following key topics:

  1. Overview of Git’s storage model: snapshots, not differences.
  2. How files are stored in Git using objects: blobs, trees, commits.
  3. The role of the .git directory and its contents.
  4. Git object model: Hashing, compression, and de-duplication.
  5. Understanding the relationship between working directory, staging area, and Git’s internal objects.
  6. Inspecting Git objects with low-level Git commands.

By the end, you’ll have a comprehensive understanding of how Git internally manages your files and why its storage model is so efficient.

1. Git's Storage Model: Snapshots, Not Differences

Most version control systems (VCS), like Subversion (SVN), work by tracking changes (or diffs) in files over time. They store the differences between file versions. Git, however, operates differently. Instead of storing differences, Git stores snapshots of the entire project’s file system each time you make a commit.

Every time you commit, Git creates a snapshot of what all files look like at that moment. If a file hasn’t changed since the last commit, Git doesn’t store the file again. Instead, it just creates a link to the previous identical file stored in the repository.

This model gives Git its speed and flexibility because it can quickly navigate between versions, branches, or changes without recalculating diffs each time.

2. How Files Are Stored in Git: Objects and Types

To understand how Git stores files, we need to explore the Git object model. Git stores content in the form of objects in the .git/objects directory. These objects form the building blocks of the repository, and each object has a unique identifier called a SHA-1 hash (now SHA-256 is also being introduced).

There are four main types of Git objects:

  1. Blob – Represents file content.
  2. Tree – Represents directories (file structures).
  3. Commit – Represents a snapshot of the repository at a specific point.
  4. Tag – Represents tags for specific commits (not our focus in this post).

Let’s break down each object type:

2.1 Blob (Binary Large Object)

A blob in Git represents the content of a file. It is the simplest type of Git object, containing only the file's raw data (no metadata like the file name or permissions). When you commit changes to a file, Git creates a new blob object for the file's contents if the file has changed.

Git doesn’t store multiple copies of the same file. If the content is identical, Git will reuse the same blob object for that file across different commits.

To see blob objects, you can run the following command in a Git repository:

git ls-tree <commit-hash>

This shows the files (blobs) and directories (trees) in the commit.

2.2 Tree

A tree object represents a directory and contains pointers to both blobs (files) and other tree objects (subdirectories). It stores information about file names, permissions, and references to blob objects.

Each tree object corresponds to a directory in your project. For example, if you have a project with multiple directories, Git will create multiple tree objects to represent the directory structure.

The tree object effectively serves as a snapshot of the file system at a particular point in time, including all the files and subdirectories.

2.3 Commit

A commit object ties together the entire project history. It points to a tree object (the snapshot of the file system at the time of the commit) and stores important metadata, such as:

  • The author of the commit.
  • The commit message.
  • The parent commit (for the commit history).

The commit object is what ties all the pieces together. Each commit references a tree, which in turn references blobs and other trees.

Example of the structure of a commit:

Commit
└── Tree (Root Directory)
    ├── Blob (File 1)
    ├── Blob (File 2)
    └── Tree (Subdirectory)
        ├── Blob (File 3)
        └── Blob (File 4)

Each commit represents a complete snapshot of the file structure at the time of that commit, which makes it efficient for Git to roll back or check out a specific version.

3. The .git Directory: Git’s Heart and Brain

The .git directory is the central location for all Git's internal data. It contains everything Git needs to track changes and store history for your repository. Let’s explore its contents:

3.1 .git/objects/

This directory holds all Git objects (blobs, trees, commits, tags). The objects are stored as compressed files, named by their hash value (e.g., a blob for a file might be stored as .git/objects/ab/cdef123...).

3.2 .git/refs/

The refs directory contains pointers to the heads of branches and tags. For example, when you switch to a branch, Git looks at the reference in .git/refs/heads/ to find the commit it points to.

3.3 .git/index

The index (also called the staging area) is a key part of how Git stages changes before committing. The index is a binary file that tracks which files are staged for the next commit.

3.4 .git/HEAD

The HEAD file is a symbolic reference that points to the current branch or commit you have checked out. It helps Git keep track of where you are in the project’s history.

4. Git Object Model: Hashing, Compression, and De-duplication

Git’s use of hashing, compression, and de-duplication is what makes its storage model so efficient. Here's a closer look:

4.1 SHA-1 Hashing (or SHA-256)

Git uses cryptographic hashing (SHA-1 by default, moving towards SHA-256) to uniquely identify every object in the repository. The hash is computed based on the content of the object, making it a content-addressable system.

  • Every file (blob), directory (tree), and commit gets its own unique SHA-1 hash.
  • If two files have the same content, Git will store only one copy of the file, because the content’s hash will be the same.

4.2 Object Compression

Git stores all objects in a compressed form to save disk space. When a file is committed, Git compresses its content before storing it as a blob object. Git uses zlib compression, which is efficient in terms of both speed and space.

4.3 Object De-duplication

One of Git’s core principles is avoiding redundant data storage. If the content of a file hasn’t changed across commits, Git reuses the existing blob for that file instead of creating a new one. This reduces the amount of data that needs to be stored.

5. Working Directory, Staging Area, and Commit History: Their Roles in Storing Files

To understand how Git stores files, it’s crucial to differentiate between the working directory, the staging area, and the commit history.

5.1 Working Directory

The working directory is the actual state of the files on your file system. These are the files you see and work with in your project folder. Changes made here are not tracked by Git until you explicitly tell Git to track them.

5.2 Staging Area (Index)

The staging area (or index) is an intermediate space where changes are recorded before committing. When you run git add, Git updates the staging area with the changes you've made to the working directory. This gives you the flexibility to stage only a subset of your changes before committing.

  • The staging area keeps track of what will go into the next commit.
  • It is stored in .git/index as a binary file.

5.3 Commit History

Once changes are staged and you run git commit, Git creates a new commit object. The commit points to a tree object representing the state of the files at the time of the commit. The commit is added to your commit history, allowing you to revisit or roll back to that point in the future.

6. Inspecting Git Objects with Low-Level Git Commands

You can explore Git's internal workings with some low-level commands. Let’s take a look at a few useful commands to inspect the objects in a repository:

6.1 git cat-file

The git cat-file command lets you inspect the content and type of Git objects.

To inspect a specific object, such as a commit:

git cat-file -p <commit-hash>

This will display the content of the commit, including metadata (author, commit message) and the tree object it points to.

6.2 git ls-tree

To list the files and directories (blobs and trees) in a specific commit or tree, use:

git ls-tree <commit-hash>

This command gives you a look at the tree structure and how blobs are stored in a particular commit.

6.3 git rev-parse

The git rev-parse command

helps you get the SHA-1 hash of references (branches, tags, HEAD, etc.) or verify the state of your repository. For example:

git rev-parse HEAD

This command will return the hash of the current commit you have checked out.

Conclusion

Understanding how Git stores files and manages history is key to mastering Git's internals and making the most of its powerful version control features. Git’s object model—blobs, trees, and commits—combined with its unique storage of snapshots (not diffs) makes it both efficient and flexible.

Key Takeaways:

  1. Git stores data as snapshots, not differences.
  2. Git’s core objects are blobs (files), trees (directories), and commits (snapshots).
  3. The .git directory contains all the internal data Git needs to manage your repository.
  4. Git uses cryptographic hashing, compression, and de-duplication to efficiently manage data.
  5. The working directory, staging area, and commit history each play a distinct role in how Git stores and tracks files.

By diving deep into Git’s internal storage model, you gain a better understanding of how to use Git effectively, resolve issues, and appreciate its power.

Read next

How to Perform a Successful Git Rebase

Rebasing in Git is a powerful tool to keep your commit history clean and organized, but it comes with some challenges. When done correctly, a rebase creates a linear and clean history, but mistakes can lead to a confusing history or even conflicts.