Last time when I wrote about Git history and why it sometimes disappears, the article concluded with a concise statement that it’s the way the Git works. Today I’d like to talk about Git internals more in a more specific way (but still quite briefly).
So why Git history disappears during squash committing feature branches?
To answer this question we have to check how the Git stores the data inside. First of all, the data is stored in .git hidden directory inside Git repository. Each time you clone a repo, this folder is created and stores the whole history of the repository from its very beginning. When we open this .git folder, we’ll see names that may be familiar, like „refs„, „head” or „index„.
Maybe I will talk about them in the future, but now, to make this investigation as easy as possible, we will start from the new repository.
- Create new folder NewRepo
- Open Git Bash and execute command git init
- Open .git folder – you will see that there are not so many items as in the cloned, existing repository. Rest of files will be created later (or never, like smartgit.config for example). Anyway, most important – config, description, HEAD are there.
- Open .git/objects folder. Except for two empty subdirectories – info and pack, there is nothing else. Yet.
So, with the new repository, let’s create a new text file and again examine .git/objects. Just type:
>hello.txt echo Hello world!
New file hello.txt will appear in the repository. We can now stage it. In the .git folder new file appeared – index, and in the .git/objects, a new folder was added: cd. It contains file named 0875583aabe89ee197ea133980a9085d08e497. This is the first important notice we have to make:
Git identifies objects by their hashes
Basic types of objects inside git are blobs (representing files) and trees (representing directories). When we staged hello.txt, checksum SHA-1 was computed using this formula:
echo -e 'blob 13\0Hello world!' | shasum #cd0875583aabe89ee197ea133980a9085d08e497 *-
As we can see, git gets prefix „blob „, then the length of the newly created file followed by null (\0) and then the content of the file. What’s important: it does not use filename! It also doesn’t use a timestamp, so the hash for the same file will be identical on your and my machines. For commits, they won’t be.
Then, first two characters of the checksum – cd – are used to name the folder, and other 38 characters to name file placed inside this new folder (this is done for performance reasons, to not to store all the blobs inside one big folder).
When we open the blob in Notepad++ we’ll see some gibberish. To examine its real content, we have to use the command:
git cat-file -p cd0875583aabe89ee197ea133980a9085d08e497 #Hello world!
We know something about blobs, let’s get to the trees.
How to structurize blobs? Trees to the rescue.
Let’s add another file in the same root repository folder, and one more in a nested directory. We’ll do this in the same manner as before:
>AdditionalFile.txt echo Additional text #add new file mkdir NestedFolder #create nested folder >NestedFolder/NestedFile.txt echo Another text in nested file #create file inside it git add . #stage new files git diff --name-only --cached #see staged changes git write-tree #create tree, hash will be displayed git cat-file -p 95f41e6911324fe360d5b2d2853c3853698843b0 #examine new tree using the hash produced by previous command #100644 blob 3e8a292c23c691ef7e14e09e8de251d057e524b3 AdditionalFile.txt #040000 tree 21bbea5c01158520d17c9443aa79ef3fceb945b4 NestedFolder #100644 blob cd0875583aabe89ee197ea133980a9085d08e497 hello.txt
What’s interesting here, that finally, we can see file names! It’s not the blobs that hold it, but the trees. Even more – our new tree contains a reference to the tree representing NestedFolder tree.
git cat-file -p 21bbea #examine nested tree #100644 blob db4280127e4c8fc54586f60264fb2fb46261fcb9 NestedFile.txt
So, a nice tree structure has been created.
Out of curiosity, I checked also the formula to compute the tree’s hash. It’s more complicated than the blob’s one and all I can tell is that it also starts from the fixed formula „’tree SIZE\0′”, but then all its content is added, so the full form is not as simple as before.
Blob->Tree->Commit
It won’t surprise you that commits are quite similar to the trees and blobs. Let’s create one and examine it.
git commit -m 'First commit' #simplest command to commit previously staged content git log #to check how the master looks like now - it has only this new commit #commit 5b0b6ec97366a66446410e984c5e52a3ea1698ef (HEAD master) #Author: Pawel Szczygielski pawel.szczygielski@ #Date: Wed Feb 10 05:21:25 2021 +0100
As we can see, the commit can be identified by an ID (SHA-1 hash) and contains information about the author and creation date (which makes differences between your hashes and mine). Let’s examine if commit contains something else.
git cat-file -p 5b0b6ec97 #again cat-file allows to see more, notice shortened commit's hash #tree 95f41e6911324fe360d5b2d2853c3853698843b0 - that's something more! #author Pawel Szczygielski pawel.szczygielski@ #committer Pawel Szczygielski pawel.szczygielski@ 1612935449 +0100
We see that our commit contains a reference to the tree, that in turn contains a reference to the files and folders that we added. It means that by having commit’s ID we can see the work committed.
There are other entities in the Git, but for our today’s purposes, it suffices to know about blobs, trees and commits.
What do commits really contain?
Examination of just one commit is not enough. Let’s prepare the next one.
>hello.txt echo 'Hello world!, version 2' #change hello.txt git commit -am 'Hello world v.2' #stage and commit using one command git cat-file -p fec88254 #examine commit to obtain tree ID git cat-file -p 9e01b9 #examine tree #100644 blob 3e8a292c23c691ef7e14e09e8de251d057e524b3 AdditionalFile.txt #040000 tree 21bbea5c01158520d17c9443aa79ef3fceb945b4 NestedFolder #100644 blob f36de3e282f7963bb4f87268e4bff76aebe40032 hello.txt git cat-file -p 21bbea5 #100644 blob db4280127e4c8fc54586f60264fb2fb46261fcb9 NestedFile.txt
Let’s compare the trees from both commits:
Both trees contain references to the same one sub-tree representing NestedFolder. They also reference the same blob related to AdditionalFile.txt. BUT, the blob representing hello.txt is different. Take a look at another way of visualizing this:
I’m not sure is this clear – the trees reuse existing items if possible. In other words, when I committed for the second time, Git took a look and noticed that only one file was changed and for this file, a new blob was created. Other files and folders were intact, so tree No.2 could reference them.
By the way – now .git/objects folder contains 9 items: 4 blobs, 3 trees and 2 commits. Every step we did is reflected as a file in .git folder.
find .git/objects -type f #.git/objects/21/bbea5c01158520d17c9443aa79ef3fceb945b4 #.git/objects/3e/8a292c23c691ef7e14e09e8de251d057e524b3 #.git/objects/5b/0b6ec97366a66446410e984c5e52a3ea1698ef #.git/objects/95/f41e6911324fe360d5b2d2853c3853698843b0 #.git/objects/9e/01b9c7abd9b77fce06c2c615e9c1e743f27ee1 #.git/objects/cd/0875583aabe89ee197ea133980a9085d08e497 #.git/objects/db/4280127e4c8fc54586f60264fb2fb46261fcb9 #.git/objects/f3/6de3e282f7963bb4f87268e4bff76aebe40032 #.git/objects/fe/c882548f73aa0a7961326744a0eeda1772d8d6
What next?
We know now how Git stores the changes. From the previous posts we also know that when we divide a file just by extracting part of it, the smaller file will not have a history of the changes. The bigger will have, the smaller won’t.
This is a Git feature, and we could try to change a bit the behaviour of git blame method using parameters -M
and -C
. But in practice it’s troublesome. It’s much easier just to split in a proper way, and don’t rely on the behaviours that cannot be easily explained.
In one of the next posts, we will try to use today’s knowledge to see why history is gone even if we split files correctly.