Getting to know Git

“git - the stupid content tracker”

– git manual page

Git has taken over source code control. Linus has a convincing set of reasons about why you should use it and the world has been converted. There is a good intro guide for simple git workflows but I found a little insight into how git models the versions helps me to understand the various git commands and workflows.

Under the hood git is a directed graph of source code tree versions. The different versions of the source code tree are each stored as a tree of the cryptographic SHA-1 hashes of the individual contents of the files and then combined into one summary hash of the entire source code tree. The SHA-1 keys for individual files map to blobs which are the compressed contents of the file. In theory there could be a collision of the hashes but it is extremely extremely unlikely and provides a convenient key for git to keep track of the tree.

Using the SHA-1 hashes of files as a key makes it easy to create a new branch as it is easy to virtually copy the entire tree by just updating the hashes of files that have changed rather than the whole tree. The hashes of entire trees provide a convenient identifier for different tree version and allow it to reference them through “pointers” like HEAD or other tags. Modeling The individual versions are connected by commits, branches and merges to form a directed acyclic graph that represents the full history of changes. Keeping track of the full graph also helps improve the merging in git compared to some other popular version control systems as it knows the revisions where things have branched.

git's representation of your source code

In this framework git commit is storing any new files that have changed, recalculating the SHA-1 hashes and storing a pointer to the previous version of code via that hash of the previous tree. git tag is just a pointer to a particular version of the tree via the hash identifier. Merges just join two source code trees and calculates the new hash of the merged source code tree. Understanding how git models things under the hood helps me to better understand the myriad of different commands that have been developed like the tradeoffs of using rebase option. Almost all of the commands are manipulating the graph in some way.

what git pull --rebase option is doing

Example using rebase for linear history

mkdir test-project
cd test-project
git init
git config user.name "user one"
git config user.email user_one@example.com
echo "Hello World" > README
git add README
git commit -m "Initial version of README"

# clone into another respository
cd ..
mkdir test-clone
cd test-clone
git clone ../test-project .
git config user.name "user two"
git config user.email user_two@example.com
echo "Hello file two" > file.two
git add file.two
git commit -m "Committing file two." file.two

# move back to cd test-project
cd ../test-project
echo "Hello file one" > file.one
git add file.one
git commit -m "Committing file one." file.one

cd ../test-clone
git pull --rebase origin master

# Observe that the commits are simply linear
# --rebase instructs git to tack on changes to 
# end of new commits in pull
git --no-pager log --graph --oneline
* 2c35eff Committing file two.
* e04183c Committing file one.
* 4bebe27 Initial version of README

Example without –rebase to see commit merge

# same as above except instead of git pull --rebase do following
git pull
git commit -m "Merging from origin"

# Can see now that there was a merge commit
git --no-pager log --graph --oneline
*   0e122d1 Merging from origin
|\  
| * dbcfc51 Committing file one.
* | 357cb10 Committing file two.
|/  
* 9b30513 Initial version of README