Recently I work with the legacy code. It’s complex, released in many different versions, and definitely needs to be refactored. But how to do refactoring from Git’s point of view? To make the history clear and neat? I will try to tell you how we do this.
The system is developed by many programmers. It consists of probably a few millions of lines of code. Its size is not important – it’s big and old. This leads to another problem – average knowledge about the system and domain is low. Which means that code history is extremely important. The reason why the line of code was added is described in a related ticket/bug. Usually before correcting it, one has to read this description. Without history, work would be even harder.
Squash commits
We have 6 or 7 main release branches (including the master containing the recent version). Bugs have to be fixed in different versions, and I guess that even 30 feature branches are merged to corresponding release branches daily.
To avoid all commits from feature branches being presented in the history, squash merging has been introduced – probably from the beginning of the Git repository. This limits the number of irrelevant noise in history and it works well in general.
DevOps and Pull Requests
We also use Azure DevOps with all its goodies. Most important for this post is the Pull Request (PR) feature. When the bug is fixed on the bug branch, PR is created. Then teammates could do a code review, testers can test the fixed application and if everything is ok, PR can be completed.
Completion means that the bug branch is merged back to the main release branch. And merging policy states that it should be done by squash merging.
This works like a charm in all standard cases. But there are at least two exceptions:
- Additional refactoring
- Additional refactoring with file splitting
Where this procedure won’t work. Let me show why.
Additional refactoring – problem
As I mentioned, the system is of type „legacy” and requires to be refactored. Refactoring such complicated code is not easy and very limited in most cases. Usually, we refactor the code gradually. The first step is the simplest and contains everything that ReSharper offers out of the box – names could be changed, unused code removed, comments limited or deleted, things like that.
This is simple, but we need it to do more radical changes. As we have thousands of classes, each with hundreds or thousands of lines of code, the amount of work even for such simple refactoring is huge and all team members should do this as often as possible. This means that each time bugfix is done, refactoring of the adjacent code should also occur.
Now, imagine that you fix a bug in the class with 5000 lines. The fix is, for example, 3-liner. The refactoring affects almost 5000 lines. How to do a code review? The reviewer would not know what is the fix, and what is refactoring. But this is not a very big problem, because the author can be asked about it. What if we examine the commit log and see the fix-commit that contains 5000 lines? What was the actual fix and what just refactoring? This situation breaks the very important rule of single responsibility (SOLID applies not only to the code).
Additional refactoring – solution
Fortunately, we’ve found the solution to this problem. The simplest would be to do a fix, merge to main, start a new branch with refactoring. But this has a serious drawback: even though refactoring doesn’t change functionality, some bugs can be introduced. It would be good to test it also. As we have limited testing capabilities, more efficient is to test fix and refactoring at the same time.
The better from our perspective is this:
The developer starts the fix branch, prepares a fix and then starts a refactoring branch from the fix branch. Performs refactoring, builds it and passes to the testers. They test (fix+refactoring) and if everything is OK, then two merges are performed – one with a plain fix, second with fix + refactoring. Detailed procedure steps are depicted in the picture below:
The solution isn’t complete though
This works well if refactoring doesn’t contain file splitting/merging. As you may know, squash merges can clean history related to newly extracted files. I wrote about it some time ago. That means, we can’t use the procedure for these more complex scenarios (not true, as we’ll see at the end). Fortunately, DevOps PRs can be completed with four different types of merge:
We have policies set that prevent from merging of a type different than „Squash”. But what to choose in this more complicated case? The answer is not easy.
4 merge types
Squash merge
Takes all commits from a feature branch, squashes them together and cherry-picks to the main branch. In Git’s console, this may be achieved with the command git merge –squash FeatureBranch
Standard merge
Default in many cases. It takes the last commit from the feature branch, the last from the main branch and the common ancestor of both commits. These three commits are used to three-way merge and the resulting commit is added to the main. Corresponding Git’s command is git merge FeatureBranch
Rebase and fast forward merge
In short, it’s a kind of automated cherry-picking. It takes all commits from the feature branch and applies them to the main branch. In the console, you would just write the command git cherry-pick (commonParentSHA)..FeatureBranch
Semi-linear merge
A mix of rebasing and standard merge. Commits from the feature branch are rebased onto the main branch. Then the rebased commits are merged to the main. It works like 4 commands chain: git checkout FeatureBranch, git rebase main, git checkout main, git merge FeatureBranch –no-ff. People like it because it keeps all commits from the feature branch (so, progress is clearly visible), and they are not rebased directly to the main, but the „merge path” is visible.
Which to choose?
Let’s go back to our original problem. We have our two-branching refactor strategy and we want to split files. Which option should be chosen? We know that definitely not the squash merge. It’s a good time to do an experiment and see what will happen. I prepared the repository that looks like this:
Let’s imagine that bug has been discovered. English language code should not be en_US but en_GB. Additionally, the great idea of placing supported language names together with their codes turned out not to be so great. Now we want to split the file to store languages and codes separately.
During bug fixing, someone added one commit to the master. Now the repo looks like in the picture below. BugFix and BugFix_Refactoring branches were made in the way I described in the previous section. As we can see, two new files split from SupportedLanguages have full history info. We want to preserve this information after the BugFix merged back to master.
Squash merge
After squash merge repository looks like in the picture below. Newly extracted files have a history of the commits. Note that firstly I squash merged BugFix branch, then BugFix_Refactoring, according to our procedure I described above. After merging, I added a new commit to clearly show how the exemplary extracted file’s history looks. Later this commit will be added after BugFix merge to show a slightly more complicated scenario – for squash commits it doesn’t matter, the history line is straight as an arrow.
We have linear history on the master (note the gap) and a full history of the newly extracted file changes. So why not use squash merging? Because if you delete the feature branch, all history is gone (the gap in the history indicated that it may end like that):
So, the one option is to use squash merge but leave feature branches alive. This, unfortunately, would leave us with thousands of abandoned feature branches containing tons of invalid, not-compiling commits. The main history will be straight and clear, but there is a need to deal with all those abandoned branches.
Standard merge
I reverted the repo to the initial state and squash merged BugFix, but I used a standard, three-way merge to join BugFix_Refactoring with the master. The picture below shows the repository and history of the new file after the merge and deletion of all side branches.
That also looks quite good. Ideally would be not to have commit „en_US code changed to en_GB (bug fixed)” on the side branch because it contains the same change as commit Merge branch 'BugFix’ on master.
The pros of this approach are that we have a history of the file and we don’t have invalid commits, because developers know that their work from the feature branch will be visible forever in the master’s history. Also, the Russian language added commit doesn’t affect the history of our new files. The Russian was added to our files only in the merge commit – which is great from one perspective (but may produce problems when the number of conflicts is big and something may be incidentally omitted).
Rebase and fast forward merge
At first glance, we can see that issue that I noticed previously disappears here. Merge branch 'BugFix’ commit is not duplicated anymore on the „side branch”. The branches history is also more linear.
BUT, there are „buts”. Because Russian language added commit was added right after Merge branch 'BugFix’, the rebasing process is harder to perform (many conflicts may occur). Additionally, commits after rebase differs from the original commits. Take a look at bottom part of the picture above. The Russian language is now added also in commits from BugFix_Refactoring branch, even though they were created before this language was added. History is not accurate then, which may be a problem in complex cases. Remember that this testing environment is simple, opposite to real development environments.
So rebasing gives us a more linear history that may be inaccurate. Is it something that we’re looking for? Maybe there is something better?
Semi-linear merge
Semi-linear and rebase merges differs by the last command using. To do both we have to:
- Rebase topic branch to the main branch
- Checkout main branch
- Merge topic branch
But the last merge is of fast-forward type for rebasing, and non-fast-forward (–no-ff option) for semi-linear merge. This time our branches look like:
Again we see the full history of the file (note that new commits also contain entries with the Russian language – which may be a bit misleading while reading history), the structure is slightly different, but we got rid of the unnecessary commit that was left during the standard merge.
Comparison of the merges
History lines
All history lines contain commits from feature branches. Two rebase merges though cleaned history a bit and haven’t duplicated commit that introduced an actual fix for the bug, so they are better in this sense. Rebase with fast forward places some feature commits directly onto the main branch while standard merge and no fast forward don’t do this. I think that the history line looks best after the third option (semi-linear in Azure DevOps).
Easiness
Unfortunately, history look is not the only parameter when we want to evaluate different merge types. Very important is the easiness of use. And I must say that rebase type merges are not good with that. Standard merge is much easier to be completed. In simple cases, as the one shown above, there is no problem, but try to rebase something more complicated (to be shown later).
Ultimate comparison – hard case
To be absolutely sure, I examined something more complicated. I prepared quite a complex refactoring branch and wanted to merge them in three different ways. The branch looks like this:
And the results of different merges:
Unfortunately, here I cannot say that semi-linear merge is the best option in terms of „cleanness” – „Fix proposal” commit is still there. And rebasing was a pain, even if I used SmartGit to do this. I can’t imagine how to deal with it with console tools only. Of course, I’m not a Git-master, but I doubt that so complex rebase would be nice and easy even for such a ninja. After a long time of rebasing and making all commits correct, I don’t want to repeat that again. If I would choose now, only standard merge or squash merge can be considered. Time and frustration wasting for rebasing complex branches is too big and overwhelms the benefits.
Squash to the rescue
So, rebasing is hard, the standard merge is ugly. But what if we could do a squash merge and have a history? As you remember it was possible when the feature branch was not deleted after squash merge. Is it the same with more complex branches? Of course:
As we can see, squash merge looks super-clean. It also preserves the history, if the feature branch is not deleted. So maybe instead of trying complicated merges that are not easy and don’t give good results, we should stick to the squash merging? We could agree that refactoring branches will be named in a special way that indicates they shouldn’t be deleted and that’s it. Potential pitfalls:
- fragility – if the branches will be deleted, the history will be gone
- many abandoned branches (if named correctly, may be OK)
But huge benefits:
- linear history
- very easy to perform
- robust, simple procedure
I know that before I have doubts, but when I performed rebasing I totally changed my mind. I think that we should stick to squash merging. Maybe you have a different opinion and can share your thoughts?