$ git add -p Localizable.strings

When using git I’m a believer in small, meaningful commits – as my colleague says: “Little commits, pushed often”. In many cases though, this doesn’t necessarily match how I – or I’m sure others – tend to work. While debugging a problem you may find yourself fixing other things or making tweaks which are important but not necessarily related. Personally I may find myself looking at a diff, ready to commit, realising that the changes within are part of separate stories.

I use the term “story” in a narrative, rather than agile sense. Meaningful, well thought-out commits can tell a tale about the engineering process, sharing valuable knowledge of decisions or trade-offs as they were made. A well written commit message is far more valuable than documented code. If the name of a method or its parameters don’t betray its purpose, then the naming is bad. If the method body doesn’t itself describe the functionality then it probably needs refactoring. Documentation goes out of date, but commit messages are always valid as they are by their very nature tied to the state of the code. They make git-blame useful for more than just “who broke this”.

git add -p

All of this to say that I’m a big fan of interactive staging in git. Being able to select and stage individual hunks, or even lines is wonderful and I use it all the time. It’s also a great way to review every single change that is going into the repository to ensure it is still necessary - a sort-of personal code-review. Did you leave an #import in that you’re no longer using? You’re far more likely to notice it while interactively staging.

Localizable.strings

Unfortunately not everything can be staged interactively. With binary assets it becomes an all-or-nothing affair and you end up being greeted by this:

$ git diff en.lproj/Localizable.strings diff --git
a/en.lproj/Localizable.strings b/en.lproj/Localizable.strings index
ff37e30..535a260 100644 Binary files a/en.lproj/Localizable.strings and
b/en.lproj/Localizable.strings differ

But wait. Why is Localizable.strings¹, a pure text file, showing as binary? Apple’s documentation on String Resources explains:

Note: It is recommended that you save strings files using the UTF-16 encoding, which is the default encoding for standard strings files. It is possible to create strings files using other property-list formats, including binary property-list formats and XML formats that use the UTF-8 encoding, but doing so is not recommended.

The problem is that the core git diff tool doesn’t handle UTF-16 data, so any tools that depend on it – for instance git add -p – will just give up.

Now, sometime in recent history Xcode started supporting UTF-8 encoded .strings, converting them to UTF-16 at compile time. Depending on your situation you may be able to convert all of your project .strings files to UTF-8, commit them and get on with your life.

Unfortunately, although Xcode now supports UTF-8, lots of localisation tooling including the venerable genstrings still operates on UTF-16. If we use these tools we have a problem. In my case, the localisation service we use deprecated their old utility (which allows specifying the encoding), replacing it with one which only outputs UTF-16.

.gitattributes

I discovered that others had encountered the same problem, but the solution only provided readable diff output. With a .gitattributes file it is possible to associate certain attributes with files matching a naming pattern when performing git operations. In this case a diff attribute would be a shell command which, when running git diff, will be executed over both the working copy and repository copy. The output from these commands is diffed instead of the raw file contents. This can yield many interesting results but isn’t quite what I’m looking for.

I’d never heard of .gitattributes before so I decided to dig a little deeper and it wasn’t long before I found something far more promising. Along with diff there is an attribute named filter. Filter follows a similar idea to diff except that it provides two commands instead of one – clean and smudge. Git operations which move content between the working copy and the repository are piped through these filter commands. clean runs any time a working copy is going to be committed (when changes are added to the index) and smudge is used whenever content is being loaded into the working copy (During a checkout or reset operation).

Filter attributes are perfect for our needs. We can keep the repository copy of the strings file UTF-8 encoded and the working copy as UTF-16 with a filter attribute converting between them. Because git filters the working copy before diffing against the repository, we get the benefits of being able to incrementally stage files while working with a UTF-16 representation of the data.

That’s the theory, lets take a look at how it works in practice.

First the easy bit. Create a .gitattributes file in your repository and add the following to configure all strings files to be handled with the utf16 filter.

*.strings filter=utf16

Next we need to actually define the filter. This is done in the git-config. We have this configured local to the repository as part of our bootstrap, ensuring that UTF-16 doesn’t get committed by accident. It would work just as well at the global level if you work on lots of different projects. The relevant section of .git/config looks like this.

[filter "utf16"]
	clean = iconv -sc -f $(file -b --mime-encoding %f | sed -e s/be//) -t utf-8
	smudge = iconv -sc -f utf-8 -t $(([ -f %f ] && (file -b --mime-encoding %f | sed -e s/be//)) || echo \"utf-16\")

iconv does most of the magic here. If you haven’t come across it before it’s a utility to transform text between encodings. Let’s break down the clean command first.

Clean

The purpose of the clean command is to convert the working copy representation into a format that will be stored in the repository by cleaning it up for storage. In this case we want to transform the UTF-16 file on disk into UTF-8 which git’s tools are able to cope with. At its simplest this can be accomplished with iconv -f utf-16 -t utf-8. This works perfectly if you already have UTF-16 files in your working copy, but there are several cases where you might have a UTF-8 file on disk instead (i.e. after a fresh clone). In this case iconv will blindly read the UTF-8 as UTF-16 and happily present you with a wall of traditional chinese characters!

We aways want to convert to UTF-8, but the representation we’re reading from can vary so we use file -b --mime-encoding %f to identify the encoding of the file on disk and use that for the -f parameter instead. %f will be substituted with the filename by git, -b prevents the filename being prepended to the output and --mime-encoding output only the encoding, for instance utf-16be. This is almost what we want - and it will work - but it will include the BOM in the UTF-8 output. I didn’t really want this, so by removing the trailing be with sed I end up with utf-16. UTF-8 files always output utf-8 so there’s no problem there.

Finally, -sc ensures that errors are silenced and unrecognised characters are discarded.

Smudge

The smudge command is almost the inverse of the clean command with one important difference. When the file doesn’t yet exist on disk - possible if you’re switching between branches or going throw history - the file command fails with output which then breaks iconv. We get around this by first checking whether the file exists and defaulting to utf-16 for the encoding if it doesn’t.

Setting it up

Although the .gitattributes file exists in the repository, the filter commands exist in the git configuration which is individual to each machine and copy of the repository, within .git/config. As the project tooling depends on the filter behaviour being present I added an additional step to our bootstrap scripts which adds the filter commands to the local git config.

update_git_config ()
{
    git config --local filter.utf16.clean 'iconv -sc -f $(file -b --mime-encoding %f | sed -e s/be//) -t utf-8'
    git config --local filter.utf16.smudge 'iconv -sc -f utf-8 -t $(([ -f %f ] && (file -b --mime-encoding %f | sed -e s/be//)) || echo "utf-16")'
}

You could include the filters in your global config to ensure it’s always available (I haven’t personally done this yet), but given how others would depend on this behaviour I think it’s best to ensure it will be present in all working copies.

If you’ve found this post useful or have any suggestions please send me an email at this domain. Feel free to put anything before the @ - be creative!

Localizable.strings is a file used to store locale specific strings in Cocoa apps. ↩