When using git I’m a believer in small, meaningful commits – as my colleague says: “Little commits, pushed often”. In many cases though, this doesn’t necessarily match how I – or I’m sure others – tend to work. While debugging a problem you may find yourself fixing other things or making tweaks which are important but not necessarily related. Personally I may find myself looking at a diff, ready to commit, realising that the changes within are part of separate stories.
I use the term “story” in a narrative, rather than agile sense. Meaningful, well thought-out commits can tell a tale about the engineering process, sharing valuable knowledge of decisions or trade-offs as they were made. A well written commit message is far more valuable than documented code. If the name of a method or its parameters don’t betray its purpose, then the naming is bad. If the method body doesn’t itself describe the functionality then it probably needs refactoring. Documentation goes out of date, but commit messages are always valid as they are by their very nature tied to the state of the code. They make git-blame useful for more than just “who broke this”.
git add -p
All of this to say that I’m a big fan of interactive staging in git. Being able
to select and stage individual hunks, or even lines is wonderful and I use it
all the time. It’s also a great way to review every single change that is going
into the repository to ensure it is still necessary - a sort-of personal
code-review. Did you leave an #import
in that you’re no longer using? You’re
far more likely to notice it while interactively staging.
Localizable.strings
Unfortunately not everything can be staged interactively. With binary assets it becomes an all-or-nothing affair and you end up being greeted by this:
1 2 3 4 |
|
But wait. Why is Localizable.strings1, a pure text file, showing as binary? Apple’s documentation on String Resources explains:
Note: It is recommended that you save strings files using the UTF-16 encoding, which is the default encoding for standard strings files. It is possible to create strings files using other property-list formats, including binary property-list formats and XML formats that use the UTF-8 encoding, but doing so is not recommended.
The problem is that the core git diff tool doesn’t handle UTF-16 data, so any
tools that depend on it – for instance git add -p
– will just give up.
Now, sometime in recent history Xcode started supporting UTF-8 encoded
.strings
, converting them to UTF-16 at compile time. Depending on your
situation you may be able to convert all of your project .strings
files to
UTF-8, commit them and get on with your life.
Unfortunately, although Xcode now supports UTF-8, lots of localisation tooling
including the venerable genstrings
still operates on UTF-16. If we use these
tools we have a problem. In my case, the localisation service we use deprecated
their old utility (which allows specifying the encoding), replacing it with
one which only outputs UTF-16.
.gitattributes
I discovered that others had encountered the same problem, but the
solution only provided readable diff output. With a .gitattributes
file it is possible to associate certain attributes with files matching a naming
pattern when performing git operations. In this case a diff
attribute would
be a shell command which, when running git diff
, will be executed over both
the working copy and repository copy. The output from these commands is diffed
instead of the raw file contents. This can yield many interesting results but
isn’t quite what I’m looking for.
I’d never heard of .gitattributes
before so I decided to dig a little deeper
and it wasn’t long before I found something far more promising. Along with
diff
there is an attribute named filter
. Filter follows a similar idea to
diff except that it provides two commands instead of one – clean
and
smudge
. Git operations which move content between the working copy and the
repository are piped through these filter commands. clean
runs any time
a working copy is going to be committed (when changes are added to the index)
and smudge
is used whenever content is being loaded into the working copy
(During a checkout or reset operation).
Filter attributes are perfect for our needs. We can keep the repository copy of the strings file UTF-8 encoded and the working copy as UTF-16 with a filter attribute converting between them. Because git filters the working copy before diffing against the repository, we get the benefits of being able to incrementally stage files while working with a UTF-16 representation of the data.
That’s the theory, lets take a look at how it works in practice.
First the easy bit. Create a .gitattributes file in your repository and add the
following to configure all strings files to be handled with the utf16
filter.
1
|
|
Next we need to actually define the filter. This is done in the git-config
. We
have this configured local to the repository as part of our bootstrap, ensuring
that UTF-16 doesn’t get committed by accident. It would work just as well at the
global level if you work on lots of different projects. The relevant section of
.git/config
looks like this.
1 2 3 |
|
iconv
does most of the magic here. If you haven’t come across it before it’s a
utility to transform text between encodings. Let’s break down the clean command
first.
Clean
The purpose of the clean command is to convert the working copy representation
into a format that will be stored in the repository by cleaning it up for
storage. In this case we want to transform the UTF-16 file on disk into UTF-8
which git’s tools are able to cope with. At its simplest this can be
accomplished with iconv -f utf-16 -t utf-8
. This works perfectly if you
already have UTF-16 files in your working copy, but there are several cases
where you might have a UTF-8 file on disk instead (i.e. after a fresh clone). In
this case iconv
will blindly read the UTF-8 as UTF-16 and happily present you
with a wall of traditional chinese characters!
We aways want to convert to UTF-8, but the representation we’re reading
from can vary so we use file -b --mime-encoding %f
to identify the encoding
of the file on disk and use that for the -f
parameter instead. %f
will be
substituted with the filename by git, -b
prevents the filename being
prepended to the output and --mime-encoding
output only the encoding, for
instance utf-16be
. This is almost what we want - and it will work - but it
will include the BOM in the UTF-8 output. I didn’t really want this, so by
removing the trailing be
with sed I end up with utf-16
. UTF-8 files always
output utf-8
so there’s no problem there.
Finally, -sc
ensures that errors are s
ilenced and unrecognised
c
haracters are discarded.
Smudge
The smudge command is almost the inverse of the clean command with one
important difference. When the file doesn’t yet exist on disk - possible if
you’re switching between branches or going throw history - the file
command
fails with output which then breaks iconv
. We get around this by first
checking whether the file exists and defaulting to utf-16
for the encoding if
it doesn’t.
Setting it up
Although the .gitattributes
file exists in the repository, the filter commands
exist in the git configuration which is individual to each machine and copy of
the repository, within .git/config
. As the project tooling depends on the
filter behaviour being present I added an additional step to our bootstrap
scripts which adds the filter commands to the local git config.
1 2 3 4 5 |
|
You could include the filters in your global config to ensure it’s always available (I haven’t personally done this yet), but given how others would depend on this behaviour I think it’s best to ensure it will be present in all working copies.
If you’ve found this post useful or have any suggestions please send me an email
at this domain. Feel free to put anything before the @
- be creative!
-
Localizable.strings is a file used to store locale specific strings in Cocoa apps. ↩