Word diff: Git as wdiff alternative
The diff
utility to show differences between two files was created in 1974 as part of Unix. It has been incredibly useful and popular ever since, and hasn’t changed much since 1991 when it gained the ability to output the now standard “unified context diff” format.
The comparison diff
makes is per line, so if anything on a given line changes, in unified context format we can tell that the previous version of that line was removed by seeing -
at the beginning of the old line, and the following line will start with +
followed by the new version.
For example see this Dockerfile that had two lines changed:
$ diff -u Dockerfile.old Dockerfile
--- Dockerfile.old 2022-01-05 22:16:21 -0700
+++ Dockerfile 2022-01-05 23:08:55 -0700
@@ -2,7 +2,7 @@
WORKDIR /usr/src/app
-# Bundle app source
+# Bundle entire source
COPY . .
-RUN /usr/src/app/test.sh
+RUN /usr/src/app/start.sh
That works well for visually reviewing changes to many types of files that developers typically work with.
It can also serve as input to the patch
program, which dates to 1985 and is still in wide use as a counterpart to diff
. With patch
we can apply changes to a file and avoid the need to send an entire new file or apply changes by hand (which is very prone to error).
But let’s leave that aside and focus on humans reading diff
output.
Diffing paragraphs
For a file containing paragraphs of prose each on their own long lines, it can look like the lines change completely when we change only a few words. This is often the case with HTML destined to be displayed in a web browser, email text, or the Markdown source for this blog post itself.
Consider this file with one long line of sample text gathered from pangrams and typing exercises:
$ cat -n paragraph.txt
1 The quick brown fox jumped over the lazy dog's back 1234567890 times. Now is the time for all good men to come to the aid of the party. Waltz, bad nymph, for quick jigs vex. Glib jocks quiz nymph to vex dwarf. Sphinx of black quartz, judge my vow. How vexingly quick daft zebras jump! The five boxing wizards jump quickly. Jackdaws love my big sphinx of quartz. Pack my box with five dozen liquor jugs.
If we change even one character of that, diff
will show that its one line changed, but we have to painstakingly visually scan the entire long line to determine what exactly changed and where. That is not very useful.
We can wrap, or split up, the lines into multiple lines of a maximum of 75 characters with the classic Unix tool fmt
:
$ fmt paragraph.txt | tee wrapped.txt
The quick brown fox jumped over the lazy dog's back 1234567890
times. Now is the time for all good men to come to the aid of the
party. Waltz, bad nymph, for quick jigs vex. Glib jocks quiz nymph
to vex dwarf. Sphinx of black quartz, judge my vow. How vexingly
quick daft zebras jump! The five boxing wizards jump quickly.
Jackdaws love my big sphinx of quartz. Pack my box with five dozen
liquor jugs.
With those shorter lines, changes will be easier to see than with one long line, but it is still hard to pick out small changes:
$ diff -u wrapped.txt wrapped2.txt
--- wrapped.txt 2022-01-05 23:22:46 -0700
+++ wrapped2.txt 2022-01-05 23:38:14 -0700
@@ -1,7 +1,7 @@
The quick brown fox jumped over the lazy dog's back 1234567890
-times. Now is the time for all good men to come to the aid of the
+times. Now is thy time for all good men to come to the aid of the
party. Waltz, bad nymph, for quick jigs vex. Glib jocks quiz nymph
to vex dwarf. Sphinx of black quartz, judge my vow. How vexingly
-quick daft zebras jump! The five boxing wizards jump quickly.
+quick daft zebras jump! Tho five boxing wizards jump quickly.
Jackdaws love my big sphinx of quartz. Pack my box with five dozen
liquor jugs.
And much worse, every line changes when we make a significant edit early in the text and reflow the paragraph to fit our maximum line length. Here is what happens after adding “an amazing count of” in the first line and re-wrapping the lines:
diff -u wrapped.txt wrapped3.txt
--- wrapped.txt 2022-01-05 23:22:46 -0700
+++ wrapped3.txt 2022-01-06 08:05:33 -0700
@@ -1,7 +1,7 @@
-The quick brown fox jumped over the lazy dog's back 1234567890
-times. Now is the time for all good men to come to the aid of the
-party. Waltz, bad nymph, for quick jigs vex. Glib jocks quiz nymph
-to vex dwarf. Sphinx of black quartz, judge my vow. How vexingly
-quick daft zebras jump! The five boxing wizards jump quickly.
-Jackdaws love my big sphinx of quartz. Pack my box with five dozen
-liquor jugs.
+The quick brown fox jumped over the lazy dog's back an amazing count
+of 1234567890 times. Now is thy time for all good men to come to
+the aid of the party. Waltz, bad nymph, for quick jigs vex. Glib
+jocks quiz nymph to vex dwarf. Sphinx of black quartz, judge my
+vow. How vexingly quick daft zebras jump! Tho five boxing wizards
+jump quickly. Jackdaws love my big sphinx of quartz. Pack my box
+with five dozen liquor jugs.
That gives no aid to a human proofreader!
Word diff to the rescue
In 1992 François Pinard wrote the word-based diff program wdiff
which is now part of the GNU project. It solves this problem.
Here is how it shows us our example changing two words:
$ wdiff wrapped.txt wrapped2.txt
The quick brown fox jumped over the lazy dog's back 1234567890
times. Now is [-the-] {+thy+} time for all good men to come to the aid of the
party. Waltz, bad nymph, for quick jigs vex. Glib jocks quiz nymph
to vex dwarf. Sphinx of black quartz, judge my vow. How vexingly
quick daft zebras jump! [-The-] {+Tho+} five boxing wizards jump quickly.
Jackdaws love my big sphinx of quartz. Pack my box with five dozen
liquor jugs.
Words removed are by default marked with [-…-]
and words added with {+…+}
.
It even knows how to accommodate word changes appearing on different lines! Trying it out on our example with the reflowed paragraph:
$ wdiff wrapped.txt wrapped3.txt
The quick brown fox jumped over the lazy dog's back {+an amazing count
of+} 1234567890 times. Now is [-the-] {+thy+} time for all good men to come to
the aid of the party. Waltz, bad nymph, for quick jigs vex. Glib
jocks quiz nymph to vex dwarf. Sphinx of black quartz, judge my
vow. How vexingly quick daft zebras jump! [-The-] {+Tho+} five boxing wizards
jump quickly. Jackdaws love my big sphinx of quartz. Pack my box
with five dozen liquor jugs.
So this is very nice, although wdiff
often isn’t available by default on the various systems we find ourselves on, and it is perhaps a bit worrisome that the wdiff
software has not been updated since 2014.
Too bad this word-diffing feature is not part of standard diff
!
A familiar friend
That’s ok because you probably already have a wdiff
alternative available on your computer: Git! More specifically, git diff --word-diff
.
Maybe you already use that feature when working with your local clones of Git repositories, to look at what changed in the commit history or local edits. Did you know that git diff
can act as a complete replacement of the standalone diff
tool? Yes, git diff
can also compare two arbitrary files that are not part of a Git repository when given the --no-index
option!
And Git can usually tell you mean --no-index
without you typing that explicitly because you’re comparing at least one file that is not tracked in a Git clone, so you can just type:
$ git diff --word-diff <path1> <path2>
for any two file paths and it will work.
Trying this out with our sample paragraph:
$ git diff --word-diff wrapped.txt wrapped2.txt
diff --git a/wrapped.txt b/wrapped2.txt
index b1c5775..59ff315 100644
--- a/wrapped.txt
+++ b/wrapped2.txt
@@ -1,7 +1,7 @@
The quick brown fox jumped over the lazy dog's back 1234567890
times. Now is [-the-]{+thy+} time for all good men to come to the aid of the
party. Waltz, bad nymph, for quick jigs vex. Glib jocks quiz nymph
to vex dwarf. Sphinx of black quartz, judge my vow. How vexingly
quick daft zebras jump! [-The-]{+Tho+} five boxing wizards jump quickly.
Jackdaws love my big sphinx of quartz. Pack my box with five dozen
liquor jugs.
It uses the same word deletion and insertion markers as wdiff
, but to make them easier for our eyes to spot, by default git diff
also shows them in different colors when output is going to an interactive terminal. You can disable the coloring with the additional option --color=never
.
Use git diff --word-diff=color
for a pretty view using only color to show the changes, without the [-…-]
and {+…+}
markers. This may be more readable when your input text is full of punctuation confusingly similar to the markers, and is useful if you want to copy from the terminal without any extra surrounding characters:
$ git diff --word-diff=color wrapped.txt wrapped2.txt
diff --git a/wrapped.txt b/wrapped2.txt
index b1c5775..59ff315 100644
--- a/wrapped.txt
+++ b/wrapped2.txt
@@ -1,7 +1,7 @@
The quick brown fox jumped over the lazy dog's back 1234567890
times. Now is thethy time for all good men to come to the aid of the
party. Waltz, bad nymph, for quick jigs vex. Glib jocks quiz nymph
to vex dwarf. Sphinx of black quartz, judge my vow. How vexingly
quick daft zebras jump! TheTho five boxing wizards jump quickly.
Jackdaws love my big sphinx of quartz. Pack my box with five dozen
liquor jugs.
There is also the option git diff --word-diff=porcelain
for an ugly but more easily code-parseable format useful for output sent as input to scripts:
$ git diff --word-diff=porcelain wrapped.txt wrapped2.txt
diff --git a/wrapped.txt b/wrapped2.txt
index b1c5775..59ff315 100644
--- a/wrapped.txt
+++ b/wrapped2.txt
@@ -1,7 +1,7 @@
The quick brown fox jumped over the lazy dog's back 1234567890
~
times. Now is
-the
+thy
time for all good men to come to the aid of the
~
party. Waltz, bad nymph, for quick jigs vex. Glib jocks quiz nymph
~
to vex dwarf. Sphinx of black quartz, judge my vow. How vexingly
~
quick daft zebras jump!
-The
+Tho
five boxing wizards jump quickly.
~
Jackdaws love my big sphinx of quartz. Pack my box with five dozen
~
liquor jugs.
~
I have never needed that yet, but it is good to be aware of in case I ever do need to parse word diff output, to make it easier and more reliable.
Customize word break definition
Other kinds of files can present challenges for readability in diff output.
For example consider trying to see small changes in the classic Unix /etc/passwd
text “database” which has one user record per line, and within each record line uses :
to delimit fields.
First we’ll try traditional line diff:
$ git diff passwd passwd.mangled
diff --git a/passwd b/passwd.mangled
index 981736c..6531f10 100644
--- a/passwd
+++ b/passwd.mangled
@@ -24,22 +24,22 @@ polkitd:x:996:991:User for polkitd:/:/sbin/nologin
rtkit:x:172:172:RealtimeKit:/proc:/sbin/nologin
pulse:x:171:171:PulseAudio System Daemon:/var/run/pulse:/sbin/nologin
chrony:x:995:988::/var/lib/chrony:/sbin/nologin
-abrt:x:173:173::/etc/abrt:/sbin/nologin
+abrt:x:173:1730::/etc/abrt:/sbin/nologin
colord:x:994:987:User for colord:/var/lib/colord:/sbin/nologin
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
vboxadd:x:993:1::/var/run/vboxadd:/sbin/nologin
dnsmasq:x:985:985:Dnsmasq DHCP and DNS server:/var/lib/dnsmasq:/sbin/nologin
-tcpdump:x:72:72::/:/sbin/nologin
+tcpdump:x:72:72::/:/bin/bash
systemd-timesync:x:984:984:systemd Time Synchronization:/:/sbin/nologin
pipewire:x:983:983:PipeWire System Daemon:/var/run/pipewire:/sbin/nologin
gluster:x:982:982:GlusterFS daemons:/run/gluster:/sbin/nologin
-radvd:x:75:75:radvd user:/:/sbin/nologin
-saslauth:x:981:76:Saslauthd user:/run/saslauthd:/sbin/nologin
+radvd:x:76:75:radvd user:/:/sbin/nologin
+saslauth:x:981:76:Saslauthd user:/ran/saslauthd:/sbin/nologin
usbmuxd:x:113:113:usbmuxd user:/:/sbin/nologin
setroubleshoot:x:980:979::/var/lib/setroubleshoot:/sbin/nologin
openvpn:x:979:978:OpenVPN:/etc/openvpn:/sbin/nologin
-nm-openvpn:x:978:977:Default user for running openvpn spawned by NetworkManager:/:/sbin/nologin
+mm-openvpn:x:978:977:Default user for running openvpn spawned by NetworkManager:/:/sbin/nologin
qemu:x:107:107:qemu user:/:/sbin/nologin
gdm:x:42:42::/var/lib/gdm:/sbin/nologin
apache:x:48:48:Apache:/usr/share/httpd:/sbin/nologin
It’s not too hard to “eyeball” changes there if they add or remove characters and thus affect the line lengths. But a line with only a change to a single character isn’t as easy.
Since blank space is not the relevant separator in this file, standard word diff doesn’t help and in some cases is worse than line diff:
$ git diff --word-diff passwd passwd.mangled
diff --git a/passwd b/passwd.mangled
index 981736c..6531f10 100644
--- a/passwd
+++ b/passwd.mangled
@@ -24,22 +24,22 @@ polkitd:x:996:991:User for polkitd:/:/sbin/nologin
rtkit:x:172:172:RealtimeKit:/proc:/sbin/nologin
pulse:x:171:171:PulseAudio System Daemon:/var/run/pulse:/sbin/nologin
chrony:x:995:988::/var/lib/chrony:/sbin/nologin
[-abrt:x:173:173::/etc/abrt:/sbin/nologin-]{+abrt:x:173:1730::/etc/abrt:/sbin/nologin+}
colord:x:994:987:User for colord:/var/lib/colord:/sbin/nologin
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
vboxadd:x:993:1::/var/run/vboxadd:/sbin/nologin
dnsmasq:x:985:985:Dnsmasq DHCP and DNS server:/var/lib/dnsmasq:/sbin/nologin
[-tcpdump:x:72:72::/:/sbin/nologin-]{+tcpdump:x:72:72::/:/bin/bash+}
systemd-timesync:x:984:984:systemd Time Synchronization:/:/sbin/nologin
pipewire:x:983:983:PipeWire System Daemon:/var/run/pipewire:/sbin/nologin
gluster:x:982:982:GlusterFS daemons:/run/gluster:/sbin/nologin
[-radvd:x:75:75:radvd-]{+radvd:x:76:75:radvd+} user:/:/sbin/nologin
saslauth:x:981:76:Saslauthd [-user:/run/saslauthd:/sbin/nologin-]{+user:/ran/saslauthd:/sbin/nologin+}
usbmuxd:x:113:113:usbmuxd user:/:/sbin/nologin
setroubleshoot:x:980:979::/var/lib/setroubleshoot:/sbin/nologin
openvpn:x:979:978:OpenVPN:/etc/openvpn:/sbin/nologin
[-nm-openvpn:x:978:977:Default-]{+mm-openvpn:x:978:977:Default+} user for running openvpn spawned by NetworkManager:/:/sbin/nologin
qemu:x:107:107:qemu user:/:/sbin/nologin
gdm:x:42:42::/var/lib/gdm:/sbin/nologin
apache:x:48:48:Apache:/usr/share/httpd:/sbin/nologin
Another venerable program similar to wdiff
that is still maintained is dwdiff
. In its self-description we read something intriguing:
It is different from wdiff in that it allows the user to specify what should be considered whitespace …
That sounds useful. But dwdiff
is still a separate program and is even less common than wdiff
. Can the versatile git diff
help us here too?
Yes! git diff
has the option --word-diff-regex
to specify a regular expression to use instead of whitespace as a delimiter, like dwdiff
does. The man page explanation notes:
For example,
--word-diff-regex=.
will treat each character as a word and, correspondingly, show differences character by character.
It also notes that --word-diff
is assumed and can be omitted when using --word-diff-regex
.
So let’s try that:
$ git diff --word-diff-regex=. passwd passwd.mangled
diff --git a/passwd b/passwd.mangled
index 981736c..6531f10 100644
--- a/passwd
+++ b/passwd.mangled
@@ -24,22 +24,22 @@ polkitd:x:996:991:User for polkitd:/:/sbin/nologin
rtkit:x:172:172:RealtimeKit:/proc:/sbin/nologin
pulse:x:171:171:PulseAudio System Daemon:/var/run/pulse:/sbin/nologin
chrony:x:995:988::/var/lib/chrony:/sbin/nologin
abrt:x:173:173{+0+}::/etc/abrt:/sbin/nologin
colord:x:994:987:User for colord:/var/lib/colord:/sbin/nologin
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
vboxadd:x:993:1::/var/run/vboxadd:/sbin/nologin
dnsmasq:x:985:985:Dnsmasq DHCP and DNS server:/var/lib/dnsmasq:/sbin/nologin
tcpdump:x:72:72::/:/[-s-]bin/[-nologin-]{+bash+}
systemd-timesync:x:984:984:systemd Time Synchronization:/:/sbin/nologin
pipewire:x:983:983:PipeWire System Daemon:/var/run/pipewire:/sbin/nologin
gluster:x:982:982:GlusterFS daemons:/run/gluster:/sbin/nologin
radvd:x:7[-5-]{+6+}:75:radvd user:/:/sbin/nologin
saslauth:x:981:76:Saslauthd user:/r[-u-]{+a+}n/saslauthd:/sbin/nologin
usbmuxd:x:113:113:usbmuxd user:/:/sbin/nologin
setroubleshoot:x:980:979::/var/lib/setroubleshoot:/sbin/nologin
openvpn:x:979:978:OpenVPN:/etc/openvpn:/sbin/nologin
[-n-]{+m+}m-openvpn:x:978:977:Default user for running openvpn spawned by NetworkManager:/:/sbin/nologin
qemu:x:107:107:qemu user:/:/sbin/nologin
gdm:x:42:42::/var/lib/gdm:/sbin/nologin
apache:x:48:48:Apache:/usr/share/httpd:/sbin/nologin
That’s quite good, at least to my eyes.
On the web
GitHub, GitLab, and Bitbucket do a good job of showing readable diffs for most common cases: line-oriented, but within each line also word or character diffs highlighted via color. Open up each of the following links to see how they present a few of our earlier examples as commit diffs:
- Change 2 letters in prose paragraph: 🔗 in GitHub, 🔗 in GitLab, 🔗 in Bitbucket
- Change a few characters in
/etc/passwd
: 🔗 in GitHub, 🔗 in GitLab, 🔗 in Bitbucket
But GitHub and GitLab both break down on a reflowed paragraph, while Bitbucket shows a sensible diff of what logically changed, including spaces to newlines and vice versa:
- Insert words early in paragraph and reflow: 🔗 in GitHub, 🔗 in GitLab, 🔗 in Bitbucket
It appears that GitLab may soon gain proper cross-line word diff ability as seen in the project’s issue Add word-diff option to commits view, which states its “Problem to solve” as:
When working with markdown (or any type of prose/text in general), the “classic” git-diff (intended for code) is of limited use.
Exactly right.
IDEs
Visual Studio Code (VS Code) handles the above cases well out of the box for uncommitted changes in the current Git clone, and the GitLens extension helps it do the same for showing past commit diffs.
IntelliJ IDEA handles both cases well by default.
For those left behind
To make the most of Git, you’ll want a fairly recent version, since new features are being added all the time. If you’re working on a server using the popular but aging CentOS 7 which comes with the ancient Git 1.8.3, you can follow our simple tutorial to upgrade to Git 2.34.1 or newer on CentOS 7.
Enjoy!
Reference
- diff on Wikipedia, including history and samples of original, context, and unified context diff output
- patch on Wikipedia
- git-diff man page
- wdiff
- dwdiff
- Pangrams on Wikipedia, the source of our sample prose here
git terminal vscode intellij-idea tips
Comments