Saturday, October 19, 2013

SourceDiff

I wanted a new challenge, so I wrote a side-by-side diff tool.

My goal was to make a simple diff tool with a clean user interface. A hundred commits later, I've started using it in my own git workflow.

I decided to build it using web technologies, mostly because I haven't seen many good side-by-side diff tools in the browser. In the end, though, I also wanted a desktop version so that I could integrate with git and other source control tools. For this, I used Adobe Air to build a thin executable shell that accepts command-line arguments.

Here's a screen shot of the Air client:



It's not the most fancy diff tool out there, but again, I built it mostly for the experience. I used the well-known Longest Common Subsequence problem as the basis of my diff algorithm. First, I look for and ignore lines that are common to the beginning or end of each file. This is simply an optimization step that takes advantage of the fact that usually a large portion of the code is left unchanged. Then, I apply the LCS algorithm on a line-by-line basis. That is, initially I only find inserted or deleted lines. If I find an insert and delete on the same line, then I report it as a modified line and proceed to find the character differences for these lines. In other words, for each modified line, I apply the LCS algorithm again to find which characters changed. From there, I improve the character diffs by cleaning up what Neal Fraser calls Semantic Chaff.

Diff tools are hard to get right. If an edit consists mostly of small changes, diff tools as a rule do a great job. However, as soon as you start doing large refactors or moving things around, most tools report a mess of changes and are less than helpful. Some diff tools try to address this by being more language-aware. Some tools go so far as to parse the code and only report semantic changes at a higher level.

I like what these new tools are trying to accomplish, but at the end of the day, a text diff is usually sufficient and provides a simple, precise record of everything that changed.

2 comments:

  1. you perhaps might have looked harder. https://github.com/cemerick/jsdifflib
    http://cemerick.github.io/jsdifflib/demo.html

    ReplyDelete
  2. Yes, I've seen that project. Like I said in the post, this was mainly just a learning experience for me.

    Furthermore, SourceDiff highlights character differences for modified lines, something that is missing from jsdifflib.

    ReplyDelete