In this blog entry you will find a working Javascript implementation of a "diff" (difference) algorithm for comparing two text files.
I decided to implement this, as I needed something for comparing VBA source code files without having to compare them using Word's comparison facilities.
I was going to use John Resig's excellent algorithm, but I wanted something that would provide char level analysis (John's is at the word level, and I didn't feel up to modifying it), so I implemented the simple (if not very performant) algorithms found on the wikipedia entry for the Longest Common Subsequence problem.
function LCSLength(X[1..m], Y[1..n])
C = array(0..m, 0..n)
for i := 0..m
C[i,0] = 0
for j := 0..n
C[0,j] = 0
for i := 1..m
for j := 1..n
if X[i] = Y[j]
C[i,j] := C[i-1,j-1] + 1
else:
C[i,j] := max(C[i,j-1], C[i-1,j])
return C[m,n]
function printDiff(C[0..m,0..n], X[1..m], Y[1..n], i, j)
if i > 0 and j > 0 and X[i] = Y[j]
printDiff(C, X, Y, i-1, j-1)
print " " + X[i]
else
if j > 0 and (i = 0 or C[i,j-1] C[i-1,j])
printDiff(C, X, Y, i, j-1)
print "+ " + Y[j]
else if i > 0 and (j = 0 or C[i,j-1] < C[i-1,j])
printDiff(C, X, Y, i-1, j)
print "- " + X[i]
else
print ""
My code is not designed to handle large files, but I have attempted to optimise a little by calculating the, Longest Common Subsequence (LCS) first using "lines" as the atoms to compare, and then, where there are differences, using "words" as the atoms, and then, where there are differences, using "chars" as the atoms.
NOTE: The word and char levels are not used however, if the differences between the original and revised text are large (i.e. the LCS is small), to improve readability.
The main problem with the LCS algorithm is that it seeks the most efficient way of transforming A into B. I have found that visually this may not reflect the intention of the actual amendments made, and also there may be more than one LCS, some of which may be visually better or worse than others. However, my main purpose is simply to see that there are amendments at all, so it is not a major issue, and I just go with the first LCS identified.
The program provides a single method which returns an html text file with the differences marked using <ins>
and <del>
.
Note that the html text file still has the CR, LF, and TAB characters, so you will need to use either wrap the output in a <pre>
tag or replace those characters with something like a <br>
and
.
The method is var html = diffButty(<original : String>, <revised : String>)
.
The code can be found here : 045-DiffButty1.js
You can browse the code in outline here: Outline
A working demo can be found here.
I am releasing this code under the MIT (X11) Licence.
Sorry, comments have been suspended. Too much offensive comment spam is causing the site to be blocked by firewalls (which ironically therefore defeats the point of posting spam in the first place!). I don't get that many comments anyway, so I am going to look at a better way of managing the comment spam before reinstating the comments.