<HTML>
<HEAD>
<TITLE>Computer Science Logo Style vol 2 ch 2: Example: Finding File Differences</TITLE>
</HEAD>
<BODY>
<CITE>Computer Science Logo Style</CITE> volume 2:
<CITE>Advanced Techniques</CITE> 2/e Copyright (C) 1997 MIT
<H1>Example: Finding File Differences</H1>
<TABLE width="100%"><TR><TD>
<IMG SRC="../csls2.jpg" ALT="cover photo">
<TD><TABLE>
<TR><TD align="right"><CITE><A HREF="http://www.cs.berkeley.edu/~bh/">Brian
Harvey</A><BR>University of California, Berkeley</CITE>
<TR><TD align="right"><BR>
<TR><TD align="right"><A HREF="../pdf/v2ch02.pdf">Download PDF version</A>
<TR><TD align="right"><A HREF="../v2-toc2.html">Back to Table of Contents</A>
<TR><TD align="right"><A HREF="../v2ch1/v2ch1.html"><STRONG>BACK</STRONG></A>
chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch3/v2ch3.html"><STRONG>NEXT</STRONG></A>
<TR><TD align="right"><A HREF="https://mitpress.mit.edu/books/computer-science-logo-style-second-edition-volume-2">MIT
Press web page for <CITE>Computer Science Logo Style</CITE></A>
</TABLE></TABLE>
<HR><P>Program file for this chapter: <A HREF="diff.lg"><CODE>diff</CODE></A>
<P>As an example of a practical program that manipulates data files, this
chapter is about comparing two similar files to find the differences
between them. This program is most often useful in comparing a current
version of a file with an earlier version to see what has changed.
On the next page is an example of two input files and the program's output.
The output shows only those lines that differ between the two files,
indicating changed lines, insertions, and deletions. When several
consecutive lines are different, they are grouped together as a single
reported change. (To demonstrate all of the program's capabilities, I've
used short input files with more lines different than identical, and so the
program's report is longer than the input files themselves. In a more
realistic example, the input files might be hundreds of lines long, with only
a few differences, so it would be easier to read the program's output than
to examine the files directly.)
<P>
<CENTER><STRONG>Input Files</STRONG></CENTER>
<PRE> Text1 Text2
My goal in this series of books My goal in this series of books
is to make the goals and methods is to make the goals and methods
of a serious computer scientist of a mad computer scientist
accessible, at an introductory accessible, at an introductory
level, to people who are level, to people who are
interested in computer interested in playing computer
programming but are not computer games.
science majors. If you're an
If you're an adult or teenaged hobbyist,
adult or teenaged hobbyist, you're definitely part of this
or a teacher who wants to use the audience.
computer as an educational tool, And I hope you appreciate
you're definitely part of this the privilege!
audience.
</PRE>
<P>
<CENTER><STRONG>Output File</STRONG></CENTER>
<P><PRE>DIFF results:
< File 1 = Text1
> File 2 = Text2
==========
CHANGE 3-3 3-3
< of a serious computer scientist
---
> of a mad computer scientist
==========
CHANGE 6-8 6-7
< interested in computer
< programming but are not computer
< science majors.
---
> interested in playing computer
> games.
==========
DELETE 11-12 10
< or a teacher who wants to use the
< computer as an educational tool,
==========
INSERT 15 12-13
> and I hope you appreciate
> the privilege!
==========
</PRE>
<P>I've called this program <CODE>diff</CODE> because it was inspired by a similar
program of that name in the Unix operating system. The format of the
report that my <CODE>diff</CODE> generates is similar to that of the Unix version.
In particular, I've followed the Unix <CODE>diff</CODE> convention that when a line
from one of the input files appears in the report, it is marked by either
a "<CODE><</CODE>" character if it's from the first file or a "<CODE>></CODE>" character
if it's from the second file.
<P>The numbers in the lines that begin with <CODE>CHANGE</CODE>, <CODE>INSERT</CODE>, or
<CODE>DELETE</CODE> are line numbers, counting from one, in each of the two files.
For example, the line
<P><PRE>CHANGE 6-8 6-7
</PRE>
<P>indicates that lines 6 through 8 in the first file were replaced
by lines 6 through 7 in the second file. (The program considers a change
to be finished when it finds two consecutive identical lines in the two
files. In this case, lines 9 and 10 of the first file are identical to
lines 8 and 9 of the second file.)
<P>The <CODE>diff</CODE> procedure takes three inputs. The first two are names of
the two input files; the third is either a name for an output file or an
empty list, in which case the program's results are printed on the screen.
For example, to see the results of my sample run, I'd say
<P><PRE>diff "Text1 "Text2 []
</PRE>
<P>I picked this project partly because it requires switching between
two input files, so you can see how the program uses <CODE>setread</CODE>
repeatedly.
<P><H2>Program Overview</H2>
<P><A HREF="v2ch2.html#diff"><CODE>Diff</CODE></A>
reads lines from the two input files in alternation. As long
as the corresponding lines are equal, the program just moves on to the
next pair of lines. (Procedure <A HREF="v2ch2.html#diff.same"><CODE>diff.same</CODE></A>
handles this process.) When
a difference is found, the program's operation becomes more complicated. It
must remember all the lines that it reads from both files until it finds two
consecutive equal pairs of lines. (Procedures <A HREF="v2ch2.html#diff.differ"><CODE>diff.differ</CODE></A>
and <A HREF="v2ch2.html#diff.found"><CODE>diff.found</CODE></A> do this.)
<P>Life would be simple if the differences between the two files were only
changes within a line, without adding or removing entire lines. (This
would be a realistic assumption if, for example, the second file had been
created by applying a spelling correction program to the first file.
Individual words would then be different, but each line of the second
file would correspond to one line of the first.) In that case, the
structure of <CODE>diff.differ</CODE> could be similar to that of <CODE>diff.same</CODE>:
Read a line from each file, compare the two, and report the pairs that are
different.
<P>But in practice, a change to a paragraph may make the file longer or
shorter. It may turn out, as in my sample run, that three lines from the
first file correspond to only two lines from the second one. If that's the
case, then there's no guarantee that the equal lines that mark the end of a
change will be at the same line <EM>numbers</EM> in the two files. (In the
sample, line 9 of the first file matches line 8 of the second.) Whenever
the program reads a line from one file, therefore, it must compare that
line to <EM>every</EM> line that it's read from the other file since the
two started being different. Therefore, <CODE>diff.differ</CODE> must <EM>
remember</EM> all of the lines that it reads from both files.
<P>Finally, when two pairs of equal lines are found, the program must
report the difference that it's detected. That's the job of
procedure <A HREF="v2ch2.html#report"><CODE>report</CODE></A>. Once the
change has been reported, the program continues in
<CODE>diff.same</CODE> until another difference is found.
<P>The program finishes its work when the ends of both input files have been
reached.
<P><H2>The File Information Block Abstract Data Type</H2>
<P>For each of the two input files, the program must remember several kinds of
information. The <CODE>report</CODE> procedure must know which is file number 1
and which file number 2, in order to print the lines with the correct
starting character. The name of each file is needed as the
input to <CODE>setread</CODE>. The current line
number is needed in order to report the location
within each file of a changed section. As I've just explained, there is a
collection of <EM>pending</EM> lines during the examination of a change;
we'll see shortly that another collection of <EM>saved</EM> lines is used for
another purpose.
<P>To keep all the information for a file collected together, <CODE>diff</CODE> uses
an abstract data type called a "file information block," or FIB, that is
implemented as an array with five members. The array is made by a
constructor procedure <CODE>makefile</CODE>, and there are selectors for four of
the five components: <CODE>which</CODE>, <CODE>filename</CODE>, <CODE>linenum</CODE>, and <CODE>
lines</CODE>. For the fifth component, the saved lines, instead of a selector for
the entire collection the program uses a selector <CODE>popsaved</CODE> that
outputs a single line each time it's invoked. (This will make more sense
when you read about saved lines in the next section.)
<P>
The procedures within this program use these two FIBs as inputs
instead of just the filenames. To read from one of the files, for
example, the program will say
<P><PRE>setread filename :fib1
</PRE>
<P><H2>Saving and Re-Reading Input Lines</H2>
<P>One further detail complicates the program. Suppose that a change is found
in which the two groups of differing lines are of different lengths. For
example, suppose three lines in the first file turn into six lines in the
second file, like this:
<P><PRE>Original line 1 Original line 1
<U>Original line 2</U> <U>Changed line 2</U>
<U>Original line 3</U> <U>Changed line 3</U>
<U>Original line 4</U> <U>New line 3.1</U>
Original line 5 <U>New line 3.2</U>
Original line 6 <U>New line 3.3</U>
Original line 7 <U>Changed line 4</U>
Original line 8 Original line 5
Original line 9 Original line 6
</PRE>
<P>The program has been reading lines alternately from the two
files. It has just read the line saying "Original line 6" from the second
file, and that's the second consecutive match with a line previously read
from the first file. So the program is ready to report a change from lines
2-4 of the first file to lines 2-7 of the second.
<P>The trouble is that the program has already read three lines of the first
file (the last three lines shown above) that have to be compared to lines
that haven't yet been read from the second file. Suppose that the files
continue as follows:
<P><PRE>Original line 10 Original line 7
</PRE>
<P>We can't just say, "Okay,
we've found the end of a difference, so now we can go back to <CODE>diff.same</CODE>
and read lines from the two files." If we did that, we'd read
"Original line 10" from file 1, but "Original
line 7" from file 2, and we'd think there is a difference when really the
two files are in agreement.
<P>To solve this problem we must arrange for <CODE>diff.same</CODE> to <EM>re-read</EM>
the three unused lines from file 1. Logo allows a programmer to re-read part
of a file by changing the current <EM>position</EM> within the file (this
ability is called <EM>random access</EM>), but in this program I found
it easier to <EM>buffer</EM> the lines by saving them in a list and then,
the next time the program wants to read a line from the file, using one of
the saved lines instead.
<P>
<PRE>to readline :fib
if savedp :fib [output popsaved :fib]
setread filename :fib
output readword
end
</PRE>
<P>The first instruction of this procedure says, "If there are any
saved lines for this file, remove the first one from the list and output
it." Otherwise, if there are no saved lines, then the procedure directs the
Logo reader to the desired file (using <CODE>setread</CODE>) and uses <CODE>readword</CODE>
to read a line. Because <CODE>popsaved</CODE> removes a line from the list of
saved lines, eventually the saved lines will be used up and then the program
will continue reading from the actual file.
<P><H2>Skipping <A NAME="diff.same">Equal</A> Lines</H2>
<P>Here is the procedure that skips over identical pairs of lines:
<P><PRE>to diff.same :fib1 :fib2
local [line1 line2]
do.while [make "line1 getline :fib1
make "line2 getline :fib2
if and listp :line1 listp :line2 [stop] ; Both files ended.
] [equalp :line1 :line2]
addline :fib1 :line1 ; Difference found.
addline :fib2 :line2
diff.differ :fib1 :fib2
end
to getline :fib
nextlinenum :fib
output readline :fib
end
</PRE>
<P>Most of the names you don't recognize are selectors and <EM>mutators</EM>
for the FIB abstract data type. (A mutator is a procedure that changes the
value of an existing datum, such as <CODE>setitem</CODE> for arrays.) One new
Berkeley Logo primitive used here is <CODE>do.while</CODE>. It takes two inputs,
an instruction list and an expression whose value is <CODE>true</CODE> or <CODE>
false</CODE>. <CODE>Do.while</CODE> first carries out the instructions in the first
input. Then it evaluates the predicate expression. If it's <CODE>true</CODE>,
then <CODE>do.while</CODE> repeats the process, carrying out the instructions and
evaluating the predicate, until the predicate becomes <CODE>false</CODE>. In this
case, the idea is "Keep reading lines as long as <CODE>:line1</CODE>
and <CODE>:line2</CODE> are equal."
<P><CODE>Getline</CODE> reads a line, either from the file or from the saved lines, and
also adds one to the current line number by invoking <CODE>nextlinenum</CODE>:
<P><PRE>to nextlinenum :fib
setitem 3 :fib (item 3 :fib)+1
end
</PRE>
<P>This is a typical mutator; I won't show the others until the
complete program listing at the end of the chapter.
<P><H2>Comparing and <A NAME="diff.differ">Remembering</A> Unequal Lines</H2>
<P><PRE>to diff.differ :fib1 :fib2
local "line
make "line readline :fib1
addline :fib1 :line
ifelse memberp :line lines :fib2 ~
[diff.found :fib1 :fib2] ~
[diff.differ :fib2 :fib1]
end
</PRE>
<P><CODE>Diff.differ</CODE> reads a line (perhaps a saved line) from one of the files,
adds it to the collection of pending lines (not saved lines!) for that file,
then looks to see whether a line equal to this one is pending in the other
file. If so, then we may have found the end of the changed area, and
<CODE>diff.found</CODE> is called to make sure there is a second pair of equal
lines following these two. If not, we must read a line from the other file;
this is accomplished by a recursive call to <CODE>diff.differ</CODE> with the two
inputs in reversed order. What was <CODE>:fib1</CODE> this time will be <CODE>:fib2</CODE>
in the recursive call, and vice versa. (This is why the FIB data type
must include a record of which is the original file 1 and file 2.)
<P>The reason that <CODE>diff.differ</CODE> uses <CODE>readline</CODE>
rather than <CODE>getline</CODE> to read from the input files is that it
doesn't advance the line
number. When dealing with a difference between the files, we are keeping a
range of lines from each file, not just a single line. The line number that
the program keeps in the FIB is that of the <EM>first</EM>
different line; the line number of the last different line will be computed
by the <CODE>report</CODE> procedure later.
<P><PRE>to <A NAME="diff.found">diff.found</A> :fib1 :fib2
local "lines
make "lines member2 (last butlast lines :fib1) ~
(last lines :fib1) ~
(lines :fib2)
ifelse emptyp :lines ~
[diff.differ :fib2 :fib1] ~
[report :fib1 :fib2 (butlast butlast lines :fib1)
(firstn (lines :fib2) (count lines :fib2)-(count :lines))]
end
</PRE>
<P><CODE>Diff.found</CODE> is called when the last line read from file 1 matches some
line pending from file 2. Its job is to find out whether the last <EM>
two</EM> lines from file 1 match two consecutive lines from file 2. Most of
the work is done by the straightforward helper procedure <CODE>member2</CODE>, which
works this way:
<P><PRE>> <U>show member2 "and "joy [she's my pride and joy etcetera]</U>
[and joy etcetera]
> <U>show member2 "pride "joy [she's my pride and joy etcetera]</U>
[]
</PRE>
<P>If the first two inputs are consecutive members of the third, then
<CODE>member2</CODE> outputs the portion of its third input starting from the point
at which the first input was found. If not, then <CODE>member2</CODE> outputs the
empty list.
<P>If <CODE>member2</CODE>'s output is empty, we continue reading lines from the
two files by invoking <CODE>diff.differ</CODE>. If not, then we've found the end
of a change, and we invoke <CODE>report</CODE> to print the results. The first two
inputs to <CODE>report</CODE> are the two files; the third and fourth are the
corresponding sets of unequal lines. The unequal lines from file 1 are all
but the last two, the ones we just matched; the unequal lines from
file 2 are all but the ones that <CODE>member2</CODE> output. Helper procedure
<CODE>firstn</CODE> is used to select those lines.
<P><H2>Reporting a <A NAME="report">Difference</A></H2>
<P>The <CODE>report</CODE> procedure is somewhat lengthy, but mostly because
differences in which one of the sets of lines is empty are reported
specially (as an insertion or a deletion, rather than as a change).
<P><PRE>to report :fib1 :fib2 :lines1 :lines2
local [end1 end2 dashes]
if equalp (which :fib1) 2 [report :fib2 :fib1 :lines2 :lines1 stop]
print "==========
make "end1 (linenum :fib1)+(count :lines1)-1
make "end2 (linenum :fib2)+(count :lines2)-1
make "dashes "false
ifelse :end1 < (linenum :fib1) [
print (sentence "INSERT :end1+1 (word (linenum :fib2) "- :end2))
] [ifelse :end2 < (linenum :fib2) [
print (sentence "DELETE (word (linenum :fib1) "- :end1) :end2+1)
] [
print (sentence "CHANGE (word (linenum :fib1) "- :end1)
(word (linenum :fib2) "- :end2))
make "dashes "true
]]
process :fib1 "|< | :lines1 :end1
if :dashes [print "---]
process :fib2 "|> | :lines2 :end2
diff.same :fib1 :fib2
end
to process :fib :prompt :lines :end
foreach :lines [type :prompt print ?]
savelines :fib butfirst butfirst chop :lines (lines :fib)
setlines :fib []
setlinenum :fib :end+2
end
</PRE>
<P>Here's how to read <CODE>report</CODE>: The first step is to ensure that the files
are in the proper order, so that <CODE>:fib1</CODE> is file number 1. (If not,
<CODE>report</CODE> invokes itself with its inputs reordered.) The next step is to
compute the ending line number for each changed section; it's the starting
line number (found in the file data structure) plus the number of unmatched
lines, minus one. <CODE>Report</CODE> then prints a header, choosing <CODE>
INSERT</CODE>, <CODE>DELETE</CODE>, or <CODE>CHANGE</CODE> as appropriate. Finally, it
invokes <CODE>process</CODE> once for each file.
<P><CODE>Process</CODE> prints the unmatched lines, with the appropriate file indicator
(<CODE><</CODE> or <CODE>></CODE>). Then it takes whatever pending lines were not
included in the unmatched group and transfers them to the saved lines, so
that they will be read again. (As a slight efficiency improvement, <CODE>
process</CODE> skips over the two lines that we know matched two lines in the
other file; there's no need to read those again.) The set of pending lines
is made empty, since no file difference is pending. Finally, the line
number in the file structure is increased to match the position following
the two lines that ended the difference.
<P>If <CODE>process</CODE> confuses you, look back at the example I gave earlier, when
I first talked about saving and re-reading lines. In that example, the lines
from "Original line 7" to "Original line 9" in the first file are the
ones that must be moved from the list of pending lines to the list of saved
lines. (No lines will be moved in the second file, since that one had the
longer set of lines in this difference, six lines instead of three.)
<P>By the way, in the places where the program adds or subtracts one or two
in a line number calculation, I didn't work those out in advance. I wrote
the program without them, looked at the wrong results, and then figured out
how to correct them!
<P>
<TABLE width="100%"><TR><TD><A HREF="../v2-toc2.html">(back to Table of Contents)</A>
<TD align="right"><A HREF="../v2ch1/v2ch1.html"><STRONG>BACK</STRONG></A>
chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch3/v2ch3.html"><STRONG>NEXT</STRONG></A>
</TABLE>
<P><H2>Program Listing</H2>
<P>I've discussed the most important parts of this program, but not all of the
helper procedures. If you want to understand the program fully, you can
read this complete listing:
<P><P>
<P><PRE>
to <A NAME="diff">diff</A> :file1 :file2 :output
local "caseignoredp
make "caseignoredp "false
openread :file1
openread :file2
if not emptyp :output [openwrite :output]
setwrite :output
print [DIFF results:]
print sentence [< File 1 =] :file1
print sentence [> File 2 =] :file2
diff.same (makefile 1 :file1) (makefile 2 :file2)
print "==========
setread []
setwrite []
close :file1
close :file2
if not emptyp :output [close :output]
end
;; Skip over identical lines in the two files.
to diff.same :fib1 :fib2
local [line1 line2]
do.while [make "line1 getline :fib1
make "line2 getline :fib2
if and listp :line1 listp :line2 [stop] ; Both files ended.
] [equalp :line1 :line2]
addline :fib1 :line1 ; Difference found.
addline :fib2 :line2
diff.differ :fib1 :fib2
end
;; Remember differing lines while looking for ones that match.
to diff.differ :fib1 :fib2
local "line
make "line readline :fib1
addline :fib1 :line
ifelse memberp :line lines :fib2 ~
[diff.found :fib1 :fib2] ~
[diff.differ :fib2 :fib1]
end
to diff.found :fib1 :fib2
local "lines
make "lines member2 (last butlast lines :fib1) ~
(last lines :fib1) ~
(lines :fib2)
ifelse emptyp :lines ~
[diff.differ :fib2 :fib1] ~
[report :fib1 :fib2 (butlast butlast lines :fib1)
(firstn (lines :fib2) (count lines :fib2)-(count :lines))]
end
to member2 :line1 :line2 :lines
if emptyp butfirst :lines [output []]
if and equalp :line1 first :lines equalp :line2 first butfirst :lines ~
[output :lines]
output member2 :line1 :line2 butfirst :lines
end
to firstn :stuff :number
if :number = 0 [output []]
output fput (first :stuff) (firstn butfirst :stuff :number-1)
end
;; Read from file or from saved lines.
to getline :fib
nextlinenum :fib
output readline :fib
end
to readline :fib
if savedp :fib [output popsaved :fib]
setread filename :fib
output readword
end
;; Matching lines found, now report the differences.
to report :fib1 :fib2 :lines1 :lines2
local [end1 end2 dashes]
if equalp (which :fib1) 2 [report :fib2 :fib1 :lines2 :lines1 stop]
print "==========
make "end1 (linenum :fib1)+(count :lines1)-1
make "end2 (linenum :fib2)+(count :lines2)-1
make "dashes "false
ifelse :end1 < (linenum :fib1) [
print (sentence "INSERT :end1+1 (word (linenum :fib2) "- :end2))
] [ifelse :end2 < (linenum :fib2) [
print (sentence "DELETE (word (linenum :fib1) "- :end1) :end2+1)
] [
print (sentence "CHANGE (word (linenum :fib1) "- :end1)
(word (linenum :fib2) "- :end2))
make "dashes "true
]]
process :fib1 "|< | :lines1 :end1
if :dashes [print "-----]
process :fib2 "|> | :lines2 :end2
diff.same :fib1 :fib2
end
to process :fib :prompt :lines :end
foreach :lines [type :prompt print ?]
savelines :fib butfirst butfirst chop :lines (lines :fib)
setlines :fib []
setlinenum :fib :end+2
end
to chop :counter :stuff
if emptyp :counter [output :stuff]
output chop butfirst :counter butfirst :stuff
end
;; Constructor, selectors, and mutators for File Information Block (FIB)
;; Five elements: file number, file name, line number,
;; differing lines, and saved lines for re-reading.
to makefile :number :name
local "file
make "file array 5 ; Items 4 and 5 will be empty lists.
setitem 1 :file :number
setitem 2 :file :name
setitem 3 :file 0
output :file
end
to which :fib
output item 1 :fib
end
to filename :fib
output item 2 :fib
end
to linenum :fib
output item 3 :fib
end
to nextlinenum :fib
setitem 3 :fib (item 3 :fib)+1
end
to setlinenum :fib :value
setitem 3 :fib :value
end
to addline :fib :line
setitem 4 :fib (lput :line item 4 :fib)
end
to setlines :fib :value
setitem 4 :fib :value
end
to lines :fib
output item 4 :fib
end
to savelines :fib :value
setitem 5 :fib (sentence :value item 5 :fib)
end
to savedp :fib
output not emptyp item 5 :fib
end
to popsaved :fib
local "result
make "result first item 5 :fib
setitem 5 :fib (butfirst item 5 :fib)
output :result
end
</PRE><P>
<P><A HREF="../v2-toc2.html">(back to Table of Contents)</A>
<P><A HREF="../v2ch1/v2ch1.html"><STRONG>BACK</STRONG></A>
chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch3/v2ch3.html"><STRONG>NEXT</STRONG></A>
<P>
<ADDRESS>
<A HREF="../index.html">Brian Harvey</A>,
<CODE>bh@cs.berkeley.edu</CODE>
</ADDRESS>
</BODY>
</HTML>