<HTML> <HEAD> <TITLE>Computer Science Logo Style vol 2 ch 2: Example: Finding File Differences</TITLE> </HEAD> <BODY> <CITE>Computer Science Logo Style</CITE> volume 2: <CITE>Advanced Techniques</CITE> 2/e Copyright (C) 1997 MIT <H1>Example: Finding File Differences</H1> <TABLE width="100%"><TR><TD> <IMG SRC="../csls2.jpg" ALT="cover photo"> <TD><TABLE> <TR><TD align="right"><CITE><A HREF="http://www.cs.berkeley.edu/~bh/">Brian Harvey</A><BR>University of California, Berkeley</CITE> <TR><TD align="right"><BR> <TR><TD align="right"><A HREF="../pdf/v2ch02.pdf">Download PDF version</A> <TR><TD align="right"><A HREF="../v2-toc2.html">Back to Table of Contents</A> <TR><TD align="right"><A HREF="../v2ch1/v2ch1.html"><STRONG>BACK</STRONG></A> chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch3/v2ch3.html"><STRONG>NEXT</STRONG></A> <TR><TD align="right"><A HREF="https://mitpress.mit.edu/books/computer-science-logo-style-second-edition-volume-2">MIT Press web page for <CITE>Computer Science Logo Style</CITE></A> </TABLE></TABLE> <HR><P>Program file for this chapter: <A HREF="diff.lg"><CODE>diff</CODE></A> <P>As an example of a practical program that manipulates data files, this chapter is about comparing two similar files to find the differences between them. This program is most often useful in comparing a current version of a file with an earlier version to see what has changed. On the next page is an example of two input files and the program's output. The output shows only those lines that differ between the two files, indicating changed lines, insertions, and deletions. When several consecutive lines are different, they are grouped together as a single reported change. (To demonstrate all of the program's capabilities, I've used short input files with more lines different than identical, and so the program's report is longer than the input files themselves. In a more realistic example, the input files might be hundreds of lines long, with only a few differences, so it would be easier to read the program's output than to examine the files directly.) <P> <CENTER><STRONG>Input Files</STRONG></CENTER> <PRE> Text1 Text2 My goal in this series of books My goal in this series of books is to make the goals and methods is to make the goals and methods of a serious computer scientist of a mad computer scientist accessible, at an introductory accessible, at an introductory level, to people who are level, to people who are interested in computer interested in playing computer programming but are not computer games. science majors. If you're an If you're an adult or teenaged hobbyist, adult or teenaged hobbyist, you're definitely part of this or a teacher who wants to use the audience. computer as an educational tool, And I hope you appreciate you're definitely part of this the privilege! audience. </PRE> <P> <CENTER><STRONG>Output File</STRONG></CENTER> <P><PRE>DIFF results: < File 1 = Text1 > File 2 = Text2 ========== CHANGE 3-3 3-3 < of a serious computer scientist --- > of a mad computer scientist ========== CHANGE 6-8 6-7 < interested in computer < programming but are not computer < science majors. --- > interested in playing computer > games. ========== DELETE 11-12 10 < or a teacher who wants to use the < computer as an educational tool, ========== INSERT 15 12-13 > and I hope you appreciate > the privilege! ========== </PRE> <P>I've called this program <CODE>diff</CODE> because it was inspired by a similar program of that name in the Unix operating system. The format of the report that my <CODE>diff</CODE> generates is similar to that of the Unix version. In particular, I've followed the Unix <CODE>diff</CODE> convention that when a line from one of the input files appears in the report, it is marked by either a "<CODE><</CODE>" character if it's from the first file or a "<CODE>></CODE>" character if it's from the second file. <P>The numbers in the lines that begin with <CODE>CHANGE</CODE>, <CODE>INSERT</CODE>, or <CODE>DELETE</CODE> are line numbers, counting from one, in each of the two files. For example, the line <P><PRE>CHANGE 6-8 6-7 </PRE> <P>indicates that lines 6 through 8 in the first file were replaced by lines 6 through 7 in the second file. (The program considers a change to be finished when it finds two consecutive identical lines in the two files. In this case, lines 9 and 10 of the first file are identical to lines 8 and 9 of the second file.) <P>The <CODE>diff</CODE> procedure takes three inputs. The first two are names of the two input files; the third is either a name for an output file or an empty list, in which case the program's results are printed on the screen. For example, to see the results of my sample run, I'd say <P><PRE>diff "Text1 "Text2 [] </PRE> <P>I picked this project partly because it requires switching between two input files, so you can see how the program uses <CODE>setread</CODE> repeatedly. <P><H2>Program Overview</H2> <P><A HREF="diff.html#diff"><CODE>Diff</CODE></A> reads lines from the two input files in alternation. As long as the corresponding lines are equal, the program just moves on to the next pair of lines. (Procedure <A HREF="diff.html#diff.same"><CODE>diff.same</CODE></A> handles this process.) When a difference is found, the program's operation becomes more complicated. It must remember all the lines that it reads from both files until it finds two consecutive equal pairs of lines. (Procedures <A HREF="diff.html#diff.differ"><CODE>diff.differ</CODE></A> and <A HREF="diff.html#diff.found"><CODE>diff.found</CODE></A> do this.) <P>Life would be simple if the differences between the two files were only changes within a line, without adding or removing entire lines. (This would be a realistic assumption if, for example, the second file had been created by applying a spelling correction program to the first file. Individual words would then be different, but each line of the second file would correspond to one line of the first.) In that case, the structure of <CODE>diff.differ</CODE> could be similar to that of <CODE>diff.same</CODE>: Read a line from each file, compare the two, and report the pairs that are different. <P>But in practice, a change to a paragraph may make the file longer or shorter. It may turn out, as in my sample run, that three lines from the first file correspond to only two lines from the second one. If that's the case, then there's no guarantee that the equal lines that mark the end of a change will be at the same line <EM>numbers</EM> in the two files. (In the sample, line 9 of the first file matches line 8 of the second.) Whenever the program reads a line from one file, therefore, it must compare that line to <EM>every</EM> line that it's read from the other file since the two started being different. Therefore, <CODE>diff.differ</CODE> must <EM> remember</EM> all of the lines that it reads from both files. <P>Finally, when two pairs of equal lines are found, the program must report the difference that it's detected. That's the job of procedure <A HREF="diff.html#report"><CODE>report</CODE></A>. Once the change has been reported, the program continues in <CODE>diff.same</CODE> until another difference is found. <P>The program finishes its work when the ends of both input files have been reached. <P><H2>The File Information Block Abstract Data Type</H2> <P>For each of the two input files, the program must remember several kinds of information. The <CODE>report</CODE> procedure must know which is file number 1 and which file number 2, in order to print the lines with the correct starting character. The name of each file is needed as the input to <CODE>setread</CODE>. The current line number is needed in order to report the location within each file of a changed section. As I've just explained, there is a collection of <EM>pending</EM> lines during the examination of a change; we'll see shortly that another collection of <EM>saved</EM> lines is used for another purpose. <P>To keep all the information for a file collected together, <CODE>diff</CODE> uses an abstract data type called a "file information block," or FIB, that is implemented as an array with five members. The array is made by a constructor procedure <CODE>makefile</CODE>, and there are selectors for four of the five components: <CODE>which</CODE>, <CODE>filename</CODE>, <CODE>linenum</CODE>, and <CODE> lines</CODE>. For the fifth component, the saved lines, instead of a selector for the entire collection the program uses a selector <CODE>popsaved</CODE> that outputs a single line each time it's invoked. (This will make more sense when you read about saved lines in the next section.) <P> The procedures within this program use these two FIBs as inputs instead of just the filenames. To read from one of the files, for example, the program will say <P><PRE>setread filename :fib1 </PRE> <P><H2>Saving and Re-Reading Input Lines</H2> <P>One further detail complicates the program. Suppose that a change is found in which the two groups of differing lines are of different lengths. For example, suppose three lines in the first file turn into six lines in the second file, like this: <P><PRE>Original line 1 Original line 1 <U>Original line 2</U> <U>Changed line 2</U> <U>Original line 3</U> <U>Changed line 3</U> <U>Original line 4</U> <U>New line 3.1</U> Original line 5 <U>New line 3.2</U> Original line 6 <U>New line 3.3</U> Original line 7 <U>Changed line 4</U> Original line 8 Original line 5 Original line 9 Original line 6 </PRE> <P>The program has been reading lines alternately from the two files. It has just read the line saying "Original line 6" from the second file, and that's the second consecutive match with a line previously read from the first file. So the program is ready to report a change from lines 2-4 of the first file to lines 2-7 of the second. <P>The trouble is that the program has already read three lines of the first file (the last three lines shown above) that have to be compared to lines that haven't yet been read from the second file. Suppose that the files continue as follows: <P><PRE>Original line 10 Original line 7 </PRE> <P>We can't just say, "Okay, we've found the end of a difference, so now we can go back to <CODE>diff.same</CODE> and read lines from the two files." If we did that, we'd read "Original line 10" from file 1, but "Original line 7" from file 2, and we'd think there is a difference when really the two files are in agreement. <P>To solve this problem we must arrange for <CODE>diff.same</CODE> to <EM>re-read</EM> the three unused lines from file 1. Logo allows a programmer to re-read part of a file by changing the current <EM>position</EM> within the file (this ability is called <EM>random access</EM>), but in this program I found it easier to <EM>buffer</EM> the lines by saving them in a list and then, the next time the program wants to read a line from the file, using one of the saved lines instead. <P> <PRE>to readline :fib if savedp :fib [output popsaved :fib] setread filename :fib output readword end </PRE> <P>The first instruction of this procedure says, "If there are any saved lines for this file, remove the first one from the list and output it." Otherwise, if there are no saved lines, then the procedure directs the Logo reader to the desired file (using <CODE>setread</CODE>) and uses <CODE>readword</CODE> to read a line. Because <CODE>popsaved</CODE> removes a line from the list of saved lines, eventually the saved lines will be used up and then the program will continue reading from the actual file. <P><H2>Skipping <A NAME="diff.same">Equal</A> Lines</H2> <P>Here is the procedure that skips over identical pairs of lines: <P><PRE>to diff.same :fib1 :fib2 local [line1 line2] do.while [make "line1 getline :fib1 make "line2 getline :fib2 if and listp :line1 listp :line2 [stop] ; Both files ended. ] [equalp :line1 :line2] addline :fib1 :line1 ; Difference found. addline :fib2 :line2 diff.differ :fib1 :fib2 end to getline :fib nextlinenum :fib output readline :fib end </PRE> <P>Most of the names you don't recognize are selectors and <EM>mutators</EM> for the FIB abstract data type. (A mutator is a procedure that changes the value of an existing datum, such as <CODE>setitem</CODE> for arrays.) One new Berkeley Logo primitive used here is <CODE>do.while</CODE>. It takes two inputs, an instruction list and an expression whose value is <CODE>true</CODE> or <CODE> false</CODE>. <CODE>Do.while</CODE> first carries out the instructions in the first input. Then it evaluates the predicate expression. If it's <CODE>true</CODE>, then <CODE>do.while</CODE> repeats the process, carrying out the instructions and evaluating the predicate, until the predicate becomes <CODE>false</CODE>. In this case, the idea is "Keep reading lines as long as <CODE>:line1</CODE> and <CODE>:line2</CODE> are equal." <P><CODE>Getline</CODE> reads a line, either from the file or from the saved lines, and also adds one to the current line number by invoking <CODE>nextlinenum</CODE>: <P><PRE>to nextlinenum :fib setitem 3 :fib (item 3 :fib)+1 end </PRE> <P>This is a typical mutator; I won't show the others until the complete program listing at the end of the chapter. <P><H2>Comparing and <A NAME="diff.differ">Remembering</A> Unequal Lines</H2> <P><PRE>to diff.differ :fib1 :fib2 local "line make "line readline :fib1 addline :fib1 :line ifelse memberp :line lines :fib2 ~ [diff.found :fib1 :fib2] ~ [diff.differ :fib2 :fib1] end </PRE> <P><CODE>Diff.differ</CODE> reads a line (perhaps a saved line) from one of the files, adds it to the collection of pending lines (not saved lines!) for that file, then looks to see whether a line equal to this one is pending in the other file. If so, then we may have found the end of the changed area, and <CODE>diff.found</CODE> is called to make sure there is a second pair of equal lines following these two. If not, we must read a line from the other file; this is accomplished by a recursive call to <CODE>diff.differ</CODE> with the two inputs in reversed order. What was <CODE>:fib1</CODE> this time will be <CODE>:fib2</CODE> in the recursive call, and vice versa. (This is why the FIB data type must include a record of which is the original file 1 and file 2.) <P>The reason that <CODE>diff.differ</CODE> uses <CODE>readline</CODE> rather than <CODE>getline</CODE> to read from the input files is that it doesn't advance the line number. When dealing with a difference between the files, we are keeping a range of lines from each file, not just a single line. The line number that the program keeps in the FIB is that of the <EM>first</EM> different line; the line number of the last different line will be computed by the <CODE>report</CODE> procedure later. <P><PRE>to <A NAME="diff.found">diff.found</A> :fib1 :fib2 local "lines make "lines member2 (last butlast lines :fib1) ~ (last lines :fib1) ~ (lines :fib2) ifelse emptyp :lines ~ [diff.differ :fib2 :fib1] ~ [report :fib1 :fib2 (butlast butlast lines :fib1) (firstn (lines :fib2) (count lines :fib2)-(count :lines))] end </PRE> <P><CODE>Diff.found</CODE> is called when the last line read from file 1 matches some line pending from file 2. Its job is to find out whether the last <EM> two</EM> lines from file 1 match two consecutive lines from file 2. Most of the work is done by the straightforward helper procedure <CODE>member2</CODE>, which works this way: <P><PRE>> <U>show member2 "and "joy [she's my pride and joy etcetera]</U> [and joy etcetera] > <U>show member2 "pride "joy [she's my pride and joy etcetera]</U> [] </PRE> <P>If the first two inputs are consecutive members of the third, then <CODE>member2</CODE> outputs the portion of its third input starting from the point at which the first input was found. If not, then <CODE>member2</CODE> outputs the empty list. <P>If <CODE>member2</CODE>'s output is empty, we continue reading lines from the two files by invoking <CODE>diff.differ</CODE>. If not, then we've found the end of a change, and we invoke <CODE>report</CODE> to print the results. The first two inputs to <CODE>report</CODE> are the two files; the third and fourth are the corresponding sets of unequal lines. The unequal lines from file 1 are all but the last two, the ones we just matched; the unequal lines from file 2 are all but the ones that <CODE>member2</CODE> output. Helper procedure <CODE>firstn</CODE> is used to select those lines. <P><H2>Reporting a <A NAME="report">Difference</A></H2> <P>The <CODE>report</CODE> procedure is somewhat lengthy, but mostly because differences in which one of the sets of lines is empty are reported specially (as an insertion or a deletion, rather than as a change). <P><PRE>to report :fib1 :fib2 :lines1 :lines2 local [end1 end2 dashes] if equalp (which :fib1) 2 [report :fib2 :fib1 :lines2 :lines1 stop] print "========== make "end1 (linenum :fib1)+(count :lines1)-1 make "end2 (linenum :fib2)+(count :lines2)-1 make "dashes "false ifelse :end1 < (linenum :fib1) [ print (sentence "INSERT :end1+1 (word (linenum :fib2) "- :end2)) ] [ifelse :end2 < (linenum :fib2) [ print (sentence "DELETE (word (linenum :fib1) "- :end1) :end2+1) ] [ print (sentence "CHANGE (word (linenum :fib1) "- :end1) (word (linenum :fib2) "- :end2)) make "dashes "true ]] process :fib1 "|< | :lines1 :end1 if :dashes [print "---] process :fib2 "|> | :lines2 :end2 diff.same :fib1 :fib2 end to process :fib :prompt :lines :end foreach :lines [type :prompt print ?] savelines :fib butfirst butfirst chop :lines (lines :fib) setlines :fib [] setlinenum :fib :end+2 end </PRE> <P>Here's how to read <CODE>report</CODE>: The first step is to ensure that the files are in the proper order, so that <CODE>:fib1</CODE> is file number 1. (If not, <CODE>report</CODE> invokes itself with its inputs reordered.) The next step is to compute the ending line number for each changed section; it's the starting line number (found in the file data structure) plus the number of unmatched lines, minus one. <CODE>Report</CODE> then prints a header, choosing <CODE> INSERT</CODE>, <CODE>DELETE</CODE>, or <CODE>CHANGE</CODE> as appropriate. Finally, it invokes <CODE>process</CODE> once for each file. <P><CODE>Process</CODE> prints the unmatched lines, with the appropriate file indicator (<CODE><</CODE> or <CODE>></CODE>). Then it takes whatever pending lines were not included in the unmatched group and transfers them to the saved lines, so that they will be read again. (As a slight efficiency improvement, <CODE> process</CODE> skips over the two lines that we know matched two lines in the other file; there's no need to read those again.) The set of pending lines is made empty, since no file difference is pending. Finally, the line number in the file structure is increased to match the position following the two lines that ended the difference. <P>If <CODE>process</CODE> confuses you, look back at the example I gave earlier, when I first talked about saving and re-reading lines. In that example, the lines from "Original line 7" to "Original line 9" in the first file are the ones that must be moved from the list of pending lines to the list of saved lines. (No lines will be moved in the second file, since that one had the longer set of lines in this difference, six lines instead of three.) <P>By the way, in the places where the program adds or subtracts one or two in a line number calculation, I didn't work those out in advance. I wrote the program without them, looked at the wrong results, and then figured out how to correct them! <P> <TABLE width="100%"><TR><TD><A HREF="../v2-toc2.html">(back to Table of Contents)</A> <TD align="right"><A HREF="../v2ch1/v2ch1.html"><STRONG>BACK</STRONG></A> chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch3/v2ch3.html"><STRONG>NEXT</STRONG></A> </TABLE> <P><H2>Program Listing</H2> <P>I've discussed the most important parts of this program, but not all of the helper procedures. If you want to understand the program fully, you can read this complete listing: <P><P> <P><PRE> to <A NAME="diff">diff</A> :file1 :file2 :output local "caseignoredp make "caseignoredp "false openread :file1 openread :file2 if not emptyp :output [openwrite :output] setwrite :output print [DIFF results:] print sentence [< File 1 =] :file1 print sentence [> File 2 =] :file2 diff.same (makefile 1 :file1) (makefile 2 :file2) print "========== setread [] setwrite [] close :file1 close :file2 if not emptyp :output [close :output] end ;; Skip over identical lines in the two files. to diff.same :fib1 :fib2 local [line1 line2] do.while [make "line1 getline :fib1 make "line2 getline :fib2 if and listp :line1 listp :line2 [stop] ; Both files ended. ] [equalp :line1 :line2] addline :fib1 :line1 ; Difference found. addline :fib2 :line2 diff.differ :fib1 :fib2 end ;; Remember differing lines while looking for ones that match. to diff.differ :fib1 :fib2 local "line make "line readline :fib1 addline :fib1 :line ifelse memberp :line lines :fib2 ~ [diff.found :fib1 :fib2] ~ [diff.differ :fib2 :fib1] end to diff.found :fib1 :fib2 local "lines make "lines member2 (last butlast lines :fib1) ~ (last lines :fib1) ~ (lines :fib2) ifelse emptyp :lines ~ [diff.differ :fib2 :fib1] ~ [report :fib1 :fib2 (butlast butlast lines :fib1) (firstn (lines :fib2) (count lines :fib2)-(count :lines))] end to member2 :line1 :line2 :lines if emptyp butfirst :lines [output []] if and equalp :line1 first :lines equalp :line2 first butfirst :lines ~ [output :lines] output member2 :line1 :line2 butfirst :lines end to firstn :stuff :number if :number = 0 [output []] output fput (first :stuff) (firstn butfirst :stuff :number-1) end ;; Read from file or from saved lines. to getline :fib nextlinenum :fib output readline :fib end to readline :fib if savedp :fib [output popsaved :fib] setread filename :fib output readword end ;; Matching lines found, now report the differences. to report :fib1 :fib2 :lines1 :lines2 local [end1 end2 dashes] if equalp (which :fib1) 2 [report :fib2 :fib1 :lines2 :lines1 stop] print "========== make "end1 (linenum :fib1)+(count :lines1)-1 make "end2 (linenum :fib2)+(count :lines2)-1 make "dashes "false ifelse :end1 < (linenum :fib1) [ print (sentence "INSERT :end1+1 (word (linenum :fib2) "- :end2)) ] [ifelse :end2 < (linenum :fib2) [ print (sentence "DELETE (word (linenum :fib1) "- :end1) :end2+1) ] [ print (sentence "CHANGE (word (linenum :fib1) "- :end1) (word (linenum :fib2) "- :end2)) make "dashes "true ]] process :fib1 "|< | :lines1 :end1 if :dashes [print "-----] process :fib2 "|> | :lines2 :end2 diff.same :fib1 :fib2 end to process :fib :prompt :lines :end foreach :lines [type :prompt print ?] savelines :fib butfirst butfirst chop :lines (lines :fib) setlines :fib [] setlinenum :fib :end+2 end to chop :counter :stuff if emptyp :counter [output :stuff] output chop butfirst :counter butfirst :stuff end ;; Constructor, selectors, and mutators for File Information Block (FIB) ;; Five elements: file number, file name, line number, ;; differing lines, and saved lines for re-reading. to makefile :number :name local "file make "file array 5 ; Items 4 and 5 will be empty lists. setitem 1 :file :number setitem 2 :file :name setitem 3 :file 0 output :file end to which :fib output item 1 :fib end to filename :fib output item 2 :fib end to linenum :fib output item 3 :fib end to nextlinenum :fib setitem 3 :fib (item 3 :fib)+1 end to setlinenum :fib :value setitem 3 :fib :value end to addline :fib :line setitem 4 :fib (lput :line item 4 :fib) end to setlines :fib :value setitem 4 :fib :value end to lines :fib output item 4 :fib end to savelines :fib :value setitem 5 :fib (sentence :value item 5 :fib) end to savedp :fib output not emptyp item 5 :fib end to popsaved :fib local "result make "result first item 5 :fib setitem 5 :fib (butfirst item 5 :fib) output :result end </PRE><P> <P><A HREF="../v2-toc2.html">(back to Table of Contents)</A> <P><A HREF="../v2ch1/v2ch1.html"><STRONG>BACK</STRONG></A> chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch3/v2ch3.html"><STRONG>NEXT</STRONG></A> <P> <ADDRESS> <A HREF="../index.html">Brian Harvey</A>, <CODE>bh@cs.berkeley.edu</CODE> </ADDRESS> </BODY> </HTML>