about summary refs log tree commit diff stats
path: root/js/games/nluqo.github.io/~bh/ssch22/files.html
blob: 8e85ab17ea02472f8ed44a82d7a26c60c85555d2 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
<P>

<P><A NAME="cabinets"></A>
<P><CENTER><IMG SRC="../ss-pics/files.jpg" ALT="figure: files"></CENTER>
<HTML>
<HEAD>
<TITLE>Simply Scheme: Introducing Computer Science ch 22: Files</TITLE>
</HEAD>
<BODY>
<HR>
<CITE>Simply Scheme:</CITE>
<CITE>Introducing Computer Science</CITE> 2/e Copyright (C) 1999 MIT
<H2>Chapter 22</H2>
<H1>Files</H1>

<TABLE width="100%"><TR><TD>
<IMG SRC="../simply.jpg" ALT="cover photo">
<TD><TABLE>
<TR><TD align="right"><CITE><A HREF="http://www.cs.berkeley.edu/~bh/">Brian
Harvey</A><BR>University of California, Berkeley</CITE>
<TR><TD align="right"><CITE><A HREF="http://ccrma.stanford.edu/~matt">Matthew
Wright</A><BR>University of California, Santa Barbara</CITE>
<TR><TD align="right"><BR>
<TR><TD align="right"><A HREF="../pdf/ssch22.pdf">Download PDF version</A>
<TR><TD align="right"><A HREF="../ss-toc2.html">Back to Table of Contents</A>
<TR><TD align="right"><A HREF="../ssch21/functions-implement.html"><STRONG>BACK</STRONG></A>
chapter thread <A HREF="../ssch23/vectors.html"><STRONG>NEXT</STRONG></A>
<TR><TD align="right"><A HREF="http://mitpress.mit.edu/0262082810">MIT
Press web page for <CITE>Simply Scheme</CITE></A>
</TABLE></TABLE>

<HR>


<P>We learned in Chapter 20 how to read from the keyboard and write to the
screen.  The same procedures (<A NAME="g1"></A><CODE>read</CODE>, <CODE><A NAME="g2"></A><CODE>read-line</CODE></CODE>,
<A NAME="g3"></A><CODE>display</CODE>, <A NAME="g4"></A><CODE>show</CODE>, <CODE><A NAME="g5"></A><CODE>show-line</CODE></CODE>, and
<A NAME="g6"></A><CODE>newline</CODE>) can also be used to read and write <A NAME="g7"></A><A NAME="g8"></A>data files on
the disk.

<P><H2>Ports</H2>

<P>Imagine a complicated program that reads a little bit of data at a time from
a lot of different files.  For example, soon we will write a program to
merge two files the way we merged two sentences in <CODE>mergesort</CODE> in Chapter
15.  In order to make this work, each invocation of <CODE>read</CODE> must
specify which file to read from this time.  Similarly, we might want to
direct output among several files.

<P>Each of the input/output procedures can take an extra argument to specify a
file:

<P><PRE>(show '(across the universe) file1)
(show-line '(penny lane) file2)
(read file3)
</PRE>

<P>What are <CODE>file1</CODE> and so on?  You might think that the natural
thing would be for them to be words, that is, the names of files.

<P>It happens not to work that way.  Instead, before you can use a file, you
have to <EM>open</EM> it.  If you want to read a file, the system has to
check that the file exists.  If you want to write a file, the system has to
create a new, empty file.  The Scheme procedures that open a file return a
<EM>port,</EM> which is what Scheme uses to remember the file you opened.
Ports are useful only as arguments to the input/output procedures.  Here's
an example:

<P><PRE>&gt; (let ((port (<A NAME="g9"></A>open-output-file &quot;songs&quot;)))
    (<A NAME="g10"></A>show '(all my loving) port)
    (show '(ticket to ride) port)
    (show '(martha my dear) port)
    (<A NAME="g11"></A>close-output-port port))
</PRE>

<P>(<CODE>Close-output-port</CODE>, like <CODE>define</CODE>, has an unspecified
return value that we're not going to include in the examples.)

<P>We've created a file named <CODE>songs</CODE> and put three expressions, each on
its own line, in that file.  Notice that nothing appeared on the screen when
we called <CODE>show</CODE>.  Because we used a port argument to <CODE>show</CODE>, the
output went into the file.  Here's what's in the file:

<P><PRE>(ALL MY LOVING)
(TICKET TO RIDE)
(MARTHA MY DEAR)
</PRE>

<P>The example illustrates two more details about using files that we haven't
mentioned before: First, the name of a file must be given in double-quote
marks.  Second, when you're finished using a file, you have to <EM>close</EM>
the port associated with it.  (This is very important.  On some systems,
if you forget to close the port, the file disappears.)

<P>The file is now permanent.  If we were to exit from Scheme, we could read
the file in a word processor or any other program.  But let's read it using
Scheme:

<P><PRE>(define in (<A NAME="g12"></A>open-input-file &quot;songs&quot;))

&gt; (read in)
(ALL MY LOVING)

&gt; (read in)
(TICKET TO RIDE)

&gt; (read in)
(MARTHA MY DEAR)

&gt; (<A NAME="g13"></A>close-input-port in)
</PRE>

<P>(In this illustration, we've used a global variable to hold the
port because we wanted to show the results of reading the file step by
step.  In a real program, we'd generally use a <CODE>let</CODE> structure like the
one we used to write the file.  Now that we've closed the port, the variable
<CODE>in</CODE> contains a port that can no longer be used.)

<P><H2>Writing Files for People to Read</H2>

<P>
A file full of sentences in parentheses is a natural representation for
information that will be used by a Scheme program, but it may seem awkward
if the file will be read by human beings.  We could use <CODE>show-line</CODE>
instead of <CODE>show</CODE> to create a file, still with one song title per line,
but without the parentheses:

<P><PRE>&gt; (let ((port (open-output-file &quot;songs2&quot;)))
    (show-line '(all my loving) port)
    (show-line '(ticket to ride) port)
    (show-line '(martha my dear) port)
    (close-output-port port))
</PRE>

<P>The file <CODE>songs2</CODE> will contain

<P><PRE>ALL MY LOVING
TICKET TO RIDE
MARTHA MY DEAR
</PRE>

<P>What should we do if we want to read this file back into Scheme?  We must
read the file a line at a time, with <CODE>read-line</CODE>.  In effect, <CODE>read-line</CODE> treats the breaks between lines as if they were parentheses:

<P><PRE>(define in (open-input-file &quot;songs2&quot;))

&gt; (read-line in)
(ALL MY LOVING)

&gt; (close-input-port in)
</PRE>

<P>(Notice that we don't have to read the entire file before closing
the port.  If we open the file again later, we start over again from the
beginning.)

<P> 
As far as Scheme is concerned, the result of writing the file with <CODE>show-line</CODE> and reading it with <CODE>read-line</CODE> was the same as that of
writing it with <CODE>show</CODE> and reading it with <CODE>read</CODE>.  The difference
is that without parentheses the file itself is more &quot;user-friendly&quot; for
someone who reads it outside of Scheme.<A NAME="text1" HREF="files.html#ft1">[1]</A>

<P><H2>Using a File as a Database</H2>

<P>It's not very interesting merely to read the file line by line.  Instead,
let's use it as a very small database in which we can look up songs by
number.  (For only three songs, it would be more realistic and more
convenient to keep them in a list and look them up with <CODE>list-ref</CODE>.
Pretend that this file has 3000 songs, too many for you to want to keep
them all at once in your computer's memory.)

<P><PRE>(define (<A NAME="g14"></A>get-song n)
  (let ((port (open-input-file &quot;songs2&quot;)))
    (skip-songs (- n 1) port)
    (let ((answer (read-line port)))
      (close-input-port port)
      answer)))

(define (<A NAME="g15"></A>skip-songs n port)
  (if (= n 0)
      'done
      (begin (read-line port)
	     (skip-songs (- n 1) port))))

&gt; (get-song 2)      
(TICKET TO RIDE)
</PRE>

<P>When we invoke <CODE>read-line</CODE> in <CODE>skip-songs</CODE>, we pay no
attention to the value it returns.  Remember that the values of all but the
last expression in a sequence are discarded.  <CODE>Read</CODE> and <CODE>read-line</CODE>
are the first procedures we've seen that have both a useful return value and
a useful side effect&mdash;moving forward in the file.

<P><CODE>Skip-songs</CODE> returns the word <CODE>done</CODE> when it's finished.  We don't
do anything with that return value, and there's no particular reason why we
chose that word.  But every Scheme procedure has to return <EM>something,</EM>
and this was as good as anything.

<P>What if we asked for a song number greater than three?  In other words, what
if we read beyond the end of the file?  In that case, <CODE>read</CODE> will return
a special value called an <EM><A NAME="g16"></A><A NAME="g17"></A>end-of-file object.</EM> The only
useful thing to do with that value is to test for it.  Our next sample
program reads an entire file and prints it to the screen:

<P><PRE>(define (<A NAME="g18"></A>print-file name)
  (let ((port (open-input-file name)))
    (print-file-helper port)
    (close-input-port port)
    'done))

(define (print-file-helper port)             ;; first version
  (let ((stuff (read-line port)))
    (if (<A NAME="g19"></A>eof-object? stuff)
	'done
	(begin (show-line stuff)
	       (print-file-helper port)))))

&gt; (print-file &quot;songs&quot;)
ALL MY LOVING
TICKET TO RIDE
MARTHA MY DEAR
DONE
</PRE>

<P>Did you notice that each recursive call in <CODE>print-file-helper</CODE> has
exactly the same argument as the one before?  How does the problem get
smaller?  (Up to now, recursive calls have involved something like the <CODE>butfirst</CODE> of an old argument, or one less than an old number.)  When we're
reading a file, the sense in which the problem gets smaller at each
invocation is that we're getting closer to the end of the file.  You don't
<CODE>butfirst</CODE> the port; reading it makes the unread portion of the file
smaller as a side effect.

<P><H2>Transforming the Lines of a File</H2>

<P>Often we want to transform a file one line at a time.  That is, we want to
copy lines from an input file to an output file, but instead of copying the
lines exactly, we want each output line to be a <EM>function</EM> of the
corresponding input line.  Here are some examples: We have a file full of
text and we want to <EM>justify</EM> the output so that every line is exactly
the same length, as in a book.  We have a file of students' names and
grades, and we want a summary with the students' total and average scores.
We have a file with people's first and last names, and we want to rearrange
them to be last-name-first.

<P>We'll write a procedure <A NAME="g20"></A><CODE>file-map</CODE>, analogous to <CODE>map</CODE> but for files.
It will take three arguments:  The first will be a procedure whose domain and
range are sentences; the second will be the name of the input file; the
third will be the name of the output file.

<P>Of course, this isn't exactly like the way <CODE>map</CODE> works&mdash;if it were
exactly analogous, it would take only two arguments, the procedure and the
<EM>contents</EM> of a file.  But one of the important features of files is
that they let us handle amounts of information that are too big to fit all
at once in the computer's memory.  Another feature is that once we write a
file, it's there permanently, until we erase it.  So instead of having a
<CODE>file-map</CODE> <EM>function</EM> that returns the contents of the new file,
we have a procedure that writes its result to the disk.

<P>
<PRE>(define (<A NAME="g21"></A>file-map fn inname outname)
  (let ((inport (open-input-file inname))
	(outport (open-output-file outname)))
    (file-map-helper fn inport outport)
    (close-input-port inport)
    (close-output-port outport)
    'done))

(define (<A NAME="g22"></A>file-map-helper fn inport outport)
  (let ((line (read-line inport)))
    (if (eof-object? line)
	'done
	(begin (show-line (fn line) outport)
	       (file-map-helper fn inport outport)))))
</PRE>

<P>Compare this program with the earlier <CODE>print-file</CODE> example.  The two are
almost identical.  One difference is that now the output goes to a file instead
of to the screen; the other is that we apply the function <CODE>fn</CODE> to each
line before doing the output.  But that small change vastly increases the
generality of our program.  We've performed our usual trick of generalizing a
<A NAME="g23"></A>
pattern by adding a procedure argument, and instead of a program that
carries out one specific task (printing the contents of a file), we have a
tool that can be used to create many programs.

<P>We'll start with an easy example: putting the last name first in a file
full of names.  That is, if we start with an input file named <CODE>dddbmt</CODE>
that contains

<P><PRE>David Harmon
Trevor Davies
John Dymond
Michael Wilson
Ian Amey
</PRE>

<P>we want the output file to contain

<P><PRE>Harmon, David
Davies, Trevor
Dymond, John
Wilson, Michael
Amey, Ian
</PRE>

<P>Since we are using <CODE>file-map</CODE> to handle our progress through the file,
all we have to write is a procedure that takes a sentence (one name) as its
argument and returns the same name but with the last word moved to the front
and with a comma added:
<PRE>(define (<A NAME="g24"></A>lastfirst name)
  (se (word (last name) &quot;,&quot;) (bl name)))
</PRE>

<P>We use <CODE>butlast</CODE> rather than <CODE>first</CODE> in case someone in
the file has a middle name.

<P>To use this procedure we call <CODE>file-map</CODE> like this:

<P><PRE>&gt; (file-map lastfirst &quot;dddbmt&quot; &quot;dddbmt-reversed&quot;)
DONE
</PRE>

<P> Although you don't see the results on the screen, you can 

<P><PRE>&gt; (print-file &quot;dddbmt-reversed&quot;)
</PRE>

<P>to see that we got the results we wanted.

<P>Our next example is averaging grades.  Suppose the file <CODE>grades</CODE>
contains this text:

<P><PRE>John 88 92 100 75 95
Paul 90 91 85 80 91
George 85 87 90 72 96
Ringo 95 84 88 87 87
</PRE>

<P>The output we want is:

<P><PRE>John total: 450 average: 90
Paul total: 437 average: 87.4
George total: 430 average: 86
Ringo total: 441 average: 88.2
</PRE>

<P>Here's the program:

<P><PRE>(define (<A NAME="g25"></A>process-grades line)
  (se (first line)
      &quot;total:"
      (accumulate + (bf line))
      &quot;average:"
      (/ (accumulate + (bf line))
	 (count (bf line)))))

&gt; (file-map process-grades &quot;grades&quot; &quot;results&quot;)
</PRE>

<P>As before, you can

<P><PRE>&gt; (print-file &quot;results&quot;)
</PRE>

<P>to see that we got the results we wanted.

<P> 
<H2>Justifying Text</H2>

<P>Many word-processing programs <EM>justify</EM> text; that is, they insert
extra space between words so that every line reaches exactly to the right
margin.  We can do that using <CODE>file-map</CODE>.

<P>Let's suppose we have a file <CODE>r5rs</CODE>, written in some text editor, that
looks like this:

<P><PRE>Programming languages should be designed not by
piling feature on top of feature, but by
removing the weaknesses and restrictions that
make additional features appear necessary.
Scheme demonstrates that a very small number of
rules for forming expressions, with no
restrictions on how they are composed, suffice
to form a practical and efficient programming
language that is flexible enough to support most
of the major programming paradigms in use today.
</PRE>

<P>(This is the first paragraph of the <EM>Revised<SUP><SMALL>5</SMALL></SUP> Report on
the Algorithmic Language Scheme,</EM> edited by <A NAME="g26"></A>William Clinger
and <A NAME="g27"></A>Jonathan Rees.)

<P>Here is what the result should be if we justify our <CODE>r5rs</CODE> text:

<P><PRE>Programming languages should be  designed  not  by
piling  feature  on  top  of   feature,   but   by
removing  the  weaknesses  and  restrictions  that
make   additional   features   appear   necessary.
Scheme demonstrates that a very  small  number  of
rules   for   forming   expressions,    with    no
restrictions on how  they  are  composed,  suffice
to form  a  practical  and  efficient  programming
language that is flexible enough to  support  most
of the major programming paradigms in  use  today.
</PRE>

<P>The tricky part is that ordinarily we don't control the spaces that appear
when a sentence is printed.  We just make sure the words are right, and we
get one space between words automatically.  The solution used in this
program is that each line of the output file is constructed as a single long
word, including space characters that we place explicitly within it.  (Since
<CODE>show-line</CODE> requires a sentence as its argument, our procedure will
actually return a one-word sentence.  In the following program, <CODE>pad</CODE>
constructs the word, and <CODE>justify</CODE> makes a one-word sentence containing
it.)

<P>This program, although short, is much harder to understand than most of
our short examples.  There is no big new idea involved; instead, there are a
number of unexciting but necessary details.  How many spaces between words?
Do some words get more space than others?  The program structure is messy
because the problem itself is messy.  Although it will be hard to read and
understand, this program is a more realistic example of input/output
programming than the cleanly structured examples we've shown until now.

<P><CODE>Justify</CODE> takes two arguments, the line of text (a sentence) and a
number indicating the desired width (how many characters).  Here's the
algorithm:  First the program computes the total number of characters the
sentence would take up without adding extras.  That's the job of <CODE>char-count</CODE>, which adds up the lengths of all the words, and adds to that
the <EM>n</EM>&minus;1 spaces between words.  <CODE>Extra-spaces</CODE> subtracts that length
from the desired line width to get the number of extra spaces we need.

<P>The hard part of the job is done by <CODE>pad</CODE>.  It's invoked with three
arguments: the part of the line not yet processed, the number of
opportunities there are to insert extra spaces in that part of the line
(that is, the number of words minus one), and the number of extra spaces that
the program still needs to insert.  The number of extra spaces to insert
<EM>this time</EM> is the integer quotient of the number <CODE>pad</CODE> wants to
insert and the number of chances it'll have.  That is, if there are five
words on the line, there are four places where <CODE>pad</CODE> can insert extra
space.  If it needs to insert nine spaces altogether, then it should insert
9/4 or two spaces at the first opportunity.  (Are you worried about the
remainder?  It will turn out that <CODE>pad</CODE> doesn't lose any spaces because
it takes the quotient over again for each word break.  The base case is that
the number of remaining word breaks (the divisor) is one, so there will be
no remainder, and all the leftover extra spaces will be inserted at the last
word break.)

<P><PRE>(define (<A NAME="g28"></A>justify line width)
  (if (&lt; (count line) 2)
      line
      (se (pad line
	       (- (count line) 1)
	       (extra-spaces width (char-count line))))))

(define (<A NAME="g29"></A>char-count line)
  (+ (accumulate + (every count line))      ; letters within words
     (- (count line) 1)))                   ; plus spaces between words

(define (<A NAME="g30"></A>extra-spaces width chars)
  (if (&gt; chars width)
      0                                     ; none if already too wide
      (- width chars)))

(define (<A NAME="g31"></A>pad line chances needed)
  (if (= chances 0)                         ; only one word in line
      (first line)
      (let ((extra (quotient needed chances)))
	(word (first line)
	      (spaces (+ extra 1))
	      (pad (bf line) (- chances 1) (- needed extra))))))

(define (<A NAME="g32"></A>spaces n)
  (if (= n 0)
      &quot;"
      (word &quot; &quot; (spaces (- n 1)))))
</PRE>

<P>Because <CODE>justify</CODE> takes two arguments, we have to decide what line width
we want to give it.  Here's how to make each line take 50 characters:

<P><PRE>&gt; (file-map (lambda (sent) (justify sent 50)) &quot;r5rs&quot; &quot;r5rs-just&quot;)
</PRE>

<P><H2>Preserving Spacing of Text from Files</H2>

<P>If we try to print the file <CODE>r5rs-just</CODE> from the previous section using
<CODE>print-file</CODE>, it'll look exactly like <CODE>r5rs</CODE>.  That's because <CODE>read-line</CODE> doesn't preserve consecutive spaces in the lines that it reads.
<CODE>Read-line</CODE> cares only where each word (consisting of non-space
characters) begins and ends; it pays no attention to how many spaces come
between any two words.  The lines

<P><PRE>All     My          Loving
</PRE>

<P>and

<P><PRE>All My Loving
</PRE>

<P>are the same, as far as <CODE>read-line</CODE> tells you.

<P>For situations in which we do care about spacing, we have another way to
read a line from a file.  The procedure <A NAME="g33"></A><CODE>read-string</CODE> reads all of the
characters on a line, returning a single word that contains all of them,
spaces included:<A NAME="text2" HREF="files.html#ft2">[2]</A>

<P>
<PRE>&gt; (define inport (open-input-file &quot;r5rs-just&quot;))

&gt; (read-string inport)
&quot;Programming languages should be  designed  not  by"

&gt; (read-string inport)
&quot;piling  feature  on  top  of   feature,   but   by"

&gt; (close-input-port inport)
</PRE>

<P>
<P>We can use <CODE>read-string</CODE> to rewrite <CODE>print-file</CODE> so that it makes
an exact copy of the input file:

<P><PRE>(define (<A NAME="g34"></A>print-file-helper port)
  (let ((stuff (read-string port)))
    (if (eof-object? stuff)
	'done
	(begin (show stuff)
	       (print-file-helper port)))))
</PRE>

<P>(We only had to change the helper procedure.)

<P> 
<H2>Merging Two Files</H2>

<P>Suppose you have two files of people's names.  Each file has been sorted
in alphabetical order.  You want to combine them to form a single file,
still in order.  (If this sounds unrealistic, it isn't.  Programs that sort
very large amounts of information can't always fit it all in memory at
once, so they read in as much as fits, sort it, and write a file.  Then
they read and sort another chunk.  At the end of this process, the program
is left with several sorted partial files, and it has to merge those
files to get the overall result.)

<P>The algorithm for merging files is exactly the same as the one we used for
merging sentences in the <CODE>mergesort</CODE> program of Chapter 15.  The
only difference is that the items to be sorted come from reading ports
instead of from <CODE>first</CODE>ing a sentence.

<P><PRE>(define (<A NAME="g35"></A>filemerge file1 file2 outfile)
  (let ((p1 (open-input-file file1))
	(p2 (open-input-file file2))
	(outp (open-output-file outfile)))
    (filemerge-helper p1 p2 outp (read-string p1) (read-string p2))
    (close-output-port outp)
    (close-input-port p1)
    (close-input-port p2)
    'done))

(define (<A NAME="g36"></A>filemerge-helper p1 p2 outp line1 line2)
  (cond ((eof-object? line1) (merge-copy line2 p2 outp))
	((eof-object? line2) (merge-copy line1 p1 outp))
	((before? line1 line2)
	 (show line1 outp)
	 (filemerge-helper p1 p2 outp (read-string p1) line2))
	(else (show line2 outp)
	      (filemerge-helper p1 p2 outp line1 (read-string p2)))))

(define (<A NAME="g37"></A>merge-copy line inp outp)
  (if (eof-object? line)
      #f
      (begin (show line outp)
	     (merge-copy (read-string inp) inp outp))))
</PRE>

<P>You might think, comparing <CODE>filemerge-helper</CODE> with such earlier examples
as <CODE>print-file-helper</CODE> and <CODE>file-map-helper</CODE>, that it would make
more sense for <CODE>filemerge-helper</CODE> to take just the three ports as
arguments and work like this:

<P><PRE>(define (filemerge-helper p1 p2 outp)        ;; wrong
  (let ((line1 (read-string p1))
	(line2 (read-string p2)))
    (cond ((eof-object? line1) (merge-copy p2 outp))
	  ((eof-object? line2) (merge-copy p1 outp))
	  ((before? line1 line2)
	   (show line1 outp)
	   (filemerge-helper p1 p2 outp))
	  (else (show line2 outp)
		(filemerge-helper p1 p2 outp)))))
</PRE>

<P>Unfortunately, this won't work.  Suppose that the first line of <CODE>file2</CODE>
comes before the first line of <CODE>file1</CODE>.  This program correctly writes
the first line of <CODE>file2</CODE> to the output file, as we expect.  But what
about the first line of <CODE>file1</CODE>?  Since we called <CODE>read-string</CODE> on
<CODE>file1</CODE>, we've &quot;gobbled&quot;<A NAME="text3" HREF="files.html#ft3">[3]</A> that line, but we're not yet ready to write it to the output.

<P>In each invocation of <CODE>filemerge-helper</CODE>, only one line is written to
the output file, so unless we want to lose information, we'd better read
only one line.  This means that we can't call <CODE>read-string</CODE> twice on each
recursive call.  One of the lines has to be handed down from one invocation
to the next.  (That is, it has to be an argument to the recursive call.)
Since we don't know in advance <EM>which</EM> line to keep, the easiest
solution is to hand down both lines.

<P>Therefore, <CODE>filemerge-helper</CODE> also takes as arguments the first line of
each file that hasn't yet been written to the output.  When we first call
<CODE>filemerge-helper</CODE> from <CODE>filemerge</CODE>, we read the first line of each
file to provide the initial values of these arguments.  Then, on each
recursive call, <CODE>filemerge-helper</CODE> calls <CODE>read-string</CODE> only once.

<P><H2>Writing Files for Scheme to Read</H2>

<P>You may be thinking that the three file-reading procedures we've shown, <CODE>read</CODE>, <CODE>read-line</CODE>, and <CODE>read-string</CODE>, have been getting better and
better.  <CODE>Read</CODE> ignores case and forces you to have parentheses in your
file.  <CODE>Read-line</CODE> fixes those problems, but it loses spacing information.
<CODE>Read-string</CODE> can read anything and always gets it right.

<P>But there's a cost to the generality of <CODE>read-string</CODE>; it can read any
file, but it loses <EM>structure</EM> information.  For example, when we
processed a file of people's names with <CODE>file-map</CODE>, we used this
function:

<P><PRE>(define (lastfirst name)
  (se (word (last name) &quot;,&quot;) (bl name)))
</PRE>

<P>It's easy to break a name into its components if you have the name
in the form of a sentence, with the words separated already.  But if we had
read each line with <CODE>read-string</CODE>, <CODE>last</CODE> of a line would have been
the last letter, not the last name.

<P>The <CODE>lastfirst</CODE> example illustrates why you might want to use <CODE>read-line</CODE> rather than <CODE>read-string</CODE>: <CODE>Read-line</CODE> &quot;understands&quot;
spaces.  Here's an example in which the even more structured <CODE>read</CODE> is
appropriate.  We have a file of Beatles songs and the albums on which they
appear:

<P><PRE>((love me do) (please please me))
((do you want to know a secret?) (please please me))
((think for yourself) (rubber soul))
((your mother should know) (magical mystery tour))
</PRE>

<P>Each line of this file contains two pieces of information: a song
title and an album title.  If each line contained only the words of the two
titles, as in

<P><PRE>love me do please please me
</PRE>

<P>how would we know where the song title stops and the album title
starts?  The natural way to represent this grouping information is to use
the mechanism Scheme provides for grouping, namely, list structure.

<P>If we use <CODE>read-line</CODE> to read the file, we'll lose the list structure;
it will return a sentence containing words like <CODE>&quot;((love&quot;</CODE>.  <CODE>Read</CODE>,
however, will do what we want.

<P>How did we create this file in the first place?  We just used one <CODE>show</CODE>
per line of the file, like this:

<P><PRE>&gt; (show '((love me do) (please please me)) port)
</PRE>

<P>But what about the movie soundtracks?  We're going to have to come to terms
with the apostrophe in &quot;A Hard Day's Night.&quot;

<P>The straightforward solution is to put <CODE>day's</CODE> in a string:

<P><PRE>(show '((and i love her) (a hard &quot;day's&quot; night)) port)
</PRE>

<P>The corresponding line in the file will look like this:

<P><PRE>((AND I LOVE HER) (A HARD day's NIGHT))
</PRE>

<P>This result is actually even worse than it looks, because when we
try to <CODE>read</CODE> the line back, the <CODE>'s</CODE> will be expanded into <CODE>(quote s)</CODE> in most versions of Scheme.  Using a string made it possible for
us to get an apostrophe into Scheme.  If the word <CODE>day's</CODE> were inside
quotation marks in the file, then <CODE>read</CODE> would understand our intentions.

<P>Why aren't there double quotes in the file?  All of the printing
procedures we've seen so far assume that whatever you're printing is
intended to be read by people.  Therefore, they try to minimize
distracting notation such as double-quote marks.  But, as we've
discovered, if you're writing a file to be read by Scheme, then you
do want enough notation so that Scheme can tell what the original
object was.

<P><CODE>Write</CODE> is a printing procedure just like <CODE>display</CODE>, except that it
<A NAME="spwrite"></A>
includes quote marks around strings:<A NAME="text4" HREF="files.html#ft4">[4]</A>

<P><PRE>&gt; (write '(a hard &quot;day's&quot; night))
(A HARD &quot;day's&quot; NIGHT)
</PRE>

<P>Once we're using strings, and since we're not extracting individual words
from the titles, we might as well represent each title as one string:

<P><PRE>&gt; (write '(&quot;And I Love Her&quot; &quot;A Hard Day's Night&quot;) port)
</PRE>

<P><H2>Pitfalls</H2>

<P>One pitfall crucial to avoid when using files is that if there is an
error in your program, it might blow up and return you to the Scheme prompt
without closing the open files.  If you fix the program and try to run it
again, you may get a message like &quot;file busy&quot; because the operating system
of your computer may not allow you to open the same file on two ports at
once.  Even worse, if you exit from Scheme without closing all your ports,
on some computers you may find that you have unreadable files thereafter.

<P>To help cope with this problem, we've provided a procedure <CODE><A NAME="g39"></A><CODE>close-all-ports</CODE></CODE> that can be invoked to close every port that you've
opened since starting Scheme.  This procedure works only in our modified
Scheme, but it can help you out of trouble while you're learning.

<P>Be sure you don't open or close a file within a recursive procedure,
if you intend to do it only once.  That's why most of the programs in
this chapter have the structure of a procedure that opens files, calls a
recursive helper, and then closes the files.

<P>As we explained in the <CODE>filemerge</CODE> example, you can't read the same
line twice.  Be sure your program remembers each line in a variable as long
as it's needed.

<P><H2>Exercises</H2>

<P><B>22.1</B>&nbsp;&nbsp;Write a <CODE><A NAME="g40"></A>concatenate</CODE> procedure that takes two arguments: a list
of names of input files, and one name for an output file.  The procedure
should copy all of the input files, in order, into the output file.


<P>
<B>22.2</B>&nbsp;&nbsp;Write a procedure to count the number of lines in a file.  It should
take the filename as argument and return the number.


<P>
<B>22.3</B>&nbsp;&nbsp;Write a procedure to count the number of words in a file.  It should
take the filename as argument and return the number.


<P>
<B>22.4</B>&nbsp;&nbsp;Write a procedure to count the number of characters in a file, including
space characters.  It should take the filename as argument and return the
number.


<P>
<B>22.5</B>&nbsp;&nbsp;Write a procedure that copies an input file to an output file but
eliminates multiple consecutive copies of the same line.  That is, if
the input file contains the lines

<P><PRE>John Lennon
Paul McCartney
Paul McCartney
George Harrison


Paul McCartney
Ringo Starr
</PRE>

<P>then the output file should contain

<P><PRE>John Lennon
Paul McCartney
George Harrison

Paul McCartney
Ringo Starr
</PRE>

<P>
<B>22.6</B>&nbsp;&nbsp;Write a <CODE><A NAME="g41"></A>lookup</CODE> procedure that takes as arguments a filename and
a word.  The procedure should print (on the screen, not into another file)
only those lines from the input file that include the chosen word.


<P>
<B>22.7</B>&nbsp;&nbsp;Write a <CODE><A NAME="g42"></A>page</CODE> procedure that takes a filename as argument and
prints the file a screenful at a time.  Assume that a screen can fit 24
lines; your procedure should print 23 lines of the file and then a prompt
message, and then wait for the user to enter a (probably empty) line.  It
should then print the most recent line from the file again (so that the user
will see some overlap between screenfuls) and 22 more lines, and so on until
the file ends.


<P>
<B>22.8</B>&nbsp;&nbsp;A common operation in a database program is to <EM>join</EM> two
databases, that is, to create a new database combining the information from the
two given ones.  There has to be some piece of information in common between
the two databases.  For example, suppose we have a class roster database in
which each record includes a student's name, student ID number, and computer
account name, like this:
<A NAME="join"></A>

<P><PRE>((john alec entwistle) 04397 john)
((keith moon) 09382 kmoon)
((peter townshend) 10428 pete)
((roger daltrey) 01025 roger)
</PRE>

<P>We also have a grade database in which each student's grades
are stored according to computer account name:

<P><PRE>(john 87 90 76 68 95)
(kmoon 80 88 95 77 89)
(pete 100 92 80 65 72)
(roger 85 96 83 62 74)
</PRE>

<P>We want to create a combined database like this:

<P><PRE>((john alec entwistle) 04397 john 87 90 76 68 95)
((keith moon) 09382 kmoon 80 88 95 77 89)
((peter townshend) 10428 pete 100 92 80 65 72)
((roger daltrey) 01025 roger 85 96 83 62 74)
</PRE>

<P>in which the information from the roster and grade databases has
been combined for each account name.

<P>Write a program <CODE><A NAME="g43"></A>join</CODE> that takes five arguments: two input
filenames, two numbers indicating the position of the item within each
record that should overlap between the files, and an output filename.  For
our example, we'd say

<P><PRE>&gt; (join &quot;class-roster&quot; &quot;grades&quot; 3 1 &quot;combined-file&quot;)
</PRE>

<P>In our example, both files are in alphabetical order of computer account
name, the account name is a word, and the same account name never appears
more than once in each file.  In general, you may assume that these
conditions hold for the item that the two files have in common.  Your
program should <EM>not</EM> assume that every item in one file also appears
in the other.  A line should be written in the output file only for the
items that do appear in both files.  

<P>

<HR>
<A NAME="ft1" HREF="files.html#text1">[1]</A> Another difference, not
apparent in this example, is that <CODE>show</CODE> and <CODE>read</CODE> can handle
structured lists.  <CODE>Show-line</CODE> can print a structured list, leaving off
only the outermost parentheses, but <CODE>read-line</CODE> will treat any
parentheses in the file as ordinary characters; it always returns a
sentence.<P>
<A NAME="ft2" HREF="files.html#text2">[2]</A> Like all the input and output primitives, <CODE>read-string</CODE> can be invoked with or without a port argument.<P>
<A NAME="ft3" HREF="files.html#text3">[3]</A> Computer programmers really talk
this way.<P>
<A NAME="ft4" HREF="files.html#text4">[4]</A> There are other kinds of data
that <A NAME="g38"></A><CODE>write</CODE> prints differently from <CODE>display</CODE>, but we don't
use them in
this book.  The general rule is that <CODE>display</CODE> formats the output for
human readers, while <CODE>write</CODE> ensures that Scheme can reread the
information unambiguously.  <CODE>Show</CODE> and <CODE>show-line</CODE> are extensions
that we wrote using <CODE>display</CODE>.  We could have written <CODE>show-in-write-format</CODE>,
for example, but happened not to need it.<P>
<P><A HREF="../ss-toc2.html">(back to Table of Contents)</A><P>
<A HREF="../ssch21/functions-implement.html"><STRONG>BACK</STRONG></A>
chapter thread <A HREF="../ssch23/vectors.html"><STRONG>NEXT</STRONG></A>

<P>
<ADDRESS>
<A HREF="../index.html">Brian Harvey</A>, 
<CODE>bh@cs.berkeley.edu</CODE>
</ADDRESS>
</BODY>
</HTML>