1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
|
<P>
<P><A NAME="bongard"></A><CENTER><IMG SRC="../ss-pics/patterns.jpg" ALT="figure: patterns"></CENTER><P><CENTER>In each set, how do the ones on the left differ from the ones on the
right?
</CENTER><P>
<P><HTML>
<HEAD>
<TITLE>Simply Scheme: Introducing Computer Science ch 16: Example: Pattern Matcher</TITLE>
</HEAD>
<BODY>
<HR>
<CITE>Simply Scheme:</CITE>
<CITE>Introducing Computer Science</CITE> 2/e Copyright (C) 1999 MIT
<H2>Chapter 16</H2>
<H1>Example: Pattern Matcher</H1>
<TABLE width="100%"><TR><TD>
<IMG SRC="../simply.jpg" ALT="cover photo">
<TD><TABLE>
<TR><TD align="right"><CITE><A HREF="http://www.cs.berkeley.edu/~bh/">Brian
Harvey</A><BR>University of California, Berkeley</CITE>
<TR><TD align="right"><CITE><A HREF="http://ccrma.stanford.edu/~matt">Matthew
Wright</A><BR>University of California, Santa Barbara</CITE>
<TR><TD align="right"><BR>
<TR><TD align="right"><A HREF="../pdf/ssch16.pdf">Download PDF version</A>
<TR><TD align="right"><A HREF="../ss-toc2.html">Back to Table of Contents</A>
<TR><TD align="right"><A HREF="../ssch15/poker.html"><STRONG>BACK</STRONG></A>
chapter thread <A HREF="../ssch17/part5.html"><STRONG>NEXT</STRONG></A>
<TR><TD align="right"><A HREF="http://mitpress.mit.edu/0262082810">MIT
Press web page for <CITE>Simply Scheme</CITE></A>
</TABLE></TABLE>
<HR>
<P>It's time for another extended example in which we use the Scheme tools
we've been learning to accomplish something practical. We'll start by
describing how the program will work before we talk about how to implement
it.
<P>You can load our program into Scheme by typing
<P><PRE>(load "match.scm")
</PRE>
<P><H2>Problem Description</H2>
<P>A <EM><A NAME="g1"></A><A NAME="g2"></A>pattern matcher</EM> is a commonly used procedure whose job is
to compare a sentence to a range of possibilities. An example may make this
clear:
<P><PRE>> (match '(* me *) '(love me do))
#T
> (match '(* me *) '(please please me))
#T
> (match '(* me *) '(in my life))
#F
</PRE>
<P>The first argument, <CODE>(* me *)</CODE>, is a <EM>pattern.</EM> In the
pattern, each asterisk (<CODE>*</CODE>) means "any number of words, including no
words at all." So the entire pattern matches any sentence that contains the
word "me" anywhere within it. You can think of <CODE>match</CODE> as a more
general form of <CODE>equal?</CODE> in the sense that it compares two sentences
and tells us whether they're the same, but with a broader meaning of "the
same."
<P>Our pattern matcher will accept patterns more complicated than this first
example. There are four <EM>special characters</EM> that indicate
unspecified parts of a pattern, depending on the number of words that should
be allowed:
<P><P><TABLE><TR><TH align="right" valign="top"><CODE>?</CODE><TD> <TD valign="top">At most one word.
<TR><TH align="right" valign="top"><CODE>!</CODE><TD> <TD valign="top">Exactly one word.
<TR><TH align="right" valign="top"><CODE>&</CODE><TD> <TD valign="top">At least one word.
<TR><TH align="right" valign="top"><CODE>*</CODE><TD> <TD valign="top">Any number of words.
</TABLE>
<P><P>These characters are meant to be somewhat mnemonic. The question
mark means "maybe there's a word." The exclamation point means "precisely
one word!" (And it's vertical, just like the digit 1, sort of.) The
ampersand, which ordinarily means "and," indicates that we're matching a
word and maybe more. The asterisk doesn't have any mnemonic value, but it's
what everyone uses for a general matching indicator anyway.
<P>We can give a <EM>name</EM> to the collection of words that match an
unspecified part of a pattern by including in the pattern a word that starts
with one of the four special characters and continues with the name. If the
match succeeds, <CODE>match</CODE> will return a sentence containing these names
and the corresponding values from the sentence:
<P><PRE>> (match '(*start me *end) '(love me do))
(START LOVE ! END DO !)
> (match '(*start me *end) '(please please me))
(START PLEASE PLEASE ! END !)
> (match '(mean mr mustard) '(mean mr mustard))
()
> (match '(*start me *end) '(in my life))
FAILED
</PRE>
<P>In these examples, you see that <CODE>match</CODE> doesn't really return
<CODE>#t</CODE> or <CODE>#f</CODE>; the earlier set of examples showed a simplified
picture. In the first of the new examples, the special pattern word
<CODE>*start</CODE> is allowed to match any number of words, as indicated by the
asterisk. In this case it turned out to match the single word "love."
<CODE>Match</CODE> returns a result that tells us which words of the sentence match
the named special words in the pattern. (We'll call one of these special
pattern words a <EM>placeholder.</EM>) The exclamation points in the
returned value are needed to separate one match from another. (In the second
example, the name <CODE>end</CODE> was matched by an empty set of words.) In the
third example, the match was successful, but since there were no
placeholders the returned sentence was empty. If the match is unsuccessful,
<CODE>match</CODE> returns the word <CODE>failed</CODE>.<A NAME="text1" HREF="match#ft1">[1]</A>
<P>If the same placeholder name appears more than once in the pattern, then it
must be matched by the same word(s) in the sentence each time:
<P><PRE>> (match '(!twice !other !twice) '(cry baby cry))
(TWICE CRY ! OTHER BABY !)
> (match '(!twice !other !twice) '(please please me))
FAILED
</PRE>
<P>Some patterns might be matchable in more than one way. For example, the
invocation
<P><PRE>> (match '(*front *back) '(your mother should know))
</PRE>
<P>might return any of five different correct answers:
<P><PRE>(FRONT YOUR MOTHER SHOULD KNOW ! BACK !)
(FRONT YOUR MOTHER SHOULD ! BACK KNOW !)
(FRONT YOUR MOTHER ! BACK SHOULD KNOW !)
(FRONT YOUR ! BACK MOTHER SHOULD KNOW !)
(FRONT ! BACK YOUR MOTHER SHOULD KNOW !)
</PRE>
<P>We arbitrarily decide that in such cases the first placeholder
should match as many words as possible, so in this case <CODE>match</CODE> will
actually return the first of these answers.
<P>Before continuing, you might want to look at the first batch of exercises at
the end of this chapter, which are about using the pattern matcher. (The
rest of the exercises are about the implementation, which we'll discuss
next.)
<P><H2>Implementation: When Are Two Sentences Equal?</H2>
<P>Our approach to implementation will be to start with something we already
know how to write: a predicate that tests whether two sentences are exactly
equal. We will add capabilities one at a time until we reach our goal.
<P>Suppose that Scheme's primitive <CODE>equal?</CODE> function worked only for words
and not for sentences. We could write an equality tester for sentences,
like this:
<P><PRE>(define (<A NAME="g3"></A>sent-equal? sent1 sent2)
(cond ((empty? sent1)
(empty? sent2))
((empty? sent2) #f)
((equal? (first sent1) (first sent2))
(sent-equal? (bf sent1) (bf sent2)))
(else #f)))
</PRE>
<P>Two sentences are equal if each word in the first sentence is
equal to the corresponding word in the second. They're unequal if one
sentence runs out of words before the other.
<P>Why are we choosing to accept Scheme's primitive word comparison but rewrite
the sentence comparison? In our pattern matcher, a placeholder in the
pattern corresponds to a group of words in the sentence. There is no kind
of placeholder that matches only part of a word. (It would be possible to
implement such placeholders, but we've chosen not to.) Therefore, we will
never need to ask whether a word is "almost equal" to another word.
<P><H2>When Are Two Sentences Nearly Equal?</H2>
<P>Pattern matching is just a more general form of this <CODE>sent-equal?</CODE>
procedure. Let's write a very simple pattern matcher that knows only about
the "!" special character and doesn't let us name the words
that match the exclamation points in the pattern. We'll call this one <CODE>match?</CODE> with a question mark because it returns just true or false.
<P><PRE>(define (match? pattern sent) ;; first version: ! only
(cond ((empty? pattern)
(empty? sent))
((empty? sent) #f)
<U>((equal? (first pattern) '!)</U>
<U>(match? (bf pattern) (bf sent)))</U>
((equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent)))
(else #f)))
</PRE>
<P>This program is exactly the same as <CODE>sent-equal?</CODE>, except for
the highlighted <CODE>cond</CODE> clause. We are still comparing each word of the
pattern with the corresponding word of the sentence, but now an exclamation
mark in the pattern matches <EM>any</EM> word in the sentence. (If <CODE>first</CODE> of <CODE>pattern</CODE> is an exclamation mark, we don't even look at <CODE>first</CODE> of <CODE>sent</CODE>.)
<P>Our strategy in the next several sections will be to expand the pattern
matcher by implementing the remaining special characters, then finally
adding the ability to name the placeholders. For now, when we say something
like "the <CODE>*</CODE> placeholder," we mean the placeholder consisting of the
asterisk alone. Later, after we add named placeholders, the same procedures
will implement any placeholder that begins with an asterisk.
<P><H2>Matching with Alternatives</H2>
<P>The <CODE>!</CODE> matching is not much harder than <CODE>sent-equal?</CODE>, because
it's still the case that one word of the pattern must match one word of the
sentence. When we introduce the <CODE>?</CODE> option, the structure of the
program must be more complicated, because a question mark in the pattern
might or might not be paired up with a word in the sentence. In other
words, the pattern and the sentence might match without being the same
length.
<P><PRE>(define (match? pattern sent) ;; second version: ! and ?
(cond ((empty? pattern)
(empty? sent))
<U>((equal? (first pattern) '?)</U>
<U>(if (empty? sent)</U>
<U>(match? (bf pattern) '())</U>
<U>(or (match? (bf pattern) (bf sent))</U>
<U>(match? (bf pattern) sent))))</U>
((empty? sent) #f)
((equal? (first pattern) '!)
(match? (bf pattern) (bf sent)))
((equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent)))
(else #f)))
</PRE>
<P>Note that the new <CODE>cond</CODE> clause comes <EM>before</EM> the check
to see if <CODE>sent</CODE> is empty. That's because <CODE>sent</CODE> might be empty and
a pattern of <CODE>(?)</CODE> would still match it. But if the sentence is empty,
we know that the question mark doesn't match a word, so we just have to make
sure that the <CODE>butfirst</CODE> of the pattern contains nothing but question
marks. (We don't have a predicate named <CODE>all-question-marks?</CODE>; instead,
we use <CODE>match?</CODE> recursively to make this test.)
<P>In general, a question mark in the pattern has to match either one word or
zero words in the sentence. How do we decide? Our rule is that each
placeholder should match as many words as possible, so we prefer to match
one word if we can. But allowing the question mark to match a word might
prevent the rest of the pattern from matching the rest of the sentence.
<P>Compare these two examples:
<P><PRE>> (match? '(? please me) '(please please me))
#T
> (match? '(? please me) '(please me))
#T
</PRE>
<P>In the first case, the first thing in the pattern is a question mark and the
first thing in the sentence is "please," and they match. That leaves
"please me" in the pattern to match "please me" in the sentence.
<P>In the second case, we again have a question mark as the first thing in the
pattern and "please" as the first thing in the sentence. But this time,
we had better not use up the "please" in the sentence, because that will
only leave "me" to match "please me." In this case the question mark has
to match no words.
<P>To you, these examples probably look obvious. That's because you're a human
being, and you can take in the entire pattern and the entire sentence all at
once. Scheme isn't as smart as you are; it has to compare words one pair at
a time. To Scheme, the processing of both examples begins with question
mark as the first word of the pattern and "please" as the first word of
the sentence. The pattern matcher has to consider both cases.
<P>How does the procedure consider both cases? Look at the invocation of <CODE>or</CODE> by the <CODE>match?</CODE> procedure. There are two alternatives; if either
turns out true, the match succeeds. One is that we try to match the
question mark with the first word of the sentence just as we matched <CODE>!</CODE> in our earlier example—by making a recursive call on the <CODE>butfirst</CODE>s of the pattern and sentence. If that returns true, then the
question mark matches the first word.
<P>The second alternative that can make the match succeed is a recursive call
to <CODE>match?</CODE> on the <CODE>butfirst</CODE> of the pattern and the <EM>entire</EM>
sentence; this corresponds to matching the <CODE>?</CODE> against
nothing.<A NAME="text2" HREF="match#ft2">[2]</A>
<P>Let's trace <CODE>match?</CODE> so that you can see how these two cases are
handled differently by the program.
<P><PRE>> (trace match?)
> (match? '(? please me) '(please please me))
<SMALL><CODE>(match? (? please me) (please please me))
| (match? (please me) (please me)) Try matching <CODE>?</CODE> with <CODE>please</CODE>.
| | (match? (me) (me))
| | | (match? () ())
| | | #t It works!
| | #t
| #t
#t
</CODE></SMALL>#T
</PRE>
<P><PRE>> (match? '(? please me) '(please me))
<SMALL><CODE>(match? (? please me) (please me))
| (match? (please me) (me)) Try matching <CODE>?</CODE> with <CODE>please</CODE>.
| #f It doesn't work.
| (match? (please me) (please me)) This time, match <CODE>?</CODE> with nothing.
| | (match? (me) (me))
| | | (match? () ())
| | | #t
| | #t
| #t
#t
</CODE></SMALL>#T
</PRE>
<P><H2>Backtracking</H2>
<P>The program structure that allows for two alternative routes to success has
more profound implications than you may think at first.
<P>When <CODE>match?</CODE> sees a question mark in the pattern, it has to decide
whether or not to "use up" a word of the sentence by matching it with the
question mark. You might wonder, "How does the question mark decide
whether to take a word?" The answer is that the decision isn't made "by
the question mark"; there's nothing about the particular word that the
question mark might match that helps with the decision! Instead, the
decision depends on matching what comes to the right of the question mark.
<P>Compare this situation with the <CODE>keep</CODE> recursive pattern. There, too,
the procedure makes a decision about the first word of a sentence, and each
alternative leads to a recursive call for the <CODE>butfirst</CODE>:
<P><PRE>(cond ((empty? sent) '())
((some-test? (first sent))
(se (first sent) <U>(recursive-call (bf sent))</U>))
(else <U>(recursive-call (bf sent))</U>))
</PRE>
<P>The difference is that in the <CODE>keep</CODE> pattern the choice
between alternatives <EM>can</EM> be made just by looking at the immediate
situation—the single word that might or might not be chosen; the decision
doesn't depend on anything in the rest of the problem. As a result, the
choice has already been made before any recursive call happens. Therefore,
only one of the recursive calls is actually made, to make choices about the
remaining words in the sentence.
<P>In <CODE>match?</CODE>, by contrast, any particular invocation can't make its
choice until it knows the result of a recursive invocation. The result
from the recursive call determines the choice made by the caller.
<P>Here's a model that might help you think about this kind of recursion. <CODE>Match?</CODE> sees a question mark in the pattern. It makes a <EM>tentative</EM>
decision that this question mark should match the first word of the
sentence, and it uses a recursive invocation to see whether that decision
allows the rest of the problem to be solved. If so, the tentative choice
was correct. If not, <CODE>match?</CODE> tries an alternative decision that the
question mark doesn't match a word. This alternative is still tentative;
another recursive call is needed to see if the rest of the pattern can
succeed. If not, the overall match fails.
<P>This structure is called <EM>backtracking.</EM>
<P>What if there are two question marks in the pattern? Then there are <EM>four</EM> ways to match the overall pattern. Both question marks can match a
word, or only the first question mark, or only the second, or neither. A
pattern with several placeholders leads to even more alternatives. A
pattern with three question marks will have eight alternatives. (All three
match words, the first two do but the third doesn't, and so on.) A pattern
with 10 question marks will have 1024 alternatives. How can <CODE>match?</CODE>
try all these alternatives? The procedure seems to make only one two-way
choice; how can it accomplish a four-way or many-way decision?
<P>The secret is the same as the usual secret of recursion: Most of the work
is done in recursive calls. We take a leap of faith that recursive
invocations will take care of the decisions concerning question marks later
in the pattern. Think about it using the backtracking model. Let's suppose
there are 10 question marks in the pattern. When <CODE>match?</CODE> encounters
the leftmost question mark, it makes a tentative decision to match the
question mark with a word of the sentence. To test whether this choice can
work, <CODE>match?</CODE> invokes itself recursively on a pattern with nine
question marks. By the leap of faith, the recursive invocation will examine
512 ways to match question marks with words—half of the total number. If
one of these 512 works, we're finished. If not, the original <CODE>match?</CODE>
invocation changes its tentative choice, deciding instead <EM>not</EM> to
match its question mark (the leftmost one) with a word of the sentence.
Another recursive call is made based on that decision, and that recursive
call checks out the remaining 512 possibilities.
<P>By the way, the program doesn't always have to try all of the different
combinations of question marks matching or not matching words separately.
For example, if the problem is
<P><PRE>(match? '(a b ? ? ? ?) '(x y z w p q))
</PRE>
<P>then the very first comparison discovers that <CODE>a</CODE> is different
from <CODE>x</CODE>, so none of the 16 possible arrangements about question marks
matching or not matching words will make a difference.
<P>Here are some traced examples involving patterns with two question marks, to
show how the result of backtracking depends on the individual problem.
<P><PRE>> (match? '(? ? foo) '(bar foo))
<SMALL><CODE>(match? (? ? foo) (bar foo))
| (match? (? foo) (foo))
| | (match? (foo) ())
| | #f
| | (match? (foo) (foo))
| | | (match? () ())
| | | #t
| | #t
| #t
#t
</CODE></SMALL>#T
</PRE>
<P>In this first example, the first question mark tries to match the
word <CODE>bar</CODE>, but it can't tell whether or not that match will succeed
until the recursive call returns. In the recursive call, the second
question mark tries to match the word <CODE>foo</CODE>, and fails. Then the
second question mark tries again, this time matching nothing, and succeeds.
Therefore, the first question mark can report success; it never has to try a
recursive call in which it doesn't match a word.
<P>In our second example, each question mark will have to try both alternatives,
matching and then not matching a word, before the overall match succeeds.
<P><PRE>> (match? '(? ? foo bar) '(foo bar))
<SMALL><CODE>(match? (? ? foo bar) (foo bar))
| (match? (? foo bar) (bar))
| | (match? (foo bar) ())
| | #f
| | (match? (foo bar) (bar))
| | #f
| #f
| (match? (? foo bar) (foo bar))
| | (match? (foo bar) (bar))
| | #f
| | (match? (foo bar) (foo bar))
| | | (match? (bar) (bar))
| | | | (match? () ())
| | | | #t
| | | #t
| | #t
| #t
#t
</CODE></SMALL>#T
</PRE>
<P>The first question mark tries to match the word <CODE>foo</CODE> in the
sentence, leaving the pattern <CODE>(? foo bar)</CODE> to match <CODE>(bar)</CODE>. The
second question mark will try both matching and not matching a word, but
neither succeeds. Therefore, the first question mark tries again, this time
not matching a word. The second question mark first tries matching <CODE>foo</CODE>, and when that fails, tries not matching anything. This last attempt is
successful.
<P>In the previous example, every question mark's first attempt failed. The
following example illustrates the opposite case, in which every question
mark's first attempt succeeds.
<P><PRE>> (match? '(? ? baz) '(foo bar baz))
<SMALL><CODE>(match? (? ? baz) (foo bar baz))
| (match? (? baz) (bar baz))
| | (match? (baz) (baz))
| | | (match? () ())
| | | #t
| | #t
| #t
#t
</CODE></SMALL>#t
</PRE>
<P>The first question mark matches <CODE>foo</CODE>; the second matches <CODE>bar</CODE>.
<P>If the sentence is shorter than the pattern, we may end up trying to match a
pattern against an empty sentence. This is much easier than the general
problem, because there aren't two alternatives; a question mark has no word
in the sentence to match.
<P><PRE>> (match? '(? ? foo) '())
<SMALL><CODE>(match? (? ? foo) ())
| (match? (? foo) ())
| | (match? (foo) ())
| | #f
| #f
#f
</CODE></SMALL>#f
</PRE>
<P>Each question mark knows right away that it had better not try to
match a word, so we never have to backtrack.
<P><H2>Matching Several Words</H2>
<P>The next placeholder we'll implement is <CODE>*</CODE>. The order in which we're
implementing these placeholders was chosen so that each new version
increases the variability in the number of words a placeholder can match.
The <CODE>!</CODE> placeholder was very easy because it always matches exactly one
word; it's hardly different at all from a non-placeholder in the pattern.
Implementing <CODE>?</CODE> was more complicated because there were two
alternatives to consider. But for <CODE>*</CODE>, we might match any number of
words, up to the entire rest of the sentence.
<P>Our strategy will be a generalization of the <CODE>?</CODE> strategy: Start with
a "greedy" match, and then, if a recursive call tells us that the
remaining part of the sentence can't match the rest of the pattern, try a
less greedy match.
<P>The difference between <CODE>?</CODE> and <CODE>*</CODE> is that <CODE>?</CODE> allows only two
possible match lengths, zero and one. Therefore, these two cases can be
checked with two explicit subexpressions of an <CODE>or</CODE> expression. In the
more general case of <CODE>*</CODE>, any length is possible, so we can't check every
possibility separately. Instead, as in any problem of unknown size, we use
recursion. First we try the longest possible match; if that fails because
the rest of the pattern can't be matched, a recursive call tries the
next-longest match. If we get all the way down to an empty match for the
<CODE>*</CODE> and still can't match the rest of the pattern, then we return <CODE>#f</CODE>.
<P><PRE>(define (match? pattern sent) ;; third version: !, ?, and *
(cond ((empty? pattern)
(empty? sent))
((equal? (first pattern) '?)
(if (empty? sent)
(match? (bf pattern) '())
(or (match? (bf pattern) (bf sent))
(match? (bf pattern) sent))))
<U>((equal? (first pattern) '*)</U>
<U>(*-longest-match (bf pattern) sent))</U>
((empty? sent) #f)
((equal? (first pattern) '!)
(match? (bf pattern) (bf sent)))
((equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent)))
(else #f)))
(define (*-longest-match pattern-rest sent)
(*-lm-helper pattern-rest sent '()))
(define (*-lm-helper pattern-rest sent-matched sent-unmatched)
(cond ((match? pattern-rest sent-unmatched) #t)
((empty? sent-matched) #f)
(else (*-lm-helper pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)))))
</PRE>
<P>If an asterisk is found in the pattern, <CODE>match?</CODE> invokes <CODE>*-longest-match</CODE>, which carries out this backtracking approach.
<P>The real work is done by <CODE>*-lm-helper</CODE>, which has three arguments.
The first argument is the still-to-be-matched part of the pattern, following
the <CODE>*</CODE> placeholder that we're trying to match now. <CODE>Sent-matched</CODE>
is the part of the sentence that we're considering as a candidate to match
the <CODE>*</CODE> placeholder. <CODE>Sent-unmatched</CODE> is the remainder of the
sentence, following the words in <CODE>sent-matched</CODE>; it must match
<CODE>pattern-rest</CODE>.
<P>Since we're trying to find the longest possible match, <CODE>*-longest-match</CODE>
chooses the entire sentence as the first attempt for <CODE>sent-matched</CODE>.
Since <CODE>sent-matched</CODE> is using up the entire sentence, the initial value
of <CODE>sent-unmatched</CODE> is empty. The only job of <CODE>*-longest-match</CODE> is
to invoke <CODE>*-lm-helper</CODE> with these initial arguments. On each
recursive invocation, <CODE>*-lm-helper</CODE> shortens <CODE>sent-matched</CODE> by one
word and accordingly lengthens <CODE>sent-unmatched</CODE>.
<P>Here's an example in which the <CODE>*</CODE> placeholder tries to match four
words, then three words, and finally succeeds with two words:
<P><PRE>> (trace match? *-longest-match *-lm-helper)
> (match? '(* days night) '(a hard days night))
<SMALL><CODE>(match? (* days night) (a hard days night))
| (*-longest-match (days night) (a hard days night))
| | (*-lm-helper (days night) (a hard days night) ())
| | | (match? (days night) ())
| | | #f
| | | (*-lm-helper (days night) (a hard days) (night))
| | | | (match? (days night) (night))
| | | | #f
| | | | (*-lm-helper (days night) (a hard) (days night))
| | | | | (match? (days night) (days night))
| | | | | | (match? (night) (night))
| | | | | | | (match? () ())
| | | | | | | #t
| | | | | | #t
| | | | | #t
| | | | #t
| | | #t
| | #t
| #t
#t
</CODE></SMALL>#t
</PRE>
<P><H2>Combining the Placeholders</H2>
<P>We have one remaining placeholder, <CODE>&</CODE>, which is much like <CODE>*</CODE>
except that it fails unless it can match at least one word. We could,
therefore, write a <CODE>&-longest-match</CODE> that would be identical to <CODE>*-longest-match</CODE> except for the base case of its helper procedure. If <CODE>sent-matched</CODE> is empty, the result is <CODE>#f</CODE> even if it would be possible
to match the rest of the pattern against the rest of the sentence. (All we
have to do is exchange the first two clauses of the <CODE>cond</CODE>.)
<P><PRE>(define (&-longest-match pattern-rest sent)
(&-lm-helper pattern-rest sent '()))
(define (&-lm-helper pattern-rest sent-matched sent-unmatched)
(cond ((empty? sent-matched) #f)
((match? pattern-rest sent-unmatched) #t)
(else (&-lm-helper pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)))))
</PRE>
<P>When two procedures are so similar, that's a clue that perhaps
they could be combined into one. We could look at the bodies of these two
procedures to find a way to combine them textually. But instead, let's step
back and think about the meanings of the placeholders.
<P>The reason that the procedures <CODE>*-longest-match</CODE> and <CODE>&-longest-match</CODE> are so similar is that the two placeholders have almost
identical meanings. <CODE>*</CODE> means "match as many words as possible";
<CODE>&</CODE> means "match as many words as possible, but at least one." Once
we're thinking in these terms, it's plausible to think of <CODE>?</CODE> as
meaning "match as many words as possible, but at most one." In fact,
although this is a stretch, we can also describe <CODE>!</CODE> similarly: "Match
as many words as possible, but at least one, and at most one."
<P><TABLE><TR><TH>Placeholder <TH>Minimum size
<TH>Maximum size
<TR><TD><CODE>*</CODE><TD>0<TD>no limit
<TR><TD><CODE>&</CODE><TD>1<TD>no limit
<TR><TD><CODE>?</CODE><TD>0<TD>1
<TR><TD><CODE>!</CODE><TD>1<TD>1
</TABLE>
<P>We'll take advantage of this newly understood similarity to
simplify the program by using a single algorithm for all placeholders.
<P>How do we generalize <CODE>*-longest-match</CODE> and <CODE>&-longest-match</CODE> to
handle all four cases? There are two kinds of generalization involved.
We'll write a procedure <CODE>longest-match</CODE> that will have the same
arguments as <CODE>*-longest-match</CODE>, plus two others, one for for the minimum
size of the matched text and one for the maximum.
<P>We'll specify the minimum size with a formal parameter <CODE>min</CODE>. (The
corresponding argument will always be 0 or 1.) <CODE>Longest-match</CODE> will
pass the value of <CODE>min</CODE> down to <CODE>lm-helper</CODE>, which will use it to
reject potential matches that are too short.
<P>Unfortunately, we can't use a number to specify the maximum size, because
for <CODE>*</CODE> and <CODE>&</CODE> there is no maximum. Instead, <CODE>longest-match</CODE>
has a formal parameter <CODE>max-one?</CODE> whose value is <CODE>#t</CODE> only for <CODE>?</CODE> and <CODE>!</CODE>.
<P>Our earlier, special-case versions of <CODE>longest-match</CODE> were written for
<CODE>*</CODE> and <CODE>&</CODE>, the placeholders for which <CODE>max-one?</CODE> will be
false. For those placeholders, the new <CODE>longest-match</CODE> will be just
like the earlier versions. Our next task is to generalize <CODE>longest-match</CODE> so that it can handle the <CODE>#t</CODE> cases.
<P>Think about the meaning of the <CODE>sent-matched</CODE> and <CODE>sent-unmatched</CODE>
parameters in the <CODE>lm-helper</CODE> procedures. <CODE>Sent-matched</CODE> means
"the longest part of the sentence that this placeholder is still allowed to
match," while <CODE>sent-unmatched</CODE> contains whatever portion of the
sentence has already been disqualified from being matched by the placeholder.
<P>Consider the behavior of <CODE>*-longest-match</CODE> when an asterisk is at the
beginning of a pattern that we're trying to match against a seven-word
sentence. Initially, <CODE>sent-matched</CODE> is the entire seven-word sentence,
and <CODE>sent-unmatched</CODE> is empty. Then, supposing that doesn't work, <CODE>sent-matched</CODE> is a six-word sentence, while <CODE>sent-unmatched</CODE> contains
the remaining word. This continues as long as no match succeeds until, near
the end of <CODE>longest-match</CODE>'s job, <CODE>sent-matched</CODE> is a one-word
sentence and <CODE>sent-unmatched</CODE> contains six words. At this point, the
longest possible match for the asterisk is a single word.
<P>This situation is where we want to <EM>start</EM> in the case of the <CODE>?</CODE>
and <CODE>!</CODE> placeholders. So when we're trying to match one of these
placeholders, our initialization procedure won't use the entire sentence as
the initial value of <CODE>sent-matched</CODE>; rather, the initial value of <CODE>sent-matched</CODE> will be a one-word sentence, and <CODE>sent-unmatched</CODE> will
contain the rest of the sentence.
<P><PRE>(define (longest-match pattern-rest sent min max-one?) ;; first version
(cond ((empty? sent)
(and (= min 0) (match? pattern-rest sent)))
(max-one?
(lm-helper pattern-rest (se (first sent)) (bf sent) min))
(else (lm-helper pattern-rest sent '() min))))
(define (lm-helper pattern-rest sent-matched sent-unmatched min)
(cond ((< (length sent-matched) min) #f)
((match? pattern-rest sent-unmatched) #t)
((empty? sent-matched) #f)
(else (lm-helper pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)
min))))
</PRE>
<P>Now we can rewrite <CODE>match?</CODE> to use <CODE>longest-match</CODE>. <CODE>Match?</CODE>
will delegate the handling of all placeholders to a subprocedure <CODE>match-special</CODE> that will invoke <CODE>longest-match</CODE> with the correct values
for <CODE>min</CODE> and <CODE>max-one?</CODE> according to the table.
<P><PRE>(define (match? pattern sent) ;; fourth version
(cond ((empty? pattern)
(empty? sent))
<U>((special? (first pattern))</U>
<U>(match-special (first pattern) (bf pattern) sent))</U>
((empty? sent) #f)
((equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent)))
(else #f)))
(define (special? wd) ;; first version
(member? wd '(* & ? !)))
(define (match-special placeholder pattern-rest sent) ;; first version
(cond ((equal? placeholder '?)
(longest-match pattern-rest sent 0 #t))
((equal? placeholder '!)
(longest-match pattern-rest sent 1 #t))
((equal? placeholder '*)
(longest-match pattern-rest sent 0 #f))
((equal? placeholder '&)
(longest-match pattern-rest sent 1 #f))))
</PRE>
<P><H2>Naming the Matched Text</H2>
<P>So far we've worked out how to match the four kinds of placeholders and
return a true or false value indicating whether a match is possible. Our
program is almost finished; all we need to make it useful is the
facility that will let us find out <EM>which</EM> words in the sentence
matched each placeholder in the pattern.
<P>We don't have to change the overall structure of the program in order to
make this work. But most of the procedures in the pattern matcher will have
to be given an additional argument, the database of placeholder names
and values that have been matched so far.<A NAME="text3" HREF="match#ft3">[3]</A> The formal parameter <CODE>known-values</CODE> will hold
this database. Its value will be a sentence containing placeholder names
followed by the corresponding words and an exclamation point to separate the
entries, as in the examples earlier in the chapter. When we begin the
search for a match, we use an empty sentence as the initial <CODE>known-values</CODE>:
<P><PRE>(define (match pattern sent)
(match-using-known-values pattern sent '()))
(define (match-using-known-values pattern sent known-values)
…)
</PRE>
<P>As <CODE>match-using-known-values</CODE> matches the beginning of a
pattern with the beginning of a sentence, it invokes itself recursively with
an expanded <CODE>known-values</CODE> containing each newly matched placeholder.
For example, in evaluating
<P><PRE>(match '(!twice !other !twice) '(cry baby cry))
</PRE>
<P>the program will call <CODE>match-using-known-values</CODE> four times:
<P><TABLE><TR><TH><CODE>pattern</CODE><TH><CODE>sent</CODE>
<TH><CODE>known-values</CODE>
<TR><TD><CODE>(!twice !other !twice) </CODE>
<TD><CODE>(cry baby cry) </CODE>
<TD><CODE>()</CODE>
<TR><TD><CODE>(!other !twice)</CODE>
<TD><CODE>(baby cry)</CODE>
<TD><CODE>(twice cry !)</CODE>
<TR><TD><CODE>(!twice)</CODE>
<TD><CODE>(cry)</CODE>
<TD><CODE>(twice cry ! other baby !)</CODE>
<TR><TD><CODE>()</CODE>
<TD><CODE>()</CODE>
<TD><CODE>(twice cry ! other baby !)</CODE>
</TABLE>
<P>In the first invocation, we try to match <CODE>!twice</CODE> against some part of
the sentence. Since <CODE>!</CODE> matches exactly one word, the only possibility
is to match the word <CODE>cry</CODE>. The recursive invocation, therefore, is made
with the first words of the pattern and sentence removed, but with the match
between <CODE>twice</CODE> and <CODE>cry</CODE> added to the database.
<P>Similarly, the second invocation matches <CODE>!other</CODE> with <CODE>baby</CODE> and
causes a third invocation with shortened pattern and sentence but a longer
database.
<P>The third invocation is a little different because the pattern contains the
placeholder <CODE>!twice</CODE>, but the name <CODE>twice</CODE> is already in the
database. Therefore, this placeholder can't match whatever word happens to
be available; it must match the same word that it matched before. (Our
program will have to check for this situation.) Luckily, the sentence does
indeed contain the word <CODE>cry</CODE> at this position.
<P>The final invocation reaches the base case of the recursion, because the
pattern is empty. The value that <CODE>match-using-known-values</CODE> returns is
the database in this invocation.
<P><H2>The Final Version</H2>
<P>We're now ready to show you the final version of the program. The program
structure is much like what you've seen before; the main difference is the
database of placeholder names and values. The program must add entries to
this database and must look for database entries that were added earlier.
Here are the three most important procedures and how they are changed
from the earlier version to implement this capability:
<P><P><TABLE><TR><TH align="right" valign="top">•<TD> <TD valign="top"><CODE>match-using-known-values</CODE>, essentially the same as what was
formerly named <CODE>match?</CODE> except for bookkeeping details.
</TABLE><TABLE><TR><TH align="right" valign="top">•<TD> <TD valign="top"><CODE>match-special</CODE>, similar to the old version, except that it must
recognize the case of a placeholder whose name has already been seen. In
this case, the placeholder can match only the same words that it matched
before.
</TABLE><TABLE><TR><TH align="right" valign="top">•<TD> <TD valign="top"><CODE>longest-match</CODE> and <CODE>lm-helper</CODE>, also similar to the old
versions, except that they have the additional job of adding to the database the
name and value of any placeholder that they match.
</TABLE>
<P>Here are the modified procedures. Compare them to the previous
versions.
<P><PRE>(define (<A NAME="g4"></A>match pattern sent)
(match-using-known-values pattern sent '()))
(define (<A NAME="g5"></A>match-using-known-values pattern sent known-values)
(cond ((empty? pattern)
(if (empty? sent) known-values 'failed))
((special? (first pattern))
(let ((placeholder (first pattern)))
(match-special (first placeholder)
(bf placeholder)
(bf pattern)
sent
known-values)))
((empty? sent) 'failed)
((equal? (first pattern) (first sent))
(match-using-known-values (bf pattern) (bf sent) known-values))
(else 'failed)))
(define (<A NAME="g6"></A>match-special howmany name pattern-rest sent known-values)
(let ((old-value (lookup name known-values)))
(cond ((not (equal? old-value 'no-value))
(if (length-ok? old-value howmany)
(already-known-match
old-value pattern-rest sent known-values)
'failed))
((equal? howmany '?)
(longest-match name pattern-rest sent 0 #t known-values))
((equal? howmany '!)
(longest-match name pattern-rest sent 1 #t known-values))
((equal? howmany '*)
(longest-match name pattern-rest sent 0 #f known-values))
((equal? howmany '&)
(longest-match name pattern-rest sent 1 #f known-values)))))
(define (<A NAME="g7"></A>longest-match name pattern-rest sent min max-one? known-values)
(cond ((empty? sent)
(if (= min 0)
(match-using-known-values pattern-rest
sent
(add name '() known-values))
'failed))
(max-one?
(lm-helper name pattern-rest (se (first sent))
(bf sent) min known-values))
(else (lm-helper name pattern-rest
sent '() min known-values))))
(define (<A NAME="g8"></A>lm-helper name pattern-rest
sent-matched sent-unmatched min known-values)
(if (< (length sent-matched) min)
'failed
(let ((tentative-result (match-using-known-values
pattern-rest
sent-unmatched
(add name sent-matched known-values))))
(cond ((not (equal? tentative-result 'failed)) tentative-result)
((empty? sent-matched) 'failed)
(else (lm-helper name
pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)
min
known-values))))))
</PRE>
<P>We haven't listed all of the minor procedures that these procedures invoke.
A complete listing is at the end of the chapter, but we hope that you have
enough confidence about the overall program structure to be able to assume
these small details will work. In the next few paragraphs we discuss some
of the ways in which the procedures shown here differ from the earlier
versions.
<P>In the invocation of <CODE>match-special</CODE> we found it convenient to split the
placeholder into its first character, the one that tells how many words can
be matched, and the butfirst, which is the name of the placeholder.
<P>What happens if <CODE>match-special</CODE> finds that the name is already in the
database? In this situation, we don't have to try multiple possibilities
for the number of words to match (the usual job of <CODE>longest-match</CODE>); the
placeholder must match exactly the words that it matched before. In this
situation, three things must be true in order for the match to succeed:
<A NAME="threematchconditions"></A>
(1) The first words of the <CODE>sent</CODE> argument must match the old value
stored in the database. (2) The partial <CODE>pattern</CODE> that remains after
this placeholder must match the rest of the <CODE>sent</CODE>. (3) The old value
must be consistent with the number of words permitted by the <CODE>howmany</CODE>
part of the placeholder. For example, if the pattern is
<P><PRE>(*stuff and !stuff)
</PRE>
<P>and the database says that the placeholder <CODE>*stuff</CODE> was
matched by three words from the sentence, then the second placeholder <CODE>!stuff</CODE> can't possibly be matched because it accepts only one word. This
third condition is actually checked first, by <CODE>length-ok?</CODE>, and if we
pass that hurdle, the other two conditions are checked by <CODE>already-known-match</CODE>.
<P>The only significant change to <CODE>longest-match</CODE> is that it invokes
<CODE>add</CODE> to compute an expanded database with the newly found match added,
and it uses the resulting database as an argument to <CODE>match-using-known-values</CODE>.
<P><H2>Abstract Data Types</H2>
<P>As you know, a database of known values is represented in this program as a
sentence in which the entries are separated by exclamation points. Where is
this representation accomplished in the program you've seen? There's
nothing like
<P><PRE>…(sentence old-known-values name value '!) …</PRE>
<P>anywhere in the procedures we've shown. Instead, the program makes
reference to the database of known values through two procedure calls:
<P><PRE>(lookup name known-values) ; in match-special
(add name matched known-values) ; in longest-match
</PRE>
<P>Only the procedures <CODE>lookup</CODE> and <CODE>add</CODE> manipulate the
database of known values:
<P><PRE>(define (<A NAME="g9"></A>lookup name known-values)
(cond ((empty? known-values) 'no-value)
((equal? (first known-values) name)
(get-value (bf known-values)))
(else (lookup name (skip-value known-values)))))
(define (<A NAME="g10"></A>get-value stuff)
(if (equal? (first stuff) '!)
'()
(se (first stuff) (get-value (bf stuff)))))
(define (<A NAME="g11"></A>skip-value stuff)
(if (equal? (first stuff) '!)
(bf stuff)
(skip-value (bf stuff))))
(define (<A NAME="g12"></A>add name value known-values)
(if (empty? name)
known-values
(se known-values name value '!)))
</PRE>
<P>These procedures are full of small details. For example, it's a little
tricky to extract the part of a sentence from a name to the next exclamation
point. It's convenient that we could write the more important procedures,
such as <CODE>longest-match</CODE>, without filling them with these details.
As far as <CODE>longest-match</CODE> knows, <CODE>lookup</CODE> and <CODE>add</CODE> could be
Scheme primitive procedures. In effect we've created a new data type, with
<CODE>add</CODE> as its constructor and <CODE>lookup</CODE> as its selector.
<P>Types such as these, that are invented by a programmer and aren't part of the
Scheme language itself, are called <EM><A NAME="g13"></A><A NAME="g14"></A>abstract data types.</EM>
<A NAME="g15"></A>
<A NAME="g16"></A>
Creating an abstract data type means drawing a barrier between an idea about
some kind of information we want to model in a program and the particular
mechanism that we use to represent the information. In this case, the
information is a collection of name-value associations, and the particular
mechanism is a sentence with exclamation points and so on. The pattern
matcher doesn't think of the database as a sentence. For example, it would
be silly to translate the database into Pig Latin or find its acronym.
<P>Just as we distinguish the <EM>primitive</EM> procedures that Scheme knows
all along from the <EM>compound</EM> procedures that the Scheme programmer
defines, we could use the names "primitive data type" for types such as
numbers and Booleans that are built into Scheme and "compound data type"
for ones that the programmer invents by defining selectors and constructors.
But "compound data type" is a bit of a pun, because it also suggests a data
type built out of smaller pieces, just as a compound expression is built of
smaller expressions. Perhaps that's why the name "abstract data type" has
become generally accepted. It's connected to the idea of abstraction
that we introduced earlier, because in order to create an abstract data type,
we must specify the selectors and constructors and give names to those
patterns of computation.
<P><H2>Backtracking and <CODE>Known-Values</CODE></H2>
<P>What happens to the database in cases that require backtracking, where
a particular recursive call might be "on the wrong track"? Let's trace
<CODE>match-using-known-values</CODE> and see what happens. (We'll use the
little-people model to discuss this example, and so we're annotating each
invocation in the trace with the name of its little person.)
<P><PRE>> (trace match-using-known-values)
> (match '(*start me *end) '(love me do))<SMALL><CODE>
(match-using-known-values (*start me *end) (love me do) ()) Martha
| (match-using-known-values (me *end) () (start love me do !)) Mercutio
| failed
| (match-using-known-values (me *end) (do) (start love me !)) Masayuki
| failed
| (match-using-known-values (me *end) (me do) (start love !)) Mohammad
| | (match-using-known-values (*end) (do) (start love !)) Mae
| | | (match-using-known-values () () (start love ! end do !)) Merlin
| | | (start love ! end do !)
| | (start love ! end do !)
| (start love ! end do !)
(start love ! end do !)
</CODE></SMALL>(START LOVE ! END DO !)
</PRE>
<P>Martha, the first little person shown, has an empty <CODE>known-values</CODE>. She
makes three attempts to match <CODE>*start</CODE> with parts of the sentence. In
each case, a little person is hired with the provisional match in his or her
<CODE>known-values</CODE>. (Actually, Martha does not directly hire Mercutio and
the others. Martha hires a <CODE>match-special</CODE> little person, who in turn
hires a <CODE>longest-match</CODE> specialist, who hires an <CODE>lm-helper</CODE>
specialist, who hires Mercutio. But that added complexity isn't important
for the point we're focusing on right now, namely, how backtracking can
work. Pretend Martha hires Mercutio.)
<P>If you don't use the little-people model, but instead think about the
program as if there were just one <CODE>known-values</CODE> variable, then the
backtracking can indeed be very mysterious. Once a provisional match is
added to the database, how is it ever removed? The answer is that it doesn't
work that way. There isn't a "the" database. Instead, each little person
has a separate database. If an attempted match fails, the little person who
reports the failure just stops working. For example, Martha hires Mercutio
to attempt a match in which the name <CODE>start</CODE> has the value <CODE>love me
do</CODE>. Mercutio is unable to complete the match, and reports failure. It is
Martha, not Mercutio, who then hires Masayuki to try another value for <CODE>start</CODE>. Martha's database hasn't changed, so Martha gives Masayuki a
database that reflects the new trial value but not the old one.
<P>Not every hiring of a little person starts from an empty database. When a
match is partially successful, the continuation of the same attempt must
benefit from the work that's already been done. So, for example, when
Mohammad hires Mae, and when Mae hires Merlin, each of them passes on an
extended database, not an empty one. Specifically, Mae gives Merlin the
new match of the name <CODE>end</CODE> with the value <CODE>do</CODE>, but also the match
of <CODE>start</CODE> with <CODE>love</CODE> that she was given by Mohammad.
<P>So as you can see, we don't have to do anything special to keep track of our
database when we backtrack; the structure of the recursion takes care of
everything for free.
<P><H2>How We Wrote It</H2>
<P>For explanatory purposes we've chosen to present the pieces of this program
in a different order from the one in which we actually wrote them. We <EM>did</EM> implement the easy placeholders (<CODE>!</CODE> and <CODE>?</CODE>) before the
harder ones. But our program had provision for a database of names from the
beginning.
<P>There is no "right" way to approach a programming problem. Our particular
approach was determined partly by our past experience. Each of us had
written similar programs before, and we had preconceived ideas about the
easy and hard parts. You might well start at a different point. For
example, here is an elegant small program we'd both been shown by friends:
<P><PRE>(define (match? pattern sent)
(cond ((empty? pattern) (empty? sent))
((empty? sent)
(and (equal? (first pattern) '*) (match? (bf pattern) sent)))
((equal? (first pattern) '*)
(or (match? pattern (bf sent))
(match? (bf pattern) sent)))
(else (and (equal? (first pattern) (first sent))
(match? (bf pattern) (bf sent))))))
</PRE>
<P>What's appealing about this is the funny symmetry of taking the
<CODE>butfirst</CODE> of the pattern <EM>or</EM> of the sentence. That's not
something you'd naturally think of, probably, but once you've worked out how
it can work, it affects your preconceptions when you set out to write a
pattern matcher yourself.
<P>Based on that inspiration, we might well have started with the hard cases
(such as <CODE>*</CODE>), with the idea that once they're in place, the easy cases
won't change the program structure much.
<P><H2>Complete Program Listing</H2>
<P><P><P><PRE>
(define (match pattern sent)
(match-using-known-values pattern sent '()))
(define (match-using-known-values pattern sent known-values)
(cond ((empty? pattern)
(if (empty? sent) known-values 'failed))
((special? (first pattern))
(let ((placeholder (first pattern)))
(match-special (first placeholder)
(bf placeholder)
(bf pattern)
sent
known-values)))
((empty? sent) 'failed)
((equal? (first pattern) (first sent))
(match-using-known-values (bf pattern) (bf sent) known-values))
(else 'failed)))
(define (special? wd)
(member? (first wd) '(* & ? !)))
(define (match-special howmany name pattern-rest sent known-values)
(let ((old-value (lookup name known-values)))
(cond ((not (equal? old-value 'no-value))
(if (length-ok? old-value howmany)
(already-known-match
old-value pattern-rest sent known-values)
'failed))
((equal? howmany '?)
(longest-match name pattern-rest sent 0 #t known-values))
((equal? howmany '!)
(longest-match name pattern-rest sent 1 #t known-values))
((equal? howmany '*)
(longest-match name pattern-rest sent 0 #f known-values))
((equal? howmany '&)
(longest-match name pattern-rest sent 1 #f known-values)))))
(define (length-ok? value howmany)
(cond ((empty? value) (member? howmany '(? *)))
((not (empty? (bf value))) (member? howmany '(* &)))
(else #t)))
(define (already-known-match value pattern-rest sent known-values)
(let ((unmatched (chop-leading-substring value sent)))
(if (not (equal? unmatched 'failed))
(match-using-known-values pattern-rest unmatched known-values)
'failed)))
(define (chop-leading-substring value sent)
(cond ((empty? value) sent)
((empty? sent) 'failed)
((equal? (first value) (first sent))
(chop-leading-substring (bf value) (bf sent)))
(else 'failed)))
(define (longest-match name pattern-rest sent min max-one? known-values)
(cond ((empty? sent)
(if (= min 0)
(match-using-known-values pattern-rest
sent
(add name '() known-values))
'failed))
(max-one?
(lm-helper name pattern-rest (se (first sent))
(bf sent) min known-values))
(else (lm-helper name pattern-rest
sent '() min known-values))))
(define (lm-helper name pattern-rest
sent-matched sent-unmatched min known-values)
(if (< (length sent-matched) min)
'failed
(let ((tentative-result (match-using-known-values
pattern-rest
sent-unmatched
(add name sent-matched known-values))))
(cond ((not (equal? tentative-result 'failed)) tentative-result)
((empty? sent-matched) 'failed)
(else (lm-helper name
pattern-rest
(bl sent-matched)
(se (last sent-matched) sent-unmatched)
min
known-values))))))
;;; Known values database abstract data type
(define (lookup name known-values)
(cond ((empty? known-values) 'no-value)
((equal? (first known-values) name)
(get-value (bf known-values)))
(else (lookup name (skip-value known-values)))))
(define (get-value stuff)
(if (equal? (first stuff) '!)
'()
(se (first stuff) (get-value (bf stuff)))))
(define (skip-value stuff)
(if (equal? (first stuff) '!)
(bf stuff)
(skip-value (bf stuff))))
(define (add name value known-values)
(if (empty? name)
known-values
(se known-values name value '!)))
</PRE><P>
<P><H2>Exercises about Using the Pattern Matcher</H2>
<P><B>16.1</B> Design and test a pattern that matches any sentence containing the word <CODE>C</CODE> three times (not necessarily next to each other).
<P>
<B>16.2</B> Design and test a pattern that matches a sentence consisting of two copies
of a smaller sentence, such as <CODE>(a b a b)</CODE>.
<P>
<B>16.3</B> Design and test a pattern that matches any sentence of no more than three
words.
<P>
<B>16.4</B> Design and test a pattern that matches any sentence of at least three words.
<P>
<B>16.5</B> Show sentences of length 2, 3, and 4 that match the pattern
<P><PRE>(*x *y *y *x)
</PRE>
<P>For each length, if no sentence can match the pattern, explain why
not.
<P>
<B>16.6</B> Show sentences of length 2, 3, and 4 that match the pattern
<P><PRE>(*x *y &y &x)
</PRE>
<P>For each length, if no sentence can match the pattern, explain why
not.
<P>
<B>16.7</B> List <EM>all</EM> the sentences of length 6 or less, starting with <CODE>a b a</CODE>, that match the pattern
<P><PRE>(*x *y *y *x)
</PRE>
<P>
<H2>Exercises about Implementation</H2>
<P><B>16.8</B> Explain how <CODE>longest-match</CODE> handles an empty sentence.
<P>
<B>16.9</B> Suppose the first <CODE>cond</CODE> clause in <CODE>match-using-known-values</CODE> were
<P><PRE>((empty? pattern) known-values)
</PRE>
<P>Give an example of a pattern and sentence for which the modified
program would give a different result from the original.
<P>
<B>16.10</B> What happens if the sentence argument—not the pattern—contains
the word <CODE>*</CODE> somewhere?
<P>
<B>16.11</B> For each of the following examples, how many <CODE>match-using-known-values</CODE>
little people are required?
<P><PRE>(match '(from me to you) '(from me to you))
(match '(*x *y *x) '(a b c a b))
(match '(*x *y *z) '(a b c a b))
(match '(*x hey *y bulldog *z) '(a hey b bulldog c))
(match '(*x a b c d e f) '(a b c d e f))
(match '(a b c d e f *x) '(a b c d e f))
</PRE>
<P>In general, what can you say about the characteristics that make a
pattern easy or hard to match?
<P>
<B>16.12</B> Show a pattern with the following two properties: (1) It has at least two
placeholders. (2) When you match it against any sentence, every invocation
of <CODE>lookup</CODE> returns <CODE>no-value</CODE>.
<P>
<B>16.13</B> Show a pattern and a sentence that can be used as arguments to <CODE>match</CODE>
so that <CODE>lookup</CODE> returns <CODE>(the beatles)</CODE> at some point during the
match.
<P>
<B>16.14</B> Our program can still match patterns with unnamed placeholders.
How would it affect the operation of the program if these unnamed
placeholders were added to the database? What part of the program keeps
them from being added?
<P>
<B>16.15</B> Why don't <CODE>get-value</CODE> and <CODE>skip-value</CODE> check for an empty argument
as the base case?
<P>
<B>16.16</B> Why didn't we write the first <CODE>cond</CODE> clause in <CODE>length-ok?</CODE> as
the following?
<P><PRE>((and (empty? value) (member? howmany '(? *))) #t)
</PRE>
<P>
<B>16.17</B> Where in the program is the initial empty database of known values
established?
<P>
<B>16.18</B> For the case of matching a placeholder name that's already been matched in
this pattern, we said on page <A HREF="match.html#threematchconditions">there</A> that three conditions
must be checked. For each of the three, give a pattern and sentence that
the program would incorrectly match if the condition were not checked.
<P>
<B>16.19</B> What will the following example do?
<P><PRE>(match '(?x is *y !x) '(! is an exclamation point !))
</PRE>
<P>Can you suggest a way to fix this problem?
<P>
<B>16.20</B> Modify the pattern matcher so that a placeholder of the form <CODE>*15x</CODE> is like <CODE>*x</CODE> except that it can be matched only by exactly 15
words.
<P><PRE>> (match '(*3front *back) '(your mother should know))
(FRONT YOUR MOTHER SHOULD ! BACK KNOW !)
</PRE>
<P>
<B>16.21</B> Modify the pattern matcher so that a <CODE>+</CODE> placeholder (with or without a
name attached) matches only a number:
<P><PRE>> (match '(*front +middle *back) '(four score and 7 years ago))
(FRONT FOUR SCORE AND ! MIDDLE 7 ! BACK YEARS AGO !)
</PRE>
<P>The <CODE>+</CODE> placeholder is otherwise like <CODE>!</CODE>—it must match
exactly one word.
<P>
<B>16.22</B> Does your favorite text editor or word processor have a search command that
allows you to search for patterns rather than only specific strings of
characters? Look into this and compare your editor's capabilities with
that of our pattern matcher.
<P>
<P>
<HR>
<A NAME="ft1" HREF="match#text1">[1]</A> Why not return the
sentence if successful or <CODE>#f</CODE> otherwise? That would be fine in most
versions of Scheme, but as we mentioned earlier, the empty sentence <CODE>()</CODE>
is the same as the false value <CODE>#f</CODE> in some dialects. In those
Schemes, a successfully matched pattern with no named placeholders, for
which the program should return an empty sentence, would be
indistinguishable from an unmatched pattern.<P>
<A NAME="ft2" HREF="match#text2">[2]</A> Actually, since <CODE>or</CODE> is a special form, Scheme avoids
the need to try the second alternative if the first one succeeds.<P>
<A NAME="ft3" HREF="match#text3">[3]</A> The word <EM>database</EM>
has two possible meanings in computer science, a broad meaning and a narrow
one. The broad meaning, which we're using here, is a repository of
information to which the program periodically adds new items for later
retrieval. The narrow meaning is a collection of information that's
manipulated by a <EM>database program,</EM> which provides facilities for
adding new information, modifying existing entries, selecting entries that
match some specified criterion, and so on. We'll see a database program
near the end of the book.<P>
<P><A HREF="../ss-toc2.html">(back to Table of Contents)</A><P>
<A HREF="../ssch15/poker.html"><STRONG>BACK</STRONG></A>
chapter thread <A HREF="../ssch17/part5.html"><STRONG>NEXT</STRONG></A>
<P>
<ADDRESS>
<A HREF="../index.html">Brian Harvey</A>,
<CODE>bh@cs.berkeley.edu</CODE>
</ADDRESS>
</BODY>
</HTML>
|