<P><HTML> <HEAD> <TITLE>Computer Science Logo Style vol 2 ch 11: Example: Cryptographer's Helper</TITLE> </HEAD> <BODY> <CITE>Computer Science Logo Style</CITE> volume 2: <CITE>Advanced Techniques</CITE> 2/e Copyright (C) 1997 MIT <H1>Example: Cryptographer's Helper</H1> <TABLE width="100%"><TR><TD> <IMG SRC="../csls2.jpg" ALT="cover photo"> <TD><TABLE> <TR><TD align="right"><CITE><A HREF="http://www.cs.berkeley.edu/~bh/">Brian Harvey</A><BR>University of California, Berkeley</CITE> <TR><TD align="right"><BR> <TR><TD align="right"><A HREF="../pdf/v2ch11.pdf">Download PDF version</A> <TR><TD align="right"><A HREF="../v2-toc2.html">Back to Table of Contents</A> <TR><TD align="right"><A HREF="../v2ch10/v2ch10.html"><STRONG>BACK</STRONG></A> chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch12/v2ch12.html"><STRONG>NEXT</STRONG></A> <TR><TD align="right"><A HREF="https://mitpress.mit.edu/books/computer-science-logo-style-second-edition-volume-2">MIT Press web page for <CITE>Computer Science Logo Style</CITE></A> </TABLE></TABLE> <HR><P>Program file for this chapter: <A HREF="crypto.lg"><CODE>crypto</CODE></A> <P> <P>A <EM>cryptogram</EM> is a kind of word puzzle, like a crossword puzzle. Instead of definitions, though, a cryptogram gives you the actual words of a quotation, but with each letter replaced with a different letter. For example, each letter A in the original text might be replaced with an F. Here is a sample cryptogram: <P> LB RA, BT YBL LB RA: LJGL CQ LJA FUAQLCBY: KJALJAT 'LCQ YBRXAT CY LJA DCYP LB QUSSAT LJA QXCYWQ GYP GTTBKQ BS BULTGWABUQ SBTLUYA, BT LB LGHA GTDQ GWGCYQL G QAG BS LTBURXAQ, GYP RM BIIBQCYW AYP LJAD? <P> <P>The punctuation marks and the spaces between words are the same in this cryptogram as they are in the original ("clear") text. <P> A cryptogram is a kind of secret code. The formal name for this particular kind of code is a <EM>simple substitution cipher.</EM> Strictly speaking, a <EM>code</EM> is a method of disguising a message that uses a dictionary of arbitrarily chosen replacements for each possible word. A foreign language is like a code. A <EM>cipher</EM> is a method in which a uniform algorithm or formula is used to translate each word. A <EM>substitution</EM> cipher is one in which every letter (or sometimes every pair of letters, or some such grouping) is replaced by a disguised equivalent. A <EM> simple</EM> substitution cipher is one in which each letter has a single equivalent replacement, which is used throughout the message. (A more complicated substitution cipher might be something like this: the first letter A in the message is replaced with F, the second A is replaced with G, the third with H, and so on.) <P>Years ago, Arthur Conan Doyle and Edgar Allen Poe were able to write mystery stories in which simple substitution ciphers were used by characters who really wanted to keep a message secret. Today, partly because of those stories, too many people know how to "break" such ciphers for them to be of practical use. Instead, these ciphers are used as word puzzles. <P>The technique used for decoding a cryptogram depends on the fact that some letters are more common than others. The letter A is much more common in English words than the letter Z. If, in a cryptogram, the letter F occurs many times, it's more likely to represent a letter like A in the original text than a letter like Z. <P>The most commonly used letter in English is E, by a wide margin. T is in second place, with A and O nearly tied for third. I, N, and R are also very commonly used. These rankings apply to <EM>large</EM> texts. In the usual short cryptogram, the most frequent letter doesn't necessarily represent E. But the letter that represents E will probably be among the two or three most frequent. <P>Before reading further, you might want to try to solve the cryptogram shown above. Make a chart of the number of times each letter appears, then use that information to make guesses about which letter is which. As you're working on it, make a note of what other kinds of information are helpful to you. <P>This project is a program to help you solve cryptograms. The program doesn't solve the puzzle all by itself; it doesn't know enough about English vocabulary. But it does some of the more boring parts of the job automatically, and can make good guesses about some of the letters. <P>The top-level procedure is <CODE>crypto</CODE> . It takes one input, a list whose members are the words of the cryptogram. Since these lists are long and easy to make mistakes in, you'll probably find it easier to type the cryptogram into the Logo editor rather than directly at a question mark prompt. You might make the list be the value of a variable, then use that variable as the input to <CODE>crypto</CODE>. (The program file for this project includes four such variables, named <CODE>cgram1</CODE> through <CODE> cgram4</CODE>, with sample cryptograms.) <P><CODE>Crypto</CODE> begins by going through the coded text, letter by letter. It keeps count of how often each letter is used. You can keep track of this counting process because the program draws a <EM>histogram</EM> on the screen as it goes. A histogram is a chart like the one at the top of the next page. <HR> <P> <PRE> L B L AB L AB L AB L AB L AB L Q AB L Q Y AB G L Q T Y AB G L Q T Y AB G L Q T Y ABC G L Q T Y ABC G J L Q T Y ABC G J L Q TU Y ABC G J L QRSTU Y ABC G J L PQRSTU W Y ABCD G J L PQRSTU WXY ABCD G IJKL PQRSTU WXY ABCD FGHIJKLM PQRSTU WXY </PRE> <P> Histogram <HR> <P><PRE><U>A-17-E</U> <U>B-18- </U> C-08- D-03- E F-01- G-11-A H-01- I-02- J-07-H K-02- <U>L-19-T</U> M-01- N O P-04- Q-13- R-05- S-05- T-11- U-06- V W-04- X-03- Y-12- Z <U>A</U>BCD<U>E</U>FG<U>H</U>IJKLMNOPQRS<U>T</U>UVWXYZ <U>LB RA, BT YBL LB RA: LJGT CQ LJA</U> T E, T T E: THAT THE <U>FUAQLCBY: KJALJAT 'LCQ YBRXAT CY LJA</U> E T : HETHE 'T E THE <U>DCYP LB QUSSAT LJA QXCYWQ GYP GTTBKQ</U> T E THE A A <U>BS BULTGWABUQ SBTLUYA, BT LB LGHA</U> T A E T E, T TA E <U>GTDQ GWGCYQL G QAG BS LTBURXAQ, GYP</U> A A A T A EA T E , A <U>RM BIIBQCYW AYP LJAD?</U> E THE ? </PRE> <P> Screen display <HR> <P>A histogram is a kind of graph, but it's different from the <EM>continuous</EM> graphs you use in algebra. Histograms are used to show quantities of <EM>discrete</EM> things, like letters of the alphabet. <P>The main reason the program draws the histogram is that it needs to know the frequencies of occurrence of the letters for later use. When I first wrote the program, it counted the letters without printing anything on the screen. Since this counting is a fairly slow process, it got boring waiting for the program to finish. The histogram display is a sort of video thumb-twiddling to keep you occupied while the program is creating an invisible histogram inside itself. <P>By the way, since there are only 24 lines on the screen, the top part of the histogram may be invisible if the cryptogram is long enough to use some letters more than 24 times. <P>The shape of this histogram is pretty typical. A few letters are used many times, while most letters are clumped down near the bottom. In this case, A, B, and L stand out. You might guess that they represent the most commonly used letters: E, T, and either A or O. But you need more information to be able to guess which is which. <P> After it finishes counting letters, the program presents a screen display like the one shown above. The information provided in this display comes in three parts. At the top is an alphabetical list of the letters in the cryptogram. For each letter, the program displays the number of times that letter occurs in the enciphered text. For example, the letter P occurs four times. The letter that occurs most frequently is highlighted by showing it in reverse video characters, represented here with underlined characters. In this example, the most frequently used letter is L, with 19 occurrences. Letters with occurrence counts within two of the maximum are also highlighted. In the example, A with 17 and B with 18 are highlighted. If a letter does not occur in the cryptogram at all, no count is given. In the example, there is no E in the enciphered text. <P>The top part of the display shows one more piece of information: if either the program or the person using it has made a guess as to the letter that a letter represents, that guess is shown after the frequency count. For example, here the program has guessed that the letter L in the cryptogram represents the letter T in the clear text. (You can't tell from the display that this guess was made by the program rather than by the person using it. I just happen to know that that's what happened in this example!) <P>The next section of the display is a single line showing all the letters of the alphabet. In this line, a letter is highlighted if a guess has been made linking some letter in the cryptogram with that letter in the clear text. In other words, this line shows the linkages in the reverse direction from what is shown in the top section of the display. For example, I just mentioned that L in the cryptogram corresponds to T in the clear text. In the top part of the display, we can find L in alphabetical order, and see that it has a T linked to it. But in the middle part of the display, we find <EM>T</EM>, not L, in alphabetical order, and discover that <EM>something</EM> is linked to it. (It turns out that we don't usually have to know which letter corresponds to T.) <P>Here is the purpose of that middle section of the display: Suppose I am looking at the second word of the cryptogram, RA. We've already guessed that A represents E, so this word represents something-E. Suppose I guess that this word is actually HE. This just happens to be the first two-letter word I think of that ends in E. So I'd like to try letting R represent H. Now I look in the middle section of the display, and I see that H is already highlighted. So some other letter, not R, already represents H. I have to try a different guess. <P>The most important part of the display is the bottom section. Here, lines of cryptogram alternate with their translation into clear text, based on the guesses we've made so far. The cryptogram lines are highlighted, just to make it easy to tell which lines are which. The program ensures that each word entirely fits on a single line; there is no wrapping to a new line within a single word. <P>There is room on the screen for eight pairs of lines. If the cryptogram is too big to fit in this space, only a portion of it will be visible at any time. In a few paragraphs I'll talk about moving to another section of the text. <P>The program itself is very limited in its ability to guess letters. For the most part, you have to do the guessing yourself when you use it. There are three guessing rules in the program: <P> <P><P> <OL><LI> The most frequently occurring single-letter word is taken to represent A. <LI> Another single-letter word, if there is one, is taken to represent I. <LI> The most frequently occurring three-letter word is taken to represent THE, but only if its last letter is one of the ones highlighted in the top part of the display. <P></OL> <P> <P>In the example, the only single-letter word in the cryptogram is G, in the next-to-last line. The program, following rule 1, has guessed that G represents A. Rule 2 did not apply, because there is no second single-letter word. The most frequently used three-letter word is LJA, which occurs three times. The last letter of that word, A, is highlighted in the top section because it occurs 17 times. Therefore, the program guesses that L represents T, J represents H, and A represents E. <P>Of course you understand that these rules are not infallible; they're just guesses. (A fancy name for a rule that works most of the time is a <EM>heuristic.</EM> A rule that works all the time is called an <EM>algorithm.</EM>) For example, the three-letter word GYP appears twice in the cryptogram, only once less often than LJA. Maybe GYP is really THE. However, the appearance of the word THAT in the translation of the first line is a pretty persuasive confirmation that the program's rules have worked out correctly in this case. <P>If you didn't solve the cryptogram on your own, at my first invitation, you might want to take another look at it, based on the partial solution you now have available. Are these four letters (A, E, I, and T) enough to let you guess the rest? It's a quotation you'll probably recognize. <P>Once this display is on the screen, you can make further guesses by typing to the program. For example, suppose you decide that the last word of the cryptogram, LJAD, represents THEM. Then you want to guess that D represents M. To do that, type the letters D and M in that order. Don't use the RETURN key. Your typing will not be echoed on the screen. Instead, three things will happen. First, the entry in the top section of the display that originally said <P><PRE>D-03- </PRE> <P>will be changed to say <P><PRE>D-03-M </PRE> <P>Second, the letter M will be highlighted in the alphabet in the second section of the display. Finally, the program will type an M underneath every D in the cryptogram text. <P>If you change your mind about a guess, you can just enter a new guess about the same cryptogram letter. For example, if you decide that LJAD is really THEY instead of THEM, you could just type D and Y. Alternatively, if you decide a guess was wrong but you don't have a new guess, type the cryptogram letter (D in this example) and then the space bar. <P>If you guess that D represents M, and then later you guess that R also represents M, the program will complain at you by beeping or by flashing the screen, depending on what your computer can do. If you meant that R should represent M <EM>instead of</EM> D representing M, you must first undo the latter guess by typing D, space bar, R, and M. <P>The process of redisplaying the clear text translation of the cryptogram after each guess takes a fairly long time, since the program has to look up each letter individually. Therefore, the program is written so that you don't have to wait for this redisplay to finish before guessing another letter representation. As soon as you type any key on the keyboard, the program stops retyping the clear text. Whatever key you typed is taken as the first letter of a two-letter guess command. <P>If the cryptogram is too long to fit on the screen, there are three other things you can type to change which part of the text is visible. Typing a plus sign (+) eliminates the first four lines of the displayed text (that is, four lines of cryptogram and four corresponding lines of cleartext) and brings in four new lines at the end. Typing a minus sign (-) moves backwards, eliminating the four lines nearest the bottom of the screen and bringing back four earlier lines at the top. These <EM>windowing</EM> commands have no effect if you are already seeing the end of the text (for +) or the beginning of the text (for -). <P>The third command provided for long cryptograms is the atsign (@) character. This is most useful after you've figured out all of the letter correspondences. It clears the screen and displays only the clear text, without the letter frequencies, table of correspondences, or the enciphered text. This display allows 23 lines of clear text to fit on the screen instead of only eight. If you don't have the solution exactly right, you can type any character to return to the three-part display and continue guessing. <P>The program never stops; even after you have made guesses for all the letters, you might find an error and change your mind about a guess. When you're done, you stop the program with control-C or command-period or whatever your computer requires. <P>In the complete listing at the end of this chapter, there are a few cryptograms for you to practice with. They are excerpted from one of my favorite books, <EM>Compulsory Miseducation</EM> by Paul Goodman. <P><H2>Program Structure</H2> <P>There are about 50 procedures in this program. These procedures can be roughly divided into several purposes: <P> <UL> <LI>initialization <LI>frequency counting and displaying the histogram <LI>guessing letters automatically <LI>reading user commands <LI>keeping track of guesses <LI>top section of display (frequencies) <LI>middle section of display (alphabet) <LI>bottom section of display (cryptogram text and cleartext) <LI>windowing and full-text displays <LI>data abstraction and other helper procedures </UL> <P>The diagram on the next page shows superprocedure/subprocedure relationships within the main categories. (Helper procedures aren't shown, to make the diagram more readable.) The bottom half of the diagram has the procedures that are concerned primarily with presenting information on the screen. <CODE>Redisplay</CODE>, near the center of the diagram, is called whenever the entire screen display must be redrawn: when the initialization part of the program is finished, and whenever the user chooses a new portion of the text to display. When the display changes slightly, because a new guess is made, procedures such as <CODE>fixtop</CODE>, <CODE>light</CODE>, and <CODE>dark</CODE> are used instead of redrawing everything. <P> <CENTER><IMG SRC="cryptoflow.gif" ALT="figure: cryptoflow"></CENTER> <P><CODE>Bind</CODE> is the most important procedure, because it records and displays each new guess. As the diagram shows, it invokes several subprocedures to update the display; more importantly, it changes the values of several variables to keep track of the new guess. There is also a similar procedure <CODE>qbind</CODE> that's used when a guess is made by the program rather than by the user. (The "Q" stands for either "quick" or "quiet," since this version never has to undo an old guess, omits some error checking, and can't beep because there are no errors in automatic guesses.) If you ignore initialization and displaying information, the entire structure of the program is that <CODE>crypto</CODE> calls <CODE>parseloop</CODE>, which repeatedly calls <CODE>parsekey</CODE>, which calls <CODE>bind</CODE> to record a guess. <P>Unfortunately, it's not so easy in practice to divide up the procedures into groups, with a single purpose for each group. Several procedures carry out two tasks at once. For example, <CODE>light</CODE> and <CODE>dark</CODE> have those names because they switch individual letters between normal and inverse video in the alphabet display in the middle part of the screen. But those procedures also set variables to remember that a particular cleartext letter has or hasn't been guessed, so they are also carrying out part of <CODE>bind</CODE>'s job, keeping track of guesses. <P><H2>Guided Tour of Global Variables</H2> <P><CODE>Crypto</CODE> uses many global variables to hold the information it needs. This includes information about individual letters, about words, and about the text as a whole. <P>There are several sets of 26 variables, one for each letter of the alphabet. For these variables, the last letter of the variable name is the letter about which the variable holds information. In the table that follows, the italic <EM>x</EM> in each name represents any letter. <P><TABLE><TR><TH align="right" valign="top"><EM>x</EM><TD> <TD valign="top">Cleartext letter that is guessed to match <EM>x</EM> in the cryptogram. <TR><TH align="right" valign="top"><CODE>bound</CODE><EM>x</EM><TD> <TD valign="top"><CODE>True</CODE> if <EM>x</EM> appears in the <EM>cleartext</EM> as guessed so far; <CODE>false</CODE> otherwise. <TR><TH align="right" valign="top"><CODE>cnt</CODE><EM>x</EM><TD> <TD valign="top">Count of how many times <EM>x</EM> appears in the cryptogram. <TR><TH align="right" valign="top"><CODE>posn</CODE><EM>x</EM><TD> <TD valign="top">Screen cursor position where the frequency count and guess for <EM>x</EM> is shown in the top part of the display. </TABLE> <P>These variables are set up initially by <CODE>initvars</CODE>, except for the <CODE>posn</CODE> variables, which are set by <CODE>showrow</CODE>. The variables with single-letter names start out with a space character as their value. This choice allows <CODE>showclear</CODE> to use <CODE>thing :letter</CODE> as the thing to type for every letter in the cryptogram. If no guess has been made for a letter, it will be displayed as a blank space in the partially-decoded version of the text. <P>Here are the variables that have to do with <EM>words</EM> in the cryptogram text. These variables are needed for the part of the program that automatically makes guesses, by looking for words that might represent A, I, and THE in the cleartext. In the following variable names, <EM>y</EM> represents either a one-letter word or a three-letter word in the cryptogram text. <P><TABLE><TR><TH align="right" valign="top"><CODE>count.single</CODE> <TD> <TD valign="top">The number of occurrences of the most frequent one-letter word. <TR><TH align="right" valign="top"><CODE>count.triple</CODE> <TD> <TD valign="top">The number of occurrences of the most frequent three-letter word. <TR><TH align="right" valign="top"><CODE>list.single</CODE> <TD> <TD valign="top">List of one-letter words in the cryptogram text. <TR><TH align="right" valign="top"><CODE>list.triple</CODE> <TD> <TD valign="top">List of three-letter words in the cryptogram text. <TR><TH align="right" valign="top"><CODE>max.single</CODE> <TD> <TD valign="top">The most frequent one-letter word in the cryptogram text. <TR><TH align="right" valign="top"><CODE>max.triple</CODE> <TD> <TD valign="top">The most frequent three-letter word in the cryptogram text. <TR><TH align="right" valign="top"><CODE>single</CODE><EM>y</EM> <TD> <TD valign="top">The number of occurrences of the one-letter word <EM>y.</EM> <TR><TH align="right" valign="top"><CODE>triple</CODE><EM>y</EM> <TD> <TD valign="top">The number of occurrences of the three-letter word <EM>y.</EM> </TABLE> <P>These variables are used only during the initial histogram counting, to keep track of which one-letter word and which three-letter word are the most frequent in each category. Once the most frequently occurring words have been determined, the actual count is no longer important. <P>Finally, there are some variables that contain information about the text as a whole: <P><TABLE><TR><TH align="right" valign="top"><CODE>fulltext</CODE> <TD> <TD valign="top">The complete cryptogram text. <TR><TH align="right" valign="top"><CODE>text</CODE> <TD> <TD valign="top">The part of the cryptogram that is displayed on the screen right now. <TR><TH align="right" valign="top"><CODE>moretext</CODE> <TD> <TD valign="top">The part of the text that should be displayed after a <CODE>+</CODE> command. <TR><TH align="right" valign="top"><CODE>textstack</CODE> <TD> <TD valign="top">A list of old values of <CODE>text</CODE>, to be restored if the <CODE>-</CODE> command is used. <TR><TH align="right" valign="top"><CODE>maxcount</CODE> <TD> <TD valign="top">The number of occurrences of the most frequently used letter. </TABLE> <P><CODE>:Maxcount</CODE> is used to know which letters should be highlighted in the top section of the display. <CODE>:Text</CODE> is used by <CODE>showcode</CODE> and <CODE>showclear</CODE> to maintain the bottom section of the display. <CODE> Fulltext</CODE>, <CODE>moretext</CODE>, and <CODE>textstack</CODE> are part of the windowing feature. At first, <CODE>text</CODE> is equal to <CODE>fulltext</CODE>, and <CODE> textstack</CODE> is empty. <CODE>Moretext</CODE> contains the portion of the text starting on the fifth line that is displayed, providing there is some text at the end of the cryptogram that didn't fit on the screen. If the end of the text is visible, then <CODE>moretext</CODE> is empty. Here is what happens if you type the plus sign: <P><PRE>to moretext if emptyp :moretext [beep stop] push "textstack :text make "text :moretext redisplay "true end </PRE> <P>If <CODE>:moretext</CODE> is empty, there is no more text to display, and the procedure stops with a complaint. Otherwise, we want to remember what is now in <CODE>:text</CODE> in case of a later <CODE>-</CODE> command, and we want to change the value of <CODE>text</CODE> to the version starting four lines later that is already in <CODE>:moretext</CODE>. <P>In the solitaire project, I used a lot of <CODE>local</CODE> instructions in the top-level procedures to avoid creating global variables. In this project, I didn't bother. There's no good reason why I was lazier here than there; you can decide for yourself whether you think it's worth the effort. <P><H2>What's In a Name?</H2> <P>In revising this program for the second edition, I was struck by the ways in which bad choices of procedure or variable names had made it needlessly hard to read. Changing names was one of the three main ways in which I changed the program. (The other two were an increased use of data abstraction and the introduction of iteration tools to eliminate some helper procedures.) <P>I'll start with a simple example. As I've mentioned, when I first wrote the program it didn't draw the histogram on the screen during the initial counting of letter frequencies. Since the top part of the screen display is primarily a presentation of those frequencies, I thought of that top part as the program's "histogram" even though it doesn't have the form of a real histogram. That's why, in the first edition, the procedures that maintain the top part of the display were called <CODE>showhist</CODE>, <CODE>fixhist</CODE>, and so on; when I added the <CODE>histogram</CODE> and <CODE>histlet</CODE> procedures that draw the real histogram, it was hard to keep track of which "<CODE>hist</CODE>" names were part of the initial histogram and which were part of the letter frequency chart at the top of the program's normal screen display. I've now changed <CODE>showhist</CODE> to <CODE>showtop</CODE>, <CODE>fixhist</CODE> to <CODE>fixtop</CODE>, and so on. The procedures with <CODE>hist</CODE> in their names are about the real histogram, and the ones with <CODE>top</CODE> in their names are about the frequency chart. <P>Here's another example. In several parts of the program, I had to determine whether a character found in the cryptogram text is a letter or a punctuation mark. The most straightforward way to do this would be an explicit check against all the letters in the alphabet: <P><PRE>to letterp :char output memberp :char "ABCDEFGHIJKLMNOPQRSTUVWXYZ end </PRE> <P>But comparing the character against each of the 26 letters would be quite slow. Instead, I took advantage of the fact that there happen to be variables in the program named after each letter. That is, there's a variable <CODE>A</CODE>, a variable <CODE>B</CODE>, and so on, but there aren't variables named after punctuation characters. Therefore, I could use the Logo primitive <CODE>namep</CODE> to see whether or not the character I'm considering is a variable name, and if so, it must be a letter. The first edition version of <CODE>crypto</CODE> is full of instructions of the form <P><PRE>if namep :char ... </PRE> <P>This is clever and efficient, but not at all self-documenting. Someone reading the program would have no way to tell that I'm using <CODE>namep</CODE> to find out whether a character is a letter. The solution was to add an instruction to the initialization in <CODE>crypto</CODE>: <P><PRE>copydef "letterp "namep </PRE> <P>The <CODE>copydef</CODE> primitive is used to give a new name to an existing procedure. (The old name continues to work.) The existing procedure can be either primitive or user-defined. The new name is not saved by the <CODE>save</CODE> command; that's why <CODE>crypto</CODE> performs the <CODE>copydef</CODE> instruction each time. <P>Probably the worst example of bad naming was in the <CODE>tally</CODE> procedure. This procedure has a complicated job; it must keep track of the most common one-letter and three-letter words, in preparation for the program's attempts to make automatic guesses for A, I, and THE. Here is the version in the first edition: <P><PRE>to tally :type :word local "this make "this word :type :word if not memberp :word list. :type ~ [setlist. :type fput :word list. :type make :this 0] make :this sum 1 thing :this make "this thing :this if :this > (count. :type) ~ [setcount. :type :this make (word "max. :type) :word] end </PRE> <P>The input named <CODE>type</CODE> is either the word <CODE>single</CODE> or the word <CODE>triple</CODE>. One thing that makes this procedure hard to read is the local variable named <CODE>this</CODE>. What a vague name! This what? Is it this word, or this letter, or this word length, or this guess? To make things worse, partway through the procedure I recycled the same name to hold a different value. At first, <CODE>:this</CODE> is a word that will be used as the name of a variable, counting the number of times a given word appears. For example, if the word YBL appears in the cryptogram, then <CODE>tally</CODE> will create a variable named <CODE>tripleybl</CODE> whose value will be the number of times that YBL occurs in the text. The value of <CODE>this</CODE> will be the word <CODE>tripleybl</CODE>, so the expression <CODE>thing :this</CODE> represents the actual number. Then, near the end of the procedure, I used the instruction <P><PRE>make "this thing :this </PRE> <P>From then on, <CODE>:this</CODE> is the number itself, not the variable name! It's really hard to read a procedure in which the same name is used to mean different things in different instructions. <P>Here's the new version: <P><PRE>to tally :type :word localmake "countvar word :type :word if not memberp :word list. :type ~ [setlist. :type fput :word list. :type make :countvar 0] localmake "count (thing :countvar)+1 make :countvar :count if :count > (count. :type) ~ [setcount. :type :count setmax. :type :word] end </PRE> <P>The name <CODE>this</CODE> is gone. Instead, I've first created a local variable named <CODE>countvar</CODE> whose value is the name of the count variable. Then I create another local variable named <CODE>count</CODE> that contains the actual count. These names are much more descriptive of the purposes of the two variables. <P>Another change in the new version is a more consistent use of data abstraction. The original version used the constructor <CODE>setlist.</CODE> and the selector <CODE>list.</CODE> to refer to the list of all known cryptogram words of the appropriate length (the variable <CODE>list.single</CODE> or <CODE>list.triple</CODE>), but used the instruction <P><PRE>make (word "max. :type) :word </PRE> <P>to construct the variable containing the most frequently appearing word of that length. The new version uses a constructor named <CODE>setmax.</CODE> that's analogous to the <CODE>setlist.</CODE> constructor. <P>Rethinking the names of procedures can reorganize your ideas about how to group the procedures into categories. For example, in the first edition I was upset about the fact that <CODE>historgram</CODE>, whose job is to count letter frequencies and draw the histogram of those counts, also invokes prepare.guess, whose job is to count <EM>word</EM> frequencies in preparation for automatic guessing. <P><CENTER><IMG SRC="histflow.gif" ALT="figure: histflow"></CENTER> <P>The reason for this mixture of tasks is efficiency. To prepare the histogram, the program must extract the letters (omitting punctuation) from each word of the text, and count them. To prepare for guessing words, the program must extract the letters from each word, and count the occurrences of the letters-only words. I could have done these things separately: <P><PRE>to histogram :text foreach :text [foreach (filter "letterp ?) "histlet] end to count.words :text foreach :text [prepare.guess (filter "letterp ?)] end </PRE> <P>But it seemed better to scan the words of the text just once, and extract the letters from each word just once: <P><PRE>to histogram :text foreach :text [localmake "word filter "letterp ? foreach :word "histlet prepare.guess :word] end </PRE> <P>But the punch line of this story is that I could avoid the confusing jump between boxes--the feeling of mixing two tasks--merely by changing the name of the <CODE>histogram</CODE> procedure to something neutral like <CODE>preprocess</CODE>. Then the structure would be <P><CENTER><IMG SRC="preprocess.gif" ALT="figure: preprocess"></CENTER> <P>Now we have one initialization procedure that includes invocations for two separate kinds of preprocessing. It's not really the program structure that is inappropriate, but only using the name <CODE>histogram</CODE> for a procedure whose job includes more than the creation of the histogram. <P> <P><H2>Flag Variables</H2> <P>Procedure <CODE>redisplay</CODE> has the job of redrawing the entire screen when there is a major change to what should be shown, like moving to a different window in the cryptogram text. <P><PRE>to redisplay :flag cleartext showtop alphabet showcode :text if :flag [showclear :text] end </PRE> <P>The input to <CODE>redisplay</CODE> is a <EM>flag variable.</EM> It must have the value <CODE>true</CODE> or <CODE>false</CODE>. (The name comes from the flags on mailboxes, which are either up or down to indicate whether or not there is mail in the box.) It's there because <CODE>redisplay</CODE> has two slightly different jobs to do at two different points in the program. First, <CODE>redisplay</CODE> is invoked by <CODE>crypto</CODE>, the top-level procedure, to draw the screen initially. At this time, no letters have been guessed yet. Therefore, it is not necessary to invoke <CODE>showclear</CODE> (which indicates the guessed letters in the bottom part of the display). <CODE>Crypto</CODE> executes the instruction <P><PRE>redisplay "false </PRE> <P>to avoid that unnecessary work. <CODE>Redisplay</CODE> is also invoked by <CODE>moretext</CODE>, <CODE>lesstext</CODE>, and <CODE>showclear</CODE>. Each of these procedures uses the instruction <P><PRE>redisplay "true </PRE> <P>to include <CODE>showcode</CODE>. If the flag variable weren't used, there would have to be two different versions of <CODE>redisplay</CODE>. <P>I used the latter technique in the procedures <CODE>bind</CODE> and <CODE> qbind</CODE>. These could also have been one procedure with a flag variable input. The advantage of the technique used in <CODE>redisplay</CODE> is that it makes the program easier to read by reducing the number of procedures, and keeping similar purposes together. The advantage of using two procedures is that it's a little faster, because you don't have to test the flag variable with <CODE>if</CODE>. <P>A flag variable is somewhat analogous to a <EM>predicate,</EM> a procedure that always outputs <CODE>true</CODE> or <CODE>false</CODE>. The advantage of using these particular values for flag variables is that they're easy to test; you can say <P><PRE>if :flag [do.something] </PRE> <P>whereas, if you used some other pair of values like <CODE>yes</CODE> and <CODE>no</CODE>, you'd have to say <P><PRE>if equalp :flag "yes [do.something] </PRE> <P>Some people like to give flag variables names ending with <CODE>p</CODE>, as in the convention for predicates. (The special variable <CODE>redefp</CODE> that controls redefinition of primitives in some versions of Logo, including Berkeley Logo, is an example.) I'm somewhat uncomfortable with that practice because to me it raises a confusion about whether a particular word is the name of a variable or the name of a procedure. I'd rather put <CODE>flag</CODE> in the names of flag variables. <P>The 26 <CODE>bound</CODE><EM>x</EM> variables in this program are also flag variables; each is <CODE>true</CODE> if the corresponding letter has been guessed as the cleartext half of a binding. They don't have "flag" in their names, but their names aren't used directly in most of the program anyway. Instead they are hidden behind data abstraction procedures. <CODE>Setbound</CODE> and <CODE>setunbound</CODE> are used to set any such variable <CODE>true</CODE> or <CODE>false</CODE>, respectively; the selector <CODE>boundp</CODE> alerts you by the P in its name that it's a predicate. <P><H2>Iteration Over Letters</H2> <P>One of the ways in which I simplified the program for this edition was to replace some recursive helper procedures with invocations of <CODE> foreach</CODE>. At several points in the program, some action must be taken for each letter in a word, or for each word in the text. <P>Another kind of iteration problem that was not so easily solved by the standard higher order procedures in Berkeley Logo was one in which some action must be taken, not for each letter in a word, but for each letter in the alphabet, or for some subset of the alphabet, as in the case of <CODE>showrow</CODE>, which displays one row of the top part of the screen, with information about five consecutive letters. Of course these problems could be solved with instructions like <P><PRE>foreach "ABCDEFGHIJKLMNOPQRSTUVWXYZ [...] </PRE> <P>but that seemed unaesthetic to me. I wanted to be able to specify the starting and ending letters, as in this example: <P><PRE>to alphabet setcursor [6 6] forletters "A "Z [ifelse boundp ? [invtype ?] [type ?]] end </PRE> <P>(The job of <CODE>alphabet</CODE> is to generate the middle part of the screen display, which is all of the letters of the alphabet, in order, with each letter in inverse video if that letter has been guessed as part of the cleartext.) <P>The difficulty in implementing <CODE>forletters</CODE> is to get from one letter to the next. How does a program know that the letter after A is B? Here is my solution: <P><PRE>to forletters :from :to :action for [lettercode [ascii :from] [ascii :to]] ~ [apply :action (list char :lettercode)] end </PRE> <P>The operation <CODE>ascii</CODE> takes a letter (or other character) as input. Its output is the number that represents that letter in the computer's memory. Most computers use the same numbers to represent characters; this standard representation is called ASCII, for American Standard Code for Information Interchange. (It's pronounced "ask E.") By using <CODE>ascii</CODE> to translate the starting and ending letters into numeric codes, I've transformed the problem into one that can be solved using the standard <CODE>for</CODE> tool that allows an action to be carried out for each number in a given range. <P>But in the template input to <CODE>forletters</CODE>, I want the question mark to represent a letter, not its numeric code. <CODE>Char</CODE> is the inverse operation to <CODE>ascii</CODE>. Given a number that is part of the ASCII sequence, <CODE>char</CODE> outputs the character that that number represents. For example: <P><PRE>?<U>print ascii "A</U> 65 ?<U>print char 65</U> A </PRE> <P><CODE>Forletters</CODE> applies the template input to the character corresponding to the number in the <CODE>lettercode</CODE> variable controlled by the <CODE>for</CODE>. <P>Adding 1 to an ASCII code to get the code for the next letter depends on the fact that the numbers representing the letters are in sequence. Fortunately, this is true of ASCII. A is 65, B is 66, C is 67, and so on. Not all computer representations for characters have this property. The code that was used in the days of punched cards had the slash (/) character in between R and S! <P>By the way, the lower case letters have different ASCII codes from the capitals. In this program I've used the primitive operation <CODE>uppercase</CODE> to translate every character that the program reads into upper case, just to be sure that each letter has only one representation. <P><H2>Computed Variable Names</H2> <P> <P>Another programming technique that is heavily used in this project is the use of <CODE>word</CODE> to compute variable names dynamically. Ordinarily, you assign a value to a variable named <CODE>var</CODE> with an instruction like <P><PRE>make "var 87 </PRE> <P>and you look at the value of the variable with the expression <P><PRE>:var </PRE> <P>But in this project, there are variables for each letter, with names like <CODE>posna</CODE>, <CODE>posnb</CODE>, <CODE>posnc</CODE>, and so on. To assign a value to these variables, the program doesn't use 26 separate instructions like <P><PRE>make "posna [0 0] </PRE> <P>(Each of these variables contains a list of screen coordinates for use with <CODE>setcursor</CODE> to find the corresponding letter in the top part of the display.) Instead, the procedure <CODE>showrow</CODE>, which draws that section of the display, contains the instruction <P><PRE>forletters :from :to [setposn ? cursor onetop ?] </PRE> <P><CODE>Setposn</CODE> is a data abstraction procedure: <P><PRE>to setposn :letter :thing make (word "posn :letter) :thing end </PRE> <P>When the variable <CODE>letter</CODE> contains the letter <CODE>a</CODE>, the <CODE>make</CODE> instruction has the same effect as if it were <P><PRE>make "posna :thing </PRE> <P>Similarly, the dots notation (<CODE>:posna</CODE>) isn't used to examine the values of these variables. Instead, <CODE>thing</CODE> is invoked explicitly: <P><PRE>to posn :letter output thing (word "posn :letter) end </PRE> <P>Another point to consider is that I could have used a different approach altogether, instead of using <CODE>word</CODE> to piece together a variable name. For instance, I could have used property lists: <P><PRE>to setposn :letter :thing pprop "posn :letter :thing end to posn :letter output gprop "posn :letter end </PRE> <P>As it happens, I first wrote this project in Atari 800 Logo, which didn't have property list primitives. So the question didn't arise for me. In a version of Logo that does support property lists, I see no <EM>stylistic</EM> reason to prefer one approach over the other. It's entirely a question of which is more efficient. Which is faster, searching through a list of 26 times 2 members (times 2 because each property has a name and a value) or concatenating strings with <CODE>word</CODE> to generate the name of a variable that can then be examined quickly? I'd have to experiment to find out. Alternatively, instead of using <CODE>posn</CODE> as the name of a property list and the letters as names of properties, I could reverse the two roles. That would give me more lists, but shorter lists. <P>What <EM>is</EM> a stylistic issue is that using procedures like <CODE>posn</CODE> and <CODE>setposn</CODE> to isolate the storage mechanism from the rest of the program makes the latter easier to read. <P> <P><H2>Further Explorations</H2> <P>I have three suggestions about how to extend this project. The first is to put in more rules by which the program can make guesses automatically. For example, a three-letter word that isn't THE might be AND. Sequences of letters within a word can also be tallied; TH is a common two-letter sequence, for example. A double letter in the cryptogram is more likely to represent OO than HH. <P>If you have many rules in the program, there will be situations in which two rules lead to contradictory guesses. One solution is just to try the most reliable rule first, and ignore a new guess if it conflicts with an old one. (<CODE>Qbind</CODE> applies this strategy by means of the instruction <P><PRE>if letterp thing :from [stop] </PRE> <P>which avoids adding a guess to the data base if the cryptogram letter is already bound to a cleartext letter.) <P>Another solution would be to let the rules "vote" about guesses. If the program had many rules, it might happen that three rules suggest that F represents E, while two rules suggest that W represents E. In this case, three rules outvote two rules, and the program would guess that F represents E. <P>The second direction for exploration in this program is to try to make it more efficient. For example, every time you make a guess, <CODE>showclear</CODE> is invoked to redisplay the partially decoded text. Much of this redisplay is unnecessary, since most of the guesses haven't changed. How can you avoid the necessity to examine every letter of the cryptogram text? One possibility would be to keep a list, for every letter in the text, of the screen positions in which that letter appears. Then when a new guess is made, the program could just type the corresponding cleartext letter at exactly those positions. The cost of this technique would be a lot of storage space for the lists of positions, plus a slower version of <CODE>showcode</CODE>, which would have to create these position lists. <P>The third direction for further exploration is to find out about more complicated ciphers. For example, suppose you started with a simple substitution cipher, but every time the letter A appeared in the cleartext you shifted the corresponding cryptogram letters by one. That is, if E is initially represented by R, the first time an A appears you'd start using S to represent E. The second time A appears you'd switch to T representing E. And so on. The effect of this technique would be that a particular cleartext letter is no longer represented by a single cryptogram letter all the way through. Therefore, you can't just count the frequencies of the cryptogram letters and assume that frequently-used letters represent E and T. How could you possibly decipher such a message? <P> <TABLE width="100%"><TR><TD><A HREF="../v2-toc2.html">(back to Table of Contents)</A> <TD align="right"><A HREF="../v2ch10/v2ch10.html"><STRONG>BACK</STRONG></A> chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch12/v2ch12.html"><STRONG>NEXT</STRONG></A> </TABLE> <P><H2>Program Listing</H2> <P><P> <P><PRE> to crypto :text make "text map "uppercase :text make "fulltext :text make "moretext [] make "textstack [] if not procedurep "letterp [copydef "letterp "namep] forletters "A "Z "initvars make "maxcount 0 initcount "single initcount "triple cleartext histogram :text redisplay "false if or guess.single guess.triple [showclear :text] parseloop end ;; Initialization to initcount :type setlist. :type [] setcount. :type 0 end to initvars :letter setcnt :letter 0 make :letter "| | setunbound :letter end ;; Histogram to histogram :text foreach :text [localmake "word filter "letterp ? foreach :word "histlet prepare.guess :word] end to histlet :letter localmake "cnt 1+cnt :letter setcursor list (index :letter) (nonneg 24-:cnt) type :letter setcnt :letter :cnt if :maxcount < :cnt [make "maxcount :cnt] end ;; Guessing letters to prepare.guess :word if equalp count :word 1 [tally "single :word] if equalp count :word 3 [tally "triple :word] end to tally :type :word localmake "countvar word :type :word if not memberp :word list. :type ~ [setlist. :type fput :word list. :type make :countvar 0] localmake "count (thing :countvar)+1 make :countvar :count if :count > (count. :type) ~ [setcount. :type :count setmax. :type :word] end to guess.single if emptyp (list. "single) [output "false] if emptyp butfirst (list. "single) ~ [qbind first (list. "single) "A output "true] qbind (max. "single) "A qbind (ifelse equalp first (list. "single) (max. "single) [last (list. "single)] [first (list. "single)]) ~ "I output "true end to guess.triple if emptyp (list. "triple) [output "false] if :maxcount < (3+cnt last (max. "triple)) ~ [qbind first (max. "triple) "T qbind first butfirst (max. "triple) "H qbind last (max. "triple) "E output "true] output "false end ;; Keyboard commands to parseloop forever [parsekey uppercase readchar] end to parsekey :char if :char = "@ [fullclear stop] if :char = "+ [moretext stop] if :char = "- [lesstext stop] if not letterp :char [beep stop] bind :char uppercase readchar end ;; Keeping track of guesses to bind :from :to if not equalp :to "| | [if not letterp :to [beep stop] if boundp :to [beep stop]] if letterp thing :from [dark thing :from] make :from :to fixtop :from if letterp :to [light :to] showclear :text end to qbind :from :to if letterp thing :from [stop] make :from :to fixtop :from light :to end ;; Maintaining the display to redisplay :flag cleartext showtop alphabet showcode :text if :flag [showclear :text] end ;; Top section of display (letter counts and guesses) to showtop setcursor [0 0] showrow "A "E showrow "F "J showrow "K "O showrow "P "T showrow "U "Y showrow "Z "Z end to showrow :from :to forletters :from :to [setposn ? cursor onetop ?] print [] end to onetop :letter localmake "count cnt :letter if :count = 0 [type word :letter "| | stop] localmake "text (word :letter "- twocol :count "- thing :letter) ifelse :maxcount < :count+3 [invtype :text] [type :text] type "| | end to twocol :number if :number > 9 [output :number] output word 0 :number end to fixtop :letter setcursor posn :letter onetop :letter end ;; Middle section of display (guessed cleartext letters) to alphabet setcursor [6 6] forletters "A "Z [ifelse boundp ? [invtype ?] [type ?]] end to light :letter setcursor list 6+(index :letter) 6 invtype :letter setbound :letter end to dark :letter setcursor list 6+(index :letter) 6 type :letter setunbound :letter end ;; Bottom section of display (coded text) to showcode :text make "moretext [] showcode1 8 0 :text end to showcode1 :row :col :text if emptyp :text [make "moretext [] stop] if :row > 22 [stop] if and equalp :row 16 equalp :col 0 [make "moretext :text] if (:col+count first :text) > 37 [showcode1 :row+2 0 :text stop] codeword :row :col first :text showcode1 :row (:col+1+count first :text) butfirst :text end to codeword :row :col :word setcursor list :col :row invtype :word end ;; Bottom section of display (cleartext) to showclear :text showclear1 8 0 :text 2 end to showclear1 :row :col :text :delta if emptyp :text [stop] if :row > 23 [stop] if keyp [stop] if (:col+count first :text) > 37 ~ [showclear1 :row+:delta 0 :text :delta stop] clearword :row :col first :text showclear1 :row (:col+1+count first :text) butfirst :text :delta end to clearword :row :col :word setcursor list :col :row+1 foreach :word [ifelse letterp ? [type thing ?] [type ?]] end ;; Windowing commands to fullclear cleartext showclear1 0 0 :fulltext 1 print [] invtype [type any char to redisplay] ignore readchar redisplay "true end to moretext if emptyp :moretext [beep stop] push "textstack :text make "text :moretext redisplay "true end to lesstext if emptyp :textstack [beep stop] make "text pop "textstack redisplay "true end ;; Iteration tool for letters to forletters :from :to :action for [lettercode [ascii :from] [ascii :to]] ~ [apply :action (list char :lettercode)] end ;; Data abstraction (constructors and selectors) to setbound :letter make word "bound :letter "true end to setunbound :letter make word "bound :letter "false end to boundp :letter output thing word "bound :letter end to setcnt :letter :thing make (word "cnt :letter) :thing end to cnt :letter output thing (word "cnt :letter) end to setposn :letter :thing make (word "posn :letter) :thing end to posn :letter output thing (word "posn :letter) end to setcount. :word :thing make (word "count. :word) :thing end to count. :word output thing (word "count. :word) end to setlist. :word :thing make (word "list. :word) :thing end to list. :word output thing (word "list. :word) end to setmax. :word :thing make (word "max. :word) :thing end to max. :word output thing (word "max. :word) end ;; Miscellaneous helpers to index :letter output (ascii :letter)-(ascii "A) end to beep tone 440 15 end to invtype :text type standout :text end to nonneg :number output ifelse :number < 0 [0] [:number] end ;; Sample cryptograms make "cgram1 [Dzynufqyjulli, jpqhq ok yr hoxpj qnzeujory qceqwj xhrtoyx zw oyjr u trhjptpolq trhln. oynqqn, rzh qceqkkogq eryeqhy tojp whrvlqfk rd qnzeujory uj whqkqyj kofwli fquyk jpuj jpq |xhrty-zwk| nr yrj pugq kzep u trhln. u nqeqyj qnzeujory uofk uj, whqwuhqk drh, u frhq trhjptpolq dzjzhq, tojp u noddqhqyj erffzyoji kwohoj, noddqhqyj reezwujoryk, uyn frhq hqul zjoloji jpuy ujjuoyoyx kjujzk uyn kuluhi.] make "cgram2 [Lvo vfkp lfzj md opaxflimn iz lm gitokflo fnp zlkonblvon f hmalv'z inilifliuo, fnp fl lvo zfyo liyo lm zoo lm il lvfl vo jnmwz wvfl iz noxozzfkh lm xmco wilv lvo mnbminb fxliuilioz fnp xaglako md zmxiolh, zm lvfl viz inilifliuo xfn to kogoufnl. il iz ftzakp lm lvinj lvfl lviz lfzj xfn to fxxmycgizvop th zm yaxv zillinb in f tms dfxinb dkmnl, yfnicagflinb zhytmgz fl lvo pikoxlimn md pizlfnl fpyinizlkflmkz. lviz iz kflvok f wfh lm kobiyonl fnp tkfinwfzv.] make "cgram3 [Pcodl hbdcx qxdrdlh yihcodr, hbd rzbiier gxd lih ziyqdhdlh hi hdgzb gwhbdlhcz echdxgzf, xdgnclp gr g ydglr ia ecudxghcil gln zwehcoghcil. gln c niwuh hbgh yirh ia wr jbi rdxciwref xdgn gln jxchd hbd dlpecrb eglpwgpd dodx edgxldn ch uf hbd xiwhd ia "xwl, rqih, xwl" hi rcegr ygxldx.] make "cgram4 [Jw btn xnsgsyp ejke gfebbcg, dtyjbn fbccsksg, ryu fbccsksg nswcsfpsu pes usgjns, wnssuba, ryu wtptns bw pes qbtyk, pesns zbtcu ls yb knrujyk, yb psgpjyk svfsxp rg r psrfejyk aspebu, ryu yb lcrfilbrnu dtykcsg. jy wrfp, zs rns ksppjyk cbfigpsx gfesutcjyk ryu knrujyk pb pes xbjyp bw pbnptns.] </PRE><P> <P> <P><A HREF="../v2-toc2.html">(back to Table of Contents)</A> <P><A HREF="../v2ch10/v2ch10.html"><STRONG>BACK</STRONG></A> chapter thread <A HREF="https://people.eecs.berkeley.edu/~bh/v2ch12/v2ch12.html"><STRONG>NEXT</STRONG></A> <P> <ADDRESS> <A HREF="../index.html">Brian Harvey</A>, <CODE>bh@cs.berkeley.edu</CODE> </ADDRESS> </BODY> </HTML>