diff options
Diffstat (limited to 'js/games/nluqo.github.io/~bh/v3ch1/v3ch1.html')
-rw-r--r-- | js/games/nluqo.github.io/~bh/v3ch1/v3ch1.html | 1977 |
1 files changed, 1977 insertions, 0 deletions
diff --git a/js/games/nluqo.github.io/~bh/v3ch1/v3ch1.html b/js/games/nluqo.github.io/~bh/v3ch1/v3ch1.html new file mode 100644 index 0000000..5f94725 --- /dev/null +++ b/js/games/nluqo.github.io/~bh/v3ch1/v3ch1.html @@ -0,0 +1,1977 @@ +<HTML> +<HEAD> +<TITLE>Computer Science Logo Style vol 3 ch 1: Automata Theory</TITLE> +</HEAD> +<BODY> +<CITE>Computer Science Logo Style</CITE> volume 3: +<CITE>Beyond Programming</CITE> 2/e Copyright (C) 1997 MIT +<H1>Automata Theory</H1> + +<TABLE width="100%"><TR><TD> +<IMG SRC="../csls3.jpg" ALT="cover photo"> +<TD><TABLE> +<TR><TD align="right"><CITE><A HREF="http://www.cs.berkeley.edu/~bh/">Brian +Harvey</A><BR>University of California, Berkeley</CITE> +<TR><TD align="right"><BR> +<TR><TD align="right"><A HREF="../pdf/v3ch01.pdf">Download PDF version</A> +<TR><TD align="right"><A HREF="../v3-toc2.html">Back to Table of Contents</A> +<TR><TD align="right"><A HREF="../v3ch0/v3ch0.html"><STRONG>BACK</STRONG></A> +chapter thread <A HREF="../v3ch2/v3ch2.html"><STRONG>NEXT</STRONG></A> +<TR><TD align="right"><A HREF="https://mitpress.mit.edu/books/computer-science-logo-style-second-edition-volume-3">MIT +Press web page for <CITE>Computer Science Logo Style</CITE></A> +</TABLE></TABLE> + +<HR><P>Program file for this chapter: <A HREF="fsm.lg"><CODE>fsm</CODE></A> + + + +<P>As I explained in the preface to the first volume, one of my purposes in +writing this series of books has been to urge computer hobbyists away from +the view of computer expertise as the knowledge of obscure characteristics +of some particular computer--how to program it in machine language, what +magic numbers can be found where in its memory, how to overcome the copy +protection schemes on its disks, and so on. The trouble with this sort of +machine-specific expertise is that it becomes obsolete when your favorite +computer does. From my point of view, one of the virtues of Logo as a +programming language is that its high level data structures direct your +attention away from questions about what goes where in memory, +allowing you to focus instead on a more abstract description of your problem. + +<P>Automata theory is a further step in abstracting your attention away from any +particular kind of computer or particular programming language. In automata +theory we consider a <EM>mathematical model</EM> of computing. Such a model +strips the computational machinery--the "programming language"--down to +the bare minimum, so that it's easy to manipulate these +theoretical machines +(there are several such models, for different purposes, as you'll soon see) +mathematically to prove things about their capabilities. For the most part, +these mathematical models are not used for practical programming problems. +Real programming languages are much more convenient to use. But the very +flexibility that makes real languages easier to use also makes them harder +to talk about in a formal way. The stripped-down theoretical machines are +designed to be examined mathematically. + +<P>What's a mathematical model? You'll see one shortly, called a +"finite-state machine." + +<P>The point of this study is that the mathematical models are, in some +important ways, <EM>equivalent</EM> to real computers and real programming +languages. What this means is that any problem that can be solved on a real +computer can be solved using these models, and vice versa. Anything we can +prove about the models sheds light on the real problems of computer +programming as well. + +<P>The questions asked in automata theory include these: Are there any +problems that no computer can solve, no matter how much time and memory it +has? Is it possible to <EM>prove</EM> that a particular computer program +will actually solve a particular problem? If a computer can use two +different external storage devices (disks or tapes) at the same time, +does that extend the range of problems it can solve compared to a +machine with only one such device? + +<P>There is also a larger question lurking in the background of automata +theory: Does the human mind solve problems in the same way that a +computer does? Are people subject to the same limitations as computers? +Automata theory does not actually answer this question, but the insights +of automata theory can be helpful in trying to work out an answer. +We'll have more to say about this in the chapter on artificial intelligence. + +<P><H2>What is a Computation?</H2> + +<P>What kinds of problems can we give to our abstract computers? In +automata theory we want to focus our attention on computation itself, +not on details of input and output devices. So we won't try creating +a mathematical model of a video game. + +<P>We will play a game, though. In this game the computer has a rule in mind. +You type in strings of letters, using only the letters <CODE>A</CODE>, <CODE>B</CODE>, and +<CODE>C</CODE>. The computer tells you whether each string follows the rule or +not. Your job is to guess the rule. For example, suppose you have done +these experiments: + +<P><TABLE><TR><TH align="left"><U>accepted</U> +<TH align="left"> <U>rejected</U> +<TR><TD><CODE>ABC</CODE><TD> <CODE>CBA</CODE> +<TR><TD><CODE>AAA</CODE><TD> <CODE>BBB</CODE> +<TR><TD><CODE>ABCABCABC</CODE><TD> <CODE>BCABCABC</CODE> +<TR><TD><CODE>A</CODE><TD> <CODE>BBBBBBB</CODE> +<TR><TD><CODE>ACCCCCCCCC</CODE><TD> <CODE>CAAAAAAAAA</CODE> +</TABLE> + +<P>You might guess, from these examples, that the rule is +"The string must begin with <CODE>A</CODE>." Once you've made a guess you can +test it out by trying more examples. + +<P>The program to play the game is called <CODE>game</CODE>. It takes one input, +a number from 1 to 10. I've provided ten different rules. Rules 1 to 3 +should be pretty easy to guess; rules 8 to 10 should be nearly impossible. +(Don't feel too frustrated if you don't get them.) + +<P>A string can be any length, including length zero (the empty string). Each +time you type a letter the program lets you know whether the string you've +typed so far obeys the rule. The program indicates whether the string is +accepted or rejected by displaying the word <CODE>accept</CODE> or <CODE>reject</CODE> on +the screen. In particular, as soon +as you start <CODE>game</CODE> the program will tell you whether or not the empty +string is accepted by this rule. If you type the string <CODE>ABC</CODE> you'll +really be testing three strings: <CODE>A</CODE>, <CODE>AB</CODE>, and <CODE>ABC</CODE>. You +should type one letter at a time to make sure the program has a chance to +respond to it before going on to the next letter. To start over again with +a different string, press the Return key. + +<P>You should stop reading now and try the game. In the following paragraphs +I'm going to talk about some of the answers, so this is your last +chance. After you've figured out at least some of the rules, come +back to the book. + +<P><H2>Finite-State Machines</H2> + +<P>The point of studying this game is that we're going to look at a way to design +a special-purpose abstract computer that understands one particular +rule. We can then ask questions about how much information the computer +needs to handle the job. + +<P>You've seen the word <EM>state</EM> before in connection with the Logo +turtle. Its state includes its position and its heading. So one turtle +state might be "position <CODE>[17 82]</CODE>, heading <CODE>90</CODE>." In principle, +the turtle has an <EM>infinite</EM> number of possible states, because its +position and heading don't have to be integers. Its position might be <CODE> +[14.142 14.142]</CODE>, for instance. + +<P>Anything that holds information can be in different states. As another +example, an on-off light switch has two states. Some lamps have four states: +off, low, medium, and high. A computer, too, has a certain number of states. +The state of a computer includes all the information in its memory at some +particular time. + +<P>A machine that has only a limited number of states, like the example of the +light switch, is called a <EM>finite-state machine.</EM> For almost all of +this chapter we'll be dealing with finite-state machines. You might think +that that imposes a very severe limit on the kinds of computations we can +do. But note that in the game I asked you to play, a rule can accept an +infinite number of possible strings and reject an infinite number of +others. The accepted or rejected strings can be of any length. (Some rules +restrict the length of a string, but others allow any length at all.) In +some sense, a finite-state machine can still perform infinitely varied +computations. + +<P>Consider the third game as an example. The rule is "Accept +any string that starts with <CODE>AB</CODE>." Here is a picture of a finite-state +machine that implements that rule: + +<P><CENTER><IMG SRC="https://people.eecs.berkeley.edu/~bh/v3ch1/fsm3.gif" ALT="figure: fsm3"></CENTER> + +<P>Each numbered circle represents a state. This machine has three +states. The <CODE>start</CODE> arrow indicates that the machine starts out in state +1. State 3 is shown with a <EM>double</EM> circle to indicate that it is an +<EM>accepting</EM> state. In other words, when the machine is in state 3 the +screen says <CODE>accept</CODE>. The other two states are not +accepting states. +Every time you type a character the machine switches from one state to +another. The arrow from state 1 to state 2 has an <CODE>A</CODE> next to its +tail. This indicates that when the machine is in state 1, an input of <CODE> +A</CODE> switches it to state 2. (The arrow from state 3 to itself has the +three letters <CODE>ABC</CODE> at its tail. This is a shorthand notation for +three separate arrows with heads and tails in the same place, one for each +letter.) + +<P>This picture is actually incomplete. For a full description of the machine +we have to indicate what happens on any input in any state. In other words, +each circle should have <EM>three</EM> arrows coming out from it, one each +for <CODE>A</CODE>, <CODE>B</CODE>, and <CODE>C</CODE>. I've chosen to adopt the convention that +every machine has an unmarked state called <CODE>reject</CODE>. Any missing arrow +goes to that state; once the machine is in the reject state it stays +there forever. Here, then, is the complete diagram of the machine for game +3: + +<P><CENTER><IMG SRC="fsm3-reject.gif" ALT="figure: fsm3-reject"></CENTER> + +<P>From now on I won't draw the <CODE>reject</CODE> state, but you should +remember that it's really part of the machine description. So this machine +requires four states, not three. + +<P>If the first input letter isn't <CODE>A</CODE>, the machine goes to the <CODE>reject</CODE> +state. If the first letter is <CODE>A</CODE>, the machine goes to state 2. Then, +if the second letter is <CODE>B</CODE>, the machine ends up in state 3 and accepts +the string <CODE>AB</CODE>. This is the shortest acceptable string. + +<P>Each of the three arrows from state 3 loops right back into state +3 itself. (Remember, although only one arrow appears in the picture, +it is labeled with three letters, so officially it represents three +arrows.) This means that once the machine is in state 3 it stays +there no matter what inputs it gets. Any string that starts <CODE>AB</CODE> is +acceptable. + +<P>Here is a machine for game number 2: + +<P><CENTER><IMG SRC="fsm2.gif" ALT="figure: fsm2"></CENTER> + +<P>In this machine the <EM>start state</EM> is also an <EM>accepting +state.</EM> (Every machine has exactly one start state, but it may have any +number of accepting states.) This machine never gets into the <CODE>reject</CODE> +state. That doesn't mean it doesn't reject any strings; all odd-length +strings are rejected in state 2. But a rejected string can redeem itself by +adding another input character, so state 2 allows a return to the accepting +state 1. + +<P>Here is a machine for game number 5. (Notice that I'm saying "a +machine" and not "the machine"; it is always possible to design +other machines that would follow the same rule.) + +<P><CENTER><IMG SRC="fsm5.gif" ALT="figure: fsm5"></CENTER> + +<P>You probably had more trouble discovering rule 5 than rule +2, and it takes longer to say the rule in English words. But the +<EM>machines</EM> for the two rules are almost identical. (Remember, +though, that the rule-5 machine really has a third state, the <CODE>reject</CODE> +state, which is not shown in the diagram.) + +<P>Here are machines for rules 7 and 9. With these machines as hints, +can you figure out the rules? Go back to the <CODE>game</CODE> program to test +your hypotheses. + +<P><CENTER><IMG SRC="fsm79.gif" ALT="figure: fsm79"></CENTER> + +<P>You should also get some practice translating in the other direction, from +English rules to machine diagrams. Here are a couple to work on: Rule 4 is +"To be accepted a string must be composed of doubled letters (<CODE>AA</CODE>, +<CODE>BB</CODE>, and <CODE>CC</CODE>) strung together." Rule 8 is "To be accepted a +string must contain an even number of <CODE>A</CODE>s." + +<P><H2>Nondeterministic Machines</H2> + + +<P>Here is rule 6: "To be accepted a string must begin with <CODE>A</CODE> and end +with <CODE>C</CODE>." Strings accepted by this rule include <CODE>AC</CODE> (the shortest +possible), <CODE>ABC</CODE>, <CODE>AACC</CODE>, <CODE>ACAC</CODE>, <CODE>ABCABC</CODE>, and so on. +Between the initial <CODE>A</CODE> and the final <CODE>C</CODE> an accepted string can +have any combination of <CODE>A</CODE>s, <CODE>B</CODE>s, and <CODE>C</CODE>s. It's natural to +think of the string as having three parts: a fixed beginning, a variable +middle, and a fixed end. The three parts of the input strings can be +represented conveniently with three states in the machine, like this: + +<P><CENTER><IMG SRC="fsm6nd.gif" ALT="figure: fsm6nd"></CENTER> + +<P>The machine starts in state 1. To be accepted a string must start +with <CODE>A</CODE>. Therefore, an <CODE>A</CODE> arrow leads from state 1 to state 2. Any +other input at state 1 leads to the <CODE>reject</CODE> state. + +<P>Once the machine is in state 2 it is ready for the middle part of +the string. In this middle part any combination of letters is allowed. +Therefore, there are three arrows from state 2 to itself, one for +every possible letter. + +<P>Finally, a <CODE>C</CODE> arrow leads from state 2 to state 3, signaling the end +of an acceptable string. A string must end with <CODE>C</CODE> to be accepted. + +<P>There is a problem with this machine: There are <EM>two</EM> <CODE>C</CODE> arrows +leading out from state 2. One is a loop back into state 2; the other +goes on to state 3. This situation reflects the fact that <CODE>C</CODE> serves +two different functions in this rule: <CODE>C</CODE> is an optional part of the +middle section of the string, and it's also the required final input +in the string. + +<P>A machine with two arrows from the same state for the same input is +called a <EM>nondeterministic</EM> machine. Here is how such a machine +could work: Whenever there are two choices for the machine's +current state and input, the machine clones itself. One of the copies +follows each arrow. From then on, if <EM>either</EM> machine is in an +accepting state the string is accepted. + +<P>Nondeterministic finite-state machines are more complicated +than deterministic ones. Does the added complexity "buy" any added +power? In other words, can a nondeterministic machine solve problems +that a deterministic machine can't? It turns out that the answer +to this question is no. Deterministic machines are just as powerful +as nondeterministic ones. This is an important theorem in automata +theory. I'm not going to prove it formally in this book, but to illustrate +the theorem, here is a deterministic machine that carries out game +6: + +<P><CENTER><IMG SRC="https://people.eecs.berkeley.edu/~bh/v3ch1/fsm6d.gif" ALT="figure: fsm6d"></CENTER> + +<P>This machine is similar to the nondeterministic version. It has the +same number of states and some of the connections are identical. +State 3 is more complicated, though. Also, in this machine, it is +no longer the case that each state of the machine corresponds exactly +to one of three parts of the input string. Specifically, when the machine is +in state 3 the string may or may not be finished. + +<P><H2>Representing Machines as Logo Lists</H2> + +<P>The <CODE>game</CODE> program uses finite-state machines to represent the rules +by which it accepts or rejects strings. (The machines must be deterministic +for the program to work.) Logo programs can't read circles and arrows, +so a machine is represented as a list. What information is actually +contained in an FSM diagram? The diagram shows that there are a certain +number of states (the circles), that there are certain transitions from one +state to another (the arrows), that one particular state is the start state +(the start arrow), and that certain states are accepting ones (the double +circles). As in any programming project, I could have chosen many different +ways to represent that information in the program. + +<P>In the particular representation I chose, the list form of a machine +has three members. The first member is the number of the start state. +The second member is a list of arrows; each arrow is itself a list, +as I'll explain in a moment. The third member of a machine list is +a list of the accepting states of the machine. For example, here +is the machine for game 3 again, in both forms: + +<P><CENTER><IMG SRC="https://people.eecs.berkeley.edu/~bh/v3ch1/fsm3.gif" ALT="figure: fsm3"></CENTER> +<PRE>[1 [[1 A 2] [2 B 3] [3 ABC 3]] [3]] +</PRE> + +<P>The number <CODE>1</CODE> is the start state; the list <CODE>[3]</CODE> is the +list of accepting states. (This machine happens to have only one accepting +state.) Everything else is the list of arrows. Each arrow is also a list +with three members: the initial state (the tail of the arrow), the input +letter or letters, and the final state (the head of the arrow). The first arrow in +this machine is + +<P><PRE>[1 A 2] +</PRE> + +<P>This is the arrow from state 1 to state 2 associated with +the input <CODE>A</CODE>. + +<P>The list <CODE>[3 ABC 3]</CODE> in this machine represents three arrows, using +the same shorthand as in the circle-and-arrow diagrams. I could equally +well have represented these arrows separately: + +<P><PRE>[1 [[1 A 2] [2 B 3] <U>[3 A 3] [3 B 3] [3 C 3]</U>] [3]] +</PRE> + +<P>As in the circle-and-arrow diagrams, I haven't explicitly represented the +transitions to the <CODE>reject</CODE> state in these lists. The program is +written so that if it doesn't find a transition for the current state and +input in the list of transitions, it goes into state number −1, its +representation for the <CODE>reject</CODE> state. + +<P>Here are some more machine lists: + +<P><PRE>Game 2: [1 [[1 ABC 2] [2 ABC 1]] [1]] +Game 7: [1 [[1 AB 1] [1 C 2] [2 C 1]] [1]] +Game 9: [1 [[1 AB 1] [1 C 2] [2 A 3] [2 B 1] [3 A 1]] [1]] +</PRE> + +<P>At this point you should stop and play with the program. Make up +your own rules. The procedure <CODE>fsm</CODE> takes a machine list as input +and accepts strings based on that machine. (<CODE>Game</CODE> uses <CODE>fsm</CODE> +with particular +machines as inputs.) Try out your new rules to make sure you've designed +the machines correctly. Then get a friend to play with your rules. +If both of you are reading this book together you can have a competition. +(It's easy to design rules that are impossible to guess because there +are too many details. If you have a competition you should limit +yourselves to three states per machine.) + +<P>You might be interested in reading through the +<A HREF="v3ch1.html#fsm"><CODE>fsm</CODE></A> program, which +simulates a finite-state machine when given the machine's description +as its input. It's a pretty simple program. If you think of a machine's +state diagram as a kind of "wiring diagram" that might be used +in building a real version of that particular machine, this Logo program +is a kind of <EM>universal</EM> finite-state machine implementation. + +<P><H2>Text Editors: a Use for Acceptors</H2> + +<P>It may seem to you that accepting or rejecting strings isn't much +like what you usually do with computers. You may wonder how this +mathematical model is related to real computer programming. There +are two answers to this question. One is that it's possible to design +finite-state machines that have more versatile outputs than simply +yes or no. I'll give an example shortly. But the other answer is +that there are real situations in which accepting or rejecting a string +of symbols does come up in practical computation. + +<P>One example is in the implementation of programming languages. When +you say that + +<P><PRE>print 2+2 +</PRE> + +<P>is a legal Logo instruction but + +<P><PRE>print 2+ +</PRE> + +<P>is illegal, you're doing a more complicated version of what +a finite-state acceptor does. + +<P>The <EM>search</EM> command in a good text editor uses +finite-state machines. Most text editors have a command that allows +you to look through a file for a particular string of characters. Fancier +editors allow searching not just for one particular string, but for any +string that follows a rule the user can provide. The editor finds a string that +matches the rule using a finite-state machine. Of course, people who use +editors don't have to specify their search rules in the <EM>form</EM> of a +finite-state machine! The editing program accepts search rules in a simpler +form and translates them into FSM form. Here is an example, the notation +used by <EM>ed,</EM> a standard editor in the Unix operating system. + +<P>A string like + +<P><PRE>spaghetti +</PRE> + +<P>just matches an identical string in the file you're editing. A +slightly more interesting case is + +<P><PRE>[Ss]paghetti +</PRE> + +<P>which matches either "Spaghetti" or "spaghetti" +(with a capital or lower case "s"). The square brackets indicate that +<EM>any</EM> of the letters inside the brackets will be accepted. +In the expression + +<P><PRE>[Ss]paghet*i +</PRE> + +<P>the asterisk matches <EM>any number</EM> (including zero) of the +letter before it (in this case, the letter <CODE>t</CODE>). This example +would match any of these: + +<P><PRE>Spaghei +Spaghettttti +spaghetti +spagheti +</PRE> + +<P>You might use this in a search command if you're a bad speller! +The bracket and asterisk can be combined; + +<P><PRE>C[AD]*R +</PRE> + +<P>will match any of + +<P><PRE>CAR +CDR +CADDR +CR +</PRE> + +<P>Or you could use + +<P><PRE>M[is]*p*i +</PRE> + +<P>to match the name of a famous river. + +<P>Some of the rules from the game I presented earlier can be represented as +<EM>ed</EM> search strings according to these rules. In the first game the +machine accepted any string made up of <CODE>A</CODE>s and <CODE>B</CODE>s. The +corresponding <EM>ed</EM> expression is + +<P><PRE>[AB]* +</PRE> + +<P>The third game called for strings beginning with the sequence +<CODE>AB</CODE>, followed by whatever you like. This can be represented as + +<P><PRE>AB[ABC]* +</PRE> + +<P>Game 10, which I'm sure you didn't solve, accepts any string that +includes the sequence <CODE>ABCBA</CODE> within it. In <EM>ed</EM> terms, that's + +<P><PRE>[ABC]*ABCBA[ABC]* +</PRE> + +<P>I haven't given you a complete description of the <EM>ed</EM> search rules. +I included this much only because I want you to see how a "real" program +uses the idea of finite-state machines. But in the remaining part of this +chapter I want to use a different notation based on Logo words and lists. + +<P><H2>Regular Expressions</H2> + +<P>The notation I'm about to describe allows an acceptance rule, like the rules +in the <CODE>game</CODE> program or the rules for <EM>ed</EM> searches, to be +represented in Logo. The representation of such a rule is called a <EM> +regular expression.</EM> I'm going to tell you some rules for what a regular +expression can look like. Don't be confused: Any particular regular +expression is a rule that accepts strings of letters. I'm giving you rules +that accept regular expressions--rules about rules. As a rough analogy, +"one player is <CODE>X</CODE> and the other is <CODE>O</CODE>" is a rule about the +specific game Tic Tac Toe; "each player should have a fair chance to win" +is a rule about what kinds of game rules are acceptable. + +<P> <EM>Alphabet rule.</EM> Any symbol in a machine's +alphabet is a regular expression. We represent the symbol as a one-letter +Logo word. In our guessing game the alphabet contains three symbols: <CODE> +A</CODE>, <CODE>B</CODE>, and <CODE>C</CODE>. So + +<P><PRE>B +</PRE> + +<P>is a regular expression. + +<P> +<EM>Concatenation rule.</EM> A list whose members are regular expressions +represents those expressions one after another. For example, since <CODE>A</CODE> +is a regular expression and <CODE>B</CODE> is a regular expression, + +<P><PRE>[A B] +</PRE> + +<P>is a regular expression representing the string <CODE>AB</CODE>. (Notice +that the Logo word <CODE>AB</CODE> does <EM>not</EM> represent that string; the +alphabet rule requires that each letter be represented as a separate word.) + +<P> +<EM>Alternatives rule.</EM> A list whose first member is the word <CODE>or</CODE> and +whose remaining members are regular expressions represents any string that +matches <EM>any</EM> of those expressions. For example, + +<P><PRE>[OR [A A] B] +</PRE> + +<P>matches either the sequence <CODE>AA</CODE> or the single symbol <CODE>B</CODE>. +As a convenience, a Logo word containing more than one letter (other +than the word <CODE>or</CODE>) is taken as an abbreviation for the <CODE>or</CODE>ing +of the individual letters. For example, <CODE>ABC</CODE> is equivalent to +<CODE>[OR A B C]</CODE>. + +<P> +<EM>Repetition rule.</EM> A list containing exactly two members, in which the +first is the asterisk (<CODE>*</CODE>) symbol and the second is a regular +expression, represents a string containing any number (including zero) of +substrings that match the regular expression. So + +<P><PRE>[* [OR [A A] B]] +</PRE> + +<P>matches any of these: + +<P><TABLE> +<TR><TD><CODE>B</CODE> +<TR><TD><CODE>BB</CODE> +<TR><TD><CODE>BAAB</CODE> +<TR><TD><CODE>AAAAAA</CODE> +<TR><TD><CODE>AABAA</CODE> +<TR><TD> (the empty string) +<TR><TD><CODE>AABBBBBAA</CODE> +</TABLE> + +<P>The number of consecutive <CODE>A</CODE>s must be even for a string of +<CODE>A</CODE>s and <CODE>B</CODE>s to match this expression. + +<P>These four rules constitute the definition of a regular expression. +It's a <EM>recursive definition.</EM> Just as the effect of a recursive +Logo procedure is defined in terms of a simpler case of the same procedure, +a complex regular expression is defined in terms of simpler ones. + +<P>Here are the ten game rules from the beginning of this chapter in +the form of regular expressions: + +<P><P> +<TABLE><TR><TH align="right">1.<TD> <CODE>[* AB]</CODE> +<TR><TH align="right">2.<TD> <CODE>[* [ABC ABC]]</CODE> +<TR><TH align="right">3.<TD> <CODE>[A B [* ABC]]</CODE> +<TR><TH align="right">4.<TD> <CODE>[* [OR [A A] [B B] [C C]]]</CODE> +<TR><TH align="right">5.<TD> <CODE>[* [ABC B]]</CODE> +<TR><TH align="right">6.<TD> <CODE>[A [* ABC] C]</CODE> +<TR><TH align="right">7.<TD> <CODE>[* [OR A B [C C]]]</CODE> +<TR><TH align="right">8.<TD> <CODE>[[* BC] [* [A [* BC] A [* BC]]]]</CODE> +<TR><TH align="right">9.<TD> <CODE>[[* AB] [* [C [OR B [A A]]] [* AB]]]</CODE> +<TR><TH align="right">10.<TD> <CODE>[[* ABC] A B C B A [* ABC]]</CODE> +</TABLE> + +<P><TD> <P> + +<P>You should go through these examples carefully, making sure +you understand how the regular expression represents the same idea +as the English description or the machine diagram you saw earlier. + +<P><H2>Rules That Aren't Regular</H2> + +<P>You may be thinking that <EM>any</EM> rule for accepting or rejecting +strings of symbols can be represented as a regular expression. But +that's not so. For example, consider the rules for recognizing ordinary +arithmetic expressions: + +<P><TABLE><TR><TH align="left"><U>accepted</U> +<TH align="left"> <U>rejected</U> +<TR><TD><CODE>2+3</CODE><TD> <CODE>23+</CODE> +<TR><TD><CODE>2*(3+4)</CODE><TD> <CODE>2*)3+4(</CODE> +<TR><TD><CODE>-5</CODE><TD> <CODE>/6</CODE> +</TABLE> + +<P>Think for a moment just about the matter of +balancing parentheses. +Sometimes you have parentheses within parentheses, as in + +<P><PRE>((3+4)/(5+6)) +</PRE> + +<P>How would you represent this part of the arithmetic expression +rule in the form of a regular expression? You can't just say something like + +<P><PRE>[[* (] something-or-other [* )]] +</PRE> + +<P>to mean "any number of open parentheses, something, then +any number of close parentheses." That would allow strings like + +<P><PRE>(((7))))) +</PRE> + +<P>But this string should be rejected because it has too many close +parentheses. You're not allowed to use a close parenthesis unless you've +already used a matching open parenthesis. You can have any number of nested +parentheses you want as long as they're balanced. + +<P>It is possible to invent other kinds of formal notation, more powerful than +regular expressions, that will allow us to give the rules for well-formed +arithmetic expressions. In this section I'll introduce briefly a formal +notation called <EM>production rules</EM> that's powerful enough to describe +arithmetic expressions. For now, in this chapter, I don't want to discuss +production rules in great detail; my only reason for introducing them at +this point is to give you a sense of how regular expressions fit into a +larger universe of possible formal systems. In the following sections I'll +have more to say about regular expressions and finite-state machines. But +we'll return to production rules in Chapters 5 and 6, in which we'll need +formal notations with which to discuss more interesting languages than the +<CODE>A</CODE>-<CODE>B</CODE>-<CODE>C</CODE> language of this chapter. (In Chapter 5 we'll be +talking about Pascal; in Chapter 6 we'll take on English.) + +<P>The key ingredient that's missing from regular expression notation is a way +to <EM>name</EM> a kind of sub-expression so that the name can be used in +defining more complex expressions. In particular, a sub-expression name can +be used in its own definition to allow a <EM>recursive</EM> definition of the +rule. + +<P>A production rule has the form + +<P><PRE>name : expansion +</PRE> + +<P>Each rule is a definition of the name on the left in terms of +smaller units, analogous to the definition of a Logo procedure in terms +of subprocedures. The expansion part of the rule is a string of symbols +including both members of the "alphabet" of the system (like the alphabet +that underlies a regular expression language) and names defined by +production rules. + +<P>As an example, here is a set of production rules that defines the language of +arithmetic expressions. These rules take into account the "order of +operations" for arithmetic that you learned in elementary school: +multiplication before addition. An expression like + +<P><PRE>2/3+1/6 +</PRE> + +<P>is ordinarily interpreted as the sum of two <EM>terms,</EM> namely +two thirds and one sixth. An expression can be a single term, a sum of +terms, or a difference of terms. Here's how that idea is expressed in a set +of production rules; I'll discuss the notation in more detail in a moment. + +<P><CENTER><IMG SRC="xx1.gif"></CENTER> + +<P>The vertical bars separate alternatives. The first rule, the one +that defines <CODE>expr</CODE>, contains three alternatives. First, an expression +can be just a single term. Second, an expression can be a smaller +expression plus a term. Third, an expression can be a smaller expression +minus a term. The symbols inside boxes are the members of the alphabet of +the arithmetic expression language. (I've put them in boxes to make it +easier not to confuse them with the punctuation characters--the colons and +vertical bars--that are part of the production rule notation itself.) + +<P>Do you see how parentheses fit in? If a string like <CODE>4+5</CODE> is an +expression, then <CODE>(4+5)</CODE> is a factor, so <CODE>3*(4+5)</CODE> is a term, +and so on. Since a factor is a kind of term, and a term is a kind of +expression, the factor <CODE>(4+5)</CODE> can be considered an expression, and +so it too can be put inside parentheses. So <CODE>((4+5))</CODE> is also +acceptable as a factor. + +<P><H2>Regular Expressions and Finite-State Machines</H2> + +<P>I've hinted at something that I haven't actually made explicit: Regular +expressions are equivalent to finite-state machines. In other words, +if you can express a rule as a regular expression, you can design +a finite-state machine that carries out the rule. If you can't write +a regular expression for the rule, you can't design a finite-state +machine either. + +<P>You may be thinking, "so what?" I've introduced two different formal +notations, finite-state machines and regular expressions, and now I'm +telling you that the two are equivalent. So why didn't I just pick one in +the first place and forget about the other? I have a general answer and a +specific answer to these questions. + +<P>The general answer is that comparing different formal systems is what +automata theory is all about. By the end of this book you'll have been +introduced to half a dozen or so different formal systems. Some are more +powerful than others. The bare assertion that one formal system is +equivalent to another, or more powerful than another, isn't very interesting; +but if we can understand the <EM>reasons</EM> behind those assertions then +we may be able to put the knowledge to work in practical situations. At the +very end of this book, in Chapter 6, we'll talk about a particular formal +system that's often used in artificial intelligence programs to recognize +English sentences. By then you should know enough about formal systems to +be able to understand why that particular one is a good choice. + +<P>The specific answer is that finite-state machines and regular expressions +are <EM>different</EM> from each other in an interesting way. A finite-state +machine is an <EM>algorithm,</EM> a sequence of steps, or a procedure that can +be followed to test whether some string matches a given rule. It says, +"start here, then if this happens do this, then..." just like a procedure +in Logo or most other programming languages. (But we've seen that a +finite-state machine is like a procedure written in a restricted programming +language that isn't as flexible as Logo.) A regular expression, though, is +<EM>not</EM> a sequence of steps. It's more like a description of the <EM> +result</EM> that we want, leaving open the precise recipe for how to get there. +People often pose problems in a similar way. They call the plumber and say, +"the drain in my bathtub is backing up." Part of the plumber's expertise is +to be able to translate that <EM>declarative</EM> problem statement into a +<EM>procedural</EM> form, a sequence of steps needed to clear up the problem. +An early stumbling block in artificial intelligence research was the seeming +gulf between the procedural knowledge embodied in a computer program and the +declarative knowledge needed for human-like behavior. Recently people have +invented <EM>declarative programming languages</EM> (the best known +is Prolog, but any commercial spreadsheet program is also +in this category) that +allow the user to state a problem in declarative form. The programming +language interpreter then automatically translates this problem statement +into a sequence of steps for the computer to perform. + +<P>Writing a Prolog interpreter raises many issues beyond the scope of this +book. But we can take a smaller step in the realm of translation from +a declarative notation to a procedural one. I've written a Logo program, +listed at the end of the chapter, that translates from a regular +expression to an equivalent finite-state machine. Its top-level procedure, +<CODE>machine</CODE>, takes a regular expression as input and outputs a machine +list in the format I showed earlier. + +<P><H2>How to Translate</H2> + +<P>The general claim that regular expressions are equivalent in power +to finite-state machines is called Kleene's Theorem, named after the +mathematician Stephen C. Kleene, its discoverer. You can find a proof +of this theorem in any textbook on automata theory. I'm not going +to give a proof here, but I'll indicate how the translation is done +in my program. The same kinds of ideas are used in the proof. + +<P>Remember that there are four parts to the definition of a regular expression. +The alphabet rule provides the fundamental building blocks; the +concatenation, alternatives, and repetition rules build large regular +expressions recursively out of smaller ones. The translation process +follows the same pattern: We start with a procedure to build a trivial +two-state machine that only accepts a single letter, then we add three +rules for combining smaller machines into a large machine. In the following +paragraphs I'll show how each rule is reflected in the +<A HREF="v3ch1.html#machine"><CODE>machine</CODE></A> program. + +<P>This construction process often produces machines with more states than +necessary. The <CODE>machine</CODE> program eliminates redundant states as its +final step. + +<P>The alphabet rule says that any member of the machine's alphabet is +a regular expression. In the program, a symbol can be any one-letter word other +than <CODE>*</CODE>. The symbol <CODE>X</CODE> is translated into the machine + +<P><PRE>[1 [[1 X 2]] [2]] +</PRE> + +<P>(You'll see that the program works by combining little machines +into bigger ones. Every time the program has to invent a new machine +state it uses the next free number. So the state numbers might not +be 1 and 2 in a real example.) The procedure +<A HREF="v3ch1.html#ndletter"><CODE>ndletter</CODE></A> handles this +rule. + +<P>Next comes the <EM>concatenation rule.</EM> The regular expression + +<P><PRE>[A B] +</PRE> + +<P>matches a string with two parts; the +first substring matches the <CODE>A</CODE> and the second matches the <CODE> +B</CODE>. In this simple example each "substring" matches only a single +letter. In a more complicated concatenation like + +<P><PRE>[[OR A C] [* B]] +</PRE> + +<P>there are different choices for each substring. For example, that +regular expression is matched by the string + +<P><PRE>CBBB +</PRE> + +<P>in which the letter <CODE>C</CODE> matches the first part of the +expression and the substring <CODE>BBB</CODE> matches the second part. + +<P>To translate a regular expression of this kind (a concatenation) into a +finite-state machine, we begin by recursively translating the subexpressions +into smaller machines. Then we have to "splice" the two machines together. +Procedure <A HREF="v3ch1.html#ndconcat"><CODE>ndconcat</CODE></A> does this splicing. + +<P>We'll begin with the simplest possible example. Suppose we want to +translate the regular expression + +<P><PRE>[A B] +</PRE> + +<P>We have already translated the two symbols <CODE>A</CODE> and <CODE>B</CODE> into +machines: + +<P><CENTER><IMG SRC="concat-before.gif" ALT="figure: concat-before"></CENTER> + +<P>The combined machine must start at the start state of the first +component machine, state 1. The combined machine should be in an accepting +state when <EM>both</EM> component machines have been satisfied; in other +words, the accepting states of the combined machine should be those of the +<EM>second</EM> component machine. In this case that means only state 4. + +<P>To splice the component machines together we must add transitions (arrows) +between them. Specifically, whenever the first component machine gets into +an accepting state, the combined machine should follow the same transitions +that apply to the start state of the second component machine. In this +case, when the combined machine gets into state 2 (the accepting state of +the first component machine) it should follow the same transitions that +apply to state 3 (the start state of the second machine). There is only one +such transition, a <CODE>B</CODE> arrow into state 4. That means we must add the +arrow + +<P><PRE>[2 B 4] +</PRE> + +<P>to the combined machine. + +<P><CENTER><IMG SRC="concat-after.gif" ALT="figure: concat-after"></CENTER> + +<P>State 3, although it is still in the machine, is now useless. +There is no way for the machine to get into state 3. Later +in the translation process another procedure removes +such "orphaned" states from the machine. + +<P>As a slightly more complicated example, consider the translation of the +regular expression + +<P><PRE>[[OR A C] [* B]] +</PRE> + +<P>We start by supposing that we've already translated the two +subexpressions separately: + +<P><CENTER><IMG SRC="concat2-before.gif" ALT="figure: concat2-before"></CENTER> + +<P>(We haven't yet discussed the alternatives rule or the repetition +rule, so I haven't yet explained how these subexpressions are translated. +For now, please just take on faith that this picture is correct. We'll get +to those other rules shortly.) + +<P>The start state of the combined machine is the start state of the first +component, state 1. At every accepting state of the first machine we must +duplicate the transitions from the start state of the second machine. In +this example the start state of the second machine has only the transition + +<P><PRE>[4 B 5] +</PRE> + +<P>but there are two accepting states in the first machine, so we +must add two new arrows: + +<P><PRE>[2 B 5] [3 B 5] +</PRE> + +<P>A final detail is that in this example the start state of the +second component machine, state 4, is an accepting state. That means that +the second substring can be empty. Therefore the accepting states of the +first component machine should also be accepting states of the combined +machine. Here is the result: + +<P><CENTER><IMG SRC="concat2-after.gif" ALT="figure: concat2-after"></CENTER> + +<P>Again, state 4 is now an "orphan" and will be eliminated +later in the program. + +<P>The <EM>alternatives rule</EM> combines two machines in parallel, so to speak, +rather than in series. It works by inventing a new state +that becomes the start state of the combined machine. Arrows leaving +from the new state duplicate the arrows from the start states of the +component machines. Procedure <A HREF="v3ch1.html#ndor"><CODE>ndor</CODE></A> handles this rule. + +<P>As an example, here is the translation process for + +<P><PRE>[OR A B] +</PRE> + +<P>(or its abbreviation <CODE>AB</CODE>). We start with two separate machines: + +<P><CENTER><IMG SRC="https://people.eecs.berkeley.edu/~bh/v3ch1/alter-before.gif" ALT="figure: alter-before"></CENTER> + +<P>We combine them by inventing a new state 5: + +<P><CENTER><IMG SRC="alter-after.gif" ALT="figure: alter-after"></CENTER> + +<P>I haven't explained all the details of the construction process. +For example, should the new state, state 5, be an accepting state? In this +example it shouldn't be. See if you can think of a case where it might be; +then read the program listing to see the exact algorithm that makes this +decision. Again, this construction process may leave unused states for +later cleanup. + +<P>A much more serious problem is that an <CODE>or</CODE> construction is likely to +produce a nondeterministic machine. For example, here is the machine +for + +<P><PRE>[OR [A B] [A C]] +</PRE> + +<P><CENTER><IMG SRC="or-nondet.gif" ALT="figure: or-nondet"></CENTER> + +<P>Like the unused states, the problem of nondeterminism is +left for the end of the program, when procedure <CODE>determine</CODE> translates +the nondeterministic machine into a deterministic one. +(The concatenation rule can also make nondeterministic machines, although +it's not as likely.) + +<P>The final case to be considered is the <EM>repetition rule.</EM> This rule +acts on only one smaller machine, not two machines as in the previous +two cases. The rule doesn't require any new states. It has two effects. +One is to add the start state to the list of accepting states. The +second is to add arrows from the (old) accepting states that mimic +the arrows from the start state. (This is exactly like the splicing +of two machines in the concatenation rule, but in this case we concatenate +a single machine with itself!) Procedure <A HREF="v3ch1.html#ndmany"><CODE>ndmany</CODE></A> +makes this transformation. +It, too, can result in a nondeterministic machine. + +<P>Here is an example of the rule: + +<P><CENTER><IMG SRC="repetition.gif" ALT="figure: repetition"></CENTER> + +<P>These four rules are combined by <A HREF="v3ch1.html#nondet"><CODE>nondet</CODE></A>, +a procedure whose input is +a regular expression and whose output is a (possibly nondeterministic) machine. + +<P><PRE>to nondet :regexp +if and (wordp :regexp) (equalp count :regexp 1) [output ndletter :regexp] +if wordp :regexp [output ndor reduce "sentence :regexp] +if equalp first :regexp "or [output ndor butfirst :regexp] +if equalp first :regexp "* [output ndmany last :regexp] +output ndconcat :regexp +end +</PRE> + +<P>The top-level procedure <A HREF="v3ch1.html#machine"><CODE>machine</CODE></A> +does a little initialization +and then does its work with the instruction + +<P><PRE>output optimize determine nondet :regexp +</PRE> + +<P>That is, first it creates what may be a +nondeterministic machine, +then if necessary it translates that into a deterministic one (eliminating +orphan states in the process), then it gets +rid of any redundant states that may have been created. + +<P><H2>Making the Machine Deterministic</H2> + +<P>In the first volume of this series we explored the techniques of depth-first +and breadth-first tree traversal. Given a tree structure, these algorithms +allow us to "visit" every node of the tree once. + +<P>A finite state machine can be viewed as a structure almost like a tree. +The machine's start state corresponds to the root node; the states that +can be reached by an arrow from a given state are the children of that state. +But there is one important difference between trees and machines: In a tree, +every node (except for the root node) has exactly one parent. The tree +search algorithms depend on that fact to ensure that each node is visited +only once. In a machine, arrows from several different states can lead to +the same state, so a state may have several "parents." The technical name +for an arbitrary collection of nodes with connections among them is a +<EM>graph.</EM> If the connections are one-way, as in the finite +state machine diagrams, it's called a <EM>directed graph.</EM> + +<P>Searching a graph is just like searching a tree, except that we have to +keep track of which nodes we've already visited, to avoid examining the +same node twice. Procedure <A HREF="v3ch1.html#determine"><CODE>determine</CODE></A> +creates a list named <CODE>states</CODE>, initially +empty, into which each state number is added as the +program examines that state. The depth first traversal of the machine +is carried out by procedure <A HREF="v3ch1.html#nd.traverse"><CODE>nd.traverse</CODE></A>; +although this procedure looks different from the +<A HREF="../v1ch14/v1ch14.html#depth.first"><CODE>depth.first</CODE></A> +procedure in Volume 1, it +uses the same basic algorithm. Given a state as input, it processes +that state and invokes itself recursively for all of the children of +that state--the states reachable by arrows from the input state. Unlike +<CODE>depth.first</CODE>, though, <CODE>nd.traverse</CODE> is an operation. It +outputs a new list of moves (arrows) for the deterministic version of +the machine. + +<P>What does it mean to process a state? <CODE>Nd.traverse</CODE> first checks +whether this state has already been processed; if so, it outputs an empty +list, because this state will contribute no new moves to the machine. +Otherwise, it remembers this state number as having been processed, finds +all the moves starting from this state, and calls +<A HREF="v3ch1.html#check.nd"><CODE>check.nd</CODE></A> to look for +nondeterminism. <CODE>Check.nd</CODE> takes the first available arrow whose tail +is at the state we're processing, and looks for all arrows with the same +tail and with the same letter.<SUP>*</SUP> +The local variable <CODE>heads</CODE> will contain a list of all the state numbers +reachable by these arrows. (The state numbers are sorted into increasing +order, and duplicates eliminated. If the machine has two completely +identical arrows, that doesn't result in nondeterminism.) + +<P><SMALL><BLOCKQUOTE><SMALL><SUP>*</SUP>By the way, <CODE>nondet</CODE> always +creates arrows with only a single letter; if two or more letters lead from +the same state to the same state, a separate arrow is created for each of +them. This makes for a longer machine list, but makes steps like this +one--looking for two arrows with the same letter--easier. Once the +deterministic machine has been created, procedure <CODE>optimize</CODE> will combine +such arrows into the abbreviated form with +more than one letter per arrow.</SMALL></BLOCKQUOTE></SMALL> + +<P>There are +three cases for what <CODE>check.nd</CODE> must do. First, if there is +only one state number in <CODE>:heads</CODE>, then there is no nondeterminism for +this letter, and <CODE>check.nd</CODE> includes the arrow from the original +machine as part of the deterministic machine. Second, if there is more +than one state number, <CODE>check.nd</CODE> looks to see if we've already seen +the same combination of result states. If so, then we've already created +a new state equivalent to that combination of old states, and <CODE>check.nd</CODE> +creates a new arrow pointing to that existing new state. Finally, the third +case is that this combination of states is one we haven't seen before. In +that case, <CODE>check.nd</CODE> must create a new state, with arrows duplicating +those from all of the original states. + +<P>In other words, if there are arrows + +<P><PRE>[[3 B 4] [3 B 7]] +</PRE> + +<P>then <CODE>check.nd</CODE> will invent a new state that is an "alias" +for "four-and-seven." If the same machine also contains arrows + +<P><PRE>[[8 C 4] [8 C 7]] +</PRE> + +<P>then <CODE>check.nd</CODE> will use the <EM>same</EM> alias state for this +pair, not inventing a new one. The new +state is given arrows matching those of all its component states (4 +and 7 in this example). The new state might itself contain a nondeterministic +branch, but that's okay because the new state will eventually be processed +as we continue to traverse the machine graph. + +<P>You might think that this process could go on forever: that each new +state <CODE>check.nd</CODE> invents will turn out to include nondeterminism, which +will require yet another new state to resolve. Fortunately, that +doesn't happen; the process does always end eventually. (In the next +chapter we'll see what the limit is on the number of necessary states +for the deterministic machine.) + +<P>Because <CODE>determine</CODE> uses a graph traversal algorithm to examine +the original machine's states, it will never find "orphan" states +that can't be reached by arrows from some other state. That's why +the process of making the machine deterministic also eliminates orphan +states, with no extra effort. + +<P><H2>Eliminating Redundant States</H2> + +<P>The machines produced by <CODE>determine</CODE> are runnable, but often ugly; +they contain many more states than necessary. Procedure +<A HREF="v3ch1.html#optimize"><CODE>optimize</CODE></A> +eliminates many redundancies and also combines arrows with the same +head and tail but with different letters. +First it goes through the machine's arrow list, +creating a list for each state representing the exits from that state: + +<P><CENTER><IMG SRC="optimize.gif" ALT="figure: optimize"></CENTER> +<PRE> [[* [A B]] C] +State 1: [[A 2] [C 4]] +State 2: [[B 3]] +State 3: [[A 2] [C 4]] +State 4: [] +</PRE> + +<P>In this machine, states 1 and 3 have the same exit list. +(In these lists, each arrow is represented with only two members; the +arrow's tail is not included. That's because states 1 and 3 would <EM> +not</EM> have identical lists if the tails were included. State 1's list +would be + +<P><PRE>[[1 A 2] [1 C 4]] +</PRE> + +<P>and state 3's list would have arrows starting with 3. In the +program, the two-member form of an arrow is called a <EM>stub.</EM>) + +<P>The program must be careful about the order in which it puts stubs +in each state's list, so it doesn't end up with + +<P><PRE>[[C 4] [A 2]] +</PRE> + +<P>for one of the states. That's why <A HREF="v3ch1.html#stub.add"><CODE>stub.add</CODE></A> +takes trouble +to insert each stub in a well-defined position, rather than just adding +each new stub at the beginning or end of the list. It's also in +<CODE>stub.add</CODE> that arrows connecting the same two states but with different +letters are combined into a single arrow. + +<P>Since states 1 and 3 also agree in their acceptingness (namely they aren't +accepting states), they can be combined into one state. <CODE> +Optimize.state</CODE> can replace every reference to state 3 with a reference to +state 1. + +<P><H2>A Finite-State Adder</H2> + + +<P>I promised earlier to show you a use for finite-state machines other +than accepting or rejecting strings. In this section I'll fulfill +that promise by designing a machine to add two numbers. We'll represent +the numbers in binary notation, in which each digit represents a power +of 2 instead of a power of 10. + +<P>If you've come across binary numbers before, you can skip this paragraph. +Just as the ordinary notation for numbers is based on the ten digits 0 to 9, +binary notation is based on <EM>two</EM> digits, 0 and 1. In ordinary +("decimal") notation, the amount that each digit contributes to the total +depends on where it is in the number. For example, in the number 247, the +digit 2 contributes two hundred, not just two, because it's in the third +position counting from the right. Each digit's contribution is the value of +the digit itself multiplied by a power of ten: +<P><CENTER>2×10<SUP><SMALL>2</SMALL></SUP> ++ 4×10<SUP><SMALL>1</SMALL></SUP> ++ 7×10<SUP><SMALL>0</SMALL></SUP></CENTER><P> +(10<SUP><SMALL>2</SMALL></SUP> is 100; 10<SUP><SMALL>1</SMALL></SUP> is 10; 10<SUP><SMALL>0</SMALL></SUP> is just 1.) In binary, the +contribution of each digit is multiplied by a power of <EM>two,</EM> so the +binary number 10101 represents +<P><CENTER>1×2<SUP><SMALL>4</SMALL></SUP> ++ 0×2<SUP><SMALL>3</SMALL></SUP> ++ 1×2<SUP><SMALL>2</SMALL></SUP> ++ 0×2<SUP><SMALL>1</SMALL></SUP> ++ 1×2<SUP><SMALL>0</SMALL></SUP></CENTER><P> +which is 16+4+1 (2<SUP><SMALL>4</SMALL></SUP>+2<SUP><SMALL>2</SMALL></SUP>+2<SUP><SMALL>0</SMALL></SUP>) or 21. Computers use binary notation +because it's easy to build electrical circuits in which each wire is either +on or off. In Chapter 2 we'll talk about an example. Right now I want to +show something different--not an actual electronic machine but an abstract +machine based on the ideas we've been using in this chapter. + +<P>The machine will add two binary numbers, one digit position at a time, just +the way you add multi-digit numbers yourself. When you see a problem like +<P><CENTER><TABLE><TR><TD align="right">376 +<TR><TD align="right"><U>+572</U></TABLE></CENTER><P> +you start at the right and say, "6 plus 2 is 8; 7 plus 7 is 14, which is +4 carry 1; 1 plus 3 plus 5 is 9." The finite-state adder works the same +way except that the digits are always 0 or 1. + +<P>The machine will add any numbers, but to explain how it works I want to +consider a specific example. Let's say we want to add 52 and 21. (By the +way, I didn't pick these numbers because they name card games, but because +the pattern of digits in their binary forms is convenient for the +explanation I want to give.) 52 in binary is 110100 (32+16+4) and 21 is +10101 (16+4+1). I'm going to write these one above the other, with a couple +of extra zeros on the left to leave room for a possible carry: + +<P><PRE> 0 0 1 1 0 1 0 0 + 0 0 0 1 0 1 0 1 +</PRE> + +<P>Remember how a finite-state machine works: At a given moment it's +in some <EM>state,</EM> then it reads some <EM>input</EM> symbol and goes to +another state. In this problem, since we have two numbers to add, the most +natural way to think about it would be to give the machine <EM>two</EM> inputs +at a time. This idea doesn't quite fit in with the formal definition of a +finite-state machine, but we can let the machine's "alphabet" consist of +<EM>pairs</EM> of digits, so something like <CODE>01</CODE> would be a single input. +(By the way, the word <EM>bit</EM> is commonly used as an abbreviation for +"binary digit.") Just as you added vertical pairs of digits (first 6 and 2, +then 7 and 7, and so on) in the earlier example, we'll use vertical pairs of +bits as the inputs to the finite-state adder, starting from the right end. +So the first input will be 01, then 00, then 11, then 00, then 11 again, +then 10, and then 00 twice. From now on, in this section, when you see +something like 10 you should remember that it is a <EM>single</EM> input to +the finite-state machine, a single symbol, not two in a row. +(In the diagram below, an arrow labeled <CODE>01/10</CODE> represents two +arrows, one for the input <CODE>01</CODE> and one for the input <CODE>10</CODE>. These +two arrows will always go to the same state because 0+1=1+0.) + +<P>We need to make one change in the notation used in machine diagrams. +We no longer want to mark each state as accepting (double circle) +or rejecting (single circle). Instead, each state produces an <EM> +output</EM> that can be any arbitrary symbol. In this machine the outputs +will be 0 or 1, representing the binary digits of the sum. Inside +each state circle, instead of just a state number you'll see something +like "3/1"; this means that it's state number 3 and that the output +from that state is 1. + +<P>Here is the machine: + +<P><CENTER><IMG SRC="fsm-add.gif" ALT="figure: fsm-add"></CENTER> + +<P>State 1, the start state, has no output. When the machine is in start +state it hasn't seen any digits of the addends yet, so it can't compute +a digit of the sum. States 2 and 4 output a zero digit, while states +3 and 5 output a one. (Like the inputs, the number that the machine +outputs is presented with its rightmost bit first. The machine works +this way for the same reason that <EM>you</EM> add numbers from right +to left: That's the direction in which a "carry" digit moves from +one column to another.) + +<P>Why are there <EM>two</EM> zero-output states and <EM>two</EM> one-output +states? The reason is that the machine is in state 4 or 5 when there +is a carry into the next digit of the sum. + +<P>Let's trace through my example. We start in state 1. The first input +symbol is 01, representing a 0 in the rightmost (binary) digit of 52 +and a 1 in the rightmost digit of 21. The machine enters state 3 +and outputs a 1. + +<P>The next input is 00 because both numbers have zero as the second +digit. The machine enters state 2 and outputs 0. + +<P>The next input is 11. The machine enters state 4 and outputs 0. Being in +state 4 means that there is a carry from this position into the next. + +<P>You can finish the example yourself. The sum should be 01001001, +or 73. + +<P><H2>Counting and Finite-State Machines</H2> + +<P>Earlier we saw that you can't write a regular expression for a rule +that requires balanced parentheses. Since regular expressions are +equivalent to finite-state machines, you won't be surprised to learn +that finite-state machines can't count. + +<P> +Actually, they can count up to a point; it's just that each finite-state +machine can only count up to a fixed limit. For example, here is +a finite-state machine that accepts strings of balanced parentheses +up to four deep: + +<P><CENTER><IMG SRC="https://people.eecs.berkeley.edu/~bh/v3ch1/fourdeep.gif" ALT="figure: fourdeep"></CENTER> + +<P>This machine will accept strings like these: + +<P><PRE> () (()) ()() + (()()) (((()))) (((())))(()) +</PRE> + +<P>There is no limit to the <EM>length</EM> of the string this +machine can handle. For example, it will accept this: + +<P><PRE> ()()()()()()()()()()()()()() +</PRE> + +<P>But there can be no more than four parentheses open <EM> +at once</EM>; the machine will reject + +<P><PRE> ((((())))) +</PRE> + +<P>Even this limited counting ability of finite-state machines is of great +practical value. Real computers, after all, are finite-state +machines. Any real computer has a finite amount of memory, and this +memory can be in a finite number of states. But the number is quite +huge. If a real computer includes a parenthesis-counting program +that is limited to, say, 20,000 levels of parentheses, nobody will +ever notice that the limit isn't infinite. + +<P>(The number of states in a real computer may be even larger than you're +thinking. Each bit of computer memory isn't a state. Instead, if +a computer has <EM>n</EM> bits of memory it has 2<SUP><SMALL>n</SMALL></SUP> states! For +example, a computer with three bits of memory can use those bits to +represent <EM>eight</EM> states: + +<P><PRE>0 0 0 +0 0 1 +0 1 0 +0 1 1 +1 0 0 +1 0 1 +1 1 0 +1 1 1 +</PRE> + +<P>The number of possible states in a typical large computer +is greater than the number of atoms in the galaxy.) + +<P>In a moment I'm going to talk about a theoretical model of a machine +with infinite memory. You might wonder why it pays to study such +machines, since any real machine has to be limited in memory. The +answer has to do with my example about the 20,000 levels of parentheses. +It is theoretically possible to write a regular expression for such +strings. To show you how it's done, here is a regular expression +for up to three levels: + +<P><CENTER><IMG SRC="xx2.gif"></CENTER> + +<P>(I've drawn boxes around the actual alphabet-rule symbols just to +make it a little easier for you to distinguish between the parentheses, +which are symbols in the input strings, and the brackets, which are part of +the glue that holds a regular expression together.) + +<P>There is no theoretical problem about extending this regular expression +to allow up to 20,000 parentheses. But a machine based on this technique +would be very large and complicated. Instead, it makes more sense to +<EM>pretend</EM> that the computer has an infinite amount of memory available +and use a formal system (like the production rules I mentioned briefly) +that depends on an infinite memory. Such a formal system leads to +a shorter, more elegant expression of the balanced parentheses rule. +In practice, we can provide enough memory for any of the strings our +program will actually meet. + +<P><H2>Turing Machines</H2> + +<P>One way we might explore infinite machines is to imagine that they're +represented by state diagrams, like those of finite-state machines, +but with an infinite number of states. For example, here is a picture +of an infinite-capacity parenthesis counter: + +<P><CENTER><IMG SRC="waydeep.gif" ALT="figure: waydeep"></CENTER> + +<P>The trouble with this idea is that it's hard to model precisely what's +meant by that row of dots on the right. There's no way we can have +a <EM>complete</EM> formal description of an infinitely complex machine. + +<P>Instead, consider what <EM>you</EM> do when you have to solve a problem +too complex to fit in your own memory. You don't build yourself a +bigger brain; you get a pencil and some paper. That is, you use <EM> +external</EM> storage. You can imagine that your brain is a finite-state +machine but that it has access to an infinite supply of paper. + +<P>Of course, this division of a problem's necessary information into +a finite <EM>internal</EM> state and an infinite <EM>external</EM> memory +also models actual computers in an obvious way. The internal state +in the model represents the internal memory of the computer, while +the external memory in the model represents such storage devices as +disks and tapes. To solve a bigger problem you don't have to buy +a bigger computer; you just have to switch floppy disks occasionally. + +<P>You might think that the mathematical model I'm talking about was +based on the analogy with real computers and that my story about the finite-state +brain is just a coincidence. But in fact this model was invented +by Alan M. Turing in 1936, before there were any computers! It was +<EM>human</EM> problem-solving that Turing wanted to model with a machine. + +<P>What is a Turing machine? Start by imagining a finite-state machine +with different possible outputs, like the adder we saw earlier. Attached +to this machine is a tape of limitless length. Symbols from some +alphabet, like the machine's input and output symbols, can be written +on the tape. There is a reading and writing mechanism, like the heads +of a magnetic tape recorder, that can travel along the tape. + +<P>Just as each state of the machine can have an output associated with +it, each state can also take action to affect the tape: It can move +the head by one symbol in either direction and it can change the +symbol at the head's current position. + +<P>In fact, we simplify the formal description of the machine by +using the tape as the input/output device as well as the storage device. +That is, we can start with some sequence of symbols already written +on the tape. That sequence serves as the input to the machine; state +transitions are based on the symbol under the tape head, not a symbol +from some other source. Likewise, the output from a state (if any) +is written on the tape. + +<P>Somewhat analogous to the concept of an accepting state in our earlier +examples, a Turing machine can have <EM>halting</EM> states. When the +machine enters a halting state it stops its operation. There are +no more state transitions. The output from the machine is whatever +sequence of symbols it leaves written on the tape. + +<P><H2>Turing's Thesis</H2> + + +<P>Turing invented his abstract machine because he was trying to formalize +the idea of an <EM>effective procedure</EM>: What does it mean to specify +a technique for solving some problem well enough that we can be sure +it will really work? As an analogy, think about directions for driving +from here to somewhere. Someone can hand you a piece of paper with +a list of instructions like "When you get to the gas station +on the left, turn right." But sometimes you get directions that +weren't careful enough. There may be a sharp right turn and a mild +right turn available near the gas station. There may be a fork in +the road before you even get to the gas station. + +<P>Turing's idea is that any problem for which there is <EM>any</EM> effective +procedure can be modeled by a Turing machine. The phrase "any effective +procedure" is taken to include the workings of the human mind. If +Turing is right, any problem that a person can solve can be programmed +for a computer. + +<P>This claim isn't something that can be proved or disproved mathematically +because there is no prior formal definition of "effective procedure" +to which Turing machines can be compared. Also, it may be that the +idea of a procedure somehow doesn't cover all the different kinds +of thinking that people do. Maybe it's true, for example, that computers +are potentially as powerful as people at solving problems, but "solving +problems" might not turn out to be an appropriate description of +what's going on when we feel emotions. If that turned +out to be true, we <EM>should</EM> expect a computer to become the world's +chess champion someday, but we <EM>shouldn't</EM> expect one to become +the world's champion poet. + +<P>But this possible conclusion has been attacked from both sides. Some +people think that emotions really <EM>are</EM> a matter of computable +procedures. Kenneth Colby's program called Parry attempts to model +the behavior of a paranoid human being by manipulating variables +for emotions like anger and fear. On the other hand, some people +think that even chess doesn't fall within the realm of things that +people do by carrying out effective procedures in Turing's sense. +A chess master, these people say, doesn't analyze the chess board +in a step-by-step fashion like a computer. He looks at the board +as a single, whole entity, and the important features just spring +to mind by some process that is still rather mysterious. + +<P>What <EM>is</EM> known is that several other mathematical models of effective +procedures have been shown to be equivalent to Turing machines, in +the same sense in which regular expressions are equivalent to finite-state +machines. For example, all of the popular programming languages are +Turing-equivalent. There's no such thing as a computation that can +be done in Logo but not in Pascal, or vice versa. (Of course, one +language may be more <EM>convenient</EM> than another for a given problem.) + +<P><H2>The Halting Theorem</H2> + + +<P>I'm not going to get into specific examples of Turing machine programming +here. That would take too much space for a single chapter; if you're +interested you should pursue the topic in a book on automata theory. +But I want to give one example of the theoretical value of Turing machines. + +<P>You've undoubtedly had the experience of writing a Logo program with a bug +that causes an "infinite loop"--you run the program and it just sits there +forever, when instead it's supposed to compute and print some results. +That's a frustrating kind of bug because you're never quite sure if the +program is really broken or if it's just very slow. Maybe if you waited +another minute it would come up with the answer. Wouldn't it be great if, +when you started the program running, Logo could print an error message like +<CODE>This program has an infinite loop</CODE>, just as it does for other errors? + +<P>It turns out that infinite loops can't, in general, be detected +automatically. Certainly <EM>some</EM> infinite loops are very easy to spot, +and we can write programs that catch certain categories of infinite loop. +But we can't write a program that's <EM>guaranteed</EM> to catch infinite +loops in programs, in Logo or any other Turing-equivalent language. The +fact that it's impossible is called the <EM>halting theorem.</EM> + +<P>It's a little tricky understanding just what the halting theorem says because +it involves Turing machines that manipulate Turing machines as data, which +is a kind of self-reference akin to recursion. Self-reference is always hard +to talk about and can lead to paradoxes like the classic "This statement +is false." (Is the sentence in quotes true or false? If it's true, then it +must be false, because it says so. But if it's false, and it <EM>says</EM> +it's false, it must really be true!) So let's proceed carefully. + +<P>The data recorded on a Turing machine's tape is a string of symbols. +Generally we choose the symbols to represent something meaningful; for +example, a string of digits can represent a number. Earlier in this chapter +we used strings of symbols like + +<P><PRE>[1 [[1 A 2] [2 B 3] [3 A 2]] [1 3]] +</PRE> + +<P>to represent a finite-state machine. There's no reason we +couldn't put <EM>that</EM> string of symbols on the tape of a Turing machine +as its input. For example, we could build a Turing machine that would work +like my <CODE>fsm</CODE> program, simulating the finite-state machine that it found +written on its tape when it started. + +<P>Letting a Turing machine simulate a finite-state machine doesn't raise +questions of self-reference. But a Turing machine, too, is a formal +structure; it, too, can be represented as a string of symbols. + +<P>Because a representation of a Turing machine can be the input to another +Turing machine, we can design Turing machines that answer questions about +Turing machines. For example, we can write a <EM>universal</EM> Turing +machine, one that simulates any Turing machine the way <CODE>fsm</CODE> simulates +any finite-state machine. + +<P>A universal Turing machine (a Turing machine simulator) sort of half-solves +the halting problem. Suppose we want to know whether a given machine will +halt after it is started with a given input. (This is like asking whether a +certain Logo procedure will terminate if it's invoked with a particular +input.) We can use the universal Turing machine to simulate the one we're +interested in. If the machine <EM>does</EM> halt, we'll find out about it. +But if the machine in question <EM>doesn't</EM> halt, then the simulator +won't halt either. We'll still have the problem we had in the first +place--how can we be sure it won't finally halt if we give it another +minute? + +<P>To solve the halting problem, what we need is a Turing machine that accepts +a representation of any Turing machine as input, just like the universal +Turing machine. But this one has to be guaranteed to halt, even if the +input machine wouldn't halt. That's what the halting theorem says we can't +do. + +<P><H2>Proving the Halting Theorem in Logo</H2> + +<P>What makes it possible to raise the question of whether a Turing machine can +decide whether another Turing machine would halt for a given input tape is +the fact that one Turing machine's "program" can be represented as data +for another Turing machine. This is also true of Logo procedures. In +particular, the higher-order procedures like <CODE>map</CODE> and <CODE>filter</CODE> +manipulate other procedures by accepting their names as inputs. +We can, therefore, use Logo +procedures to illustrate the proof of the halting theorem. + +<P>We'll consider a Logo procedure with an input as analogous to a Turing +machine with its input tape. We want to prove that there can't be a Logo +procedure that could tell whether such a procedure stops for a given input. +The technique we use is called <EM>proof by contradiction.</EM> In this +technique we assume that there <EM>is</EM> such a procedure, then show that +this assumption leads to a paradox. + +<P>So let's imagine that someone has written a Logo predicate <CODE>haltp</CODE> that +takes two inputs: the name of a procedure and an input value for that +procedure. <CODE>Haltp</CODE> will output <CODE>true</CODE> if the procedure it's testing +would eventually stop, given the specified input; <CODE>haltp</CODE> outputs <CODE> +false</CODE> if the procedure it's testing would get into an infinite loop, like a +recursive procedure without a stop rule. (In practice, if you think about +your own experience debugging programs, it's easy to tell if a procedure +doesn't have a stop rule at all, but not so easy to be sure that the stop +rule will always eventually be satisfied. Think about a Pig Latin program +given a word of all consonants as input. We want + +<P><PRE>to piglatin :word +if memberp first :word [a e i o u] [output word :word "ay] +output piglatin word bf :word first :word +end + +? <U>print haltp "piglatin "salami</U> +true +? <U>print haltp "piglatin "mxyzptlk</U> +false +</PRE> + +<P>Remember that <CODE>haltp</CODE> itself must <EM>always</EM> stop, even +in the case where <CODE>piglatin</CODE> wouldn't stop.) + +<P>Now consider this Logo procedure: + +<P><PRE>to try :proc +if haltp :proc :proc [loop] +end + +to loop +loop +end +</PRE> + +<P>Since <CODE>haltp</CODE> works, we're assuming, on <EM>any</EM> Logo +procedure with one input, it must work on <CODE>try</CODE> in particular. +What happens if we say + +<P><PRE>? <U>try "try</U> +</PRE> + +<P>Does this stop or loop? Suppose it stops. <CODE>try</CODE> begins its +work by evaluating the expression + +<P><PRE>haltp "try "try +</PRE> + +<P>Since we've said <CODE>try</CODE> will stop, given <CODE>try</CODE> as input, +<CODE>haltp</CODE> will output <CODE>true</CODE>. It follows, from the definition of <CODE> +try</CODE>, that <CODE>try</CODE> will invoke <CODE>loop</CODE> and will <EM>not</EM> stop. +Similarly, if we start with the assumption that <CODE>try</CODE> will loop, then +<CODE>haltp</CODE> must output <CODE>false</CODE> and so, from the definition of <CODE> +try</CODE>, you can see that <CODE>try</CODE> <EM>will</EM> stop. Whatever value <CODE> +haltp</CODE> outputs turns out to be incorrect. + +<P>It was the assumption that we could write an infallible <CODE>haltp</CODE> that led +us into this contradiction, so that assumption must be wrong. We can't write +a Logo procedure that will automatically detect infinite loops in our +programs. A similar proof could be made in any language in which one program +can manipulate another program as data--that is, in any Turing-equivalent +language. + +<P> +<TABLE width="100%"><TR><TD><A HREF="../v3-toc2.html">(back to Table of Contents)</A> +<TD align="right"><A HREF="../v3ch0/v3ch0.html"><STRONG>BACK</STRONG></A> +chapter thread <A HREF="../v3ch2/v3ch2.html"><STRONG>NEXT</STRONG></A> +</TABLE> + +<P><H2>Program Listing</H2> + +<P><P> +<P><PRE> +;;; Finite State Machine Interpreter (<A NAME="fsm">FSM</A>) + +to game :which +fsm thing word "mach :which +end + +to fsm :machine +cleartext +setcursor [0 3] +localmake "start startpart :machine +localmake "moves movepart :machine +localmake "accept acceptpart :machine +fsm1 :start +end + +to fsm1 :here +ifelse memberp :here :accept [accept] [reject] +fsm1 (fsmnext :here readchar) +end + +to fsmnext :here :input +blank +if memberp :input (list char 13 char 10) ~ + [print ifelse memberp :here :accept ["| ACCEPT|] ["| REJECT|] + output :start] +type :input +catch "error [output last find [fsmtest :here :input ?] :moves] +output -1 +end + +to fsmtest :here :input :move +output and (equalp :here arrowtail :move) ~ + (memberp :input arrowtext :move) +end + +;; Display machine state + +to accept +display "accept +end + +to reject +display "reject +end + +to blank +display "| | +end + +to display :text +localmake "oldpos cursor +setcursor [15 1] +type :text +setcursor :oldpos +end + +;; Data abstraction for machines + +to startpart :machine +output first :machine +end + +to movepart :machine +output first bf :machine +end + +to acceptpart :machine +output last :machine +end + +to make.machine :start :moves :accept +output (list :start :moves :accept) +end + +;; Data abstraction for arrows + +to arrowtail :arrow +output first :arrow +end + +to arrowtext :arrow +output first butfirst :arrow +end + +to arrowhead :arrow +output last :arrow +end + +to make.arrow :tail :text :head +output (list :tail :text :head) +end + +;; Machine descriptions for the guessing game + +make "mach1 [1 [[1 AB 1]] [1]] +make "mach2 [1 [[1 ABC 2] [2 ABC 1]] [1]] +make "mach3 [1 [[1 A 2] [2 B 3] [3 ABC 3]] [3]] +make "mach4 [1 [[1 A 2] [1 B 3] [1 C 4] [2 A 1] [3 B 1] [4 C 1]] [1]] +make "mach5 [1 [[1 ABC 2] [2 B 1]] [1]] +make "mach6 [1 [[1 A 2] [2 AB 2] [2 C 3] [3 AB 2] [3 C 3]] [3]] +make "mach7 [1 [[1 AB 1] [1 C 2] [2 C 1]] [1]] +make "mach8 [1 [[1 A 2] [1 BC 1] [2 A 1] [2 BC 2]] [1]] +make "mach9 [1 [[1 AB 1] [1 C 2] [2 A 3] [2 B 1] [3 A 1]] [1]] +make "mach10 [1 [[1 A 2] [1 BC 1] [2 A 2] [2 B 3] [2 C 1] + [3 A 2] [3 B 1] [3 C 4] [4 A 2] [4 B 5] [4 C 1] + [5 A 6] [5 BC 1] [6 ABC 6]] + [6]] + + +;;; Regular Expression to FSM Translation (<A NAME="machine">MACHINE</A>) + +to machine :regexp +localmake "nextstate 0 +output optimize determine nondet :regexp +end + +;; First step: make a possibly <A NAME="nondet">nondeterministic</A> machine + +to nondet :regexp +if and (wordp :regexp) (equalp count :regexp 1) ~ + [output ndletter :regexp] +if wordp :regexp [output ndor reduce "sentence :regexp] +if equalp first :regexp "or [output ndor butfirst :regexp] +if equalp first :regexp "* [output ndmany last :regexp] +output ndconcat :regexp +end + +;; <A NAME="ndletter">Alphabet</A> rule + +to ndletter :letter +localmake "from newstate +localmake "to newstate +output (make.machine :from + (list (make.arrow :from :letter :to)) + (list :to)) +end + +;; <A NAME="ndconcat">Concatenation</A> rule + +to ndconcat :exprs +output reduce "string (map "nondet :exprs) +end + +to string :machine1 :machine2 +output (make.machine (startpart :machine1) + (sentence (movepart :machine1) + (splice acceptpart :machine1 :machine2) + (movepart :machine2)) + (stringa (acceptpart :machine1) + (startpart :machine2) + (acceptpart :machine2))) +end + +to stringa :accept1 :start2 :accept2 +if memberp :start2 :accept2 [output sentence :accept1 :accept2] +output :accept2 +end + +;; <A NAME="ndor">Alternatives</A> rule + +to ndor :exprs +localmake "newstart newstate +localmake "machines (map "nondet :exprs) +localmake "accepts map.se "acceptpart :machines +output (make.machine :newstart + (sentence map.se "movepart :machines + map.se "or.splice :machines) + ifelse not emptyp find [memberp (startpart ?) + (acceptpart ?)] + :machines + [fput :newstart :accepts] + [:accepts]) +end + +to or.splice :machine +output map [newtail ? :newstart] (arrows.from.start :machine) +end + +;; <A NAME="ndmany">Repetition</A> rule + +to ndmany :regexp +localmake "machine nondet :regexp +output (make.machine (startpart :machine) + sentence (movepart :machine) + (splice (acceptpart :machine) :machine) + fput (startpart :machine) (acceptpart :machine)) +end + +;; <A NAME="splice">Generate</A> moves from a bunch of given states (:accepts) duplicating +;; the moves from the start state of some machine (:machine). +;; Used for concatenation rule to splice two formerly separate machines; +;; used for repetition rule to "splice" a machine to itself. + +to splice :accepts :machine +output map.se [copy.to.accepts ?] (arrows.from.start :machine) +end + +to arrows.from.start :machine +output filter [equalp startpart :machine arrowtail ?] movepart :machine +end + +to copy.to.accepts :move +output map [newtail :move ?] :accepts +end + +to newtail :arrow :tail +output make.arrow :tail (arrowtext :arrow) (arrowhead :arrow) +end + +;; <A NAME="newstate">Make</A> a new state number + +to newstate +make "nextstate :nextstate+1 +output :nextstate +end + +;; Second step: Turn nondeterministic FSM into a <A NAME="determine">deterministic</A> one +;; Also eliminates "orphan" (unreachable) states. + +to determine :machine +localmake "moves movepart :machine +localmake "accepts acceptpart :machine +localmake "states [] +localmake "join.state.list [] +localmake "newmoves nd.traverse (startpart :machine) +output make.machine (startpart :machine) ~ + :newmoves ~ + filter [memberp ? :states] :accepts +end + +to <A NAME="nd.traverse">nd.traverse</A> :state +if memberp :state :states [output []] +make "states fput :state :states +localmake "newmoves (check.nd filter [equalp arrowtail ? :state] :moves) +output sentence :newmoves map.se "nd.traverse (map "arrowhead :newmoves) +end + +to <A NAME="check.nd">check.nd</A> :movelist +if emptyp :movelist [output []] +localmake "letter arrowtext first :movelist +localmake "heads sort map "arrowhead ~ + filter [equalp :letter arrowtext ?] :movelist +if emptyp butfirst :heads ~ + [output fput first :movelist + check.nd filter [not equalp :letter arrowtext ?] + :movelist] +localmake "check.heads member :heads :join.state.list +if not emptyp :check.heads ~ + [output fput make.arrow :state :letter first butfirst :check.heads ~ + check.nd filter [not equalp :letter arrowtext ?] + :movelist] +localmake "join.state newstate +make "join.state.list fput :heads fput :join.state :join.state.list +make "moves sentence :moves ~ + map [make.arrow :join.state + arrowtext ? + arrowhead ?] ~ + filter [memberp arrowtail ? :heads] :moves +if not emptyp find [memberp ? :accepts] :heads ~ + [make "accepts sentence :accepts :join.state] +output fput make.arrow :state :letter :join.state ~ + check.nd filter [not equalp :letter arrowtext ?] :movelist +end + +to sort :list +if emptyp :list [output []] +output insert first :list sort butfirst :list +end + +to insert :value :sorted +if emptyp :sorted [output (list :value)] +if :value = first :sorted [output :sorted] +if :value < first :sorted [output fput :value :sorted] +output fput first :sorted insert :value butfirst :sorted +end + +;; Third step: <A NAME="optimize">Combine</A> redundant states. +;; Also combines arrows with same head and tail: +;; [1 A 2] [1 B 2] -> [1 AB 2]. + +to optimize :machine +localmake "stubarray array :nextstate +foreach (movepart :machine) "array.save +localmake "states sort fput (startpart :machine) ~ + map "arrowhead movepart :machine +localmake "start startpart :machine +foreach reverse :states [optimize.state ? ?rest] +output (make.machine :start + map.se [fix.arrows ? item ? :stubarray] :states + filter [memberp ? :states] acceptpart :machine) +end + +to array.save :move +setitem (arrowtail :move) :stubarray ~ + stub.add (arrow.stub :move) (item (arrowtail :move) :stubarray) +end + +to <A NAME="stub.add">stub.add</A> :stub :stublist +if emptyp :stublist [output (list :stub)] +if (stub.head :stub) < (stub.head first :stublist) ~ + [output fput :stub :stublist] +if (stub.head :stub) = (stub.head first :stublist) ~ + [output fput make.stub letter.join (stub.text :stub) + (stub.text first :stublist) + stub.head :stub + butfirst :stublist] +output fput first :stublist (stub.add :stub butfirst :stublist) +end + +to letter.join :this :those +if emptyp :those [output :this] +if beforep :this first :those [output word :this :those] +output word (first :those) (letter.join :this butfirst :those) +end + +to optimize.state :state :others +localmake "candidates ~ + filter (ifelse memberp :state acceptpart :machine + [[memberp ? acceptpart :machine]] + [[not memberp ? acceptpart :machine]]) ~ + :others +localmake "mymoves item :state :stubarray +localmake "twin find [equalp (item ? :stubarray) :mymoves] :candidates +if emptyp :twin [stop] +make "states remove :state :states +if equalp :start :state [make "start :twin] +foreach :states ~ + [setitem ? :stubarray + (cascade [emptyp ?2] + [stub.add (change.head :state :twin first ?2) + ?1] + filter [not equalp stub.head ? :state] + item ? :stubarray + [butfirst ?2] + filter [equalp stub.head ? :state] + item ? :stubarray)] +end + +to change.head :from :to :stub +if not equalp (stub.head :stub) :from [output :stub] +output list (stub.text :stub) :to +end + +to fix.arrows :state :stublist +output map [stub.arrow :state ?] :stublist +end + +;; Data abstraction for "stub" arrow (no tail) + +to arrow.stub :arrow +output butfirst :arrow +end + +to make.stub :text :head +output list :text :head +end + +to stub.text :stub +output first :stub +end + +to stub.head :stub +output last :stub +end + +to stub.arrow :tail :stub +output fput :tail :stub +end +</PRE><P> + +<P><A HREF="../v3-toc2.html">(back to Table of Contents)</A> +<P><A HREF="../v3ch0/v3ch0.html"><STRONG>BACK</STRONG></A> +chapter thread <A HREF="../v3ch2/v3ch2.html"><STRONG>NEXT</STRONG></A> + +<P> +<ADDRESS> +<A HREF="../index.html">Brian Harvey</A>, +<CODE>bh@cs.berkeley.edu</CODE> +</ADDRESS> +</BODY> +</HTML> |