about summary refs log tree commit diff stats
path: root/docs/CRAWL.announce
diff options
context:
space:
mode:
authorThomas E. Dickey <dickey@invisible-island.net>2012-02-20 02:08:17 -0500
committerThomas E. Dickey <dickey@invisible-island.net>2012-02-20 02:08:17 -0500
commitbc0fa578036583231edb567b328b4f69ce6860fe (patch)
tree99b322070bf62270218a0d80257a1f50bbefe147 /docs/CRAWL.announce
parentbb5fd6e44e480f571bcb713788cc50eea44095e5 (diff)
downloadlynx-snapshots-bc0fa578036583231edb567b328b4f69ce6860fe.tar.gz
snapshot of project "lynx", label v2-8-8dev_11
Diffstat (limited to 'docs/CRAWL.announce')
-rw-r--r--docs/CRAWL.announce131
1 files changed, 131 insertions, 0 deletions
diff --git a/docs/CRAWL.announce b/docs/CRAWL.announce
new file mode 100644
index 00000000..e734bcba
--- /dev/null
+++ b/docs/CRAWL.announce
@@ -0,0 +1,131 @@
+The TRAVERSAL code from old versions of Lynx has been upgraded by David
+Mathog (mathog@seqaxp.bio.caltech.edu) so that it works again, can be
+implemented via a command line switch (-traversal) instead of via a
+compilation symbol for creating a separate Lynx executable as in those
+previous versions, and can be used in conjunction with a -crawl switch
+to make Lynx a front end for a Web Crawler.
+ 
+
+Usage:
+
+   lynx [-traversal] [-realm] [-crawl] ["startpage"]
+
+
+Added switches are:
+
+  -traversal      Follow all http links derived from startpage that are
+                  on the same server as startpage.  If startpage isn't
+                  specified then the traversal begins with the default
+                  startfile or WWW_HOME.
+
+  -realm	  Further restrict http links to ones in the same realm
+                  (having a matching base URI) as the startpage (e.g.,
+		  http://host/~user/ will restrict the traversal to that
+		  user's public html tree).
+
+  -crawl          With [-traversal] outputs each unique hypertext page
+                  as an lnk###########.dat file in the format specified
+                  below.  With [-dump] outputs only the startpage, in
+		  the same format, to stdout.
+
+
+Note on startpage:
+
+                  If a startpage is specified and contains any uppercase
+		  characters, on VMS it should be enclosed in double-quotes.
+		  The code that extracts the access and host fields from
+                  startpage for comparsions with links to ensure they are
+                  not on another server, and the comparisons with already
+                  traversed links, are case sensitive, and the startpage
+                  will go to all lowercase on VMS if no double-quotes are
+                  supplied, such that it might be treated as a new link if
+                  encounted with uppercase letters.
+
+
+Files created and/or used with the -traversal switch, based on definitions
+in userdefs.h:
+
+TRAVERSE_FILE (traverse.dat):
+                  Contains a list of all URLs that were traversed.  Note
+                  that if a URL appears in this file it will not be 
+                  traversed again (important if runs are started and 
+                  stopped).  Placing an entry in this file BEFORE the
+                  run will block traversal of that URL.  Unlike reject.dat
+                  a final * has no effect (see below).  Note that Lynx
+		  internal client-side image MAP URLs will be included in
+		  this file (e.g., LYNXIMGMAP:http://server/foo.html#map1),
+		  in addition to the "real" (external) http URLs.
+
+TRAVERSE_FOUND_FILE (traverse2.dat):
+                  Contains a list of all URLs that were traversed, in the
+                  order encountered or re-encountered (but not re-travered)
+                  during a traversal run, and the TITLEs of the documents
+                  (separated from the URLs by TABs)  A URL and TITLE may be
+                  present in this list many times.  To simplify the list,
+                  on VMS use:  sort/nodups traverse2.dat;1 ;2
+		  Note that the URLs and TITLEs of the Lynx internal
+		  client-side image MAP pseudo-documents will not be
+		  included in this file, though "traversed", but only the
+		  http URLs and TITLEs derived from the MAP's AREA tag
+		  HREFs that were traversed.
+
+TRAVERSE_REJECT_FILE (reject.dat):
+                  Contains a list of URLs that have been rejected from the
+                  traversal.  Once a URL has been entered in this list, it
+                  will not be traversed.  URLs that end in a * will cause
+		  rejection of all URLs that match up to the character before
+		  the *. So for instance, to reject all htbin references on a
+		  site put this line in the reject.dat file BEFORE starting
+		  the run:  http://www.site.wherever:8000/htbin*
+
+TRAVERSE_ERRORS (traverse.errors):
+		  A list of links that could not be accessed or had an
+		  unknown status returned by the http server.  If the
+		  owner of the document containing the link is know via
+		  a LINK REV="made" HREF="mailto:foo" in it and the
+		  MAIL_SYSTEM_ERROR_LOGGING was set true in userdefs.h
+		  or lynx.cfg (not recommended!!!), a message about the
+		  problem will be mailed to the owner as well. 
+
+
+Files created during traversals if the -crawl switch is included with the
+-traversal switch:
+
+lnk########.dat   Numbered output files containing the contents of traversed
+		  hypertext documents in text format.  All hypertext links
+		  within the document have been stripped, and the URL and
+		  TITLE of the document are recorded as the first two lines,
+		  e.g., for the seqaxp.bio.caltech.edu home page the first
+		  two lines will be:
+
+                  THE_URL:http://seqaxp.bio.caltech.edu:8000/
+                  THE_TITLE:SAF Web server home page
+
+                  The VMSIndex software is being adapted to use this
+		  information to extract the corresponding URL and TITLE
+		  for use in indexing the lnk########.dat files, e.g.:
+
+                  $ build_index -
+                    /url=(text="THE_URL:") -
+                    /topic=(text="THE_TITLE:",EXCLUDE) -
+                    /output=INDEX_NAME -
+                    lnk*.dat
+
+		  A clever person should be able to figure out a way to
+		  index the lnk########.dat files on Unix as well.
+
+		  If you want the hypertext links in the document to be
+		  numbered, include the -number_links switch.  By default,
+		  this will cause the list of References (URLs for the
+		  numbered links) to be appended as well.  If you want
+		  numbered links but not the References list, include the
+		  -nolist switch as well.
+
+		  Note that any client-side image MAP pseudo documents
+		  that were "traversed" will not have lnk########.dat
+		  output files created for them, but output files will
+		  be created for "real" documents that were traversed
+		  based on the HREFs of the MAP's AREA tags.
+
+This functionality is still under development.  Feedback and suggestions
+are welcome.