PERL MINI-TUTORIAL.


Copyright © 2001 G. William Moore, MD, PhD.
http://www.medparse.com/whatperl.htm

G. William Moore, MD, PhD [1,2,3],

From: Pathology and Laboratory Medicine Service (113), Baltimore VA Maryland Health Care System [1], Baltimore, MD.
Department of Pathology, University of Maryland School of Medicine [2], Baltimore, MD.
Department of Pathology, The Johns Hopkins Medical Institutions [3], Baltimore, MD.



0. TABLE OF CONTENTS.


1. ABSTRACT.
2. INTRODUCTION.
3. INTERNET BROWSER.
4. WORLDWIDE WEB.
5. TEXT FILES.
6. HYPERTEXT MARKUP LANGUAGE.
7. FILE TRANSFER PROTOCOL.
8. PERL.



1. ABSTRACT.


NEXT PAGE.
RETURN TO TABLE OF CONTENTS.



2. INTRODUCTION.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      The INTERNET is a worldwide system of computers, connected together by telephone lines and other high-speed connections. The Internet began about three decades ago as a secure network of a few top-of-the-line computers, sponsored by the U.S. Department of Defense, DEFENSE ADVANCED RESEARCH PROJECTS AGENCY (DARPA). As more universities and defense contractors became attached to in this system, the network became more public and less secure for defense secrets. Today the Internet is completely public.

      The features of the Internet with special interest for this tutorial are:
(1) the WORLDWIDE WEB (WWW);
(2) HYPERTEXT MARKUP LANGUAGE (HTML);
(3) FILE TRANSFER PROTOCOL (FTP);
(4) COMMON GATEWAY INTERFACE (CGI); and
(5) PRACTICAL EXTRACTION AND REPORTING LANGUAGE (PERL).
      The WORLDWIDE WEB (WWW) is a collection of individual WEBSITES, which contain an initial document, or HOME PAGE, that in turn is linked to many subsidiary documents. The power of the web derives from the fact that:
(1) the documents may be richly formatted, using various typesetting commands, as well as multimedia segments, such as images, cinematic video clips, audio clips, etc.; and

(2) linkages may be made to related documents anywhere else on the Internet.
Technically, the great advantage of the web is that your Internet server sends you a desired web page, and then ignores you until your computer request another web-page. In the time that you are looking through your current web-page, your Internet server is taking care of many other clients. Thus, the worldwide web operates on the principle that humans think more slowly than computers.

      HYPERTEXT MARKUP LANGUAGE (HTML) is virtually the universal language of webpages on the worldwide web. A simple HTML page consists of ordinary ASCII text (see below), sandwiched between a standard header and standard trailer. You can either write HTML from a specialized editor (we recommend DREAMWEAVER); or else you can write HTML from a raw text editor.

      FILE TRANSFER PROTOCOL (FTP) is a service for transferring large documents across the Internet. To use FTP, you need an Internet Service Provider dialup account and an FTP program. Many serviceable FTP programs are available as freeware on the Internet. My favorite is: WS-FTP.

      The COMMON GATEWAY INTERFACE (CGI) is a method for exchanging files between server and client. The most popular operating system in the common gateway interface is UNIX. One of the most popular programming languages in the common gateway interface is PERL.

      PRACTICAL EXTRACTION AND REPORTING LANGUAGE (PERL) is one of the most popular programming languages on the Internet. The advantages of PERL are:
(1) Many basic PERL modules are cost-free. When you pay for a PERL module, you should expect a lot of additional value.

(2) PERL is ubiquitous. There are about 4000 Internet Service Providers that offer PERL version 5 or higher. The ubiquity of PERL means that ISPs are competing to sell you PERL-associated products, so that the prices are kept very low.

(3) PERL is well-documented. There are oodles of PERL books, aimed at everyone from rank beginners to very sophisticated programmers. Much of this necessary documentation about PERL is cost-free. When you pay, you are paying for a nice presentation or a well-organized book.

(4) It is easy to learn a little PERL, and you can start writing your own simple programs in one day. However, PERL has a great deal of depth as a programming language.
A programming language is the language in which you may set up complex calculations for your computer to perform, such as recognizing passwords, accepting credit-card information, calculating statistics, etc.

      PERL is typically run in the server-computer, in the UNIX operating system, and the results are transmitted to the client computer in the form of a webpage. Java and JavaScript are typically run in the client-computer, and may be run on your Internet browser, even if the dialup line is disconnected. JAVA is very powerful, but subject to security breaches except in the hands of the experts. It's difficult to learn even a little Java. JavaScript is secure but almost useless, because it can't access large files, for security reasons.

      It is easy to learn a little PERL, and you can start writing your own simple programs in one day. However, PERL has a great deal of depth, and you can do many sophisticated operations in PERL, after sufficient study. You can download complete instruction manuals and serviceable compilers (i.e., computer language translators) for PERL from the Internet. PERL is virtually the single-handed creation of a single person, Larry Wall.



3. INTERNET BROWSER.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      An INTERNET BROWSER is a software system, which resides on your (client) computer, and conducts the dialogue between your computer and your Internet Service Provider (ISP). Popular Internet browsers include: Netscape and Microsoft Explorer. Many ISP-vendors, such as America Online and NetCom, provide proprietary software when they first sell you a service contract. Other ISPs, such as Erols, simply give you a registered copy of a major browser, such as Netscape.

      For Netscape and Microsoft Internet Explorer, operating on Microsoft Windows, the operating instructions are fairly self-explanatory. In order to print the webpage which is currently displayed on your screen, simply go to the File option in the upper left-hand corner, and select Print. If you have displayed a webpage on your browser, and you wish to print it later when you have time, then go to the File option and select Save As. If you have created your own webpage file on your hard disk drive, and you wish to see what it will look like on the web, then go to the File option and select Open.



4. WORLDWIDE WEB.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      HOW THE WORLDWIDE WEB WORKS. Your client computer sends a request to a server, and the server responds by sending back a webpage. Although your telephone line is busy the whole time, and your server is always passively paying attention, in case you send another webpage, the server is only working actively for you in the few moments when it is accepting your request and returning the webpage. The time which you spend staring at the webpage on your home computer does not consume any significant resources from your server. In effect, this is like a roomful of amateur chess players playing simultaneously with a grand master chess player. The grand master goes from board to board, spending an instant with each amateur player. This is an extremely cost-effective way for (fast) servers to interact with (slow) clients. In older server-client systems, the server paid complete attention to a single client for an entire session, a much less efficient arrangement.

      The two most popular Internet browsers, namely Netscape and Microsoft Internet Explorer, are fairly serviceable systems for simple typesetting, printing, and testing new web pages. Both browsers are in Version 5 or better. In these systems, it not necessary to have an active dialup in order to display a page on the browser. Simply go to the File option in the upper left-hand corner, and select Print. If you have displayed a webpage on your browser, and you wish to print it later when you have time, then go to the File option in the upper left-hand corner, and select Save As. This trick can also be used in order to save email messages while your dialup is active, then print them out later when the telephone is disconnected. This is particularly valuable if your organization only owns a single telephone line, and you do not want your computer to interfere with telephone voice communications.



5. TEXT FILES.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      AMERICAN STANDARD CODE FOR INFORMATION INTERCHANGE (ASCII). is the American Standard Code for Information Interchange. So-called seven-bit ASCII, which is almost universally standardized, is a system of characters corresponding to the numbers between 0 and 127. For example, A=65, B=66,..., Z=90, and a=97, b=98,..., z=122. The INTERNATIONAL STANDARDS ORGANIZATION (ISO) uses the same numbering system. Seven-bit ASCII is also known as VANILLA ASCII, because it is plain, ordinary ASCII, with no frills, no special alphabets, etc.

      EIGHT-BIT ASCII, which is much less standardized than 7-bit ASCII, is the system of characters corresponding to the numbers between 0 and 255. Different national language groups assign their special characters to the numbers between 128 and 255, but there are not enough slots for all the different alphabets (French accents, Spanish tildes, German umlauts, Greek, Cyrillic, Japanese, and Korean alphabets). There is no universal agreement on how to assign the characters in eight-bit ASCII.

      WORD-PROCESSORS first became popular in the early 1980s. At that time, different private companies introduced different, incompatible formats for their word-processor texts, in order to lock in their customer-base. The result is a virtual tower of Babel in word-processors. The closest thing to standardization is an option on most word-processors to output their texts in seven-bit ASCII.

      MAKING A SIMPLE (VANILLA) ASCII FILE. In Windows 95, click on START, then click on Programs, then click on Accessories, then click on Notepad. Notepad is also available on Windows 3.x. In Notepad, a simple ASCII file has a name ending with .txt.

      HYPERTEXT MARKUP LANGUAGE (HTML) is virtually the universal language of webpages on the worldwide web. A simple HTML page consists of ordinary ASCII text, sandwiched between a standard header and standard trailer.

      HOW DO I MAKE A SIMPLE ASCII FILE? In Windows 95, 98, or NT, click on START, then click on Programs, then click on Accessories, then click on Notepad. Notepad is also available on Windows 3.x. In Notepad, a simple ASCII file has a name ending with .txt.

      HOW DO I MAKE A SIMPLE HTML FILE? Make any vanilla (seven-bit) ASCII text-file, using Windows Notepad, or some other suitable word-processor. Then insert it into the following boiler-plate:


<html><head><title> PLACE YOUR TITLE HERE</title></head>
<body>PLACE YOUR MAIN TEXT HERE
<br><br><br></body></html>




6. HYPERTEXT MARKUP LANGUAGE.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      HYPERTEXT MARKUP LANGUAGE (HTML) IS NOT A PROGRAMMING LANGUAGE. For all intents and purposes, HTML is a typesetting language. On the worldwide web, HTML issues the formatting instructions for text information (also: image, audio, and cinematic video) that appears on the web page. If you want programming on the worldwide web, then you have two choices: programming on the server; or programming on the client.

      The Internet Worldwide Web (WWW) may be viewed as a collection of WEBPAGES. You can read my introductory guide to the WWW at URL:
http://www.medparse.com/whatnett.htm
The WWW is sponsored by the WORLDWIDE WEB CONSORTIUM, a non-profit, international organization, described at URL:
http://www.w3.org/
A STATIC WEBPAGE is a collection of text, images, and audio, that is located at a UNIVERSAL RESOURCE LOCATOR (URL). A static webpage provides information, but does not perform calculations or alter data. However, a static webpage can LAUNCH a computer program at your website. For example, the static webpage that describes the short-sentence parser that we will build together is located at URL:
http://www.medparse.com/
Most static webpages on the worldwide web are written in a MARKUP LANGUAGE, which controls such features as font style, font size, color, placement of images, and foreign language characters. The most popular markup language is HYPERTEXT MARKUP LANGUAGE (HTML), which is a scaled-down (simplified) version of the Standard Generalized Markup Language, described at URL:
http://www.w3.org/MarkUp/SGML.html


      A DYNAMIC WEBPAGE is created on-the-fly by a computer program, or COMPUTER SCRIPT. The highway between an HTML page and a script is the COMMON GATEWAY INTERFACE (CGI). If you click on the SUBMIT button for the www.medparse.com home webpage, then a computer program will calculate a parse for the input sentence, using a script written in PRACTICAL EXTRACTION AND REPORTING LANGUAGE (PERL). There are many other scripting languages besides Perl, including C, M, Python, Java. The advantages of Perl are: cost-free, easy-to-learn, very powerful. When you have your FTP access set up, I will send you a Perl module.

      Most webpage names begin with the prefix "www", but as you see, the owner of the website has the option to use a different prefix. Every HOME WEBPAGE has an up-to-twelve-digit-number, supplied by the INTERNET SERVICE PROVIDER, ISP. The ISP number for www.medparse.com is: xxx.xxx.xxx.xxx. You can reach any URL through its ISP number, as for example:
http:// xxx.xxx.xxx.xxx
An international registry of ISP numbers and URL names is maintained by the company, NETWORK SOLUTIONS, INC., which was originally sponsored by the U. S. National Science Foundation.

      On your file-transfer-protocol (FTP) software, this ISP number is also known as the HOSTNAME. The USERID is: turk. For security reasons, I will send you the password in a separate mailing. This is an Internet site that you and I can share in our work together. We will be as close as if we lived in the same city!

      As soon as you have your FTP capability in place, you can start depositing files at the website. The home webpage file is named: index.htm . I have another webpage, a brief description of medical ontologies, in webpage file: whatonto.htm . You can reach this file at URL:
http://www.medparse.com/whatonto.htm
You can also read my introductory guide to the WWW at URL:
http://www.medparse.com/whatnett.htm


      You should become familiar with the basics of writing RAW HTML, in a text-editor such as NOTEPAD. The reason for learning this, is that Internet web programming works by an exchange of HTML files. The initial HTML page that you see on the screen contains a SUBMIT button, which launches a command-sequence to a program in the container, or BIN, for COMMON GATEWAY INTERFACE programs, or CGI-BIN. When you click on the SUBMIT button, a command-sequence launches a CGI program, which, in turn, returns a new HTML page to your browser. You can write a simple PERL program after an afternoon of study.

      The simplest HTML page is the following blank page:
<html> <body> </body> </html>
The PERL command for writing this HTML page back to your browser is:
print " <html> <body> </body> </html> ";
The entire PERL program for writing this HTML page back to your browser is:
#!/usr/bin/perl
 print "Content-type: text/html\n\n";
 print " <html> <body> </body> </html> ";
 exit;
      The next step in writing an HTML page is to give it some content. The HTML page may be divided into a HEAD and BODY. Within the HEAD is the TITLE. Thus:
<html> <head> <title> PLACE TITLE HERE </title> </head>
<body> PLACE CONTENT HERE </body> </html>


      There a few tips for composing your HTML page:
LINE BREAK IS: <br>
SOLID LINE IS: <hr>
BULLET IS: <li>
BOLDFACE IS: <b> FOLLOWED BY: </b>
ITALIC IS: <i> FOLLOWED BY: </i>
UNDERLINE IS: <u> FOLLOWED BY: </u>
STRIKETHROUGH IS: <s> FOLLOWED BY: </s>
You should find a good HTML book to get the details.

      Borrowing HTML code from other webpages is the best way to add interesting new tricks to your HTML page. That is, surf the net, find some stuff that you like, DISPLAY THE HTML SOURCE CODE on your browser, and borrow the relevant code. Just so that you don't break any copyrights or offend anybody, you should disguise your own code enough so that its source is not immediately recognizable. If you are a real gentleman and scholar, then you could make a hyperlink to the source where you got the interesting coding idea.

      An HTML page is a static entity. In order for you to dynamically collect information from a web-user, and return an individualized response, the web-user must launch a cgi-program from the initial HTML page, and receive a programmed response, based upon the command-sequence launched by the web-user, and resulting in a new HTML page composed on-the-fly by the cgi-program. For example, the following HTML program launches a command-sequence to the program PERLHLLO.CGI in the cgi-bin subdirectory of the account.
<html> <body>
<form name="sender" method="get" action="http://www.medparse.com/cgi-bin/perlhllo.cgi">
<br><input type="submit" name="bx" value="SUBMIT"> <br>
</body> </html>
The job for program PERLHLLO.CGI is to intercept this command sequence and return a customized HTML page back to the web-user.



7. FILE TRANSFER PROTOCOL.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      All necessary files for FTP and PERL are stored in BINARY mode in the \cgi-bin subdirectory. I suggest that you make a parallel C:\cgi-bin subdirectory in your personal computer hard drive, and transfer the following PERL and WS_FTP files in BINARY MODE:
 PERL.EXE
 PERL100.DLL
 PERLGLOB.EXE
 PERLHLLO.CGI

WS_FTP.EXE WS_FTP.EXT WS_FTP.INI WS_FTP.HLP WS_FTP.TXT
The executable files (PERL.EXE and WS_FTP.EXE) only function in an MS-DOS directory, with WINDOWS 95 or WINDOWS 98 running. (WINDOWS 3.x is does not have all the requisite DLL files.) The four PERL*.* files allow you to execute PERL5 programs in MS-DOS. The TURK account has its own PERL5 interpreter, that executes PERL5 programs in its UNIX environment. The five WS_FTP files comprise a file-transfer program. As a first step, download these files into an MS-DOS directory, I suggest c:\turk with WINDOWS 95 or WINDOWS 98 running.

      THE FIRST THING THAT YOU ABSOLUTELY NEED TO UNDERSTAND, and which will drive you crazy if you don't keep track of correctly, is that the LINE DELIMITER in MS-DOS (i.e., the operating system of your personal computer) is CARRIAGERETURNLINEFEED, i.e., ASCII 13 followed by ASCII 10; whereas the line delimiter in UNIX (i.e., the operating system of the ISP account, and most other inexpensive Internet accounts) is LINEFEED ONLY, i.e., ASCII 10 only.

      Whenever you transfer a file between MS-DOS and UNIX, you MUST pay attention to this difference. You have two options: you can either transfer files in ASCII MODE or in BINARY MODE. In BINARY MODE, the file is transferred in either direction byte-by-byte, with absolutely no modification. In ASCII MODE, the file is modified during transfer, as follows: from MS-DOS to UNIX, the extraneous CARRIAGERETURNs are stripped; from UNIX to MS-DOS, each LINEFEED in UNIX is supplemented with a preceding CARRIAGERETURN. If you make one of these file-transfers incorrectly, the resulting file will be unreadable or unexecutable or both. Since UNIX preceded MS-DOS historically, and is more widely available to the world at large through the Internet, Microsoft is to blame for introducing this fundamental incompatibility in his MS-DOS operating system. However, we are now stuck with this incompatibility forever.

      I keep track of my file-transfers as follows: all *.HTM and *.CGI files are transferred in ASCII MODE. all other files are transferred in BINARY MODE. Furthermore, I name all my files with mainname at most eight characters, extensionname exactly three letters, all lower case. This way, there are no name incompatibilities between operating systems. I never deviate from these rules, so I never forget what I'm doing.



8. PERL.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      PERL (PRACTICAL EXCHANGE AND REPORTING LANGUAGE) is the hottest programming language on the Internet. It is easy to learn a little bit of PERL, once have launched your first program. PERL is the best scripting-language for handling text-strings, text-files, and for collecting text-information from websites. The most important property of a PERL script is that it can be launched from one HTML page and return a response in the form of another HTML page.

      The most efficient strategy for writing PERL programs is to debug the PERL program in your local MS-DOS environment, and then move the program into the CGI-BIN subdirectory for final testing. DON'T FORGET: YOU SHOULD ALWAYS MOVE PERL/CGI PROGRAMS IN ASCII MODE. I suggest writing the following program in NOTEPAD in your MS-DOS environment, as file PERLHLLO.CGI
#!/usr/bin/perl
 print "Content-type: text/html\n\n";
 print " <html> <body> </body> </html> ";
 print " <html> <head> <title> HELLO, WORLD. ";
 print " </title> </head> ";
 print " <body> HELLO, WORLD. <br></body></html> ";
 exit;
You may test program PERLHLLO.CGI in the c:\cgi-bin directory in your MS-DOS environment, with the command:
perl PERLHLLO.CGI
The program will print an HTML-ready file as follows:
<html><head><title> Hello, World. </title></head>
<body><center> Hello, Dr. World. </center><br></body></html>


      The next step is to move program PERLHLLO.CGI into the CGI-BIN subdirectory on the Internet for final testing. I will tell you how to do it with WS_FTP, although any File Transfer Protocol (FTP) program is OK. However, I only know how to launch UNIX security commands in WS_FTP, so if you use a different FTP program, you'll have to figure out how to do this yourself.
<html><head><title> Launch Program. </title></head>
<body><center> Launch Program.
<form name="sender" method="get"
action="http://www.medparse.com/cgi-bin/perlhllo.cgi">
<input type="submit" name="bx" value="SUBMIT"></form>
</center><br></body></html>
The same result is obtained at URL:
http://www.medparse.com/cgi-bin/perlhllo.cgi


      The WORLDWIDE WEB (WWW) consists of a decentralized collection of STATIC WEBPAGES, each residing on a SERVER, supported by an INTERNET SERVICE PROVIDER (ISP). Ordinary users may rent basic services from an ISP in the Baltimore/Washington area for as little as $15 per month.

      Each webpage has a unique name, or UNIVERSAL RESOURCE LOCATOR (URL), consisting of numbers and letters and some punctuation marks. A single webpage displays a fixed body of textual, image, and audio information. In this discussion, the webpage is written in the HYPERTEXT MARKUP LANGUAGE (HTML). Within very narrow limits, a static webpage does not process information entered by the user. The user may seek additional information by LINKING to antoehr static webpage. Alternatively, the user may enter data or commands into a static webpage, and LAUNCH these data, typically by clicking a SUBMIT BUTTON. The launch data are sebt across the COMMON GATEWAY INTERFACE (CGI) to a computer program, or COMPUTER SCRIPT. The computer script processes the launch data, and returns a new webpage, written in HTML, to the orignal user. For the discussion that follows, all computer scripts are written in the PRACTICAL EXTRACTION AND REPORTING LANGUAGE (PERL).

      Thus, the entire functionality of the worldwide web can be viewed as static webpages which either link to one another, or else launch data that create new webpages. This worldwide, interlocking system of information packages was first envisioned by VANNEVAR BUSH, a U. S. scientist and futurist, active in the 1940s at the dawn of the computer age.

      This report will be confined to webpages written in HTML and computer scripts written in PERL. We will walk through a set of instructional HTML pages and PERL scripts which may be visited directly in this email. For each example, say perlxxxx, the webpage may be viewe by clicking on URL:
http://www.medparse.com/cgi-bin/perlxxxx.cgi
You may examine the HTML document by clicking VIEW SOURCE DOCUMENT on your browser. You may view the corresponding PERL script at URL:
http://www.medparse.com/perlxxxx.cgi
The following instructional PERL scripts are available.
 perlxxxx.cgi
 perlnull.cgi
 perlread.cgi
 perlwrit.cgi
 perljoin.cgi
 perlsplt.cgi
 perlrado.cgi
 perltext.cgi
 perlchkb.cgi
 perlwhil.cgi
 perlasci.cgi
 perlasoc.cgi
 perlaray.cgi
 perltime.cgi
 perlescp.cgi
      Each example-script is designed to illustrate a particular feature of HTML or PERL.

      An HTML file consists of text, images, and audio, whose size, positions, fonts, colors, etc., are delineated by TAGS. Typically, these tags are PAIRED, i.e., the functionality begins with a START-TAG and concludes with an END-TAG. A start-tag consists of <...> where ... is the TAGNAME. An end-tag consists of a matching </...> Some tags are SINGLE.

      The simplest HTML file is:
<html>...</html>
This file consists of the HTML START-TAG and the HTML END-TAG.

      Typically, an HTML file contains a HEAD and a BODY. The head contains the title, optional indexing information, and sometimes a small program that controls graphical functions on the page. The BODY contains the main textual and image information of the document. Thus, most HTML documents have the following stereotypical structure:
<html> <head> <title> PLACE YOUR TITLE HERE. </title> </head> <body> PLACE YOUR TEXT HERE. </body> </html>
The entire PERL script that generates this HTML page is:
#!/usr/bin/perl
 print "Content-type: text/html\n\n";
 print " <html> <head> "
 print " <title> PLACE YOUR TITLE HERE. </title> </head> "
 print " <body> PLACE YOUR TEXT HERE. </body> </html> "
 exit;
You may display this PERL script by clicking on URL:
http://www.medparse.com/cgi-bin/perlxxxx.cgi




Last Updated: October 17, 2001, by G. William Moore, MD, PhD.