UNIQUENESS OF MEDICAL DATA MINING.
DRAFT COPY ONLY.
8/29/2009.
Krzystof J Cios, PhD. [1]
G. William Moore, MD, PhD. [2,3,4]
http://www.netautopsy.org/uniqmddm.htm



From the Department of Computer Science and Engineering, University of Colorado at Denver, Campus Box 109, 1200 Larimer Street, Denver, CO 80217-3364 [1]; Pathology and Laboratory Medicine Service, Veterans Affairs Maryland Health Care System, Baltimore, Maryland [2]; Department of Pathology, University of Maryland Medical System, Baltimore, Maryland [3]; and Department of Pathology, The Johns Hopkins Medical Institutions, Baltimore, Maryland [4].

Cios KJ, Moore GW.
Uniqueness of medical data mining.
Artif Intell Med. 2002 Sep-Oct;26(1-2):1-24.
PMID: 12234714.
PubMed Entry
Full Text of Article: http://www.netautopsy.org/uniqmddm.htm
Last tested: August 29, 2009.

Send comments and correspondence to: krys.cios@cudenver.edu, George.Moore4@va.gov


Related Publications:
G. William Moore, MD, PhD. Curriculum Vitae: http://www.netautopsy.org/gwmcv.htm
Anatomic Pathology Data Mining: http://www.netautopsy.org/apdmchap.htm
Uniqueness, Medical Data Mining: http://www.netautopsy.org/uniqmddm.htm
Anatomic Pathology Procedure Manual: http://www.netautopsy.org/axsop/axsop.htm
Autopsy Index Procedure Manual, incl. Free Perl Open-Source Code: http://www.netautopsy.org/axsop/jharispm.htm
Data Mining Medical Time Events: http://www.netautopsy.org/timemine.htm
Data Mining Medical Time Events, Free Perl Open-Source Code: http://www.netautopsy.org/timeperl.htm
Data Mining Medical Time Events, Procedure Manual: http://www.netautopsy.org/timeproc.htm
Data Mining Medical Time Events, Spreadsheet: http://www.netautopsy.org/timemine.xls
SNOMED Translator: http://www.netautopsy.org/autocode.htm
SNOMED Epidemiology Tool: http://www.netautopsy.org/snomedsp.htm
Johns Hopkins Autopsy Resource: http://www.netautopsy.org/protoiad.htm
Natural Language Processing: http://www.netautopsy.org/natlngpr.htm
Modal Logic, Medical Ethics: http://www.netautopsy.org/modlthry.htm
Modal Logic, Medical Ethics, Free Perl Open-Source Code: http://www.netautopsy.org/modllive.htm
Data Mining, Anatomic Pathology: http://www.netautopsy.org/apdmchap.htm
Surgical Pathology Linguistic Inventory: http://www.netautopsy.org/vhpsapsx.htm
Spreadsheet Order Logic: http://www.netautopsy.org/ordrlogc.htm
Resource Description Framework (RDF), Pathology Images: http://www.julesberman.info/spec2img.htm
Resource Description Framework (RDF), Specimen Common Data Elements: http://www.netautopsy.org/scderdfh.htm
Resource Description Framework (RDF), Mucosal Surfaces: http://www.netautopsy.org/mucordfh.htm
Resource Description Framework (RDF), Lymphoid Infiltrates: http://www.netautopsy.org/lympapaa.htm
Proof in Pathology: http://www.netautopsy.org/apihonfl.htm
Automated Edge Detection, Pathology Images: http://www.netautopsy.org/ascpedge.htm
Fractal Dimensions in Pathology: http://www.netautopsy.org/ascpfrac.htm
Image Segmentation and Analysis, incl. Free Visual Basic Open Source Code: http://www.netautopsy.org/ascpisap.htm
Basal Cell Carcinoma, Histologic Discontinuities: http://www.netautopsy.org/basalcel.htm
DNA Analysis, Cardiac Myxoma: http://www.netautopsy.org/camyxoma.htm
Cell Death, Preneoplasia: http://www.netautopsy.org/celdeath.htm
Clear Cell Dysplasia, Bladder: http://www.netautopsy.org/clearcel.htm
Maintaining Patient Confidentiality: http://www.netautopsy.org/confiden.htm
Elevated PSA, African-American Males: http://www.netautopsy.org/elevpsal.htm
Elevated PSA, African-American Males: http://www.netautopsy.org/zmopapsa.htm
Bibliography, Staged Human Embryos: http://www.netautopsy.org/embrbibl.htm
Image Segmentation, Analysis, Pathology (ISAP): http://www.netautopsy.org/isapwlcm.htm
Johns Hopkins Autopsy Resource: http://www.netautopsy.org
Bibliography, Johns Hopkins Autopsy Resource: http://www.netautopsy.org /jharpubl.htm
Autopsy Report Words, Johns Hopkins Autopsy Resource: http://www.netautopsy.org /jharaurw.htm
Zipf Distribution, Johns Hopkins Autopsy Resource: http://www.netautopsy.org /jharzipf.htm
DNA Flow Cytometry, Keratoacanthoma: http://www.netautopsy.org/keratflw.htm
Dysplasia, Atypical Liver Nodule. http://www.netautopsy.org/lvrdyspl.htm
Dysplasia, Atypical Liver Nodule, Letter: http://www.netautopsy.org/livelet.htm
Cell Simulation, Polyclonal Tumors: http://www.netautopsy.org/monoclon.htm
SNOMED-Encoded Surgical Pathology Databases: http://www.netautopsy.org/snomedsp.htm
Practice Guidelines, Autopsy Pathology: http://www.netautopsy.org/pracguid.htm
Internet Autopsy Database: http://www.netautopsy.org/protoiad.htm
Internet-based Quality Improvement: http://www.netautopsy.org/qimpmopa.htm
Spontaneous Regression, Preneoplasia, Simulation: http://www.netautopsy.org/sponregr.htm
Unfunded Research, Pathologists, Internists, Surgeons: http://www.netautopsy.org/unfunded.htm
Tumor classification, molecular analysis: PubMed Entry
Tumor taxonomy: PubMed Entry
Developmental Neoplasm Lineage: http://www.julesberman.info/devclass.htm
Biomedical Informatics: http://www.jbpub.com/catalog/9780763741358/
Perl Programming: http://www.jbpub.com/catalog/9780763743338/
Perl Programming: http://www.jbpub.com/catalog/9780763757588/
Ruby Programming: http://www.jbpub.com/catalog/9780763750909/
Ruby Programming: http://www.jbpub.com/catalog/9780763757571/
Neoplasms, Development, Diversity: http://www.jbpub.com/catalog/9780763755706/
Precancer: http://www.jbpub.com/catalog/9780763777845/

Last tested: August 29, 2009.

0. DISCLAIMER.

DISCLAIMER. United States Government Work, uncopyrighted, public-domain, DRAFT COPY ONLY. This document does not necessarily represent the views or policies of any United States Government agency. This document is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and non-infringement. In no event shall the authors be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of, or in connection with the document or the use or other dealings made with the document.


1. ABSTRACT.

This article addresses the special features of data mining with medical data. Researchers in other fields may not be aware of the particular constraints and difficulties of the privacy-sensitive, heterogeneous, but voluminous data of medicine. Ethical and legal aspects of medical data mining are discussed, including data ownership, fear of lawsuits, expected benefits, and special administrative issues. The mathematical understanding of estimation and hypothesis formation in medical data may be fundamentally different than those from other data collection activities. Medicine is primarily directed at patient-care activity, and only secondarily as a research resource; almost the only justification for collecting medical data is to benefit the individual patient. Finally, medical data have a special status based upon their applicability to all people; their urgency (including life-or-death); and a moral obligation to be used for beneficial purposes.


2. INTRODUCTION.

Human medical data are at once the most rewarding and difficult of all biological data to mine and analyze. Humans are the most closely-watched species on earth. They can relate detailed histories of past traumas, pain, discomfort, hallucinations, etc., that are difficult or impossible to collect from animal models. Some three-quarter billions of persons living in North America and Europe have at least some of their medical information collected in electronic form, at least transiently. These subjects generate volumes of data that an animal experimentalist can only dream of. On the other hand, there are ethical, legal, and social constraints on data collection and distribution, that do not apply to non-human species, that limit the intellectual conclusions that may be drawn.

Medical data are heavily image-based and text-based; the datasets are heterogeneous, replete with missing values; medicine lacks a formal, mathematical structure for organizing data; and there are significant issues of ownership, fairness, and privacy. As computers and the Internet increase our abilities to create and distribute human data, there is an open challenge to manage and exploit this incredible research potential for human betterment.

The major points of uniqueness of medical data may be organized under three general headings:
HETEROGENEITY OF MEDICAL DATA.
ETHICAL/LEGAL/SOCIAL ISSUES.
STATISTICAL PHILOSOPHY.
SPECIAL STATUS OF MEDICINE.


3. HETEROGENEITY OF MEDICAL DATA.



COMPLEXITY OF MEDICAL DATA.
PHYSICIAN'S INTERPRETATION.
RECALL/PRECISION MEASURES.
POOR MATHEMATICAL CHARACTERIZATION.


4. ETHICAL/LEGAL/SOCIAL ISSUES.



Because medical data are collected on human subjects, there is an enormous ethical and legal tradition designed to prevent abuse of patients and misuse of data.
DATA OWNERSHIP.
FEAR OF LAWSUITS.
CONFIDENTIALITY OF UMAN DATA.
ENCRYPTION.
MINIMAL RISK.
TYPES OF IDENTIFICATION: ANONYMOUS,ANONYMIZED,DE-IDENTIFIED,IDENTIFIED.
NEED FOR IDENTIFICATION: LOOKBACK POSSIBLE. PREVENT DUPLICATES.
EXPECTED BENEFITS.
TWO YEAR WINDOW.
SAUL LEGAL PAPER. ADMINISTRATIVE ISSUES.


5. STATISTICAL PHILOSOPHY.


      There is an emerging doctrine that data mining methods themselves, especially statistics, and the basic assumptions underlying these methods, may be fundamentally different for medical data.
MYSTERIES OF STATISTICS.
BOSON/FERMION.
AMBUSH STATISTICS.
NATURAL HISTORY OF DISEASE.
FAIRNESS.
CANONICAL FORM ISSUE.
MACHINE TRANSLATION, SPOILED MEAT.
DATA MINING AS A SUPERSET OF STATISTICS.


6. SPECIAL STATUS OF MEDICINE.


      Medicine has a special status in Western science, philosophy, and daily life. The outcomes of medicine are life-or-death, and they apply to everybody. Medicine is more than a pleasure or a convenience.

Among all the professions, medicine has the longest apprenticeship. Most medical specialists in the USA require at least eleven years of training after high school graduation, and some surgical subspecialties require up to sixteen. In the USA, medical care costs consume one-seventh of the Gross Domestic Product. Physicians represent about 0.2% of the U. S. population; their incomes are in the top 2%; and the average physician causes seven times his/her income to be spent on services ordered.

The average citizen has high expectations of medicine and its practitioners. A sick person is expected to recover. Physicians are expected to be ethical, caring, and not too greedy. Medicine is a popular subject for television dramas. When medicine fails, the desire for legal revenge is intense and punitive. Medical information about the individual patient is considered highly private, and the general public is extremely fearful about disclosure [refs.]. We all want the benefits of medical research conducted on other patients, but we are cautious about letting our own information out for similar purposes. When medical data are published, it is expected that they will maintain the dignity of the individual patient, and that they will be used socially beneficial purposes [refs.].

It has been suggested that scientific truths are fundamentally amoral; they can be used for good or evil [refs.]. Yet although Western medicine is based upon science, there are certain tests that may not be performed, and certain conclusions that may not be drawn, because of medicine's special status. As we shall see in this report, this view pervades our attitudes about research in medical science, as well as our attitudes about medical diagnosis and treatment.


7. COMPLEXITY OF MEDICAL DATA.

More and more medical procedures employ imaging as a preferred diagnostic tool. Thus, there is a need to develop methods for efficient mining in databases of images, which is not only different, but also more difficult, than mining in numerical databases. As an example, imaging techniques like SPECT, MRI, PET, and collection of ECG or EEG signals, can generate gigabytes of data per day. A single cardiac SPECT procedure of one patient may contain dozens of 2D images. In addition, medical databases are always heterogeneous. For instance, an image of the patient's organ will almost always be accompanied by other clinical information, as well as the physician's interpretation (clinical impression; diagnosis). This heterogeneity requires high capacity data storage devices, as well as new tools to analyze such data. It is obviously very difficult for an unaided human to process gigabytes of records, although dealing with images is relatively easier for humans, who are able to recognized patterns, grasp basic trends in data, and form rational decisions. The information becomes less useful as we are faced with difficulties of retrieving it, and making it available in an easily comprehensible format. Visualization techniques will play an increasing role in this setting, since images are the easiest for humans to comprehend, and they can provide a great deal of information in a single visualization of the results.


8. IMPORTANCE OF PHYSICIAN'S INTERPRETATION.

The physician's interpretation of images, signals, or any other clinical data, is written in unstructured free-text English, that is very difficult to standardize and thus difficult to mine. Even specialists from the same discipline cannot agree on unambiguous terms to be used in describing a patient's condition. Not only do they use different names (synonyms) to describe the same disease, but they render the task even more daunting by using different grammatical constructions to describe relationships among medical entities.


9. DATA OWNERSHIP.

The question of data ownership in medical data mining. The corpus of medical data potentially available for mining is enormous. Some thousands of terabytes (quadrillions of bytes) are now generated annually in North America and Europe. However, these data are buried in heterogeneous databases, and scattered throughout the medical care establishment, without any common format or principles of organization. The question of ownership of patient information is unsettled, and the object of recurrent, highly-publicized lawsuits and congressional inquiries. Do individual patients own data collected on themselves? Do their physicians own the data? Do their insurance providers own the data? Some HMOs now refuse to pay for patient participation in clinical treatment protocols that are deemed experimental. If insurance providers do not own their insurees' data, can they refuse to pay for the collection and storage of the data? Some might argue that the ownership of human data, and therefore the ability to process and sell such data, is unseemly. If so, then how should the data managers, who organize and mine the data, be compensated? Or should this incredibly rich resource for the potential betterment of humankind be left unmined?


10. FEAR OF LAWSUITS.
A fourth unique feature of medical data mining is a fear of lawsuits directed against physicians and other medical providers. Medical care in the U.S.A., for those who can afford it, is the best in the world. However, U. S. medical care is some 30% more expensive than that elsewhere in North America and Europe, where quality is comparable; and U. S. medicine also has the most litigious malpractice climate in the world. Some have argued that the 30% surcharge on U. S. medical care, about one thousand dollars per capita annually, is mostly medicolegal: either direct legal costs, or else the overhead of "defensive medicine", i.e., unnecessary tests ordered by physicians to cover themselves in potential future lawsuits. In this tense climate, physicians and other medical data-producers are understandably reluctant to hand over their data to data miners. Data miners could browse these data for untoward events. Apparent anomalies in the medical history of an individual patient might trigger an investigation. In many cases, the appearance of malpractice might be a data-omission or data-transcription error; and not all bad outcomes in medicine are necessarily the result of negligent provider behavior. However, an investigation inevitably consumes the time and emotional energy of medical providers. For exposing themselves to this risk, what reward do the providers receive in return?


11. PRIVACY AND SECURITY OF HUMAN DATA.


      A fifth feature is privacy and security concerns. For instance, pending U. S. Federal rules suggest that there should be unrestricted usage of medical data of patients deceased for at least two years; but for live patients, all data must be encrypted, so that it is impossible to identify a person, or to go back and decrypt the data. At stake is not only a potential breach of patient confidentiality, with the possibility of ensuing legal action; but also erosion of the physician-patient-relationship, in which the patient is extraordinarily candid with the physician in the expectation that such private information will never be made public. Thus, the encryption of data must be irreversible. A related privacy issue may apply if, for example, crucial diagnostic information were to be discovered on live-patient data, and that a patient could be treated if we could only go back and inform the patient about the diagnosis and possible cure. According to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) legislation, this is unfortunately not possible. Another issue is data security in data handling, and particularly in data transfer. Before the data are encrypted, only authorized persons should have access to the data. Since transferring the data electronically via the Internet is insecure, the data must be carefully encrypted even for transfers within one medical institution from one unit to another.


12. POOR MATHEMATICAL CHARACTERIZATION OF MEDICAL DATA.
A sixth unique feature of medical data mining is that the underlying data structures of medicine are poorly characterized mathematically, as compared to many areas of the physical sciences. Physical scientists collect data which they can substitute into formulas, equations, and models that reasonably reflect the relationships among their data. On the other hand, the conceptual structure of medicine consists of word-descriptions and pictures, with very few formal constraints on the vocabulary, the composition of images, or the allowable relationships among basic concepts. The fundamental entities of medicine, such as inflammation, ischemia, or neoplasia, are just as real to a physician as entities such as mass, length, or force are to a physical scientist; but medicine has no comparable formal structure into which a data miner can organize information, such as might be modeled by clustering, regression models, or sequence analysis. In its defense, medicine must contend with hundreds of distinct anatomic locations and thousands of diseases. Until now, the sheer magnitude of this concept space was insurmountable. Furthermore, there is some suggestion that the logic of medicine may be fundamentally different from the logic of the physical sciences (Moore et al, 1979; Moore et al, 1979; Moore and Hutchins, 1980, 1981a, b). However, it may now happen that fast computers and the newer tools of data mining and knowledge discovery may overcome this prior obstacle.


13. HUMAN MEDICINE NOT PRIMARILY A RESEARCH RESOURCE.
Human medicine is primarily a patient-care activity, and serves only secondarily a research resource. Generally the only justification for collecting data in medicine, or refusal to collect certain data, is to benefit the individual patient. Some patients might consent to be involved in research projects that do not benefit them directly, but such data collection is typically very small-scale, narrowly focused, and highly regulated by legal and ethical considerations. On the other hand, humans are the most closely-studied species in the world, and enormous quantities of data are generated as an incidental byproduct of patient care. In Europe and North America, much of this information is now primarily generated on electronic media. These data include observations that cannot be gained from animal studies, such as visual and auditory sensations, the perception of pain, and recollection of possibly relevant prior traumas and exposures. By contrast, most animal studies are short-term, and therefore cannot track long-term disease processes of medical interest, such as preneoplasia or atherosclerosis. Furthermore, most animal studies are small-scale, and thus cannot detect or follow rare disease processes. And, there is no issue of having to extrapolate animal observations to the human species.


14. RECALL AND PRECISION.
Nearly all diagnoses and treatments in medicine are imprecise, and are subject to rates of error. The usual paradigm for measuring this error in medicine is SENSITIVITY and SPECIFICITY.

      SENSITIVITY measures how often you find what you are looking for. It goes under a variety of near-synonyms: FALSE-NEGATIVE. RECALL. TYPE II ERROR. BETA ERROR. ERROR OF OMISSION. ALTERNATIVE HYPOTHESIS.

      SPECIFICITY measures how often what you find is what you are looking for. It goes under a variety of near-synonyms: FALSE-POSITIVE. PRECISION. TYPE I ERROR. ALPHA ERROR. ERROR OF COMMISSION. NULL HYPOTHESIS.

      In medical SENSITIVITY/SPECIFICITY studies, there is a TEST-UNDER-STUDY and a GOLD-STANDARD, as follows:

                 TEST-UNDER-STUDY:
                    | YES |  NO |
               ___________________
                YES |  TP |  FN |
GOLD-STANDARD: ___________________
                NO  |  FP |  TN |
               ___________________
where TP = TRUE POSITIVE; TN = TRUE NEGATIVE; FP = FALSE POSITIVE; and FN = FALSE NEGATIVE.

      Often, the test-under-study is a proposed, inexpensive new test, whereas the gold-standard is either a more expensive test, regarded as definitive, or else a COMPLETE MEDICAL WORKUP of the patient, regarded as definitive. Sometimes, this complete medical workup may be slightly different for each patient, and therefore considered subjective by some referees. In medical settings, the question must be formulated as a yes-no question, which is sometimes fairly challenging, considering all the twists-and-turns in a medical history.

      Reluctance to use these scientific measures of error in medicine is due to many factors: the expectation that the results will not look very good, and will therefore will not be publishable; the difficulty in formulating the YES/NO questions to be tested; and the burdensome, expensive, and sometimes imprecise process of evaluating the GOLD STANDARD for each case in the study.

MYSTERIES OF STATISTICS.


      When I was a college undergraduate, first introduced to the mysteries of statistics, two things puzzled me in a very fundamental way: LEAST SQUARES and the NULL HYPOTHESIS. The convoluted reasoning for justifying statistical arguments involving these two concepts never entirely satisfied me as a graduate student, and satisfies me even less in the medical applications where they are used.

      LEAST SQUARES. Why not LEAST CUBES? Why not LEAST ABSOLUTE-VALUES? The answer, as far as I can tell, is based not upon some deep understanding of distributions of biomedical data, but rather upon the mathematical convenience of taking a partial first derivative of sums-of-squares, as compared to sums-of-cubes or sums-of-absolute-values.

      NULL HYPOTHESIS. Why not test ACCEPTING THE ALTERNATIVE HYPOTHESIS, rather than going through the logical contortions of REJECTING THE NULL HYPOTHESIS? I find this standard formulation in statistics very difficult to explain to my medical colleagues who ask me to interpret scientific papers containing t-tests, analysis of variance, etc. The simple answer is that there is one only null hypothesis, but an infinity of alternative hypotheses. However, this still represents a mathematical convenience, rather than something that makes sense to physicians.

BOSON/FERMION CONCEPT.


      In particle physics, a FERMION is a particle that has a cardinality of ONE. A BOSON may have a multiple cardinality.

      The trick that I am proposing for public medical databases which must pass federal de-identification guidelines is that each record in a public database MAY CORRESPOND TO UP TO TEN INDIVIDUAL PATIENTS. This means that, even if you guess that a particular public record might correspond to a particular patient, you still do not know whether this record corresponds to other patients, as well. The identity of a given record and whether this record corresponds to more than one patient, can only be obtained through negotiation with the contributing institution's Institutional Review Board (IRB). I use this trick for the Johns Hopkins Autopsy Resource.

      The PROBLEM with this trick is that many statistical calculations, where SIGNIFICANCE is based, in part, on SAMPLE SIZE, must be reformulated at a fundamental level. This could provide employment for idle theoretical statisticians!

TWO-YEAR WINDOW ON LEGAL COMPLIANCE.


      According to now-pending guidelines of the NATIONAL BIOETHICS ADVISORY COMMISSION (NBAC), as soon as guidelines for research use of DE-IDENTIFIED MEDICAL INFORMATION (DIMI) are finalized, there will be a two-year window for academic institutions to justify their procedures, and bring them into compliance. The clock is ticking. Let's get busy!

AMBUSH STATISTICS.


      The usual statistical tests used in medicine are based upon a concept of AMBUSH. That is, in theory, you set up a clinical trial with a pre-determined null hypothesis and PREDETERMINED SAMPLE SIZE, and run the trial until the sample size is reached. (I know, I know, many people cheat.) You are not allowed to interrupt the study if you seem to have reached the numerical value for statistical significance before the PREDETERMINED SAMPLE SIZE is reached. This is based upon A PRIORI A POSTERIORI reasoning.

      There is a similar line of reasoning in neural nets, with the paradigm of TRAINING SET / TEST SET. When you have used up the training set to train your neural net, you can no longer run a test on that training set: that is cheating. Rather, you must AMBUSH a different set, or test set. In medical data, the problem is that you use up the training set, and you can't use it again. However, medical data are not like animal experiments: you can't kill a few more subjects, do the experiment again, and create another training set. You are, ethically, required to use the same patients over and over. Working through some of these intellectual problems may provide future employment for theoretical statisticians.

NEED FOR IDENTIFICATION.


      Prevent duplication. Look-back to verify correctness or additional information.

FORMS OF DE-IDENTIFICATION.


      ANONYMOUS DATA. These are data that were collected so that the patient-identification was removed at the time the information was collected. For example, a piece of tissue is taken from an autopsy with a certain disease, to serve as control tissue-blocks in the histology laboratory.

      ANONYMIZED DATA. These are data that are collected initially WITH the patient-identification, which is subsequently, irrevocably removed. That is, there is NEVER a possibility of going back to the patient's record and obtaining additional information.

      DE-IDENTIFIED DATA. These are data that are collected initially with the patient-identification, which is subsequently encoded or encrypted.

      IDENTIFIED DATA. This can only be justified under significant review by the institution, federal guidelines, etc., with patient signing a consent form.

EXPECTED BENEFITS.


      Any use of patient data, even de-identified data, must be justified to the Institutional Review Board (IRB) as having SOME expected benefits. Even de-identified data cannot be used for frivolous purposes. Legally and ethically, you can't perform data analysis "just for the fun of it".

CANONICAL FORM.


      You know what CANONICAL FORM is in mathematics. The trouble is that, in biomedicine, even simple concepts have no canonical form. For example, what is the canonical form for even a simple idea, such as: ADENOCARCINOMA OF COLON , METASTATIC TO LIVER . What about more complex ideas? If there is not a canonical form for equivalent ideas in biomedicine, then how are statistical tables constructed, when they depend upon equivalent concepts being tabulated together? See the Spoiled Meat Paradigm (below).

NATURAL HISTORY OF DISEASE.


      The distribution of results in a medical database are biased, in large part, on the NATURAL HISTORY OF DISEASE. In other words, different categories in a statistical distribution may not be perfectly random, but constrained by the fact that certain sequences of medical events are either common or unlikely. How should this be taken into account in statistical tests?

MINIMAL RISK.


      The emerging U. S. federal paradigm for using medical data for research purposes is MINIMAL RISK. That is, if one uses only data that are collected in the ordinary diagnosis and treatment of patients, and there is no change in patient management as a result of the research, including no pressure on the patient to accept or refuse certain management, then the only risk of using such data is the loss of confidentiality to the patient. This is called MINIMAL RISK data, and may be possible to use in research projects with a simple exemption from the INSTITUTIONAL REVIEW BOARD (IRB).

ENCRYPTION.


      Encryption methods for de-identified medical information must be justified to the INSTITUTIONAL REVIEW BOARD (IRB). For internet distribution, data which enter the database only once are pretty safe. For example, the Johns Hopkins Autopsy Resource. On the other hand, data from multiple institutions are only as secure as the procedures from the least-secure contributing institution. Also, data from a single institution, in which there are multiple updates of the database over time, are also less secure from a determined attacker. Double-brokered encryption. One-time-pad encryption. Public-private encryption.

FAIRNESS.


      FAIRNESS is required in statistical evaluations between competing hypotheses. Jules is fond of saying that many would-be grantsmen formulate their ideas in terms of FAIRNESS FOR ME, rather than FAIRNESS FOR EVERYBODY. Fairness in a statistical test may be constrained both by the natural history of the diseases involved, as well as by what mineable data are excluded by ethical and legal considerations.

SPOILED MEAT PARADIGM.


      The medically significant information in many records (namely, the pathology report!) is often written in free text. For example, the following are all equivalent statements:
ADENOCARCINOMA OF COLON , METASTATIC TO LIVER .
COLON ADENOCARCINOMA , METASTATIC TO LIVER .
COLONIC ADENOCARCINOMA , METASTATIC TO LIVER .
LARGE BOWEL ADENOCARCINOMA , METASTATIC TO LIVER .
COLON'S ADENOCARCINOMA , METASTATIC TO LIVER .
ADENOCARCINOMA OF COLON , WITH METASTASIS TO LIVER .
COLON ADENOCARCINOMA , WITH METASTASIS TO LIVER .
COLONIC ADENOCARCINOMA , WITH METASTASIS TO LIVER .
LARGE BOWEL ADENOCARCINOMA , WITH METASTASIS TO LIVER .
COLON'S ADENOCARCINOMA , WITH METASTASIS TO LIVER .
ADENOCARCINOMA OF COLON , WITH LIVER METASTASIS .
COLON ADENOCARCINOMA , WITH LIVER METASTASIS .
COLONIC ADENOCARCINOMA , WITH LIVER METASTASIS .
LARGE BOWEL ADENOCARCINOMA , WITH LIVER METASTASIS .
COLON'S ADENOCARCINOMA , WITH LIVER METASTASIS .
However, the following means something completely different, and has an entirely different diagnosis, treatment, prognosis, epidemiology, etc.:
ADENOCARCINOMA OF LIVER , METASTATIC TO COLON .
What is to guarantee that the computer translator that is used in the data mining algorithm gets it right?

      The SPOILED MEAT PARADIGM term in the field of computer translation is based upon stories about the early Georgetown University computer translation system between Russian and English [reference: Hutchins. Machine Translation.]. You are more of a Slavic language expert than I, but I understand that, as with German, SPIRIT is ambiguous for the religious spirit as well as the for alcoholic spirit; and FLESH is ambiguous for religiously fallible humanity as well as animal flesh (meat). According to the story, the following passage (Matthew 26:40-41), recounting Jesus's annoyance with St. Peter about the disciples falling asleep while Jesus was fasting and praying in Gethsemane, prior to His crucifixion, was translated from English to Russian, then back to English again:
Matthew 26:40. And He cometh unto the disciples, and findeth them asleep, and saith unto Peter, What, could ye not watch with me one hour?
Matthew 26:41. Watch and pray, that ye enter not into temptation; THE SPIRIT INDEED IS WILLING, BUT THE FLESH IS WEAK.
The return translation to English was translated as: THE VODKA IS STRONG, BUT THE MEAT IS SPOILED.

COMPUTER TRANSLATION.
NAGAO'S PRINCIPLES (my emphasis) [Nagao, 1992].
  • Machine translation is typically composed of the following three steps: analysis of a SOURCE LANGUAGE SENTENCE; TRANSFER (word selection and structural mapping) from one language to another; and generation of a TARGET LANGUAGE SENTENCE.

  • Natural language can be regarded as a huge set of exceptional expressions, and as many such expressions as possible must be collected in the dictionary. It is an endless job [see also, Goethe, Langenscheidt].

  • One of the difficulties of translation, whether it is done by a human or a machine, is that the translation of an input sentence is not unique.

  • Current translation systems can analyze and translated sentences composed of less than ten words, but almost always fail to analyze and translate the sentences of more than thirty words. A reason for such failure is the ambiguity mentioned above. This is inevitable because sometimes even a human cannot understand the meaning of a long sentence at the first reading.

  • Grammatical rules in machine translation can be regarded as production rules that find outa specific linguistic structure and transform it to another linguistic structure.

  • One of the most promising [methods] involves storing many pairs of example phrases and their translations to compare an input partial phrase with these examples, and to extract the most similar example phrase. Then, the translation of the input phrase is done in reference to the translation of the example phrase. This principle is called TRANSLATION BY ANALOGY.

          The key quotation in this article is: "Natural language can be regarded as a huge set of exceptional expressions, and as many such expressions as possible must be collected in the dictionary. It is an endless job."

          That is, computer translation is driven by the exception-table. One of Joel Saltz's graduate students (alas, I do not remember his name) laid this thought upon me at the APIII-1999 meeting in Pittsburgh, a fact well-known to anyone who has ever contended with a significant bolus of heterogeneous text for translation. As far as I am concerned, the central intellectual question of Jules's RFA is whether the project will succeed, because a large majority of sentences in a surgical pathology report have a repetitive syntax (Zipf's Law); or whether the project will fail, because nearly every sentence is a new exception, and requires yet another lexicon entry. That is, will the project be driven by the left end of the Zipf distribution, or by the right end? You can't really know this until you have done a serious, statistical study of the entire Zipf distribution. This is the main reason why I have dug in my heels on getting a complete copy of the surgical pathology database.

          My personal belief is that surgical pathology reports are a sufficiently narrow domain that left-Zipf will permit a significant economy in translation; as compared to endless human coding, which is the conventional wisdom of SNOMEDers at the College of American Pathologists; and the translator will more-or-less succeed. Furthermore, we can quantitatively demonstrate the relative success of our translator with Zipf distributions.

          Another key quotation is as follows: "One of the difficulties of translation, whether it is done by a human or a machine, is that the translation of an input sentence is not unique."

          That is, for an ideal computer-translator, there needs to be a unique expression, or CANONICAL FORM, in the target language, that encapsulates all the equivalent forms of the same concept. In mathematics, this means that the canonical form for ONE-HALF is 1/2, and there is an ALGORITHM for reducing the infinity of equivalent expressions, or ALIASES, namely 2/4, 3/6, 4/8, 5/10, ... , down to 1/2.

          The importance of canonical form finally dawned upon the English-language dictionary-writers of the eighteenth century, who realized that you couldn't write a dictionary without consistent orthography. (The German's didn't GET IT until late nineteenth century.) So long to the Chauceroid orthography of yesteryear, where there are five ways to spell the same word within the same poetical quatrain. Apparently U. S. Federal clerk-typists at the Baltimore VAMC are still living in the Chaucer era, judging from how many different ways they can spell ATHEROSCLEROSIS in the same autopsy report.

          Likewise, as you well know, in a laboratory information system (LIS), there has to be a canonical form for a patient's name, as well as any number of aliases.

          It is my belief that there are only a few fundamental syntactic concepts in surgical pathology, and that these can be reduced a canonical form. For example, all the following are equivalent expressions:
    ADENOCARCINOMA OF COLON , METASTATIC TO LIVER .
    COLON ADENOCARCINOMA , METASTATIC TO LIVER .
    COLONIC ADENOCARCINOMA , METASTATIC TO LIVER .
    LARGE BOWEL ADENOCARCINOMA , METASTATIC TO LIVER .
    COLON'S ADENOCARCINOMA , METASTATIC TO LIVER .
    ADENOCARCINOMA OF COLON , WITH METASTASIS TO LIVER .
    COLON ADENOCARCINOMA , WITH METASTASIS TO LIVER .
    COLONIC ADENOCARCINOMA , WITH METASTASIS TO LIVER .
    LARGE BOWEL ADENOCARCINOMA , WITH METASTASIS TO LIVER .
    COLON'S ADENOCARCINOMA , WITH METASTASIS TO LIVER .
    ADENOCARCINOMA OF COLON , WITH LIVER METASTASIS .
    COLON ADENOCARCINOMA , WITH LIVER METASTASIS .
    COLONIC ADENOCARCINOMA , WITH LIVER METASTASIS .
    LARGE BOWEL ADENOCARCINOMA , WITH LIVER METASTASIS .
    COLON'S ADENOCARCINOMA , WITH LIVER METASTASIS .
    (I don't expect that you would ever encounter COLON'S ADENOCARCINOMA or LARGE BOWEL ADENOCARCINOMA in an actual surgical pathology report, but the expressions are perfectly meaningful.) Obviously, if you switch COLON and LIVER, the expressions mean something entirely different, with an entirely different pathophysiology, prognosis, treatment, epidemiology, etc., etc. The computer-translator MUST be able to get this detail correct, and there must be RECALL/PRECISION studies to prove it.

          My candidate for canonical form is:
    <M> ADENOCARCINOMA <T> COLON <M> METASTASIS <T> LIVER </T> </M> </T> </M>
    where M=morphology/disease and T=topography/bodysite. This expression is XML-compatible. M, T, and all the component words have UMLS Concept-Unique-Identifiers (CUIs). Even if an official committee settles upon a different XML expression for the nationwide database, I'll bet that I can easily translate the above expression into the committee's final form.

    DATA MINING VERSUS STATISTICS.



    Although data mining has a great deal in common with statistics, since both strive toward discovering some structure in data, data mining also draws heavily from many other disciplines, most notably machine learning and database technology. Data mining differs from statistics in that it must deal with heterogeneous data fields, not just heterogeneous numbers, as is the case in statistics. The best example is medical data that may contain images, signals like ECG, clinical information like temperature, cholesterol levels, urinalysis data, etc., as well as the physician's interpretation written in unstructured English. More success stories in data mining are due to advancements in database technology than to advancements in data mining algorithms It is only after a subset of data is selected from a large database that most data mining algorithms can actually manage this reduced data set.

    TYPES OF DATA MINING.



    DIRECTED MINING. For instance, the physician is interested in acquiring some particular information, such as, find regions of the left ventricle............

          HYPOTHESIS TESTING AND REFINEMENT. The user provides some working hypotheses, and expects the system to validate them, or to modify them and suggest other, more refined hypotheses.

          UNDIRECTED OR PURE MINING. This is the most general scenario, in which there are no constraints on the system, and at the same time no prior expectations of what the user will discover, or what type of discovery might be of interest. It is also the most difficult one to perform. Little has been done in this area.

          HUGE VOLUME OF DATA. Because of the sheer size of databases, it is unlikely that any of the data mining methods will succeed with raw data. Data mining methods may require extracting a sample from the database, in the hope that results obtained in this manner are representative for the entire database. Dimensionality reduction can be achieved in two ways: sampling in the data space, where some records are selected, often randomly, and used afterwards for data mining ; sampling in the feature space, where only some features of each data record are selected. Again, for a large number of features, the selection can be performed in a random manner.

          DYNAMIC NATURE OF DATA. Databases are constantly updated, either by adding, say, new SPECT images (for the same or a new patient), or by replacement of the existing ones (say, a SPECT had to be repeated because of technical problems). This requires methods that are able to incrementally update the knowledge learned so far.

          INCOMPLETE OR IMPRECISE DATA. The information collected in a database can be either incomplete or imprecise. Fuzzy sets and rough sets were developed explicitly for the purpose of addressing this problem.

          NOISY DATA. It is very difficult for any data collection technique to entirely eliminate noise. This implies that data mining methods should be made less sensitive to noise, or care should be taken that the amount of noise in data to be collected in the future will be approximately the same as that in the current data.

          MISSING ATTRIBUTE VALUES. A missing value may have been accidentally not entered, or may represent an unknown value. Missing values create a problem for most data mining methods, since most methods require a fixed dimension (number of features) for each data object.

          One approach to remedy this problem is to substitute missing values with most likely values; another approach is to replace the unknown value with all possible values for that attribute. Still another approach is intermediate: specify a likely RANGE of values, instead of only one most likely, or ALL possible values, some of which might be vanishingly unlikely. The trick is to specify the range in an unbiased manner.

          The missing value problem is widely encountered in medical databases, since most medical data are collected as a byproduct of patient care activities, rather than for organized research protocols, where exhaustive data collection can be enforced. In the emerging federal paradigm of MINIMAL RISK investigations, there is a preference for data mining solely from byproduct data. Thus in a large medical database, almost every patient is lacking values for some feature, and almost every feature is lacking values for some patient.

          REDUNDANT, INSIGNIFICANT DATA, OR INCONSISTENT DATA. The data set may contain redundant, insignificant, or inconsistent data objects and/or attributes. We speak about inconsistent data when the same data item is categorized as belonging to more than one category.

          SUMMARIZATION. The goal is to characterize the data in terms of a small number of features/attributes in an aggregated form.

          CLUSTERING OR SEGMENTATION. The key objective is to find natural groupings (clusters) in large dimensional data. Objects are clustered together if they are similar to one another (according to some measure), and at the same time are dissimilar from objects in other clusters. The key concern in clustering is how to incorporate domain knowledge into the mechanisms of clustering. Without that focus and at least partial human supervision, one can easily end up with clustering problems that are computationally infeasible.

          REGRESSION MODELS. These models originate from regression analysis, and its applied field, known as system identification. Regression is the analysis of dependencies of attribute values on the values of other attributes in the same object, and generation of a model that can predict attribute values for new objects.

          CLASSIFICATION. This term has its origins in pattern recognition. The task is to build a classifier, which, given a set of classes, would determine class membership for a new object. The classifiers can be regarded as a special case of regression models.

          CONCEPT DESCRIPTION. The goal is to create understandable descriptions of concepts, or categories. Machine learning, conceptual clustering, genetic algorithms, and fuzzy sets are the principal methods used for achieving this goal.

          DEPENDENCY ANALYSIS. This analysis is concerned with the determination of relationships (dependencies) among fields in a database.

          LINK ANALYSIS, OR ASSOCIATIONS. The task is to discover associations between attributes and objects, such that the presence of one pattern would imply presence of another pattern. These associations can involve attributes of the same object, or attributes of different objects. If it is performed over time, it is called sequence analysis (see next).

          SEQUENCE ANALYSIS. This analysis is geared toward problems of modeling sequential data. Methods include time series analysis, time series models, and temporal neural networks.

          PREDICTION. The task, given the prediction model and a new data object, is to predict a specific value for an attribute of the object. Prediction can be used in hypotheses testing.

          EXPLORATORY DATA ANALYSIS. Graphical models are often used to perform exploratory data analysis, i.e., to harness human visual recognition power to gain some insight into the data, such as discovering new patterns.

          VISUALIZATION. Visualization is very important for making the discovered knowledge comprehensible for humans; although it is least developed, there is a growing demand for visualization techniques. Administrative Procedures-this category includes administrative policies and procedures; technical implementation comes in the later sections. Requirements in this category are:

          o Certification (policies to evaluate and certify that appropriate security measures are in place)

          o Chain of Trust Partner Agreements (contracts between the organization and any outside parties given access to individually identifiable health information, requiring the outside parties to protect the data)

          o Contingency Plans (for response to emergencies; must include applications and data criticality analysis, a data backup plan, a disaster recovery plan, an emergency mode operation plan, and testing and revision procedures)

          o Formal Mechanism for Processing Records (to limit risks due to processing issues)

          o Information Access Control (must include policies for the authorization, establishment, and modification of access privileges)

          o Internal Audit (an ongoing review of access records, etc., to identify possible security violations)

          o Personnel Security (the organization must ensure supervision of personnel performing technical systems maintenance activities by authorized, knowledgeable persons; maintain access authorization records; ensure that operating, and in some cases, maintenance personnel have proper access; employ personnel clearance procedures; employ personnel security policy/procedures; and ensure that system users, including technical maintenance personnel, are trained in system security)

          o Security Configuration Management (coordinated and integrated procedures for system security; must include documentation, hardware/software installation and maintenance review and testing for security features, inventory procedures, security testing, and virus checking)

          o Security Incident Procedures (must include both reporting and response procedures)

          o Security Management Process (must include risk analysis, risk management, a sanction policy, and a security policy)

          o Termination Procedures (to be performed when an employee leaves or loses access to the data; must include changing combination locks, removal from access lists, removal of user account(s), and turning in of keys, tokens, or cards that allow access)

          o Training (security training for all staff; must include awareness training for all personnel, including management, periodic security reminders, user education concerning virus protection, user education in importance of monitoring login success/failure, and how to report discrepancies, and user education in password management)

          o Physical Safeguards-this category includes requirements to control and monitor physical access to systems and storage devices. Requirements in this category are:

          o Assigned Security Responsibility (responsibility for security measures and personnel conduct regarding security has to be assigned to a specific person or organization)

          o Media Controls (procedures covering the transport of data and software into and out of a facility; must include controls on access to media, accountability, data backup, data storage, and disposal)

          o Physical Access Controls (procedures for preventing unauthorized physical access to a facility while ensuring access for those who should have it; must include disaster recovery, emergency mode operation, equipment control, a facility security plan, procedures for verifying access authorizations prior to physical access, maintenance records, need-to-know procedures for personnel access, sign-in for visitors and escort if appropriate, and testing and revision)

          o Policy/Guideline on Workstation Use (how to use workstations so as to maximize security of information, e.g. log off after use)

          o Secure Workstation Location (placement of workstations so as to minimize the risk of unauthorized access to information)

          o Security Awareness Training (for all personnel, including contractors and the like)

          o Technical Security Services-this category includes requirements for technical features to control access to information and permit auditing of access. Requirements in this category are:

          o Access Control (must include a procedure for emergency access, plus either context-based, role-based, or user-based control of access; encryption is optional)

          o Audit Controls (recording and examination of system activity)

          o Authorization Control (for obtaining consent to access information; must use either role-based or user-based control)

          o Data Authentication (the ability to confirm that data has not been changed or deleted; possible methods include checksums or digital signatures, etc.)

          o Entity Authentication (the ability to confirm that a person or organization accessing data is who or what they claim to be; must include automatic log off and unique user identification, as well as at least one of the following implementation features: a biometric identification system, a password system, a personal identification number, telephone callback, or a token system which uses a physical device for user identification)

          Technical Security Mechanisms-protecting information as it travels over internal or external networks. There is only one requirement in this category:

          o Communications/Network Controls (must include integrity controls and message authentication, plus either access controls or encryption; if data travels over a network, implementation must also include alarms, audit trails, entity authentication, and event reporting)

          While the Security standards are fairly detailed, and unusual in that government has not made a practice of specifying mandatory standards for information security practices in the past, they are generally in line with current thinking on good security practices, and allow a fair amount of flexibility in their implementation.

          2.1.5. The HIPAA Privacy Standards. The Privacy standards (45 CFR §164) are intended to ensure that individually identifiable health information is only provided to those who should have it, while ensuring that those who need to have it can get access to it. They are generally directed towards those who maintain and collect medical data as part of their core business, i.e., potential suppliers, rather than pure users, of data. As a result, most of the privacy regulations are not of direct relevance to researchers, who are accustomed to anonymizing data before publication and do not generally pass raw data on to others. They will, however, affect what data researchers can obtain from health care organizations, and may lead such organizations to be more reluctant to supply data sets because they will have to make the effort to anonymize them first.

          Permission-In general, one must obtain permission from the individuals involved to use individually identifiable health information. There are, however, exemptions. One that may be relevant to researchers is for public health authorities or government agencies conducting or those acting at their direction. There is also an exemption for research use, but the request for the data must be reviewed either by an Institutional Review Board (established in accordance with the federal regulations for such boards) or a Privacy Board (for which the standards establish criteria) before the organization with the data is permitted to disclose it. In order to get approval to use identifiable health information without informed consent from all subjects, the researcher will have to show that:

          (1) The use or disclosure of protected health information involves no more than minimal risk to the subjects;

          (2) The waiver will not adversely affect the rights and welfare of the subjects;

          (3) The research could not practicably be conducted without the waiver;

          (4) Whenever appropriate, the subjects will be provided with additional pertinent information after participation;

          (5) The research could not practicably be conducted without access to and use of the protected health information;

          (6) The research is of sufficient importance so as to outweigh the intrusion of the privacy of the individual whose information is subject to the disclosure;

          (7) There is an adequate plan to protect the identifiers from improper use and disclosure; and

          (8) There is an adequate plan to destroy the identifiers at the earliest opportunity consistent with conduct of the research, unless there is a health or research justification for retaining the identifiers. (45 CFR §164.510[j])

          Need for Protected Information-If the data do not have to be individually identifiable in order to perform the research, the holder of the data (assuming the holder is an organization subject to the standards) must anonymize it before the researcher gets it. This is true whether consent is obtained or not; covered organizations are required to avoid divulging any more than is actually necessary to accomplish the intended purpose.

    This is actually beneficial to researchers in that, if they don't have individually-identifiable health information in their possession, they don't have to worry about the Security standards. Researchers should carefully weigh the perceived need for information such as zip codes (which could be necessary for epidemiological studies, for example), which could be regarded as making the data non-anonymized in combination with other information.

    Disclosure-Researchers do not ordinarily provide raw data including personal identifiers to other organizations; it is not, therefore, worthwhile to detail the extensive regulations on disclosure here. Researchers who are part of organizations to which HIPAA directly applies should be aware that the privacy standards place a number of restrictions on disclosure. Transfers of identifiable health data to colleagues at other institutions will be possible only with approval from their institution's Privacy Officer (a position mandated by the standards), only if certain criteria are met, and the other institution will have to sign an agreement to protect the data's confidentiality. Research organizations that are not directly governed by HIPAA, but receive data from organizations that are, will likely have to sign agreements barring them from making such disclosures.

    REFERENCES.



    1. Hutchins WJ.
    Machine Translation : Past, Present, Future .
    Ellis Horwood Series in Computers and Their Applications.
    ASIN: 0135435218 .


    2. Nagao M.
    Machine Translation.
    In: Shapiro SC, ed. Encyclopedia of Artificial Intelligence. Volume 2. M-Z.
    New York: Wiley-Interscience. 1992;2:898-902.


    3. Moore GW, Berman JJ.
    Anatomic Pathology Data Mining.
    Chapter 4. In: Cios KJ. Medical Data Mining and Knowledge Discovery. Berlin: Springer Verlag. 2000;4:61-107.
    ISBN: 3-7908-1340-0, 502 pages.
    Published within the series: "Studies in Fuzziness and Soft Computing", Physica-Verlag Heidelberg, a Springer-Verlag Company.
    Full Text of Article: http://www.netautopsy.org/apdmchap.htm

    ADDITIONAL READING.





    4. Anderson RE, Smith RD, Benson ES
    The accelerated graying of American pathology.
    Hum Pathol. 1991;22:210-214.


    5. Arcidi JM jr, Moore GW, Hutchins GM.
    Hepatic morphology in cardiac dysfunction. A clinicopathologic study of 1000 autopsied patients.
    Am J Pathol. 1981;104:159-166.


    6. Association of Directors of Anatomic and Surgical Pathology (ADASP).
    Recommendations for the reporting of specimens containing laryngeal neoplasms.
    Mod Pathol. 1997;10:384-386.


    7. Berman JJ, Moore GW, Hutchins GM.
    Maintaining patient confidentiality in the public domain Internet Autopsy Database (IAD).
    Proc AMIA 20th Annu Fall Symp. 1996;20:328-332.


    8. Bubendorf L, Kononen J, Koivisto P, Schraml P, Moch H, Gasser TC, Willi N, Mihatsch MJ, Sauter G, Kallioniemi OP.
    Survey of gene amplifications during prostate cancer progression by high-throughout fluorescence in situ hybridization on tissue microarrays.
    Cancer Res. 1999;59:803-806.


    9. Bubley GJ, Carducci M, Dahut W, Dawson N, Daliani D, Eisenberger M, Figg WD, Freidlin B, Halabi S, Hudes G, Hussain M, Kaplan R, Myers C, Oh W, Petrylak DP, Reed E, Roth B, Sartor O, Scher H, Simons J, Sinibaldi V, Small EJ, Smith MR, Trump DL, Vollmer R, Wilding G.
    Eligibility and Response Guidelines for Phase II Clinical Trials in Androgen-Independent Prostate Cancer: Recommendations From the Prostate-Specific Antigen Working Group.
    J Clin Oncol. 1999;Nov;17(11):3461-3467.


    10. Carter JR, Nash NP, Cechner RL, Platt RD.
    Proposal for a national autopsy data bank. A potential major contribution of pathologists to the health care of the nation.
    Am J Clin Pathol. 1981;76(Suppl):597-617.


    11. Choi SS, Kang YS, Kim UJ, Lee KH, Shin HS.
    Chromosomal localization of ESTs obtained from human fetal liver via BAC-mediated FISH mapping.
    Mol Cells. 1999;9:403-409.


    12. Colby TV, Koss MN, Travis WD.
    Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Lower Respiratory Tract. Electronic Fascicle version 2.0.
    Armed Forces Institute of Pathology. 1995;:.


    13. Compton CC.
    Pathology Report in Colon Cancer: What Is Prognostically Important?
    Dig Dis. 1999;17:67-79.


    14. Connell PP, Rotmensch J, Waggoner SE, Mundt AJ.
    Race and clinical outcome in endometrial carcinoma.
    Obstet Gynecol. 1999;94:713-720.

          15. Cote, R.A., Rothwell, D.J., Beckett, R.S., Palotay, J.L. and Brochu, L. SNOMED International. The Systematized Nomenclature of Human and Veterinary Medicine. College of American Pathologists. 1993;:.

          16. Degregorio, W.A. 1997.
    The Complete Book of U.S. Presidents. Fifth Edition.
    Barricade Books. 1997.

          17. Fedorowicz J.
    A Zipfian model of an automatic bibliographic system: An application to MEDLINE.
    J Am Soc Info Sci. 1982;33:223-232.


    18. Giere W.
    Foundations of clinical data automation in cooperative programs.
    Proc 5th Ann Symp Comp Applic Med Care. 1981;5:1142-1148.

          19. Grizzle WE, Aamodt R, Clausen K, LiVolsi V, Pretlow TG, Qualman S.
    Providing human tissues for research: how to establish a program.
    Arch Pathol Lab Med. 1998;122:1065-1076.

          20. Hahn U, Romacker M, Schulz S.
    How knowledge drives understanding -- matching medical ontologies with the needs of medical language processing.
    Artif Intell Med. 1999;15:25-51.

          21. Hruban R, Westra WH, Phelps TH.
    Surgical Pathology Dissection.
    Springer Verlag. 1996;

    22. Hutchins GM.
    Autopsy. Performance and Reporting.
    College of American Pathologists. 1990;:.

    23. Hutchins GM, Meuli M, Meuli-Simmen C, Jordan MA, Heffez DS, Blakemore KJ.
    Acquired spinal cord injury in human fetuses with myelomeningocele.
    Pediatr Pathol Lab Med. 1996;16:701-712.

          24. Hutchins, G.M., Berman, J.J., Moore, G.W., Hanzlick, R. and the Autopsy Committee of the College of American Pathologists. 1999. Practice Guidelines for Autopsy Pathology. Arch Pathol Lab Med. 1999; 123:1085-1092.

          25. Junor, E.J., Hole, D.J., McNulty, L., Mason, M. and Young, J. 1999. Specialist gynaecologists and survival outcome in ovarian cancer: a Scottish national study of 1866 patients. Br J Obstet Gynaecol. 106:1130-1136.

          26. Kao, G.F. and Moore, G.W. 2000. Dermatopathology False Negative Terms in Unified Medical Language System (UMLS). Arch Pathol Lab Med. 124: (in press).

          27. Klausner, R.D. 1999. The Nation's Investment in Cancer Resarch: A Budget Proposal for Fiscal Year 2001. National Cancer Institute. 51-55.

          28. Kononen, J., Bubendorf, L., Kallioniemi, A., Barlund, M., Schraml, P., Leighton, S., Torhorst, J., Mihatsch, M.J., Sauter, G. and Kallioniemi, O.P. 1998. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med. 4:844-847.

          29. Kurman, R.J., Malkasian, G.D. jr., Sedlis, A. and Solomon, D. 1991. From Papanicoloau to Bethesda: the rationale for a new cervical cytology classification. Obstet Gynecol 77:779-782.

          30. Light, R. 1997.Presenting XML. Sams.net Publishing.

          31. Lilienfeld, D.E. and Stolley, P.D. 1994.Foundations of Epidemiology. Fifth Edition. Oxford University Press. 1994.

          32. Mapp, T.J., Hardcastle, J.D., Moss, S.M. and Robinson, M.H. 1999. Randomized clinical trial: Survival of patients with colorectal cancer diagnosed in a randomized controlled trial of faecal occult blood screening. Br J Surg. 86:1286-1291.

          33. Moch, H., Schraml, P., Bubendorf, L., Mirlacher, M., Kononen, J., Gasser, T., Mihatsch, M.J., Kallioniemi, O.P. and Sauter, G. 1999. High-throughput tissue microarray analysis to evaluate genes uncovered by cDNA microarray screening in renal cell carcinoma. Am J Pathol. 154:981-986.

          34. Moore GW, Hutchins GM.
    Consistency versus completeness in medical decision making: Application to 155 patients autopsied after coronary artery bypass graft surgery.
    Proc 6th Annu Symp Comput Appl Med Care. 1982;6:805-811.

          35. Moore, G.W., Boitnott, J.K., Miller, R.E., Eggleston, J.C. and Hutchins, G.M. 1988. Integrated anatomic pathology reporting system using natural language diagnoses. Modern Pathol 1:44-50, 1988.

          36. Moore GW, Miller RE, Hutchins GM.
    Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the Barrier Word Method.
    In: Scherrer JR, Cote RA, Mandil SH, eds. Computerized Natural Medical Language Processing for Knowledge Representation.
    North-Holland. 1989;:29-39.

          37. Moore, G.W. and Berman, J.J. 1994. Performance Analysis of Manual and Automated Systematized Nomenclature of Medicine (SNOMED) Coding. Am J Clin Pathol 101:253-256.

          38. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM. A prototype internet autopsy database: 1625 consecutive fetal and neonatal autopsy facesheets spanning twenty years. Arch Pathol Lab Med. 1996;39;120:782-785.

          39. Moulton, G. 1999. Surveillance data take on a new statistical dimension. J Natl Cancer Inst. 91:671-673.

          40. Mullick, F. 1997. The Center for Environmental Pathology and Toxicology at the Armed Forces Institute of Pathology. Hum Pathol. 52: 752-753.

          41. Naber, S.P., Smith, L.L.,jr. and Wolfe, H.J. 1992.Role of the frozen tissue bank in molecular pathology. Diagnostic Molecular Pathology.1:73-79.

          42. Nelson, S.J., Olson, N.E., Fuller, L., Tuttle, M.S., Cole, W.G. and Sherertz, D.D. 1995. Identifying concepts in medical knowledge. Medinfo. 8:33-36.

          43. Nelson, R.L., Persky, V. and Turyk, M. 1999. Carcinoma in situ of the colorectum: SEER trends by race, gender, and total colorectal cancer. J Surg Oncol. 1999 Jun;71(2):123-129.

          44. Payne, C. 1995. Developing a standard dataset for the NHS. Version 3 of Read Codes addresses many difficulties. BMJ. 311:951.

          45. Peery, T.M. 1978. The autopsy data bank. A proposal for pathologists to contribute to the health care of the nation. Am J Clin Pathol 69 (Suppl): 258-259.

    46. Read JD, Benson TJR.
    Comprehensive coding.
    Brit J Health Care Computing. 1986; 3:22-25.

    47. Rosai J, Ackerman LV.
    Ackerman's Surgical Pathology. Eighth Edition.
    C.V. Mosby. 1996;:.

    48. Schmitt AO, Specht T, Beckmann G, Dahl E, Pilarsky CP, Hinzmann, B. and Rosenthal, A.
    Exhaustive mining of EST libraries for genes differentially expressed in normal and tumour tissues.
    Nucleic Acids Res. 1999;27:4251-4260.

    49. Schneier B.
    Applied Cryptography, Second Edition. Protocols, Algorithms, and Source Code in C.
    New York: John Wiley & Sons. 1996;:.

    50. Simpson A.
    HTML Publishing Bible, Windows 95 Edition.
    IDG Books Worldwide, Inc. 1996;:.

    51. Smith RD, Benson ES, Anderson RE.
    Some characteristics of the community practice of pathology in the United States.
    Arch Pathol Lab Med. 1989;113:1335-1342.

          52. Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
    Barrier word method for detecting molecular biology multiple word terms.
    Proc 12th Annu Symp Comput Appl Med Care. 1988;12:.

          53. U. S. Centers for Disease Control and Prevention.
    Manual of Procedures for the Reporting of Nationally Notifiable Disease to CDC.
    Centers for Disease Control and Prevention. 1995;:.

    54. U. S. Code of Federal Regulations.
    45 CFR Subtitle A (10-1-95 Edition), part 46.101 (b) (4).
    U. S. Department of Health and Human Services. Office of the Secretary. 1995;:/

    55. U. S. Code of Federal Regulations. 45 CFR Parts 160 - 164. Standards for Privacy of Individually Identifiable Health Information; Proposed Rule.
    Department of Health and Human Services. Office of the Secretary.
    Federal Register. 1999;64:59917-60065.
    http://aspe.hhs.gov/admnsimp/
    Last tested: August 29, 2009.

          56. U. S. Health Insurance Portability and Accountability Act. (HIPAA, Kennedy-Kassebaum Bill, H.R. 3103 of 104th U. S. Congress). 1996.
    U. S. Government Documents at URL:
    http://thomas.loc.gov
    Last tested: August 29, 2009.

    58. Vigorita VJ, Moore GW, Hutchins GM.
    Absence of correlation between coronary arterial atherosclerosis and severity or duration of diabetes mellitus of adult onset.
    Am J Cardiol. 1980;46:535-542.

    59. Wagner BM.
    The future of environmental and toxicologic pathology.
    Human Pathol. 1996;27:1003-1004.

    60. Zhang Q.
    Easy entry of Chinese character set symbols.
    Proc 5th Ann Symp Comp Appl Med. 1981;5:143-149.

    61. Zipf GK.
    Human Behavior and The Principle of Least Effort. An Introduction to Human Ecology.
    Reading, MA: Addison-Wesley Press. 1949;:19-55.

    61. Collins KA, Hutchins GM, eds.
    Autopsy Performance and Reporting. Second Edition.
    Northfield, IL: College of American Pathologists. 2003;:. Voice: 800-323-4040.
    ISBN 0-930304-78-0, 397 pages.

    62. Moore GW.
    Computer-based indexing. Chapter 32.
    In: Collins KA, Hutchins GM, eds. Autopsy Performance and Reporting. Second Edition.
    Northfield, IL: College of American Pathologists. 2003;2:313-323. Voice: 800-323-4040.
    ISBN 0-930304-78-0, 397 pages.

    63. Ludwig J.
    Handbook of Autopsy Practice. Third Edition.
    Totowa, NJ: Humana Press. 2002;:.
    ISBN 1-58829-169-3, 592 pages.

    64. Hutchins GM, Berman JJ, Moore GW, Hanzlick R.
    Practice guidelines for autopsy pathology: autopsy reporting. Autopsy Committee of the College of American Pathologists.
    Arch Pathol Lab Med. 1999 Nov;123(11):1085-1092.
    PMID: 10539932
    PubMed Entry
    Full Text.
    Last tested: August 29, 2009.

    65. Hutchins GM, Autopsy Committee of the College of American Pathologists.
    Practice guidelines for autopsy pathology. Autopsy performance. Autopsy Committee of the College of American Pathologists.
    Arch Pathol Lab Med. 1994 Jan;118(1):19-25.
    PMID: 8285830.
    PubMed Entry
    Last tested: August 29, 2009.

    66. Hutchins GM, Autopsy Committee of the College of American Pathologists.
    Practice guidelines for autopsy pathology. Autopsy reporting. Autopsy Committee of the College of American Pathologists.
    Arch Pathol Lab Med. 1995 Feb;119(2):123-30.
    PMID: 7848058.
    PubMed Entry
    Last tested: August 29, 2009.

    67. Hutchins GM, ed.
    Autopsy Performance and Reporting.
    Northfield, IL: College of American Pathologists. 1990.

    68. Powers JM, the Autopsy Committee of the College of American Pathologists.
    Practice guidelines for autopsy pathology. Autopsy procedures for brain, spinal cord, and neuromuscular system.
    Arch Pathol Lab Med 119:777-783, 1995.

    69. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
    A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years.
    Arch Pathol Lab Med. 1996 Aug;120(8):782-5.
    PMID: 8718907.
    PubMed Entry
    Last tested: August 29, 2009.

    70. Hutchins GM.
    Whither the Autopsy? ...To regional autopsy centers.
    Arch Pathol Lab Med 1996; 120:718-719.

    71. Hutchins GM, McLendon W W. (eds.).
    College of American Pathologists Conference XXIX on Restructuring Autopsy Practice for Health Care Reform.
    Arch Pathol Lab Med 120. 733-781, 1996.

    72. Berman JJ, Moore GW, Hutchins GM.
    Maintaining patient confidentiality in the public domain Internet Autopsy Database (IAD).
    Proc AMIA Annu Fall Symp. 1996;:328-332.
    PMID: 8947682.
    PubMed Entry
    Last tested: August 29, 2009.

    73. Berman JJ, Moore GW, Hutchins GM.
    Internet autopsy database.
    Hum Pathol. 1997 Apr;28(4):393-394.
    PMID: 9104935.
    PubMed Entry
    Last tested: August 29, 2009.

    74. O'Grady G.
    Death of the teaching autopsy.
    BMJ. 2003 Oct 4;327:802-803.
    Curriculum pressures and a decline in hospital autopsy rates have reduced the opportunity for medical students to learn from autopsy findings.

    75. Moore GW, Hutchins GM.
    The persistent importance of autopsies.
    Mayo Clin Proc. 2000 Jun;75(6):557-558.

    76. Underwood J.
    Resuscitating the autopsy.
    BMJ. 2003 Oct 4;327:803-804.

    77. Anderson RE, Hill RB.
    The current status of the autopsy in academic medical centers in the United States.
    Am J Clin Pathol 1989;92:S31-37.

    78. Lundberg GD.
    Medicine without the autopsy.
    Arch Pathol Lab Med 1984;108:449-454.

    79. Charlton R.
    Autopsy and medical education: a review.
    J Roy Soc Med. 1994;87:232-236.

    80. Welsh TS, Kaplan J.
    The role of postmortem examination in medical education.
    Mayo Clin Proc. 1998;73:802-805.

    81. Hartmann HRF, Sebastian M.
    An argument for th attendance of clinicians at autopsy.
    Arch Pathol Lab Med 1984;108:522-523.

    82. Hill RB, Anderson RE.
    The uses and value of autopsy in medical education as seen by pathology educators.
    Acad Med. 1991:66:97-100.

    83. Galloway M.
    The role of the autopsy in medical education.
    Hosp Med 1999;60:756-758.

    84. Sanchez H, Ursell P.
    Use of autopsy cases for integrating and applying the first two years of medical education.
    Acad Med. 2001;76:530-531.

    85. Benbow EW.
    Medical students' views on necropsies.
    J Clin Pathol 1990;43:969-976.

    86. Goldman L.
    Diagnostic advances vs the value of the autopsy.
    Arch Pathol Lab Med 1984;108:501-505.

    87. Kirch W, Schafii C.
    Misdiagnosis at a university hospital in four medical eras.
    Medicine 1996;75:29-40.

    88. Kingsford DPW.
    A review of diagnostic inaccuracy.
    Med Sci Law. 1995:35:347-351.

    89. Botega NJ, Medtze K, Marques E, Cruvinel A, Moraes AV, Augusto L, et al.
    Attitudes of medical students to necropsy.
    J Clin Pathol. 1997;50:64-66.

    90. Sherwood SJ, Start RD, Birdi KS, Cotton DWK, Bunce D.
    How do clinicians learn to request permission for autopsies?
    Med Educ. 1995:29:231-234.

    91. Sanner MA.
    In perspective of the declining autopsy rate: attitudes of the public.
    Arch Pathol Lab Med 1994;118:878-883.

    92. Geller SA.
    Religious attitudes and the autopsy.
    Arch Pathol Lab Med 1984;108:494-496.

    97. Manning CD, Schütze H.
    Foundations of Statistical Natural Language Processing.
    Cambridge, MA: The MIT Press.
    ISBN: 0262133601. 2000;:.
    Foundations of Statistical Natural Language Processing companion website:
    http://www-nlp.stanford.edu/fsnlp/intro/
    Last tested: August 29, 2009.

    98. Li W.
    Comprehensive bibliography regarding Zipf's Law, a central principle of natural language processing, has been assembled by Prof. Wentian Li:
    http://www.nslij-genetics.org/wli/zipf/
    Last tested: August 29, 2009.

    99. Moore GW, Miller RE.
    Inventory of medical natural language:
    http://www.netautopsy.org/vhpsapsx.htm
    Last tested: August 29, 2009.

    100. Moore GW.
    Anatomic Pathology Natural Language Processing.
    http://www.netautopsy.org/natlngpr.htm
    Last tested: August 29, 2009.

    101. Berman JJ, Moore GW.
    Laboratory Digital Imaging Project (LDIP).
    http://www.julesberman.info/spec2img.htm
    Last tested: August 29, 2009.

    102. Association for Pathology Informatics (API).
    http://www.pathologyinformatics.org
    Last tested: August 29, 2009.

    103. Code of Federal Regulations. Title 45. Public Welfare. Department of Health and Human Services. Part 46. Protection of Human Subjects.
    http://www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm
    Last tested: August 29, 2009.

    104. Frequently-asked Questions about Research Using Human Specimens, Cell Lines or Data.
    http://grants.nih.gov/grants/policy/hs/faqs_specimens.htm
    Last tested: August 29, 2009.

    108. Section 508 of the U. S. Federal Rehabilitation Act.
    http://www.jimthatcher.com/webcourse1.htm
    Last tested: August 29, 2009.

    109. The official goverment site for Section 508 of the U. S. Federal Rehabilitation Act:
    http://www.section508.gov/
    Last tested: August 29, 2009.

    110. Surveillance, Epidemiology, and End Results (SEER).
    Cancer Statistics Review, 1975-2005.
    http://seer.cancer.gov/csr/1975_2005/

    111. Surveillance, Epidemiology, and End Results (SEER).
    Statistics for 2001-2005. Rates per 100,000 persons.
    Statistics in the text are calculated assuming a U. S. population of 300,000,000.
    http://seer.cancer.gov/csr/1975_2005/results_single/sect_01_table.04_2pgs.pdf
    Last tested: August 29, 2009.

    112. Berman JJ, with Moore GW.
    Precancer: The Beginning and the End of Cancer.
    Boston, Toronto, London, Singapore: Jones and Bartlett. 2009 Aug 11;:.
    ISBN-13: 978-0-7637-7784-5, 200 pages.
    http://www.jbpub.com/catalog/9780763777845/
    Last tested: August 29, 2009.
    The central idea of this book is straightforward: Precancer is an entity biologically distinct from cancer, that is recognizable and potentially treatable, before it makes its final, irrevocable step to invasive cancer. Several precancers are well-known: dysplasia of the uterine cervix in women; adenomatous polyp in the colon; and actinic keratosis in sun-exposed areas of the skin, all of which can eventually progress to cancer. Cervical dysplasia screening (Pap smears) has decreased cervical cancer rates by 70-90% in all countries that have universal programs for detection and followup. The late George N. Papanicolaou, MD, PhD, is the single pathologist who has saved more lives in the twentieth century than any other pathologist. Since cervical cancer typically attacks 20-year-old women, this prevention program gives them an additional 50 years of life-expectancy. Since cancer is typically a disease of middle-aged and older patients, other precancer programs wouldn't have as much punch on life expectancy rates. However, there's a lot of valuable, unexplored territory out there in precancer.

    113. Moore GW, Berman JJ.
    Cell growth simulations predicting polyclonal origins for 'monoclonal' tumors.
    Cancer Lett. 1991 Nov;60(2):113-119.
    PMID: 1933835.
    PubMed Entry
    Full Text of Article, including public-domain open-source code: http://www.netautopsy.org/monoclon.htm
    Last tested: August 29, 2009.

    114. Berman JJ, Moore GW.
    Spontaneous regression of residual tumour burden: prediction by Monte Carlo simulation.
    Anal Cell Pathol. 1992 Sep;4(5):359-368.
    PMID: 1445794.
    PubMed Entry
    Full Text of Article: http://www.netautopsy.org/sponregr.htm
    Last tested: August 29, 2009.

    115. Berman JJ, Moore GW.
    The role of cell death in the growth of preneoplastic lesions: a Monte Carlo simulation model.
    Cell Prolif. 1992 Nov;25(6):549-557.
    PMID: 1457604.
    PubMed Entry
    Full Text of Article: http://www.netautopsy.org/celdeath.htm
    Last tested: August 29, 2009.

    116. Moore GW, Berman JJ.
    Anatomic Pathology Data Mining.
    In: Cios KJ, ed. Medical Data Mining and Knowledge Discovery.
    2001. XVIII, 502 pp. 98 figs., 98 tabs. Hardcover.
    ISBN: 3-7908-1340-0.
    Copyright Springer-Verlag: Berlin/Heidelberg 1999.
    http://www.netautopsy.org/apdmchap.htm
    Last tested: August 29, 2009.

    117. Berman JJ.
    Tumor classification: molecular analysis meets Aristotle.
    BMC Cancer. 2004 Mar 17;4:10.
    PMID: 15113444
    PubMed Entry
    Aristotle (384-322 BCE), Greek philosopher.
    This article is among the all-time most-viewed articles in BMC Cancer, and, as of September 2008, has been downloaded about 15,000 times from BiomedCentral.
    Last tested: August 29, 2009.

    118. Berman JJ.
    Tumor taxonomy for the developmental lineage classification of neoplasms.
    BMC Cancer. 2004 Nov 30;4(1):88.
    PMID: 15571625.
    PubMed Entry
    Last tested: August 29, 2009.

    119. Berman JJ.
    Modern classification of neoplasms: reconciling differences between morphologic and molecular approaches.
    BMC Cancer 2005, 5:100.
    PMID: 16092965
    PubMed Entry
    Last tested: August 29, 2009.

    120. Berman JJ.
    Developmental Lineage Classification and Taxonomy of Neoplasms.
    http://www.julesberman.info/devclass.htm
    Last tested: August 29, 2009.

    121. Berman JJ.
    Doublet method for very fast autocoding.
    BMC Med Inform Decis Mak. 2004 Sep 15;4:16.
    PMID: 15369595
    PubMed Entry
    Last tested: August 29, 2009.

    122. Berman JJ, Moore GW.
    Implementing an RDF schema for pathology images.
    http://www.julesberman.info/spec2img.htm
    Last tested: August 29, 2009.

    123. Berman JJ.
    Biomedical Informatics.
    Boston, Toronto, London, Singapore: Jones & Bartlett Publishers; 1 edition (October 18, 2006)
    ISBN-10: 0763741353, 459 pages.
    ISBN-13: 978-0763741358, 459 pages.
    http://www.jbpub.com/catalog/9780763741358/
    http://www.julesberman.info/
    Last tested: August 29, 2009.

    124. Berman JJ.
    Perl Programming for Medicine and Biology.
    Boston, Toronto, London, Singapore: Jones & Bartlett Publishers; 1 edition (April 6, 2007)
    ISBN-10: 076374333X, 407 pages.
    ISBN-13: 978-0763743338, 407 pages.
    http://www.jbpub.com/catalog/9780763743338/
    http://www.julesberman.info/
    Last tested: August 29, 2009.

    125. Berman JJ.
    Perl: The Programming Language.
    Boston, Toronto, London, Singapore: Jones & Bartlett Publishers. 2009;:.
    ISBN: 9780763757588, 52 pages.
    http://www.jbpub.com/catalog/9780763757588/
    http://www.julesberman.info/
    Last tested: August 29, 2009.

    126. Berman JJ.
    Ruby Programming for Medicine and Biology.
    Boston, Toronto, London, Singapore: Jones & Bartlett Pub; 1 edition (September 13, 2007)
    ISBN-10: 0763750905, 378 pages.
    ISBN-13: 978-0763750909, 378 pages.
    http://www.jbpub.com/catalog/9780763750909/
    http://www.julesberman.info/
    Last tested: August 29, 2009.

    127. Berman JJ.
    Ruby: The Programming Language.
    Boston, Toronto, London, Singapore: Jones & Bartlett Publishers. 2009;:.
    ISBN: 9780763757571, 46 pages.
    http://www.jbpub.com/catalog/9780763757571/
    Last tested: August 29, 2009.

    128. Berman JJ.
    Neoplasms: Principles of Development and Diversity.
    Boston, Toronto, London, Singapore: Jones & Bartlett Publishers. 2008 Oct 1.
    ISBN: 9780763755706, 464 pages.
    http://www.jbpub.com/catalog/9780763755706/
    Last tested: August 29, 2009.

    131. Jha AK, DesRoches CM, Campbell EG, Donelan K, Rao SR, Ferris TG, Shields A, Rosenbaum S, Blumenthal D.
    Use of electronic health records in U.S. hospitals.
    N Engl J Med. 2009 Apr 16;360(16):1628-1638. Epub 2009 Mar 25.
    PMID: 19321858.
    PubMed Entry
    Last tested: August 29, 2009.

    132. Longman P.
    Best Care Anywhere: Why VA Health Care is Better Than Yours.
    Sausalito, CA: Polipoint Press. 2007 Jan 1.
    ISBN-10: 0977825302, 159 pages.
    ISBN-13: 978-0977825301, 159 pages.

    133. ActivePerl.
    A current version of a Perl interpreter (at this writing, up to version 5.10) can be downloaded from the ActivePerl website (search for Active Perl Download on google.com). For the MS-DOS version of ActivePerl, the source code and input data files are placed in folder c:\Perl\eg\.
    http://www.google.com/search?hl=en&q=activeperl+download&aq=f&oq=&aqi=g6
    Last tested: August 29, 2009.

    134. Murff HJ, Kannry J.
    Physician satisfaction with two order entry systems.
    J Am Med Inform Assoc. 2001 Sep-Oct;8(5):499-509.
    PMID: 11522770.
    PubMed Entry
    Last tested: August 29, 2009.

    135. Weir CR, Crockett R, Gohlinghorst S, McCarthy C.
    Does user satisfaction relate to adoption behavior?: an exploratory analysis using CPRS implementation.
    Proc AMIA Symp. 2000;:913-917.
    PMID: 11080017.
    PubMed Entry
    Last tested: August 29, 2009.

    136. Lovis C, Payne TH.
    Extending the VA CPRS electronic patient record order entry system using natural language processing techniques.
    Proc AMIA Symp. 2000;:517-521.
    PMID: 11079937
    PubMed Entry
    Last tested: August 29, 2009.

    140. Orwant J, Hietaniemi J, Macdonald J.
    Mastering Algorithms with Perl.
    Cambridge: O'Reilly. 1999.
    ISBN 1-56592-398-7, 684 pages.

    141. Till D.
    Teach yourself PERL 5 in 21 days, Second Edition.
    Indianapolis: SAMS Publishing, 1996.
    SAMS Publishing, 201 West 103rd Street, Indianapolis, IN 46290.

    143. Medsphere.org:
    Community gathering place for VistA
    http://medsphere.org/index.jspa
    Medsphere.org is a community gathering place where healthcare administrators, clinicians, developers, and enthusiasts can interact, share, and collaborate in the largest ecosystem in healthcare.
    Last tested: August 29, 2009.

    144. Schwartz RL, Phoenix T, Foy BD.
    Learning Perl, 5th Edition (Paperback)
    Publisher: O'Reilly Media, Inc.; 5th edition (July 7, 2008)
    ISBN-10: 0596520107, 348 pages.
    ISBN-13: 978-0596520106, 348 pages.

    145. Schwartz RL, Phoenix T, Foy BD.
    Intermediate Perl (Paperback)
    Publisher: O'Reilly Media, Inc.; 2nd edition (March 8, 2006)
    ISBN-10: 0596102062 278 pages.
    ISBN-13: 978-0596102067, 278 pages.



    Last updated: August 29, 2009, by G. William Moore, MD, PhD.