From the Pathology and Laboratory Medicine Service,
Veterans Affairs Maryland Health Care System, Baltimore, Maryland [1];
Department of Pathology, University of Maryland Medical System,
Baltimore, Maryland [2]; and
Department of Pathology, The Johns Hopkins Medical Institutions,
Baltimore, Maryland [3].
Send comments and correspondence to:
George.Moore4@va.gov
Related Publications:
http://www.netautopsy.org/ascpedge.htm
http://www.netautopsy.org/ascpfrac.htm
http://www.netautopsy.org/ascpisap.htm
http://www.netautopsy.org/autocode.htm
http://www.netautopsy.org/basalcel.htm
http://www.netautopsy.org/camyxoma.htm
http://www.netautopsy.org/celdeath.htm
http://www.netautopsy.org/clearcel.htm
http://www.netautopsy.org/clearrev.htm
http://www.netautopsy.org/snomedsp.htm
http://www.netautopsy.org/protoiad.htm
http://www.netautopsy.org/natlngpr.htm
http://www.netautopsy.org/axsop/axsop.htm
Moore GW, Berman JJ.
Anatomic Pathology Data Mining.
Chapter 4. In: Cios KJ.
Medical Data Mining and Knowledge Discovery.
Berlin: Springer Verlag. 2000;4:61-107.
ISBN: 3-7908-1340-0, 502 pages.
Published within the series: "Studies in Fuzziness and Soft Computing",
Physica-Verlag Heidelberg, a Springer-Verlag Company.
Full Text of Article:
http://www.netautopsy.org/apdmchap.htm
1. DISCLAIMER.
DISCLAIMER. United States Government Work, uncopyrighted,
public-domain, DRAFT COPY ONLY. This document does not necessarily
represent the views or policies of any United States Government agency.
This document is provided "as is", without warranty of any kind, express
or implied, including but not limited to the warranties of merchantability,
fitness for a particular purpose and non-infringement. In no event shall the
authors be liable for any claim, damages or other liability, whether in an
action of contract, tort or otherwise, arising from, out of, or in connection
with the document or the use or other dealings made with the document.
ABSTRACT
Pathology is the study of disease. Anatomic pathology is that area
of pathology that studies the gross anatomy and microanatomy (histology)
of diseased organs, in order to render specific diagnoses, and to acquire
new knowledge related to disease biology. One of the chief functions
of the anatomic pathologist is to issue diagnostic reports on tissue samples
(biopsies) taken from suspected lesions. Pathology reports are needed
to guide treatment of the individual patient. The aggregate collection
of free-text reports contains a wealth of information related to almost
every serious human disease. The goal of data mining in anatomic pathology
is to extract research value from collections of pathology reports.
Intended uses for anatomic pathology data mining include: epidemiology;
human tissue resources linked to clinical data; trial and outcomes analysis;
and monitoring the quality of patient care. Issues facing anatomic pathology
data mining include: security and patient confidentiality; data integrity;
and standards for pathology reports. In the future, with improved
data mining technology, we can anticipate wide use of anatomic pathology
reports in support of medical research.
4.1 Understanding the Problem Domain
4.1.1 Objectives of Data Mining in Anatomic Pathology
In the course of providing patient care, anatomic pathologists issue
diagnoses in the form of free-text pathology reports. Almost every new
diagnosis of cancer, and many diagnoses of non-neoplastic diseases,
results in a pathology report, issued from a hospital or an independent
pathology laboratory. The main, and until now essentially the only,
purpose of most of these reports has been to provide a correct
and timely diagnosis for the individual patient. The goal of data mining
in anatomic pathology is to extract additional value from these reports,
beyond their initial, designated purpose.
Over the past decade, with the growth of computers and word processors,
nearly every anatomic pathology report exists for a period of time
as a computer-readable document. Approximately 40 million pathology
reports are issued each year in the United States (
American Board
of Medical Specialists, Appendix 1;
Smith et al, 1989;
Anderson et al, 1991).
If even a small percentage of these reports could be analyzed
with modern data analysis techniques, the benefits to society
could be enormous. Currently this potential data-source
is essentially untapped, due to technical, legal, and social obstacles,
none of which is insurmountable. The principal areas for potential
application of anatomic pathology data mining include:
epidemiologic studies; as a link to archival tissue specimens
for research use; for monitoring and improving patient treatments;
and for the development of new diagnostic tests.
4.1.2 Intended Uses for Anatomic Pathology Data Mining
4.1.2.1 Epidemiology
Epidemiology is the study of the distribution and determination of disease
in specified populations, and the application of this knowledge
to the control of health problems
(Lilienfeld and Stolley, 1994).
Almost every new diagnosis of cancer, issued in the USA or in other technically advanced medical environments, results in a pathology report being generated, and an accompanying tissue sample being processed and stored. Thus, in the field of cancer epidemiology, anatomic pathology reports represent an enormous, but at this time largely untapped resource, for determining the epidemiologic distribution of cancer diagnoses, by sex, by age, by geographic location, and by a variety of environmental and occupational risk factors.
Currently, 48 U. S. states have a reporting requirement for cancer diagnoses. That is, each physician or medical institution that renders a report containing the patient's original cancer diagnosis (i.e., the biopsy upon which the patient and physician initially document the patient's cancer) has a legal obligation to report the diagnosis to state authorities. State agencies then submit this information to the Centers for Disease Control and Prevention (CDCP) (1995). In addition, many cancer diagnoses are collected by other agencies, including Surveillance, Epidemiology, and End-Results (SEER)
(Lilienfeld and Stolley, 1994;
U. S. Centers for Disease Control and Prevention, 1995;
Nelson et al, 1999;
Moulton, 1999;
Surveillance, Epidemiology, and End-Results, Appendix 1),
and the American College of Surgeons (Appendix 1),
which maintains the National Cancer Database.
Once collected, the information is evaluated by a team of experts
in medicine, public health, and statistics, who do not necessarily
foresee the needs of all possible end-recipients who might eventually
use the information. For example, SEER excludes non-melanoma
skin cancer cases. Non-melanoma skin cancer (represented primarily
by squamous carcinoma and basal cell carcinoma) account for well over
one million cases per year in the USA, the same order of magnitude
as the sum of cancer incidences from all other organ sites combined.
In addition, the SEER data do not include precancerous lesions,
including adenomas of colon, dysplasias of cervix, esophagus, bronchus,
oral mucosa, liver, etc. SEER also excludes atypical nevi, a precursor
lesion for malignant melanoma. The U. S. National Cancer Institute
has recognized the importance of studying the precursor lesions
that precede invasive cancer
(Klausner, 1999).
However, there is no national database that collects the pathology reports for all newly-occurring pathologic lesions in the USA. At present, all pathology databases and registries are designed to fulfill a specific objective, and are not publicly available for data mining projects. It would seem that for many potentially important cancer studies, national databases are inadequate, and data mining efforts with pre-existing archives residing in large medical institutions and clinical service laboratories may become the most practical way of supporting exploratory data mining efforts.
From time immemorial, it has been commonplace for privileged experts, who have access to biomedical data, to claim that the mechanics of releasing these data into a public venue are impractical and expensive. With the rise of the Internet, large biomedical databases have been made available, at no cost, to the general public (e.g., the U. S. National Library of Medicine's
PubMed
(Appendix 1),
the U. S. National Human Genome Research Institute's
GenBank,
and the U. S. National Cancer Institute's
Cancer Genome Anatomy Projects (Appendix 1),
and summary SEER data
(Surveillance, Epidemiology, and End-Results,
Appendix 1)).
The most important obstacles against using laboratory-based collections
of pathology reports for data mining projects are:
Inadequate and inconsistent data-representation.
Reports often have missing information, and contain errors in grammar and spelling. Further, there is currently no standard format for pathology reports, so that data collected from different institutions cannot be consistently merged.
Pathology data are always highly sensitive
for the individuals involved, and the confidentiality of these data
is protected by law.
Although data miners may have strong incentives to acquire pathology data,
there are at present no incentives for the data-holders to dispense data
to researchers. On the contrary, there are numerous disincentives to sharing pathology data, including a bewildering array of still-unresolved legal, ethical and administrative issues
(U. S. Code of Federal Regulations, 1995;
U. S. Health Insurance Portability and Accountability Act, 1996;
U. S. Code of Federal Regulations, 1999;
U. S. Office of Protection from Research Risks, Appendix 1;
U. S. National Bioethics Advisory Commission, Appendix 1;
Berman et al, 1996).
If these issues were addressed and settled acceptably, researchers could,
in principle, have access to every record of every pathology diagnosis
rendered. This information would have enormous benefit to medical
researchers, epidemiologists, public health policy experts,
and, ultimately, all patients.
4.1.2.2 Data Mining for Human Tissue Resources.
Pathology informatics is inseparable from the field of tissue banking
(U. S. National Cancer Institute,
NCI Cooperative Human Tissue Network,
and
NCI Cooperative Breast Cancer Tissue Resource, Appendix 1;
Naber et al, 1992;
Grizzle et al, 1998).
This relationship is exemplified by the pathology report's specimen
accession number, which is used to identify the tissue specimen,
the microscope slides prepared from the tissue specimen, and
the pathology report. One unique number immutably links tissue to data.
Data linked to tissue samples is one of the most important resources
in the area of biomedical research that deals with translating
basic research discoveries into clinical practice.
Furthermore, tissue banking can reduce the need for animals in clinical
research. Research using human tissues can sometimes spare researchers
the time and costs involved in developing and validating animal models
for human disease. In addition, retrospective studies using existing
collections of pathology data and human tissue specimens may sometimes
obviate the need for costly prospective clinical studies, and reduce
the time required to convert laboratory discoveries into actual
patient benefits.
Pathology informatics provides a mechanism whereby researchers can test
new markers and reagents on a wide variety of human tissues, collected
from large archives maintained by pathology departments. Since the
specimens are human tissues obtained from live patients and archived
for future use, the tests performed are directly applicable to patients
(i.e., do not require extrapolation from an animal model). Since each
specimen is associated with a diagnosis as well as clinical and demographic
information, researchers can answer questions pertaining to specific clinical
conditions and outcomes. The task of the pathology informatician, more often
than not, includes the identification and retrieval of archived tissue,
selected for specified criteria related to diagnosis, stage of disease,
patient demographics, treatment status, and clinical history.
Tissue microarrays are a new tool that will have a wide range of uses,
including validation studies, as well as development of new prognostic
markers and new diagnostic tests
(Moch et al, 1999;
Bubendorf et al, 1999;
Kononen et al, 1998).
A tissue microarray is a composite tissue-block,
containing as many as one thousand small sections of tissue. Each microscope
slide cut from this block can be stained in a single procedure.
The advantage of a tissue microarray is that it permits a scientist
to perform the equivalent of hundreds of experiments at once, on a single
slide, using only a small amount of reagent. Since one tissue microarray
block can be used to produce up to several hundred near-identical glass
slides, different laboratories can compare their results obtained
from slides all obtained from the same tissue-block (i.e., from the
same set of tissues). For instance, a researcher might have a small amount
of monoclonal antibody raised against a putative marker for prostate cancer
of low malignant potential (i.e., a prognostically favorable variant
of prostate cancer). To test the value of the new marker, she performs
an immunostain procedure on a tissue microarray that contains
prostate cancer sections selected from hundreds of patients.
For each tissue on the microarray, she measures the immunostaining
intensity of the marker. In order to validate the stain, she will need
pathologic and clinical information for each patient. At this point,
the tissue microarray serves as a pathology database, consisting
of a patient-record for each tissue specimen, that contains staining
information, staging and Gleason scores for the tissue, clinical course
of the patient, recurrence data, treatment data, demographics, etc.
The results at different laboratories can be compared using a set
of tissue microarray glass slides all prepared from the same tissue-block.
Each tissue microarray requires an informatics effort, in which
large archives of pathology data are mined. The desired result will be the
creation of an array of tissues and associated data, all tailored to the
goals of the researcher. Once the microarray is complete and stained,
a second informatics phase begins, involving the association of images
and measured data derived from each of the sampled tissues included
in the microarray. In the case of microarray blocks that have several
hundred slides parceled to multiple investigators, all the test result
data collected from each investigator will ideally be merged into a large
database that includes clinical data and test data for each of the
thousand samples included in each tissue microarray. This second
informatics phase will require methodology to store and access terabytes
of image and other data associated with a single microarray block.
The third informatics phase is the data analysis phase, in which conclusions
and new hypotheses are generated from the collected tissue microarray
database. This will spawn a major data mining effort involving the
development of novel informatics methodology to evaluate image data
with data linked to textual and conceptual biologic and medical information.
The Cancer Genome Anatomy Project
(U. S. National Cancer Institute, Appendix 1;
Clark et al, 1993;
Schmitt et al, 1999)
of the
U. S. National Institutes of Health
consists of a an ever-growing wealth of genetic information related to human tumors. Data include Expressed Sequence Tag (EST) databases
(Clark et al, 1993;
Schmitt et al, 1999;
Choi et al, 1999;
Hawkins et al, 1999),
gene microarray data, and data related to oncogenes and tumor suppressors found in human tumors.
Data miners are welcomed at this important public resource. It is now possible to link together data mined from pathology reports, and associated clinical data, with publicly available CGAP data. This data ensemble, in turn, can be linked to genetic data held in laboratory databases, where genetic studies were performed on clinical specimens. Trends and associations found by data mining databases that contain clinical, pathologic, and genetic data from related specimens, have enormous potential value. As a hypothetical example, suppose a laboratory with access to clinical specimens and related data has found a mutated gene present in a morphologic and clinical variant of a lung carcinoma. A search through the CGAP database might indicate the presence of the same mutation sequence in a specific set of tumors. This finding might initiate further studies, leading to the identification of a cancer gene, and also to new clinical approaches in the treatment of these cancers.
4.1.2.3 Trial and Outcome Analysis
Much of medical practice has developed as an empiric art. Treatment protocols are often accepted as standard of care, with no objective evidence. Radical mastectomies for breast cancer, hiatal hernia repairs, laser endarterectomes for advanced atherosclerotic disease, bone marrow transplants for advanced breast carcinoma, and tonsillectomies for throat infections, are all examples of medical procedures whose uses have been reduced or eliminated as the result of well-designed clinical studies showing general ineffectiveness, or effectiveness restricted to a small subset of the diseased populations.
The medical community and the public are demanding
that the practice of medicine be evidence-based
(Bubley, 1999;
Junor et al, 1999;
Connell et al, 1999;
Compton, 1999;
Mapp et al, 1999).
This means that the validity of a treatment or test must be based
on well-designed studies yielding statistically signficant results.
Unfortunately, such studies are very time-consuming and expensive.
Considering the enormous number of old, new, and developing medical tests
and procedures, it would be impossible to implement expensive
prospective studies in every case.
However, retrospective studies using collections of pathology data,
linked to clinical data and specimens, may sometimes yield evidence
related to the value of medical procedures, treatments, and tests.
Data and specimens can be analyzed in large numbers, especially when
variables such as gender, age, ethnicity, and clinical presentation,
treatment and outcome are known.
4.1.2.4 Monitoring and Improving Patient Care
There is an enormous potential for monitoring and improving individual patient care with mined anatomic pathology data. Currently the only universal purpose for issuing an anatomic pathology report is to provide a correct and timely diagnosis for the individual patient. However, as aggregate data, anatomic pathology reports can identify potential risks to the individual patient, as well as managerial and administrative problems in medical institutions.
This fact is recognized by national regulatory bodies, such as
the
College of American Pathologists (CAP)
and the
Joint Commission for the Accreditation of Healthcare Organizations (JCAHO)
(Appendix 1),
who currently inspect most of the pathology laboratories in the USA.
For example, a new anatomic pathology diagnosis, such as malignant melanoma
of the skin with a positive surgical margin (indicating that not all
of the malignancy was removed by the procedure), should trigger
two consequences, documented in the records of the pathology laboratory.
First, there should be evidence that the clinician caring for the patient has been notified of an urgent diagnosis in a timely manner. In principle, every clinician should read his/her own pathology reports on a periodic basis, and should pursue delinquent reports. However, for very serious diagnoses, there is a recognized value in having a back-up notification mechanism. Second, the original urgent diagnosis of malignant melanoma should eventually be followed by a second specimen, documented in pathology laboratory records, namely, the definitive excision specimen. The absence of such a specimen in laboratory records after a predetermined period of time, should trigger a documented attempt to verify that the patient has:
undergone the appropriate procedure at another institution;
undergone another, acceptable treatment;
refused treatment;
been considered too ill to receive the usual treatment.
While the CAP and JCAHO require that each accredited institution should have such policies in place, every physician who has been in practice for a few years knows of an occasional instance where these policies have failed for the individual patient. Data mining procedures could automatically generate lists of problem cases that might require additional medical attention. Through an automated process, pathology departments could alert clinicians to potential problem cases.
4.1.3 Economic Issues in Anatomic Pathology Data Mining
Pathology departments offering new services, such as informatics and specimen-related support, should be compensated for providing those services. As an example, one major reason for the long and dramatic decline in autopsy rates has been the inability of most pathology departments to find mechanisms for adequate monetary compensation for autopsies.
In addition to specimen-handling costs, the preparation of reports
in a manner that supports informatics initiatives will also require
incentives. Third parties might pay a bonus for reports that
are well-coded using standard terminologies that permit more accurate
billing, better data analysis, improved outcome studies, etc.
Advocacy groups might be motivated to seek support for efforts
to standardize pathology reports when they come to understand the potential value of pathology data mining. It has also been suggested that laboratory information system vendors might invest in informatics initiatives that are likely to add value to their services. It is a commonly held perception that pathology informatics could add significant value to pathology reports, as well as facilitate better research support. Some examples of improved reporting include: posting images on reports; coding data to make reports more useful to third parties; using The Bethesda System and other reporting standards in a more consistent manner
(Hutchins, 1990;
Hutchins et al, 1999;
Rosai and Ackerman, 1996;
Hruban et al, 1996;
Association of Directors of Anatomic and Surgical Pathology, 1997;
Kurman et al, 1991);
and permanently and securely associating tissue specimens
with pathology data.
4.1.4 Commercial Uses of Mined Anatomic Pathology Data.
Pathology data and archived tissues have enormous potential value
to commercial entities. Imagine the case of an ambitious genomics
company that has just cloned a gene found in an
EST (expressed sequence tag)
gene library obtained from a prostate cancer cell line, and which was not
found in EST libraries derived from normal cells or from other types of cancer cells. The company knows that if a prostate-specific gene can be isolated, it might potentially be useful as a diagnostic test for prostate cancer. The gene may even be exploitable for prostate cancer treatment, if an antisense version of the gene could be produced that interferes with production of the normal gene in vivo. The company would need information and tissues related to prostate cancer patients, in order to locate prostate cancer cases representing the spectrum of human disease, including precancerous lesions and lesions of varying clinical stage and prognosis. The company would use the tissues to help develop a gene assay and to test for the presence of the gene in a large number of prostate specimens, as well as in specimens from normal tissues and cancer tissues from other organs. Since bioinformatics companies do not have their own pathology departments, they would need to obtain contractual services from a pathology laboratory. The laboratories would need to be large, with high-quality pathology and clinical data, and with well-characterized, retrievable tissues. The laboratory would need to provide its service in such a way that ethical and legal standards of operation are maintained. The company would need to compensate the laboratory to an extent commensurate with the market value of the service.
The pharmaceutical industry, genomics companies, privately supported laboratories, and federally supported research and development laboratories, all have a growing need for data and tissues. Even academic researchers in need of pathology data and tissues must compete with commercial laboratories in producing a satisfactory incentive for pathology laboratories to provide them with data and tissues.
4.1.5 Past, Current Efforts to Standardize Pathology Data.
Efforts to standardize anatomic pathology data have proceeded
on three fronts: standardized coding; standard formatting of anatomic
pathology reports; and standards for publishing collections of reports.
The three dominant coding systems in anatomic pathology are:
SNOMED,
Read,
and
UMLS.
The Systematized Nomenclature of Human and Veterinary Medicine
(SNOMED International, or SNOMED III)
(Appendix 1,
Cote et al, 1993,
Moore and Berman, 1994)
is a copyrighted product of the
College of American Pathologists
(Appendix 1),
and one of the recognized standard nomenclatures in medicine.
SNOMED is used in most pathology laboratories that employ
standardized coding systems. The
Read Classification
is employed by the National Health Service in Great Britain
(Read and Benson, 1986;
Payne, 1995),
and is owned by the the British crown. SNOMED, Read,
MeSH,
and over fifty other classifications are subsumed within the
Unified Medical Language System (UMLS) Metathesaurus
of the
U. S. National Library of Medicine (Appendix 1).
The UMLS is by far the largest of all medical concept systems
(Hahn et al, 1999),
and is the best tool for research studies in controlled medical vocabularies.
UMLS serves as an indexing tool for PubMed (Appendix 1), a collection of over nine million medical citations available on the Internet to the general public. It has been estimated that only 1-3% of medical concepts in general medical text are not present in UMLS (Humphries, Appendix 1; Kao and Moore, 2000). UMLS provides a uniform, integrated distribution format from over fifty biomedical vocabularies and classifications. The 1999 UMLS Metathesaurus contains 625,530 biomedical concepts and 1,362,823 different concept-names. Each Concept Unique Identifier (CUI) consists of C followed by seven decimal digits, with an accompanying synonym-name. Different synonym-names have the same CUI. UMLS is updated annually and made available to registered users, with a complete listing of CUIs and synonym-names. Since the UMLS is available cost-free to researchers worldwide for research projects, it makes the most sense to develop pathology data mining applications in UMLS.
One of the major limitations in coding systems for anatomic pathology
today is that there is no agreed-upon syntax for an anatomic pathology
report, and no agreed-upon method for translating an anatomic pathology
report written in plain English into a given coding system for publication.
It is our perception that no two hospital systems encode their reports
with the same set of coding rules. Consequently, there are no examples
of different hospital systems merging their encoded pathology datasets
into an aggregate database.
4.1.6 Data Integrity Issues.
Currently nearly every anatomic pathology report exists for at least
a while as an electronic file; but beyond this, there are few standards.
Patient identification, provider identification, date/time conventions,
name of tissue source, name of diagnosis, and even word processor
formatting conventions, are all idiosyncratic to the pathology laboratory
that issues the report. Even the presence or absence of certain pieces of information is not guaranteed.
The most unstructured part of the pathology report, and perhaps the part in greatest need of accurate data mining, is the microscopic diagnosis (Moore and Berman, 1994). At a minimum, this field should contain the disease name and bodysite for each anatomic diagnosis. Almost equally important, the sentences must be appropriately separated from one another. At this time, there is virtually no control on data-integrity in the microscopic diagnosis, beyond the fastidiousness of the pathologist and the competence of the typist. Sentences may not contain complete information. Sentences may be written with convoluted grammatical constructions, including run-on sentences and vaguely placed negations. There may be spelling errors, idiosyncratic abbreviations, or ambiguous terms. The following examples are potentially ambiguous sentences. In most cases, the meaning is more-or-less apparent to a trained health professional, but might easily be misinterpreted by a data mining program.
Squamous cell carcinoma. (Which bodysite?)
Liver showing metastatic adenocarcinoma, portal lymph node showing reactive hyperplasia. (Run-on sentence.)
Liver showing metastatic adenocarcinoma No tumor present in portal lymph node. (Ambiguous negation.)
Skin with sqamous cell carcinoma. (Spelling error.)
Skin with BCC. (Idiosyncratic abbreviation: BCC=basal cell carcinoma.)
Cervical soft tissue with metastatic adenocarcinoma. (Neck? Uterine cervix?).
RLL. (Right lobe of lung? Right lower lung? Right lower lobe of lung? Right lower lid of eye?)
4.1.7 Ethical and Legal Issues
Current U. S. federal regulations governing the protection of human subjects are contained in Title 45 of the Code of Regulations, Part 46 (also known as Common Rule, or 45CFR46) (Appendix 1). Although different U. S. states have passed their own laws that restrict the uses of medical information, 45CFR46 is the most general and comprehensive document on this subject, detailing the functions and operations of Institutional Review Boards (IRBs), and applicable to all research funded by or regulated by any U. S. Federal Department or Agency. The National Bioethics Advisory Commission (NBAC) (Appendix 1) has recently expressed their views in a document that provides researchers, IRB members, and federal agencies with the NBAC's interpretation of 45CFR46 (Appendix 1).
Several provisions of 45CFR46 directly relate to pathology informatics initiatives. Under 45CFR46, a subject must be living in order to be a human subject protected under Common Rule. Consequently, autopsy data and tissues may be used freely by the research community.
In addition, records and tissues that have have irreversibly unlinked from patient identifiers are specifically exempted from 45CFR46 restrictions (so-called Exemption 4). Research involving anonymous tissue and data can proceed without IRB review, but this exemption only applies to data and tissues that can never be traced back to the patient, either by the pathologist who contributes materials to the researcher, or by the researcher who generates new data based on evaluation of the provided material (e.g., data mining), or through laboratory examinations of the tissue.
The practice of removing all links between a patient's data and the patient's identity is called anonymization. The anonymization process relieves the researcher from the regulatory burdens and the IRB review process required by 45CFR46. It has been our experience that many researchers believe that anonymizing data is a foolproof method for protecting the interests of both patient and researcher. It would seem reasonable that in an anonymized study, the patient would certainly be no worse off than if the study had not been performed at all.
Nonetheless, ethicists have made some cogent arguments against research using anonymized tissues or data. In our view, the most compelling argument against research using anonymized tissues arises in the instance in which life-saving information is discovered during the course of the research. Imagine the researcher's predicament when a treatable disease is diagnosed on an anonymized tissue sample. By definition, anonymized tissue samples have had all identifying links to the patient removed. There would be no way of identifying the patient who should be notified of research findings.
From the researcher's perspective, anonymized tissues and data make it impossible to validate research results. When the unique linkage between tissue (or data) and the patient is removed, there is no way of verifying data. In an anonymized database, one cannot be certain that the data haven't been biased by multiple submissions on a single patient. If one suspects that a particular laboratory may have contaminated its contributions to the archive, there is no way of identifying those samples. Suppose that particular data-records are incomplete. Anonymized data do not permit the researcher to identify and contact a subject, in order to verify or complete a data-record. In the absence of any mechanism to verify data-records, how can one accept the conclusions drawn from the data?
Drafted NBAC recommendations consider a mechanism whereby certain types of studies might be approved by the IRB without meeting Exemption 4. Under opinions emerging under NBAC, IRBs may categorize research proposals by the risks they impose on human subjects. In general, research that utilizes in-place data and tissues (i.e., materials collected in the regular course of patient treatment and not collected as part of a research protocol), would be considered minimal risk research projects. The only risk to the patient would be the loss of confidentiality of medical record data. In these cases, the IRB may examine the proposal to determine whether the researchers have a plan that assures that patient confidentiality will be well-protected. The IRB, at that point, might address the question of whether the potential value to society of the proposed research outweighs the minimal risk to the patient for whom confidentiality might be breached. If the research meets these requirements, then the IRB may grant approval for the research without requiring data anonymization or obtaining informed consent (in the case of retrospective data and tissues).
How can patient confidentiality be protected when the data are not anonymized? The authors have previously demonstrated the use of a doubly-encrypted brokered tissue database (Berman et al, 1996). In this model, providers of pathology data encrypt patient identifiers and send their data to a database administrator. The database administrator then performs a second encryption on the identifiers, before releasing the data to the scientific community. Data miners could freely use the data in such a pathology database, without access to the identity of the patients included in the database. Suppose, for some reason, a data miner needs to collect additional data on certain patient records (e.g., survival or treatment data). The researcher would send a request for additional information to the database administrator, along with the relevant records, each containing doubly-encrypted patient identifiers. The database administrator would perform a single decryption of the patient identifiers, and forward the request to the institution that contributed the data-record. The provider IRB then reviews the request, and determines whether it would be legal and ethical to perform the final decryption step linking the data-record with the patient. This decision might involve obtaining patient consent. If the final decryption is approved, then the patient is identified. The additional information is returned to the database administrator, after tagging the data-record with the re-encrypted identifier. The database administrator then sends the records to the researcher, after tagging the data with the doubly-encrypted identifier. Throughout the process, the database administrator and the researcher never learn the identity of the patient. It is interesting that the field of cryptography provides the solution (double-encryption) to the greatest legal and ethical obstacle against progress in the field of pathology informatics.
Of some interest is the fact that 45CFR46 applies to all federally funded research involving human subjects. However, research conducted with absolutely no federal tie-in is not covered by 45CFR46 regulations. Today, medical insurers and health maintenance organizations freely use data collected on patients for a variety of data mining activities, that may eventually impact negatively on certain groups of patients. Consider the medical insurer who employs data mining techniques to identify persons at high risk for cancer or other chronic diseases, for the purpose of removing those patients from insurance enrollment. Is this an ethical use of data mining?
4.2 Understanding the Data.
4.2.1 Size of Potential Anatomic Pathology Data Domain.
On August 1, 1998, there were 17,974 board-certified pathologists
in the USA (American Board of Medical Specialists, Appendix 1),
who spend an estimated 42.5% of their time in the practice
of diagnostic surgical pathology (Smith et al, 1989; Anderson et al, 1991).
Other activities that occupy a pathologist include: teaching, research,
administration, diagnostic cytopathology, autopsy pathology, and
clinical pathology (laboratory medicine). A fulltime surgical pathologist,
who does nothing but practice diagnostic surgical pathology,
is expected to issue approximately 5,300 surgical pathology reports per year.
Autopsies account for only a tiny percentage of anatomic pathology reports.
Cytopathology reports are predominantly screening procedures, and usually
do not contain the final diagnoses required for data mining investigations.
Hence, the pathology establishment in the USA issues approximately
17,974 x 5,300 x 42.5% = 40 million
surgical pathology reports annually, or about one surgical pathology report per year for each six persons in the USA.
In general, large medical centers and large health-care networks, such as the Veterans Affairs medical centers and Kaiser Permanente, maintain archival computerized surgical pathology reports indefinitely. However, many community hospitals maintain computerized reports for two years (as required by pathology laboratory regulatory bodies), and then transfer surgical pathology reports to an inactive storage medium, such as microfiche, in which reports are essentially unrecoverable for data mining purposes. This downloading policy made economic sense about a decade ago when computer storage was relatively expensive. The persistence of this policy in this modern era of cheap computer storage represents administrative inertia, and the absence of financial and social incentives to convert laboratories toward more computerized medical record management.
Each surgical pathology report is approximately one kilobyte in size, i.e., one double-spaced typewritten page, of which approximately 100 bytes form the free-text microscopic diagnosis. The other 900 bytes consist of accession numbers, patient identifiers, date/time stamps, and the gross description text. Thus each year, an estimated 40,000,000,000 bytes (40 Gigabytes, 40 GB) of surgical pathology reports are issued in the USA, an amount of data small enough to fit on a single hard-disk drive.
Academic medical centers and Veterans Affairs medical centers generate an estimated 30% of all reports (Anderson et al, 1991), with a decade of legacy data, for a total of 12 GB of surgical pathology report information. Community hospitals generate 70% of all reports, with two years of legacy data, for a total of 5.6 GB of surgical pathology report information. Thus we estimate that there is a total of 17.6 GB of legacy surgical pathology report information potentially available for data mining in the USA, with 4 GB of new information generated annually.
Clinical pathology (laboratory medicine) involves medical laboratory tests on fluids taken from the patient's body (blood, urine, etc.). Each byte of anatomic pathology data corresponds to approximately 150 bytes of clinical pathology data. Thus each year, six terabytes (six trillion bytes) of clinical pathology data are generated in U. S. medical centers. While most of these data are numerical in character, the data often require non-numerical contextual information about the patient in order to be meaningfully interpreted in data mining investigations.
4.2.2 Description of an Anatomic Pathology Report.
Every pathology report begins as pieces of human tissue, submitted to a pathology laboratory with accompanying paperwork. In the simplest case, there is one piece of tissue, obtained from one surgical procedure, arriving in one container, from one appropriately identified patient, with one accompanying page of paper that contains matching identifiers, and a relevant medical history in plain English, with correct spelling and grammar. Roughly half the specimens received in a typical pathology department conform to this description. Furthermore, this model can serve as the basis for understanding the more complex situations, in which an accessioned case arrives in the pathology laboratory with multiple specimens from one or more procedures performed on the patient.
When the specimen and paperwork arrives in the laboratory, an accession clerk verifies the paperwork, and assigns a unique accession number to the specimen-and-paperwork ensemble, known as an accession. A sample surgical pathology report is illustrated in Table 1. The format for this pathology report is United States Government Tissue Examination Form, Standard Form 515. All U. S. Government installations use the same form, and academic and community hospital tissue examination forms contain essentially the same information, although details differ. Different institutions enforce the completion of these forms with different degrees of strictness. For example, at our institution, a form with a patient-name not recognized in the hospital database, a form without a physician-name or unsigned by a physician, or a form without a patient-history, are not accepted, and an effort is made to resolve any problems with the submitting physician. These procedures are minimum requirements set by the College of American Pathologists (Appendix 1).
There are four general classes of data in an anatomic pathology report:
Assigned numbers (accession number, procedure number, etc.).
Date/time stamps (date obtained, date received, date released).
Person (patient, submitting physician, pathologist).
Clinicopathologic information (brief clinical history, gross description, microscopic diagnosis).
Accession numbers are assigned by the pathology laboratory, and are used to keep track of what specimens arrived in the laboratory. This accession number assignment is ideally carried out in one physical location, using a computerized Laboratory Information System (LIS) with a parallel offline accessioning system (logbook). The LIS assigns the accession number, which should be sequential and non-duplicated. If specimens are accepted and accession numbers are assigned at more than one physical location, then great care must be taken to keep numbering assignments at all the physical locations in synchrony. This requirement is trivial as long as the LIS is always functioning everywhere, but may become very convoluted when different accession areas have dyssynchronous periods of computer downtime.
Date/time stamps include: date obtained, date received, date released. The LIS should not accept date/time information that is inconsistent, such as a specimen obtained at a date/time later than specimen received. On the other hand, reality is unpredictably complex, and there must be mechanisms to override apparent inconsistencies in the usual sequence-of-events in a pathology report. For example, what if the exact date of a particular event is not known? When a patient cannot recall the appearance of a particular symptom; or, for some older or immigrant patients, birthdate.
There must be a formalism for managing inexact dates. In the VISTA computer system used by Veterans Affairs Medical Centers, date/time is denoted by seven decimal digits, followed by a decimal point, followed by six decimal digits. The first digit denotes century (0=1700, 1=1800, 2=1900, etc.). Thus, U. S. Veteran George Washington (born 1732) has a century digit of 0; and U. S. Veteran George Bush (born 1924) has a century digit of 2 (DeGregorio, 1997). Digits two and three denote year; digits four and five denote month (01=January, etc.); and digits six and seven denote day. The first and second post-decimal digits denote hour (24-hour clock); the third and fourth post-decimal digits denote minute; and the fifth and sixth post-decimal digits denote second. Missing values in the date/time are denoted with zero. Thus, an event happening during an unknown month in 1999 is denoted 2990000.000000; and an event happening at an unknown time on August 27, 1908 (birthdate of U. S. President Lyndon B. Johnson (DeGregorio, 1997)) is denoted 2080827.000000. The VISTA date/time numbering system is ideally suited for managing patient confidentiality on in a public data mine resource. All events can be rounded off to year-of-occurrence, simply by replacing all digits past the third digit with zero; or to decade-of-occurrence, by replacing all digits past the second digit with zero.
Each person named on the report (patient, submitting physician, pathologist) must match up to a person on a list, who can be reached as necessary for purposes of notification, billing, and followup. It is critical that the paperwork reliably identifies the patient and the submitting physician, and unambiguously links the paperwork to the specimen. Correct identification is not easy to achieve, but is not an idle luxury. A misidentification could mean the failure to assign a serious diagnosis to a patient, or assigning a diagnosis to the incorrect patient. Patient injury and legal action could ensue.
Unusual names must be spelled correctly, and common names must be distinguished for different actual patients. For example, a thousand-bed U. S. hospital can expect to add one new Mary Smith every week to its patient identification system. The U. S. Social Security Number is not an acceptable method for distinguishing among all the Mary Smiths, since there is a known error rate in the social security system (about one percent), and the immediate consequence of a misidentified patient is far more serious than that of a misplaced social security payment. It is not acceptable to assign Mary Smith a new identification number each time her physician submits a new specimen to the pathology laboratory, because the certain knowledge of a patient's prior diagnoses is critical toward understanding each subsequent diagnosis. Finally, uncommon names with unusual spellings must be spelled consistently for each entry into the LIS, and aliases (i.e., different names used by the same patient) must be known and managed appropriately by the LIS.
Clinicopathologic information includes: brief clinical history, gross description, microscopic diagnosis, and typically appears on a pathology report as free-text. This free-text is the most unstructured part of the pathology report, and is difficult both to enter correctly and to recover satisfactorily in data mining investigations. Clinical histories are received from other departments, so that one makes the effort to clarify an unintelligible clinical history only if the clinical information is perceived as critical to the final diagnosis on the pathology report. In many cases, medical histories replete with misspellings, nonsequiturs, and missing information are accepted. Gross description and microscopic diagnosis are under the administrative control of the pathology laboratory, but there is typically little motivation to achieve standards of spelling and punctuation beyond those necessary for a presentable report.
The microscopic diagnosis field contains the final result of the anatomic pathology examination. The microscopic diagnosis is referred to by a number of different names, including diagnostic impression, diagnosis, microscopic evaluation, and even report. The lack of consistent terminology for common data elements poses yet another obstacle for data mining efforts using stored pathology reports. In our experience, approximately half the microscopic diagnosis fields from anatomic pathology reports must be copyedited before the reports are optimally suitable for data mining investigations. In some cases, this copyediting might consist of little more than inserting appropriate sentence terminators. In other cases, major revision might be necessary to correct misspellings, and to reconstruct sentences into a grammatically correct and unambiguous text.
4.2.3 Converting Parts of a Report into Object Domains
4.2.3.1 Pathology Report Database
In the Laboratory Information System (LIS), the report database is separated into fields, and certain minimum standards are enforced by the system. For example, the date/time that the specimen was removed from the patient must precede temporally the date/time that the specimen was received in the pathology laboratory; and in turn, the date/time that the specimen was received in the pathology laboratory must precede the date/time that a final report was issued by the pathology laboratory. The name of the patient must correspond to an identifier recognized by a third-party-payer (insurance carrier); the name of the submitting physician must correspond to an identifier to whom the report can be sent (in order to receive payment); and the name of the attending pathologist must correspond to an identifier who can be disciplined if the report is tardy. The advantage of such a LIS is a level of managerial control over workloads, turnaround times, and billing that a simple word processor system cannot offer. An important byproduct of this enforced structure is that such systems contain data that are usable in data mining projects. However, in the commercial setting, there is always a tradeoff between strict enforcement of standards and offending the customer, so that optimal data standards are not always achieved.
In complex cases, the submitting physician may submit multiple pieces of tissue from the same patient, taken during a single outpatient visit or inpatient encounter. For example, Table 1 describes two containers taken at the same encounter from the patient, and consecutively numbered by the submitting physician. The final pathology report must reproduce the numbering produced by the submitting physician. All discrepancies in the actual containers received and their descriptions on accompanying paperwork must be resolved in the final report.
There are two other natural divisions in a multi-container anatomic pathology accession: procedure and bodysite. Many multi-part pathology reports do not unambiguously reflect these divisions; and in our conversations with colleagues, we are convinced that some of our colleagues do not clearly understand these divisions. However a correct assignment of these divisions is essential for making sense of the data in any anatomic pathology data mine resource, and for many quality assurance surveys.
For example, suppose that a surgeon performs a total laryngectomy on a patient with previously diagnosed squamous cell carcinoma of the left true vocal cord. The patient has hard, palpable lymph nodes (probable metastatic cancer) on both sides of the neck, so the surgeon performs a bilateral radical neck dissection in the same operative session. The surgeon also notices two small, irregular black macules (i.e., flat skin discolorations), one over the right clavicle, the other over the left scapula, and removes them for diagnosis. The laryngectomy specimen arrives in the pathology laboratory in container #1. Ten surgical margin specimens arrive in containers #2 through #11. The right radical neck dissection arrives in container #12, and the left radical neck dissection arrives in container #13. The right clavicular skin excision arrives in container #14, and the left scapular skin excision arrives in container #15.
These fifteen containers separate logically into five surgical procedures, namely, laryngectomy, right radical neck dissection, left radical neck dissection, right clavicular skin excision, and left scapular skin excision. The laryngectomy procedure subdivides into at least sixteen bodysites; the right and left radical neck dissections subdivide into nine bodysites apiece; and each of the skin excisions and their surgical margins subdivide into five bodysites apiece (Rosai and Ackerman, 1996; Hruban et al, 1996; Association of Directors of Anatomic and Surgical Pathology, 1997). The structure of the mined database must capture these divisions and subdivisions, in order to be optimally useful to epidemiologists and tissue resource specialists.
Furthermore, the relationship between one specimen and another specimen taken from the same patient at a different time may be indeterminate. For example, for many skin cancers, a particular report may correspond to a new lesion on the patient or to the recurrence of a previously diagnosed lesion, taken either at the same hospital or at another institution. All too often, the bodysite designation provided by the submitting physician does not provide enough anatomical detail even for a human expert to make this distinction from the content of the report alone.
Multiple cancer lesions from the same patient might represent metastases or multiple independent lesions. Sometimes, the submitting physician does not include relevant clinical or radiologic information that might resolve this question.
Sometimes the submitting physician does not even accurately identify the site of origin of a specimen, omitting such details as left versus right, medial versus lateral, or superior versus inferior. Sometimes the anatomical orientation is resolved in such a convoluted manner that only an expert can reasonably understand what is intended. In any event, no commercially available LIS enforces the correct disambiguations from carelessly composed pathology reports. Ensuring that reports can be automatically parsed into data elements that accurately represent the concepts included in the free-text report is one of the greatest impediments (and challenges) to the advancement of pathology informatics.
4.2.3.2 Parsing Free-Text into Sentences
The introduction of sentence boundaries into the free-text of an anatomic pathology report is a surprisingly necessary and complicated undertaking.
For example, this text report:
Liver showing metastatic adenocarcinoma
Portal lymph node showing reactive hyperplasia.
might easily be misinterpreted by an autocoder (computer translator) as including the diagnosis: Metastatic adenocarcinoma portal lymph node. After all, the disease-concept, Metastatic adenocarcinoma, and the bodysite-concept, Portal lymph node, are not separated by any intervening words or punctuation.
One could insist that the pathologist punctuate the report as:
Liver showing metastatic adenocarcinoma.
Portal lymph node showing reactive hyperplasia.
This insistence is no help in processing legacy (retrospective) anatomic pathology reports. Different pathologists have different punctuation styles. Even the two authors of this chapter, who practiced pathology together for eight years, could never reach agreement on a consistent style of punctuation of their department's anatomic pathology reports. Furthermore, uniform punctuation standards may not be enforceable even in the future, unless it is demonstrated that the pathologist is rewarded, either financially or by substantially improved case management, for correctly punctuating her reports, and that this goal can be achieved with a minimum of fuss.
It should be noted that the period-character (ASCII 46) is neither an unambiguous sentence-terminator, nor is it the only possible sentence terminator. The period is also used as a decimal-point (4.2, 10.17), for honorifics (Dr., Mr., Ms.), and (often ambiguously) in abbreviations (AIDS., A.I.D.S., A. I. D. S.), all of which may occur in mid-sentence of a pathology report. For example, the abbreviation MS. may be used legitimately in a free-text medical document to denote: female-honorific, Mississippi, Microsoft(R), multiple sclerosis, mitral stenosis, morphine sulphate, millisecond, microsecond. Conversely, colon, semicolon, question-mark, and exclamation-point may also serve as de facto sentence-terminators for some sentences appearing in a pathology report.
4.2.3.3 Discovering Terms for Translation into Coded Nomenclature
The process of autocoding (computer-translating) a large collection of legacy free-text reports is interactive and iterative. One begins with the entire, text-file (source-text), and a computerized coding system, such as UMLS. One concludes with a database that maps each text-sentence into either a syntactically acceptable sequence of codes recognized by the autocoder, or else into an error message. The goal of the autocoder is to minimize the number of text-sentences that generate an error-message.
Both the source-text and the coding system can be modified and enriched. The source-text may be pre-edited, with sentence boundaries corrected, and with misspellings and ambiguities resolved.
If the original source-text is relatively well-formed, then pre-editing may be targeted at those few sentences that fail a grammatical parser. One may discover additional, as yet unforeseen sentence structures that must be added to the autocoder. The coding system may be enriched to include new synonyms encountered in the source text. For example, in the UMLS Metathesaurus coding system, C0007117 is the Concept Unique Identifier (CUI) for Basal Cell Carcinoma. However, pre-editing the source-text might uncover Basal Cell Epithelioma or Basalioma as synonyms for C0007117. A very ambitious researcher might even collect a list of false-negative concepts, present in the source-text but absent in the coding system, and petition the administrators of the coding system to include the additional codes and synonyms in a future revision of the coding system (Kao and Moore, 2000).
Concepts within the source-text are further discoverable by the Barrier Word Method (Tersmette et al, 1988; Moore GW et al, 1989; Nelson et al, 1995). In the barrier word method, natural-language medical text is regarded as a sequence of medical concepts linked together with grammatical objects, or barrier words, which serve to delimit sentence boundaries or phrase boundaries. Barrier words include: all punctuation, all numerals, nearly all one-letter and two-letter words, articles, prepositions, and common verbs and modifiers. Medical concepts are one-word or multiple-word terms, consisting of medically significant component words (keywords).
For example, consider the source-text appearing in Table 3 (Colby et al, 1995). In this example, the barrier words are shown in lower case, and the KEYWORDS ARE SHOWN IN UPPER CASE. In general, each SEQUENCE OF KEYWORDS uninterrupted by barrier words should point to one or more UMLS Metathesaurus CUIs. Thus the entire message contained in this table, not including punctuation, may be translated into a sequence of CUIs, namely, C0206704, C0396473, C0225355,....
4.2.3.4 Preparing the Zipf Distribution of Phrases and Words
It is helpful to prepare a descending-order frequency distribution, or Zipf Distribution, of single words occurring in the source-text. Zipf's Law (Zipf, 1949) states that a few hundred common words (typically, barrier words) account for over half the word-occurrences in a large text-database, an observation confirmed in large text-databases in English (Fedorowicz, 1982; Moore et al, 1988), German (Giere, 1981), and Chinese (Zhang, 1981). Very high frequency words and terms have almost no recall value in a computerized indexing system, and should not be indexed. High frequency words that are not barrier words provide the data miner with a snapshot of the common concepts in the data source. Greater attention should be focused on matching these words to UMLS synonym-names, since failure to match common concepts will result in poor overall data mining performance. Many low frequency words are highly specific for indexing and data mining purposes. These words are also important to match to UMLS synonym-names, particularly if one of the goals of data mining is to identify unusual disease conditions and outcomes.
4.2.3.5 Translating Source Terms into UMLS Codes
As a first approximation, one can obtain an exact match between keyword terms in the source-text and UMLS Metathesaurus synonym-names, in order to achieve a preliminary mapping into UMLS. For example, Large Cell Carcinoma maps exactly into the official UMLS Metathesaurus CUI, C0206704. Next, some non-matches may be mapped as more complete forms. For example, Nuclei is not an official UMLS Metathesaurus synonym-name, but Cell Nucleus maps into the UMLS Metathesaurus CUI, C0007610. Additional non-exact matches may be mapped into obvious synonym-names. For example, Cluster and Clusters are not official UMLS Metathesaurus synonym-names, but the synonym-name, Aggregate, maps into the UMLS Metathesaurus CUI, C0205418.
Misspellings in the source text may be managed either by declaring a popular misspelling as a synonym-name, or by correcting the misspelling in the source text. For example, abcess is a popular misspelling for abscess (C0000833) in many surgical pathology reports. Some orthographic purists would shudder at the prospect of placing abcess in a synonym-name table, but this action may be a necessary concession to practicality. Popular abbreviations may be matched to UMLS codes in a comparable manner, such as C.O.L.D. for Chronic-Obstructive-Lung-Disease (C0024117). However, this action can eventually lead to a jungle of unresolved ambiguities, as for example: COLD for Chronic-Obstructive-Lung-Disease (C0024117); Cold-Coryza (C0009443); and Cold-Temperature (C0009264).
As a general rule, it is preferable for an expert copyeditor to annotate an obvious ambiguity in the source text than to relegate this step to the autocoder. Thus for example, Adnexa is ambiguous, but the annotations, Skin-Adnexa (C0221943), Uterine-Adnexa (C0001575), or Ocular-Adnexa (C0229243), are unambiguous, and would be properly translated by the autocoder.
Another compelling reason to initially disambiguate the source-text in a pre-editing step is our belief that the real goal of preparing source-text for pathology data mining should be recoding or retranslation. This is the view that the anatomic pathology report should initially be written in plain English: unambiguous, orthographically and syntactically correct, short sentences. Then, as the philosophy or goals of the autocoder evolve and improve, the entire set of anatomic pathology reports can be re-autocoded, and re-mined.
4.3 Preparation of the Data
4.3.1 Data Ownership
Pathologists, patients, and laboratories all lay claim to ownership of tissue specimens and pathology report data. When a pathologist renders a diagnosis on a microscopic slide (which contains a stained sample of tissue that the pathologist has carefully selected and described), she creates a medicolegal document that often creates a drastic change in the life of the patient from that point onward. For many patients, the notification of cancer (or absence of cancer) on a pathology report may represent one of the most important moments of their lives. The pathologist's signature appears on each report, and the pathologist must be prepared to accept the legal and social consequences of any errors or inaccuracies in her rendered diagnosis. Pathologists spend a good portion of their professional life archiving tissues and contributing their reports to the pathology database. It is no wonder that pathologists tend to think of pathology specimens and reports as their professional property. This belief, long held by pathologists, is currently under attack by patients and patient-advocates, who believe that the tissues obtained from a patient and the report rendered on the tissues belong to the patient.
Legal theory treats ownership in relation to commerce. The owner of an item is considered to be the person who has the right to sell the item. Since it is almost universally held that human tissue has special status and should not be sold or bartered, legal precedent demures from assigning ownership of human tissue. Instead, current practices provide specific rights to use human tissues to several parties. Patients have the right to know the content of pathology reports rendered on their biopsied tissues. In certain cases, such as when a patient seeks treatment at an institution different from the institution that rendered the original diagnosis, or when the patient chooses to have the pathologic material reviewed by a different laboratory or pathologist, the pathologist must transfer those materials according to the patient's request. In this instance, the original pathologist concedes that her fiduciary responsibility to the patient is more compelling than her rights to own the specimen.
Institutions rendering pathology diagnoses have certain rights and obligations regarding patient information. The Health Insurance Portability and Accountability Act of 1996 (HIPAA, Kennedy-Kassebaum Bill, H.R. 3103 of 104th Congress (Appendix 1)) requires standards to be established for the transfer of information regarding health claims. Institutions may also be required to establish registries, particularly cancer registries. This involves compiling data, including synopses of pathology data, and contributing the collected data to a centralized registry, such as the registry operated by the Centers for Disease Control and Prevention (CDCP). In addition, institutions may on occasion need to acquire and transfer reports and tissues, in order to comply with legal subpoenas filed in the interest of patients.
The pathologist who uses a patient's tissues and associated data in a research study, is likely to claim ownership. If she can satisfy herself that the study is designed in such a way that the patient's welfare is protected, and that the fiduciary responsibilities to the patient are satisfied, then she will most likely argue that she has the right to use her acquired specimens and data without obtaining the patient's consent.
Currently, regulations guiding the use of tissues and data for research purposes abound. A number of states have enacted laws that place special restrictions on genetic research. However, the current U. S. Federal legislation that is directly germane to the use of human tissues and associated data for research involving federally funded institutions or researchers is found in Article 46 of 45 CFR, U. S. Code of Federal Regulations (USCFR) (1995). These regulations are used by researchers, by Institutional Review Boards, and by the U. S. Federal Office of Protection from Research Risks (OPRR) (Appendix 1), to ensure the safety of individuals entered into research projects. Some suggestions for the ethical conduct of research and for the interpretation of 45 CFR 46, have been prepared by the U. S. National Bioethics Advisory Commission (NBAC) (Appendix 1).
In general, in research involving pathology data and tissues that are stored in pathology archives, but which were obtained in the standard course of patient treatment (i.e., that were not obtained to satisfy a research protocol), the risk to the patient is the risk that the patient's confidentiality will be breached. All physicians have a fiduciary obligation to ensure that information related to a patient's medical condition can never be used to harm the patient.
In response to the HIPAA legislation, the Department of Health and Human Services has issued proposals for Standards for Privacy of Individually Identifiable Health Information. The Proposed Rules appeared in the Federal Register, and are designated as 45 CFR Parts 160 to 164 (Appendix 1). In the Proposed Rule, the concept of deidentification is discussed. In 45 CFR 46 (Common Rule), researchers who use medical information can be excluded from regulation, if the medical data are completely stripped of all patient identifiers (anonymization) as well as all links to patient identifiers. Because anonymization is irreversible, discoveries, inconsistencies or errors found in anonymized data cannot be relinked back to a patient. Consequently, studies using anonymized data cannot be used to inform patients of any discovered risks to their health; cannot have data inconsistencies rectified; and cannot be checked for the accuracy of the original data included in the dataset.
De-identified data are data for which the recipient of the data cannot determine the patient's identity, but the institution that provided the data can, if needed, relink the patient's identity to the data (Appendix 1; Berman et al, 1996). For instance, if an institution encrypts a patient's name and gives the data to a third party, then the third party does not know the identity of the patient (i.e., the data are anonymized for the recipient). But if the third party contacts the institution and suggests that they have information that the patient may need to know, then the institution can take the encrypted patient identifier from the third party, decrypt the patient's identifier, and contact the patient. Although the mechanism for patient deidentification is unspecified in the Proposed Rule, it is clear that institutions will need to develop deidentification methods that are reasonable for their own operational circumstances, and that such methods will require approval from Institutional Review Boards before de-identified medical data can be shared with researchers.
De-identification methods may offer institutions a way of sharing encrypted, masked, or otherwise modified data for research efforts, in which the value to public health is deemed to outweigh the minimal risk to patient confidentiality.
4.3.2 Data Requirements
4.3.2.1 What Constitutes Acceptable Anatomic Pathology Data?
Every surgical pathology report is identified by a unique case accession number (usually assigned when the patient's specimen is received in the pathlogy laboratory), and a patient identifier (typically the patient's name and a unique identifying code assigned by the hospital, or the patient's social security number). Many institutions prefer the social security number, because it is presumed to be unique, and lasts for a lifetime. We have heard unverified anecdotal stories of the re-use of social security numbers from deceased patients. If this is true, then it implies that social security numbers might be unique among the set of living persons, but that some living persons may share the same social security number of persons who are deceased. In any event, it can be safely assumed that social security numbers provided by patients to hospitals may be transcribed with human data-entry errors. The most common data-entry error is associated with remembering sequences of digits, or transcribing sequences of digits is transposition of consecutive digits. For long sequences of digits, error rates can exceed 10%. It is a certainty that hospital databases contain incorrect social security numbers, and that some of these numbers are non-unique (i.e., erroneously match one individual to another individual).
Relevant demographic material might include: date of birth, gender, ethnicity, and date of death when applicable. Additional clinical information is often obtained in the form of other pathology reports for a given patient, clinical history that precedes a diagnostic report, and clinical history, including treatment, that results from a rendered pathology diagnosis. Finally, there is the pathology diagnosis itself, which may be recorded as free-text, and translated (either manually or automatically) into a standardized coding system.
For a well-formed pathology report, the laboratory information system should be able to markup the report with document tags, that serve as a roadmap for a data mining system. The Hypertext Markup Language (HTML) is a document formatting language that permits Internet web designers to attach markup tags to elements of text (e.g., font size, formatting information, color selection, etc.). The Extensible Markup Language (XML) permits document elements to be tagged with markers that that describe the actual content of data elements in the text (Simpson, 1996; Light, 1997).
Table 2 shows a valid XML file for the surgical pathology report listed in Table 1. The significant elements in the pathology report (accession number, lab identifier, time obtained, etc.) are marked by XML tags. These tags, in turn, may have certain required properties (date/time, size in cm, bodysite-name, UMLS-CUI, etc.), and may be iterated.
The open challenge for anatomic pathology data miners is to develop a structure for XML tags that is acceptable to our colleagues, and then to translate large collections of pathology reports into corresponding XML files.
4.3.2.2 Security and Confidentiality
There are two levels of security that must be addressed in any data mining project for anatomic pathology:
Concealment of the patient's name.
Concealment of the patient's detailed medical history.
At the first level of security, the patient's exact identifiers must never be disclosed in a public venue. These identifiers include (U. S. Code of Federal Regulations, Appendix 1): name; address, including street address, city, county, zip code, or equivalent geocodes; names of relatives and employers; birth date; voice telephone and fax numbers; email addresses; social security number; medical record number; health plan beneficiary number; account number; certificate/license number; any vehicle or other device serial number; web URL; Internet Protocol (IP) address; finger or voice prints; photographic images; and any other unique identifying number, characteristic, or code (whether generally available in the public realm or not) that the one has reason to believe may be available to an anticipated recipient of the information. The methods for achieving simple encryption of the patient's identifiers are well-understood, commercially available, and if used properly, are essentially unbreakable (Schneier, 1996). These include: the one-time-pad method and public-private encryption methods.
At the second level of security, one must recognize that a detailed medical record, even when purged of the patient's exact identifiers, might still contain information sufficiently detailed to identify the patient. For example, suppose that the hypothetical autopsy report in Table 4 had been deposited in a public autopsy resource. Then public knowledge regarding this U. S. President's medical history might be sufficient to identify the subject of this hypothetical autopsy report. Furthermore, there might be additional, previously private information in the report that would embarrass or otherwise harm the family. Even if it were legal to disclose this report, such wholesale disclosures of personal information would erode public confidence in the doctor-patient relationship, cause future patients to conceal parts of their medical history, and eventually damage the overall quality of medical care.
On the other hand, had the public autopsy report identified the patient's occupation solely as Politics (C0032382), seventh decade at death, autopsied in the 1970s, and contained no exact numeric information regarding the dates of birth and of various diagnoses, then it might represent the autopsy report of thousands of middle-aged men.
This example is further mitigated by an important qualification. Anyone who campaigns and serves on a job as public as the President of the United States must reckon with a loss of privacy, and with possible exposure of private aspects of one's life. Should a public medical resource be deemed culpable for revealing additional facts about an already-public person?
A patient's detailed medical history may be concealed by a distinction between public demographics and private demographics (Carter et al, 1981; Peery, 1978). Public demographics must include enough information to have value for epidemiologic studies, but not so much as to disclose the identity of the patient. All other demographics could be withheld from the public, and should be disclosed only under subpoena or with IRB approval.
4.4 Data Mining Example
4.4.1 The Johns Hopkins Autopsy Resource
Over the past three decades, there have been proposals in the anatomic pathology literature for inter-institutional sharing of pathology data (Carter et al, 1981; Peery, 1978; Wagner, 1996; Mullick, 1997; Moore et al, 1996). The Johns Hopkins Autopsy Resource (JHAR) (Appendix 1) is an Internet website, founded as an institutional database in 1980, and posted publicly in 1995, that lists over 50,000 autopsy facesheets, on patients born over a span of two centuries. An autopsy facesheet is the summary of final diagnoses, typically appearing as the first page in an autopsy report. The JHAR corresponds to an estimated one million tissue blocks, predominantly formalin-fixed and paraffin-embedded, which may be obtained as part of collaborative research investigations. Over 1300 publications in scholarly journals have resulted from the cases listed in the JHAR, and all citations, many with PubMed hyperlinks, are available on the website. Studies based upon data mining the JHAR include case reports (Hutchins et al, 1996), large autopsy case series (Arcidi et al, 1981; Vigorita et al, 1980; Moore and Hutchins, 1982), and even linguistic studies of medical text (Moore et al, 1988; Moore et al, 1989).
For example, in one of the autopsy studies based upon cases listed in the JHAR, the investigators sought to determine whether the severity and duration of adult-onset diabetes mellitus (AODM) is correlated with the severity of coronary atherosclerosis observed at autopsy. Clinical and autopsy findings were studied in 185 patients with AODM, who ranged in age from 37 to 91 years, and had had a clinical diagnosis of AODM established between a few days and up to 50 years before death. The JHAR was used to identify age-sex matched control subjects autopsied during a similar period. In comparison with age-sex matched controls, patients with AODM showed significantly more coronary artery disease, more diffuseness of coronary disease, more coronary collateralization, more vessels involved by atherosclerosis, and more myocardial infarcts. On the other hand, the progression of this atherosclerotic disease was unrelated to duration or severity of AODM (Vigorita et al, 1980).
The authors of that study cautioned against drawing conclusions from large datasets without a thorough understanding of the contained data (Moore and Hutchins, 1982). For example, severity of AODM was significantly correlated with short stature at autopsy. Further investigation of this initially interesting observation revealed that each patient's height at autopsy was measured from the head to the termination of the lower extremities. It turned out that many severe diabetics were double amputees!
Studies such as this one underscore the value of data mining efforts, in which valuable conclusions could be drawn quickly and inexpensively from existing patient records, radiographs, and microscope slides alone, without the expense and ethical problems inherent in designing a prospective clinicopathologic investigation.
In order to achieve patient confidentiality in the JHAR, each autopsy facesheet consists of a demographic line, followed by diagnoses. The only public demographic information provided is: age in decades, race, sex, decade of autopsy, and a key-number, which can be used to decrypt the patient identification, if necessary. Confidentiality is protected by the double-brokered encryption system of patient identifiers, which requires the participation of both the JHAR administrator and officials of the Department of Pathology of The Johns Hopkins Medical Institutions to re-identify the individual patients (Berman et al, 1996). As an additional security measure, the key-number may correspond to multiple patients, with the number-of-patients for a given key-number known only to the JHAR administration. Diagnoses in the original autopsy facesheet have been stripped of names of persons, locations, and institutions; and diagnoses have been autocoded into generic medical language as well as into UMLS codes in XML format. The only mechanism for obtaining additional information regarding an individual JHAR autopsy facesheet is to correspond with the database administrator, who forwards the correspondence to the appropriate official at The Johns Hopkins Medical Institutions. The Johns Hopkins Medical Institutions responds in accordance with policies set by its Institutional Review Board.
4.5 Conclusion.
Data mining in anatomic pathology is an emerging field. In the past,
data mining efforts consisted of small projects conducted with
the existing electronic records within a given institution.
Five developments over the past several years have enormously expanded
the potential for anatomic pathology data mining:
1. The accumulation of millions of pathology records
in electronic form. Approximately 40 million new anatomic pathology
reports are created and stored each year in the USA.
2. The emergence of legislative guidelines that ensure patient
confidentiality, while establishing ethical and legal paradigms by which
researchers can acquire pathology data for legitimate research needs.
3. The availability of comprehensive common medical terminologies
(including SNOMED, Read, and UMLS), that will support translation
of diagnostic information into coded terms that can be aggregated
and queried.
4. Community acceptance of standard document structures,
such as XML. Electronic reports will contain data, along with information
that describes the data, using community-standard data tags (metadata).
Such standardized reports will remove the greatest technical impediment
to sharing pathology data, namely, unclassifiable data.
5. The availability of Internet technology that allows
the rapid and secure transfer of medical data. The most promising
technology for data mining is the distributed network query.
Through this mechanism, a client query is sent to multiple institutional
databases, and a reply is created by a middleware agent that merges
the responses from each of the cooperating institutions
into a single database.
Access to pathology data is the rate-limiting factor to advancement in the field of pathology data mining. The next few years will be critical to the development of this new and promising field of research.
Table 1. Surgical Pathology Sample Report.
U. S. Government Standard Form 515
MEDICAL RECORD | SURGICAL PATHOLOGY
PATHOLOGY REPORT
Laboratory: BALTIMORE VAMHCS Accession No. BSP 99 8888
Submitted by: J SURGEON MD Date obtained: Jan 14, 1999
Specimen (Received Jan 15, 1999 10:32):
1. LARYNGECTOMY.
2. LEFT RADICAL NECK DISSECTION.
Brief Clinical History:
SQUAMOUS CARCINOMA, LEFT TRUE CORD.
Preoperative Diagnosis:
SQUAMOUS CARCINOMA, LEFT TRUE CORD.
Operative Findings:
SAME.
Postoperative Diagnosis:
SAME.
Surgeon/physician: J SURGEON MD
Gross description:
PATIENT IDENTIFICATION AGREES WITH REQUISITON AND TWO CONTAINERS.
1. THE SPECIMEN IS RECEIVED FRESH, LABELED WITH THE PATIENT'S NAME,
AND ADDITIONALLY LABELED "LARYNGECTOMY".
THE SPECIMEN CONSISTS OF A LARYNGECTOMY RESECTION, MEASURING
10.5 X 5.5 X 3.5 CM. THE LARYNX IS EDEMATOUS. THE LARYNX IS OPENED
POSTERIORLY, TO REVEAL AN IRREGULARITY OF APPARENT TUMOR, ON THE SURFACE
OF THE LEFT TRUE VOCAL CORD, MEASURING 3.0 X 1.5 CM. THE TUMOR DOES NOT
APPEAR TO INVOLVE THE SUBGLOTTIS, NOR THE ANTERIOR COMMISSURE. THE SUPERIOR,
INFERIOR, ANTERIOR, AND POSTERIOR MARGINS ARE GROSSLY UNINVOLVED BY TUMOR
REPRESENTATIVE SECTIONS OF TUMOR ARE SUBMITTED, AS WELL AS THE SURGICAL
MARGINS, AS FOLLOWS:
SUMMARY OF SECTIONS:
1-1, 1 PIECE. TRACHEAL MARGIN.
1-2, 1 PIECE. BASE OF TONGUE MARGIN.
1-3, 1 PIECE. RIGHT PYRIFORM SINUS MARGIN.
1-4, 1 PIECE. LEFT PYRIFORM SINUS MARGIN.
1-5, 1 PIECE. ANTERIOR SOFT TISSUE MARGIN.
1-6, 1 PIECE. POSTERIOR SOFT TISSUE MARGIN.
1-7, 1 PIECE. LESION OF THE LEFT TRUE CORD.
1-8, 1 PIECE. LESION OF THE LEFT TRUE CORD.
1-9, 1 PIECE. LESION OF THE LEFT TRUE CORD.
1-10, 1 PIECE. EPIGLOTTIS.
2. THE SPECIMEN IS RECEIVED FRESH, LABELED WITH THE PATIENT'S NAME,
AND ADDITIONALLY LABELED "LEFT RADICAL NECK DISSECTION". THE SPECIMEN
CONSISTS OF A LEFT RADICAL NECK DISSECTION, MEASURING 25.0 X 15.0 X 5.0 CM.
THE SPECIMEN IS DIVIDED INTO LEVELS 1, 2, 3, 4, AND 5. IN LEVEL 1,
THE SALIVARY GLAND AND ONE PROBABLE LYMPH NODE ARE SUBMITTED.
IN LEVEL 2, SIX PROBABLE LYMPH NODES ARE SUBMITTED.
IN LEVEL 3, TWO PROBABLE LYMPH NODES ARE SUBMITTED.
IN LEVEL 4, ELEVEN PROBABLE LYMPH NODES SUBMITTED.
IN LEVEL 5, FIVE PROBABLE LYMPH NODES ARE SUBMITTED.
REPRESENTATIVE SECTIONS ARE SUBMITTED, AS FOLLOWS:
SUMMARY OF SECTIONS:
1-1, 1 PIECE. LEVEL 1.
2-1, 5 PIECES. LEVEL 2.
3-1, 5 PIECES. LEVEL 2.
4-1, 4 PIECES. LEVEL 3.
5-1, 3 PIECES. LEVEL 3.
6-1, 6 PIECES. LEVEL 3.
7-1, 5 PIECES. LEVEL 4.
8-1, 5 PIECES. LEVEL 4.
9-1, 4 PIECES. LEVEL 5.
Microscopic exam/diagnosis:
1. SQUAMOUS CELL CARCINOMA OF LEFT TRUE CORD, WELL-DIFFERENTIATED, INVASIVE.
SURGICAL MARGINS OF RESECTION ARE FREE OF TUMOR.
2. RADICAL NECK DISSECTION. SALIVARY GLAND WITH NOEVIDENCE OF MALIGNANCY.
ELEVEN OF TWENTY-THREE LYMPH NODES WITH METASTATIC SQUAMOUS CELL CARCINOMA,
AS FOLLOWS.
LEVEL I: SALIVARY GLAND AND ONE LYMPH NODE WITH NO EVIDENCE OF MALIGNANCY.
LEVEL II: THREE OF FIVE LYMPH NODES WITH METASTATIC SQUAMOUS CELL CARCINOMA.
LEVEL III: ONE OF TWO LYMPH NODES WITH METASTATIC SQUAMOUS CELL CARCINOMA.
LEVEL IV: SEVEN OF TEN LYMPH NODES WITH METASTATIC SQUAMOUS CELL CARCINOMA.
LEVEL V: FIVE LYMPH NODES WITH WITH NO EVIDENCE OF MALIGNANCY.
JOHN Q PATHOLOGIST MD xyz| Date Jan 16, 1999
VETERAN,JOHN Q. STANDARD FORM 515
ID:123-45-6789 SEX:M DOB:12/01/1940 AGE:58 LOC:ENT J SURGEON
Table 2. XML File for Surgical Pathology
Sample Report *
<?xml version="1.0" ?>
<!DOCTYPE path_report
[
<!ELEMENT path_report (accession)>
<!ELEMENT accession (lab_identifier, time, submission,
pathologist, patient, procedure)>
<!ELEMENT lab_identifier (#PCDATA)>
<!ELEMENT time (time_obtained, time_received, time_reported,
time_amended, time_supplemented)>
<!ELEMENT time_obtained (#PCDATA)>
<!ELEMENT time_received (#PCDATA)>
<!ELEMENT time_reported (#PCDATA)>
<!ELEMENT time_amended (#PCDATA)>
<!ELEMENT time_supplemented (#PCDATA)>
<!ELEMENT submission (submitting_physician, submitting_service)>
<!ELEMENT submitting_physician (#PCDATA)>
<!ELEMENT submitting_service (#PCDATA)>
<!ELEMENT pathologist (#PCDATA)>
<!ELEMENT patient (patient_name, patient_identifier, date_of_birth,
patient_gender, patient_ethnicity)>
<!ELEMENT patient_name (#PCDATA)>
<!ELEMENT patient_identifier (#PCDATA)>
<!ELEMENT date_of_birth (#PCDATA)>
<!ELEMENT patient_gender (#PCDATA)>
<!ELEMENT patient_ethnicity (#PCDATA)>
<!ELEMENT procedure (specimen)>
<!ELEMENT procedure_cui (specimen)>
<!ELEMENT specimen (unique_specimen_identifier,container)>
<!ELEMENT unique_specimen_identifier (#PCDATA)>
<!ELEMENT container (container_number,label,gross,diagnosis)>
<!ELEMENT container_number (#PCDATA)>
<!ELEMENT label (#PCDATA)>
<!ELEMENT gross (gross_description,lesion_size)>
<!ELEMENT gross_description (#PCDATA)>
<!ELEMENT lesion_size (#PCDATA)>
<!ELEMENT diagnosis (diagnosis_number, disease_concept,
disease_modifiers, comment)>
<!ELEMENT diagnosis_number (#PCDATA)>
<!ELEMENT disease_concept (#PCDATA)>
<!ELEMENT disease_concept_cui (#PCDATA)>
<!ELEMENT disease_modifiers (#PCDATA)>
<!ELEMENT disease_modifiers_cui (#PCDATA)>
<!ELEMENT comment (#PCDATA)>
]>
<path_report>
<accession> BSP 99 8888
<lab_identifier> BALTIMORE VAMC SURGICAL PATH </lab_identifier>
<time>
<time_obtained> Jan 14, 1999 14:18 </time_obtained>
<time_received> Jan 15, 1999 10:32 </time_received>
<time_reported> Jan 18, 1999 09:18 </time_reported>
<time_amended></time_amended>
<time_supplemented></time_supplemented>
</time>
<submission>
<submitting_physician> J SURGEON MD </submitting_physician>
<submitting_service> SURGERY </submitting_service>
</submission>
<pathologist> JOHN Q PATHOLOGIST MD </pathologist>
<patient>
<patient_name> VETERAN,JOHN Q. </patient_name>
<patient_identifier> 123-45-6789 </patient_identifier>
<date_of_birth> 12/01/1940 </date_of_birth>
<patient_gender> M </patient_gender>
<patient_ethnicity> WHITE </patient_ethnicity>
</patient>
<procedure> LARYNGECTOMY AND LEFT NECK DISSECTION
<procedure_cui> C0023065, C0205091, C0034542
</procedure_cui>
<specimen> LARYNX
<specimen_cui> C0023078 </specimen_cui>
<unique_specimen_identifier> 9876543
</unique_specimen_identifier>
<container>
<container_number> 1 </container_number>
<label> LARYNGECTOMY </label>
<gross>
<gross_description> THE SPECIMEN IS RECEIVED FRESH,
LABELED WITH THE PATIENT'S NAME, AND ADDITIONALLY LABELED "LARYNGECTOMY".
THE SPECIMEN CONSISTS OF.... </gross_description>
<lesion_size> 3 cm </lesion_size>
</gross>
<diagnosis>
<diagnosis_number> 1 </diagnosis_number>
<disease_concept> SQUAMOUS CELL CARCINOMA </disease_concept>
<disease_concept_cui> C0280324, C0007137
</disease_concept_cui>
<disease_modifiers> WELL DIFFERENTIATED SQUAMOUS CARCINOMA
OF LARYNX </disease_modifiers>
<disease_modifiers_cui> C0205615 </disease_modifiers_cui>
<diagnosis_number>2</diagnosis_number>
<disease_concept> MARGINS FREE OF TUMOR
</disease_concept>
<disease_concept_cui> C0332648 </disease_concept_cui>
<disease_modifiers></disease_modifiers>
<comment> </comment>
</diagnosis>
</container>
</specimen>
<specimen> LEFT NECK
<unique_specimen_identifier> 9876544
</unique_specimen_identifier>
<container>
<container_number> 2 </container_number>
<label> LEFT </label>
<gross>
<gross_description> THE SPECIMEN IS RECEIVED FRESH,
LABELED WITH THE PATIENT'S NAME, AND ADDITIONALLY LABELED
"LEFT RADICAL NECK DISSECTION". THE SPECIMEN CONSISTS OF....
</gross_description>
</gross>
<diagnosis>
<diagnosis_number>1</diagnosis_number>
<disease_concept> METASTATIC SQUAMOUS CARCINOMA
</disease_concept>
<disease_concept_cui> C0334246, C0280399
</disease_concept_cui>
<disease_modifiers></disease_modifiers>
<disease_modifiers_cui></disease_modifiers_cui>
<comment> METASTATIC SQUAMOUS CARCINOMA
FOUND IN 11 OF 23 EXAMINED LYMPH NODES </comment>
<diagnosis_number>2</diagnosis_number>
<disease_concept> NO EVIDENCE OF MALIGNANCY
</disease_concept>
<disease_concept_cui> C0391857 </disease_concept_cui>
<disease_modifiers></disease_modifiers>
<disease_modifiers_cui></disease_modifiers_cui>
<comment> SALIVARY GLAND </comment>
</diagnosis>
</container>
</specimen>
</procedure>
</accession>
</path_report>
* This sample XML format can be expanded to associate UMLS (CUI) descriptor tags with every text-containing tag.
Table 3. Barrier Word Method, Illustrated by
Sample Source Text from the Armed Forces
Institute of Pathology Electronic Fascicles.
Sample Legend-Text from the Armed Forces Institute of Pathology Electronic Fascicles (Colby et al, 1995). Barrier words are displayed in lower case, and KEYWORDS ARE DISPLAYED IN UPPER CASE. Each sequence of keywords uninterrupted by barrier words, maps to one or more CUIs in the UMLS Metathesaurus.
LARGE CELL CARCINOMA . BRONCHIAL WASH CYTOLOGY SPECIMEN
shows CLUSTERS of NEOPLASTIC CELLS with LARGE NUCLEI ,
PROMINENT NUCLEOLI , and ABUNDANT CYTOPLASM .
___________________________________________________________
LARGE CELL CARCINOMA ................... C0206704
BRONCHIAL WASH ......................... C0396473
CYTOLOGY SPECIMEN ...................... C0225355
shows .................................. C0332265
CLUSTERS ............................... C0205418
of ..................................... C0332285
NEOPLASTIC ............................. C0027651
CELLS .................................. C0007634
with ................................... C0332287
LARGE .................................. C0205164
NUCLEI ................................. C0007610
PROMINENT .............................. C0205402
NUCLEOLI ............................... C0007609
and .................................... C0332287
ABUNDANT ............................... C0205172
CYTOPLASM .............................. C0010834
Table 4. Hypothetical Autopsy Report
Male. Caucasian. 1.91 m. 95.5 kg. (DeGregorio, 1997).
b. 8/27/1908. d. 1/22/1973.
Occupation: U. S. Congressman, U. S. Senator, U. S. President.
Status post: Appendectomy. (C0003611)
Status post: Cholecystectomy. (C0008320)
History of: Renal Calculi. (C0022650)
Myocardial Infarct, 1955. (C0027051)
Myocardial Infarct, April, 1972. (C0027051)
Myocardial Infarct, January 22, 1973. (C0027051)
Marked Generalized Atherosclerosis. (C0205082, C0205046, C0205246)
Appendix 1. Internet References.
American Board of Medical Specialists.
http://www.abms.org/
American College of Surgeons.
http://www.facs.org/
College of American Pathologists.
http://www.cap.org/
Humphries BL. NLM/AHCPR Large Scale Vocabulary Test.
http://www.cpri.org/events/meetings/terminology/blh/sld001.htm
Johns Hopkins Autopsy Resource.
http://www.netautopsy.org/
Joint Commission for the Accreditation of Healthcare Organizations.
http://www.jcaho.org/
Surveillance, Epidemiology, and End-Results (SEER):
http://www-seer.ims.nci.nih.gov
Systematized Nomenclature of Human and Veterinary Medicine.
http://www.snomed.org/
U. S. Code of Federal Regulations, 45 CFR Parts 160 - 164.
http://aspe.hhs.gov/admnsimp/
U. S. Health Insurance Portability and Accountability Act of 1996.
http://thomas.loc.gov
U. S. Human Genome Research Institute, GenBank.
http://www.ncbi.nlm.nih.gov/Genbank
U. S. National Bioethics Advisory Commission.
http://bioethics.gov/general.html
U. S. National Cancer Institute, Cancer Genome Anatomy Project.
http://www.ncbi.nlm.nih.gov/CGAP
U. S. National Cancer Institute, Cooperative Breast Cancer Tissue Resource.
http://www-cbctr.ims.nci.nih.gov/FAQ.html
U. S. National Cancer Institute, Cooperative Human Tissue Network. http://www-chtn.ims.nci.nih.gov/
U. S. National Library of Medicine, PubMed. http://www.ncbi.nlm.nih.gov/PubMed/
U. S. National Library of Medicine, Unified Medical Language System. http://www.nlm.nih.gov/research/umls/
U. S. Office of Protection from Research Risks (OPRR). http://grants.nih.gov/grants/oprr/oprr.htm
References
Anderson, R.E., Smith, R.D. and Benson, E.S. 1991. The accelerated graying of American pathology. Hum Pathol 22:210-214.
Arcidi, J.M. jr., Moore, G.W. and Hutchins, G.M. 1981. Hepatic morphology in cardiac dysfunction. A clinicopathologic study of 1000 autopsied patients. Am J Pathol 104:159-166.
Association of Directors of Anatomic and Surgical Pathology. 1997. Recommendations for the reporting of specimens containing laryngeal neoplasms. Mod Pathol. 10:384-386.
Berman, J.J., Moore, G.W. and Hutchins, G.M. 1996. Maintaining patient confidentiality in the public domain Internet Autopsy Database (IAD). Proc AMIA Annu Fall Symp. 328-332.
Bubendorf, L., Kononen, J., Koivisto, P., Schraml, P., Moch, H., Gasser, T.C., Willi, N., Mihatsch, M.J., Sauter, G. and Kallioniemi, O.P. 1999. Survey of gene amplifications during prostate cancer progression by high-throughout fluorescence in situ hybridization on tissue microarrays. Cancer Res. 59:803-806.
Bubley, G.J., Carducci, M., Dahut, W., Dawson, N., Daliani, D., Eisenberger, M., Figg, W.D. Freidlin, B., Halabi, S., Hudes, G., Hussain, M., Kaplan, R., Myers, C., Oh, W., Petrylak, D.P., Reed, E., Roth, B., Sartor, O., Scher, H., Simons, J., Sinibaldi, V., Small, E.J., Smith, M.R., Trump, D.L., Vollmer, R. and Wilding, G. 1999. Eligibility and Response Guidelines for Phase II Clinical Trials in Androgen-Independent Prostate Cancer: Recommendations From the Prostate-Specific Antigen Working Group. J Clin Oncol. 1999 Nov;17(11):3461-3467.
Carter, J.R., Nash, N.P., Cechner, R.L. and Platt, R.D. 1981. Proposal for a national autopsy data bank. A potential major contribution of pathologists to the health care of the nation. Am J Clin Pathol. 76 (Suppl): 597-617, 1981.
Choi, S.S., Kang, Y.S., Kim, U.J., Lee, K.H. and Shin, H.S. 1999. Chromosomal localization of ESTs obtained from human fetal liver via BAC-mediated FISH mapping. Mol Cells. 9:403-409.
Colby, T.V., Koss, M.N. and Travis, W.D. 1995. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Lower Respiratory Tract. Electronic Fascicle version 2.0.Armed Forces Institute of Pathology.
Compton, C.C. 1999. Pathology Report in Colon Cancer: What Is Prognostically Important? Dig Dis. 17:67-79.
Connell, P.P., Rotmensch, J., Waggoner, S.E. and Mundt, A.J. 1999. Race and clinical outcome in endometrial carcinoma. Obstet Gynecol. 94:713-720.
Cote, R.A., Rothwell, D.J., Beckett, R.S., Palotay, J.L. and Brochu, L. 1993. SNOMED International. The Systematized Nomenclature of Human and Veterinary Medicine. College of American Pathologists.
Degregorio, W.A. 1997. The Complete Book of U.S. Presidents. Fifth Edition. Barricade Books. 1997.
Fedorowicz J. 1982. A Zipfian model of an automatic bibliographic system: An application to MEDLINE. J Am Soc Info Sci 33:223-232.
Giere, W. 1981.Foundations of clinical data automation in cooperative programs. Proc 5th Ann Symp Comp Applic Med Care.1142-1148.
Grizzle, W.E., Aamodt, R., Clausen, K., LiVolsi, V., Pretlow, T.G. and Qualman, S. 1998. Providing human tissues for research: how to establish a program. Arch Pathol Lab Med 122:1065-1076.
Hahn, U., Romacker, M. and Schulz, S. 1999. How knowledge drives understanding -- matching medical ontologies with the needs of medical language processing.Artif Intell Med 15:25-51.
Hruban, R., Westra, W.H. and Phelps, T.H. 1996. Surgical Pathology Dissection. Springer Verlag.
Hutchins, G.M. 1990. Autopsy. Performance and Reporting.College of American Pathologists.
Hutchins, G.M., Meuli, M., Meuli-Simmen, C., Jordan, M.A., Heffez, D.S. and Blakemore, K.J. 1996. Acquired spinal cord injury in human fetuses with myelomeningocele. Pediatr Pathol Lab Med. 16:701-712.
Hutchins, G.M., Berman, J.J., Moore, G.W., Hanzlick, R. and the Autopsy Committee of the College of American Pathologists. 1999. Practice Guidelines for Autopsy Pathology. Arch Pathol Lab Med. 1999; 123:1085-1092.
Junor, E.J., Hole, D.J., McNulty, L., Mason, M. and Young, J. 1999. Specialist gynaecologists and survival outcome in ovarian cancer: a Scottish national study of 1866 patients. Br J Obstet Gynaecol. 106:1130-1136.
Kao, G.F. and Moore, G.W. 2000. Dermatopathology False Negative Terms in Unified Medical Language System (UMLS). Arch Pathol Lab Med. 124: (in press).
Klausner, R.D. 1999. The Nation's Investment in Cancer Resarch: A Budget Proposal for Fiscal Year 2001. National Cancer Institute. 51-55.
Kononen, J., Bubendorf, L., Kallioniemi, A., Barlund, M., Schraml, P., Leighton, S., Torhorst, J., Mihatsch, M.J., Sauter, G. and Kallioniemi, O.P. 1998. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med. 4:844-847.
Kurman, R.J., Malkasian, G.D. jr., Sedlis, A. and Solomon, D. 1991. From Papanicoloau to Bethesda: the rationale for a new cervical cytology classification. Obstet Gynecol 77:779-782.
Light, R. 1997.Presenting XML. Sams.net Publishing.
Lilienfeld, D.E. and Stolley, P.D. 1994.Foundations of Epidemiology. Fifth Edition. Oxford University Press. 1994.
Mapp, T.J., Hardcastle, J.D., Moss, S.M. and Robinson, M.H. 1999. Randomized clinical trial: Survival of patients with colorectal cancer diagnosed in a randomized controlled trial of faecal occult blood screening. Br J Surg. 86:1286-1291.
Moch, H., Schraml, P., Bubendorf, L., Mirlacher, M., Kononen, J., Gasser, T., Mihatsch, M.J., Kallioniemi, O.P. and Sauter, G. 1999. High-throughput tissue microarray analysis to evaluate genes uncovered by cDNA microarray screening in renal cell carcinoma. Am J Pathol. 154:981-986.
Moore, G.W. and Hutchins, G.M. 1982. Consistency versus completeness in medical decision making: Application to 155 patients autopsied after coronary artery bypass graft surgery. Proc 6th Annu Symp Comput Appl Med Care. 805-811.
Moore, G.W., Boitnott, J.K., Miller, R.E., Eggleston, J.C. and Hutchins, G.M. 1988. Integrated anatomic pathology reporting system using natural language diagnoses. Modern Pathol 1:44-50, 1988.
Moore, G.W., Miller, R.E. and Hutchins, G.M. 1989. Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the Barrier Word Method. In: Scherrer JR, Cote RA, Mandil SH, eds. Computerized Natural Medical Language Processing for Knowledge Representation. North-Holland. 29-39.
Moore, G.W. and Berman, J.J. 1994. Performance Analysis of Manual and Automated Systematized Nomenclature of Medicine (SNOMED) Coding. Am J Clin Pathol 101:253-256.
Moore, G.W., Berman, J.J., Hanzlick, R.L., Buchino, J.J. and Hutchins, G.M. 1996. A prototype internet autopsy database: 1625 consecutive fetal and neonatal autopsy facesheets spanning twenty years. Arch Pathol Lab Med. 120:782-785.
Moulton, G. 1999. Surveillance data take on a new statistical dimension. J Natl Cancer Inst. 91:671-673.
Mullick, F. 1997. The Center for Environmental Pathology and Toxicology at the Armed Forces Institute of Pathology. Hum Pathol. 52: 752-753.
Naber, S.P., Smith, L.L.,jr. and Wolfe, H.J. 1992.Role of the frozen tissue bank in molecular pathology. Diagnostic Molecular Pathology.1:73-79.
Nelson, S.J., Olson, N.E., Fuller, L., Tuttle, M.S., Cole, W.G. and Sherertz, D.D. 1995. Identifying concepts in medical knowledge. Medinfo. 8:33-36.
Nelson, R.L., Persky, V. and Turyk, M. 1999. Carcinoma in situ of the colorectum: SEER trends by race, gender, and total colorectal cancer. J Surg Oncol. 1999 Jun;71(2):123-129.
Payne, C. 1995. Developing a standard dataset for the NHS. Version 3 of Read Codes addresses many difficulties. BMJ. 311:951.
Peery, T.M. 1978. The autopsy data bank. A proposal for pathologists to contribute to the health care of the nation. Am J Clin Pathol 69 (Suppl): 258-259.
Read, J.D. and Benson, T.J.R. 1986.Comprehensive coding. Brit J Health Care Computing 1986; 3:22-25.
Rosai, J. and Ackerman, L.V. 1996. Ackerman's Surgical Pathology. Eighth Edition. C.V. Mosby.
Schmitt, A.O., Specht, T., Beckmann, G., Dahl, E., Pilarsky, C.P., Hinzmann, B. and Rosenthal, A. 1999. Exhaustive mining of EST libraries for genes differentially expressed in normal and tumour tissues. Nucleic Acids Res. 27:4251-4260.
Schneier, B. 1996. Applied Cryptography, Second Edition. Protocols, Algorithms, and Source Code in C. John Wiley & Sons
Simpson, A. 1996. HTML Publishing Bible, Windows 95 Edition. IDG Books Worldwide, Inc. 1996.
Smith, R.D., Benson, E.S. and Anderson, R.E. 1989. Some characteristics of the community practice of pathology in the United States. Arch Pathol Lab Med. 113:1335-1342.
Tersmette, K.W.F., Scott, A.F., Moore, G.W., Matheson, N.W. and Miller, R.E. 1988. Barrier word method for detecting molecular biology multiple word terms. Proc 12th Annu Symp Comput Appl Med Care.
U. S. Centers for Disease Control and Prevention. 1995. Manual of Procedures for the Reporting of Nationally Notifiable Disease to CDC. CDC.
U. S. Code of Federal Regulations. 1995. 45 CFR Subtitle A (10-1-95 Edition), part 46.101 (b) (4). U. S. Department of Health and Human Services. Office of the Secretary.
U. S. Code of Federal Regulations. 1999. 45 CFR Parts 160 - 164. Standards for Privacy of Individually Identifiable Health Information; Proposed Rule. Department of Health and Human Services. Office of the Secretary. Federal Register. 64:59917-60065.
http://aspe.hhs.gov/admnsimp/
U. S. Health Insurance Portability and Accountability Act. 1996. (HIPAA, Kennedy-Kassebaum Bill, H.R. 3103 of 104th U. S. Congress).
U. S. Government Documents at URL: http://thomas.loc.gov
U. S. National Bioethics Advisory Commission (NBAC).
1995. Executive Order 12975, October 3, 1995.
Federal Register. 60:52063-52065.
http://bioethics.gov/general.html
Vigorita VJ, Moore GW, Hutchins, GM. Absence of correlation
between coronary arterial atherosclerosis and severity or duration
of diabetes mellitus of adult onset. Am J Cardiol. 1980;46:535-542.
Wagner BM. 1996. The future of environmental and toxicologic
pathology. Human Pathol. 1996;27:1003-1004.
Zhang Q.
Easy entry of Chinese character set symbols.
Proc 5th Ann Symp Comp Appl Med. 1981;5:143-149.
Zipf GK.
Human Behavior and The Principle of Least Effort.
An Introduction to Human Ecology.
Reading, MA: Addison-Wesley Press. 1949;:19-55.
Cios KJ, Moore GW.
Uniqueness of medical data mining.
Artif Intell Med. 2002 Sep-Oct;26(1-2):1-24.
PMID: 12234714.
PubMed Entry
Full Text of Article:
http://www.netautopsy.org/uniqmddm.htm
Moore GW, Berman JJ.
Anatomic Pathology Data Mining.
Chapter 4. In: Cios KJ.
Medical Data Mining and Knowledge Discovery.
Berlin: Springer Verlag. 2000;4:61-107.
ISBN: 3-7908-1340-0, 502 pages.
Published within the series: "Studies in Fuzziness and Soft Computing",
Physica-Verlag Heidelberg, a Springer-Verlag Company.
Full Text of Article:
http://www.netautopsy.org/apdmchap.htm
MEDICAL DATA MINING AND KNOWLEDGE DISCOVERY.
TABLE OF CONTENTS.
1 Medical Data Mining and Knowledge Discovery: An Introduction 1
Krzysztof J. Cios
1.1 Unique Features of Medical Data Mining 1
1.2 Defining Data Mining and Knowledge Discovery 2
1.3 Fundamental Issues of Data Mining and Knowledge Discovery 6
1.4 Data Mining Models 8
1.5 Knowledge Discovery Process 10
1.5.1 Understanding medical problem domain 11
1.5.2 Understanding the data 11
1.5.3 Preparation of the data 12
1.5.4 Data mining 12
1.5.5 Evaluation of the discovered knowledge 13
1.5.6 Using the discovered knowledge 13
1.6 Sources of Further Information 14
References 14
2 Legal Policy and Security Issues in the Handling of Medical Data 17
Joseph M. Saul, J.D.
2.1 General Considerations 17
2.2 The United States 19
2.2.1 The Health Insurance Portability and Accountability Act (HIPAA) 19
2.2.2 The Common Rule 27
2.2.3 State Medical Records Laws 28
2.3 The European Union 29
3 Medical Natural Language Understanding
as a Supporting Technology for Data Mining in Healthcare. 32
Werner Ceusters
3.1 Understanding the problem domain 32
3.1.1 Introduction 32
3.1.2 The many faces of language engineering 33
3.2 Understanding the data 39
3.2.1 Types of knowledge 39
3.2.2 Linguistic and conceptual knowledge 40
3.2.3 Linguistic semantics 41
3.2.4 Conceptual and linguistic ontologies 42
3.3 Preparation of the data for subsequent data mining 44
3.3.1 Technologies required 44
3.3.2 Preparing neurosurgerical reports for data mining purposes:
the MultiTale approach45
3.4 Data mining 47
3.5 Evaluation 54
3.5.1 Discussion 56
3.6 Conclusion 57
References 58
4 Anatomic Pathology Data Mining 61
G. William Moore, Jules J. Berman
4.1 Understanding the Problem Domain 62
4.1.1 Objectives of Data Mining in Anatomic Pathology 62
4.1.2 Intended Uses for Anatomic Pathology Data Mining 62
4.1.3 Economic Issues in Anatomic Pathology Data Mining 68
4.1.4 Commercial Uses of Mined Anatomic Pathology Data 69
4.1.5 Past, Current Efforts to Standardize Pathology Data 70
4.1.6 Data Integrity Issues 71
4.1.7 Ethical and Legal Issues 72
4.2 Understanding the Data 75
4.2.1 Size of Potential Anatomic Pathology Data Domain 75
4.2.2 Description of an Anatomic Pathology Report 76
4.2.3 Converting Parts of a Report into Object Domains 79
4.3 Preparation of the Data 85
4.3.1 Data Ownership 85
4.3.2 Data Requirements 87
4.4 Data Mining Example 90
4.4.1 The Johns Hopkins Autopsy Resource 90
4.5 Conclusion 92
Appendix 1. Internet References. 103
References 105
5 A Data Clustering and Visualization Methodology for Epidemiological Pathology Discoveries 109
Doron Shalvi, Nicholas DeClaris
5.1 Introduction 109
5.2 Understanding the problem domain 110
5.2.1 Objectives 110
5.2.2 Project Plan 111
5.3 Understanding the data 112
5.3.1 Data collection 112
5.3.2 Data description 112
5.3.3 Data quality 113
5.4 Preparation of the data 117
5.4.1 Data selection 117
5.4.2 Data preprocessing 117
5.5 Data Mining 119
5.5.1 Self-Organizing Map (SOM) 120
5.5.2 Data Visualization 124
5.3 Adaptive Resonance Theory (ART) 133
5.6 Evaluation of the discovered knowledge 136
5.7 Using the discovered knowledge 138
5.8 Conclusions 139
5.9 Acknowledgements 140
5.10 References 141
6 Mining Structure-Function Associations in a Brain Image Database 143
Vasileios Megalooikonomou, Edward H. Herskovits
6.1 Understanding the problem domain 143
6.2 Understanding the data 144
6.3 Preparation of the data 146
6.4 Data mining 146
6.4.1 Atlas-based analysis 147
6.4.2 Voxel-based analysis 148
6.4.3 Results from mining BRAID 149
6.5 Evaluation of the discovered knowledge 154
6.5.1 The Lesion-Deficit Simulator 155
6.5.2 Results from the evaluation of the mining system 159
6.6 Using the discovered knowledge 166
References 168
7 ADRIS: an Automatic Diabetic Retinal Image Screening system 171
Kheng Guan Go