Text data mining: A case study using the internal tobacco industry documents

Martha C Michel, MS, Graduate Group in Biological and Medical Informatics, University of California, San Francisco, P.O. Box 0613, 3333 California St. - Laurel Heights, San Francisco, CA 94143, 415-502-8183, martham@itsa.ucsf.edu and Lisa A. Bero, PhD, Institute for Health Policy Studies, Department of Clinical Pharmacy, and Center for Tobacco Control Research and Education, University of California, San Francisco, 530 Parnassus Avenue, Suite 366, Box 1390, Library, San Francisco, CA 94143.

Text data mining is a new informatics research method that has applications for public health data collections. Using statistical algorithms and user-guided analysis, text data mining can help a researcher come up with new information and knowledge from a corpus and potentially discover “nuggets” of information that would otherwise go unnoticed. It can also be used for exploratory data analysis to discover new hypotheses for researching a set of documents. Text data mining has the potential to help users find new connections in mountains of text.

Within the academic literature, a debate currently exists over what text mining is and the terminology that is used to describe it. Text data mining should be approached with caution, as should other numerical data mining problems. The analyst must pay attention to the increased potential for spurious results.

We will provide several examples of text data mining relevant to public health from the internal tobacco industry documents. As a result of the Master Settlement agreement, 6.9 million documents have been released on the Internet at the UCSF Legacy Tobacco documents library. We will identify different software that is useful for text data mining and describe the use of SAS text miner to discover new relationships in the internal tobacco documents using various statistical algorithms including: neural networks, support vector machines, and clustering. Text data mining has the potential to be useful for large public health reports and text information as demonstrated by the case study of the internal tobacco industry documents.

Learning Objectives:

Understand the definition and purpose of text mining.
Identify different existing software that can be used to text mine public health document corpora.
Define and describe three advantages and disadvantages for text mining using the internal tobacco industry documents.

Keywords: Tobacco Policy, Public Health Informatics

Presenting author's disclosure statement:
I do not have any significant financial interest/arrangement or affiliation with any organization/institution whose products or services are being discussed in this session.

Tobacco Issues Update Poster Session

The 132nd Annual Meeting (November 6-10, 2004) of APHA