Paper
Friday, July 15, 2005
Applying Data Mining Methods to Classify Smoking Cessation Status
Mollie R. Poynton, MSN, APRN, BC, College of Nursing, University of Utah, Salt Lake City, UT, USA
Learning Objective #1: Identify issues relevant to the classification of health behaviors, specifically smoking cessation status, using data mining methods |
Learning Objective #2: Identify and characterize the process of Knowledge Discovery in Databases (KDD), commonly known as data mining |
Knowledge Discovery in Databases (KDD) is the application of data mining methods to detect interesting patterns embedded in data. However, the utility of data mining methods for modeling health behavior patterns relevant to nursing, including smoking cessation, has not been established. This pilot study explored the process of applying data mining methods, including backpropagation neural networks (BPNN), to model and classify smoking cessation status using data from the 2001 National Health Interview Survey (NHIS). BPNN algorithms are well suited for modeling complex relationships, such as those expected in a highly dimensional health survey data set. The data set necessitated extensive pre-processing prior to the application of BPNN algorithms. Feature sub-set selection was performed. Automated pattern search, the “data mining” step of the KDD process was conducted, using multiple methods including BPNNs. Weights and thresholds were adjusted iteratively to aid pattern discovery. Models created during the KDD process were validated using 10-fold cross-validation, and evaluated using performance metrics. The steps of pre-processing, feature sub-set selection, and data mining were re-visited in the iterative approach characteristic of KDD. This study served as the pilot for a larger study encompassing cancer control data, and establishes the potential to create predictive models of smoking cessation status based on health survey data. Models created using KDD/ data mining methods may hold both scientific and clinical implications. Discovered patterns may be hypothesis generating for future scientific research, and immediately useful in clinical informatics applications, such as decision support systems and mass customization of health behavior interventions.