Analyzing Influenza-Like Illness With Unsupervised Machine Learning

Friday, 28 July 2017

Monique J. Bouvier, MSN1
Mary Barger, PhD, MPH1
David D'Ambrosio, PhD2
John C. Arnold, MD3
(1)Hahn School of Nursing and Bob and Betty Institute for Nursing Research, University of San Diego, San Diego, CA, USA
(3)Department of Pediatrics, Naval Medical Center San Diego, San Diego, CA, USA

Disclosure: The views expressed herein are those of the authors and do not reflect the official policy or position of the Department of the Navy, Department of Defense, or the United States Government.


The World Health Organization reports acute respiratory infections continue to be the leading cause of global infectious disease morbidity and mortality with almost 4 million deaths annually. An acute respiratory infection is caused by an infectious agent, bacterial or viral, with a wide spectrum of symptom presentation. There are over 200 viruses that can cause influenza-like illnesses (ILI), a sub-type of acute respiratory infection, and there is developing research to understand the symptomatic differences between the virus types. However, symptom experience is very subjective, so it is difficult to determine which of the 200 viruses is causing the ILI without laboratory viral testing.

Symptoms are experiences stimulating changes in a person’s feelings and biopsychosocial factors; therefore, biological, psychological, and social factors may contribute to a person’s symptom experience. Several ILI symptom studies have examined the symptom experience in order to predict the diagnosis of influenza from the other causes of ILI, but did not yield satisfactory results. Other studies examining influenza symptom severity used dichotomous or linear sum analysis with few looking at symptoms over time.

Unsupervised machine learning is an approach that identifies patterns in datasets with minimal human input. Clustering is a common method of unsupervised machine learning where data are grouped together based on similarity. Recently, several studies have used clustering in medical applications such as predicting the recurrence of breast cancer or detection of Alzheimer’s disease.

Studies examining patient reported ILI symptoms to predict virus type are limited, especially related to virus types other than influenza. Additionally, most research focuses on the influenza virus and its symptoms, and not the other common viruses identified as sources of ILI. The purpose of this study is to identify if symptom presentation over the course of influenza-like illness (ILI) can predict virus type using unsupervised machine learning. Additionally, we sought to identify sub-populations with similar symptom experience.


 A secondary analysis of data from a prospective longitudinal study conducted by the Acute Respiratory Infection Consortium was performed. The data was collected from 2009 to 2014 at five US military medical institutions across the United States. The population was otherwise healthy active duty military members, dependents, and retirees, age 0-65 years, who presented to the clinic with influenza-like illness symptoms. A nasopharyngeal sample was collected for virus identification. Subjects reported their symptom severity at enrollment and follow-up visit days 3, 7, and 28 on an instrument designed for this study. The instrument had subjects rate their symptom severity on a 4-point nominal scale for 20 symptoms associated with influenza-like illness.

The sample used for this study was limited subjects with complete symptom severity data on days 0, 3, and 7, and no viral co-infections. The unsupervised machine learning approach, k-means clustering, was used to analyze the symptom data. For the initial analysis subjects were clustered by their individual symptom severity scores for all visits. The clusters were examined to identify if any of them represented a specific virus or group of viruses. Because it was unknown how the different viruses’ symptoms were expressed, clustering was run with kvalues 5-10. The secondary analyses clustered subjects with specific viral diagnoses, influenza A, rhinovirus, or coronavirus by symptom expression. Subject attributes of: sex, military status, age, BMI, smoking, and ethnicity, were compared amongst the clusters to identify how specific groups may experience the specified virus.


 The initial analysis was unable to predict virus type based on the individual symptom severity scores using a variety of scoring approaches. Only k=7 clustering revealed some promising differences, but detailed analysis identified that the clusters were not significantly different (p>0.05) than the overall population with the exception of one cluster. One cluster had a higher coronavirus percentage compared to the overall population. However, this cluster had eight total virus types, therefore, it is not specific for diagnostic purposes.

The secondary analyses of subject attributes for the rhinovirus (n=101), influenza A (n=107), and coronavirus (n=51) groups generated favorable results. At least one symptom cluster in each group yielded statistically significant difference based on subject attributes using one-way ANOVA or chi-square testing. The clustered rhinovirus data exhibited statistically significant differences (p<.000) in five out of the six attributes: sex, BMI, age, smoking history, and military status. The clustered influenza A data had statistically significant difference the attribute sex (p<.000), and approached significance in the military status attribute. The clustered coronavirus data only showed statistically significant differences in sex (p<.000), which was expected as the data set was well distributed. Overall the patients in the different virus clusters experienced symptoms differently compared to the total population for virus type.


Although, virus type could not be predicted based on physical symptom score, some differences in symptoms among virus types were anecdotally observed. Additionally, the results showed people with the same virus infection experience physical symptoms differently. Moreover, the secondary analyses reinforced the fact that a person’s attributes may result in different physical symptom presentation. For that reason, future research should consider utilizing a symptom severity instrument that measures more than physical symptoms, and captures psychological, environmental, and other aspects as conceptualized by the symptom management theory.

Nurses track and follow the care of a patient closely, and are typically the first to see a change in a patient’s status. It is important for nurses to be able to recognize a change in symptom severity and how different attributes may affect symptom presentation. A slight change in a patient’s symptom presentation can be the beginning of a worsening of the illness.

Additionally, unsupervised machine learning could become a useful technique to help identify patterns in research with large data. Its technique could open new avenues of patient data analysis and may reveal knowledge and factors that may not be obvious using traditional statistical approaches. This novel approach needs to be tested against other approaches to more fully understand its usefulness.