Exploring text classification methods in oncological medical notes using machine learning and deep learning
Description
With the preventive and personalized medicine advances, and technological improvements enabling better interaction from patients with their healthcare information, the volume of healthcare data gathered has increased. A relevant part of these data is recorded as an unstructured format in natural language free-text, making it harder for Clinical Decision Support Systems (CDSS) to process these data. Consequently, healthcare professionals get overwhelmed keeping themselves updated with the patient’s healthcare information because they need more time to gather and analyze it manually. Furthermore, to define an oncology diagnosis and its treatment plan is a complex decision-making process because it is affected by a broad range of parameters. This research’s main objective is to apply several text classification methods in non-synthetic oncology clinical notes corpora to help with this decision-making process. First, the corpora were obtained from an Oncology EHR system from three different oncology clinics. Two corpora versions were created: the per-clinical-event version with each patient’s medical note per record; and the per-patient version with one record per patient with his or her medical notes. Then, these corpora were preprocessed to leverage the performance of the classifiers. As the last step, several machine learning and one deep learning text classification methods were trained using these corpora with each patient’s diagnosis as enriched data. The following machine learning and deep learning classification methods were applied: Multilayer Perceptron (MLP) neural network, Logistic Regression, Decision Tree classifier, Random Forest classifier, K-nearest neighbors (KNN) classifier, and Long-Short Term Memory (LSTM). An additional experiment with an MLP classifier was performed to evaluate the preprocessing step’s influence on the results, and it found that the classifier’s mean accuracy was leveraged from 26.1% to 86.7% with the per-clinical-event corpus, and 93.9% with the perpatient corpus. The classifier that best performed was the MLP with 2 hidden layers (800 and 500 neurons), which achieved 93.90% accuracy, a Macro F1 score of 93.61%, and a Weighted F1 score of 93.99%. The experiments were performed in a dataset with 3,308 medical notes from a small oncology clinic.Nenhuma