What if the use of big data could diagnose cancer before your doctor does?

That was the aim of a study conducted by a lab in the Netherlands. They developed a data-driven pre-processing pipeline based on a dataset of over 260 000 Electronic Medical Reports (EMR). Compared to its opposite (hypothesis-driven model), a data-driven model analyses all the data and has hence the capability to be trained to detect various diseases. The EMRs represent a tremendous amount of data as they store medication prescriptions, lab results, information from consultations, etc. The pipeline was evaluated for the prediction of colorectal cancer (CRC), the third most frequent among all cancers whose detection is challenging because of its non-specific symptoms.

As shown in the figure below, the Python coded pipeline consists of four steps. First, the measurements of lab results are compared to a set reference value and to other measurements taken from the same patient during a time period. The dataset is then enriched by additional information from the web. For example, a medication is linked to its side effects. The third step consists of finding patterns from the succession or the co-occurrence of certain events. Finally, the pre-processed data is converted to one input vector per person that is readable for machine learning processing.

Figure: Skeleton of the pre-processing pipeline

After the pre-processing, the performance of the model was tested on established machine learning techniques. Compared to two known benchmarks for CRC diagnosis: (1) the age & gender and (2) the Bristol and Birmingham hypothesis-driven algorithm, the present model showed a significant higher precision of CRC detection with a score of 89.1% (versus (1) 83.6% and (2) 86.4%) Also, it was able to identify familiar symptoms of CRC such as anemia and constipation while also pinpointing conditions associated with a higher risk of CRC like diabetes and hypertension.

As it was demonstrated, this kind of pre-processing pipeline has great potential to enhance disease prediction and detection.

Reference :

Büchnerb, F.L., Hoogendoorna, M., Kopa, R., Moonsd, L.M.G., Numansbce, M.E., Slottjec, P., Teijea, A.T. (2016). Predictive modeling of colorectal cancer using a dedicated pre-processing pipeline on routine electronic medical records. Computer in Biology and Medicin. 76, 30-38. doi: 10.1016/j.compbiomed.2016.06.019

Leave a Reply

Your email address will not be published. Required fields are marked *