Belloni P., Boccuzzo G., Guzzinati S., Italiano I., Scarpa B., Rossi C.R., Rugge M., Zorzi M.

Data Science & Social Research 2019 - 2nd international conference on data science and social research

Milano, 4-5 Febbraio 2019



Valuable information is stored in a healthcare record system and over 40% of it is estimated to be unstructured in the form of free clinical text [1]. A collection of pathology records is provided by the Veneto Cancer Registry: these medical records refer to cases of melanoma and contain free text, in particular the diagnosis written by a pathologist and the result of a microscopic and macroscopic analysis of the cancerous tissue. The aim of this research is to extract from the free text the size of the primary tumor, the involvement of lymph nodes, the presence of metastasis, the cancer stage and the morphology of the tumor. This goal is achieved with text mining techniques based on a statistical approach. Since the procedure of information extraction from a free text can be traced back to a statistical classification problem, we apply several data mining models in order to extract the variables mentioned above from the text. A gold-standard for these variables is available: the clinical records have already been assessed case-by-case by an expert. Therefore, it is possible to evaluate the quality of the information that the models are able to extract from the clinical text comparing the result of our procedure with the gold standard.

[1]. H. Dalianis, Clinical Text Mining: Secondary Use of Electronic Patient Records. Springer, 2018.