Andreotti Alessandra, Baracco Maddalena, Carpin Eva, Dal Cin Antonella, Fiore Anna Rita, Guzzinati Stefano, Memo Laura, Zorzi Manuel.

48th Annual Meeting of the Group for Cancer Epidemiology and Registration in Latin Language Countries (GRELL). 

Lausanne, 15 - 17  May 2024

 

 Abstract 

Background

In pathology reports, information on biological markers is often contained in a free text field related to the pathological diagnosis of cancer.

Aim

To train models capable of extracting and predicting the values of biological markers of female breast cancer contained in pathology reports, through the application of Text Mining (TM) and Machine Learning (ML) methodologies.

Methods

We first implemented a TM algorithm for the text extraction from the diagnosis field; we subsequently implemented the supervised Support Vector Machine ML algorithm for predicting the values of the following biological markers: estrogen receptor (ER), progestin receptor (PR), human epidermal growth factor receptor 2 (HER2) and marker of proliferation (Ki-67).

Results

The data used to train the models were extracted from the Veneto Cancer Registry (VCR) and refer to 9,807 anonymized pathology reports relating to 4,029 patients with breast cancer diagnosed between 2017 and 2020, including Gold Standard (GS - data recorded manually by VCR registrars). These data refer to 7 Pathological Anatomy services of the Veneto Region. The weighted F1 score related to the exact biomarkers values varies
between 87.1% of Ki-67 to 91.6% of HER2. Conversely, the score related to the thresholds defined by the Italian Oncology Association to identify the cancer phenotypes varies between 95.4% of HER2 to 99.6% of ER.

Conclusion & Discussion

The prediction accuracy (F1 score) of the ML models is very good. Of significant importance will be the application of our models to the pathology reports of patients a"ected by breast cancer in 2021, and the comparison of the models’ predictions with the corresponding GSs. Furthermore, these models will be applied to the reports of the other 15 Pathology Services of the Veneto Region and subsequently verified on a sampling basis by the VCR registrars.