Abstract 3P
Background
ChatGPT is a web interface chatbot based on a large language model with the aim to mimic human conversation tuned with machine learning and supervised techniques, that have gained scientific attention wondering if it can be a tool in medical decision.
Methods
We tasked ChatGPT 4.0 with creating a multidisciplinary team (MDT) chat and provided it with clinical data from patients (pts) diagnosed with hormone receptor (HR)-positive, human epidermal growth factor receptor 2-negative eBC with an intermediate clinico-pathological risk. These pts were candidates for the Oncotype DX® genomic test. Our goal was to compare our MDT recommendations with those generated by ChatGPT’s and assess the consistency of its responses.
Results
We gathered data from 100 consecutive pts: median age 57, evenly split between stages I and II, 35 premenopausal. By supplying clinical details (age, stage, menopausal status, HR expression, grading, ki67, comorbidity), we asked ChatGPT to assess the need for Oncotype DX®. Each case was presented 9 times in varied chats to test repeatability, yielding a modal vector with a mean variation ratio of 0.181. Only in 31 pts it always recommended a genomic test. Summarizing ChatGPT's most frequent advices for each patient, it recommended genomic test for 61 pts. Next, we provided Recurrence Scores of the 61 pts, asking for chemotherapy (CT) recommendations. The mean variation ratio in responses was 0.069. The Cohen's kappa coefficient for inter-rater agreement between ChatGPT's and actual CT recommendations was 0.62. ChatGPT did not consider clinical risk but only menopausal status for endocrine therapy: tamoxifen if premenopausal, aromatase inhibitor if postmenopausal. When asked for concurrent CT and genomic test advice, its responses were inconsistent, offering CT for almost all pts regardless of genomic testing recommendation.
Conclusions
ChatGPT is a generative model capable of producing data that attempts to capture the statistical distribution of its training dataset, but without reasoning abilities. Its low repeatability, along with suboptimal inter-rater agreement, mean it cannot yet replace an MDT. Effective clinical integration requires identifying areas where ChatGPT's knowledge is beneficial.
Clinical trial identification
Editorial acknowledgement
Legal entity responsible for the study
The authors.
Funding
Has not received any funding.
Disclosure
A. Fontana: Non-Financial Interests, Institutional, Invited Speaker: MSD. All other authors have declared no conflicts of interest.