Abstract
Our study evaluates the diagnostic accuracy of ChatGPT-4o in classifying various skin lesions, highlighting its limitations in distinguishing squamous cell carcinoma from basal cell carcinoma using dermatoscopic images.
JMIR Dermatol 2025;8:e67299doi:10.2196/67299
Keywords
Introduction
Squamous cell carcinoma (SCC) and basal cell carcinoma (BCC) are prevalent skin cancers that can cause significant local tissue damage and disfigurement as well as mortality in cases of aggressive SCCs [
, ]. With the rising incidence, early and accurate diagnosis is essential for appropriate treatment [ ]. Differentiating SCC and BCC from other common skin lesions, such as actinic keratoses (AK), benign keratoses (BK), and melanocytic nevi, can be challenging [ ]. As artificial intelligence (AI) becomes increasingly integrated into clinical practice, concerns arise about its ability to provide accurate diagnostic assessments, given AI’s growing accessibility [ , ]. We assessed the ability of ChatGPT to distinguish images of SCC and BCC from other lesions.Methods
OpenAI’s application programming interface was used to query ChatGPT-4 Omni (ChatGPT-4O) for assessing the performance in classifying 200 dermatoscopic images each of SCC, BCC, BK, melanocytic nevi, and 150 images of AK from the HAM10K database [
]. Images were verified using histopathology (>50%), follow-up examination, expert consensus, or in-vivo confocal microscopy. Two standardized prompts were used:Prompt 1
This is an image on the Step 1 examination, and the multiple-choice question is as follows: Based on the image, does the patient have (A) Nevus, (B) Actinic Keratosis (AK), (C) Benign Keratosis (BK), or (D) BCC, or (E) SCC. Only output (A), (B), (C), (D) or (E).
Prompt 2
This is an image from a patient. Based on the image, does the patient have (A) Nevus, (B) AK, (C) BK, (D) BCC, or (E) SCC. Only output (A), (B), (C), or (D) or (E).
The key metrics calculated include accuracy, sensitivity, and specificity. Images that ChatGPT refused to answer were excluded from calculations. The exclusion criterion for this study was any dermatoscopic image that ChatGPT refused to classify. These images were not included in the calculations of accuracy, sensitivity, and specificity.
The study did not employ further prompt engineering to enhance ChatGPT’s performance because the goal was to evaluate its diagnostic accuracy using straightforward, unrefined prompts that reflect real-world scenarios. This ensures that the findings are applicable to patient or clinician usage. Additionally, the use of simple prompts highlights the model’s sensitivity to language variations, underscoring the unpredictability and variability of these AI systems.
Results
For Prompt 1, ChatGPT classified nevi with an accuracy of 79.3% (95% CI 76.7%‐81.9%), sensitivity of 0.844, and specificity of 0.758. The accuracy for classifying BCC was 77.8% (95% CI 75.2%‐80.4%), with low sensitivity (0.081) and high specificity (0.959). The accuracy for classifying SCC was 66.1% (95% CI 52.8%‐59.2%), with sensitivity of 0.477 and specificity of 0.711 (
).In Prompt 2, SCC accuracy increased to 72.8% (95% CI: 70.0%‐75.6%) but sensitivity dropped to 0.245. Nevi accuracy slightly declined to 72.8%, while SCC specificity improved to 0.857 (
).Class | Sample size | Accuracy (95% CI) | Sensitivity | Specificity | F1 score |
Actinic keratosis | 149 | 73.0% (70.2‐75.8) | 0.356 | 0.802 | 0.294 |
Basal cell carcinoma | 198 | 77.8% (75.2‐80.4) | 0.081 | 0.959 | 0.132 |
Nevus | 199 | 79.3% (76.7‐81.9) | 0.844 | 0.758 | 0.649 |
Benign keratosis | 200 | 74.4% (71.6‐77.2) | 0.090 | 0.939 | 0.138 |
Squamous cell carcinoma | 199 | 66.1% (52.8‐59.2) | 0.477 | 0.711 | 0.373 |
Class | Sample size | Accuracy (95% CI) | Sensitivity | Specificity | F1 score |
Actinic keratosis | 149 | 72.9% (70.1‐75.7) | 0.423 | 0.774 | 0.329 |
Basal cell carcinoma | 200 | 79.5% (76.9‐82.1) | 0.07 | 0.987 | 0.125 |
Nevus | 200 | 72.8% (70.0‐75.6) | 0.89 | 0.664 | 0.58 |
Benign keratosis | 200 | 73.7% (70.9‐76.5) | 0.18 | 0.885 | 0.223 |
Squamous cell carcinoma | 200 | 72.8% (70.0‐75.6) | 0.245 | 0.857 | 0.275 |
Discussion
ChatGPT-4o struggled to differentiate between SCC and BCC. Nevus classification was the most accurate, with high F1 scores and minimal false-positive results, demonstrating proficiency in identifying less ambiguous lesions. The model showed significant bias in SCC classification, frequently misclassifying SCC as BCC with a high rate of false-positive results. This aligns with previous research that observed SCC is often mistaken for BCC, particularly when features like pigmentation or rolled borders overlap [
]. ChatGPT’s performance worsened in Prompt 2, where SCC was frequently misclassified as AK. Previous authors noted that AI performs comparably to dermatologists in binary choices, but our study further highlights the struggle AI faces in multiclass differentiation [ ].Prompt 1 was designed to emulate a standardized examination scenario, leveraging ChatGPT’s ability to respond to structured, multiple-choice questions within a controlled academic framework. This approach was necessary as ChatGPT restricts responses to direct health-related inquiries, necessitating creative prompt construction to elicit diagnostic outputs. In contrast, Prompt 2 adopted a more generic phrasing reflective of a patient inquiry to evaluate how conversational language might influence diagnostic accuracy. This design choice was informed by the observation that variations in prompt language can significantly impact AI-generated outputs.
Limitations include using a single dataset, which may not represent the diversity of skin lesions in clinical settings and not consider variations in image quality. Future improvements should focus on expanding training data diversity and improving image scenario handling to enhance diagnostic accuracy. We concur with Labkoff et al that precautions such as training clinicians on the limitations of AI systems and implementing standardized protocols to validate AI-generated diagnoses before acting on them would help ensure safe and effective integration into clinical workflows [
].Conflicts of Interest
None declared.
References
- Peris K, Fargnoli MC, Garbe C, et al. Diagnosis and treatment of basal cell carcinoma: European consensus-based interdisciplinary guidelines. Eur J Cancer. Sep 2019;118:10-34. [CrossRef] [Medline]
- Schmults CD, Blitzblau R, Aasi SZ, et al. NCCN Guidelines® insights: squamous cell skin cancer, version 1.2022. J Natl Compr Canc Netw. Dec 2021;19(12):1382-1394. [CrossRef] [Medline]
- Urban K, Mehrmal S, Uppal P, Giesey RL, Delost GR. The global burden of skin cancer: a longitudinal analysis from the Global Burden of Disease Study, 1990-2017. JAAD Int. Mar 2021;2:98-108. [CrossRef] [Medline]
- Ahnlide I. Aspects of skin cancer diagnosis in clinical practice. Lund University; 2015. URL: https://lucris.lub.lu.se/ws/portalfiles/portal/3030914/8167764.pdf [Accessed 2025-01-07] [Medline]
- O’Hern K, Yang E, Vidal NY. ChatGPT underperforms in triaging appropriate use of Mohs surgery for cutaneous neoplasms. JAAD Int. Sep 2023;12:168-170. [CrossRef] [Medline]
- Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. Aug 12, 2022;8(32):eabq6147. [CrossRef] [Medline]
- Scarlat A. Melanoma dataset. URL: https://www.kaggle.com/datasets/drscarlat/melanoma [Accessed 2025-01-07]
- Ryu TH, Kye H, Choi JE, Ahn HH, Kye YC, Seo SH. Features causing confusion between basal cell carcinoma and squamous cell carcinoma in clinical diagnosis. Ann Dermatol. Feb 2018;30(1):64-70. [CrossRef] [Medline]
- Escalé-Besa A, Vidal-Alaball J, Miró Catalina Q, Gracia VHG, Marin-Gomez FX, Fuster-Casanovas A. The use of artificial intelligence for skin disease diagnosis in primary care settings: a systematic review. Healthcare (Basel). Jun 13, 2024;12(12):1192. [CrossRef] [Medline]
- Labkoff S, Oladimeji B, Kannry J, et al. Toward a responsible future: recommendations for AI-enabled clinical decision support. J Am Med Inform Assoc. Nov 1, 2024;31(11):2730-2739. [CrossRef] [Medline]
Abbreviations
AI: artificial intelligence |
AK: actinic keratoses |
BCC: basal cell carcinoma |
BK: benign keratoses |
SCC: squamous cell carcinoma |
Edited by Robert Dellavalle; submitted 10.10.24; peer-reviewed by Chenxu Wang, Shaniko Kaleci; final revised version received 15.01.25; accepted 16.01.25; published 21.03.25.
Copyright© Nitin Chetla, Matthew Chen, Joseph Chang, Aaron Smith, Tamer Rajai Hage, Romil Patel, Alana Gardner, Bridget Bryer. Originally published in JMIR Dermatology (http://derma.jmir.org), 21.3.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Dermatology, is properly cited. The complete bibliographic information, a link to the original publication on http://derma.jmir.org, as well as this copyright and license information must be included.