Performance of Artificial Intelligence Imaging Models in Detecting Dermatological Manifestations in Higher Fitzpatrick Skin Color Classifications

Background The performance of deep-learning image recognition models is below par when applied to images with Fitzpatrick classification skin types 4 and 5. Objective The objective of this research was to assess whether image recognition models perform differently when differentiating between dermatological diseases in individuals with darker skin color (Fitzpatrick skin types 4 and 5) than when differentiating between the same dermatological diseases in Caucasians (Fitzpatrick skin types 1, 2, and 3) when both models are trained on the same number of images. Methods Two image recognition models were trained, validated, and tested. The goal of each model was to differentiate between melanoma and basal cell carcinoma. Open-source images of melanoma and basal cell carcinoma were acquired from the Hellenic Dermatological Atlas, the Dermatology Atlas, the Interactive Dermatology Atlas, and DermNet NZ. Results The image recognition models trained and validated on images with light skin color had higher sensitivity, specificity, positive predictive value, negative predictive value, and F1 score than the image recognition models trained and validated on images of skin of color for differentiation between melanoma and basal cell carcinoma. Conclusions A higher number of images of dermatological diseases in individuals with darker skin color than images of dermatological diseases in individuals with light skin color would need to be gathered for artificial intelligence models to perform equally well.


Background
In dermatology, artificial intelligence (AI) is poised to improve the efficiency and accuracy of traditional diagnostic approaches, including visual examination, skin biopsy, and histopathologic examination [1]. Deep-learning image recognition models have had success in differentiating between dermatological diseases using images of light-skinned individuals. However, when these models are tested on images of people with skin of color, the performance drops [2]. It is thought that the primary reason for this difference is the lack of available images of dermatological diseases in individuals with darker skin color (Fitzpatrick classification of skin types 4 and 5) [3]. However, is it also possible that even when the same number of images are available, image recognition models will have a harder time differentiating between dermatological diseases in individuals with Fitzpatrick skin types 4 and 5 compared to skin types 1, 2, and 3?

Objective
The objective of this research was to assess whether image recognition models perform differently when differentiating between dermatological diseases in individuals of color (Fitzpatrick skin types 4 and 5) than when differentiating between the same dermatological diseases in Caucasians (Fitzpatrick skin types 1, 2, and 3) when both models are trained on an equal number of images.

Methods
Open-source images of melanoma and basal cell carcinoma (BCC) were acquired from the Hellenic Dermatological Atlas [4], the Dermatology Atlas [5], the Interactive Dermatology Atlas [6], and DermNet NZ [7]. Two image recognition models were trained, validated, and tested using methodology as described previously [8]. TensorFlow [9], an open-source software library by Google, was used as a deep-learning framework and was used to retrain Inception, version 3 (v3). Inception v3 is a deep convolutional neural network. This neural network consists of a hierarchy of multiple computational layers that each have an input and output. All layers except the final layer of this neural network are pretrained with more than 1.2 million images. The final layer of the neural network was retrained with the gathered dermatological images. During the retraining process, the neural network underwent both a training and validation step. In the training step, the inputted images were used to train the neural network. In the validation step, inputted naïve images were used to iteratively assess training accuracy [10].
After the model had been retrained (trained and validated), a user-inputted testing/assessment step was performed in which test images were inputted and the results were statistically analyzed. The program assessment output is expressed in terms of percentages of the probability of each of the dermatological manifestations for each testing image inputted. R software (R Foundation for Statistical Computing) [11] was used to perform the statistical analysis. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score were calculated for each dermatological manifestation. The F1 score is the harmonic average of the sensitivity and PPV (mean of the recall and precision).
The goal of each model was to differentiate between melanoma and BCC.
The first model was: Area under the receiver operating characteristic (AUC) curves for melanoma and BCC were calculated to determine the performance of the two models.

Results
When asked to differentiate between melanoma and BCC, the image recognition model trained and validated on images of light skin color had higher sensitivity, specificity, PPV, NPV, and F1 score than the image recognition model trained and validated on images of skin of color (Table 1) The average AUC for the two light skin color image recognition models was 0.598, compared to 0.500 (values point out the difference) for the skin of color image recognition models (Table  1 and Figure 1).

Limitations
The number of images available was limited for Fitzpatrick skin types 4 and 5; as such, both the light skin color and skin of color models were investigated with this constraint for the number of images used during training. A larger sample size would have been better to test if the results recur consistently.

Conclusion
When the same number of images is used for training, validation, and testing, the AI model that was provided images of melanoma and BCC belonging to Fitzpatrick classification skin types 1, 2, and 3 performed better than the AI model that was provided with images of melanoma and BCC in skin types 4 and 5. This may be because dermatological diseases can have more variability in presentation in individuals with darker skin; additionally, cutaneous manifestations may not be as easily distinguished from the surrounding skin in darker-skinned individuals. As such, a higher number of images of skin of color with dermatological diseases than images of light skin color with dermatological diseases would need to be gathered for the AI models to perform equally well.