Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study

Background Convolutional neural networks (CNNs) are a type of artificial intelligence that shows promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image data sets with varying image capture standardization. Objective The aim of our study was to use CNN models with the same architecture—trained on image sets acquired with either the same image capture device and technique (standardized) or with varied devices and capture techniques (nonstandardized)—and test variability in performance when classifying skin cancer images in different populations. Methods In all, 3 CNNs with the same architecture were trained. CNN nonstandardized (CNN-NS) was trained on 25,331 images taken from the International Skin Imaging Collaboration (ISIC) using different image capture devices. CNN standardized (CNN-S) was trained on 177,475 MoleMap images taken with the same capture device, and CNN standardized number 2 (CNN-S2) was trained on a subset of 25,331 standardized MoleMap images (matched for number and classes of training images to CNN-NS). These 3 models were then tested on 3 external test sets: 569 Danish images, the publicly available ISIC 2020 data set consisting of 33,126 images, and The University of Queensland (UQ) data set of 422 images. Primary outcome measures were sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Teledermatology assessments available for the Danish data set were used to determine model performance compared to teledermatologists. Results When tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (95% CI 0.830-0.889) and CNN-S2 achieved an AUROC of 0.831 (95% CI 0.798-0.861; standardized models), with both outperforming CNN-NS (nonstandardized model; P=.001 and P=.009, respectively), which achieved an AUROC of 0.759 (95% CI 0.722-0.794). When tested on 2 additional data sets (ISIC 2020 and UQ), CNN-S (P<.001 and P<.001, respectively) and CNN-S2 (P=.08 and P=.35, respectively) still outperformed CNN-NS. When the CNNs were matched to the mean sensitivity and specificity of the teledermatologists on the Danish data set, the models’ resultant sensitivities and specificities were surpassed by the teledermatologists. However, when compared to CNN-S, the differences were not statistically significant (sensitivity: P=.10; specificity: P=.053). Performance across all CNN models as well as teledermatologists was influenced by image quality. Conclusions CNNs trained on standardized images had improved performance and, therefore, greater generalizability in skin cancer classification when applied to unseen data sets. This finding is an important consideration for future algorithm development, regulation, and approval.


Introduction
Skin cancer (melanoma and keratinocyte cancer) is the most common type of cancer in fair-skinned populations, with the overall incidence and prevalence increasing worldwide [1]. In an effort to improve current prevention and detection practices, artificial intelligence (AI) has shown promise, at least in experimental settings.
In recent years, advances in machine learning and deep learning have led to increases in the research and exploration of potential applications in dermatology [2][3][4][5][6]. These advancements have led to the production of systems that can diagnose skin conditions through image analysis. With the help of clinical and dermoscopic images for training, convolutional neural networks (CNNs) have been able to compete and even outperform experienced dermatologists when diagnosing and classifying skin cancer [7][8][9][10][11].
Although these models perform well, they are often tested on images that they have already seen or come from the same data set in which the models were trained on, leading to an inflation in their performance [12]. When tested on externally sourced images, the performance of these models is reduced significantly, highlighting the models' poor generalizability [13].
Generalizability is an important factor that deserves careful consideration when assessing dermatology models. Generalizability refers to how well a model can apply the concepts it has learned from the available training data and implement these same concepts to data it has not seen before.
The method for collecting dermatology image data sets can be defined as nonstandardized and standardized. Nonstandardized image collection refers to images taken using multiple image capture devices and techniques. This method exposes the model to variation in image quality parameters, such as sharpness, brightness, polarization, magnification, color, and distance from lesion (for macroscopic images). Standardized image collection refers to images taken with the same image capture device and technique, resulting in a greater uniformity of images across a data set. It is unknown the extent to which uniformity (or lack thereof) of training images will affect the performance of the resultant CNN model.
Dermatology image data sets are generally not standardized and often collected retrospectively and contain images collected with a variety of techniques and technologies. Theoretically, this variety increases the adaptability of the model and its ability to handle noisy and poorer quality data, thus increasing generalizability. However, with standardized image data sets, there is an expectation for greater consistency in image quality and, therefore, greater performance of the model. When considering the eventual implementation of a CNN model in a clinical setting, it is vital that the model's performance is impacted minimally by changes to the environment and patient demographic and variation in the presentation of disease. Identifying the factors that affect generalizability will increase the effectiveness of AI model implementation in practice. This retrospective comparative study assessed the generalizability of CNN models trained on standardized and nonstandardized images.

Test Sets, Study Population, and Image Selection
In this study, we compared the performance of CNNs trained on standardized and nonstandardized images when classifying skin cancer as malignant or benign on 3 separate external data sets.

Ethics Approval
This retrospective comparative study was approved by the Monash University Human Ethics Committee (Project ID 28130).

Architecture and Training of CNN Models
In all, 3 CNN models with the same architecture were trained on International Skin Imaging Collaboration (ISIC) 2019 [14][15][16][17] and MoleMap (MoleMap NZ Limited) [2] data sets. Model architecture used ImageNet pretrained ResNet-50 as a backbone ( Figure 1) combined with a transformer [18,19]. The ResNet-50 backbone was incorporated because of the trade-off between accuracy and complexity. A transformer was also added to the model to overcome the limitation of CNN in the context of learning global images. The same 3 CNN models were then additionally trained with a ResNet  (Table 1). We define nonstandardized images as images that are taken using multiple image capture technologies ( Figure 2). CNN-S was trained on 177,475 standardized, teledermatologist-verified, clinical, and dermoscopic MoleMap images. This data set includes a total of 65 skin conditions organized into a 3-level hierarchical semantic tree (Table 1). This model was trained on standardized images taken using the same camera (DermLite FOTO System). CNN-S2 was trained on 25,331 standardized, teledermatologist-verified, and dermoscopic MoleMap images consisting of 8 skin conditions (Table 1). CNN-NS and CNN-S2 were trained on the same number of images and skin conditions, only differing in the standardization of the images the models were trained on.   Figure 2. Examples of standardized and nonstandardized images. Images A and B are nonstandardized images, taken using different image capture devices. Images C and D are standardized images, taken using the same image capture device.

Assessment of CNN Performance
CNN performance was assessed using 3 separate test data sets that were not used in model training.

Test Set 1
The Danish data set was provided by the Department of Dermatology and Allergy Centre, Odense University Hospital and collected between January 9 and October 31, 2018 [20]. General practitioners from 50 practices across southern Denmark were trained for 1 hour with the image capture equipment required to take images of lesions that are suspicious for malignant melanoma and nonmelanoma skin cancer. A total of 600 images were collected from 519 Danish patients, predominantly involving patients with Fitzpatrick skin types II and III, were used. The "ground truth" diagnosis was achieved by histopathology, follow-up, or a single face-to-face evaluation (308 of the 600 lesions in the original data set were only seen once face-to-face). Images containing clinical features that could not be identified were removed from the data set, leaving 569 images. Lesion classification can be seen in Table 2.
The 569 images were taken using an iPhone 6 smartphone (Apple Inc) and a handyscope (FotoFinder Systems GmbH) with an overview, a close-up, and a dermoscopic image being taken of the lesions.
In total, 4 dermatologists were involved in the face-to-face and teledermatology evaluations of the 519 patients. The quality of the images was rated as "poor," "fair," or "good" by 3 allocators. Images were assigned to the different categories when there was agreement between 2 or more allocators.

Test Set 2
The University of Queensland (UQ) data set contained 422 dermoscopic images provided by The University of Queensland, Diamantina Institute, Dermatology Research Centre and captured using the EOS Rebel T6i camera (Canon) and ATBM master automated mole-mapping system (FotoFinder Systems GmbH) between 2016 and 2020, with all lesions diagnosed through histopathology (Table 2).

Test Set 3
The ISIC 2020 data set contained 33,126 dermoscopic images provided by the ISIC and collected from 3 continents between 1998 and 2020 [21]. The 33,126 images in the ISIC 2020 test set contained 59 images that overlap with the 25,331 images in the ISIC 2019 data set used for the training of CNN-NS.
All 3 test sets were imbalanced, with the Danish data set containing 411 benign and 158 malignant images, the UQ data set containing 257 benign and 165 malignant images, and the ISIC 2020 data set containing 27,131 benign and 5995 malignant images, which is reflective of the breakdown seen in a clinical setting. As the classification is binary, the imbalance had no effect on the study. Lesion classification can be seen in Table  2.

Statistical Analysis
Statistical analysis was performed using Python software (version 3.8.13; Python Software Foundation) and Stata statistical software (version SE 17; StataCorp). The primary outcome measures were sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC) for the binary classification of lesions.
For each input image, the CNNs provided a score between 0 and 1 representing the probability that the input image is malignant. In binary classifications, thresholds are applied to the CNN models to establish the point at which an input image is labeled malignant. This threshold is variable and allows for the manipulation of the sensitivity and specificity of the models.
The performance was assessed by aligning the sensitivity and specificity of the CNN models to the teledermatologists' and by calculating the AUROC. AUROC allows for the direct comparison of different models regardless of the threshold applied. Delong nonparametric test was used to evaluate the statistical difference between AUROC values resulting from the same data set. Additionally, 95% CI for the AUROC was computed using 2000 stratified bootstrap replicates. McNemar test was used to compare the sensitivities and specificities of the CNN models. The 1-sample, 2-tailed t test was used to compare the mean sensitivities and specificities of the teledermatologists against the sensitivities and specificities of the CNN models. P values <.05 were considered to have statistically significant differences.

CNN Performance on Test Set 2
When tested on the externally sourced UQ test set of 422 images, CNN-NS performed well with an AUROC of 0.850 (95% CI 0.812-0.887). CNN-S outperformed CNN-NS when tested on the UQ image set, with an AUROC of 0.876 (95% CI 0.842-0.911), again showing greater generalizability than CNN-NS (P=.08; Figure 4). CNN-S2 also achieved a slightly greater AUROC (0.864, 95% CI 0.828-0.900) compared to CNN-NS, though this was not statistically significant (P=.35). Among the standardized models, CNN-S had the greatest AUROC (0.8765 vs 0.8638), though the difference was not statistically significant (P=.23).

CNN Performance on Test Set 3
When tested on the publicly available ISIC 2020 test set of 33,126 images, the performance of CNN-NS was reduced, with an AUROC of 0.763 (95% CI 0.743-0.783). CNN-S significantly    (Table 3). However, both CNN-S2 and CNN-NS had significantly lower specificity and CNN-NS had significantly lower sensitivity when compared to the teledermatologists (Table 3).

Principal Findings
Our results provide evidence that models trained on standardized images outperform and, hence, achieve greater generalizability than models trained on nonstandardized images. In recent years, advances in machine learning have led to the development of models that can compete and even outperform dermatologists in the classification of skin cancer [7][8][9][10][11]. Although these models have been shown to perform well when tested on a subset of images from their training data set, the generalizability of these models to images taken in different clinical settings and with different devices is unknown.
The impact that the varying image acquisition devices and techniques have on CNN model performance in dermatology has not been explored in the literature to date; however, the lack of imaging standardization in dermatology has been highlighted. The collection, transfer, and storage of clinical and dermoscopic images are not standardized in dermatology and have implications on the creation of data sets for machine learning, the reproducibility of imaging, and accessibility to relevant metadata for the images [22,23].
The standardized models (CNN-S and CNN-S2) consistently outperformed the nonstandardized model (CNN-NS) on all test sets. The statistical significance was directly affected by the number of images in the 3 test sets, with fewer images in test set 2 resulting in a nonsignificant difference in performance. Larger test sets will have a more accurate measure of model performance, and this finding would need to be considered when reporting validation results.
The ISIC holds an annual challenge that invites contestants to create a model that is trained and tested on images provided by the ISIC. In the AI community, the model that wins the ISIC challenge often holds a reputation as one of the best available. However, if tested on external data, the same performance is not guaranteed. If models are both trained and tested on the same set of images, then they are subjected to overfitting and thus poorer generalizability. The quality of a model should therefore be judged on its performance on multiple external data sets from varying population groups.
Several studies have looked at the performance of CNN models compared to the performance of dermatologists. These models perform comparably and even outperform dermatologists when classifying skin cancers. However, it is important to note that the images used in test sets are often taken from the same data sets used in the training of the models [7][8][9][10][11]. It is important when comparing models to dermatologists that the CNN is externally validated. This validation provides a clearer indication of the performance of the models in comparison to dermatologists and their ability to generalize to external data sets.
In our study, when tested on test set 1, the teledermatologists outperformed all models. Interestingly, CNN-S was trained on Australian and New Zealand patients and generalized well to the Danish images. There was no statistical difference between the sensitivity and specificity of the teledermatologists and the matched sensitivity and specificity of CNN-S. It is important to note that the Danish teledermatologists were predominantly trained on Danish skin and had access to metadata and multiple image viewpoints for a single lesion, which the models did not have access to. Previous studies have shown that the addition of metadata and inclusion of both macroscopic and dermoscopic images of a lesion can improve the performance of the model [24,25]. Therefore, incorporating these features into future models will be important and may level the playing field when assessing performance against teledermatologists' clinical assessment.
The Danish images used in our study were taken by general practitioners who were required to undertake training to use the image capture technology. However, there were some issues with the quality of the images: some lesions were not centered, several lesions may be present within a single image, and parts of lesions were not included within the image frame. As the image quality improved, the diagnostic performance of all models and teledermatologists also increased. This finding highlights the influence that image capture techniques and image quality can have on CNN models and teledermatologists' diagnostic ability. This finding is also a consideration when designing models for integration into web-based tools or mobile apps with consumers as end users, as the quality of images taken by consumers on their smartphones will vary, especially in the absence of training.

Limitations
Our study has several limitations. First, the MoleMap data set used to train our 2 standardized CNN models was labeled by dermatologists; however, only very few images were biopsy proven. Given that histopathology is the gold standard for diagnosis, some of these images may have been mislabeled, which could have an impact on the performance of the models. Second, in test set 1 with 569 images, we only had access to 221 biopsy-proven images. The remaining 348 images in the test set 1 were labeled by dermatologists, which allows for the possibility of mislabeling. Third, the quality of the images in the training data sets (ISIC and MoleMap) and the type of image modality may have played a part in the performance of the models rather than the standardization of the images. It is important to consider that the quality of the camera used in the standardized MoleMap data set is less variable than the nonstandardized ISIC 2019 data set, which may have led to a discrepancy in the performance. CNN-S was trained on a combination of dermoscopic and macroscopic images, whereas CNN-NS and CNN-S2 were trained only on dermoscopic images. This combination of image modalities may have had an influence on the strength of the CNN-S model. Additionally, the models are complex, making it difficult to understand the process behind their decision-making (ie, a black box). This is an important limitation of the models and of this study and will be addressed through the incorporation of explainable AI techniques in our future models. Finally, in test set 1, the number of lesions in each group becomes small when divided into images of poor, fair, and good quality. In future studies, it would be better to evaluate a larger data set split among the quality groups to more confidently assess the relationship between image quality and CNN performance.

Conclusion
In this study, CNN models trained on standardized images based on dermoscopic and macroscopic modalities performed better than a CNN model with the same architecture trained on nonstandardized images when tested on external image data sets. This finding has important implications for model generalizability in the binary classification of skin cancer. In test set 1, image quality also had a direct impact on the performance of the models. For future algorithm training, development, and registration, it is important that model generalizability is considered through the evaluation of model performance on external image data sets.

Acknowledgments
AIO had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. AIO is supported by an Australian Government Research Training Program Scholarship.
ZG is supported by the NVIDIA Artificial Intelligence Fellowship for access to the computational resources. He is also supported by the Monash-Airdoc Research Centre collaboration.