Background

JDERM

JMIR Dermatol

JMIR Dermatology

2562-0959

JMIR Publications

Toronto, Canada

v5i4e39113

37632881

10.2196/39113

Original Paper

Issues in Melanoma Detection: Semisupervised Deep Learning Algorithm Development via a Combination of Human and Artificial Intelligence

Dellavalle

Robert

Singh

Vivek

Zhang

Xinyuan

PhD 1

https://orcid.org/0000-0002-3519-7515

Xie

Ziqian

PhD 1

https://orcid.org/0000-0001-6541-1773

Xiang

Yang

PhD 1

https://orcid.org/0000-0003-1395-6805

Baig

Imran

BSc 2

https://orcid.org/0000-0002-9931-9147

Kozman

Mena

BSc 2

https://orcid.org/0000-0002-5143-3071

Stender

Carly

BSc 2

https://orcid.org/0000-0003-3064-757X

Giancardo

Luca

PhD 1

https://orcid.org/0000-0002-4862-2277

Tao

Cui

PhD 1

School of Biomedical Informatics The University of Texas Health Science Center at Houston

7000 Fannin St

Suite 600

Houston, TX, 77030

United States 1 7135003981 cui.tao@uth.tmc.edu

https://orcid.org/0000-0002-4267-1924

1 School of Biomedical Informatics The University of Texas Health Science Center at Houston

Houston, TX

United States 2 McGovern Medical School The University of Texas Health Science Center at Houston

Houston, TX

United States

Corresponding Author: Cui Tao cui.tao@uth.tmc.edu

Oct-Dec 2022

12 12 2022

5 4

e39113

28 4 2022 24 5 2022 1 9 2022 12 10 2022

©Xinyuan Zhang, Ziqian Xie, Yang Xiang, Imran Baig, Mena Kozman, Carly Stender, Luca Giancardo, Cui Tao. Originally published in JMIR Dermatology (http://derma.jmir.org), 12.12.2022.

2022

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Dermatology, is properly cited. The complete bibliographic information, a link to the original publication on http://derma.jmir.org, as well as this copyright and license information must be included.

Background

Automatic skin lesion recognition has shown to be effective in increasing access to reliable dermatology evaluation; however, most existing algorithms rely solely on images. Many diagnostic rules, including the 3-point checklist, are not considered by artificial intelligence algorithms, which comprise human knowledge and reflect the diagnosis process of human experts.

Objective

In this paper, we aimed to develop a semisupervised model that can not only integrate the dermoscopic features and scoring rule from the 3-point checklist but also automate the feature-annotation process.

Methods

We first trained the semisupervised model on a small, annotated data set with disease and dermoscopic feature labels and tried to improve the classification accuracy by integrating the 3-point checklist using ranking loss function. We then used a large, unlabeled data set with only disease label to learn from the trained algorithm to automatically classify skin lesions and features.

Results

After adding the 3-point checklist to our model, its performance for melanoma classification improved from a mean of 0.8867 (SD 0.0191) to 0.8943 (SD 0.0115) under 5-fold cross-validation. The trained semisupervised model can automatically detect 3 dermoscopic features from the 3-point checklist, with best performances of 0.80 (area under the curve [AUC] 0.8380), 0.89 (AUC 0.9036), and 0.76 (AUC 0.8444), in some cases outperforming human annotators.

Conclusions

Our proposed semisupervised learning framework can help with the automatic diagnosis of skin disease based on its ability to detect dermoscopic features and automate the label-annotation process. The framework can also help combine semantic knowledge with a computer algorithm to arrive at a more accurate and more interpretable diagnostic result, which can be applied to broader use cases.

deep learning dermoscopic images semisupervised learning 3-point checklist skin lesion dermatology algorithm melanoma classification melanoma automatic diagnosis skin disease

Introduction

Skin cancer is one of the most common cancers worldwide, with steadily increasing incidence rates of melanoma and nonmelanoma cancers [1]. Early detection of skin cancer is an important prognostic factor that can improve patient survival and overall outcomes [2]. Reliable skin cancer screening, however, may not be readily available to all patients. For example, individuals who live in rural areas without local dermatology clinics or who face barriers to attending an in-office evaluation may not have an opportunity to have skin cancer detected at an early stage. To address this concern, the use of teledermatology has become increasingly popular, particularly during the COVID-19 pandemic, which has significantly decreased in-person dermatological evaluation [3,4]. Recently, teledermatology has been shown to increase access to reliable dermatology evaluation and to minimize delays in skin cancer management [3,5]. A useful subset of teledermatology is teledermoscopy, whereby digital images of skin lesions are taken using a dermatoscopy or a smartphone with a dermatoscopy attachment [6]. Studies find that the use of dermoscopic images in teledermatology consultations improves the sensitivity and specificity of the diagnosis [3,7]. In this way, teledermoscopy offers itself as a promising tool to increase patient access to reliable skin cancer screening and, thus, the early detection of skin cancer.

The automated classification of dermoscopic images through convolutional neural networks (CNNs) has emerged as a reliable supplement to visual skin examination by on-site specialists in the detection of skin cancer [8-11]. CNNs have the potential to extend reliable skin cancer recognition to clinicians who lack special dermatology training, including nurse practitioners, physician assistants, and primary care physicians. In addition, the use of CNNs enables the evaluation of skin lesions via telemedicine. Images captured on smartphone cameras and analyzed by similar algorithms have been shown to achieve accuracy in identifying melanomas similar to that of board-certified specialists [12]. Some CNN models even exhibit greater sensitivity and specificity in diagnosing early melanoma compared with those of inexperienced clinicians [13,14].

Artificial intelligence (AI) algorithms, however, have some weaknesses. One weakness is interpretability and transparency regarding how the computer arrived at its output, making it difficult for dermatologists to trust the diagnostic results [15-17]. Another is that the current algorithms, such as the deep CNNs used in triaging and classifying suspicious skin lesions, do not provide the reasoning used to arrive at their given result [18]. This is often due to the complexity of the algorithm and hinders their utility due to a lack of the trust in the diagnosis by the patient and the physician [19].

Another limitation of AI algorithms is that a majority rely solely on images as inputs, whereas in a clinical setting, more information can be obtained through, for instance, palpation of the lesion and clinical data on age and family history [20]. The dermatologist also relies on diagnostic rules to make decisions, such as the ABCD rule, pattern analysis, 7-point checklist, and 3-point checklist, which have been developed to standardize the dermoscopic evaluation of melanoma and play a critical role in skin lesion diagnosis [8,9,21-23].

Recent studies have focused on attempts to combine semantic knowledge with the algorithm to arrive at a more accurate diagnosis [20,24-26]. Several studies have suggested that diagnoses derived using more than one source of input are more accurate than are those conceived by one method alone [27-29]. One study showed that nondermatologist physicians were able to improve their accuracy in classifying pigmented lesions when combining their knowledge of age, sex, and localization of the lesion with deep-learning frameworks [24]. Earlier research added factors such as age, body site, proportion of dysplastic nevi, naevus count, and family history of melanoma to a computer image–analysis program and found that the addition of clinical data significantly improved the ability to distinguish between benign and malignant skin lesions [30]. Another study found an improvement in the detection of basal cell carcinoma after adding factors such as lesion size and elevation, age, gender, and location [31]. Kawahara et al [32] conducted a similar work when proposing a multitask deep CNN trained on multimodal data to classify the 7-point melanoma checklist criteria and perform a skin lesion diagnosis. Even though they intergraded each feature from a 7-point checklist using loss blocks, their studies did not integrate the knowledge with the CNN architecture. One major constraint of these studies is the lack of high-quality data related to diagnosis, for example, the dermoscopic features that dermatologists use to diagnose skin lesions. In this study, we address these limitations by developing a semisupervised deep-learning framework that applies the results learned from a small, annotated data set to a larger unlabeled data set as well as by imitating the human diagnosis process in our CNN structure.

In this experiment, we chose the 3-point checklist for melanoma and melanocytic nevus as an illustration of diagnostic rules and disease class. The 3-point checklist is easy to interpret and is highly sensitive for the diagnosis of melanoma by nonexpert clinicians [33]. Melanoma is well known as the most aggressive cutaneous malignancy, accounting for approximately 75% of all skin cancer deaths [24]. It often shares morphology with melanocytic nevi on naked-eye examination, a technique that yields only 60% accuracy in a melanoma diagnosis by expert dermatologists [34]. In this regard, the International Skin Imaging Collaboration (ISIC) organizes data challenges every year, which focus primarily on diagnostic accuracy when distinguishing melanoma from other malignant and benign lesions [35]. Numerous studies that concern the use of the 3-point checklist to help classify melanomas have been conducted [33,36,37]. In these studies, participants with varying experience were able to score proven nonmelanoma and proven melanoma lesions using just the 3-point checklist criteria. A disadvantage of this method, however, is that the checklist tends to miss thinner melanomas [37]. None of the studies related to 3-point checklist has tried to combine visual inspection with CNN-extracted imaging features to arrive at a diagnosis. This is also the major difference in our state-of-the-art methodology as compared to what was seen in previous ISIC data challenges.

Combining diagnostic rules with the 3-point checklist classification algorithm can yield benefits that improve patient access to care and diagnostic accuracy. The proposed algorithms have several potential application scenarios, including the following: (1) they can automatically classify skin disease images and generate feature labels by listing the criteria used to categorize suspicious lesions to improve trust and acceptance of teledermoscopy; (2) they can assist medical students to learn and identify the features in dermoscopic images; given the detailed evaluation of each criterion in the 3-point checklist by the algorithm, students can use the checklist to learn about the fundamental parameters used to differentiate lesions as a benign nevus or a melanoma; and (3) they can automate the process of feature annotation; thus, fewer human annotators need to be involved, enabling the secondary use of enormous imaging data resources, such as the ISIC archive.

Methods Data Set

All images from labeled and unlabeled data sets come from the ISIC archive. “Label” here represents the 3-point checklist feature labels, which means both “labeled” and “unlabeled” data sets contain disease type information. For the small, labeled data set, we selected an even distribution of melanoma and melanocytic nevus dermoscopic images from ISIC 2019 to annotate, using the 3-point checklist features. The large unlabeled data set came mainly from ISIC 2020, which contains the 584 melanoma and 5193 melanocytic nevus dermoscopic images. To balance the data set, we added 4062 melanoma images from ISIC 2019, excluding the images in the small, labeled data set. We divided each data set into training and validation sets in an 80/20 ratio and used 5-fold cross-validation, which means the data set was divided equally into 5 subsets and rotating in order to be the training or validation data set. We annotated an additional 400 images as a holdout testing set.

The 3-point checklist is easy to interpret and is highly sensitive for the diagnosis of melanoma versus melanocytic nevus. Our algorithm evaluated dermoscopic images of pigmented lesions based on the 3-point checklist, indicating the presence or absence of (1) asymmetry, (2) atypical pigment network, and (3) blue-white structures. If any one of these features was detected from the skin lesion image, 1 point would be added on top of the scoring for that image. The scoring range per image is 0 to 3. These 3-point automated classification outputs can aid in a provider’s decision to biopsy a lesion or to refer to a specialist for a more thorough evaluation. Table 1 presents the number of images for the skin disease categories of melanoma and melanocytic nevus.

Table 1

Number of images for skin disease categories for labeled and unlabeled data sets.

Disease	Unlabeled data set	Labeled data set
Melanoma	4646	450
Melanocytic nevus	5193	450
Total	9839	900

Annotation of the 3-Point Checklist

There are 3 features of the 3-point checklist, which are atypical network, asymmetry, and blue-white structure. For each feature detected, 1 score will be added for that image. The higher the score is (usually higher than 2), the higher the risk of melanoma will be. If the score is lower than 1, according to the 3-point checklist, the lesion is more likely to be benign. Our experiment was developed based on a gold standard whereby each image was rigorously reviewed by at least 2 annotators. If consensus was reached, the resulting diagnosis was annotated. If not, a third annotator would evaluate the image again. We divided the annotation into 2 steps. First, the 3 annotators had training sessions to develop consensus annotation guidelines. We provided the annotators with a small image set annotated by domain experts to annotate and evaluate. During this phase, the annotators are allowed to discuss their different understandings. After interrater agreement reached at least 70%, we moved to the second step, in which they annotated images independently. We divided the whole image data set into 3 subsets, and each annotator was assigned 2 subsets so that every image had at least 2 annotation results. Our final interrater agreement Kappa-Cohen score for the second step was 0.64, which indicated substantial agreement. If any images had different annotation results, we brought in the third annotator, who was not previously assigned to the image, and took a majority vote. Overall, this is a very time-consuming process.

Image Preprocessing Crop and Resize

Because the training data set came from 3 data sources, each had a different resolution of the images. There could be 1 lesion that took up the entire image or just 1 corner of the graph. Hence, we developed a rule to crop and resize all the training images, which improved the performance of our model.

Color Constancy

Due to the different imaging sources and illuminations, the color of dermoscopic images varied considerably. Therefore, it was important to calibrate the color of the images in the preprocessing stage to reduce possible bias for the deep neural network. Catarina et al [38] compared 4 color-constancy algorithms (Gray World, max-RGB, Shades of Gray, and General Gray World) to calibrate the color of dermoscopic images for the melanoma classification system. These algorithms improved the system performance by increasing sensitivity and specificity, and Shades of Gray achieved better results than did the other color-constancy algorithms. Thus, for the project, we chose Shades of Gray as the color-constancy algorithm to calibrate the color of the dermoscopic images before the training stages. The calibration procedure involves 2 steps. First, the color of the light source in the RGB color space is estimated. Then, the image is transformed, using the estimated illuminant.

Contrast-Limited Adaptive Histogram Equalization

Contrast-Limited Adaptive Histogram Equalization was used to improve the contrast in images. Unlike histogram equalization, it computes several distinct sections of the image and uses them to redistribute the lightness values of the image. It helps to improve the local contrast and enhance the edges of objects in the image.

Model Architecture

We proposed a semisupervised learning framework for the prediction of skin disease that uses a small set of labeled images and a larger set of unlabeled images. The labeled data set contains 900 images that were labeled with disease tags and the 3-point checklist annotation, while the unlabeled data set contains 9839 images that have only disease tags. The architecture of the proposed classification model is presented in Figure 1 and contains primarily 3 components. The input component involves the preprocessing of both labeled and unlabeled images. The output of the input component is streamed into 2 branches. One branch is the supervised learning component that uses ResNet, inside which the representation of each image is associated with the 3-point labels and the classification tag and with the label-related ranking loss [39] and classification loss, correspondingly. The other is the semisupervised learning component, whereby a consistency loss is optimized using the output from an exponential moving average (EMA) model of the ResNet branch [40]. Finally, the 3 types of losses are combined, and coefficients are used to balance their weights. We provide a detailed description of these 3 components in this section.

Figure 1

Architecture of the proposed semisupervised learning framework. EMA: exponential moving average; ResNet: residual neural network.

Supervised Learning + Ranking Loss

The supervised learning consists of 2 tasks, which are jointly learned during training. One task is the classification of the skin disease, and the other is the classification of each feature in the 3-point checklist. Using the 3-point checklist, each feature is given a binary score of 0 or 1 in the training phase, indicating whether it exists in the image. A total score higher than 2 suggests that the lesion is more likely to be malignant. We incorporated the traditional cross-entropy loss to optimize the skin disease classification part and used ranking loss to represent the 3-point checklist knowledge. The hyperparameters for our training models are as follows: a batch size of 128, stochastic gradient descent optimizer, and ReduceLROnPlateau learning rate decay (mode=“min,” factor=0.5, threshold=0.01, patience=7, verbose=True).

Semisupervised Learning

Image annotation requires not only extensive time investment but also domain expertise of human annotators. Inspired by the research of Tarvainen and Valpola [40], we developed a semisupervised scheme based on their “mean teacher” framework to automate the feature annotation process of skin lesion images. This model can use the information from small-scaled labeled images and make skin feature and disease predictions on larger unlabeled image data sets. On top of that, we developed and integrated disease- or feature-specific loss functions to combine knowledge from human expertise into the model. The predicted features can be used simultaneously in the training phase to improve the disease classification accuracy. The supervised loss is associated with the disease label of each image and denoted by the cross-entropy function. In the semisupervised learning component, the mean-teacher strategy was adopted to minimize the consistency loss between labeled and unlabeled data sets and to average the model weights from supervised and unsupervised learning.

Theory and Calculation Supervised Learning + Ranking Loss

Using the ranking loss, we enforce the model to learn a predefined diagnostic rule—the samples with higher scores are more likely to have melanoma. The ranking loss is computed from each pair of samples in a batch. We denote o_ij ≡ f(x_i)– f(x_j), where f is the logit corresponding to the disease class, the posterior P_ij, and the desired target values :

Then, the cross-entropy loss function can be represented as

We compute P_ij from o_ij using the sigmoid function as follows; the loss function can be further rewritten as:

Semisupervised Learning

The EMA model behaves as the teacher model on the unlabeled. This method constrains the model to behave similarly to the past models during the update so it can potentially find flatter local minima and avoid singularity points where a small update would result in large behavior change in the model. The mean-teacher strategy proved useful in previous works, and the consistency cost is defined as follows, where is updated based on EMA parameters:

Finally, the ranking loss, disease supervised loss, feature supervised loss (FSL), and consistency loss were added together to train the model.

Results

Our models were built based on the state-of-the-art ResNet model. We tried ResNet-18, ResNet-50, ResNet-152, and Resnext50_32x4d, and there was no significant difference in classification accuracies. To facilitate the training process, we used a relatively light architecture, ResNet-18, as our baseline.

The first task is to test whether the model will increase the classification accuracy after adding human knowledge, which is transformed and represented in the Ranking Loss format. Many state-of-the-art CNN model architectures have been developed for image recognition task, some of which achieved great performance on the skin lesion recognition task on ISIC data sets. In a 2021 paper published by Yiming Zhang et al [41], they reported that DenseNet [42] achieved superior performance over other deep learning approaches on the melanoma classification task using ISIC 2020 data set. MobileNet [43] is another CNN model developed in the recent years, and it has been adapted to ISIC image classification tasks in many cases [44,45]. To choose a CNN architecture as our baseline model and show the improvement of accuracy after combining the human knowledge in the ranking loss format, we compared accuracy results of the state-of-the-art CNN models mentioned above. The comparison outcomes are shown in Table 2. We chose ResNet as our baseline model for its better performance. All the models were trained using a 900 labeled data set (from Table 1). We tested the performance of pretrained baseline model on our larger 9000-image data set using 80/20 data split. The results are shown in Table 2. We used 5-fold cross-validation to calculate the mean and standard deviation of the validation accuracy.

As can be seen from the table, the pretrained baseline model reached the same level of accuracy on the large 9000-image data set. After adding the human knowledge of the 3-point checklist rule, the average accuracy even improved on this basis.

The previous experiment was based on human-annotated, 3-point feature labels. The entire process, from recruiting annotators to finally reaching agreement, took more than 2 months. Hence, we developed the semisupervised model to automate the feature-annotation process. We combined the generated features as human knowledge to test whether such knowledge can help to improve the disease classification accuracy.

To evaluate the performance of the 3-point feature classification for our semisupervised model, we calculated the testing accuracy and area under the receiving operating characteristic curve (AUC) on a separate holdout testing data set that contains 100 images with annotated 3-point features and disease type. We tested the performance for feature and disease classification on the models shown in Table 3, for which “baseline” is the labeled 900-image data set for supervised training, followed by different combinations of loss functions.

As seen in Table 3, the semisupervised model that combined all 3 loss functions achieved the best accuracy for disease classification. Adding FSL improved the performance of disease classification by 2%. The result shows that emphasizing the weight of “Asymmetry” feature improved the testing accuracy of “asymmetry” by 2% and improved the classification of the “atypical network” by 3%. Nevertheless, the accuracy of “Blue-white structure” and disease classification has a significant decrease.

Table 2

Five-fold cross validation results for the disease classification task.

Model	Five-fold accuracy, mean (SD)
MobileNetV3 (Pretrain=True)	0.8733 (0.0113)
DenseNet (Pretrain=True)	0.8856 (0.0114)
Baseline (ResNet-18, Pretrain=True)	0.8867 (0.0191)
Baseline + Human Knowledge (RL^a)	0.8943 (0.0115)

^aRL: ranking loss.

Table 3

Results for semisupervised model for disease or feature classification tasks with different loss functions—disease supervised loss (DSL), feature supervised loss (FSL), and consistency loss (CL).

Model	Asymmetry, accuracy (AUC^a)	Atypical network, accuracy (AUC)	Blue-white structure, accuracy (AUC)	Disease, accuracy (AUC)
CL	0.51 (0.5760)	0.53 (0.5021)	0.54 (0.5620)	0.54 (0.5648)
DSL	0.51 (0.5480)	0.76 (0.6480)	0.58 (0.5285)	0.76 (0.8690)
FSL	0.80 (0.8380)	0.89 (0.9036)	0.74 (0.8036)	0.51 (0.5339)
FSL+CL	0.68 (0.7816)	0.87 (0.8752)	0.75 (0.8137)	0.53 (0.5402)
DSL+FSL	0.76 (0.7892)	0.86 (0.8602)	0.76 (0.8133)	0.74 (0.8418)
DSL+CL	0.53 (0.5448)	0.79 (0.4340)	0.47 (0.5943)	0.77 (0.8389)
DSL+FSL+CL	0.73 (0.8036)	0.85 (0.8474)	0.76 (0.8444)	0.79 (0.8402)
DSL+FSL^b+CL	0.75 (0.7932)	0.88 (0.8752)	0.71 (0.7951)	0.69 (0.7971)

^aAUC: area under the receiving operating characteristic curve.

^bWe emphasized the weight of the “Asymmetry” feature in the loss function.

Discussion Annotation Process

Annotators in this study were medical students with no expert training in dermatology. They evaluated images based solely on tutorials from web-based resources and textbooks. Without any designated training, using example images, each of the annotators initially had a different idea of what each feature looked like. Preliminary agreement scores may have been improved if annotators had been given reference images from which to learn the dermoscopic features. This finding highlights the potential value of our algorithm as an educational tool. If medical students can evaluate a dermoscopic image and check their 3-point annotation against the algorithm’s validated output, it will help them develop their ability to visually identify each dermoscopic feature.

During the image-annotation process, there were some uncertainties for annotators. First, the vague definition of dermoscopic features, especially “atypical network” posed an issue, as each annotator had a different idea of what that looks like. This resulted in initial low agreement scores. We address this concern by proposing an ontology that can integrate the domain knowledge on dermoscopic features and represent the features in a more standardized, computer-readable format.

Another uncertainty in analyzing the images was the use of different screens with various color-display settings. One common error that was encountered was the inability to properly characterize blue structures when night light or blue light filters were activated. As such options can be automatically engaged on a schedule, however, this could lead to annotation errors. The use of different screens led to initial disagreement among the annotators but can be corrected by proper calibration and ensuring that no color filter is on.

One limitation of this study was that most of the images are taken from White skin. This has implications for whether the algorithm can be effective in detecting melanoma in colored skin. Training the algorithm to identify lesions in more than just one group of skin colors would be valuable in helping to screen a larger population of patients at risk of melanoma. Another limitation was that the image quality could have been decreased due to shadows, hairs, reflections, and noise, leading to an inadequate lesion analysis, as discussed in an earlier study [46].

Classification Models

For the first task, after combining the 3-point checklist human knowledge, the loaded model weights from the large data set improved the classification accuracy from an average of 0.8867 to 0.8943. This shows that the ranking loss has a positive impact on classification accuracy. We plan to continue to work on expanding human knowledge to develop more complicated diagnostic rules to test their impacts on computer algorithms.

For the feature- and disease-classification task that used semisupervised architecture, interesting findings were discovered in Table 3. The improvement of the classification accuracy for certain feature labels can be accomplished by assigning a heavier weight on the corresponding feature’s loss function, however, at the cost of scarifying the accuracy for disease classification. Among the 3 features, blue-white structure has a relatively low accuracy when classified without feature-supervised loss function, the potential reason being the unbalance of blue-white structure data set where most of them are negative. While adding FSL is helpful for the feature classification task, adding disease-supervised loss function could bring down the performance of feature classification. For the disease classification, adding FSL alone did not improve the accuracy; however, combining consistency loss with FSL is showing a positive effect on disease classification.

We also noticed that, during the human annotation process for the 3-point checklist, the atypical network had the lowest interagreement rate among the 3 annotators. For the computer feature-classification task, however, the atypical network had the highest classification accuracy. This suggests that the algorithm has the advantage of learning certain image features that might be a challenge for human experts. This shows that human intelligence and AI can complement each other.

Because our image data set is from the ISIC archives, we also compared the performance of our algorithm with the winner of the ISIC 2020 leaderboard [47]. The current best performance has an AUC of 0.949. The AUC of the proposed model on the 400 unlabeled-image testing set (from ISIC 2020) is 0.9848 with different settings of disease category. Our 0.9848 AUC, however, cannot be directly compared with the results from the ISIC leaderboard, as our classification task includes only melanoma and melanocytic nevus, whereas the ISIC challenge has some “unknown” images. The remainder of the results in this regard are calculated on the small 100 labeled-image testing set, which has significant improvement over the application of the student-teacher framework, indicating the power of semisupervised learning.

Future Steps

We plan to implement more fine-tuned model architectures trained from scratch so that a more advanced ensemble can be applied by integrating architectures from submodels. Our current experimental setting for the disease classes and rules of the 3-point checklist is only a demonstration of how we can integrate the human thinking process into the structure of CNNs. There are numerous diagnostic rules that are being developed, as dermatology is thriving, and we plan to summarize all the diagnostic rules and dermoscopic features mentioned, as well as their relationship with skin diseases, into ontology and to further accelerate the automation process of clinical decision support by computer algorithms. With our trained algorithm, we can already automate the 3-point checklist annotation process and apply it to a wider range of image databases.

Conclusions

This study is distinctive because it combines the semantic knowledge from the 3-point checklist with a computer algorithm (CNN) to arrive at a more accurate and more interpretable diagnosis. The CNN classification was conducted based on more information than just the imaging pixels. Due to the time and labor consumption of the image-annotation process, there are vast imaging data sets that remain undiscovered. Our proposed semisupervised learning framework can help automate the annotation process, enabling the reuse of many skin-imaging data sets, which is also beneficial to the robustness and domain adaptation of the deep-learning model.

Abbreviations

artificial intelligence

AUC

area under the receiving operating characteristic curve

CNN

convolutional neural network

EMA

exponential moving average

FSL

feature supervised loss

ISIC

International Skin Imaging Collaboration

XZ conducted the experiments and led the writing of the manuscript. ZX and YX helped with the design of the model and the writing of methodology. IB, MK, and CS conducted the annotation and contributed to the writing of the manuscript from the clinician’s perspective. LG and CT supervised the project. All authors participated in the design of this study.

This work was supported by UTHealth Innovation for Cancer Prevention Research Training Program Pre-doctoral Fellowship (Cancer Prevention and Research Institute of Texas Grant No. RP160015 and No. RP210042).

None declared.

Guy

Thomas

Thompson

Watson

Massetti

Richardson

Centers for Disease ControlPrevention (CDC)

Vital signs: melanoma incidence and mortality trends and projections - United States, 1982-2030

MMWR Morb Mortal Wkly Rep 2015 06 05 64 21 591 6

26042651

mm6421a6

PMC4584771

Jerant

Johnson

Sheridan

Caffrey

Early detection and treatment of skin cancer

Am Fam Physician 2000 07 15 62 2 357 68, 375

10929700

Giavina-Bianchi

Santos

Cordioli

Teledermatology reduces dermatology referrals and improves access to specialists

EClinicalMedicine 2020 12 29-30 100641

10.1016/j.eclinm.2020.100641

33437950

S2589-5370(20)30385-0

PMC7788431

Conforti

Lallas

Argenziano

Dianzani

Di Meo

Giuffrida

Kittler

Malvehy

Marghoob

Soyer

Zalaudek

Impact of the COVID-19 Pandemic on Dermatology Practice Worldwide: Results of a Survey Promoted by the International Dermoscopy Society (IDS)

Dermatol Pract Concept 2021 01 11 1 e2021153

10.5826/dpc.1101a153

33614221

dp1101a153

PMC7875667

Duong

Lamé

Zehou

Skayem

Monnet

El Khemiri

Boudjemil

Hirsch

Wolkenstein

Jankovic

A process modelling approach to assess the impact of teledermatology deployment onto the skin tumor care pathway

Int J Med Inform 2021 02 146 104361

10.1016/j.ijmedinf.2020.104361

33348274

S1386-5056(20)31897-9

Lee

Finnane

Soyer

Recent trends in teledermatology and teledermoscopy

Dermatol Pract Concept 2018 10 31 8 3 214 223

10.5826/dpc.0803a013

Ferrándiz

Ojeda-Vila

Corrales

Martín-Gutiérrez

Ruíz-de-Casas

Galdeano

Álvarez-Torralba

Sánchez-Ibáñez

Domínguez-Toro

Encina

Narbona

Herrerías-Esteban

Moreno-Ramírez

Internet-based skin cancer screening using clinical images alone or in conjunction with dermoscopic images: A randomized teledermoscopy trial

J Am Acad Dermatol 2017 04 76 4 676 682

10.1016/j.jaad.2016.10.041

28089728

S0190-9622(16)31017-9

Brinker

Hekler

Enk

Klode

Hauschild

Berking

Schilling

Haferkamp

Schadendorf

Holland-Letz

Utikal

von Kalle

Collaborators

Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task

Eur J Cancer 2019 05 113 47 54

10.1016/j.ejca.2019.04.001

30981091

S0959-8049(19)30221-7

Haenssle

Fink

Schneiderbauer

Toberer

Buhl

Blum

Kalloo

Hassen

ABH

Thomas

Enk

Uhlmann

Reader study level-Ilevel-II Groups Alt

Arenbergerova

Bakos

Baltzer

Bertlich

Blum

Bokor-Billmann

Bowling

Braghiroli

Braun

Buder-Bakhaya

Buhl

Cabo

Cabrijan

Cevic

Classen

Deltgen

Fink

Georgieva

Hakim-Meibodi

L-E

Hanner

Hartmann

Haus

Hoxha

Karls

Koga

Kreusch

Lallas

Majenka

Marghoob

Massone

Mekokishvili

Mestel

Meyer

Neuberger

Nielsen

Oliviero

Pampena

Paoli

Pawlik

Rao

Rendon

Russo

Sadek

Samhaber

Schneiderbauer

Schweizer

Toberer

Trennheuser

Vlahova

Wald

Winkler

Wölbing

Zalaudek

Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists

Ann Oncol 2018 08 01 29 8 1836 1842

10.1093/annonc/mdy166

29846502

S0923-7534(19)34105-5

Esteva

Kuprel

Novoa

Swetter

Blau

Thrun

Dermatologist-level classification of skin cancer with deep neural networks

Nature 2017 02 02 542 7639 115 118

10.1038/nature21056

28117445

nature21056

PMC8382232

Zhang

Wang

Liu

Tao

Computer-aided diagnosis of four common cutaneous diseases using deep learning algorithm

2017

IEEE Int Conf Bioinforma Biomed

November 13-16, 2017

Kansas City, MO, USA

10.1109/BIBM.2017.8217850

Phillips

Marsden

Jaffe

Matin

Wali

Greenhalgh

McGrath

James

Ladoyanni

Bewley

Argenziano

Palamaras

Assessment of Accuracy of an Artificial Intelligence Algorithm to Detect Melanoma in Images of Skin Lesions

JAMA Netw Open 2019 10 02 2 10 e1913436

10.1001/jamanetworkopen.2019.13436

31617929

2752995

PMC6806667

Tran

Ayad

Weinberg

Cherng

Chowdhury

Monir

El Hariri

Kovarik

Mobile teledermatology in the developing world: implications of a feasibility study on 30 Egyptian patients with common skin diseases

J Am Acad Dermatol 2011 02 64 2 302 9

10.1016/j.jaad.2010.01.010

21094560

S0190-9622(10)00027-7

Thomas

Puig

Dermoscopy, Digital Dermoscopy and Other Diagnostic Tools in the Early Detection of Melanoma and Follow-up of High-risk Skin Cancer Patients

Acta Derm Venereol 2017 07 Suppl 218 14 21

10.2340/00015555-2719

28676882

Schwartz

Elstein

Clinical problem solving and diagnostic decision making: A selective review of the cognitive research literature

The Evidence Base of Clinical Diagnosis: Theory and Methods of Diagnostic Research, Second Edition 2009

Chichester (UK)

Blackwell Publishing Ltd

237 255

Asan

Bayrak

Choudhury

Artificial Intelligence and Human Trust in Healthcare: Focus on Clinicians

J Med Internet Res 2020 06 19 22 6 e15154

10.2196/15154

32558657

v22i6e15154

PMC7334754

Hauschild

Chen

Weichenthal

Blum

King

Goldsmith

Scharfstein

Gutkowicz-Krusin

To excise or not: impact of MelaFind on German dermatologists' decisions to biopsy atypical lesions

J Dtsch Dermatol Ges 2014 07 12 7 606 14

10.1111/ddg.12362

24944011

Demyanov

Chakravorty

Abedini

Halpern

Garnavi

Classification of dermoscopy patterns using deep convolutional neural networks

2016

IEEE 13th International Symposium on Biomedical Imaging (ISBI)

Prague, Czech Republic

April 13-16, 2016

364

10.1109/isbi.2016.7493284

Jones

Jurascheck

van Melle

Hickman

Burrows

Hall

Emery

Walter

Dermoscopy for melanoma detection and triage in primary care: a systematic review

BMJ Open 2019 08 20 9 8 e027529

10.1136/bmjopen-2018-027529

31434767

bmjopen-2018-027529

PMC6707687

Hekler

Utikal

Enk

Hauschild

Weichenthal

Maron

Berking

Haferkamp

Klode

Schadendorf

Schilling

Holland-Letz

Izar

von Kalle

Fröhling

Brinker

Collaborators

Superior skin cancer classification by the combination of human and artificial intelligence

Eur J Cancer 2019 10 120 114 121

10.1016/j.ejca.2019.07.019

31518967

S0959-8049(19)30427-7

Xie

Yang

Liu

Jiang

Zheng

Wang

Skin lesion segmentation using high-resolution convolutional neural network

Comput Methods Programs Biomed 2020 04 186 105241

10.1016/j.cmpb.2019.105241

31837637

S0169-2607(19)30663-7

Rasul

Dey

Hashem

A Comparative Study of Neural Network Architectures for Lesion Segmentation and Melanoma Detection

2020

2020 IEEE Region 10 Symposium (TENSYMP)

June 05-07, 2020

Dhaka, Bangladesh

1572 75

10.1109/tensymp50017.2020.9230969

Shen

Zhang

Jiang

Chen

Song

Liu

Wong

Fang

Ming

Artificial Intelligence Versus Clinicians in Disease Diagnosis: Systematic Review

JMIR Med Inform 2019 08 16 7 3 e10010

10.2196/10010

31420959

v7i3e10010

PMC6716335

Lucius

De All

Belvisi

Radizza

Lanfranconi

Lorenzatti

Galmarini

Deep Neural Frameworks Improve the Accuracy of General Practitioners in the Classification of Pigmented Skin Lesions

Diagnostics (Basel) 2020 11 18 10 11 969

10.3390/diagnostics10110969

33218060

diagnostics10110969

PMC7698907

Felmingham

Adler

Morton

Janda

Mar

The Importance of Incorporating Human Factors in the Design and Implementation of Artificial Intelligence for Skin Cancer Diagnosis in the Real World

Am J Clin Dermatol 2021 03 22 22 2 233 242

10.1007/s40257-020-00574-4

33354741

10.1007/s40257-020-00574-4

Yap

Yolland

Tschandl

Multimodal skin lesion classification using deep learning

Exp Dermatol 2018 11 27 11 1261 1267

10.1111/exd.13777

30187575

Barnett

Boddupalli

Nundy

Bates

Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians

JAMA Netw Open 2019 03 01 2 3 e190096

10.1001/jamanetworkopen.2019.0096

30821822

2726709

PMC6484633

Kurvers

RHJM

Krause

Argenziano

Zalaudek

Wolf

Detection Accuracy of Collective Intelligence Assessments for Skin Cancer Diagnosis

JAMA Dermatol 2015 12 01 151 12 1346 1353

10.1001/jamadermatol.2015.3149

26501400

2464932

Kämmer

Hautz

Herzog

Kunina-Habenicht

Kurvers

RHJM

The Potential of Collective Intelligence in Emergency Medicine: Pooling Medical Students' Independent Decisions Improves Diagnostic Performance

Med Decis Making 2017 08 37 6 715 724

10.1177/0272989X17696998

28355975

Binder

Kittler

Dreiseitl

Ganster

Wolff

Pehamberger

Computer-aided epiluminescence microscopy of pigmented skin lesions: the value of clinical data for the classification process

Melanoma Res 2000 12 10 6 556 61

10.1097/00008390-200012000-00007

11198477

Kharazmi

Kalia

Lui

Wang

Lee

A feature fusion system for basal cell carcinoma detection through data-driven feature learning and patient profile

Skin Res Technol 2018 05 24 2 256 264

10.1111/srt.12422

29057507

Kawahara

Daneshvar

Argenziano

Hamarneh

Seven-Point Checklist and Skin Lesion Classification Using Multitask Multimodal Neural Nets

IEEE J. Biomed. Health Inform 2019 3 23 2 538 546

10.1109/jbhi.2018.2824327

Soyer

Argenziano

Zalaudek

Corona

Sera

Talamini

Barbato

Baroni

Cicale

Di Stefani

Farro

Rossiello

Ruocco

Chimenti

Three-point checklist of dermoscopy. A new screening method for early detection of melanoma

Dermatology 2004 2 3 208 1 27 31

10.1159/000075042

14730233

75042

Kittler

Pehamberger

Wolff

Binder

Diagnostic accuracy of dermoscopy

The Lancet Oncology 2002 03 3 3 159 165

10.1016/s1470-2045(02)00679-4

ISIC 2022: International Semantic Intelligence Conference (ISIC 2022)

WikiCFP 2021-10-28

http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=133712&copyownerid=169171

Rogers

Marino

Dusza

Bajaj

Marchetti

Marghoob

Triage amalgamated dermoscopic algorithm (TADA) for skin cancer screening

Dermatol Pract Concept 2017 04 30 7 2 39 46

10.5826/dpc.0702a09

28515993

dp0702a09

PMC5424662

Gereli

Onsun

Atilganoglu

Demirkesen

Comparison of two dermoscopic techniques in the diagnosis of clinically atypical pigmented skin lesions and melanoma: seven-point and three-point checklists

Int J Dermatol 2010 01 49 1 33 8

10.1111/j.1365-4632.2009.04152.x

20465608

IJD4152

Fidalgo Barata

Celebi

Marques

Improving Dermoscopy Image Classification Using Color Constancy

IEEE J. Biomed. Health Inform 2014 1 1

10.1109/jbhi.2014.2336473

Burges

Shaked

Renshaw

Hamilton

Hullender

Learning to rank using gradient descent

2005

ICML '05: Proceedings of the 22nd international conference on Machine learning

August 7-11, 2005

Bonn, Germany

89 96

10.1145/1102351.1102363

Tarvainen

Valpola

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

NeurIPS Proceedings 2022-10-14

https://proceedings.neurips.cc/paper/2017/file/68053af2923e00204c3ca7c6a3150cf7-Paper.pdf

Zhang

Wang

SIIM-ISIC melanoma classification with DenseNet

2021

IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE)

March 26-28, 2021

Nanchang, China

10.1109/icbaie52039.2021.9389983

Huang

Liu

van

DML

Weinberger

CondenseNet: An efficient DenseNet using learned group convolutions

2018

IEEE/CVF Conference on Computer Vision and Pattern Recognition

June 18-23, 2018

Salt Lake City, UT, USA

10.1109/cvpr.2018.00291

Howard

Sandler

Chen

Wang

Chen

L-C

Tan

Chu

Vasudevan

Zhu

Pang

Adam

Searching for MobileNetV3

2019

IEEE/CVF International Conference on Computer Vision (ICCV)

October 27-November 02, 2019

Seoul, Korea (South)

10.1109/iccv.2019.00140

Mohamed

El-Behaidy

Enhanced skin lesions classification using deep convolutional networks

2019

Ninth International Conference on Intelligent Computing and Information Systems (ICICIS)

December 08-10, 2019

Cairo, Egypt

10.1109/icicis46948.2019.9014823

Widiansyah

Rasyid

Wisnu

Wibowo

Image segmentation of skin cancer using MobileNet as an encoder and linknet as a decoder

J. Phys.: Conf. Ser 2021 07 01 1943 1 012113

10.1088/1742-6596/1943/1/012113

Zhang

Wang

Liu

Tao

Towards improving diagnosis of skin diseases by combining deep neural network and human knowledge

BMC Med Inform Decis Mak 2018 07 23 18 Suppl 2 59

10.1186/s12911-018-0631-9

30066649

10.1186/s12911-018-0631-9

PMC6069289

SIIM-ISIC melanoma classification: Identify melanoma in lesion images

kaggle 2022-10-14

https://www.kaggle.com/c/siim-isic-melanoma-classification/leaderboard