This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Dermatology, is properly cited. The complete bibliographic information, a link to the original publication on http://derma.jmir.org, as well as this copyright and license information must be included.
Automatic skin lesion recognition has shown to be effective in increasing access to reliable dermatology evaluation; however, most existing algorithms rely solely on images. Many diagnostic rules, including the 3-point checklist, are not considered by artificial intelligence algorithms, which comprise human knowledge and reflect the diagnosis process of human experts.
In this paper, we aimed to develop a semisupervised model that can not only integrate the dermoscopic features and scoring rule from the 3-point checklist but also automate the feature-annotation process.
We first trained the semisupervised model on a small, annotated data set with disease and dermoscopic feature labels and tried to improve the classification accuracy by integrating the 3-point checklist using ranking loss function. We then used a large, unlabeled data set with only disease label to learn from the trained algorithm to automatically classify skin lesions and features.
After adding the 3-point checklist to our model, its performance for melanoma classification improved from a mean of 0.8867 (SD 0.0191) to 0.8943 (SD 0.0115) under 5-fold cross-validation. The trained semisupervised model can automatically detect 3 dermoscopic features from the 3-point checklist, with best performances of 0.80 (area under the curve [AUC] 0.8380), 0.89 (AUC 0.9036), and 0.76 (AUC 0.8444), in some cases outperforming human annotators.
Our proposed semisupervised learning framework can help with the automatic diagnosis of skin disease based on its ability to detect dermoscopic features and automate the label-annotation process. The framework can also help combine semantic knowledge with a computer algorithm to arrive at a more accurate and more interpretable diagnostic result, which can be applied to broader use cases.
Skin cancer is one of the most common cancers worldwide, with steadily increasing incidence rates of melanoma and nonmelanoma cancers [
The automated classification of dermoscopic images through convolutional neural networks (CNNs) has emerged as a reliable supplement to visual skin examination by on-site specialists in the detection of skin cancer [
Artificial intelligence (AI) algorithms, however, have some weaknesses. One weakness is interpretability and transparency regarding how the computer arrived at its output, making it difficult for dermatologists to trust the diagnostic results [
Another limitation of AI algorithms is that a majority rely solely on images as inputs, whereas in a clinical setting, more information can be obtained through, for instance, palpation of the lesion and clinical data on age and family history [
Recent studies have focused on attempts to combine semantic knowledge with the algorithm to arrive at a more accurate diagnosis [
In this experiment, we chose the 3-point checklist for melanoma and melanocytic nevus as an illustration of diagnostic rules and disease class. The 3-point checklist is easy to interpret and is highly sensitive for the diagnosis of melanoma by nonexpert clinicians [
Combining diagnostic rules with the 3-point checklist classification algorithm can yield benefits that improve patient access to care and diagnostic accuracy. The proposed algorithms have several potential application scenarios, including the following: (1) they can automatically classify skin disease images and generate feature labels by listing the criteria used to categorize suspicious lesions to improve trust and acceptance of teledermoscopy; (2) they can assist medical students to learn and identify the features in dermoscopic images; given the detailed evaluation of each criterion in the 3-point checklist by the algorithm, students can use the checklist to learn about the fundamental parameters used to differentiate lesions as a benign nevus or a melanoma; and (3) they can automate the process of feature annotation; thus, fewer human annotators need to be involved, enabling the secondary use of enormous imaging data resources, such as the ISIC archive.
All images from labeled and unlabeled data sets come from the ISIC archive. “Label” here represents the 3-point checklist feature labels, which means both “labeled” and “unlabeled” data sets contain disease type information. For the small, labeled data set, we selected an even distribution of melanoma and melanocytic nevus dermoscopic images from ISIC 2019 to annotate, using the 3-point checklist features. The large unlabeled data set came mainly from ISIC 2020, which contains the 584 melanoma and 5193 melanocytic nevus dermoscopic images. To balance the data set, we added 4062 melanoma images from ISIC 2019, excluding the images in the small, labeled data set. We divided each data set into training and validation sets in an 80/20 ratio and used 5-fold cross-validation, which means the data set was divided equally into 5 subsets and rotating in order to be the training or validation data set. We annotated an additional 400 images as a holdout testing set.
The 3-point checklist is easy to interpret and is highly sensitive for the diagnosis of melanoma versus melanocytic nevus. Our algorithm evaluated dermoscopic images of pigmented lesions based on the 3-point checklist, indicating the presence or absence of (1) asymmetry, (2) atypical pigment network, and (3) blue-white structures. If any one of these features was detected from the skin lesion image, 1 point would be added on top of the scoring for that image. The scoring range per image is 0 to 3. These 3-point automated classification outputs can aid in a provider’s decision to biopsy a lesion or to refer to a specialist for a more thorough evaluation.
Number of images for skin disease categories for labeled and unlabeled data sets.
Disease | Unlabeled data set | Labeled data set |
Melanoma | 4646 | 450 |
Melanocytic nevus | 5193 | 450 |
Total | 9839 | 900 |
There are 3 features of the 3-point checklist, which are atypical network, asymmetry, and blue-white structure. For each feature detected, 1 score will be added for that image. The higher the score is (usually higher than 2), the higher the risk of melanoma will be. If the score is lower than 1, according to the 3-point checklist, the lesion is more likely to be benign. Our experiment was developed based on a gold standard whereby each image was rigorously reviewed by at least 2 annotators. If consensus was reached, the resulting diagnosis was annotated. If not, a third annotator would evaluate the image again. We divided the annotation into 2 steps. First, the 3 annotators had training sessions to develop consensus annotation guidelines. We provided the annotators with a small image set annotated by domain experts to annotate and evaluate. During this phase, the annotators are allowed to discuss their different understandings. After interrater agreement reached at least 70%, we moved to the second step, in which they annotated images independently. We divided the whole image data set into 3 subsets, and each annotator was assigned 2 subsets so that every image had at least 2 annotation results. Our final interrater agreement Kappa-Cohen score for the second step was 0.64, which indicated substantial agreement. If any images had different annotation results, we brought in the third annotator, who was not previously assigned to the image, and took a majority vote. Overall, this is a very time-consuming process.
Because the training data set came from 3 data sources, each had a different resolution of the images. There could be 1 lesion that took up the entire image or just 1 corner of the graph. Hence, we developed a rule to crop and resize all the training images, which improved the performance of our model.
Due to the different imaging sources and illuminations, the color of dermoscopic images varied considerably. Therefore, it was important to calibrate the color of the images in the preprocessing stage to reduce possible bias for the deep neural network. Catarina et al [
Contrast-Limited Adaptive Histogram Equalization was used to improve the contrast in images. Unlike histogram equalization, it computes several distinct sections of the image and uses them to redistribute the lightness values of the image. It helps to improve the local contrast and enhance the edges of objects in the image.
We proposed a semisupervised learning framework for the prediction of skin disease that uses a small set of labeled images and a larger set of unlabeled images. The labeled data set contains 900 images that were labeled with disease tags and the 3-point checklist annotation, while the unlabeled data set contains 9839 images that have only disease tags. The architecture of the proposed classification model is presented in
Architecture of the proposed semisupervised learning framework. EMA: exponential moving average; ResNet: residual neural network.
The supervised learning consists of 2 tasks, which are jointly learned during training. One task is the classification of the skin disease, and the other is the classification of each feature in the 3-point checklist. Using the 3-point checklist, each feature is given a binary score of 0 or 1 in the training phase, indicating whether it exists in the image. A total score higher than 2 suggests that the lesion is more likely to be malignant. We incorporated the traditional cross-entropy loss to optimize the skin disease classification part and used ranking loss to represent the 3-point checklist knowledge. The hyperparameters for our training models are as follows: a batch size of 128, stochastic gradient descent optimizer, and ReduceLROnPlateau learning rate decay (mode=“min,” factor=0.5, threshold=0.01, patience=7, verbose=True).
Image annotation requires not only extensive time investment but also domain expertise of human annotators. Inspired by the research of Tarvainen and Valpola [
Using the ranking loss, we enforce the model to learn a predefined diagnostic rule—the samples with higher scores are more likely to have melanoma. The ranking loss is computed from each pair of samples in a batch. We denote
Then, the cross-entropy loss function can be represented as
We compute
The EMA model behaves as the teacher model on the unlabeled. This method constrains the model to behave similarly to the past models during the update so it can potentially find flatter local minima and avoid singularity points where a small update would result in large behavior change in the model. The mean-teacher strategy proved useful in previous works, and the consistency cost is defined as follows, where is updated based on EMA parameters:
Finally, the ranking loss, disease supervised loss, feature supervised loss (FSL), and consistency loss were added together to train the model.
Our models were built based on the state-of-the-art ResNet model. We tried ResNet-18, ResNet-50, ResNet-152, and Resnext50_32x4d, and there was no significant difference in classification accuracies. To facilitate the training process, we used a relatively light architecture, ResNet-18, as our baseline.
The first task is to test whether the model will increase the classification accuracy after adding human knowledge, which is transformed and represented in the Ranking Loss format. Many state-of-the-art CNN model architectures have been developed for image recognition task, some of which achieved great performance on the skin lesion recognition task on ISIC data sets. In a 2021 paper published by Yiming Zhang et al [
As can be seen from the table, the pretrained baseline model reached the same level of accuracy on the large 9000-image data set. After adding the human knowledge of the 3-point checklist rule, the average accuracy even improved on this basis.
The previous experiment was based on human-annotated, 3-point feature labels. The entire process, from recruiting annotators to finally reaching agreement, took more than 2 months. Hence, we developed the semisupervised model to automate the feature-annotation process. We combined the generated features as human knowledge to test whether such knowledge can help to improve the disease classification accuracy.
To evaluate the performance of the 3-point feature classification for our semisupervised model, we calculated the testing accuracy and area under the receiving operating characteristic curve (AUC) on a separate holdout testing data set that contains 100 images with annotated 3-point features and disease type. We tested the performance for feature and disease classification on the models shown in
As seen in
Five-fold cross validation results for the disease classification task.
Model | Five-fold accuracy, mean (SD) |
MobileNetV3 (Pretrain=True) | 0.8733 (0.0113) |
DenseNet (Pretrain=True) | 0.8856 (0.0114) |
Baseline (ResNet-18, Pretrain=True) | 0.8867 (0.0191) |
Baseline + Human Knowledge (RLa) | 0.8943 (0.0115) |
aRL: ranking loss.
Results for semisupervised model for disease or feature classification tasks with different loss functions—disease supervised loss (DSL), feature supervised loss (FSL), and consistency loss (CL).
Model | Asymmetry, accuracy (AUCa) | Atypical network, accuracy (AUC) | Blue-white structure, accuracy (AUC) | Disease, accuracy (AUC) |
CL | 0.51 (0.5760) | 0.53 (0.5021) | 0.54 (0.5620) | 0.54 (0.5648) |
DSL | 0.51 (0.5480) | 0.76 (0.6480) | 0.58 (0.5285) | 0.76 (0.8690) |
FSL | 0.80 (0.8380) | 0.89 (0.9036) | 0.74 (0.8036) | 0.51 (0.5339) |
FSL+CL | 0.68 (0.7816) | 0.87 (0.8752) | 0.75 (0.8137) | 0.53 (0.5402) |
DSL+FSL | 0.76 (0.7892) | 0.86 (0.8602) | 0.76 (0.8133) | 0.74 (0.8418) |
DSL+CL | 0.53 (0.5448) | 0.79 (0.4340) | 0.47 (0.5943) | 0.77 (0.8389) |
DSL+FSL+CL | 0.73 (0.8036) | 0.85 (0.8474) | 0.76 (0.8444) | 0.79 (0.8402) |
DSL+FSLb+CL | 0.75 (0.7932) | 0.88 (0.8752) | 0.71 (0.7951) | 0.69 (0.7971) |
aAUC: area under the receiving operating characteristic curve.
bWe emphasized the weight of the “Asymmetry” feature in the loss function.
Annotators in this study were medical students with no expert training in dermatology. They evaluated images based solely on tutorials from web-based resources and textbooks. Without any designated training, using example images, each of the annotators initially had a different idea of what each feature looked like. Preliminary agreement scores may have been improved if annotators had been given reference images from which to learn the dermoscopic features. This finding highlights the potential value of our algorithm as an educational tool. If medical students can evaluate a dermoscopic image and check their 3-point annotation against the algorithm’s validated output, it will help them develop their ability to visually identify each dermoscopic feature.
During the image-annotation process, there were some uncertainties for annotators. First, the vague definition of dermoscopic features, especially “atypical network” posed an issue, as each annotator had a different idea of what that looks like. This resulted in initial low agreement scores. We address this concern by proposing an ontology that can integrate the domain knowledge on dermoscopic features and represent the features in a more standardized, computer-readable format.
Another uncertainty in analyzing the images was the use of different screens with various color-display settings. One common error that was encountered was the inability to properly characterize blue structures when night light or blue light filters were activated. As such options can be automatically engaged on a schedule, however, this could lead to annotation errors. The use of different screens led to initial disagreement among the annotators but can be corrected by proper calibration and ensuring that no color filter is on.
One limitation of this study was that most of the images are taken from White skin. This has implications for whether the algorithm can be effective in detecting melanoma in colored skin. Training the algorithm to identify lesions in more than just one group of skin colors would be valuable in helping to screen a larger population of patients at risk of melanoma. Another limitation was that the image quality could have been decreased due to shadows, hairs, reflections, and noise, leading to an inadequate lesion analysis, as discussed in an earlier study [
For the first task, after combining the 3-point checklist human knowledge, the loaded model weights from the large data set improved the classification accuracy from an average of 0.8867 to 0.8943. This shows that the ranking loss has a positive impact on classification accuracy. We plan to continue to work on expanding human knowledge to develop more complicated diagnostic rules to test their impacts on computer algorithms.
For the feature- and disease-classification task that used semisupervised architecture, interesting findings were discovered in
We also noticed that, during the human annotation process for the 3-point checklist, the atypical network had the lowest interagreement rate among the 3 annotators. For the computer feature-classification task, however, the atypical network had the highest classification accuracy. This suggests that the algorithm has the advantage of learning certain image features that might be a challenge for human experts. This shows that human intelligence and AI can complement each other.
Because our image data set is from the ISIC archives, we also compared the performance of our algorithm with the winner of the ISIC 2020 leaderboard [
We plan to implement more fine-tuned model architectures trained from scratch so that a more advanced ensemble can be applied by integrating architectures from submodels. Our current experimental setting for the disease classes and rules of the 3-point checklist is only a demonstration of how we can integrate the human thinking process into the structure of CNNs. There are numerous diagnostic rules that are being developed, as dermatology is thriving, and we plan to summarize all the diagnostic rules and dermoscopic features mentioned, as well as their relationship with skin diseases, into ontology and to further accelerate the automation process of clinical decision support by computer algorithms. With our trained algorithm, we can already automate the 3-point checklist annotation process and apply it to a wider range of image databases.
This study is distinctive because it combines the semantic knowledge from the 3-point checklist with a computer algorithm (CNN) to arrive at a more accurate and more interpretable diagnosis. The CNN classification was conducted based on more information than just the imaging pixels. Due to the time and labor consumption of the image-annotation process, there are vast imaging data sets that remain undiscovered. Our proposed semisupervised learning framework can help automate the annotation process, enabling the reuse of many skin-imaging data sets, which is also beneficial to the robustness and domain adaptation of the deep-learning model.
artificial intelligence
area under the receiving operating characteristic curve
convolutional neural network
exponential moving average
feature supervised loss
International Skin Imaging Collaboration
XZ conducted the experiments and led the writing of the manuscript. ZX and YX helped with the design of the model and the writing of methodology. IB, MK, and CS conducted the annotation and contributed to the writing of the manuscript from the clinician’s perspective. LG and CT supervised the project. All authors participated in the design of this study.
This work was supported by UTHealth Innovation for Cancer Prevention Research Training Program Pre-doctoral Fellowship (Cancer Prevention and Research Institute of Texas Grant No. RP160015 and No. RP210042).
None declared.