Introduction

Skin cancer results in approximately 91,000 deaths annually1. Early detection and regular monitoring are crucial in improving the quality of diagnosis, ensuring accurate treatment planning, and reducing skin cancer mortality rates2. A common detection method involves a dermatologist examining skin images to identify ambiguous clinical patterns of lesions that are often not visible to the naked eye. Dermoscopy, a widely used technique, helps dermatologists differentiate between malignant and benign lesions by eliminating surface reflections on the skin, thereby improving the accuracy of skin cancer diagnosis3.

Figure 1
figure 1

Challenges in skin lesion segmentation using dermoscopic images. First row: (a) minor variation in the lesion and skin color, (b) low contrast between wound and skin, (c) occlusion in lesions due to hair, and (d) artifacts from image acquisition. Second row: a few examples from the ISIC Lesion dataset4 used in this paper.

Table 1 Related work on skin lesion segmentation with CNN and GAN-based approaches.

Skin lesion segmentation, a method to differentiate foreground lesions from the background, has received a lot of attention for over a decade due to its high clinical applicability. Computer-aided diagnostic algorithms for automated skin lesion segmentation could aid clinicians in precise treatment and diagnosis, strategic planning, and cost reduction. However, automated skin lesion segmentation is challenging due to several factors7 such as (1) large variance in shape, texture, color, geographical conditions, and fuzzy boundaries, (2) the presence of artifacts such as hair and blood vessels, and (3) poor contrast between background skin and cancer lesions in addition to artifacts from image acquisition, as shown in Fig. 1.

Prior work

Pixel-level skin lesion segmentation algorithms can be divided into approaches built upon (a) classical image processing and (b) deep learning-based architectures. Deep learning-based methods can be further classified into Convolutional Neural Networks (CNN) and Adversarial Learning-based Generative Networks (GAN) based on the network topology. A brief review of a few prior works in these categories is presented in Table 1. The performance of classical image processing approaches heavily depends on post-processing, such as thresholding, clustering, and hole filling, tuning hyperparameters, and manual feature selection. Manually tuning these parameters can be expensive and could result in poor generalizability. Lately, deep learning-based approaches have surpassed several classical image processing-based approaches, mainly due to the wide availability of large labeled datasets and compute resources. Deep convolutional neural networks (DCNN) based methods gained a lot of popularity for skin lesion segmentation prior to the introduction of Transformer and GAN-based approaches in the field of medical imaging23,24,25,26,27.

Figure 2
figure 2

Flowchart of the proposed framework. The generator module is an encoder-decoder network. The discriminator classifies the segmentation result as real or fake.

The success of prior DCNN-based approaches in skin lesion segmentation is primarily based on supervised methods that rely on large labeled datasets to extract features related to the image’s spatial characteristics and deep semantic maps. However, gathering a large dataset with finely annotated images is time-consuming and expensive. To address this challenge, Goodfellow et al.28 introduced Generative Adversarial Networks (GANs), which have gained popularity in various applications, including medical image synthesis, due to the lack of widely available finely annotated data. Several recent and relevant GAN-based approaches in skin lesion analysis from the literature are listed in Table 1. Unsupervised learning-based algorithms that can handle large datasets with precision and high performance without requiring ground truth labels carry significant promise in addressing real-world problems such as computer-aided medical image analysis.

In our work, we address the challenges of skin lesion segmentation by utilizing generative adversarial networks (GANs)28, which can generate accurate segmentation masks with minimal or no supervision. GANs work by training a generator and discriminator to compete against each other, where the generator tries to create realistic images, and the discriminator tries to differentiate between real and generated images (Fig. 2). However, designing an effective GAN for segmentation takes considerable time, as the performance is highly dependent on the architecture and choice of the loss function. Our study aims to optimize all three components (generator, discriminator, and loss function) for better segmentation results. The choice of the loss function is critical for the success of any deep learning architecture, and our approach takes this into account29.

Proposed work

We propose two GAN frameworks for skin lesion segmentation. The first is Efficient-GAN (EGAN), which focuses on precision and learns in an unsupervised manner, making it data-efficient. It uses an encoder-decoder-based generator, patchGAN30 based discriminator and smoothing-based loss function. The generator architecture uses a squeeze and excitation-based compound scaled encoder and a lateral connection-based asymmetric decoder. This architecture captures dense features to generate fine-grained segmentation maps, and the discriminator distinguishes between synthetic and original labels. We also implement a morphological-based smoothing loss function to capture fuzzy boundaries more effectively.

Although deep learning methods provide high precision for lesion segmentation, they are computationally expensive, making them impractical for real-world applications with limited resources like dermatoscopy machines. This presents a challenge in contexts where high-resource devices are unavailable to dermatologists. To address this issue, various devices like MoleScope II, DermLite, and HandyScope have been developed for lesion analysis and support low computational resources. These devices use a special lens with a smartphone. To create a more practical model for such real-time applications, we propose Mobile-GAN (MGAN), which is a lightweight unsupervised model consisting of an Inverted Residual block31 with Atrous Spatial Pyramid Pooling32. With this model, we aim to achieve good segmentation performance in terms of the Jaccard score with lower resource strain. With only 2.2M parameters (as opposed to 27M parameters in EGAN), the model can run at 13 frames per second, increasing the potential impact of computer vision-based approaches in day-to-day clinical practice.

Results

Performance of CNN-based models

We implemented and analyzed the results of several CNN and GAN-based approaches for this task. Table 2 summarizes the evaluation of CNN and GAN-based approaches on the unseen test dataset. We started with one of the most popular architectures in medical imaging segmentation-UNet33. Since this architecture is a simple stack of convolutional layers, the original UNet provided a baseline performance on ISIC 2018 dataset. We strategically conducted several experiments using deeper encoders like ResNet, MobileNet, EfficientNet, and asymmetric decoders (described in the Methods section). The concatenation of low-level features is skewed, based on the number of output feature maps, rather than linking each block from the encoder, like in traditional UNet. Adding a batch normalization layer after each convolutional layer also helped achieve better performance. For detailed evaluation with CNN-based methods, we also experiment with DeepLabV3+32 and Feature Pyramid Network (FPN)34 decoders in combination with various encoders as described above, and the modification led to improved performance. These results on the ISIC 2018 test set from our experimentation, i.e., us running the authors’ code to train the proposed models, are listed with \(*\) in Table 2.

Performance of GAN-based models

Table 2 also lists several results from recent literature on this dataset for comparison completeness. Models trained by us are submitted to the evaluation server for a fair evaluation. We then compare the results of various GAN-based approaches, as shown in Table 2. We observe that a well-designed generative adversarial network (GAN) improves performance compared to techniques based on CNNs for medical image segmentation. This demonstrates GANs ability to overcome the main challenge in this domain of not having large labeled training data. Our proposed EGAN approach outperforms all other approaches in terms of Dice coefficient. A few works8,9,35,36 report better performance compared to our results. But these works created and used an independent test split from ISIC training data and did not use the actual ISIC test data.

Table 2 Results of CNN and GAN-based approaches including our proposed algorithms (MGAN and EGAN) on the ISIC 2018 test dataset.

Performance of lightweight models

We designed a lightweight generator model called MGAN, based on DeepLabV3+ and MobileNetV2, which achieves results comparable to our EGAN model in terms of Dice Coefficient with significantly fewer parameters and faster inference times. Table 3 compares various mobile architectures based on the Jaccard Index, the number of parameters in a million, and inference speed on the test dataset for a patch size of \(512 \times 512\). As we can see from the table 3, MGAN has 2.2M parameters providing the Inference Speed of 13FPS. Even though SLSNet reports a higher performance in terms of the Jaccard Index, this result is evaluated on the independent validation test set.

Table 3 Comparison of various Mobile networks at the task of skin lesion segmentation on the ISIC 2018 dataset.
Figure 3
figure 3

Visualization of the learned feature maps of the proposed EGAN architecture.

Figure 4
figure 4

Comparison of the segmentation by various CNN and GAN-based approaches. Each column serially depicts the input image, label, output of various CNN-based approaches, and output of proposed MGAN and EGAN. Ground truth and segmented lesions are marked with green and red curves respectively.

Visualization of the learned representations

One of the criticisms of deep neural networks, which can make valuable and skillful predictions, is that they are generally opaque, i.e., it is unclear how or why a particular prediction or decision is made. To address concerns about the opacity of deep neural networks, we utilized the internal structures of convolutional neural networks that work on 2D image data to investigate the representations learned by our unsupervised model. Figure 4 displays the segmentation results for visual interpretation. The proposed GAN framework also demonstrates better segmentation performance regardless of non-skin objects or artifacts in the image. We assessed and visualized the 2D filter weights of the model to explore the features learned by the model. Additionally, we investigated the activation layers of the model to understand precisely which features the model recognized for a given input image, and we visualized the results in Fig. 3. We selected the output of seven blocks of the encoder (Block1-Block7) and four output feature maps from the decoder (D1-D4) for visualization, as the model has numerous convolutional layers in each architecture block.

Discussion

This paper has three main findings. First, we proposed a novel unsupervised adversarial learning-based framework (EGAN) based on Generative Adversarial Networks(GANs) to segment skin lesions in a fine-grained manner accurately. In data-scarce applications such as skin lesion segmentation, the success of GANs relies on the quality of the generator, discriminator, and loss function used. One of the main challenges in the field of medical imaging is the availability of large annotated data, collecting which is a tedious, consuming, and costly task. To address the data-efficiency challenge, we trained our model unsupervised, allowing the generator module to capture features effectively and segment the lesion without supervision. Our patchGAN-based discriminator penalized the adversarial network by differentiating between labels and predictions. As we do not backpropagate the error during training in the discriminator, no such advancement is needed as PatchGAN-based architecture is powerful to classify between real and fake. In skin lesion segmentation, capturing contextual information around the segmentation boundary is crucial for improving performance8. To address this, we implemented the morphological-based smoothing loss to capture fuzzy lesion boundaries, resulting in a highly discriminative GAN that considers contextual information and segmented boundaries. The performance-exclusive EGAN approach outperforms prior works achieving improved performance with a dice coefficient of 90.1% on the ISIC 2018 test dataset when trained with adversarial learning and morphology-based smoothing loss function compared to using the dice loss alone, which achieved a dice coefficient of 88.4%. Our evaluation of the ISIC 2018 dataset demonstrates significantly improved performance compared to existing models in the literature. Furthermore, the proposed framework’s potential can be extended to other medical imaging applications.

Second, we proposed a lightweight segmentation framework (MGAN) that achieves comparable results while being much less computationally expensive – with an order of magnitude lower number of training parameters and significantly faster inference time. The MGAN approach is suitable for real-time applications, making it a viable solution for cutting-edge deployment, for instance, in low compute resource contexts. Our proposed framework includes two generative models: EGAN and MGAN, which are designed to balance performance and efficiency. Integrating models like MGAN with dermoscopy devices has the potential to revolutionize the future of dermatology, enabling more efficient, accurate, real-time segmentation and accessible care for patients with skin lesions.

Third, our approach enables visualizing the learned representations of the model to interpret the predictions. This is especially crucial for clinical algorithms-in-the-loop applications such as skin lesion segmentation, where the decisions of automated segmentation methods could be considered by clinicians in the context of the features learned by the model.

Limitations: Although our model achieved promising performance on ISIC 2018 dataset, the performance could not be evaluated on other datasets. We explored different datasets such as Derm7pt43, Diverse Dermatology Images44, and Fitzpatrick 17k45, among others, to assess the generalizability of the proposed approach. However, we noticed that segmentation masks were not available at the time of writing this paper. While segmentation masks were available for the PH2 dataset46, we could not access the dataset. Deep Learning models are computationally intensive and require significant resources. EGAN model is computationally heavy for deployment in real-time clinical applications. This can limit the use in resource constrained environments or devices with limited processing capabilities. In such scenarios, models such as MGAN could be utilized.

Methods

The skin lesion GAN-based segmentation framework we propose in this work is shown in Fig. 2. The framework contains three main components: (1) the generator, which consists of an encoder to extract feature maps and a decoder to generate segmentation maps without supervision and adapt to variations in contrast and artifacts; (2) the discriminator, which distinguishes between the reference label and the segmentation output; and (3) appropriate loss functions to prevent overfitting, achieve excellent convergence, and accurately capture fuzzy lesion boundaries.

Dataset

The proposed segmentation approach was evaluated using the ISIC 2018 dataset, a standard skin lesion analysis dataset. This dataset contains 2594 images with corresponding ground truth, of which 20% (514 images) were used for validation. The images in the dataset vary in size and aspect ratio and contain lesions with different appearances in various skin areas. Some sample images from the dataset are shown in Fig. 1. To ensure a fair evaluation, the results of the test set were uploaded to the online server of the ISIC 20184 portal.

Figure 5
figure 5

The architecture of the proposed generator in the EGAN architecture.

Generative adversarial network

Goodfellow et al.28 first introduced Generative adversarial networks (GAN) to generate synthetic data. Labeling clinical information is a tricky and time-consuming task requiring a specialist. Several medical imaging applications lack adequately annotated data. Inspired by this, the proposed work leverages unsupervised GAN for skin lesion segmentation. To begin with the methodology, we first briefly discuss generator and discriminator concepts. An adversarial network comprises a generator (G) and a discriminator (D). The generator maps a random vector \(\gamma\) from source domain space \(\alpha\) to generate the desired output in the target domain \(\beta\) and tries to fool the discriminator. D learns to classify whether \(\beta\) is real (reference ground truth) or fake (generated by (G)). The generator’s distribution \(p_{G}\) learns over \(\alpha\) data, input noise distribution is defined as \(P_\gamma\)(\(\gamma\)),which maps data space as (G)(\(\gamma\); \(\theta _{G}\)), where differentiable function (G) has parameters \(\theta _{G}\). (D)(\(\alpha\)) is the probability distribution of \(\alpha\) from the data instead of \(p_{G}\).

The adversarial training is represented by following equation28 which is minmax game between G and D :

$$\begin{aligned} \min _{G}\max _{D} V(D,G)&= E_{\alpha \sim P_{data}(\alpha )} [logD_{\theta _{D}}(\alpha )]\\&+ E_{\gamma \sim P_{\gamma }(\gamma )}[log(1 - logD_{\theta _{D}}(G_{\theta _{G}}(\gamma )))] \end{aligned}$$
(1)

where V is function of Discriminator (D), Generator (G),\(\gamma\) is from a input noise distribution \(P_{\gamma }(\gamma )\), true samples are from \(P_{data}(\alpha )\) and \(\theta _{G}\) are generator paramaters and \(\theta _{D}\) are discriminator paramaters.

Segmentation framework

Generally, segmentation frameworks consist of encoder-decoder-based architecture. The encoder module is the block for feature extraction to capture spatial information within the image. It reduces the spatial size, i.e. the dimension of the input image, and decreases feature map resolution to catch significant level features. The decoder recuperates the spatial data by upsampling the feature map extracted by layers of the encoder and providing the output segmentation map. We propose to modify the architecture design of the encoder-decoder to capture the dense feature map rather than the traditional encoder and change the decoder appropriately, as shown in Fig. 5. Including squeeze and excitation-based compound scaled encoders significantly improves efficiency in terms of results.

Design of encoder

Advancement of CNN designs is dependent on the accessibility of infrastructure and, afterward, the scaling of the model in terms of width (w), depth (d), or resolution (r) of the network to accomplish further significant improvement in performance when there is an expansion in the availability of resources. Instead of doing this scaling manually and arbitrarily, Tan et al.47 proposed a novel systematic and automatic scaling approach by introducing a compound coefficient. The novel technique of compound coefficient \(\phi\) to efficiently scale the network’s depth, width, and resolution with a proper arrangement of scaling factors is per the following equation:

$$\begin{aligned} & w:Network\;width = \beta ^{\phi } \\ & d:Network{\mkern 1mu} depth = \alpha ^{\phi } \\ & r:Input\;Resolution = \gamma ^{\phi } \\ & satisfying\;\alpha \times \beta ^{2} \times \gamma ^{2} \approx 2 \\ & also\;\alpha {\mkern 1mu} \ge 1{\mkern 1mu} \beta \ge 1{\mkern 1mu} {\mkern 1mu} \gamma \ge 1 \\ \end{aligned}$$
(2)

The encoder is built using the above equation proposed by Baheti et al.40, consisting of seven building blocks. Each basic building block for this encoder model is squeezing, and excitation functions48 with mobile inverted bottleneck convolution (MBConv), as shown in Fig. 5b. Also, swish activation is used in each encoder block, enhancing performance.

Design of decoder

The encoder downsamples the input image to a smaller resolution and captures contextual information. A decoder block likewise called an upsampling path, comprises many convolutional layers that progressively upsample the feature map obtained from the encoder. The conventional segmentation framework like UNet33 has symmetric encoder and decoder architectures. The proposed architecture builds upon a compound scaled squeeze & excitation-based encoder and decoder as an asymmetric network. The output features from the encoder are expanded in the decoder blocks consisting of bilinear upsampling. The low-level features from the encoder are combined with the higher-level feature maps from the decoder of respective sizes to generate a more precise segmentation output.

Figure 6
figure 6

The architecture of the lightweight and efficient segmentation network MGAN. This architecture is based on an inverted residual network and atrous spatial pyramid pooling. The inverted residual block is shown above Encoder.

Design of lightweight segmentation framework

To develop a lightweight segmentation architecture for the generator, we leverage the power of MobileNetV231 and DeepLabV3+32 consisting of atrous spatial pyramid pooling module (ASPP) as shown in Fig. 6. MobileNetV2 uses depthwise separable convolution and inverted residual blocks as the basic building module, as shown in Fig. 6 above the encoder. MobileNetV2 is modified such that the output stride, i.e., the ratio of the input image to the output image, is 8. It has fewer computations and parameters and is thus suitable for real-time applications. The ASPP block has a variety of dilation rates, i.e., 1, 6, 12, and 18, to generate multi-scale feature maps and further integrate by concatenation. This feature map is upsampled and integrated with a low-level intermediate feature map from the contracting path, i.e., encoder, to generate fine-grained segmentation output. The feature extraction consisted of blocks of inverted residual blocks, as shown in Fig. 6. The stride of the latter blocks is set as one. Images of size 512 \(\times\) 512 \(\times\) 3 are fed as input to MGAN architecture.

Discriminator

In our architecture, we have a generator and a discriminator. The discriminator supervises the generator to produce precise masks that match the original ground truth. We have implemented a patchGAN-based approach to achieve this, classifying each \(m \times n\) mask as equivalent to the ground truth. The discriminator consists of five Conv2D layers with a kernel size of 4 \(\times\) 4 and a stride of 2 \(\times\) 2, with 64, 128, 256, 512, and 1 feature maps in each layer. LeakyReLU activation with an alpha value of 0.2 is used in each Conv2D layer, with the last layer using sigmoid activation. The patch-based discriminator has an output size (\(m \times n\)) of 16 \(\times\) 16, where one pixel is linked to a patch of input probability maps with a size of 94 \(\times\) 94. The discriminator classifies each patch as either fake or real. This learning strategy enforces the predicted label to be similar to the ground truth. The number of parameters is the same as proposed in patchGAN30.

We practice the following adversarial technique for each generated label to align with the ground truth labels. A min-max two-step game alternatively renews the generator and discriminator network with adversarial learning. The discriminator function is given by:

$$\begin{aligned} L_{D}(x,y) = -\sum _{x,y} \gamma log(D(I_S)) + (1 - \gamma ) log(1 - D(I_T)) \end{aligned}$$
(3)

where xy are the pixel locations of the input, D(\(I_S\)) is the Discriminator function of Source Domain Images(\(I_S\)), i.e., Label Image, D(\(I_T\)) is Discriminator function of Target Domain Images (\(I_T\)), i.e., Predicted Image and \(\gamma\) is the probability of the predicted pixel, \(\gamma\) =1 when prediction is from ground truth, i.e., source domain, and \(\gamma\)= 0 when prediction is from generator segmented mask, i.e., target domain.

Loss function

We implement smoothing loss based on morphology to improve skin lesions segmentation and supervise the network that captures the lesion’s smoothness and fuzzy boundaries. The network’s loss function includes dice coefficient loss \((L_{DL})\) as well as the morphology-based smoothing loss \((L_{SL})\). The dice coefficient loss assesses the cross-over between the ground truth and prediction and is given by the condition:

$$\begin{aligned} L_{DL({\widehat{v}},v))} = 1 - \frac{ 2\sum _{i \in \omega } \widehat{{v}_{i}}\cdot v_{i}}{\sum _{i \in \omega } \widehat{{v}_{i}}^{2} + {\sum _{i \in \omega } v_{i}^{2}}} \end{aligned}$$
(4)

where \(\omega\) is the cumulative of pixels in the input image, v, and \({\widehat{v}}\) are the original mask and predicted mask probability map, respectively.

The morphology-based smoothing loss strengthens the network to allow smooth predictions within the nearest neighbor area49. It is pairwise interaction of binary labels written as:

$$\begin{aligned} L_{SL({\widehat{y}},y)} = \sum _{{i \in \Omega }}\sum _{{j \in \mathbb {N^{\iota }}}} B(i,j) \times y_{i} \times \mathopen |{\widehat{y}}_{i} - {\widehat{y}}_{j} \mathclose |{} & {} where: B_{i,j} = \left\{ \begin{matrix}1\hspace{0.5cm} if \hspace{0.2cm} y_{i} = y_{j} \\ 0 \hspace{0.5cm} otherwise \end{matrix}\right. \end{aligned}$$

where \(\mathbb {N^{\iota }}\) is four neighbor connection of pixels. y and \({\widehat{y}}\) denote the ground truth and prediction probability maps, respectively. The four connected neighbor algorithm-based smoothing loss encourages the surrounding area of pixel j with center pixel i to produce prediction probabilities similar to the original ground-truth class (\(B_{i,j} = 1\)).

The combined loss function is written as:

$$\begin{aligned} L_{{\widehat{y}},y} = L_{DL}({\widehat{y}},y) + L_{SL}({\widehat{y}},y) \end{aligned}$$
(5)

Thus, the complete framework works to optimize the loss function by training the network iteratively49.