Practice Point: A new artificial intelligence (AI) program correctly detects melanoma more frequently than do primary care providers, but it also generates more false alarms.

EBM Pearl: An area under the receiver operating characteristic (AUROC) is a metric that describes how well a model can discriminate between binary classifications across all thresholds in a given population.

 

Artificial Intelligence (AI), of a sort, has been around for years in medicine, starting with machines reading EKGs. But now radiologists and pathologists and even dermatologists are starting to see how visual pattern recognition by AI may be changing their fields in the near future. The British Journal of Dermatology published a report from the co-founder of a Swedish-based company about an AI program they developed to help primary care providers (PCPs) sort out which suspicious skin lesions were most likely to be melanomas.

The AI program in question was trained on dermoscopic images of melanomatous and non-melanomatous lesions. The PCPs in question were asked to take dermoscopy photographs of skin lesions they believed were potentially melanomas and had decided to either biopsy or send to a dermatologist for a second opinion before asking for the AI program’s opinion about the lesions.

For those of you who may not know, dermoscopy involves using a special high-powered back-lit magnifying glass (currently costing between $800-$1,400) to zoom in on suspicious skin lesions to help identify which ones should be biopsied. Most dermatologists use dermoscopes regularly, but few primary care providers (in our experience) are entirely comfortable with the procedure.

In this study, 253 lesions (on 223 patients) had dermoscopic AI analysis followed by either biopsy or a document from a consulting dermatologist stating that a biopsy was not needed.

Out of these, 119 lesions were assessed as benign by dermatologists and were not biopsied. The PCPs thought that 20 percent of the lesions were highly suspicious for melanoma and the remaining 80 percent were considered suspicious enough to refer for consultation or biopsy but were not classified as highly suspicious.

The AI program analyzed the lesions in terms of either the presence or absence of dermoscopic evidence for melanoma, and 44 percent were classified as having dermoscopic evidence of melanoma.

Of the 134 lesions biopsied, eleven were invasive melanoma and ten were melanoma in situ. The AI program stated one melanoma in situ did not show “any evidence of melanoma” on dermoscopic appearance. The PCPs also classified this “missed” lesion as low likelihood, but still suspicious enough to either biopsy or send for consultation. Another eight lesions that turned out to be melanoma were flagged as having evidence of melanoma by the AI program, placing them into a higher risk category compared to the PCP’s assessment. Researchers took all this information and calculated an AUROC value.

AUROC is an acronym for Area Under the Receiver Operating Characteristic Curve, which is a method to evaluate how well a given test is able to correctly rank samples into a binary category

  • An AUROC value is calculated for a given population by taking into account both the precision of the results (how often is there a false positive?) and the recall of the results (how frequent are false negatives?).
  • In general, an AUROC value of 0.5 means the test is no better than chance at discriminating between two outcomes, but a value of 1 means the test is 100 percent always able to discriminate between the two outcomes in a given population.

For the AI program, the categories for AUROC were “likely to be melanoma based on appearance” or “not likely to be melanoma [based on appearance].” The AI had an astounding AUROC value of 0.96 (95% CI 0.93-0.98) for differentiating melanomas from other skin lesions. It was even better for detecting invasive melanomas.

Despite the impressive AUROC value, this study reveals a potential problem with algorithm-based diagnosis. The researchers set the sensitivity of the AI program fairly high, but when this happened, almost by default, the specificity went low. (In population terms, this translates to fewer false negatives but more false positives.) That is, an additional 72 biopsies would have been performed in order to find an additional 8 melanomas.

Recall that about half of the melanomas identified were in situ, which have a rate of progression to malignant melanoma that ranges from five-50 percent. The additional false positives due to biopsy from the AI determination may certainly be worth it. But it’s misleading to say that the AI program was significantly more accurate at diagnosing melanoma than the PCPs were. More precisely, the AI was simply programmed to be more cautious.

 

For more information, see the topic Melanoma in DynaMedex.

Reference: Br J Dermatol. 2024 Jun 20;191(1):125-133