The
The guidelines, published in JAMA Dermatology on Dec. 1, 2021, contain a broad range of recommendations stakeholders should consider when developing and assessing image-based AI algorithms in dermatology. The recommendations are divided into categories of data, technique, technical assessment, and application. ISIC is “an academia and industry partnership designed to facilitate the application of digital skin imaging to help reduce melanoma mortality,” and is organized into different working groups, including the AI working group, according to its website.
“Our goal with these guidelines was to create higher-quality reporting of dataset and algorithm characteristics for dermatology AI,” first author Roxana Daneshjou, MD, PhD, clinical scholar in dermatology, in the department of dermatology at Stanford (Calif.) University, said in an interview. “We hope these guidelines also aid regulatory bodies around the world when they are assessing algorithms to be used in dermatology.”
Recommendations for data
The authors recommended that datasets used by AI algorithms have image descriptions and details on image artifacts. “For photography, these include the type of camera used; whether images were taken under standardized or varying conditions; whether they were taken by professional photographers, laymen, or health care professionals; and image quality,” they wrote. They also recommended that developers include in an image description the type of lighting used and whether the photo contains pen markings, hair, tattoos, injuries, surgical effects, or other “physical perturbations.”
Exchangeable image file format data obtained from the camera, and preprocessing procedures like color normalization and “postprocessing” of images, such as filtering, should also be disclosed. In addition, developers should disclose and justify inclusion of images that have been created by an algorithm within a dataset. Any public images used in the datasets should have references, and privately used images should be made public where possible, the authors said.
The ISIC working group guidelines also provided recommendations for patient-level metadata. Each image should include a patient’s geographical location and medical center they visited as well as their age, sex and gender, ethnicity and/or race, and skin tone. Dr. Daneshjou said this was one area where she and her colleagues found a lack of transparency in AI datasets in algorithms in a recent review. “We found that many AI papers provided sparse details about the images used to train and test their algorithms,” Dr. Daneshjou explained. “For example, only 7 out of 70 papers had any information about the skin tones in the images used for developing and/or testing AI algorithms. Understanding the diversity of images used to train and test algorithms is important because algorithms that are developed on images of predominantly white skin likely won’t work as well on Black and brown skin.”
For diagnosing skin lesions, AI risks failing in skin of color
The guideline authors also asked algorithm developers to describe the limitations of not including patient-level metadata information when it is incomplete or unavailable. In addition, “we ask that algorithm developers comment on potential biases of their algorithms,” Dr. Daneshjou said. “For example, an algorithm based only on telemedicine images may not capture the full range of diseases seen within an in-person clinic.”
When describing their AI algorithm, developers should detail their reasoning for the dataset size and partitions, inclusion and exclusion criteria for images, and use of any external samples for test sets. “Authors should consider any differences between the image characteristics used for algorithm development and those that might be encountered in the real world,” the guidelines stated.
Recommendations for technique
How the images in a dataset are labeled is a unique challenge in developing AI algorithms for dermatology, the authors noted. Developers should use histopathological diagnosis in their labeling, but this can sometimes result in label noise.
“Many of the AI algorithms in dermatology use supervised learning, which requires labeled examples to help the algorithm ‘learn’ features for discriminating between lesions. We found that some papers use consensus labeling – dermatologists providing a label – to label skin cancers; however, the standard for diagnosing skin cancer is using histopathology from a biopsy,” she said. “Dermatologists can biopsy seven to eight suspected melanomas before discovering a true melanoma, so dermatologist labeling of skin cancers is prone to label noise.”
ISIC’s guidelines stated a gold standard of labeling for dermatologic images is one area that still needs future research, but currently, “diagnoses, labels and diagnostic groups used in data repositories as well as public ontologies” such as ICD-11, AnatomyMapper, and SNOMED-CT should be included in dermatologic image datasets.
AI developers should also provide a detailed description of their algorithm, which includes methods, work flows, mathematical formulas as well as the generalizability of the algorithm across more than one dataset.
Recommendations for technical assessment
“Another important recommendation is that algorithm developers should provide a way for algorithms to be publicly evaluable by researchers,” Dr. Daneshjou said. “Many dermatology AI algorithms do not share either their data or their algorithm. Algorithm sharing is important for assessing reproducibility and robustness.”
Google’s recently announced AI-powered dermatology assistant tool, for example, “has made claims about its accuracy and ability to diagnose skin disease at a dermatologist level, but there is no way for researchers to independently test these claims,” she said. Other options like Model Dermatology, developed by Seung Seog Han, MD, PhD, of the Dermatology Clinic in Seoul, South Korea, and colleagues, offer an application programming interface “that allows researchers to test the algorithm,” Dr. Daneshjou said. “This kind of openness is key for assessing algorithm robustness.”
Developers should also note in their algorithm explanations how performance markers and benchmarks would translate to proposed clinical application. “In this context,” the use case – the context in which the AI application is being used – “should be clearly described – who are the intended users and under what clinical scenario are they using the algorithm,” the authors wrote.
Recommendations for application
The guidelines note that use case for the model should also be described by the AI developers. “Our checklist includes delineating use cases for algorithms and describing what use cases may be within the scope of the algorithm versus which use cases are out of scope,” Dr. Daneshjou said. “For example, an algorithm developed to provide decision support to dermatologists, with a human in the loop, may not be accurate enough to release directly to consumers.”
As the goal of AI algorithms in dermatology is eventual implementation for clinicians and patients, the authors asked developers to consider shortcomings and potential harms of the algorithm during implementation. “Ethical considerations and impact on vulnerable populations should also be considered and discussed,” they wrote. An algorithm “suggesting aesthetic medical treatments may have negative effects given the biased nature of beauty standards,” and “an algorithm that diagnoses basal cell carcinomas but lacks any pigmented basal cell carcinomas, which are more often seen in skin of color, will not perform equitably across populations.”
Prior to implementing an AI algorithm, the ISIC working group recommended developers perform prospective clinical trials for validation. Checklists and guidelines like SPIRIT-AI and CONSORT-AI “provide guidance on how to design clinical trials to test AI algorithms,” Dr. Daneshjou said.
After implementation, “I believe we need additional research in how we monitor algorithms after they are deployed clinically, Dr. Daneshjou said. “Currently there are no [Food and Drug Administration]–approved AI algorithms in dermatology; however, there are several applications that have CE mark in Europe, and there are no mechanisms for postmarket surveillance there.
'Timely' recommendations
Commenting on the ISIC working group guidelines, Justin M. Ko, MD, MBA, director and chief of medical dermatology for Stanford Health Care, who was not involved with the work, said that the recommendations are timely and provide “a framework for a ‘common language’ around AI datasets specifically tailored to dermatology.” Dr. Ko, chair of the American Academy of Dermatology’s Ad Hoc Task Force on Augmented Intelligence, noted the work by Dr. Daneshjou and colleagues “is consistent with and builds further details” on the position statement released by the AAD AI task force in 2019.
“As machine-learning capabilities and commercial efforts continue to mature, it becomes increasingly important that we are able to ‘look under the hood,’ and evaluate all the critical factors that influence development of these capabilities,” he said in an interview. “A standard set of reporting guidelines not only allows for transparency in evaluating data and performance of models and algorithms, but also forces the consideration of issues of equity, fairness, mitigation of bias, and clinically meaningful outcomes.”
One concern is the impact of AI algorithms on societal or health systems, he noted, which is brought up in the guidelines. “The last thing we would want is the development of robust AI systems that exacerbate access challenges, or generate patient anxiety/worry, or drive low-value utilization, or adds to care team burden, or create a technological barrier to care, or increases inequity in dermatologic care,” he said.
In developing AI algorithms for dermatology, a “major practical issue” is how performance on paper will translate to real-world use, Dr. Ko explained, and the ISIC guidelines “provide a critical step in empowering clinicians, practices, and our field to shape the advent of the AI and augmented intelligence tools and systems to promote and enhance meaningful clinical outcomes, and augment the core patient-clinician relationship and ensure they are grounded in principles of fairness, equity and transparency.”
This research was funded by awards and grants to individual authors from the Charina Fund, a Google Research Award, Melanoma Research Alliance, National Health and Medical Research Council, National Institutes of Health/National Cancer Institute, National Science Foundation, and the Department of Veterans Affairs. The authors disclosed relationships with governmental entities, pharmaceutical companies, technology startups, medical publishers, charitable trusts, consulting firms, dermatology training companies, providers of medical devices, manufacturers of dermatologic products, and other organizations related to the paper in the form of supplied equipment, having founded a company; receiving grants, patents, or personal fees; holding shares; and medical reporting. Dr. Ko reported that he serves as a clinical advisor for Skin Analytics, and has an ongoing research collaboration with Google.