What 400 Flawed Healthcare AI Models Can Teach Us
The hundreds of flaws in AI models built to help tackle COVID-19 could be viewed merely as a consequence of fast-moving efforts to stop a crisis. Yet, academics calling out these flaws want you to hear their alarm. Their voices are precisely what more business leaders and policymakers need to hear as the U.S. increasingly adopts AI for medical and commercial use.
Casey Ross recently reported in STAT how the pandemic kicked off a flurry of model building. Everyone wanted to make a positive contribution and help alleviate concerns emanating from the crisis. They asked: How can we use machine learning to detect COVID-19? How can we predict who is likely to be severely ill? And can we build models that will be robust through new variants of the virus? They reported their efforts to build on the works of others and learn from the AI community.
A year later, the University of Cambridge examined these models and found that every one of the more than 400 that they studied was fatally flawed, including those published in leading scientific journals.
What’s a fatal flaw?
Researchers found two general types of flaws. The first had to do with data. Too often, model makers used small data sets that didn’t represent the universe of patients that the models were intended to represent.
The second type of flaw had to do with limited information disclosure. Modelers failed to disclose sources of data, techniques they used to model data and potential for bias in either the input data or the algorithms used to train their models.
Ross notes that the practice of not disclosing sources of data isn’t limited to just these COVID-19 models. Forty-five percent of medical AI products approved by the U.S. Food and Drug Administration between 2012 and 2020 did not disclose the amount of data they used to validate their product’s accuracy.
Why flaws matter
Teams that build AI have such good tools at their fingertips in 2021 that many can access pre-coded algorithms and start training data. That’s remarkable progress.
But good, robust models that are rigorous and defensible are still difficult and take time. If the input data isn’t good, then a model’s output won’t be sound either. Beyond that, the human errors noted by Cambridge researchers, such as using the same data for training and validation, are simply indefensible.
I see several reasons why the proliferation of these types of AI models is worrisome. Inaccurate, untraceable models can quickly lead to poor patient care and poor health and cost outcomes. The U.S. health system―or any health system―simply cannot afford a broad erosion of public trust in using AI technologies for patient care.
In an operational sense, flawed data science may lead to expensive mistakes, such as unwarranted clinical trials that could easily waste three to five years of research time. We may end up losing effectiveness and efficiency, the very things that these models are supposed to improve.
Shift your mindset on where to apply safety checks
Already, the genuinely good data science is barely distinguishable from the noise. So, what’s the solution?
Consider this: We have protections for consumers from flaws in other services. Before you eat at a restaurant, for example, you want to know if someone inspected it and deemed it to be sanitary. Before you ride an elevator, you assume that inspectors have passed their safety checks. When you buy a condo, you expect that the builder followed building safety codes.
The only way we’re comfortable in any of these scenarios is because common standards for safety have helped build public trust over time. Today’s AI models can follow best practices, but they aren’t subject to any common set of standards, although there are some good working proposals to change this.
Expect the regulatory model to look different soon
In the U.S., the regulatory model for determining if physical medical products are up to rigorous quality standards doesn’t wholly transfer to medical AI. Unlike a drug or a standard medical device, AI systems constantly change as they’re fed new data.
It’s simply not scalable for an outsider to check how data may change a model or if the algorithms to produce decisions or predictions consistently yield expected results.
Former FDA chief Scott Gottlieb recognized this in 2017 when the agency began plotting a future regulatory model. The FDA continues to study what it will take to pre-certify manufacturers and software providers based on a culture of quality and commitments to monitoring the real-world performance of AI in the market.
The philosophy to certify the company and not just each app is like the restaurant model. Once you certify the restaurant, you don’t have to check each dish. This trusted, yet scalable model can minimize risk and drive a corporate culture toward responsible AI.
Beyond basic disclosure
In addition to transparency on data sources and modeling methods, consider the following:
Potential bias: Share what you’ve done to avoid creating or reinforcing bias. Describe the controls you’ve put in place for bias in input data as well as algorithm design. Communicate how AI system users can flag issues related to bias.
Reviews before a model goes into the wild: Was your AI system subject to a review board before releasing it into the wild? For medical AI products, document the requirement for regulatory approvals and the state of those approvals. Modelers in larger organizations may have an AI center of excellence to help them meet their organization’s highest standards.
Mechanisms that test quality over time: The validity and usefulness of any model can change over time and as people and data interact. Share what checks you’ve put in place to ensure that your model remains valid over time and across populations.
It’s always good to be skeptical and question how a system derives its decisions and predictions. Yet, without these practices for transparency, we risk putting AI into the world that can quickly lead to disillusionment.