How to build data models in this unpredictable COVID-19 world

24 June 2020

08:00

Matt Jones, lead data science strategist at Tessella says old data and models no longer reflect the world in front of us, and explains how to build new ones, based on Tessella’s white paper COVID-19: Effective Use of Data and Modelling to Deliver Rapid Responses.

Models are critical for our response to COVID-19, underpinning diagnostic tools, track and trace apps, understanding of lifesaving interventions, hospital capacity planning, and so on.

Such models need to be developed quickly, using very new data. Some need to incorporate biological functions of the virus and the body’s response, others need data on ways people behave during lockdown. Such data is being collected, but it relies on new and sometimes ad hoc data capture mechanisms.

Matt Jones

For example, an AI to diagnose bacterial pneumonia has years of well-documented lung scans to be trained on. A similar AI to diagnose COVID-19 infected lungs has just a few month’s data, which may have been captured and filed in a hurry by people struggling to get a grip on what they were looking for. There is less of it and it is less well labelled.

There is a need for speed. But there is also a real risk we rush ahead, try to make do with old models or inadequate data, and end up having to start again. This takes longer than getting it right in the first place, as the UK’s track and trace programme has found.

Spending time getting the data right up front, and building new models from scratch, is more likely to deliver reliable results at speed. Here are four areas to consider.

1. Curating the data

Data going into models needs to be proven and of good quality.

Models of disease spread, or hospital capacity, or disease manifestations cannot just use last year’s data. They need to data on the new reality, which is rapidly evolving and poorly understood.

Even in normal times, AI can fail because of poor data curation. For example, many diagnostic image recognition tools have taught themselves to spot a label in the data that was not removed (e.g. a circle drawn around an infected area), rather than the disease indicator itself.

Data must come from a trusted source and it needs subject matter experts to check it for errors, bias and confounding elements, before it is made available to modellers. There must also be a consistent taxonomy for naming data, and metadata should be used to add context.

Once captured, it must be adequately stored with IT systems which allow modellers to access it easily.

2. Choosing the right models

Some models may be repurposed, but many need to be built from the ground up.

There is no single rule. A model to analyse whether different medical interventions were effective in in reducing death rates, will look different from a model which diagnoses infection from scans, which will be different from a model which verifies antibody tests.

Understand the type of problem. Is it classification or regression, supervised or unsupervised, predictive, statistical, physics-based, etc?

Screen data to understand what is possible. Perform rapid and agile early explorations using simple techniques to spot the correlations that will guide your plan. From this analysis, identify candidate modelling techniques (e.g. empirical, physical, stochastic, hybrid) before narrowing down to the most suitable model for the problem.

‘Most powerful’ is not the same as ‘must suitable’. Techniques such as machine learning need lots of well understood data and so are ill-suited to most COVID-19 challenges at this stage. Approaches such as Bayesian uncertainty quantification may be better where limited trusted data is available.

3. Ensure your answers are trusted

The best model in the world is useless if its users don’t trust it.

Trust requires more than just a working model. Over-complicated or frustrating user interfaces, privacy and ethical concerns, or models which break after a few months, all undermine trust and slow uptake.

So does a lack of explainability. If people can see from their app they spent an hour talking to an infected person, they are likely to take the result seriously and isolate. If they get an alert with no context, they may decide the model is being oversensitive and dismiss it.

All these must be considered when designing models, especially those used by non-experts, to ensure they are trusted and usable.

4. Deploying models at scale

Models must work in the real world. Usually that involves engineering the final model into a piece of software and integrating it into a mobile or web app, or a piece of technology such as a diagnostics machine.

It may mean wrapping models in software (‘containers’) which translates incoming and outgoing data into a common format, to allow it to slot into an IT ecosystem. It will require allocating power and compute demands relevant to the application. It means planning for ongoing maintenance, support, and retraining.

If all goes well, the user is presented with a clear interface. They enter the relevant inputs – symptoms, scans, hospital bed capacity, intubation timing, etc. The model runs and presents the resulting insight in an easy to understand way, that the user is comfortable acting upon.

Bringing it all together for rapid results

Time can be saved by being laser-focussed on your end objective, thus reducing time needed to find and manage relevant data.

But beyond that, rigour is needed. Speed isn’t about cutting corners; it is about doing things right first time. That means putting the right people on the right task: data experts to handle data, modellers to build models, software engineers to deploy them. Critically, it means giving COVID-19 experts – whether doctors, researchers or public health professionals – the tools and the time to focus on the best responses to the pandemic.