Improving decisions on what to focus on in research using large datasets

Theme Translational data science

Workstream Large, complex datasets

Research using de-personalised data from electronic health records is increasingly common. 

Electronic health records include data from many individuals and extremely detailed information on each individual. This gives researchers a lot of information to work with. 

But there are drawbacks too. With so much information, some factors might appear to be causing a disease when they aren’t. This is known as a confounding variable. 

To be able to work out whether a factor is confounding or not, statisticians use models to help them think through the causal relationships. However, these tend not to work with very large amounts of data.  

With large data sets, data can also be in the wrong category or even missing completely.  

This project will develop an automated probability-based framework to identify which data points to include in an analysis.  

The researchers will compare different approaches and examine how they perform as the number of data points increases. They will also investigate how each method works when some data is missing.  

This will help manage the problem of confounding variables when using large data sets. 

This PhD project is being undertaken by Emma Tarmey as lead researcher, with Kate Tilling, Jonathan Sterne, Paul Madley-Dowd and Rhian Daniel (University of Cardiff) providing supervision.