What models can and cannot do

Guest blog post by David Byrne

Models have been widely deployed in scientific discussion of the likely course of the COVID-19 pandemic to explore the potential impact of different policy interventions. However, any model is a necessary simplification of the system it describes.

Covid-19 is a biological intervention in the complex social systems – the plural is very important – which include human social interactions and policy interventions within existing social relations and institutional structures. These systems have emergent properties. That is not to say that modelling is useless but its use is necessarily limited.

Existing models are basically modifications of traditional epidemiological models of infectious disease transmission with parameters changed to reflect different timings and degrees of social interaction in response to social distancing and lock down regulatory interventions. There is a real problem of scale. Have these models got the scale of the system right before any attempt at description, let alone prediction, is attempted?

Communicable disease public health doctors have consistently made the point that what we have is not one nationwide outbreak but a set of local outbreaks which is why isolation of cases, tracing of contacts and isolation of those contacts is such an important part of the public health armoury. Most modelling seemed to deal only with the national scale and looked at the impact of policies like lockdown at that level although I am aware that local modelling is being attempted. Few models have examined in detail the impact of case isolation, tracing and contact isolation despite this having been a successful strategy in South Korea and elsewhere. An exception is  Kretzschmar, et al (2020). Plainly this is a very important set of interventions to consider.

Although an alternative form of approach has no immediate predictive capacity it is absolutely necessary to develop it in order to learn from this experience, for similar outbreaks will happen again. That approach is case-based process tracing and systematic case-comparison to establish what has worked better. That needs setting up now and whilst data is essential for developing it, and modelling can play a retrospective role if done at the right scale and with full incorporation of structural elements, it is not the only or even the best way to establish wha has worked where.

How might we  learn from this first wave of COVID-19 in order to find out what approaches have worked or not and in what contexts they have worked or not? Note the emphasized plural. Interventions have been interventions in different local complex systems and have themselves been complex. At national or even sub-national scales (where sub-national governments as with provinces in Canada have had appropriate powers) they have combined public health regulatory regimes (again note the plural) – different regimes in different places – with different levels of curative intervention depending on resources and even perhaps (on some limited evidence) different curative approaches, particularly in relation to the diagnostic anticipation and intervening prevention of cytokine storms. On this see the interviews with Chinese Intensive Care Physicians here: https://www.newscientist.com/article/mg24632783-600-wuhans-covid-19-crisis-intensive-care-doctors-share-their-stories/.

The first thing we need to know is just what has been done in different places alongside descriptions of the spatial and temporal contexts in which those things were done. We need careful process tracing and that means that we need good recording of what things were done in reasonable detail. This is a norm of any complex engineering production process but health systems are weak at full case recording other than those insurance based systems which generate financial records for costing. There have been attempts to improve this in non-insurance based systems but at present they are not fully developed. That kind of recording might be useful at the level of the individual patient and might provide the basis of a new wave of learning algorithm-based data mining to guide intervention, but it does not take account of institutional interventions at higher levels. It will be useful, indeed essential, in establishing treatment protocols – the sheer uselessness and inappropriateness of Randomized Controlled Trials other than for vaccine testing in a pandemic is obvious. It will not guide overall health system management.

There are well established tools in evaluation which can deal with the issue of post hoc exploration of what has worked in different contexts this time to guide policy and practice for next time. These are inherently mixed method in that they require the construction of narratives of what has been done, a mix of descriptive quantitative and qualitative specification of the contexts in which things have been done, and the use of data generated from those account to establish the multiple forms of intervention which have worked to different degrees. Equifinality rules OK!  The same outcome – control over the impact of the disease – can be generated in different and multiple ways. We need comparative process tracing based exploration of the multiple and complex ways in which systems have generated different outcomes.

This is precisely the set of problems addressed by  CECAN – a multi research council and UK government department funded investigation into the problems of Evaluating Complex Interventions Across the NEXUS (food, environment, water and energy). A range of approaches have been developed for this purpose:  CECAN’s website provides a full listing, https://www.cecan.ac.uk/.

Developed outside CECAN but interacting with it has been the very interesting approach of Dynamic Pattern Synthesis devised by Phil Haynes (see Haynes 2019).

This combines exploratory cluster analyses with Qualitative Comparative Analysis to explore how policy and practice systems have come to the outcomes they have reached.

Fundamental to this way of finding “what works” is a combination of qualitative materials and quantitative data. QCA – which is one tool but a good one with an established literature of effective use – requires the interpretation of qualitative narrative accounts of process to yield quantitative descriptions of interventions alongside quantitative descriptions of context. For an example of how this can be done see: Blackman et al.(2013).  Note that the level of measurement is often simply binary or at best ordinal specification of the attributes of the systems and of the interventions made within them.

Demanding documentation during crises is a hard thing to do but the construction of narratives, preferably on an ongoing real time basis but if necessary by careful historical investigation is absolutely necessary. We must always be able to say what has been done and if we can’t then we won’t learn what needs to be done.


Blackman et al.(2013). “Using Qualitative Comparative Analysis to understand complex policy problems.” Evaluation 19(2):126-140.

Haynes, Phil (2019). Social Synthesis – Finding Dynamic Patterns in Complex Social Systems. Abingdon: Routledge.

Kretzschmar, Mirjam E, Ganna Rozhnova, and Michiel E van Boven (2020). “Isolation and contact tracing can tip the scale to containment of COVID-19 in populations with social distancing.”


Picture This!

Guest blog post by R. Allan Reese

From the start, Downing Streets’ daily COVID press conferences have included various graphs slightly amended each day.  In mid April on the Allstat list, I described the presentation and labelling of these graphs as “Boilerplate Excel” and was duly reprimanded for “slagging the people concerned off behind their backs” with “destructive criticism”.  That was not my intention, nor do I accept that criticising a presentation equates with being derogatory about the author. I stand by the assertion that an equivalent lack of attention to spelling, grammar or punctuation would not be condoned in a PR organisation.  The Downing Street presentations were not prepared by hard-pressed, front-line health staff, but by media-savvy folk around the PM.  I wrote to the press office but received no response.

The basis for my criticisms comes from an approach I call Graphical Interpretation of Data (GID), expounded for example in various articles in Significance, freely available online.  Number 10’s daily sequences of graphs and data are available at https://www.gov.uk/government/collections/slides-and-datasets-to-accompany-coronavirus-press-conferences.

It could be argued that changing a style of presentation risks accusations of “spin” if this detracts from the day-to-day comparison.  However, some changes were made mid-stream. Initially, the daily numbers of deaths were plotted on a log scale labelled obscurely “5, 100, 2, 5, 1000, 2, 5, 10k” (Figure 1). The Daily Telegraph published a redrawn version labelling the grid line below the 5 with “0”.  The label “5” actually meant 50 deaths to allow the trajectory for each country to be aligned from the day 50 deaths were reported.  This is all very confusing.

Fig 1: 30 March. Early line graph with enigmatic Y labels and poor linkage to key.

I wrote to the Telegraph about this, and their presentations improved, as did Number 10’s, with labels 50, 100, 200, 500, etc.  The Metro commented on 1 April that the log scale made the growth in number of deaths appear less steep.  They quoted David Spiegelhalter that each presentation has its “advantages and disadvantages” and “there is no ‘right’ way”. However, less mathematically-minded readers would surely see the choice and the changes as spin.

On 8 April these presentations switched to a linear scale with a scale labelled “2K, 4K, 6K …”, thus avoiding showing “real” numbers or a disturbing axis title “Thousands of deaths”.  I described the use of ‘K’ as “nerdy”, especially as K in IT means a power of 2, not 1000.  It is notable that the daily format of the press conferences was a speech by a minister who then handed over to a scientist or medic to describe the graphs, reinforcing the attitude that graphs are for “boffins” – they might be over your head, dear simple reader.

Within GID one often has to guess at the intention of the author: Was the choice of notation accidental or deliberate?  Whom was this graph designed to inform?  I think we have to assume the direct audience are journalists who then interpret the graphs and data for their readership.  On the other hand, some features are so clearly defaults in spreadsheet graph production (e.g. text written horizontally or vertically), that I stand by the assertion that these presentations were handed out without further consideration or editing.

Downing Street’s daily “Global comparison of deaths” compares countries using a line chart. Initially the lines were just colour-coded with a separate key. Then the country names were written at the end of each line. Because each country’s “Day 0” was a different date, all the lines were different lengths.  Because there were ten lines, some were difficult to identify, as some colours were very similar and there was no redundancy (variation in other line characteristics).  The intended message appeared to be that the UK was buried in the middle, on a similar trajectory to the rest of Europe, with the US far worse (nearly three times as many deaths), while China and South Korea had fared much better.

It’s pretty obvious that crude numbers of deaths is a poor comparator, and there is much confusion between numbers of deaths and death rates. BBC’s More or Less (22 April) discussed this and identified the problem that converting using deaths per million population flung San Marino and Andorra to the top.  But you have the same problem calculating rates for many statistics by London boroughs: Westminster may come out top because so few people live there but many people commute.  The GID approach is to draw a graph (of numbers or rates), consider what message you wish to put across, and to revise the graph to clarify and emphasise.

Cristl Donnelly, on More or Less, suggested a better comparison would be to look at excess deaths in each country. This could also be standardised for population size, but might also allow a division into excess deaths from COVID and excess collateral deaths due to non-availability of other health services.

Another graph showed the number of deaths reported daily.  In the first weeks this was for hospitals only, but from mid-April it showed Daily COVID-19 Deaths in All Settings.  Note this was not necessarily the day the person died. Once the “peak” was passed, it was stressed in most presentations that there was a strong weekend effect with greater delays in reporting and hence a jump up each Monday. As a result the bar graph looks quite chaotic. A 7-day rolling average line clarified the general trend, but no visual effect was used to indicate weekends and the dates were labelled at 3-day intervals. Surely a good presentation would demonstrate the periodicity? (Fig 2)

Fig 2: 30 April. The bars show large day-to-day fluctuations while the smoothing line gives a clear, and more comforting, pattern. Which are weekends?

The other graph I draw attention to is the daily “New UK Cases”, based on the number of positive (PCR for antigen) tests reported that day (Fig 3). Initially this was constrained by test availability. By the third week of April a large excess of laboratory capacity over sampling numbers was reported.  According to the rubric, “there are likely many more cases than currently recorded here”, predominantly because sampling was restricted to hospital patients and staff, then extended to wider NHS staff and care workers, but (at the time of writing) not to the wider population.

Fig 3: 19 April.  The numbers written on bars were subsequently dropped as a separate data file was available. Without knowledge of the number of negative tests, it’s hard to evaluate any trend.

Showing the number of positives against an increasing number of daily tests, but not showing the number of tests, disguises any trend in prevalence. It would help if the number of tests or the proportion positive were also reported; these might be split to show the proportions in groups showing symptoms (expected high) and those tested as contacts (hopefully, lower).  Such comparisons were further hindered by gerrymandering the number of tests in late April to claim to have reached the arbitrary target of 100,000 tests “on” 30 April.

Among the problems with this chart are: the dates are written vertically with no indication of weekends or other divisions that might aid interpretation; the actual numbers are written on the bars, again vertically and hard to compare; for two thirds of the period shown the number varied between 4K and 6K and the largely overdrawn grid gives no assistance for comparison; most of the bars are split into two sections, linked to an enigmatic key (Pillar 1 and 2) which requires further recourse to the rubric for an explanation.

The split between “pillars” had me for one puzzled. It derived from the Secretary of State’s plan for five pillars of activity, but at various times the spokesmen distinguished the groups either by the targets for sampling (patients and hospital staff showing symptoms versus wider NHS staff and households) or by the place of testing (PHE versus commercial labs). I failed to find on the website any clear definitions to discriminate between “critical” and “key” workers.  By the end of April this graph had become quite impossible to learn from: the number of cases detected by NHS labs was going down despite PHE opening its Lighthouse labs but the number from other mass testing (private) labs appeared to increase each day. Hence, it appeared to say nothing about the national prevalence and, since there was no effective treatment, offer no assistance to individual patients.

My interpretation is that this is a case of “reporting the data” out of a sense of duty or as a totem to show the approach is “scientific”.  The layout obscures any visible trend except to show Pillar 2 as increasing over its range. High counts on 5 and 8 April are balanced by lows values on 4, 6 and 7 April, so a smoothing line would make the graph far easier to understand.  Having so many written numbers makes this far more of a table than a graph; it’s no good on a screen, especially during a presentation, but you can print it and turn it sideways. Or you could opt for horizontal bars with time running down the screen; this layout fits less well on a landscape screen, but could be split into a row of panes by week.

For screen use, one could easily angle the dates, add grid lines or background shading for weekends and Easter, round the numbers and omit the commas, move the “6K” gridline to label the actual maximum and add a 5000 gridline, change “0K” to “0”, and make the key one-stage.  As the interest is always in the latest figures, one could move the Y labels to the right-hand end.  I would reconsider the colours: the orange is intrusive and a “warning” tint, and the blue is quite dark. Lighter, more neutral colours for the bars would go better with an overlain smoother.

None of this is rocket science or takes much time or resource, but it does show one has thought about the graph and the audience. It shows competence and consideration.