The natural history of numbers

I have made a few facetious comments in this blog about the tendency for ecologists to spend more time staring at spreadsheets than engaging directly with the organisms and habitats they are trying to understand.   There is, of course, a balance that needs to be struck.   We can learn a lot from analysing big datasets that would not have occurred to a biologist who spent all his or her time in the field.  And, I have to admit, somewhat grudgingly, there is a beauty to the numerical landscapes that becomes apparent when a trained eye is brought to bear on data.

I’ve been involved in a project for the European Commission which has been trying to find good ways of converting the ecological objectives that we’ve established for the Water Framework Directive into targets for the pressures that lead to ecosystem degradation.   The key principle behind this work is summarised in the graph below: if the relationship between the biology (expressed as an Ecological Quality Ratio, EQR) and a pressure (in this case, the phosphorus concentration in a river or lake) can be expressed as a regression line then we can read off the phosphorus concentration that relates to any point on the biological scale.   (Note that there are many other ways of deriving a threshold phosphorus concentration, but this simple approach will suffice for now.)


Relationship between biology (expressed as an Ecological Quality Ratio, EQR) and phosphorus concentration for a hypothetical dataset.  The blue line indicates the least squares regression line, the horizontal green line is the position of the putative good/moderate status boundary and the vertical green line is the phosphorus concentration at this boundary position.  Coefficient of determination, r2= 0.89 (rarely achieved in real datasets!)

This is fine if you have a strong relationship between your explanatory and response variables and you are confident that there is a causal relationship between them.  Unfortunately, neither of these criteria are fulfilled in most of the datasets we’ve looked at; in particular, it is rare for the biota in rivers to be so strongly controlled by a single pressure.  This means that, when trying to establish thresholds, we also need to think about how a second pressure might interact with the factor we’re trying to control.   If this second pressure has an independent effect on the biota then we might expect some sites that would have had high EQRs if we just considered phosphorus might now be influenced by this second pressure, so the EQR at these sites will fall below the regression line we’ve just established.   When we plot the relationship between EQR and phosphorus taking this second pressure into account, our data no longer fits a neat straight line, but now has a “wedge” shape, due to the many sites where the second pressure overrules the effect of phosphorus.   If you were tempted to put a simple regression line through this new cloud of data, you would see the coefficient of determination, r2, drop from 0.89 to 0.35.  Note, too, how the change in slope means that the position of the phosphorus boundary also falls.   More worryingly, we know that, for this hypothetical dataset, the new line does not represent a causal relationship between biology and phosphorus.  That’s no good if you want to use the relationship to set phosphorus targets and, indeed, you now also need to think about how to manage this second pressure.


The same relationship as that shown in the previous graph, but this time with an interaction from a second pressure.  The blue line is the regression line established when phosphorus alone was considered, and the red line is the regression between EQR and phosphorus in the presence of this second pressure.

My purpose in this post is not to talk about the dark arts of setting targets for nutrient concentrations that will support healthy ecosystems but, rather, to talk about data landscapes.  Once we saw and started to understand the meaning of “wedge”-shaped data, we started to see similar patterns occurring in all sorts of other situations.   The previous paragraph and graph, for example, assumed that the factor that confounded the biology-phosphorus relationship was detrimental to the biology, but some factors can mitigate the effect of phosphorus, giving an inverted wedge, as in the next diagram.  Once again, the blue line shows the regression line that would have been fitted if this was a pure biology versus phosphorus relationship.


The same relationship, but this time with a second factor that mitigates against the effect of phosphorus.  Note how the original relationship now defines the lower, rather than the upper, edge of the wedge. 

Wedge-shaped data crop up in other situations as well.  The next graph shows the number of diatoms I recorded in a study of Irish streams and there is a distinct “edge” to the cloud of data points.   At low pH (acid conditions), I rarely found more than 10-15 species of diatom whereas, at circumneutral conditions, I sometimes found 10-15 species but I could find 30 or more.   Once again, we are probably looking at a situation where, although pH does exert a pressure on the diatom assemblage, lots of other factors do too, so we only see the effect of pH when its influence is strong (< pH 5).


The number of diatom species recorded across a pH gradient in Irish streams.  Unpublished data.

In this case, the practical problem is that the link between species number and pH is weak so it is hard to derive useful information from the number of species alone.   It would be dangerous to conclude, for example, that the ecology at a site was impacted by acidification on the strength of a single sample.  On the other hand, if you visited the site several times and always recorded low species numbers, then you have a pretty good indication that there was a problem (not necessarily low pH; toxic metals would have a similar effect).   Whether such a pattern would be spotted will depend on how often a site is visited and the sad reality is that sampling frequencies in the UK are now much lower than in the past.

However, this post is not supposed to be about the politics of monitoring (evidence-based policy is so much easier when you don’t collect enough uncomfortable evidence) but about the landscapes that we see in our data, and what these can tell us about the processes at work.   Just as a field biologist can look up from the stream they are sampling and gain a sense of perspective by contemplating the topography of the surrounding land, so we should also be aware of the topography of our data before blithely ploughing ahead with statistical analyses.


With Geoff Phillips and Heliana Teixaira – fellow-explorers of data landscapes in our project to encourage consistent nutrient boundaries across the European Union.

Comparing comparisons …

Several of the speakers at the DNAqua-net meeting in Essen described work that, essentially, produced a molecular genetic-based “mirror” of current assessment procedures.  That is what we have done and it is a sensible first step because it helps us to understand how the data produced by Next Generation Sequencing (NGS) relate to our current understanding, based on traditional ecological methods.   The obvious way to make such a comparison is to generate both “old” and “new” indices from samples collected from a range of sites spread out along an environmental gradient, and then to look at the relationship between these.   A scatter plot gives you a good visual indication of the nature of the relationship whilst the correlation coefficient indicates its strength.  All well and good but consider the two plots below.   These are based on artificial data that I generated in such a way that both had a Pearson’s correlation coefficient of about 0.95, indicating a highly significant relationship between the two variables.   However, the two plots differ in one important respect: points on the left hand plot are scattered around the diagonal line (indicating slope = 1, i.e. both indices give the same outcome) whilst those on the right hand plot are mostly below this line.

The work that we have done over the past ten years or so means that we are fairly confident that we understand the performance of our traditional indices and, more importantly, that we can relate these to the concepts of ecological status as set out in the Water Framework Directive.   This means that we need to be able to translate these concepts across to any new indices that might replace our existing approaches and the right hand plot indicates one potential problem: at high values, in particular, the new method consistently under-estimates condition compared with the old method.   Note, however, that this has not been picked up by the correlation coefficient, which is the same for both comparisons and, in this post, I want to suggest a better way of comparing two indices.

I made some comparisons of this nature in a paper that I wrote a few years ago and one of the peer reviewers suggested that, rather than use a correlation coefficient I should, in fact, use Lin’s concordance correlation coefficient, which measures the relationship between two variables in terms of their deviation from a 1:1 ratio.  This is an approach widely used in pharmacology and epidemiology to ensure that drugs give equivalent performance to any that they might replace and there is, as a result, a command for performing this calculation within a library of statistical methods for epidemiologists written for R: epiR.   Having downloaded and installed this library, calculation is straightforward:

The standard Pearson’s correlation coefficient can be computed from a base function in R as:

> cor.test(x,y)

And then we load the epiR library:

> library (epiR)

before calculating Lin’s concordance correlation coefficient as:

> epi.ccc(x,y)

If we calculate this coefficient of concordance on the data used to generate each of the plots above we see that it is 0.95 for the left-hand plot (i.e. very similar to Pearson’s correlation coefficient) but only 0.74 for the right hand plot: quite a different result.

Having identified a deviation from a 1:1 relationship, discussion can spin off in several directions.   For the diatoms, for example, we are recognising that data produced by NGS is fundamentally different to that produced by traditional methods and that the number of “reads” associated with a cell does not necessarily align with our traditional counting unit of the frustule (cell wall) or valve.   It is a product partly of cell size, partly of the number of chloroplasts and partly, I suspect, on a variety of environmental factors that we have not yet started to investigate.   The NGS data are not “wrong”, but they are different and using these data without cognisance of the problem might lead to an erroneous conclusion about the state of a site.   So we then have to think about how to rectify this problem, which might involve applying correction factors so that “traditional” indices can continue to be used or deriving new NGS-specific coefficients, which is the approach we have adopted in the UK.   Both approaches have pros and cons but that is a subject for another day …


Kelly, M.G. & Ector, L. (2012) Effect of streamlining taxa lists on diatom-based indices: implications for intercalibrating ecological status.  Hydrobiologia 695: 253-263.

Lin, L. I.-K., 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics 45: 255–268.

The madness that is “British values”


This post is a slight diversion from the core business of my blog, but bear with me because some of the themes will resonate with issues that I have been writing about over the past couple of years.

Recently, a local school, Durham Free School, made national headlines after an excoriating Ofsted report.   The Ofsted inspection was one of a number of inspections called at short notice on faith schools in the region and there seemed to have been a particular focus on determining whether or not the school taught “British values”.   The inspectors commented that the school was “…failing to prepare students for life in modern Britain. Some students hold discriminatory views of other people who have different faiths, values or beliefs from themselves”.   If true, it would be damning but as my job for the past 20 years has involved developing objective measures to determine the success of policy (albeit for the environment rather than education), I was curious to see just how OFSTED inspectors arrived at this conclusion.

The report itself is not very illuminating on methods: the inspectors “…spoke to students in lessons, at break and during lunchtimes. They also spoke formally to two groups of students on the first day of the inspection.”   What, in particular, I wondered, did they ask before arriving at their conclusion about these “discriminatory values”?   I’ve read the whole report, I’ve searched the OFSTED website and I’ve looked at OFSTED’s publication Inspecting Schools: A Handbook for Inspectors.     The Handbook explains that Inspectors should consider how well management and leadership ensure that the curriculum “… promotes tolerance and respect for people of all faiths, genders, ages, disability and sexual orientation…” but there is nothing that explains how such an evaluation should be performed.   The Inspectors, I conclude, simply reported their opinion based on the conversations they had with this small sample of pupils.

Let’s look at this process from a statistical perspective: the Inspector’s opinion is, in effect, a test of the hypothesis that “students hold discriminatory views”, which could be re-cast as a null hypothesis: “students do not hold discriminatory views”.   The Inspectors reach their opinion via the conversations mentioned above (no mention of whether there was a set form of questions, whether students were interviewed as a group or individually or whether closed or open questions were used).   The outcome is cited in the report in absolute terms but, in reality, is a probability based on the outcomes of the interviews.   And, because they only interviewed a sample of students, there will be uncertainties associated with this outcome.   The Inspectors might have reached the wrong conclusion. In statistical terms this means that they rejected the null hypothesis based on their sample because most of the students, in fact, hold non-discriminatory views – a “Type 1 error”.

That the Inspectors concluded that “some students” held these views is, perhaps worrying in itself. We could argue, in support of the inspectors, that there should be zero tolerance of discrimination of any kind. Yet this then raises the question of whether the limited sampling program deployed by the inspectors is sufficiently sensitive to detect discrimination in every school where it occurs (i.e. to retain the null hypothesis when it should have been rejected – a “Type2 error”).   On the other hand, if Ofsted published detailed guidelines on how such evaluations were to be performed (guidance on sample size, types of questions and so on), and the Inspectors at Durham Free School had given more details of the sample size on which they based their judgements, then perhaps we would be in a better position to evaluate the credibility of their judgements. The reality, I suspect, is that the risk of a wrong outcome will be high because the sample size was small. The two inspectors had just two days to evaluate all aspects of the teaching and governance of the school. Some topics that deserved detailed scrutiny were, inevitably, evaluated in a superficial manner as a result.

The core of the problem was summarised neatly in an editorial in The Independent today which places the blame squarely on Michael Gove, the previous Education Secretary, for putting “British values” onto the list of criteria that Ofsted were required to inspect.   The problem, The Independent comments “is that no one can say exactly what it means, which gives inspectors enormous leeway to decide whether or not a school is teaching the said value correctly.”   Political ideology, at some point, has to be translated to practical action and the success or otherwise of policy depends on being able to make judgements consistently across the entire country.   If Ofsted are unable to convert Michael Gove’s rhetoric into transparent and fair measures, then they should resist being drawn into an arena where objectivity comes second to political grandstanding.