Comparing comparisons …

Several of the speakers at the DNAqua-net meeting in Essen described work that, essentially, produced a molecular genetic-based “mirror” of current assessment procedures.  That is what we have done and it is a sensible first step because it helps us to understand how the data produced by Next Generation Sequencing (NGS) relate to our current understanding, based on traditional ecological methods.   The obvious way to make such a comparison is to generate both “old” and “new” indices from samples collected from a range of sites spread out along an environmental gradient, and then to look at the relationship between these.   A scatter plot gives you a good visual indication of the nature of the relationship whilst the correlation coefficient indicates its strength.  All well and good but consider the two plots below.   These are based on artificial data that I generated in such a way that both had a Pearson’s correlation coefficient of about 0.95, indicating a highly significant relationship between the two variables.   However, the two plots differ in one important respect: points on the left hand plot are scattered around the diagonal line (indicating slope = 1, i.e. both indices give the same outcome) whilst those on the right hand plot are mostly below this line.

The work that we have done over the past ten years or so means that we are fairly confident that we understand the performance of our traditional indices and, more importantly, that we can relate these to the concepts of ecological status as set out in the Water Framework Directive.   This means that we need to be able to translate these concepts across to any new indices that might replace our existing approaches and the right hand plot indicates one potential problem: at high values, in particular, the new method consistently under-estimates condition compared with the old method.   Note, however, that this has not been picked up by the correlation coefficient, which is the same for both comparisons and, in this post, I want to suggest a better way of comparing two indices.

I made some comparisons of this nature in a paper that I wrote a few years ago and one of the peer reviewers suggested that, rather than use a correlation coefficient I should, in fact, use Lin’s concordance correlation coefficient, which measures the relationship between two variables in terms of their deviation from a 1:1 ratio.  This is an approach widely used in pharmacology and epidemiology to ensure that drugs give equivalent performance to any that they might replace and there is, as a result, a command for performing this calculation within a library of statistical methods for epidemiologists written for R: epiR.   Having downloaded and installed this library, calculation is straightforward:

The standard Pearson’s correlation coefficient can be computed from a base function in R as:

> cor.test(x,y)

And then we load the epiR library:

> library (epiR)

before calculating Lin’s concordance correlation coefficient as:

> epi.ccc(x,y)

If we calculate this coefficient of concordance on the data used to generate each of the plots above we see that it is 0.95 for the left-hand plot (i.e. very similar to Pearson’s correlation coefficient) but only 0.74 for the right hand plot: quite a different result.

Having identified a deviation from a 1:1 relationship, discussion can spin off in several directions.   For the diatoms, for example, we are recognising that data produced by NGS is fundamentally different to that produced by traditional methods and that the number of “reads” associated with a cell does not necessarily align with our traditional counting unit of the frustule (cell wall) or valve.   It is a product partly of cell size, partly of the number of chloroplasts and partly, I suspect, on a variety of environmental factors that we have not yet started to investigate.   The NGS data are not “wrong”, but they are different and using these data without cognisance of the problem might lead to an erroneous conclusion about the state of a site.   So we then have to think about how to rectify this problem, which might involve applying correction factors so that “traditional” indices can continue to be used or deriving new NGS-specific coefficients, which is the approach we have adopted in the UK.   Both approaches have pros and cons but that is a subject for another day …


Kelly, M.G. & Ector, L. (2012) Effect of streamlining taxa lists on diatom-based indices: implications for intercalibrating ecological status.  Hydrobiologia 695: 253-263.

Lin, L. I.-K., 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics 45: 255–268.


The madness that is “British values”


This post is a slight diversion from the core business of my blog, but bear with me because some of the themes will resonate with issues that I have been writing about over the past couple of years.

Recently, a local school, Durham Free School, made national headlines after an excoriating Ofsted report.   The Ofsted inspection was one of a number of inspections called at short notice on faith schools in the region and there seemed to have been a particular focus on determining whether or not the school taught “British values”.   The inspectors commented that the school was “…failing to prepare students for life in modern Britain. Some students hold discriminatory views of other people who have different faiths, values or beliefs from themselves”.   If true, it would be damning but as my job for the past 20 years has involved developing objective measures to determine the success of policy (albeit for the environment rather than education), I was curious to see just how OFSTED inspectors arrived at this conclusion.

The report itself is not very illuminating on methods: the inspectors “…spoke to students in lessons, at break and during lunchtimes. They also spoke formally to two groups of students on the first day of the inspection.”   What, in particular, I wondered, did they ask before arriving at their conclusion about these “discriminatory values”?   I’ve read the whole report, I’ve searched the OFSTED website and I’ve looked at OFSTED’s publication Inspecting Schools: A Handbook for Inspectors.     The Handbook explains that Inspectors should consider how well management and leadership ensure that the curriculum “… promotes tolerance and respect for people of all faiths, genders, ages, disability and sexual orientation…” but there is nothing that explains how such an evaluation should be performed.   The Inspectors, I conclude, simply reported their opinion based on the conversations they had with this small sample of pupils.

Let’s look at this process from a statistical perspective: the Inspector’s opinion is, in effect, a test of the hypothesis that “students hold discriminatory views”, which could be re-cast as a null hypothesis: “students do not hold discriminatory views”.   The Inspectors reach their opinion via the conversations mentioned above (no mention of whether there was a set form of questions, whether students were interviewed as a group or individually or whether closed or open questions were used).   The outcome is cited in the report in absolute terms but, in reality, is a probability based on the outcomes of the interviews.   And, because they only interviewed a sample of students, there will be uncertainties associated with this outcome.   The Inspectors might have reached the wrong conclusion. In statistical terms this means that they rejected the null hypothesis based on their sample because most of the students, in fact, hold non-discriminatory views – a “Type 1 error”.

That the Inspectors concluded that “some students” held these views is, perhaps worrying in itself. We could argue, in support of the inspectors, that there should be zero tolerance of discrimination of any kind. Yet this then raises the question of whether the limited sampling program deployed by the inspectors is sufficiently sensitive to detect discrimination in every school where it occurs (i.e. to retain the null hypothesis when it should have been rejected – a “Type2 error”).   On the other hand, if Ofsted published detailed guidelines on how such evaluations were to be performed (guidance on sample size, types of questions and so on), and the Inspectors at Durham Free School had given more details of the sample size on which they based their judgements, then perhaps we would be in a better position to evaluate the credibility of their judgements. The reality, I suspect, is that the risk of a wrong outcome will be high because the sample size was small. The two inspectors had just two days to evaluate all aspects of the teaching and governance of the school. Some topics that deserved detailed scrutiny were, inevitably, evaluated in a superficial manner as a result.

The core of the problem was summarised neatly in an editorial in The Independent today which places the blame squarely on Michael Gove, the previous Education Secretary, for putting “British values” onto the list of criteria that Ofsted were required to inspect.   The problem, The Independent comments “is that no one can say exactly what it means, which gives inspectors enormous leeway to decide whether or not a school is teaching the said value correctly.”   Political ideology, at some point, has to be translated to practical action and the success or otherwise of policy depends on being able to make judgements consistently across the entire country.   If Ofsted are unable to convert Michael Gove’s rhetoric into transparent and fair measures, then they should resist being drawn into an arena where objectivity comes second to political grandstanding.