Several of the speakers at the DNAqua-net meeting in Essen described work that, essentially, produced a molecular genetic-based “mirror” of current assessment procedures. That is what we have done and it is a sensible first step because it helps us to understand how the data produced by Next Generation Sequencing (NGS) relate to our current understanding, based on traditional ecological methods. The obvious way to make such a comparison is to generate both “old” and “new” indices from samples collected from a range of sites spread out along an environmental gradient, and then to look at the relationship between these. A scatter plot gives you a good visual indication of the nature of the relationship whilst the correlation coefficient indicates its strength. All well and good but consider the two plots below. These are based on artificial data that I generated in such a way that both had a Pearson’s correlation coefficient of about 0.95, indicating a highly significant relationship between the two variables. However, the two plots differ in one important respect: points on the left hand plot are scattered around the diagonal line (indicating slope = 1, i.e. both indices give the same outcome) whilst those on the right hand plot are mostly below this line.
The work that we have done over the past ten years or so means that we are fairly confident that we understand the performance of our traditional indices and, more importantly, that we can relate these to the concepts of ecological status as set out in the Water Framework Directive. This means that we need to be able to translate these concepts across to any new indices that might replace our existing approaches and the right hand plot indicates one potential problem: at high values, in particular, the new method consistently under-estimates condition compared with the old method. Note, however, that this has not been picked up by the correlation coefficient, which is the same for both comparisons and, in this post, I want to suggest a better way of comparing two indices.
I made some comparisons of this nature in a paper that I wrote a few years ago and one of the peer reviewers suggested that, rather than use a correlation coefficient I should, in fact, use Lin’s concordance correlation coefficient, which measures the relationship between two variables in terms of their deviation from a 1:1 ratio. This is an approach widely used in pharmacology and epidemiology to ensure that drugs give equivalent performance to any that they might replace and there is, as a result, a command for performing this calculation within a library of statistical methods for epidemiologists written for R: epiR. Having downloaded and installed this library, calculation is straightforward:
The standard Pearson’s correlation coefficient can be computed from a base function in R as:
And then we load the epiR library:
> library (epiR)
before calculating Lin’s concordance correlation coefficient as:
If we calculate this coefficient of concordance on the data used to generate each of the plots above we see that it is 0.95 for the left-hand plot (i.e. very similar to Pearson’s correlation coefficient) but only 0.74 for the right hand plot: quite a different result.
Having identified a deviation from a 1:1 relationship, discussion can spin off in several directions. For the diatoms, for example, we are recognising that data produced by NGS is fundamentally different to that produced by traditional methods and that the number of “reads” associated with a cell does not necessarily align with our traditional counting unit of the frustule (cell wall) or valve. It is a product partly of cell size, partly of the number of chloroplasts and partly, I suspect, on a variety of environmental factors that we have not yet started to investigate. The NGS data are not “wrong”, but they are different and using these data without cognisance of the problem might lead to an erroneous conclusion about the state of a site. So we then have to think about how to rectify this problem, which might involve applying correction factors so that “traditional” indices can continue to be used or deriving new NGS-specific coefficients, which is the approach we have adopted in the UK. Both approaches have pros and cons but that is a subject for another day …
Kelly, M.G. & Ector, L. (2012) Effect of streamlining taxa lists on diatom-based indices: implications for intercalibrating ecological status. Hydrobiologia 695: 253-263.
Lin, L. I.-K., 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics 45: 255–268.