The natural history of numbers

I have made a few facetious comments in this blog about the tendency for ecologists to spend more time staring at spreadsheets than engaging directly with the organisms and habitats they are trying to understand.   There is, of course, a balance that needs to be struck.   We can learn a lot from analysing big datasets that would not have occurred to a biologist who spent all his or her time in the field.  And, I have to admit, somewhat grudgingly, there is a beauty to the numerical landscapes that becomes apparent when a trained eye is brought to bear on data.

I’ve been involved in a project for the European Commission which has been trying to find good ways of converting the ecological objectives that we’ve established for the Water Framework Directive into targets for the pressures that lead to ecosystem degradation.   The key principle behind this work is summarised in the graph below: if the relationship between the biology (expressed as an Ecological Quality Ratio, EQR) and a pressure (in this case, the phosphorus concentration in a river or lake) can be expressed as a regression line then we can read off the phosphorus concentration that relates to any point on the biological scale.   (Note that there are many other ways of deriving a threshold phosphorus concentration, but this simple approach will suffice for now.)

PvEQR_1pressure

Relationship between biology (expressed as an Ecological Quality Ratio, EQR) and phosphorus concentration for a hypothetical dataset.  The blue line indicates the least squares regression line, the horizontal green line is the position of the putative good/moderate status boundary and the vertical green line is the phosphorus concentration at this boundary position.  Coefficient of determination, r2= 0.89 (rarely achieved in real datasets!)

This is fine if you have a strong relationship between your explanatory and response variables and you are confident that there is a causal relationship between them.  Unfortunately, neither of these criteria are fulfilled in most of the datasets we’ve looked at; in particular, it is rare for the biota in rivers to be so strongly controlled by a single pressure.  This means that, when trying to establish thresholds, we also need to think about how a second pressure might interact with the factor we’re trying to control.   If this second pressure has an independent effect on the biota then we might expect some sites that would have had high EQRs if we just considered phosphorus might now be influenced by this second pressure, so the EQR at these sites will fall below the regression line we’ve just established.   When we plot the relationship between EQR and phosphorus taking this second pressure into account, our data no longer fits a neat straight line, but now has a “wedge” shape, due to the many sites where the second pressure overrules the effect of phosphorus.   If you were tempted to put a simple regression line through this new cloud of data, you would see the coefficient of determination, r2, drop from 0.89 to 0.35.  Note, too, how the change in slope means that the position of the phosphorus boundary also falls.   More worryingly, we know that, for this hypothetical dataset, the new line does not represent a causal relationship between biology and phosphorus.  That’s no good if you want to use the relationship to set phosphorus targets and, indeed, you now also need to think about how to manage this second pressure.

PvEQR_2pressures

The same relationship as that shown in the previous graph, but this time with an interaction from a second pressure.  The blue line is the regression line established when phosphorus alone was considered, and the red line is the regression between EQR and phosphorus in the presence of this second pressure.

My purpose in this post is not to talk about the dark arts of setting targets for nutrient concentrations that will support healthy ecosystems but, rather, to talk about data landscapes.  Once we saw and started to understand the meaning of “wedge”-shaped data, we started to see similar patterns occurring in all sorts of other situations.   The previous paragraph and graph, for example, assumed that the factor that confounded the biology-phosphorus relationship was detrimental to the biology, but some factors can mitigate the effect of phosphorus, giving an inverted wedge, as in the next diagram.  Once again, the blue line shows the regression line that would have been fitted if this was a pure biology versus phosphorus relationship.

PvEQR_2pressures_#2

The same relationship, but this time with a second factor that mitigates against the effect of phosphorus.  Note how the original relationship now defines the lower, rather than the upper, edge of the wedge. 

Wedge-shaped data crop up in other situations as well.  The next graph shows the number of diatoms I recorded in a study of Irish streams and there is a distinct “edge” to the cloud of data points.   At low pH (acid conditions), I rarely found more than 10-15 species of diatom whereas, at circumneutral conditions, I sometimes found 10-15 species but I could find 30 or more.   Once again, we are probably looking at a situation where, although pH does exert a pressure on the diatom assemblage, lots of other factors do too, so we only see the effect of pH when its influence is strong (< pH 5).

Ntaxa_v_pH_FORWATER

The number of diatom species recorded across a pH gradient in Irish streams.  Unpublished data.

In this case, the practical problem is that the link between species number and pH is weak so it is hard to derive useful information from the number of species alone.   It would be dangerous to conclude, for example, that the ecology at a site was impacted by acidification on the strength of a single sample.  On the other hand, if you visited the site several times and always recorded low species numbers, then you have a pretty good indication that there was a problem (not necessarily low pH; toxic metals would have a similar effect).   Whether such a pattern would be spotted will depend on how often a site is visited and the sad reality is that sampling frequencies in the UK are now much lower than in the past.

However, this post is not supposed to be about the politics of monitoring (evidence-based policy is so much easier when you don’t collect enough uncomfortable evidence) but about the landscapes that we see in our data, and what these can tell us about the processes at work.   Just as a field biologist can look up from the stream they are sampling and gain a sense of perspective by contemplating the topography of the surrounding land, so we should also be aware of the topography of our data before blithely ploughing ahead with statistical analyses.

with_Geoff_&amp;_Heliana

With Geoff Phillips and Heliana Teixaira – fellow-explorers of data landscapes in our project to encourage consistent nutrient boundaries across the European Union.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.