The Imitation Game

About a year ago, I made a dire prediction about the future of diatom taxonomy in the new molecular age (see “Murder on the barcode express …“).   A year on, I thought I would return to this topic from a different angle, using the “Turing Test” in Artificial Intelligence as a metaphor.   The Turing Test (or “Imitation Game”) was derived by Alan Turing in 1950 as a test of a machine’s ability to exhibit intelligent behaviour, indistinguishable from that of a human (encapsulated as “can machines do what we [as thinking entities] can do?”).

My primary focus over the past few years has not been the role of molecular biology in taxonomy, but rather the application of taxonomic information to decision-making by catchment managers.   So my own Imitation Game is not going to ask whether computers will ever identify microscopic algae as well as humans, but rather can they give the catchment manager the information they need to make a rational judgement about the condition of a river and the steps needed to improve or maintain that condition as well as a human biologist?

One of the points that I made in the earlier post is that current approaches based on light microscopy are already highly reductionist: a human analyst makes a list of species and their relative abundances which are processed using standardised metrics to assign a site to a status class. In theory, there is the potential for the human analysts to then add value to that assignment through their interpretations.  The extent to which that happens will vary from country to country but there two big limitations: first, our knowledge of the ecology of diatoms is meagre (see earlier post) and, in any case, diatoms represent only a small part of the total diversity of microscopic algae and protists present in any river.   That latter point, in particular, is spurring some of us to start exploring the potential of molecular methods to capture this lost information but, at the same time, we expect to encounter even larger gaps in existing taxonomic knowledge than is the case for diatoms.

One very relevant question is whether this will even be perceived as a problem by the high-ups.  There is a very steep fall-off in technical understanding as one moves up through the management tiers of environmental regulators.   That’s inevitable (see “The human ecosystem of environmental management…“) but a consequence is that their version of the Imitation Game will be played to different rules to that of the Environment Agency’s Poor Bloody Infantry whose game, in turn, will not be the same as that of academic taxonomists and ecologists.  So we’ll have to consider each of these versions separately.

Let’s start with the two extreme positions: the traditional biologist’s desire to retain a firm grip on Linnaean taxonomy versus the regulator’s desire for molecular methods to imitate (if not better) the condensed nuggets of information that are the stock-in-trade of ecological assessment.   If the former’s Imitation Game consists of using molecular methods to capture the diversity of microalgae at least as well as human specialists, then we run immediately into a new conundrum: humans are, actually, not very good at doing this, and molecular taxonomy is one of the reasons we know this to be true.  Paper after paper has shown us the limitations of taxonomic concepts developed during the era of morphology-based taxonomy.  In the case of diatoms we are now in the relatively healthy position of a synergy between molecular and morphological taxonomy but the outcomes usually indicate far more diversity than we are likely to be able to catalogue using formal Linnaean taxonomy to make this a plausible option in the short to medium-term.

If we play to a set of views that is interested primarily in the end-product, and is less interested in how this is achieved, then it is possible that taxonomy-free approaches such as those advocated by Jan Pawlowski and colleagues, would be as effective as methods that use traditional taxonomy.   As no particular expertise is required to collect a phytobenthos sample, and the molecular and computing skills required are generic rather than specific to microalgae, the entire process could by-pass anyone with specialist understanding altogether.  The big advantages are that it overcomes the limitations of a dependence on libraries of barcodes of known species and, as a result, that it does not need to be limited to particular algal groups.  It also has the greatest potential to be streamlined and, so, is likely to be the cheapest way to generate usable information.   However, two big assumptions are built into this version of the Imitation Game: first, there is absolutely no added value from knowing what species are present in a sample and, second, that it is, actually, legal. The second point relates to the requirement in the Water Framework Directive to assess “taxonomic composition” so we also need to ask whether a list of “operational taxonomic units” (OTUs) meets this requirement.

In between these two extremes, we have a range of options whereby there is some attempt to align molecular barcode data with taxonomy, but stopping short of trying to catalogue every species present.  Maybe the OTUs are aggregated to division, class, order or family rather than to genus or species?   That should be enough to give some insights into the structure of the microbial world (and be enough to stay legal!) and would also bring some advantages. Several of my posts from this summer have been about the strange behavior of rivers during a heatwave and, having commented on the prominence and diversity of green algae during this period, it would be foolish to ignore a method that would pick up fluctuations between algal groups better than our present methods.   On the other hand, I’m concerned that an approach that only requires a match to a high-level taxonomic group will enable bioinformaticians and statisticians to go fishing for correlations with environmental variables without needing a strong conceptual behind their explorations.

My final version of the Imitation Game is the one played by the biologists in the laboratories around the country who are simultaneously generating the data used for national assessments and providing guidance on specific problems in their own local areas.   Molecular techniques may be able to generate the data but can it explain the consequences?  Let’s assume that method in the near future aggregates algal barcodes into broad groups – greens, blue-greens, diatoms and so on, and that some metrics derived from these offer correlations with environmental pressures as strong or stronger than those that are currently obtained.   The green algae are instructive in this regard: they encompass an enormous range of diversity from microscopic single cells such as Chlamydomonas and Ankistrodesmus through colonial forms (Pediastrum) and filaments, up to large thalli such as Ulva.   Even amongst the filamentous forms, some are signs of a healthy river whilst others can be a nuisance, smothering the stream bed with knock-on consequences for other organisms.   A biologist, surely, wants to know whether the OTUs represent single cells or filaments, and that will require discrimination of orders at least but in some cases this level of taxonomic detail will not be enough.   The net alga, Hydrodictyon(discussed in my previous post) is in the same family as Pediastrumso we will need to be able to discriminate separate genera in this case to offer the same level of insight as a traditional biologist can provide.   We’ll also need to discriminate blue-green algae (Cyanobacteria) at least to order if we want to know whether we are dealing with forms that are capable of nitrogen fixation – a key attribute for anyone offering guidance on their management.

The primary practical role of Linnaean taxonomy, for an ecologist, is to organize data about the organisms present at a site and to create links to accumulated knowledge about the taxa present.    For many species of microscopic algae, as I stressed in “Murder on the barcode express …”, that accumulated knowledge does not amount to very much; but there are exceptions.  There are 8790 records on Google Scholar for Cladophora glomerata, for example, and 2160 for Hydrodictyon reticulatum.  That’s a lot of wisdom to ignore, especially for someone who has to answer the “so what” questions that follow any preliminary assessment of the taxa present at a site.  But, equally, there is a lot that we don’t know and molecular methods might well help us to understand this.   There will be both gains and losses as we move into this new era but, somehow, blithely casting aside hard-won knowledge seems to be a retrograde step.

Let’s end on a subversive note: I started out by asking whether “machines” (as a shorthand for molecular technology) can do the same as humans but the drive for efficiency over the last decade has seen a “production line” ethos creeping into ecological assessment.   In the UK this has been particularly noticeable since about 2010, when public sector finances were squeezed.   From that point on, the “value added” elements of informed biologists interpreting data from catchments they knew intimately started to be eroded away.   I’ve described three versions of the Imitation Game and suggested three different outcomes.  The reality is that the winners and losers will depend upon who makes the rules.  It brings me back to another point that I have made before (see “Ecology’s Brave New World …”): that problems will arise not because molecular technologies are being used in ecology, but due to how they are used.   It is, in the final analysis, a question about the structure and values of the organisations involved.

References

Apothéloz-Perret-Gentil, L., Cordonier, A., Straub, F., Iseili, J., Esling, P. & Pawlowksi, J. (2017).  Taxonomy-free molecular diatom index for high-throughput eDNA monitoring.   Molecular Ecology Resources17: 1231-1242.

Turing, A. (1950).  Computing machinery and intelligence.  Mind59: 433-460.

Advertisements

As if through a glass darkly …

Life used to be so easy: I stared down my microscope, named the diatoms I could see, counted them and, from these data, made an evaluation of the quality of the ecosystem that I was studying.   Along with the majority of my fellow diatomists, I conveniently ignored the fact that I was looking at dead cell walls rather than living organisms.   My work on molecular barcodes as an alternative to traditional microscopy has been revelatory as I try to reconcile these two types of data.   At one level, what I see down the microscope is a benchmark for what I should expect to see in my barcode output.  Yet, at the same time, the differences between the two types of data show up the limitations of traditional data – and the assumptions that underpin the ways that we work.

Take a look at the plate below which shows two of the most common diatoms in UK rivers: Ulnaria ulna is one of the largest that I encounter regularly whilst Achnanthidium minutissimum is often one of the most abundant in my samples, particularly when the level of human pressure is relatively low.  When we analyse samples with the light microscope, we record individuals, so both of these score “1” in my data book despite the fact that U. ulna is about 100x larger (by volume) than A. minutissimum.

Specimens of Ulnaria ulna (top) and Achnanthidium minutissimum (bottom).  Both are from cultures used for obtaining sequences for the reference library for our molecular barcoding project.   Scale bar: 10 µm.   Photographs: Shinya Sato, Royal Botanic Gardens, Edinburgh.

When we analyse a sample using Next Generation Sequencing (NGS), we count not cell walls but copies of the rbcL gene, which provides the blueprint for Rubisco, a key photosynthetic enzyme.   As I write, there is no clear understanding of how the number of rbcL copies relates to the number of individuals.  We know that each chloroplast within a cell will have at least one copy of this gene, and usually several. There is also some evidence that larger chloroplasts have more copies of the gene than smaller ones and there is also likely to be a measure of environmental control.  The key message that I try to get across in my talks is that NGS data are different to the data we are used to gathering using microscopy.  These differences do not mean that it is wrong, just that we need to leave some of our preconceptions before starting to interpret this new type of data.

However, we could also argue that counting the number of copies of the gene for an important photosynthesis enzyme should be giving us a better insight into the contribution of a species to primary productivity than counting the number of cell walls.  In other words (whisper this …), rbcL might not just be different, it might be better, especially if our purpose is to understand the contribution the various species in the biofilm make to primary productivity in stream ecosystem.  At the moment there are plenty of problems with the NGS-based method, not least the fact that we often cannot assign half the copies of the rbcL gene in a sample to a species, but the situation is improving all the time …

Some recent work pushes this a little further.   Jodi Young and colleagues at Princeton University have demonstrated large variation in the kinetics of Rubisco in diatoms, and in their carbon-concentrating mechanisms (see “Concentrating on carbon …” for more about these).  Although their work is focussed on marine phytoplankton, the variation within Rubisco and carbonic anhydrases could go some way to explaining the sensitivity of diatoms to inorganic carbon (see “Ecology in the Hard Rock Café …”).   In other words, rbcL is not an irrelevant DNA sequence, as the term “barcode” may imply (in contrast to barcodes based on the ITS region, for example), it is deeply implicated in the reasons why a species lives in particular place.

And yet, and yet, and yet …  The same could be argued for morphology, up to a point at least.   The shape of a Gomphonema or a Navicula also helps us to understand the organism’s relationship with its environment.   The problem is that modern taxonomists tend to focus on a much finer level of detail – on the arrangement and structure of the various pores on the silica frustule, for example – and offer few insights into what these minute differences mean in terms of the ecophysiology of the organisms.  Even at the whole-cell scale, information on habit, which is linked to form (Gomphonema tending to live on stalks or short mucilage pads secreted from their foot poles for at least part of their life-cycle, for example) is rarely incorporated into assessment systems.   The move from using light microscopy to using NGS, in other words, means replacing an imperfect system with which we are familiar with one that we are still learning to understand.  Both offer unique information and the gains from using one approach rather than the other, will be offset by losses of insight.

That leaves us with two big challenges over the couple of years, as UK diatom-based assessments move from light microscopy to NGS.  The first is to work harder to understand what NGS outputs are actually telling us about the environment over and above the minimalist ecological status indices that spew out of our “black box” computer programs.   The second is to maintain an understanding of the properties of whole organisms and how these interact with one another and with their environments.   I guess I should add a third challenge to this pair: persuading middle managers who have at best a sketchy understanding of diatoms and phytobenthos and already-stretched budgets that any of this matters …

References

Badger, M.R. & Price, G.D. (2003).  The role of carbonic anhydrase in photosynthesis.  Annual Review of Plant Biology 45: 369-392.

Young, J.N. & Hopkinson, B.M.M. (2017).  The potential for co-evolution of CO2-concentrating mechanisms and Rubisco in diatoms.  Journal of Experimental Botany doi: 10.1093/jxb/erx130.

Young, J.N., Heureux, A.M.C., Sharwood, R.E., Rickaby, R.E.M., Morel, F.M.M. & Whitney, S.M. (2016).  Large variations in the Rubisco kinetics of diatoms reveals diversity among their carbon-concentrating mechanisms.  Journal of Experimental Botany 67: 3445-3456.