In ecology and evolution we work with models all of the time. Perhaps, if this was more widely known, we would find our class sizes swelling. Of course students expecting long blonde hair, white teeth or rippling six packs might be a bit disappointed when they realised that models mean something quite different in science… Models are simplifications of reality in which we make some assumptions which allow us to predict an outcome to a process. It’s a simplified version of reality. Usually models are quantitative. For example, one of my PhD students, Emily Fountain, is working on one of the most endangered insects in New Zealand, the Canterbury knobbled weevil. It would be helpful if we knew just how many individuals are left in their final remaining population. By capturing and marking some members of the population and then looking at how many of these marked individuals are later recaptured, adding in some assumptions about the life history of the weevils and the sampling process and you have a model which allows Emily to predict the likely population for the area. Emily is in the process of using these models, so hopefully I can tell you more about this population in a future blog. One area in which models are very important is when we use DNA to reconstruct the evolutionary history of species. The methods used to obtain these evolutionary trees from DNA sequence data usually require a model of how nucleotides change over time. Such change is regulated by structure of the nucleotide molecules, their position in the gene region, the gene region they are in and so on. Many models have been developed that try to take into account these different factors that influence DNA sequence change.
A group of Lincoln University researchers led by Rupert Collins, including Rob Cruickshank, Karen Armstrong and Laura Boykin, have been messing around with models. Rupert and his colleagues are especially interested in DNA barcoding. Basically, all species should be identifiable if you look at their DNA and, in fact, if you look at one gene region in particular – COI (the ‘barcode gene’) then there is usually enough information present to confirm what species you are looking at. The main interest in DNA barcoding at Lincoln is for bioprotection. Let’s say that some grapes turn up in the Port of Auckland with some insect larvae crawling around in them. There is usually no way to tell what species it is by looking at larvae – they’re all small, white and wriggly. We can’t tell if this is a harmless species or something that could do unspeakable harm to our fruit industry. By looking at their barcode DNA we can quickly and easily match these mysterious larvae to a species. Well, in theory at least. There are a few complications. One complication is with the models. When analysing the DNA data a model is selected to more accurately construct the tree and find the right match. The default model for DNA barcoding is known as the K2P model (which assumes that there are two basic types of nucleotides that behave differently and are present in equal proportions). Rupert has published a paper “Barcoding’s next top model” in Methods in Ecology and Evolution that tests whether the K2P model really is the top model.
Rupert and his colleagues examined 14 data sets with over 14000 barcodes in total from a variety of species (butterflies, birds and bats) and analysed 22 different models that can be used in DNA analyses for these data sets. Most of these models were much more sophisticated than the K2P model. All of these models were used to find which provided the best results. Interestingly, most models were almost never selected as the best. This included the K2P model. A couple of models similar to the K2P, F81 and HKY (which only differ in not assuming equal nucleotide frequencies) were usually selected as better. However, almost all of the models gave similar results; there wasn’t a lot between any of them. Of greatest interest was the case when the researchers didn’t even use a model and this often resulted in an outcome as good as, or sometimes better than, those analyses with models. So Rupert and his colleagues conclude that simply leaving DNA barcoding analyses to a default model is not a good idea. So we’re certainly better off without a kitset model and maybe we’re better off without those pesky models after all.