Searching for gene mutations: a search engine for microbial DNA data

An image representing microbial DNA data

Dr Zamin Iqbal, Research Group Leader at EMBL-EBI, spoke to SciTech Europa about BIGSI – a search engine for microbial DNA data.

Researchers at the European Molecular Biology Laboratory’s (EMBL) European Bioinformatics Institute (EBI) have created a new search engine, called BIGSI, which allows scientists to search public microbial DNA data for specific genes and mutations. This could help researchers monitor the spread of antibiotic resistance genes, and understand how bacteria and viruses evolve and adapt.

The EBI team have combined their knowledge of bacterial genetics and web search algorithms to build a DNA search engine for microbial DNA data. The search engine could enable researchers and public health agencies to use genome sequencing data to monitor the spread of antibiotic resistance genes. By making this vast amount of data discoverable, the search engine could also allow researchers to learn more about bacteria and viruses.
SciTech Europa asked Dr Zamin Iqbal, Research Group Leader at EMBL-EBI, about how and why BIGSI was developed and how it might evolve.

Why was the creation of BIGSI necessary, and how did you approach it?

The sequencing of the human genome was a milestone in biology, and we hear a lot these days about how sequencing of our DNA can be used for medical, forensic and other purposes. However, viewed from the perspective of all biology on our planet, human genomes are very pedestrian.

Our DNA varies relatively little from person to person, and the dominant process is simply inheritance of DNA by a child from their parents. The side-effect of these two things is that many computational methods are focussed on human genomes, and they can make a lot of assumptions about how similar those genomes are.

Another way of looking at it is to imagine you were writing some kind of program to compress and store genomes; while human genomes compress really well, by contrast, bacteria have been around for billions of years, and now include millions of species, and their genomes are much more variable. What is more, the differences between species are huge, but even within species, bacterial genomes are super-flexible – pieces of DNA can move between unrelated bacteria.

As a result, what you have is much more of a fluid mosaic of DNA. And actually, the ‘right way’ to look at bacterial genomes depends on your question. You might be interested in getting a sense of how something is passed down from a parent to a child; or you might be interested in the hitchhikers who hang out in one species for a bit and then move on somewhere else.

That is essentially the background to the development of BIGSI: bacterial DNA is very variable, and we might want to look at it in multiple ways. Then, we can ask ourselves what happens when scientists sequence genomes? They effectively conduct whatever experiment they’re working on and then they add the data into one of the ‘global archives, so that other scientists can check what they have done and potentially reproduce the research. As a result, we have accumulated a huge amount of data and, as sequencing prices have dropped, this has accelerated. But now, there is so much data that no-one actually knows what is in there.

At a personal level, I spent half of my professional life building fundamental computational methods for studying genomes and species. The other half I spent working on applying them to healthcare problems. We have worked, for instance, on TB diagnostics from sequence data (see: ), and on developing methods to run them on handheld devices without slow culture (see: ).

I run the sequence analysis for a global consortium that is now analysing 100,000 TB genomes and measuring drug resistance in half of them in order to get a better understanding of how to predict resistance from genome sequence (see; our first paper showing that the sequence data is good enough to replace phenotyping for the 4 first-line Tb drugs can be accessed here: have also started moving those things out into the field (see:

The bottom line is that sequencing-based diagnostics are going to become a reality. And if that happens we are going to see huge columns of sequencing from people who are not necessarily doing science, but who just want an answer to their diagnostic question. If we can make it easy for them to archive their data, then suddenly we generate an unprecedented treasure trove of data.

As an example: I might have a TB patient, and I would like to know if anyone has seen that strain of M. tuberculosis before, or whether anyone has seen this resistance mutation before, or whether I have a cluster of very similar strains which might be an outbreak. All of these questions are mediated by sequence search.

While we have been working on diagnostics over the last four years; Phelim Bradley (first author, and then my PhD student) and I have also been working on methods to store and query large numbers of genomes. When we started, there was absolutely no way to do this for microbial DNA data on this scale (although some people had done things for humans, but the problem is different because human genomes are so similar). I specifically wanted to make it useful for my TB goals, and also for all the other things that occur in bacteria – plasmids, mobile genes, drug resistant genes, etc.

The key to making BIGSI work was finding a way to build a search index that could cope with the diversity of microbial DNA data. Can you explain more about this?

One way to look at this is that we are trying to build something of a Google index. Normally with Google, you type in a term and it looks for webpages associated with those words and gives you what it thinks you will like most at the top. You have keywords, and you find webpages containing them. In our case, we can break a long DNA string (such as a bacterial gene, which might be 2,000 bases (characters) long), into the ‘words’ within it. Thus, ‘AAGCTC’ might be broken into ‘AAG’, ‘AGC’, ‘GCT’ and so on (except we use slightly longer words).

Then, we search for which documents (genome sequence data files) contain those words. In our case, we may have a significant number of search terms, and sometimes we may want to use them all.

Google indexes 10^12 documents containing around 10^8 words and, as it indexes more webpages, the lexicon essentially stays the same. Human languages don’t change that fast. For us, however, we indexed of the order of 10^6 documents, and found they contained 10^10 unique words, and every new genome we ad, also adds numerous new words. In particular, we have only sequenced a tiny fraction of bacterial life, and there is much more variation to come.

The methods that existed before were dependant on having a full list of all the words in all the datasets, and scaled with that list. That works with humans, but we had to get away from that limitation; we needed to scale with the number of datasets, not datasets times the total number of words. We therefore developed a way to store each dataset as a compressed, fixed length string of 0s and 1s. Put together, that makes a large table which grows in one dimension only: more samples adds more columns.

Where do you feel the biggest application areas will be for BIGSI?

In its raw form, BIGSI is just a triage system that allows you to scan through the global archive and see if there is something useful. A colleague had a drug resistance plasmid from a salmonella outbreak, and they had never seen that plasmid before. He put it into BIGSI and immediately found it was there in several e coli strains, meaning that it had jumped the species barrier.

We give an example in our paper to show how you can look for things (we used plasmids as the example) and then see what species they are in in order to get a sense of how far they are spread.

Another example is for bringing new antibiotics to the market. When testing a new antibiotic, pharmaceutical companies will look to see if resistance evolves rapidly in the lab. If they find any mutations which are candidates for mediating resistance, BIGSI allows them to check if there is pre-existing resistance to their drug out there in the world already.

I believe that a number of services will be built on top of the BIGSI engine moving forwards. I am currently working on one for TB that allows you to look for related strains, drug resistance mutations and so on.

Other obvious things are automated ‘typing’ (microbiologists and epidemiologists classify bacteria according their sequence, so they essentially have lookup lists checking for certain sequences), and global surveillance of drug resistance (especially as well-controlled structured surveys start to emerge, adding good metadata for where the samples were taken).

How do you plan to ensure that those working in these areas are aware of BIGSI and its potential?

We have built a demo, and we have indexed a snapshot of the global data. I work at the European Bioinformatics Institute (EBI), whose mission is to make the world’s biological data available and useful. We are therefore now in the process of building a live-system that keeps up with the data as it comes in. When we indexed our snapshot for the paper in December 2016, there were 450,000 bacterial and viral datasets (we indexed both). Now, there are about 1.2 million that have been indexed by us so far, and once we have absorbed the backlog, keeping up with inflow shouldn’t be a problem. At the same time, we can start to build some services on top, so people can access them via the web or via computer programs.

EBI has over 3 million scientific users per month (see: and so we have a direct route to a lot of users already.

Given that the amount of sequenced microbial DNA data is doubling every two years, will BIGSI continue to evolve to incorporate this? What challenges will this involve?

Yes, it will, and BIGSI was built with this in mind. The system is set up to be incrementally augmented. We also have a background project which we started 18 months ago to work on a second version of BIGSI which is much more efficient, and that is progressing well, so I am reasonably confident we can keep ahead of the curve.

Moving forwards, is it likely that genome sequencing will become more widespread and utilised in various sectors outside of the basic research environment
Yes! Clinical tests, food standard tests, tests when you land in Australia, tests when you leave quarantined farms, etc.

How, then, do you see BIGSI developing in the future to be able to cater for this? And, in a more general sense, what are your hopes for the future of BIGSI?

BIGSI is not the end of the line; it is just the beginning. Once you’ve shown something is possible, someone smarter always finds a better way of doing things, and there are a lot of people out there much smarter than us. Now that people see it is an issue, and they can download and look at the data and understand how it is different to the English language and to human genomes, new ideas will flourish. We have ideas and will push them, and we look forward to the field itself flourishing and learning from others.

Personally, I want us at EBI, the home of the archives, to provide BIGSI and, in the future, better-than-BIGSI services, ideally with layers of services on top.

Dr Zamin Iqbal
Group Leader
Tweet @emblebi

Laboratory Supplies Directory - Now Live


Please enter your comment!
Please enter your name here