Professor Sören Auer discusses the Big Data Europe project, as well as some of the main challenges in the Big Data space, particularly in the life sciences sector.
The BigDataEurope project holds that the growing digitisation and networking process within our society has a large influence on all aspects of everyday life. Large amounts of data are being constantly produced, and when these are analysed and interlinked they have the potential to create new knowledge and intelligent solutions for both the economy and society. Big Data can make important contributions to the technical progress in our societal key sectors and help shape business. What is needed are innovative technologies, strategies and competencies for the beneficial use of Big Data to address societal needs.
In the healthcare and life sciences space, Big Data Europe worked with The OpenPHACTS Discovery Platform, which was developed to reduce barriers to drug discovery in industry, academia and for small businesses.
OpenPHACTS contains the data sources already being used by those working in this area, but integrates and links them together so that the relationships between compounds, targets, pathways, diseases and tissues are easily discernible. Data sources include ChEBI, ChEMBL, SureChEMBL, ChemSpider, ConceptWiki, DisGeNET, DrugBank, Gene Ontology, neXtProt, and UniProt.
SciTech Europa Quarterly asked Big Data Europe’s co-ordinator, Professor Sören Auer (who is now Director of the Leibniz Information Centre for Science and Technology (TIB) and Professor of Data Science and Digital Libraries at the Leibniz University of Hannover) about the main achievements of the BigDataEurope project and some of the challenges in the Big Data space, particularly in the life sciences sector.
How would you describe the way in which the sheer amount of data being generated across sectors has developed in recent years and what do you think have been the biggest challenges to emerge alongside this?
The Big Data problems is characterised by the three ‘Vs’: volume, velocity, and variety. One particular challenge, perhaps, concerns variety, or the heterogeneity of different data modalities, of different data strategies, of different understandings of the data, of different governance, different licensing schemes, and so on. There is a lot of heterogeneity, especially when we look into domains such as the interlinking and connecting of data from different stakeholders and different organisations.
To determine requirements for Big Data use, there needs to be a lot of interviews with stakeholders and with the relevant communities, and while this is something we did within the BigDataEurope project, this is an element that has not been sufficiently addressed by Big Data efforts before or since. Indeed, while we made significant progress in the area of variety within the project, this is an area that needs more attention, as we are far from having solved this issue yet.
A particular challenge is the development of a common understanding of the meaning of the data that we have in distributed settings. Furthermore, regarding Big Data from those scenarios where there are large amounts of homogeneously structured data, the use cases which benefit from this are already developed and exploited, meaning that in the coming years the biggest potential will lie in identifying where we want to integrate data from different sources and so create value out of the heterogeneity, and that can be done using vocabularies and ontologies.
Do you think that this interoperability issue can be solved across disciplines, too?
Yes, we do indeed need to solve this across sectors. We need more flexibility, and we need to evolve the standards and/or interoperability frameworks – things like vocabularies, ontologies, and taxonomies, for example.
In the life sciences, many of these fortunately already exist because many of the communities here discovered ten or fifteen years ago that they needed an enhanced level of interoperability. In other domains, there are challenges inherent in bringing the different communities together. A good example of where it works, however, is the schema.org initiative, which is an interoperability vocabulary for data which is exchanged on the web. Indeed, some 20-30% of web pages now contain mark-up according to this scheme, which is driven by Google and other major search engines.
We need to have similar initiatives in other areas and, of course, they need to be linked and to be interoperable with each other. To return to the life sciences, initiatives such as the NCBO BioPortal and OpenPHACTS and many others in the industrial internet space or within what is known as ‘Industry 4.0’ are helping.
We have also begun another initiative parallel to the BigDataEurope project, which is called the ‘Industrial Data Space’, and which is working to develop interoperability standards along different value chains, between companies and enterprises, and we need a similar effort in other areas as well.
Are there other innovative technologies, strategies and competencies which need to be developed in order to add value to Big Data and so address societal challenges?
One particular strategy could be of use here, and that is to organise Big Data using knowledge graphs. Google does this, and it is a great way of professionally integrating a vast amount of data.
We are now seeing efforts emerging in different companies and organisations to integrate data using knowledge graphs and using different data. A knowledge graph can help to interpret or understand how to link data from different sources, and this can be done inside companies or organisations or, indeed, across different organisations.
Of course, there are some aspects that also need further attention, such as the data governance and the data security. For example, the recent GDPR regulation in the EU on the one hand is a burden, but on the other can help to identify where companies store personal data, how it is described, and then allow systematically manage not only personal data, but distributed, heterogeneous interlinked knowledge in general to master digitisation.
In the health sector, areas such as microscopy, for example, stand to benefit hugely from advances in data handling and repositories of super-resolution data etc. What challenges and opportunities do you feel exist in regard to ensuring Big Data can fulfil its potential in such areas?
Regarding microscopy, handling the extremely large number of images being produced can be a significant challenge. This is perhaps not as difficult as, for example, integrating data to ensure interoperability, such as we are seeing with regard to sensor data from fitness trackers or other biomedical sensor devices which needs to be understood in conjunction with longitudinal data so you can track the evolution of the data, or the time it was produced, so that you can then provide, say, statistical data and analytical insights for the individual.
Nevertheless, there is a sense that with regard to microscopy data in isolation, the major challenge centres on technological hurdles which exist with regard to managing the large amounts of data being generated. However, these are relatively easy to solve by adding more hardware or computational power rather than integrating the context of the information to create value out of that.
The area with the biggest potential but with the biggest challenges in the health sector involves patient data. Here, there are issues around personal data, and that requires additional attention as current Big Data frameworks don’t support that very well. This is something that needs more metadata attached to it, which is then traced over its lifetime. You must then also take into account for what purpose certain data can be used for with the consent of the patient.
Preventing bias and generating responsible data, especially for health applications, are important areas. This means being able to detect, for example, bias in the data, and the development of algorithms and Big Data frameworks and technologies which are able to detect and handle biases is crucial.
The technology also needs to be transparent and must make the analysis explainable, and if you use analysed data for advisory systems or decision support systems then it is necessary to provide evidence about how the algorithm came to the conclusion first.
Thinking of the life sciences in a more general sense, what is the role
of the OpenPHACTS Discovery Platform and what do you feel will be the biggest benefits of this moving forwards?
OpenPHACTS is a pre-competitive effort on the part of pharmaceutical companies, which we contributed to via the BigDataEurope project by helping them to improve their technology platform. Generally, such pre-competitive collaboration efforts between companies and research institutes are very important and need to be continued and strengthened, and we need similar activities to better bring together businesses and pharmaceutical research organisations so that they can be linked both to each other and to related nearby areas and domains.
Do you feel that more needs to be done to make data more open? How can issues such as those concerning intellectual property be solved here?
There are some areas where we still haven’t leveraged the potential of open data, and this is perhaps the result of an increasing trend on the part of companies to keep certain types of data closed and private in order to keep their competitive advantage. However, in certain areas it is also good to collaborate with other players and so not duplicate the work taking place, and they need to realise this.
OpenPHACTS was a great example of this, but we need more such initiatives alongside a change in mentality, and that requires more awareness and education inside different organisations so that good decisions can be made in terms of what kind of data should be made open or where it is more beneficial to collaborate. We want companies developing new products and services to use the Big Data Value Forum for this purpose, and it is fully understandable that certain types of data will be closed.
Traditionally, the decision to keep all data as closed as possible – or to have data closed by default – is changing, and in the future organisations and governments need to learn where to draw the line between more openness and keeping some data closed where necessary. That process will continue over the next years, maybe even decades, but it is clear that things are changing all the time as, of course, sometimes certain types of data offer certain advantages, meaning that it can become a commodity, and many other organisations will perhaps have similar data meaning that this borderline then has to be shifted to a different area.
Looking back at BigDataEurope, what were your biggest achievements? And do you have any ambitions to continue this work moving forwards?
We have released an open source software framework which realises, for example, some of the ideas of people supporting the heterogeneity and integration of data from different sources and is also helping to foster a raised awareness of the importance of this in different communities.
Meanwhile, there are a number of projects coming out of BigDataEurope in different areas that we are involved in; one such project, BigMedialytics, is a Big Data project in the life sciences and healthcare domain where we will apply the technologies developed BigDataEurope on a larger scale. Here, Phillips, for example, and many other companies and organisations in the healthcare sector, are involved and we are working together to move the technologies towards more applications (in BigDataEurope, of course, we only worked on prototypes) in BigMedialytics for deployed uses and real industrial settings. A similar project is ‘Boost 4.0’, which does similar work in the automotive and other manufacturing industries.
Professor Sören Auer
Leibniz Information Centre for Science and Technology (TIB)
Professor of Data Science and Digital Libraries
Leibniz University of Hannover