Opening up big data

Tuesday, 21 August, 2012

The increasing use of ICT for business, leisure and public services is leading to the accumulation of mountains of structured and in many cases unstructured data. But this so-called ‘big data’ should be seen as an opportunity, not a problem. EU research and efforts to promote open data are helping to make sense and good use of this resource.

‘Open data’ is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. Inspired by the open source software (non-proprietary) and open access (academic publishing) movements, open data is broadly taken to mean the liberal movement, use, re-use or electronic distribution of data.

An important part of this ‘big data’ movement is the use for the wider benefit of society of the non-personal information that citizens share with their governments and public services. Open government data is a tremendous resource that has yet to be fully tapped. “Government collects a vast quantity of high-quality data as part of its ordinary working activities. If this data is made open, it can have huge potential benefits,” notes the Open Government Data (OGD) website, run by the Open Government Working Group.

According to Rufus Pollock of the Open Knowledge Foundation, opening up the data enables companies, individuals and the non-profit sector to build useful ‘apps’ and services, and it promotes democracy, government participation, transparency and accountability. “Why not open up the data that’s already there and is already being collected?” he says.

But there are numerous challenges - legal, technical, social and market related - that need to be faced before the many benefits of open (government) data can be effectively transferred to citizens.

This way forward …

According to reports from the recent Future Internet Assembly (FIA) in Aalborg, Denmark, “Trends like ‘big data’ and the ‘internet of things’ (IoT), including ‘people as sensors’, are showing how citizens/entrepreneurs/innovators can develop new services and apps for the benefit of smart cities.” FIA presenter Reinhard Scholl of the International Telecommunication Union (ITU) said good examples include New York City’s Open Data initiative, Amsterdam’s Smart City program, Catalonia’s Open Data Gencat and the Commission’s Open Cities challenge.

Best practices from the USA, according to Scholl, include MIT’s ‘Track Trash’ experiment which used sensors to monitor where rubbish ends up. And Oakland’s data-driven ‘crime spotting’' service, he said, is helping the city improve security.

Public-sector information (PSI) is the single largest source of information in Europe, according to the European Commission’s DG Connect, and includes digital maps, meteorological, legal, traffic, financial, economic and other data. Most of this raw data could be re-used or integrated into new products and services for everyday use, such as car navigation systems, weather forecasts, financial and insurance services.

“Re-use of public-sector information means using it in new ways by adding value to it, combining information from different sources, making mash-ups and new applications, both for commercial and non-commercial purposes. Public-sector information has great economic potential,” explains the Commission on its dedicated PSI web-page.

EU research, adapting to change

The research landscape has also moved to accommodate the rapid changes taking place in data collection, processing and handling. For example, projects funded under the FP7 ‘Technologies for information management’ activities, as part of the ‘Content and knowledge’ theme, have targeted a range of research domains encompassing online content, interactive and social media; reasoning and information exploitation; and knowledge discover and management.

One initiative, the EU-funded ‘Emerging, collective intelligence for personal, organisational and social use’ (Weknowit) project, has developed a platform that converts vast user-generated content from a problem of information overload into a new, ‘collective intelligence’ with a range of applications, from handling emergencies to enhanced city tourism. The project has filed for several patents and a handful of products and results are destined for public or commercial release.

“Using a wide variety of tools, the Weknowit platform transforms large-scale and poorly structured information into meaningful topics, entities, points of interest, social connections and events,” says project coordinator Yiannis Kompatsiaris of the Informatics and Telematics Institute (CERTH-ITI), Multimedia Knowledge Lab in Greece. To do this, the partners developed a middleware application that can be deployed on servers to process incoming data and route it effectively.

They also developed several tools within the project case studies, including an emergency response scenario and a consumer social group scenario, and partners created a dozen more tools for specific tasks. Meanwhile, partners CERTH-ITI, Yahoo! and Koblenz University are further collaborating on real-time aspects of social media information extraction as well as looking at applications in the news sector and large events, such as film festivals.

Open data for science, too

Better use of structured data also benefits the scientific research more directly, thanks to advances in cloud and grid computing, or supercomputing. European investment in e-infrastructure, which harnesses the ‘unused’ capacity of computers distributed all over the world, means researchers can process and analyse bigger data sets than ever before, revealing possible answers to some of science’s biggest questions, from quantum physics to climate change modelling.

For example, biologists studying a specific problem could create a virtual research environment (VRE) for collaborating across the grid, processing information from one source in Estonia and analysing it with data-mining software tools from another in, say, Portugal.

Going one step further, an EU-funded project, called Data infrastructures ecosystem for science (D4Science-II), has created an interoperable framework for e-infrastructures which is like an ecosystem in which data, computing and software resources belonging to different e-infrastructures can be shared regardless of location, technology, format, language, protocol or workflow. Their ecosystem has supported VREs in fields such as high-energy physics, biodiversity, fisheries and aquaculture resources. It has helped open up new areas of research between them and is being extended to new domains.

For instance, D4Science-II supported the Aquamaps marine species mapping study. Aquamaps helps scientists to cross-reference marine biodiversity with records of fish catches to get a clearer picture of where fish stocks are most at risk. This is a huge data- and number-crunching exercise made possible thanks to European funding for e-infrastructure and its open data policy and research initiatives.

“Cooperation across e-infrastructures opens up entirely new possibilities and areas of research. We can analyse scientific data against economic statistics, for example, to get an entirely new perspective that was not available before,” says Donatella Castelli, a D4Science-II partner at the Institute of Information Science and Technology (Alessandro Faedo) of Italy’s National Research Council.

Open access publishing

While public organisations are opening up their data to researchers, it might seem ironic that the results of such research might end up inaccessible in expensive journals. In an effort to promote more open access publishing online - especially for publicly funded research - the European Commission has made open access publishing mandatory for around 20% of FP7 projects.

In addition, when projects publish results in a range of traditional journals, as well as some in open access publications, knowledge is fragmented and it is less easy to measure the output of a project. The EU-supported project ‘Open access infrastructure for research in Europe’ ( Openaire) set out to change that with the vision of making everything accessible for everyone.

The Openaire team recognised early on that better technology is only half the battle to overcoming research and data fragmentation: “A significant part of the project focuses on promoting open access in the FP7 community,” says Natalia Manola, the project’s manager, “advocating open access publishing so projects can fully contribute to Europe’s knowledge infrastructure.”

With the help of projects like Openaire and its follow-up Openaireplus, open access publishing can boost Europe’s economy and innovation levels, according to Manola. If you are an employee of a small firm or a teacher, subscriptions to high-end scientific journals can be prohibitive, which means valuable research is locked away in silos. “With open access, anybody [can] use it how they want - it is the best way to make the most of publicly-funded research,” she concludes.

Related to this, the nuclear research organisation CERN has spearheaded an EU-funded project on the ‘Study of open access publishing’ (SOAP) in search of sustainable business models to promote scholarly publishing. The team documented over 4000 journals and, following some analysis, the SOAP team found that some 8% of the worldwide production of scientific papers, or about 120,000 articles per year out of the estimated industry figure of 1.5 million, is currently published as open access. They concluded that a ‘hybrid open access’ model (partially subscription based) is the most viable option, especially for scientific and research publishing.

“By enhancing viable open access models, European researchers - and indeed the world - will gain from the exchange of knowledge and have access to vast material,” according to a CORDIS report, ‘Open access to mountains of research’, on SOAP.

Data-speak

While opening up publicly owned data, combining data sets and open access publishing of results all have their advantages for science, monetising structured data commercially is a more complex challenge. Some newly launched EU projects are looking into it.

The EU-funded project ‘Commercially empowered linked open data ecosystems in research’ ( CODE) is an SME-led initiative focused on the digital content and languages side of the big data equation. Linked open data, or LOD, shows enormous potential as the next big evolutionary step of the internet, according to the CODE team. But this potential remains largely untapped due to missing usage and monetisation strategies.

CODE, which only began work this year, is developing a robust ecosystem for commercialising LOD based on a value-creation-chain among traditional (eg, data provider and consumer) and non-traditional (eg, data analyst) roles in data marketplaces. Early results look promising.

Recognising that we live our life more and more online, partners in the EU-backed project ‘Linguistically motivated semantic aggregation engines’ (Limosine) are meanwhile looking to leverage language and semantic search technology to improve this online experience.

“Information is accumulated on a wide range of human activities, from science and facts, to personal content, opinions and trends,” notes the project team. Limosine’s multilingual web opinion-mining system means the internet can move away from current document-centric search towards greater semantic aggregation. In other words, getting refined search results faster through smarter tools which better understand and even predict what you are looking for.

For example, if you search “dog’s breakfast” using today’s search standards you get results about British idiom or Canadian theatre, when a non-native English speaker may have been looking more literally for a healthy alternative to feed their pet instead of cereal! Semantic search tools may be able to contextualise the query based on your previous searches or other gathered evidence.

Meanwhile, projects like LIVE+GOV bring together “reality sensing, mining and augmentation for mobile citizen-government dialogue”. The project is developing an ‘m-government’ solution that allows citizens to express their needs to government through mobile sensing technologies already in smartphones, alongside established mobile e-participation formats.

Oiling the European economy

Eventually, public data, generated by all administrations in Europe, should become automatically re-usable and will stimulate innovation and entrepreneurship, which in turn feeds into new applications and services, both fixed and mobile.

“Just as oil was likened to black gold, data takes on a new importance and value in the digital age,” commented Neelie Kroes, Vice-President of the European Commission responsible for the Digital Agenda Data at the launch of the EU’s Open Data Strategy in December. This open data package informs the new PSI Directive, which is now before the European Council and Parliament.

Public-sector information already generates some EUR 32 billion of economic activity each year. The new package stands to more than double that to around EUR 70 billion, which according to Kroes is a “badly needed boost to our economy”. She applauded the UK, Denmark and France on their open data initiatives and said that the new strategy will “radically shake up” how EU institutions and most public authorities in Europe share data.

Kroes called on governments not to wait for this package to become law: “You can give your data away now and generate revenue and jobs, and even save money from the better information and decisions that will flow.” She encouraged the private sector to open their data to generate new services. “Data is gold … Let’s start mining it!” she urged.

CORDIS Features, formerly ICT Results
cordis.europa.eu

Opening up big data

This way forward …

EU research, adapting to change

Open data for science, too

Open access publishing

Data-speak

Oiling the European economy

From stone age to open source, community is key

Visibility and governance will define the next phase of AI investment

AI governance: the next challenge in AI adoption

Content from other channels on our network