Tutorial provisional program
LIP6@Paris, May 29-31 2013
Day 1 (Wednesday, May 29, 2013): Open Data Applications on the Web
- 08:30-09:00 – Welcome
- 09:00-10:00 – Big Data and the emerging Web of Data, Vassilis Christophides (FORTH) & Dan Vodislav (ETIS)
- 10:00-11:00 – Active Citizenship and Collaborative Data Analysis (part 1), Ioana Manolescu (INRIA)
- 11:00-11:15 – Coffee-break
- 11:15-13:15 – Active Citizenship and Collaborative Data Analysis (part 2), Ioana Manolescu (INRIA)
- 13:15-14:45 – Lunch
- 14:45-16:45 – Data-intensive Journalism, Damien Cirotteau (Rue89)
- 16:45-17:00 – Coffee-break
- 17:00-19:00 – The Open Data Eco-system, François Bancilhon (Data Publica)
Day 2 (Thursday, May 30, 2013): Open Data Management
- 09:00-11:00 – Centralized and Distributed SPARQL Query Processing (part 1), Martin Theobald (Max Planck Institute, Saarbrucken)
- 11:00-11:15 – Coffee-break
- 11:15-12:15 – Centralized and Distributed SPARQL Query Processing (part 2), Martin Theobald (Max Planck Institute, Saarbrucken)
- 12:15-13:15 – Linked Data Management on the Cloud (part 1), Zoi Kaoudi (INRIA), Ioana Manolescu (INRIA)
- 13:15-14:45 – Lunch
- 14:45-16:45 – Linked Data Management on the Cloud (part 2), Zoi Kaoudi (INRIA), Ioana Manolescu (INRIA)
- 16:45-17:00 – Coffee-break
- 17:00-19:00 – Data and Knowledge Evolution (part 1), Giorgos Flouris (FORTH-ICS, Heraklion)
Day 3 (Friday, May 31, 2013): Entity Resolution and Reasoning
- 09:00-11:00 – Reasoning on Web Data Semantics (part 1), Marie-Christine Rousset (LIG Grenoble)
- 11:00-11:15 – Coffee-break
- 11:15-12:15 – Reasoning on Web Data Semantics (part 2), Marie-Christine Rousset (LIG Grenoble)
- 12:15-13:15 – Entity resolution (part 1), Melanie Herschel (LRI/INRIA)
- 13:15-14:45 – Lunch
- 14:45-16:45 – Entity resolution (part 2), Melanie Herschel (LRI/INRIA)
- 16:45-17:00 – Coffee-break
- 17:00-18:00 – Data and Knowledge Evolution (part 2), Giorgos Flouris (FORTH-ICS, Heraklion)
- 18:00-19:00 – Concluding Discussions
Abstracts
Centralized and Distributed SPARQL Query Processing
Martin Theobald
With the recent advent of the Linked Open Data movement, more and more data sets are being made available and published in the Resource Description Format (RDF). Storing large, interlinked RDF collections and efficiently processing queries formulated in the SPARQL query language thus has become an increasingly important and emerging field of research also in the area of databases. The tutorial will cover current trends and state-of-the-art techniques for the scalable management of RDF data from a relational-database perspective. The presentation will focus on both centralized and distributed database approaches for storing, indexing, and optimizing queries over very large RDF repositories consisting of many billions of triples.
Data and Knowledge Evolution
Giorgos Flouris
The web is gradually evolving from a collection of documents to a collection of data. More and more information becomes publicly available in the form of Linked Open Data (LOD), i.e., as open and interlinked data that provide significant added-value via the emergence of unexpected new knowledge or data connections. In this context, managing the evolution of such data is becoming all the more important. Even though the problem of data and knowledge evolution has been around since the early years of knowledge representation research, and problems related to the dynamics of web documents have also been studied, the recent LOD trend has given rise to new dimensions for the problem. Apart from the standard challenges, i.e., « what is the semantics of evolution? » and « how can I efficiently compute the ideal evolution result? », new questions, driven by the distributed, uncontrolled and dynamic nature of LOD, arise. Such questions include « how does the evolution of remote datasets affect my data? », « how can I detect changes in remote datasets? », « how can I efficiently propagate changes from one dataset to another? », « how can I preserve the integrity and quality of my data in a dynamic and interlinked environment? », and others. In this tutorial, we will provide an overview of the fields that address these problems, namely the fields of evolution, repair and change detection.
Reasoning on Web Data Semantics
Marie-Christine Rousset
Providing efficient and high-level services for integrating, querying and managing Web data raises many difficult challenges, because data are becoming ubiquitous, multi-form, multi-source and musti-scale. Data semantics is probably one of the keys for attacking those challenges in a principled way. A lot of effort has been done in the Semantic Web community for describing the semantics of information through ontologies.
In this tutorial, we will show that description logics provide a good model for specifying ontologies over Web data (described in RDF), but that restrictions are necessary in order to obtain scalable algorithms for checking data consistency and answering conjunctive queries. We will show that the DL-Lite family has good properties for combining ontological reasoning and data management at large scale, and is then a good candidate for being a Semantic Web data model.
Entity Resolution
Melanie Herschel
Entity resolution has a long history in database research and applications, its goal being to identify multiple representations of a same real-world object, despite differences between these representations. As more and more data sets become available on the Web, interest on linking these data sets has increased, for instance within the Linked Open Data movement. To automatically detect links between data sets, new entity resolution techniques are being developed for such Web-Data.
The overall goal of this tutorial is to give an overview of techniques at the basis for effective and efficient entity resolution for various kinds of data published on the Web. After an introduction to traditional entity resolution for data stored in relational tables, this tutorial presents techniques that have been devised to perform entity resolution in hierarchical data (e.g., XML) and graph data (e.g., RDF). In our presentation, we will focus both on methods that reach high entity resolution quality and on techniques that improve efficiency to scale to large amounts of data.
The Open Data Eco-system
François Bancilhon
Open Data movement is a fairly recent phenomenon, launched in 2009 by the Obama administration. The original idea is that public data gathered, maintained and used by public organizations, should be made available for access and re-use to citizens and companies. By access we mean the ability to understand the data for the general public. By reuse we mean the ability given to companies to process and use this data for business purposes. Initially born in Anglo-Saxon countries, it is spreading to all western democracies. It has also recently extended from the public to the private sector: some companies have discovered the benefits of opening part of their data for various purposes (communication and transparency and the ability to generate and develop a virtuous eco-system around them). The open data phenomenon has several facets: legal, business, technology, political, etc. In this presentation, I will give an overview of Open Data with some focus on the technology aspect.
Active Citizenship and Collaborative Data Analysis: a Standard-Compliant Framework for Fact-Checking the Web
Ioana Manolescu
My talk considers a form of collaborative analysis of Web data, which can be loosely defined as « fact-checking the Web », by reference to the many electronic tools and platforms for analyzing, confirming or proving false, claims made by various public figures in recent elections in France and the US. Such claims are transcribed in Web pages, social media etc., commented, dissected and analyzed, separating truths from half-truths and outright lies. This analysis typically involves users from a variety of backgrounds and viewpoints, increasing the likelihood that independent opinions are gathered, and users back their statements with references to online content which stands as evidence. I argue that such an informed Web of open, evidence-backed facts is a necessary ingredient of democracy itself, for what is voting worth, without the knowledge on which to base it?
The talk presents XR, a W3C standard-compliant framework developed within our group at INRIA to support collaborative analysis and enrichment of structured documents (XML) with annotations (RDF). I discuss the XR data model and query language, the architecture of a recently developed XR query evaluation engine, and how XR can be harnessed to support collaborative fact-checking on data and knowledge gathered from the Web. XR relies on open and free standards, making it possible to integrate and enrich data and information from a variety of sources.
Linked Data Management on the Cloud
Zoi Kaoudi, Ioana Manolescu
The W3C’s Resource Description Framework (or RDF, in short) is a promising candidate which may deliver many of the original semi-structured data promises: flexible structure, optional schema, and rich, flexible Uniform Resource Identifiers as a basis for information sharing. Moreover, RDF is uniquely positioned to benefit from the efforts of scientific communities studying databases, knowledge representation, and Web technologies. Many RDF data collections are being published, going from scientific data to general-purpose ontologies to open government data, in particular in the Linked Data movement. Managing such large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are needed. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance and elasticity features it provides. This tutorial discusses the problems involved in efficiently handling massive amounts of RDF data in a cloud environment. We provide the necessary background, analyze and classify existing solutions, and discuss open problems and perspectives.
Data-intensive Journalism
Damien Cirotteau
Data has always been one of the main building blocks of news production but the open data movement has changed the rules of the game. Never before have we had so much data, available so quickly. In the mean time, the chain of production and the time to market of news have been shorten.
We will explore examples of data driven journalism and discuss the best practices to produce such contents:
- how developers, designers and journalists are working together,
- what are the new skills of the journalists and how to work with them,
- how data streams should be structured and disseminated to be efficiently used by newsroom