Séminaire ETIS : Vassilis Christophides
Content syndication has become a popular means for timely delivery of frequently undated information on the Web. It essentially enhances traditional pull-oriented searching and browsing of web pages with push-oriented protocols in which information publishers deliver brief summaries of the content they publish on the Web, called news items, while information consumers subscribe to a number of feeds seen as information channels and get informed about the addition of recent items. Today, web syndication technologies such as RSS or Atom are used in a wide variety of applications spreading from large-scale news broadcasting to medium-scale information sharing in scientific and professional communities. However, they exhibit serious limitations for coping with information overload in Web 2.0 since they imply a tight coupling between feed producers with consumers while they do not facilitate users in finding news items with interesting content. In this work, we are proposing an extension of existing web syndication systems with content-based filtering and tracking facilities. In this way, users can express their information interests as keyword-based queries (rather than subscribing a priori to an entire channel) which will be matched at real time against the streams of news items originating from different feeds.
To efficiently check whether all keywords of a subscription also appear in an incoming news item (i.e., board match semantics) we need to index the subscriptions. Two are the main indexing schemes proposed in the literature, namely, Count-based (explicitly counting the number of contained keywords) and Tree-based (implicitly counting the number of contained keyword). Unfortunately, the majority of data structures employed in these indexes concern structured subscriptions under the form of sets of attribute-value pairs. In our work, we are interested in indexing unstructured subscriptions formed by sets of keywords according to both Count and Tree-based schemes and study their behavior for critical parameters of realistic web syndication workloads. More precisely, we rely on an ordering of vocabulary terms to build a Trie (for tree-based index) that exploits the covering relationships between subscriptions compared to the flat search space implied by an Inverted File (for count-based index). Then, we experimentally investigate (a) how the morphology of the two indexes is affected by different workload parameters, i.e. the vocabulary distribution, the size of the vocabulary, and the size of the subscriptions and b) how the Trie and Inverted File morphology impacts the scalability and performance of the two indexes for realistic characteristics of subscriptions and news items.
The main conclusions drawn from our study are the following. Trie depth is bounded by the size of the indexed subscriptions (in realistic settings is small from 2 up to 6 terms), while its width by the size of the vocabulary of indexed terms (in realistic settings is very large up to 1,500,000 terms). To cope with large vocabulary sizes exhibited in reality we have considered path compression on Trie nodes featuring a unique child. The total number of Trie nodes as well as their morphology is then affected by the vocabulary terms' distribution (usually power laws). When the actual terms’ frequency in subscriptions follows the initial Trie ordering, subscription factorization will be important resulting to a left deed structured Trie. In the opposite, factorization becomes less important resulting to a right shallow Trie where path compression is now more intense. In particular, when the reverse ordering is followed a great number of subscriptions will be stored at the sub-Tries of the low ranked terms (i.e., the most frequent ones) and thus the pruning of the search space during matching will be significant. Besides Trie morphology, matching time is also affected by the size of the incoming news items (in realistic settings lies between 5 and 50 terms). In this context, Trie outperforms in matching time from 1 up to 3 orders of magnitude the Inverted File-based index at the expanse of double memory requirements while the Trie throughput rate when indexing with 10,000,000 subscriptions achieves ~500 items/sec (vs 34 news items/sec for Inverted File).
Université de Cergy-Pontoise, St-Martin 1, salle 562 (5ème étage, le Rectangle).