Twitter Indexing Demo
Twitter Indexing Demo
Objective
The demo offers a real-time search of the Twitter stream. The key feature we are demonstrating here is that as soon as a tweet is posted, it is indexed and made available for search.
Try it here!
Architecture
This demo is a web interface for an underlying an infrastructure that indexes in real-time streams of data (e.g. tweets) using Terrier and a distributed architecture where the index is distributed to different computing nodes on a cluster. This infrastructure will be the backbone of the SMART search engine.
This is done with the recently emerging Storm framework which provides a distributed processing environment similar to MapReduce, but which can handle streams of data in real-time. We use this to distribute the workload of indexing the tweet stream using Terrier across multiple machines in a cluster. Terrier has been enhanced to use real-time, in-memory indices, such that as soon as a tweet is posted/received, it is indexed, and made available for search. Typically, on-disk indices for inverted indices are compressed. However, we studied in-memory compression to confirm that this is still appropriate for indices (which they are - they increase retrieval speed, as well as the number of documents that can be indexed in a fixed amount of space). In particular, we use Elias-Gamma of docid deltas, and Elias-Unary for term frequencies (Unary is suitable, as tweets have one or at most two occurrences).
When a query is issued, results are aggregated from different index "shards" (currently 5 "shards" representing 5 distributed computing nodes). Once a tweet becomes a bit old (oldness is a parameter), they are removed from the search results. Currently, the search demo uses a baseline ranking model for tweets - we have previously deployed more advanced and effective ranking models, as described in our paper entitled "University of Glasgow at TREC 2011: Experiments with Terrier in Crowdsourcing, Microblog, and Web Tracks".
Future features
- Enhanced user interface to visualise the results.
- Refined ranking models for tweets on top of the STORM framework e.g. by deploying learning to rank.
- Indexing streams from sensors.
- Aggregating results from both social and sensor streams.