The Insight Data Engineering Fellows Program is free 7-week professional training where you can build cutting edge big data platforms and transition to a career in data engineering at top teams like Facebook, Uber, Slack and Squarespace.
During the seven-week Insight Data Engineering Fellows Program recent grads and experienced software engineers learn the latest open source technologies by building a data platform to handle large, real-time datasets. Ryan Walker (now a Data Engineer at Casetext) discusses his project of building a streaming search platform.
On average, Twitter users worldwide generate about 6,000 tweets per second. Obviously, there is much interest in extracting real-time signal from this rich but noisy stream of data. More generally, there are many open and interesting problems in using high-velocity streaming text sources to track real-time events. In this post, I describe the key components of a platform that will allow for near real-time search of a streaming text data source such as the Twitter firehose.
Such a platform can have many applications far beyond monitoring Twitter. For example, a network of speech to text monitors could transcribe radio and television feeds and pass the transcriptions to the platform. When key phrases or features are found in the feeds, the platform could be configured to trigger real-time event management. This application is potentially relevant to finance, marketing, and other domains that depend on real-time information processing.
Streaming Search on demand
The key data structure for solving a traditional text search problem is an inverted index built from the collection of documents you want to be able to query. In its simplest form, an inverted index is just a map whose keys are the set of all unique terms in the documents. The value associated to a particular term in the map is a list of all the documents which use that term.
I recommend watching movies in full and free streaming at no cost, just register to enter our movie platform below:
– Eternals in Streaming in Italiano
– Io sono Babbo Natale Film Italy in Streaming
– Guarda il Film Ultima Notte a Soho 2021
– I molti santi del New Jersey streaming film in lingua italia
– La scelta di Anne – L’Événement Streaming Film Italy
Good luck, and remember on our platform it’s free at no cost
After the index has been built, users can submit queries to run against the index. For example, we can have a query that should return all the documents that contain both words in the phrase “llama pajamas”. The query engine will split the input phrase into the tokens “llama” and “pajamas”, then it will check the inverted index to get the list of all documents that contain the word “llamas” and the list of all documents that contain the word “pajamas”. The engine will then return the intersection of these two lists, i.e. the list of the documents that are present in both lists.
In the streaming case, documents arrive at a very fast rate (e.g. average of 6000 per second in the case of Twitter) and with this kind of velocity and volume it is impractical to build the inverted document index in real-time. Moreover, the goal is not to create a static index of tweets — rather it is to scan the tweets as they arrive in real-time and determine if they match a registered query. Here’s where we can play a clever trick. Instead of building our inverted index from the documents, we can instead build the index from the queries themselves.