• 12.12.13

Classify Social Data With DataSift’s DIY Machine Learning

DataSift just released an intelligence engine called Vedo which allows developers to automatically classify social data in order to run businesses better.

Classify Social Data With DataSift’s DIY Machine Learning
[Image: Flickr user Joshua Conley]

DataSift aggregates social data from sources like Twitter, Facebook, and blogs and lets you query it via a single API, but making effective use of that data is still a big problem for many businesses–until today.


“A lot of the trouble that you have when you try to integrate social data into an actual business,” says DataSift’s CTO and founder, Nick Halstead, “is what is the pivot point to connect the social data to the business behavior? So you need to know what the discussion actually means rather than just finding that somebody is talking about you.”

Every business is different and therefore may use the same data in a different way. Are you an airline which wants to automatically classify new customer conversations as queries, complaints, or urgent requests? Or a B2B business which wants to determine the profession and seniority level of Twitter followers in order to identify leads? DataSift’s professional services team often builds custom tools like Dell’s Social Net Advocacy for this very reason. Companies with in-house data scientists might do the same using DataSift’s data.

DataSift’s new intelligence engine, Vedo, allows developers with no data science background to build custom models of customers and conversations and use them to automatically classify the real-time social data available via DataSift’s API. Vedo can even build those models for you using Machine Learning.

A classification model automatically sorts new examples of an object like a tweet or Twitter profile into a number of classes. DataSift created 30 classification models and released the code so that they can be used by developers as examples. One of DataSift’s sample models takes every new message sent to an airline’s customer service account and categorizes it as a “query,” “rant,” or “rave” based on the words used in the tweet.

Another example distinguishes Twitter bots from humans. “Tweets always come from a source like TweetDeck,” says Halstead. “There’s about 80,000 of them. If you see a tweet coming from TweetDeck you know for sure that a human sent it. There’s no way to fake that easily. TweetDeck users, on average, follow a daily 24-hour volume curve. We built a model to identify bots by knowing when they are tweeting at high volumes out of sync with that 24-hour cycle.”

DataSift’s query language, CSDL (Curated Stream Definition Language), is used to develop custom classification models. For example, you can create a rule like “tweets which contain the word ‘help.'” You can then add a tag to the rule-like query which indicates that all matching tweets which match the rule are queries. Tweets which match a different rule might be classed as complaints.

The entire set of rules, which can be arbitrarily complex, form the classifier model. Each element of a rule may also have a score attached to it, e.g., “help” might get a score of +10, while “can” gets a score of +1 when classifying a query. The rule with the highest score indicates the class of the tweet. Once completed, the classification model is uploaded to DataSift, where it can then be used to classify new or historical social data.


Vedo also allows you to programmatically train and use a Bayesian classifier, a machine learning algorithm which automatically learns a classification model. Instead of building a set of rules manually, you take a set of tweets or whatever social data you want to classify, and manually attach a label-like query or complaint to each one.

These labeled examples are called a training set. Each tweet will have a number of features like the words contained in the tweet. A Bayesian classifier builds a probabilistic model of feature values using the training examples. If, for example, the word “help” appears much more frequently in queries than complaints, then the classifier will consider it more likely that a new unlabeled tweet is a query if it contains the word “help.” The model the Bayesian classifier learns from the training set can be used to predict the class of new tweets based on their feature values. While the accuracy of a classification model built manually depends on the domain knowledge of the developer, that of a Bayesian classifier depends on how representative the training set is of the true population of tweets.

DataSift also takes care of the low-level text processing required prior to running Machine Learning. “You can’t even start doing text processing in Chinese or Japanese until you have word-chunked the characters because Chinese and Japanese have no white spaces,” says Halstead.

“We have spent the last year reading Japanese and Chinese, lexically parsing it, inserting fake spaces where the whitespace should be (between words) before we try and look for a word. You can’t do Machine Learning against Chinese or Japanese unless you have have done the chunking.”

DataSift, which just raised a new funding round of $42 million, sees analyzing enterprise social data as the next frontier. “In Q1, we will open up a public API where you can insert any unstructured data into DataSift,” says Halstead.

“McKinsey says that 70% of the world’s business data is unstructured. Instead of looking at public social data you can suck in your employee discussions and see what’s going on inside your business. We already ingest Yammer data. Every day DataSift is working out out what topics people are discussing. What big pushes are going on? Are the employees really getting behind it?”