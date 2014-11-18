If you’ve ever had to merge two Excel spreadsheets then you know how challenging it can be to work with data from multiple sources. At LinkedIn, where tables with hundreds of millions of records are far from unheard of, simply merging those giant datasets for routine queries began to take up massive amounts of time and resources.

“It slowed down the queries and even made them infeasible to execute in our Hadoop environment,” says LinkedIn engineer Srinivas Vemuri, who worked on the number-crunching pipeline for XLNT, LinkedIn’s A/B testing platform. “The size of the intermediate output of the join was explosive.”

That led engineers working on XLNT to carefully craft a suite of Java code that pulls necessary data in from across the company, according to a blog post by engineers Vemuri and fellow engineer Maneesh Varshney.

“Written completely in Java and built using several novel primitives, the new system proved effective in handling joins and aggregations on hefty datasets which allowed us to successfully launch [the framework called] XLNT,” the engineers wrote earlier this month.

The code divides data into manageably sized blocks of rows from the different tables, where rows from different tables referring to the same user are guaranteed to be found in corresponding blocks. That let a lot of interesting statistics, like tallies and averages, be computed block-by-block, without ever having to store the whole merged dataset in memory and made it possible to generate A/B test results in a reasonable amount of time.

“However, we soon became victims of our own success,” wrote the engineers. “We were faced with extended requirements as well as new use cases from other domains of data analytics. Adding to the challenge was maintaining the Java code and in some cases, rewriting large portions to accommodate various applications.”

The company decided to build a general-purpose tool built on that block-by-block principle, creating an open-source framework they called Cubert. Varshney, Vemuri, and some of their colleagues described the principles in detail in a conference paper published in September by Varshney, Vemuri, and other LinkedIn staff.