Up until a decade ago, most companies sufficed with simple statistics and offline reporting, relying on traditional database management systems (DBMSs) to meet their basic business intelligence needs. This model prevailed in a time when data was small and analysis was simple.
But data has gone from being scarce to superabundant, and now companies want to leverage this wealth of information in order to make smarter business decisions. This data explosion has given rise to a host of new analytics platforms aimed at flexible processing in the cloud. Well-known systems like Hadoop and Spark are built upon the MapReduce paradigm and fulfill a role beyond the capabilities of traditional DBMSs. However, these systems are engineered for deployment on hundreds or thousands of cheap commodity machines, but non-tech companies like banks or retailers rarely operate clusters larger than a few dozen nodes. Analytics platforms, then, should no longer be built specifically to accommodate the bottlenecks of large cloud deployments, focusing instead on small clusters with more reliable hardware.
Furthermore, computational complexity is rapidly increasing, as companies seek to incorporate advanced data mining and probabilistic models into their business intelligence repertoire. Users commonly express these types of tasks as a workflow of user-defined functions (UDFs), and they want the ability to compose jobs in their favorite programming language. Yet, existing analytics systems fail to adequately serve this new generation of highly complex, UDF-centric jobs, especially when companies have limited resources or require sub-second response times. So what is the next logical step?
It’s time for a new breed of systems. In particular, a platform geared toward modern analytics needs the ability to (1) concisely express complex workflows, (2) optimize specifically for UDFs, and (3) leverage the characteristics of the underlying hardware. To meet these requirements, the Database Group at Brown University is developing Tupleware, a parallel high-performance UDF processing system that considers the data, computations, and hardware together to produce results as efficiently as possible.
Concisely Express Workflows
Existing systems based upon the MapReduce paradigm require users to write hundreds of lines of code to express even basic analytics tasks. Tupleware takes a new approach that merges traditional SQL with functional programming to obtain the best of both worlds; we retain the optimization potential and familiarity of SQL while incorporating the flexibility and expressiveness of functional languages. Furthermore, Tupleware users are not bound to a single programming language. By building upon the popular LLVM compiler framework, the system can integrate UDFs written in any language that has an LLVM compiler, even mixing languages to compose a single job. Presently, C/C++, Python, Ruby, Haskell, Julia, R, and many other languages already have LLVM compilers, and we expect other languages to adopt LLVM in the near future.
Optimize for UDFs
Since the advent of DBMS research, a considerable amount of work has been devoted to the problem of SQL query optimization, but relatively little has been done to optimize custom UDF workflows. All traditional systems treat UDFs as black boxes, and thus they can never make informed decisions about how best to execute a given workflow. On the other hand, Tupleware combines ideas from the database and compiler communities by performing UDF introspection, which allows the system to reason about the expected behavior of individual UDFs in order to achieve optimal performance. Thus, our system can optimize workflows without borders between UDFs, seamlessly integrating user-specified computations with the overarching control flow.
Leverage Underlying Hardware
Modern analytics requires a variety of CPU-intensive computations. Whereas other systems neglect to efficiently utilize the available computing resources, Tupleware optimizes for all of the low-level characteristics of the underlying hardware, including SIMD vectorization, memory bandwidth, CPU caches, and branch prediction. In a process called program synthesis, our system translates workflows directly into compact and highly optimized distributed executables. This approach is built for maximum performance per node and avoids all of the overhead inherent to traditional systems.
Our initial benchmarks, based on common machine learning algorithms, demonstrate the superior performance of Tupleware relative to alternative analytics platforms. In particular, we compare our system to the industry standard Hadoop, the in-memory analytics framework Spark, and a commercial column DBMS (System X). Tupleware outperforms these systems by up to three orders of magnitude in both a distributed and single machine setup.
Speedup over other systems (TO = timed out, FAIL = memory failure)
Tupleware is a research project under development by the Database Group at Brown University. For more information, please visit our web site.