As machine learning improves, the appeal of making ground-breaking (and money-making) discoveries in our mountains of data is only increasing. But medical researchers, marketing executives, and professionals of all kinds aren’t data scientists, and they often can’t afford to hire them. This means that daunting questions surrounding data analysis remain: how do you start, which machine learning tools do you use, and how do you trust the results?
A new solution from researchers at Brown University and MIT reduces the uncertainty of all three of those issues. After five years of research and evolution, their latest creation is Northstar, a cloud-based interactive data system accessed through a web app. It’s domain-independent, particularly well-suited for large data sets with many features, and it lets users pose specific questions about their data or simply explore it at will. Northstar is usable by anyone, the researchers say, with no data science expertise necessary and no need for users to limit themselves to a particular machine learning tool.
And the feature that may prove the most valuable to the non-scientist? Statistical safeguards that warn users about false conclusions.
“The gateway to our work for people who aren’t scientists,” says Professor Eli Upfal of Brown, “is the user interface, because it shows them that anyone can use our machine learning tools.” Upfal’s co-authors include Tim Kraska (Associate Professor at MIT and Adjunct Associate Professor at Brown), Emanuel Zgraggen (a Postdoctoral Associate at MIT who also worked on the system while earning his PhD at Brown), Brown CS doctoral students Benedetto Buratti, Lorenzo De Stefani, Philipp Eichmann, and Leonhard Spiegelberg, and Zeyuan Shang, an MIT doctoral student.
After launching the app and loading one or more datasets, users are presented with what the researchers describe as a blank canvas. Using a finger or digital pen, they touch the datasets box to choose a dataset, which automatically populates the attributes box with the dataset’s features: for a medical researcher, these might include heart hypertensive, heart ischemic, and metabolic. Touching one or more of these causes a graph to appear, and when users drag and drop two graphs close to each other, a yellow line appears between them. These features are now linked, with yellow bars appearing on the graphs to show each attribute in relation to the other.
“If you’re not a scientist,” says Emanuel Zgraggen, “you need to see the data to spot patterns. We provide something very unusual: a simple touch interface, all the room you need, and the freedom to work in a way that makes sense for you. Nothing is static, so you can play around, drill drown, find interesting details, and then bring them together for analysis.”
Machine learning enters the picture at a touch of the operators box, which connects to a library of algorithms that users can employ for predictive analysis. For example, our medical researcher can evaluate the likelihood of heart failure based on certain conditions, but other users might want to classify images (“Does this photo have a car in it?”) or find the best way to compress audio recordings effectively.
But which algorithm? Left to their own devices, users would need to pick one, apply it, look at the results, repeat the process with other algorithms, and then compare. None of this is necessary with Northstar. When the user touches AutoML, the system automatically asks itself which algorithm to use and then which parameters to apply. A blank box appears, ready for the user to drag and drop heart ischemic or other features into. Thanks to the the approximation engine (Interactive Data Exploration Accelerator, or IDEA), results appear in less than a second, but even while that’s happening, the system is running other algorithms in parallel threads and using encoded rules to pick the best one.
“It’s interactive regardless of your data size,” says Tim Kraska. “We know that’s crucial: people lose engagement and explore less if there’s a delay of even 500 milliseconds, so we provide low latency that lets you stay on task without waiting for things to load.”
Northstar’s AutoML has already achieved improvements over multiple solutions that previously represented the state of the art. In a recent test, it performed better than Microsoft’s Azure products, and it’s continuously been ranked among the best systems in the DARPA D3M Automatic Machine Learning project. A new datamart feature can even suggest other datasets that might be helpful for users. For instance, a researcher doing demographic studies of incarceration and poverty might be interested in analyzing education levels at the same time.
“After a great deal of interest from multiple industries,” says Kraska, “a business venture is now in the works to unfold Northstar’s potential within companies by adding the necessary industry-specific functionality and customer support. We just started our pilot program and are looking forward to see how Northstar can transform the way people analyze their data and derive at actionable insights."
But for those using Northstar on their own, what sort of results can we expect from turning a novice loose with the chemistry set of machine learning? As a very simple example, our medical researcher might see a link between individuals who smoke cigarettes and are left-handed. Interesting, but only until she realizes that she’s looking at ten people in a dataset of more than a million. It’s clearly an anomaly, but without the benefit of data science expertise or review by a quality assurance team, it’s easy to be deceived.
To prevent this, Northstar adds a safety net for the non-scientist. As always, without any programming needed by the user, it creates a clear visualization that attempts to quantify the quality of all results. Statistically meaningful results are highlighted in green; results that are less certain appear in red.
“We want you to know how confident you should be in the answers you find,” says Benedetto Buratti. “Statistical significance is embedded in the output, so you don’t jump to conclusions that aren’t supported by the data.”
“People talk about democratizing data science,” says Eli Upfal, “but very few have looked at the problem of drawing false conclusions. We want everyone to use our system, but Northstar isn’t just about producing results, it’s about giving you confidence that those results are significant.”
For more information, click the link that follows to contact Brown CS Communication Outreach Specialist Jesse C. Polhemus.