2017 Poster Sessions : Weakening the Supervision Bottleneck

Student Name : Alex Ratner, Paroma Varma
Advisor : Christopher Ré
Research Areas: Computer Systems
Abstract:
Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, we present a newly proposed machine learning paradigm--data programming--and a system built around it, Snorkel, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but we model this as a generative process—learning, essentially, which labeling functions are more accurate than others—and then use this to train an end discriminative model (for example, a deep neural network in TensorFlow). Given certain conditions, we show that this method has the same asymptotic scaling with respect to generalization error as directly-supervised approaches. Empirically, we find that by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see this, and extensions that we will present as well, as providing a general framework for many weak supervision techniques, and at a higher level, as defining a new programming model for weakly-supervised machine learning systems.

Bio:
Alex Ratner is a 3rd-year PhD student advised by Chris Re at the Stanford InfoLab, where he works on new machine learning paradigms for settings where limited or no hand-labeled training data is available, motivated in particular by information extraction problems in domains like genomics, clinical diagnostics, and political science. He co-leads the development of the Snorkel framework for lightweight information extraction (snorkel.stanford.edu).

Paroma is a second year Electrical Engineering PhD student advised by Professor Chris Ré. She is primarily interested in exploring practical methods for machine learning, focusing on the problem of creating high quality training data. She has studied how generative models can be used for this purpose and is currently looking at how weak supervision can be used to label image data, efficiently. She is supported by the NSF and SGF Fellowships.