Skip to main content

Show HN: Programmatic – a REPL for creating labeled data https://ift.tt/dzHNJq9

Show HN: Programmatic – a REPL for creating labeled data Hey HN, I’m Jordan cofounder of Humanloop (YC S20) and I’m excited to show you Programmatic — an annotation tool for building large labeled datasets for NLP without manual annotation . Programmatic is like a REPL for data annotation. You: 1. Write simple rules/functions that can approximately label the data 2. Get near-instant feedback across your entire corpus 3. Iterate and improve your rules Finally, it uses a Bayesian label model [1] to convert these noisy annotations into a single, large, clean dataset, which you can then use for training machine learning models. You can programmatically label millions of datapoints in the time taken to hand-label hundreds. What we do differently from weak supervision packages like Snorkel/skweak[1] is to focus on UI to give near-instantaneous feedback. We love these packages but when we tried to iterate on labeling functions we had to write a ton of boilerplate code and wrestle with pandas to understand what was going on. Building a dataset programmatically requires you to grok the impact of labeling rules on a whole corpus of text. We’ve been told that the exploration tools and feedback makes the process feel game-like and even fun (!!). We built it because we see that getting labeled data remains a blocker for businesses using NLP today. We have a platform for active learning (see our Launch HN [2]) but we wanted to give software engineers and data scientists a way to build the datasets needed themselves and to make best use of subject-matter-experts’ time. The package is free and you can install it now as a pip package [2]. It supports NER / span extraction tasks at the moment and document classification will be added soon. To help improve it, we'd love to hear your feedback or any success/failures you’ve had with weak supervision in the past. [1]: We use a HMM model for NER tasks, and Naive-Bayes for classification using the two approaches given in the papers below: Pierre Lison, Jeremy Barnes, and Aliaksandr Hubin. "skweak: Weak Supervision Made Easy for NLP." https://ift.tt/rCsUQqy (2021) Alex Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Chris Ré. "Data Programming: Creating Large Training Sets, Quickly" https://ift.tt/NpztrfE (NIPS 2016) [2]: Our Launch HN for our main active learning platform, Humanloop – https://ift.tt/puJhGLo [3]: Can install it directly here https://ift.tt/OqgB267... https://ift.tt/T1xHpaS April 8, 2022 at 04:35PM

Comments

Popular posts from this blog

Show HN: TPMouse - A Virtual Trackball for Windows, controlled from the homerow https://ift.tt/BtjAqD4

Show HN: TPMouse - A Virtual Trackball for Windows, controlled from the homerow Hello all, I apologize for the repost as the previous submission was made from an unfortunate timezone. I've been refining my app to the point that it's pretty much become an indispensable daily driver in my own workflow. Hoping to hear some critiques/feedbacks on its usability! https://ift.tt/p6HvZCc October 24, 2022 at 02:24AM