Building better ML training data with transparency and control

The Challenge

As demand for online learning surged, the team needed to scale text annotations fast. Their existing tools, including Amazon SageMaker and Prodigy, couldn’t handle the volume or provide the visibility required to manage quality at scale.

The Approach

They switched to Databrewery for its collaborative text annotation tools, permission controls, and built-in QA. This gave them the ability to manage large teams, maintain consistency, and monitor data quality in real time.

The Outcome

With Databrewery, the team delivered hundreds of thousands of annotations quickly and efficiently. They scaled without losing control, tracking both productivity and data quality across every project.

DataSets

This education-focused team had previously relied on a mix of tools like Amazon SageMaker GroundTruth, Prodigy, and their own internal platform to handle text data labeling. But as usage of their platform exploded during the pandemic, the volume of text annotations required to support their ML models quickly outgrew their setup. What once worked for smaller-scale needs was no longer viable.

They turned to Databrewery to help them handle the surge. Much of their raw data came from screenshots of students’ answers, so they first had to extract and clean OCR output before annotation could even begin. To improve quality, they built a custom comparison system directly inside Databrewery to measure OCR accuracy across sources. The platform’s flexibility gave them the control they needed to manage complex workflows including permissions, dataset tracking, and advanced QA through consensus review.

Three months in, the team had already noticed a major shift. “With other platforms, we submitted responses and that was it we had no control or visibility after the fact,” said Maya Krishnan, Lead Data Manager. “With Databrewery, we can monitor label volume, fix mistakes, rerun QA, and track how productive our labelers are. It’s night and day.”

The AI models powering their tools now play a more active role in guiding the annotation process, speeding up expert reviews and improving question-answer matching. This smoother, more responsive data pipeline is helping them deliver better recommendations and drive stronger student learning outcomes all while cutting down on time and manual effort.