Accelerating ML breakthroughs with cleaner, smarter training data

The Challenge

Working with historical data like old census records comes with its own set of complications: inconsistent formats, unclear handwriting, and lots of ambiguity. The team knew that improving their machine learning models wouldn’t be enough on its own. To make real progress, they needed to shift their focus from just building models to improving the quality of the data that fed them. But with outdated tools and slow processes, labeling was becoming a bottleneck that slowed down their entire MLOps pipeline.

The Approach

To remove the roadblocks, the team turned to Databrewery. With its advanced annotation platform, they were able to automate repetitive tasks, organize complex label relationships, and streamline collaboration with domain experts. Databrewery Boost gave them access to high-quality labelers who understood the unique demands of working with sensitive historical data, removing the need for constant internal supervision.

The Outcome

With Databrewery in place, the team was able to move faster without sacrificing quality. Training data was clearer, models improved faster, and cross-functional collaboration became effortless. What once required weeks of back-and-forth could now be completed in days. As a result, the team unlocked deeper insights from their data and pushed forward ML projects that had previously been stuck.

Combining deep historical records, family trees, and DNA samples, the team works on extracting structured genealogical insights from highly unstructured, complex data. They’ve invested heavily in neural networks and transformer-based models, but as projects grew in size and complexity, it became clear that model performance was limited by one thing: the quality and speed of their training data. To move faster, they shifted toward a more data-centric approach and began rethinking how labeling fit into their MLOps pipeline.

Previously, data scientists owned much of the labeling work from end to end. Even when domain experts were brought in to help, the process was clunky. Experts had deep familiarity with historical documents, but not always with how machine learning models consumed labeled data. This created friction. The team needed a way to connect both worlds expert judgment and machine learning in a workflow that encouraged iteration, not slowdowns.

That’s where Databrewery came in. Using its Annotate platform, the team integrated model-assisted labeling, dynamic annotation relationships, and a flexible image and text editor that allowed everyone engineers and historians alike to work in sync. With Databrewery, they sourced high-quality annotators who could keep up with both the scale and complexity of the work without requiring constant supervision.

“Before switching platforms, we could build models quickly, but the data pipeline was a bottleneck,” said Priya Mehta, Senior Applied Scientist.

“Once we brought in a platform that let us collaborate with labelers in real time, everything changed. We now move at the pace our models demand.”

One major shift came from the ability to leave real-time feedback dropping comments directly on specific image regions, asking questions, and clarifying edge cases without delay. This back-and-forth unlocked faster decision-making and reduced errors that often came from ambiguous labeling specs. “Other platforms felt like a black box; we’d wait for all the labels to come back before we could even start giving feedback,” Mehta added.

To keep momentum, the team also leaned into Databrewery’s analytics and QA tools, which made it easier to evaluate label performance and fine-tune processes as they scaled. They set up strong review systems to ensure accuracy, even on unlabeled data, and focused on high-value tasks like handwriting recognition and data extraction from historical PDFs. As a result, their training cycles became faster, their models more accurate, and their entire workflow more collaborative and efficient.