Enhancing language learning through precise, scalable AI data labeling

The Challenge

A company needed to manage diverse data sources such as audio, transcriptions, and video to train its AI models on the nuances of language. The challenge grew as their user base expanded, demanding more accurate data labeling to understand tone, context, and regional dialects. With this influx of data, the company faced the task of maintaining high standards while processing a massive volume of information.

The Approach

Databrewery’s intuitive tooling and post-training labeling services provided a streamlined solution, enabling the company’s internal team and external annotators to collaborate effectively. The platform's tools for quality control, labeler feedback, and seamless data management helped the company maintain accuracy without compromising on speed.

The Outcome

As a result, the company reduced data labeling time by almost 50%, which accelerated feature releases and language expansions. Their model accuracy also improved by up to 35%, giving them the agility to roll out improvements at a faster pace.

Fast-growing language learning app revolutionizes education with AI-driven personalized learning experiences

A rapidly expanding language learning app has gained recognition for utilizing AI to deliver personalized and engaging lessons to millions of users globally. The company’s goal is to break down language barriers by applying advanced technology to help users master new languages quickly and efficiently.

The app functions like a personal tutor, supporting multiple languages including English and Spanish, with plans to add even more languages. At its core, the product experience is driven by cutting-edge speech recognition technology and large language models (LLMs), enabling seamless interactions that make language learning both effective and intuitive. These AI technologies facilitate real-time feedback, helping users enhance their language skills more effectively.

Foundations of AI Innovation

The company was founded in 2016 with a vision to develop an AI-powered language tutor specifically for English learners. At that time, the technology wasn’t advanced enough to fully support their ambitious goals. While deep learning was beginning to gain momentum, they foresaw the rapid growth of AI advancements and strategically planned around this future potential.

A significant challenge from the start was managing the complexity of audio data, which needed to account for a wide range of accents, dialects, and phonetic nuances. This often required labeled data for the training of custom speech models. Scaling these efforts, however, meant sourcing vast quantities of high-quality, accurately labeled data to consistently train and improve their AI models. This is where Databrewery played a pivotal role.

Managing Data Integrity at Scale

The company faced a significant challenge in delivering precise and nuanced language lessons due to the complexity of their data. Their AI models had to grasp intricate language elements, including tone, context, and regional accents. As their user base grew, the volume of data increased as well, making it increasingly difficult to ensure consistent, high-quality labeling across vast datasets.

The team, made up of machine learning engineers focused on speech and language, leverages cutting-edge LLMs and APIs within a well-integrated software stack. They specifically focus on improving speech recognition systems across different product experiences. A key feature of their offering is the pronunciation lessons, where users repeat sentences and get real-time feedback. As learners advance, the tasks become more challenging requiring the system to handle blanked-out words or questions that prompt specific responses. These systems need to be highly customizable and fine-tuned for precise accuracy.

The team sought to use Databrewery as their main platform for annotating and assessing their speech models, ensuring that the training data could help the systems improve.

Previously, the process was labor-intensive, relying on contractors and multiple spreadsheets. The scattered nature of the data pipeline made efficient management difficult. With Databrewery, the company was able to centralize its efforts, focusing on key areas like phonetics and pronunciation. Their end goal is to leverage LLMs to simulate an AI language tutor that can offer real-time feedback, providing the quality of a private human tutor at a scalable level.

Choosing the Right Platform for Post-Training Data Labeling

After evaluating multiple options, the company selected Databrewery as their partner to optimize and elevate their data labeling process. Databrewery provided a user-friendly and robust platform that facilitated seamless collaboration between their internal team and external annotators. With features that supported quality control, real-time labeler feedback, and efficient data management, Databrewery became integral to the company’s data operations. The platform significantly enhanced the overall data pipeline, streamlining workflows and boosting operational efficiency.

Accelerating AI Development with Databrewery

Streamlined Data Labeling Process: By incorporating automation, the company was able to cut down data labeling time by nearly 50%. This significantly accelerated their ability to introduce new languages and features at a quicker pace.
Improved Data Integrity: Utilizing Databrewery’s robust quality control systems and real-time feedback mechanisms ensured that data labeling remained highly accurate. This, in turn, improved the AI’s capacity to better interpret accents, tonal shifts, and contextual subtleties in conversations.
Support for Scalable Growth: As the company expanded to new markets, Databrewery’s flexible platform and post-training services became essential for managing a growing dataset. The ability to onboard a broad range of annotators ensured that data quality was maintained, even as the dataset grew larger.