High Quality STEM Training Data to Improve Multimodal Reasoning in Large Language Models

The Challenge

A leading AI lab wanted to find out where their large language model (LLM) was falling short in K–12 STEM subjects. To do this, they needed a trusted and diverse group of STEM experts to review model outputs and create new, domain-specific training data to improve accuracy and reasoning across both text and images.

The Approach

Databrewery’s labeling services, powered by the Brewforce network, built a dedicated team of STEM professionals with advanced degrees in chemistry, biology, engineering, and related fields. These experts created original multimodal prompts—combining text and images—and provided accurate answers to test and improve the model’s responses.

The Outcome

The expert team consistently produced unique, domain-specific data that revealed the model’s weaknesses in STEM reasoning. This allowed the lab to directly improve model performance in key areas. Databrewery now plays a central role in supplying high quality STEM data for their real-time loss training workflows.

MultiModal Reasoning

Identifying LLM Gaps in K–12 STEM Using Domain-Specific Multimodal Data

A leading AI lab set out to identify weak areas in their large language model’s performance across K–12 STEM subjects. They needed a dependable team that could work within their real-time loss workflow to create advanced multimodal prompts combining image and text to highlight exactly where the model was falling short.

The challenge was sourcing a large, qualified team of STEM experts with deep knowledge across technical fields like biology, physics, engineering, and earth sciences. Creating original, domain-specific image and text pairs at this level required both subject matter expertise and consistency at scale.

Building domain-specific multimodal STEM datasets to improve AI reasoning

The AI lab needed a dependable data partner with deep expertise in STEM and multimodal content. Databrewery, powered by the Brewforce network, stepped in with a global pool of domain experts, multilingual capabilities, and broad experience across specialized fields.

Thanks to a fast calibration cycle and full project support, Databrewery quickly assembled a qualified team of STEM experts to take on the task.

This wasn’t a standard labeling job. The project required generating original, complex prompts that combined both image and text inputs across multiple STEM fields and grade levels. The goal was to stretch the model’s reasoning capabilities with content that wasn’t easily found online. Once the prompts were finalized, accurate answer sets were created to help improve model training.

Due to the complexity, Databrewery carefully vetted hundreds of applicants and selected 150 highly skilled domain specialists—each holding a PhD or Master’s degree in fields like math, physics, and engineering. These experts were tasked with producing non-trivial, domain-specific content that could expose the model’s limitations and guide its learning.

“Designing these prompts wasn’t just about knowledge—it was about strategy. With my background in engineering education, I could craft multimodal questions that genuinely challenged the model’s reasoning. It pushed me to rethink how AI interprets STEM problems and where human insight is still essential.” – Priya N., M.Tech in Mechanical Engineering and STEM Curriculum Designer

After building the prompts and answers, the team tested whether the model consistently struggled with them. Only prompts that caused repeated model failures were marked as 'winning labels' and included in the final dataset.

Prompts

Boosting Domain-Specific LLM Performance Through Expert Feedback and Multimodal Tools

The AI lab needed to strengthen its real-time loss training workflow but lacked a consistent team of qualified experts to review model outputs and deliver actionable feedback. This workflow was critical to improving their LLM enabling ongoing evaluation, identifying weak spots, and bringing human insight directly into the loop.

Databrewery’s multimodal chat editor played a key role in this process. It allowed experts to give direct feedback, follow clear labeling guidelines, and evaluate how the model responded to the prompts they created.

Through multiple review cycles, Databrewery delivered a high-quality multimodal dataset that helped improve the model’s accuracy on complex STEM tasks. With a trusted team of experts in place, the AI lab now runs an efficient feedback loop that continuously surfaces domain-specific issues and drives focused model improvement.