10,000 Hrs of Brain Thoughts to Language Dataset: AI Startup Breakthrough

In a windowless basement in San Francisco, a quiet revolution in neuroscience and artificial intelligence has just taken place. While most AI companies are scraping the internet for text and images, a startup named Conduit has spent the last six months mining a very different resource: the human mind.

Conduit has successfully assembled what is believed to be the largest neuro-language dataset in history—roughly 10,000 hours of non-invasive neural recordings collected from thousands of unique individuals. This massive undertaking aims to bridge the gap between biological thought and digital text, potentially allowing machines to decode what you intend to say seconds before you actually speak.

The “Underground” Data Factory

The project was conducted with an intensity that matches its ambition. Operating out of a basement studio, Conduit managed to record data from thousands of volunteers in just half a year. To achieve this scale, the startup developed a rigorous system where participants rotated through compact recording booths for two-hour sessions, sometimes keeping the facility running for up to 20 hours a day.

The goal was not to study the brain in a clinical vacuum but to capture the neural signatures of natural language generation. The startup is training AI models to decode “semantic content” from brain activity in the fleeting moments before a person speaks or types with ambitious objective of training AI models to translate human thoughts directly into text.

Chatting with AI to Map the Mind

One of the most interesting aspects of Conduit’s approach was their shift away from traditional scientific rigidity. Early in the project, participants were asked to perform rigid, repetitive tasks. However, the team quickly realized that this “boring” approach drained participant energy and resulted in “flatter” data.

To fix this, Conduit pivoted to a method that feels distinctly modern: conversing with an LLM. Participants sat in booths and engaged in open-ended, free-flowing conversations with a Large Language Model, either by speaking or using a simplified keyboard. This change not only improved engagement but also produced richer language output, allowing the sensors to capture high-quality data aligned with natural thought patterns.

Custom Hardware: The 4-Pound Headset

When Conduit began this journey, they discovered that commercial hardware simply wasn’t up to the task. Off-the-shelf headsets couldn’t capture the density of signals required for such a complex decoding feat.

The solution was to build their own. The team engineered custom, 3D-printed rigs weighing approximately four pounds. These heavy “training headsets” are packed with best-in-class sensors, combining EEG (electroencephalogram) and fNIRS (functional near-infrared spectroscopy) to maximize signal coverage. While these bulky devices are strictly for data collection, the company plans to design lighter, wearable “inference” headsets in the future, once they understand exactly which signals are most critical.

Battling the “Hum” with Scale

Collecting delicate brain signals in a city like San Francisco presented a massive technical hurdle: electrical noise. The standard 60 Hz hum from the power grid threatened to corrupt the EEG data. Initially, the team went to extreme lengths to mitigate this, wrapping equipment in rubber and even shutting off the building’s main power, running the entire lab on battery packs.

However, as the dataset grew, Conduit made a fascinating discovery about AI scaling laws. Once they surpassed 4,000 to 5,000 hours of data, the AI models began to generalize across different people and setups. The sheer volume of data made the models robust enough to ignore the electrical noise, allowing the team to return to normal power and focus on speed. By the end of the project, they had cut the cost of collecting usable data by 40%.

What’s Next?

With data collection largely complete, Conduit has shifted its focus to the heavy lifting of model training. If successful, their “thought-to-text” models could fundamentally change how humans interact with technology, moving us away from keyboards and touchscreens toward a future where our devices understand our intent the moment we form it.

Key Takeaways

Conduit, a U.S.-based startup, has created the largest known neuro-language dataset with 10,000 hours of brain recordings.
The dataset aims to bridge the gap between biological thought and digital text, enabling machines to decode intent before speech.
Participants engaged in free-flowing conversations with an LLM to generate richer, more natural language data.
Custom-built, 4-pound headsets combining EEG and fNIRS were developed for dense signal capture.
Scaling the dataset helped AI models generalize and overcome electrical noise issues, reducing data collection costs.

Join our community by subscribing to our Weekly Newsletter to stay updated on the latest AI updates and technologies, including the tips and how-to guides.

(Also, follow us on Instagram (@tid_technology) for more updates in your feed and our WhatsApp Channel to get daily news straight to your Messaging App).