The challenge with healthcare data
By Bob de Vos and Isabell Trinh
“AI needs to be trained with large amounts of data…” We’ve probably all heard that. But what if that takes too long or isn’t possible in the first place?
In healthcare, getting high-quality, labelled data for AI applications is a real challenge. Health data is often neither standardized nor accessible, and patient samples are notoriously difficult to obtain. As a result, building a good training dataset is both time- and labor-intensive — and therefore expensive.
At the same time, AI holds huge promise for making healthcare more affordable and accessible, a goal that’s only becoming more urgent as our population ages.
So how can we fulfil that promise?
In our last post, we introduced our Large Spectral Model (LSM) — our take on a foundation model for spectral data, inspired by how Large Language Models (LLMs) work with text. Just like LLMs learn the structure and meaning of language, our LSM captures the underlying patterns in spectral data.
This broad understanding is what makes few-shot learning so powerful. When a model already knows the “language” of spectra, it doesn’t need thousands of examples to learn something new; just a few can be enough to adapt it to perform well.
To make that more tangible, consider the word “spectrotype”, a term used to describe the unique spectral fingerprint of a sample, like a bacterial strain. Even if you’ve never seen the word before, you might recognize “spectrum” and “type” and intuitively guess that it refers to a type defined by its spectrum. Prior knowledge helps you make sense of the unfamiliar.

In the same way, once a model is familiar with the language of spectra, it can be adapted with just a few examples to perform specialized tasks such as identifying or classifying new spectrotypes. When we use our LSM as a foundation, we can give spectral models a headstart and enable them to recognize unfamiliar patterns with minimal data.