Chenyue Wendy Hu
Photo courtesy of Jeff Fitlow
and Rice University
A newly developed algorithm for “big data” could have a significant impact on clinical trials, according to researchers.
The algorithm, called progeny clustering, was the only method to successfully reveal “clinically meaningful” groupings of proteomic data from patients with acute myeloid leukemia.
And the algorithm is currently being used in a hospital study to identify optimal treatment for children with leukemia.
Details on progeny clustering have been published in Scientific Reports.
The authors noted that clustering is important for its ability to reveal information in complex sets of data like medical records.
“Doctors who design clinical trials need to know how to group patients so they receive the most appropriate treatment,” said author Amina Qutub, PhD, of Rice University in Houston, Texas. “First, they need to estimate the optimal number of clusters in their data.”
The more accurate the clusters, the more personalized the treatment can be, Dr Qutub said. She added that separating groups by a single data point would be easy, but when separating patients by the types of proteins in their bloodstreams, for example, it becomes more difficult.
“That’s the kind of data that’s become prevalent everywhere in biology, and it’s good to have,” Dr Qutub said. “We want to know hundreds of features about a single person. The problem is identifying how to use all that data.”
Progeny clustering provides a way to ensure the number of clusters is as accurate as possible, Dr Qutub said. The algorithm extracts characteristics about patients from a data set, mixing and matching them randomly to create artificial populations—the “progeny” of the parent data. The characteristics appear in roughly the same ratios in the progeny as they do among the parents.
These characteristics, called dimensions, can be anything: as simple as hair color or place of birth, or as detailed as blood cell count or the proteins expressed by tumor cells. For even a small population, each individual may have hundreds or thousands of dimensions.
By creating progeny with the same dimensions of features, the algorithm increases the size of the data set. With this additional data, the distinct patterns become more apparent, allowing the algorithm to optimize the number of clusters that warrant attention from doctors and researchers.
Dr Qutub said this technique is just as reliable as state-of-the-art clustering evaluation algorithms, but at a fraction of the computational cost. In lab tests, progeny clustering compared favorably to other popular methods.
And it was the only method to provide clinically meaningful groupings in an acute myeloid leukemia reverse-phase protein array data set.
Progeny clustering also allows researchers to determine the ideal number of clusters in small populations, Dr Qutub noted.
The algorithm was used to design an ongoing trial involving leukemia patients at Texas Children’s Hospital.
“Progeny clustering allowed them to design a robust clinical trial, even though that trial did not involve a large number of children,” Dr Qutub said. “It meant they didn’t have to wait to enroll more.”
Dr Qutub added that the algorithm could apply to any data set.
“We could just as easily use it for a population of voters to see who should get campaign materials from a candidate,” she said. “Progeny clustering has a lot of possible applications.”
Dr Qutub and her colleagues plan to make the algorithm available for free on her lab’s website.