In This Story
Big data often appears overwhelming, with thousands of variables and millions of points. Images, documents, biological systems, and more are rendered into vast digital spaces. But Wanli Qiao, an associate professor in the George Mason University Department of Statistics, delights in the challenge of finding the hidden “shapes” within this data.
Qiao’s research focuses on what he describes as the geometry of data—the idea that even the most complicated datasets often follow simpler, underlying structures. Those structures are not always visible, but they govern how data clusters and behave and even reveal meaning.
“Data today can be incredibly complex,” Qiao said, referring to data with a vast number of features or variables describing each item. “But what really matters is often the low-dimensional structure underneath.” He gives an example of a digital image of something common like a car. That image may contain hundreds of thousands of pixels, forming a complex distribution of light. If shifting the camera capturing the image of the car causes the light to change, the essential object, the car, does not change, even though the image does. The overall data appears complex, but the underlying structure is much simpler. Qiao’s work seeks to find and understand those structures, ensuring that the mathematical tools used to detect them are reliable.
In 2025, alongside collaborator Ery Arias-Castro (UC San Diego), he published two papers in the Annals of Statistics, one of the field’s most selective journals. Together, the papers pose a central question in modern statistics: how to make sense of increasingly complex data while maintaining rigorous guarantees that the conclusions are sound.
The first paper tackles what statisticians call “distributional data.” Unlike traditional datasets, where each observation is a single point or vector, distributional data treats each observation as a full distribution. A document, for example, can be represented by the frequency of words it contains, which captures the substance of its language.
These representations are powerful, but they are also difficult to work with, existing in high-dimensional spaces, making them hard to visualize or analyze directly. Qiao’s contribution is to develop methods for embedding this complex data into lower-dimensional representations without losing the data’s essential structure. The goal is to compress the data in a way that preserves meaning.
The paper also establishes the mathematical foundations behind Qiao’s methods. In statistics, this means proving that the approach is “consistent,” that as more data becomes available, the method converges toward the correct answer.
The second paper turns to a different but related problem: how to identify natural groupings within data. Clustering is a fundamental task in data analysis, used to group similar items together. Many standard methods require researchers to decide in advance how many groups to look for. Qiao’s work instead focuses on a method known as hill-climbing clustering, which avoids that assumption.
The approach imagines the data as a kind of landscape, where regions of high density form peaks. Each data point “climbs” the landscape until it reaches a peak, and points that arrive at the same peak are grouped together. “It’s based on the geometry of the data,” Qiao said. “You let the structure define the clusters.”
Qiao’s research contribution is not just the method itself, but also proof that it works. The paper established that the algorithm will reliably recover the true underlying structure, even when we only have a limited sample of data to work with.
Separately, in collaboration with Amarda Shehu, a professor in the Department of Computer Science, Qiao applies similar ideas to the study of protein folding, where the geometry of molecular structures determines biological function. There, the challenge is to understand complex systems by uncovering the shapes that define them.
For Qiao, the goal is not just to make sense of data, but to make that understanding reliable.