Changing the odds for understanding data in the AI era

In This Story

Ray Bai is a recent addition to the George Mason University Statistics Department, joining as an assistant professor in 2025. But he’s not new to the field, which has undergone a profound transformation. His work sits at the intersection of statistics and AI, a space that didn’t even exist just a few years ago but that now defines his research and the discipline’s future.

“A lot of AI tools are really about learning probability distributions,” said Bai. “In generative AI, you’re transforming random noise into something structured, like an image or text. That’s veryclosely connected to problems we’ve always studied in statistics.”

After graduating from Cornell University, he followed a common route into finance, working at State Street Bank in Massachusetts. During this time, he found himself drawn to the underlying mathematics, a realization that led him back to graduate school, first for a master’s degree in applied mathematics at the University of Massachusetts Amherst, and eventually to a PhD in statistics at the University of Florida. After working in industry for a brief time between earning his master’s and his PhD, the academic environment called him back.

Bai with PhD student Shijie Wang after Wang's successful dissertation defense. Photo provided.

Since then, he has followed a steady progression through academia, including time at the University of South Carolina, before arriving at George Mason. His move to the Washington, D.C., region was partly professional and partly personal, offering both a new institutional home and proximity to family in the Northeast.

Along the way, Bai adapted to a quickly evolving field. During his graduate years, “big data” was dominant, and now AI is all the rage. One example of AI’s use is the growing reliance on pre-trained models. Instead of building a statistical model from scratch for each new dataset, researchers can now start with models trained on massive datasets and fine-tune them for specific problems. The approach reduces both computational cost and data requirements.

“You can tweak the existing model on your data, which can speed up things in a few ways,” he said. “First, you can use the previously trained model as a starting point or initialization in your new model, and then your new model learns faster. You might also have fewer data requirements if you can fine-tune a model that was already previously fit, so you don't need to use as much data on the new problem.”

At the same time, the nature of data has changed. Traditional methods often assumed that data was static and centrally stored, but increasingly, that assumption no longer holds. “We now deal with streaming data, like financial markets or online activity, where the data is constantly updating,” he said. “And we also deal with decentralized data, where information is distributed across different locations and can’t be combined for privacy reasons.”

In healthcare, for example, hospitals may want to collaborate on research without sharing sensitive patient data. New statistical approaches allow models to be trained across these decentralized systems, updating iteratively without moving the underlying data.

Beyond his research, Bai has also built an audience through his blog, which he began in 2020 while navigating the academic job market. The posts focus on topics like graduate admissions and academic career advice for junior researchers. The blog is also a professional tool, increasing his visibility in the field and leading to invitations to speak on panels and share his insights more broadly.

Bai hopes to continue his work through mentoring graduate students, pursuing large grants, and continuing to pursue new boundaries with his research.

Topics

Statistics

Statistics Faculty

Big Data