15/12/2025
Bigger datasets aren’t always better
MIT researchers developed a way to identify the smallest dataset that guarantees optimal solutions to complex problems.
Determining the least expensive path for a new subway line underneath a metropolis like New York City is a colossal planning challenge — involving thousands of potential routes through hundreds of city blocks, each with uncertain construction costs. Conventional wisdom suggests extensive field studies across many locations would be needed to determine the costs associated with digging below certain city blocks.
Because these studies are costly to conduct, a city planner would want to perform as few as possible while still gathering the most useful data for making an optimal decision.
With almost countless possibilities, how would they know where to start?
A new algorithmic method developed by MIT researchers could help. Their mathematical framework provably identifies the smallest dataset that guarantees finding the optimal solution to a problem, often requiring fewer measurements than traditional approaches suggest.
In the case of the subway route, this method considers the structure of the problem (the network of city blocks, construction constraints, and budget limits) and the uncertainty surrounding costs. The algorithm then identifies the minimum set of locations where field studies would guarantee finding the least expensive route. The method also identifies how to use this strategically collected data to find the optimal decision.
This framework applies to a broad class of structured decision-making problems under uncertainty, such as supply chain management or electricity network optimization.
“Data are one of the most important aspects of the AI economy. Models are trained on more and more data, consuming enormous computational resources. But most real-world problems have structure that can be exploited. We’ve shown that with careful selection, you can guarantee optimal solutions with a small dataset, and we provide a method to identify exactly which data you need,” says Asu Ozdaglar, Mathworks Professor and head of the MIT Department of Electrical Engineering and Computer Science (EECS), deputy dean of the MIT Schwarzman College of Computing, and a principal investigator in the Laboratory for Information and Decision Systems (LIDS).
Ozdaglar, co-senior author of a paper on this research, is joined by co-lead authors Omar Bennouna, an EECS graduate student, and his brother Amine Bennouna, a former MIT postdoc who is now an assistant professor at Northwestern University; and co-senior author Saurabh Amin, co-director of Operations Research Center, a professor in the MIT Department of Civil and Environmental Engineering, and a principal investigator in LIDS. The research will be presented at the Conference on Neural Information Processing Systems.
An optimality guarantee
Much of the recent work in operations research focuses on how to best use data to make decisions, but this assumes these data already exist.