Major League Baseball can sometimes seem like an exercise in statistics as much as it is a sport. Batting averages, stolen bases, strikeouts, runs batted in – almost every aspect of the game can be recorded as a number. For over a century, teams and fans have used this huge body of data to measure players’ and teams’ performance – and value – in minute detail through a statistical process called sabermetrics.
Researchers at Penn State have developed a new method of analysis that draws on recent advances in machine learning to offer an even more accurate picture of how individual players impact the game.
Sabermetrics uses 121 kinds of statistics to quantify players’ success or failure in batting, pitching and fielding, and how many games teams win or lose as a result. Creatively crunched, these numbers can guide decisions that can make the difference between winning and losing.
In 2002, the Oakland A’s took an innovative approach by focusing only on how likely a player was to get on base. As a result, they were able to acquire players who had been undervalued by traditional analysis – and clinch that year’s American League West championship, a story told in the book and movie Moneyball.
But “when you simply describe a game as counting statistics, you really lose a lot of information about how the game actually happened,” said Connor Heaton, a Ph.D student at Penn State’s College of Information Sciences and Technology (IST).
In sabermetrics, a single is a single regardless of what else is going on, like whether runners were on base or where the ball ended up. Recording games as a series of discrete events without any context doesn’t fully capture a player’s impact, Heaton said.
Heaton’s model draws on recent work in Natural Language Processing (NLP), specifically a sequential modeling technique called Masked Gamestate Modeling, which helps computers infer the meaning of words from the surrounding context. In baseball, Heaton said, a similar process can be used to infer the meaning of game events based on context and the impact they have on the game.
Heaton also leveraged the idea of self-supervised contrastive learning, a family of methods used in computer vision to draw conclusions from unlabeled data. The idea is that similar views of the same image will produce outputs that are also similar, and different from other records in a batch of images.
“We adapted that to baseball, and said that the same player at two close points in time should have a similar impact on the game,” Heaton said.
Heaton and his co-author, IST professor Prasenjit Mitra, trained their model on data from the Statcast system, which uses 12 high-speed cameras at every MLB stadium to record information on pitching, hitting and fielding. There were three kinds of data in all. First, they used the Python package pybaseball to collect pitch-by-pitch data for the 2015-2019 seasons and season-by-season data from 1995-2019, a total of 5,000 games and 4.6 million pitches.
Pitch-by-pitch data included game number, at-bat number and pitch number. The season-by-season data covered the result of each pitch in terms of changes to the “gamestate”: ball-strike count, base occupancy, number of outs and score. Various combinations of these four numbers could lead to one of 325 possible gamestate changes.
The third type of input was recordings from traditional sabermetrics, describing each pitcher, batter and their past encounters. They ran the analysis on two A600 GPU workstations in Heaton’s office.
The result, described in a paper that was picked as a finalist at the MIT Sloan Sports Analytics Conference, was a measurement of each player’s short-term impact on games called “player form.” A form, described by a 64-element vector, describes a player’s skill as part of a larger sequence of events, instead of a collection of events in isolation. Expressed in a low-dimensional space called an embedding, “it provides much more nuance into the exact way in which the good players impact the game,” said Heaton.
Heaton and Mitra tested the technique on MLB games from 2015 through 2019. When combined with traditional sabermetrics, their approach was able to predict the winner of a game with almost 60% accuracy.
Forms also seemed to do a better job teasing apart exactly how good players impact the game. One statistic used to evaluate players’ value is called “wins above replacement” (WAR) – a measurement of how much they help their team win compared to a hypothetical replacement player with more pedestrian skills (and lower cost).
“In analyzing sabermetric-based embeddings, one could reasonably conclude that in order to have a high WAR rating, a player would need to hit a lot of home runs,” Heaton said.
“The form-based embeddings, on the other hand, provide a much more holistic interpretation, suggesting a variety of ways in which players can bring high value to their team.”
The authors have made the code and data publicly available on Github. They hope to use the methodology to model how events within the same game relate to each other, and what impact other team members such as managers might have on game outcomes.
Stat-heavy baseball is an obvious starting point, Heaton says, but their approach could also be useful in other sports like cricket, basketball or hockey. Beyond sports, it could potentially be applied in healthcare, for example allowing medical providers to describe patients and their health visits at different points in time.
In the meantime, “it’s definitely fun that I can watch a baseball game and say it’s for research purposes,” Heaton said.
Julian Smith is a contributing writer. He is the executive editor Atellan Media and author of Aloha Rodeo and Smokejumper published by HarperCollins. He writes about green tech, sustainability, adventure, culture and history.
© 2022 Nutanix, Inc. All rights reserved. For additional legal information, please go here.