About the Transformer-Based Baseball Model

1. Introduction

In baseball, the ability to predict the outcome of a pitch can fundamentally transform how teams approach pitch sequencing and defensive positioning. Traditional methods often rely on historical statistics, which may overlook the dynamic and streaky nature of the game. By modeling baseball as a sequence of pitches, this research leverages transformer-based models to predict pitch outcomes, akin to how language models interpret sentences (Vaswani et al., 2023). Building on the success of learning player embeddings using transformers (Heaton et al., 2020), we investigate whether transformer models can accurately predict the result of a pitch. Success in this endeavor could enhance pitch sequencing strategies and optimize defensive positioning, providing coaches with advanced tools for real-time tactical adjustments.

2. Methods

This study employs a transformer-based model trained on MLB Statcast data spanning from 2015 to 2023 (Statcast Dataset, n.d.). Each pitch in the dataset is characterized by features such as pitch type, speed, game context, and outcome. The model is designed for single pitch result prediction, where the attributes of an upcoming pitch are known, and the objective is to predict its outcome.

During inference, a future pitch is paired with the previous 400 pitches faced by the current batter to form a comprehensive sequence. The transformer model processes this sequence to output a probability distribution over nine hit-location zones and ten pitch result types (see Table 1). Each hit-location zone corresponds to areas on the field where a batted ball is most likely to be fielded by a fielder. The training process utilizes a semi-masked token prediction task, inspired by masked language modeling (Devlin et al., 2019), where only the result features of the final pitch in the sequence are partially masked.

Feature Description
Pitch Type Classification of the pitch (e.g., Changeup, Curveball)
Speed Velocity of the pitch in mph
Game Context Situational factors like count, inning, score differential
Outcome Result of the pitch (e.g., Strike, Ball, Hit)

4. Interactive Tool

This tool is a demonstration of how a user might interact with the aforementioned baseball model. In practice, this model could be used in live game sitautions to call optimal pitches and inform defense positioning. For any given moment in a game, the model will take the current batter's previous 400 pitches seen as context, and cycle through every combination of pitch type/location for the current pitcher. With the probability distributions produced using this method, a pitch optimzing whatever outcome/hit location the user wants can be selected. For example, we might choose the pitch type/location that results in the highest probability of a Strike. Alternatively, you might want to choose the pitch that results in the highest probability of a field out with a high infield hit location density to optimize for a double play. This tool, while not as comprehensive as the methods just described, allows users to experiment perform a manual version of this method, simulate at-bats bewteen a wide range of players, or just experiment to try and gain insights from the model. The tool assumes that each simulated pitch take place at the end of the 2023 MLB season. That is the model retrieves a batter's previous 400 pitches seen starting from the last day of the 2023 season. The model populates pitch features like Velocity and Spin Rate by calculating average feature values by pitch type for each pitcher. For more info, please contact declankneita2025@u.northwestern.edu.

References