Back in August, I wrote a post about my Baseball Century Experiment. I haven’t had much of a chance to do actual work on it, but in the months since, I’ve done a lot of reading, particularly on the science of sabermetrics, and I now have a plan, at least, for moving forward with this little experiment. I explained some of my reasons for doing this in my original post, and a few people pointed out that there was already software out there that does this, so why reinvent the wheel? I have a couple of thoughts on this:
The first, and most important reason, for me, is to learn. Sure, there is software out there that can do this, but it exposes only the results. I’m interested in the internal mechanics of how such a piece of software might work. This helps me in 3 different ways:
- It allows me to make a deeper exploration of baseball by implementing it as a simulation myself.
- It allows me to dive deeper into a development package–in this case, Mathematica–that I want to know better.
- It allows me to tinker in ways that I could not do with off-the-shelf software.
Second, I’ve looked at the software that is out there. The top-of-the-line appears to be Out of the Park Baseball. Not only did I look at this, but I bought a copy for my Mac and played around with it a bit. It gets to some of what I am looking to do, but not all of it. I’m not (at the moment) interested in human management in the game. I’m currently more interested in simulating human management through some basic game AI. That is part of the fun for me.
Third, I’m not interested in developing the kind of elaborate interface that OOTP has. My simulation will be entirely text-based. My ideal output and presentation layer would be something akin to WolframAlpha, for baseball, where you could type in some natural language queries and get a boatload of results, charts, graphs, numbers, etc. But at the simplest level, I’m satisfied with producing text-based box scores, play-by-plays, rosters, lineups, standings, etc.
Fourth, I’m not interested in using real players. Part of the point is to think of this as almost an alternate history to baseball. Fictional players, randomly generated, moving through careers based on statistically valid simulations.
The ultimate goal of my initial experiment is to be able to simulate 100 continuous seasons of baseball, and then look at the resulting number and see who are the leaders? Did anyone every hit .400 in a season? Did anyone break a 56-game hitting streak? Who is the home run kings and what is the record? Did any pitcher throw a perfect game?
My approach to all of this is starting very simple and layering on more and more complexity. Over the last several weeks, I have drawn up a plan for how I will approach this. It looks something like this:
1. Develop a simple player generator
Since I’m not using real players, I need a way of bootstrapping players. One of the tools I will need to create, therefore, is a player generator. As with all the tools I’ll need to develop, my plan is to start simple and layer on complexity over time. The simple version of the tool will generate names, positions, and some basic stats for the players. My present approach for generating the stats will be to assume a standard bell curve for a statistic and randomize the stats based on a normal distribution. This probabilities of such a distribution would allow for an appropriate relative generation of “average” players to “superstars” and to players who don’t perform so well. Put another way, there would be a lot of values (say, batting averages) that have small deviations from the mean average. There would be very few that are far better or far worse.
Not a perfect solution but it allows me to bootstrap some basic statistics in a fair way without the need to borrow from real player numbers.
2. Develop a simple team generator
The team generator in this instance is a way of picking out the players needed to create a roster of n people, with all of the necessary slots fills (so many pitchers, so many fielding positions, etc.) from the pool of available people. In a more complicated version, the team generator would be a kind of AI scout or GM, looking at what is available and getting the best that it could. But that is way down the line. Right now, I’m simply looking to be able to create teams out of the players generated in #1.
3. Develop a lineup generator
Again, we are talking simple here. In a more complex version, the lineup generator would be part of the manager AI function. For now, I’m looking to produce the best possible lineup with the data available. In its most simple terms, this is likely a fairly simple two-part problem:
- Identify a team player for every position.
- For each position, sort the players by OBP (on-base percentage) and then choose the best OBP for the given position.
At this point, the pitcher almost doesn’t matter
In future versions, I’ll probably also look at some more advanced sabermetrics statistics, but this is good enough for now.
4. Simulate a match-up using simple BLOOP methodology
BLOOP is a method of simulation that sabermetricians have used quite a bit. It’s fairly simple. It involves calculating the probability of various outcomes based on the hitters stats. A more sophisticated version normalizes these calculations based on the pitcher they are facing and across the league, but for now, I’m keeping things simple.