Tek Forward and RetroSheet Data

We had been looking for a dataset we could use to teach our “Intro to Database Development” class. and we found it in RetroSheet.org. They have amassed an incredible amount of data related to baseball from as far back as the late 1800’s. From ballparks, teams, players, and umpires, to play by play for just about every single game, the wealth of information is amazing.

For us, it offers opportunities.

  • It comes in CSV files, allowing us to teach how to load this type of files either through Postgres tools or Python.

  • It has different files for ballparks, teams, bio files for players, and play by play, allowing to teach how to:

    • Create tables and indexes

    • Clean up data

    • Standardized data

    • Create constraints between tables and within columns

    • Create scripts to load data from raw tables into new tables.

Although Retrosheet has several files we can use, for our purpose, we are going to concentrate on the following four:

  • Ballpark.csv

  • Teams.csv

  • Biofile0.csv

  • Plays.csv

Each section below will deal with a single file/table and what how we are using it to teach. Hope you enjoy it!

Plays.csv

RetroSheet has several files available. For our purposes, we are going to use the plays.csv, a file containing 16M+ rows, each a play in a game. A row in this file contains everything that happened in a single play. A play is defined as:

“A continuous action from the moment the pitcher throws the ball until it becomes dead, or the pitcher regains possession of the ball while in the pitching position.”

A row of a play also contains the where, who, what, of the play. This includes, but not limited to, where was the game played, who was batting, running, pitching, on base, what was the count, etc. To answer the question presented to us by our followers, we must then learn how to dissect this data into ways that can be used.

You can find the column definition for plays.csv here.

What are we doing with it?

Nothing. Well basically nothing. The raw data is loaded into a table we call plays_raw. The columns for this table are all character. We then proceed to create a new table, with columns of the correct data type - date, number, character - and length, only moving the data necessary to answer the questions followers are asking. As data is moved to new table, each column is loaded with only “valid” values.

Ballpark.csv

Because plays.csv contains ballpark codes, we need to be able to decipher these codes. Enter ballpark.csv. Ballpark.csv contains information about all bark parks where Major League Baseball has been played, home and abroad, including location. Simple Enough.

You can find the column definition for ballpark.csv here.

What are we doing with it?

As with the plays.csv data, we create a ballpark_raw table, all with character columns, and then move data to a new table using correct data types and lengths for the columns. It is with this table that we begin to look into standarizing columns like state and country, laying the foundation to use these standardizations in other tables.

Teams.csv / CurrentTeams.csv

Like ballparks, the play.csv table contains team IDs, and we need to be able to determine who are the teams involved. Team.csv contains all teams that have ever played a game in the Major Leagues. CurrentTeams.csv contains a subset of it, but it includes division, franchise information.

What are we doing with it?

A few things. First we are going to combine what data we can from each tables into one, using correct data type and lengths. Second, like Ballpark data, we then will standardize states and will add a country code.

Biofile0.csv

Players, managers, coaches, umpires, all reside in this file. For us will be all about the players, of course.

What are we doing with it?

Because for us is all about players, we are going to move only players to a new table, with correct data types and lengths. And of course, we then will standardize states and will add a country code.