Major League Baseball 2019 Analysis

3 Analysis Overview

This project was a little bit different than the others. For this project, I sourced my own data to visually showcase results of an exploratory analysis in a Tableau dashboard. I developed a hypothesis, and tested it using advanced analytical methods.

I have a passion for baseball and chose to source data from Major League baseball and explore some common variables in new ways.

Data:

The data for this project was sourced from Kaggle, link here. It consists of Major League Baseball's 2019 season batting statistics. All data was cleaned and analyzed in Python. Code for the project can be viewed on my GitHub link.

Skills/Tools

Python
Jupyter Notebooks
Geographical Visualizations in Python
Regression
Sourcing Data
Unsupervised Machine Learning: Clustering
Time Series
Data Dashboards

Research Questions asked throughout analysis:

What region are the most runs scored in?

What region did the most strikeouts occur?

How did weather impact the number of pitches seen throughout the season?

What sky conditions had the most hits recorded?

What is the impact wind speed has on total runs?

What impact does temperature have on runs?

Project Steps

1

The first step of this project was to source and clean my data. I knew that I wanted to analyze baseball data. Once I found the 2019 season data, I imported it into a Jupyter notebook and began to try a clean it (basic consistency checks, remove duplicates, handle missing values, check data types). Here, I also outlined any ethical limitations of the data and began to explore some of the variables. I created a data profile. As always, code can be found in my GitHub link. Figure one consists of a few snapshots of my data profile formulated.

Figure 1

3

I tested my hypothesis using a supervised machine learning technique: regression. I found that wind had no correlation with the runs scored. I conducted this step by splitting the data into a test set and a training set. After this, I conducted unsupervised machine learning: clustering on the variables as well as calculated the descriptive statistics using the groupby function. Figure 3 shows the scatterplot after performing clustering as well as the regression test.

3

Figure 3

Additional Time Series Analysis

It is important to note that an additional time series was conducted during this project. Since the data only consists of 2019, it would not have been beneficial to conduct on the baseball data. This was a key learning point that was done on additional data sourced through an API. The code for this procedure can be viewed in my GitHub at this link.

2

In this next step I explored some of the columns and variables present. I then developed a hypothesis based on these variables. The hypothesis was: wind speeds greater than 10 will contribute to more runs. I added a column based on the state/ location of the game in order to conduct geographical visualizations of my data. The geospatial analysis was completed through a choropleth map in the relevant python libraries. Figure 2 shows the choropleth map created.

2

Figure 2

4

The last step of this project consisted of creating a data dashboard to present and display my findings on variables within the data set. This dashboard was created in Tableau Public on a storyboard. Link to the storyboard here. I recreated many of the python visuals in Tableau, created interactive visuals, and noted the data limitations in this step as well. I made sure to recommend further steps based on the analysis. Figure 4 shows a few snapshots of the storyboard.

Figure 4

Final Results

My findings are viewable in my Tableau Public Link. Listed below are what comprises the presentation:

Project description
Weather impact
Further analysis
Unlikely findings
Results

I found that Arizona had the highest average temperatures during the season. More runs were scored on cloudy days. There is a weak positive relationship between total runs scored and temperature. Teams who played in warmer weather would score slightly more runs. There is not a relationship between wind speed and total runs. The northeast and southwest have the most runs scored in comparison to the total pitches seen and the Midwest has the highest number of plate appearances as well as doubles. There is a correlation between pitches seen and strikeouts. The more hits an opponent has, the more likely a team is to score runs.

Recommendations

I recommended that further analysis be conducted with a larger data set. This analysis is limited to 2019, and could be completed in additional years. It is hard to say if the same trends would occur in other years without larger data.

Challenges

I faced several challenges throughout this project. One of the major challenges was sourcing the data. The website who generated this batting data no longer generates the data. It would have been nice to be able to combine more years. This project was less structured and proved to be a bit harder when identifying variables. As always, new python concepts and code were used.

In the future, I plan to continue to practice my Python and Tableau skills. Understanding more advanced python techniques such as clustering will continue to take practice.

Previous Project

Next Project