Wednesday 31 May 2017

From Baseball Savant to Tableau

Baseball Savant is a great site: very user friendly, interesting and good quality stats, nice visualizations, etc. Then why do I still have the need to "have this data"? As many will have noticed the download function times out and is limited to a set number of records somewhere in the range of 30,000. Also there is no way that I am aware of to link "live" to the data. No intention here to say a bad thing about Baseball Savant, as what they do offer is great, it just doesn't always meet my needs. Combine that with a desire to learn more about R and cloud data storage, and we have a challenge: freely querying Baseball Savant data without limitations, and being able to use that from a cloud environment in tools like Tableau for analysis and visualization.

Part 1: downloading the data from Baseball Savant (I'll refer to it as BBS from here on)
As I mentioned there are limitations on what you can download from BBS, and as I didn't want to download the data in chucks (like per player) I looked for a way to automate this process. Look no further than Bill Petti's awesome package for R, called baseballr. Among many other things Bill's package lets you download BBS data in chunks, and then use R code to combine it all together.

I am very new to R / R-studio, and just started making my way through Analyzing Baseball Data with R which also has a great website: https://baseballwithr.wordpress.com/. That being said I do have coding experience in other languages, so I was able to hack together the following rather quickly:

setwd("<file path") #change this to the workdirectory you want

rm(list=ls())           #removes all datasets, vars etc from the environment

for (y in c(2015, 2016, 2017))              #loop through the years
  {
  yearpart <- y
  for  (m in c(3,4,5,6,7,8,9,10))                                #loop through the months
  {
    monthpart <- m
    
    for (d in c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))  
      #loop through the days of the month
    {
      daypart <- d
      startpartD <- (paste(yearpart,"-",monthpart,"-",daypart,sep=""))
      endpartD <- startpartD
            
      scrapebb <- try(scrape_statcast_savant_batter_all(startpartD,endpartD),silent=T)                
        #the try function is used to skip the next step if the data is null (no play data for that day). Here I'm using Bill's package
      
      Filetext <- paste("bbsavant_B_",startpartD,"_",endpartD,".csv",sep="")                                         #combine variables and text for the file name
      
      try(write.csv(scrapebb,file = Filetext))          
         #write the csv. Try is there in case the scrapebb function did not return anything (and then does not exist)
      print(Filetext)
      remove(list=c("d","daypart","endpartD","Filetext","startpartD"))                                  #removes all daily vars from the environment
    }
    remove(list=c("m", "monthpart"))                                                                    #removes all monthly vars from environment
  }
}

I know this is not the most efficient way to do this, but it works. I creates daily data sets from all batter data. Replace 'scrape_statcast_savant_batter_all' with 'scrape_statcast_savant_pitcher_all' and 'bbsavant_B_' with 'bbsavant_P_'  to get and write pitcher data. Data is very similar, but there are some small differences.

From here I created folders per year for Pitchers and Batters, and moved the appropriate daily exports to the right folders. Per folder I ran the following to combine the daily data files into combined year data sets:

setwd("<file path>") #change this to the workdirectory you nwat

#Make sure to set the setWD() and clear the objects from workspace

file_list <- list.files(pattern="*.csv") #Getting a List of Files in a Directory

#Merging the Files into a Single Dataframe
for (file in file_list){
  
  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep=",")
  }
  
  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep=",")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }
  
}
write.csv(dataset, "combined_<year>_<Batters/Pitchers>.csv")

Part 2: querying the data in a place I can access from anywhere (aka Cloud)
Over the last year I tried numerous ways to store and access data from the cloud, where Google Sheets worked best for me. However, when combining a few large data sources, queries and even Sheet files become slow, really too slow to work with. I didn't want to set up a server environment at home so a hosted cloud option appeared the way to go. To store the data I was automatically thinking of a database, and since I am familiar with PostGres I would want to set up a PostGres database and an admin tool to run queries.

Oh yeah, one more small requirement: free, or very low cost. This proved a hard requirement to meet. I think Amazon AWS is fairly affordable to host a Postgres Database on a VM, but since I have been a long time user of Google environment I checked that out first (after initially looking at other options, but not finding anything applicable) and created a PostGres database. Although that allowed me to store and query data, it feels somewhat over-complicated, and I got distracted by some other offerings from Google, specifically Google BigQuery:

"BigQuery is Google's fully managed, petabyte scale, low cost enterprise data warehouse for analytics. BigQuery is serverless. There is no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL."

That sounds like my thing. I have large datasets that I want to query and analyze fast, and querying will be fairly simple. And I can use GBQ for free for a year, which should give me enough time to decide if it works for me. And even better, Tableau connects natively to GBQ and Google Sheets, while Google also offers a new product called Data studio (currently in Beta, and limited compared to Tableau, but it appears to be actively improved on continuously) for data visualization.

So the Google Platform is my environment of choice, at least for the free year.

Loading the combined BBS yearly datasets is super easy, and querying uses typical SQL, also allowing to link with other data sources, like Crunchtimebaseball's fantastic MLB player map. I haven't figured out yet if it is possible to create a Google sheet to pulls CrunchTimeBaseball's data as a csv on a weekly basis, and then let GBQ refresh directly from the Google sheet, but it looks promising.

So in summary I now have BBS data in a cloud environment that I can query and combine with other data sources, from anywhere.

PART 3: analysing and visualizing data in Tableau or other tools
As mentioned Tableau Desktop native connects to GBQ, but unfortunately this is not available in Tableau Public. Perhaps there is a way to build your own connector, or it might just be a matter of time before it becomes available. But for now, in GBQ you can create a dataset with a query, and save it as a view or to Google Sheets that Tableau Public can connect to. And lastly Google's Data Studio can use the GBQ data natively, and looks promising.

I'll keep posting updates as I learn more about the Google Cloud platform and how it works with Tableau.

Note that I only downloaded 2015 and 2016 from Baseball Savant, as the people of BBS mention the data prior to 2015 was not yet fully mature and reliable.

UPDATE: I tried loading the GBQ data into DataStudio and it works very easily:
Unfortunately embedding is currently not supported.

No comments:

Post a Comment