1 Introduction

The Lahman package in R contains a plethora of baseball data. The package is used extensively in the book, Analyzing Baseball Data with R, by Max Marchi and Jim Albert (2013). This tutorial will use a subset of data from the Lahman package to expose you to some basic descriptive statistical functions and data subsetting within the R Markdown environment.

I have extracted a subset of the data and stored it in a Rdata file. In order to read-in the data correctly, save the Rmd file in a folder with the rest of your course work. Place the file baseball.Rdata in the same folder. Go to Session > Set Working Directory > To Source File Location. Now you may run the below code that will load the data.

load("baseball.Rdata")

2 Lahman data

2.1 Accessing the data

To see all the objects loaded from the baseball data set in this tutorial use the function ls(). Function ls() lists all the objects in the current R environment.

You may see other objects from previous instances of work in R.

ls()
 [1] "batting_stats"  "CarltonFiskBA"  "CarltonFiskHR"  "CarltonFiskRBI"
 [5] "JimRiceBA"      "JimRiceHR"      "JimRiceRBI"     "TedWilliamsBA" 
 [9] "TedWilliamsHR"  "TedWilliamsRBI"

To access the content of an object in R use the object’s name. Keep in mind that R is case sensitive. Thus, we need to type an object’s name exactly as it appears.

Above we see the object JimRiceBA. Run the code below to see the contents of JimRiceBA.

JimRiceBA
 1974  1975  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985 
0.269 0.309 0.282 0.320 0.315 0.325 0.294 0.284 0.309 0.305 0.280 0.291 
 1986  1987  1988  1989 
0.324 0.277 0.264 0.234 

Baseball abbreviations

Abbreviation Meaning
BA Batting Average
HR Home Runs
RBI Runs Batted In

These are some measures of a batter’s success.

2.2 Descriptive statistics

The names of many functions in R are self-explanatory. To compute the minimum, maximum, and mean for Jim Rice’s career batting average we can use the corresponding functions given below.

min(JimRiceBA)
[1] 0.234
max(JimRiceBA)
[1] 0.325
mean(JimRiceBA)
[1] 0.292625

To find the year in which Jim Rice had his lowest batting average and the year in which he had his highest batting average, we can make use of the functions which.min() and which.max(), respectively.

which.min(JimRiceBA)
1989 
  16 
which.max(JimRiceBA)
1979 
   6 

Let’s examine how Rice’s batting average changed throughout his career. First, we compute year-over-year differences, then view the results. Second, we will look at which year he had the largest increase and which year he had the largest decrease.

# compute differences
JimRiceBA_diffs <- diff(JimRiceBA, lag = 1)
JimRiceBA_diffs
  1975   1976   1977   1978   1979   1980   1981   1982   1983   1984 
 0.040 -0.027  0.038 -0.005  0.010 -0.031 -0.010  0.025 -0.004 -0.025 
  1985   1986   1987   1988   1989 
 0.011  0.033 -0.047 -0.013 -0.030 
# find years
which.max(JimRiceBA_diffs)
1975 
   1 
which.min(JimRiceBA_diffs)
1987 
  13 

The # symbol was used to add comments. R does not execute anything following #. Use # for code documentation to explain to others why you are doing what you are doing with your code. Good code documentation is also beneficial for your future self.

2.3 Summary statistics with two variables

Recall that the correlation measures the linear strength between two quantitative variables. Let’s look at the correlation between each pair of available variables for Jim Rice: batting average, home runs, and RBIs.

cor(JimRiceBA, JimRiceHR)
[1] 0.7576232
cor(JimRiceBA, JimRiceRBI)
[1] 0.7942272
cor(JimRiceHR, JimRiceRBI)
[1] 0.9305225

To view a simple plot of Rice’s home runs versus his RBIs we can use the plot() function.

plot(JimRiceHR, JimRiceRBI)

To learn more about Jim Rice click here.

2.4 Exercises

Ted Williams, 1939

Ted Williams, 1939

Answer parts a-i below. Use a separate code chunk for each part that requires code. You will examine data on Ted Williams.

To remind yourself of the variable names use the function ls().

  1. Use the length function to determine how many seasons Ted Williams played.

  2. Which season did Ted Williams have his highest batting average?

  3. Plot Williams’ batting average over time. To put the years on the x-axis, use names(TedWilliamsBA).

  4. What was Williams’ highest batting average?

  5. What was Williams’ career mean batting average?

  6. What was the correlation between Williams’ home runs and RBIs? Was it higher than Jim Rice’s correlation?

  7. What was the largest absolute change in Williams’ RBIs year-over-year?

  8. Why does Ted Williams not have any statistics from 1943 - 1945? Was he hurt?

  9. Which of the three players (Fisk, Rice, Williams) was most consistent year-over-year with regards to the batting average metric? How did you define consistency?

3 R Markdown

3.1 Exercises

Create the R Markdown file that produced the HTML file ica-01-10-19. All formatting should match, but you may replace my name with your own name. Below are some helpful hints.

  1. YAML header should be
---
title: "Console only"
author: "Shawn Santo"
date: "January 10, 2019"
output: 
  html_document:
    toc: true
    number_sections: true
    toc_float: true
    df_print: paged
---
  1. In section 3.1 the longley data frame is shown, but the code is not. You will need to use an appropriate chunk option to get your HTML file to match.

  2. To create the final plot use the below code. You will need to install each package before you can load them with the library function.

library(ggplot2)
library(rvg)
library(ggiraph)

longley$tooltip <- paste("GNP: ", longley$GNP, sep = "")

gg_point_1 <- ggplot(longley, 
                     aes(x = Year, y = Employed, tooltip = tooltip)) + 
              geom_point_interactive(size = 4, color = "darkblue") + 
              theme_bw()

ggiraph(ggobj = gg_point_1)

4 References

  1. Lahman, S. (2017) Lahman’s Baseball Database, 1871-2016, Main page, http://www.seanlahman.com/baseball-archive/statistics/

  2. https://en.wikipedia.org/wiki/Ted_Williams