Krister Martinez

Getting Started with R,Part 2

Let us continue getting started with R as we start discussing important statistical concepts in Sports Analytics.

Case-scenario 1

This is the fourth season of outfielder Luis Robert with the Chicago White Socks. If during the first three seasons
he hit 11, 13, and 12 home runs, how many does he need on this season for his overall average to be at least 20?

Solution

Given that x1=11,x2=13,x3=12

we want to find x4 such that the mean (average) number of home-runs is x¯>=20

Notice that in this case n=4

According to the information above: 20×4=11+13+12+x4

so when x4=80, the home-runs average will be 20.

# Home-runs so far
HR_before <- c(11, 13, 12)
# Average Number of Home-runs per season wanted
wanted_HR <- 20
# Number of seasons
n_seasons <- 4
# Needed Home-runs on season 4
x_4 <- n_seasons*wanted_HR - sum(HR_before)
# Minimum number of Home-runs needed by Robert
x_4

## [1] 44

According to the calculations above, Robert must hit 44 home-runs or better on this season to get an average number of home-runs per season of at least 20.

We could confirm this, by using the function mean() in R

# Robert's performance
Robert_HRs <- c(11, 13, 12,44)
# Find mean
mean(Robert_HRs)

## [1] 20

# Find standard deviation
sd(Robert_HRs)

## [1] 16.02082

# Find the maximum number of home-runs during the four seasons period
max(Robert_HRs)

## [1] 44

# Find the minimum number of home-runs during the four seasons period
min(Robert_HRs)

## [1] 11

We can also use the summary() function to find basic statistics, including the median!

summary(Robert_HRs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   11.75   12.50   20.00   20.75   44.00

Question 1

Now, you must complete the problem below which represents a similar case scenario. You may use the steps that we executed in Case-scenario 1 as a template for your solution.

This is the sixth season of outfielder Juan Soto in the majors. If during the first five seasons he received 79, 108,41,145, and 135 walks, how many does he need on this season for his overall number of walks per season to be at least 100?

Solution

# Walks so far
W_before <- c(79, 108, 41, 145, 135)
# Average Number of Walks per season wanted
wanted_W <- 100
# Number of seasons
n_seasons <- 6
# Needed Walks on season 6
x_6 <- n_seasons * wanted_W - sum(W_before)
# Maximum number of Walks needed by Juan
cat("The maximum number of walks needed in season 6 is:", x_6, "\n\n")

## The maximum number of walks needed in season 6 is: 92

# Juan's performance
Juan_HRs <- c(11, 13, 12, 44)

# Find mean
mean_Juan_HRs <- mean(Juan_HRs)
cat("Mean of Juan's home-runs:", mean_Juan_HRs, "\n\n")

## Mean of Juan's home-runs: 20

# Find standard deviation
sd_Juan_HRs <- sd(Juan_HRs)
cat("Standard deviation of Juan's home-runs:", sd_Juan_HRs, "\n\n")

## Standard deviation of Juan's home-runs: 16.02082

# Find the maximum number of home-runs during the four seasons period
max_Juan_HRs <- max(Juan_HRs)
cat("Maximum number of home-runs during the four seasons period:", max_Juan_HRs, "\n\n")

## Maximum number of home-runs during the four seasons period: 44

# Find the minimum number of home-runs during the four seasons period
min_Juan_HRs <- min(Juan_HRs)
cat("Minimum number of home-runs during the four seasons period:", min_Juan_HRs, "\n\n")

## Minimum number of home-runs during the four seasons period: 11

# Summary of Juan's home-runs
cat("Summary of Juan's home-runs:\n")

## Summary of Juan's home-runs:

print(summary(Juan_HRs))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   11.75   12.50   20.00   20.75   44.00

Case-scenario 2

The average salary of 10 baseball players is 72,000 dollars a week and the average salary of 4 soccer players is 84,000. Find the mean salary of all 14 professional players.

Solution

We can easily find the joined mean by adding both mean and dividing by the total number of people.

Let n1=10 denote the number of baseball players, and y1=72000 their mean salary. Let n2=4 the number of soccer players and y2=84000 their mean salary. Then the mean salary of all 16 individuals is: n1x1+n2x2n1+n2

We can compute this in R as follows:

n_1 <- 10
n_2 <- 4
y_1 <- 72000
y_2 <- 84000
# Mean salary overall
salary_ave <-  (n_1*y_1 + n_2*y_2)/(n_1+n_2)
salary_ave

## [1] 75428.57

Question 2

The average salary of 7 basketball players is 102,000 dollars a week and the average salary of 9 NFL players is 91,000. Find the mean salary of all 16 professional players.

Solution

# Input values
b_1 <- 7
nfl_2 <- 9
s_b_w_1 <- 102000
s_nfl_w_2 <- 91000

# Mean salary overall
salary_ave <- (b_1 * s_b_w_1 + nfl_2 * s_nfl_w_2) / (b_1 + nfl_2)

# Function to format numbers as currency
format_currency <- function(x) {
  paste0("$", formatC(x, format = "f", big.mark = ",", digits = 0))
}

# Format the average salary as currency
formatted_salary_ave <- format_currency(salary_ave)

# Print the formatted average salary
cat("Mean salary overall:", formatted_salary_ave, "\n")

## Mean salary overall: $95,813

Case-scenario 3

The frequency distribution below lists the number of active players in the Barclays Premier League and the time left in their contract.

Case-scenario 3

Find the mean,the median and the standard deviation.
What percentage of the data lies within one standard deviation of the mean?
What percentage of the data lies within two standard deviations of the mean?
What percent of the data lies within three standard deviations of the mean?
Draw a histogram to illustrate the data.

Solution

The allcontracts.csv file contains all the players’ contracts length. We can read this file in R using the read.csv() function.

The code chunk bellow is reading data from a CSV file into a data frame and it is extracting a columns from the data set.
Lets brake down the code
read.table() Function: This function reads a file into a data frame.
“allcontracts.csv”: is the name of the file to be read. This file should be in CSV (Comma-Separated Values) format.
header = TRUE: This argument specifies that the first row of the CSV file contains column names.
sep = “,”: This argument specifies that the fields in the file are separated by commas.

contract_length$years: This syntax extracts the column named years from the data frame contract_length.
The extracted column is assigned to a new variable named contract_years. This variable now holds the data from the years column of the contract_length data frame.

contract_length <- read.table("allcontracts.csv", header = TRUE, sep = ",")
contract_years <- contract_length$years

Make comments about the code we just ran above.

To find the mean and the standard deviation

# Mean 
contracts_mean  <- mean(contract_years)
contracts_mean

## [1] 3.458918

# Median
contracts_median <- median(contract_years)
contracts_median

## [1] 3

# Find number of observations
contracts_n <- length(contract_years)
contracts_n

## [1] 499

# Find standard deviation
contracts_sd <- sd(contract_years)
contracts_sd

## [1] 1.69686

What percentage of the data lies within one standard deviation of the mean?

contracts_w1sd <- sum((contract_years - contracts_mean)/contracts_sd < 1)/ contracts_n
# Percentage of observation within one standard deviation of the mean
contracts_w1sd

## [1] 0.8416834

#Difference from empirical 
contracts_w1sd - 0.68

## [1] 0.1616834

What percentage of the data lies within two standard deviations of the mean?

## Within 2 sd
contracts_w2sd <- sum((contract_years - contracts_mean)/ contracts_sd < 2)/contracts_n
contracts_w2sd

## [1] 1

## Difference from empirical 
contracts_w2sd - 0.95

## [1] 0.05

What percent of the data lies within three standard deviations of the mean?

## Within 3 sd 
contracts_w3sd <- sum((contract_years - contracts_mean)/ contracts_sd < 3)/contracts_n
contracts_w3sd

## [1] 1

# Difference from empirical 
contracts_w3sd - 0.9973

## [1] 0.0027

# Create histogram
hist(contract_years,xlab = "Years Left in Contract",col = "green",border = "red", xlim = c(0,8), ylim = c(0,225),
   breaks = 5)

getwd()

## [1] "C:/Users/krist/OneDrive/Desktop/MDC/Summer 2024/DA Special Topics CAP 4936/R in-class activities"

Question 3

Use the skills learned in case scenario number 3 on one the following data sets. You may choose only one data set. They are both available in Canvas.

doubles_hit.csv and triples_hit.csv

Solution

I will be using doubles_hit.csv

doubles_hit <- read.table("doubles_hit.csv", header = TRUE, sep = ",")
dh <- doubles_hit$doubles_hit
dh

##   [1] 37  4  6  7  9 25 18 11  8 13 15  1 30 30  6 23 14 26 33 23 34 32  9  4 23
##  [26] 34 19 29 15 27 18 35  7  7 19  4 38  2 16 15 26 15  3 19 24 33 34 33 38 29
##  [51] 19 18  7  7 30 15 31 12 17 21 11  9 35  1 27 27 27 10 35 34 13  5 40 40 11
##  [76] 40 29 23 37 22 29 15 24 25 40 14  1 47 49 45 42 40 40 46 41 42 43 45 48 46

To find the mean and the standard deviation

# Mean 
dh_mean  <- mean(dh)
cat("Mean of doubles hits:", dh_mean, "\n\n")

## Mean of doubles hits: 23.55

# Median
dh_median <- median(dh)
cat("Median of doubles hits:", dh_median, "\n\n")

## Median of doubles hits: 23.5

# Find number of observations
dh_n <- length(dh)
cat("Number of observations:", dh_n, "\n\n")

## Number of observations: 100

# Find standard deviation
dh_sd <- sd(dh)
cat("Standard deviation of doubles hits:", dh_sd, "\n\n")

## Standard deviation of doubles hits: 13.37371

What percentage of the data lies within one standard deviation of the mean?

dh_w1sd <- sum((doubles_hit - dh_mean)/dh_sd < 1)/ dh_n
cat("Percentage of observations within one standard deviation of the mean:", dh_w1sd * 100, "%\n\n")

## Percentage of observations within one standard deviation of the mean: 79 %

difference_from_empirical <- dh_w1sd - 0.68
cat("Difference from the empirical rule (68%):", difference_from_empirical * 100, "%\n")

## Difference from the empirical rule (68%): 11 %

What percentage of the data lies within two standard deviations of the mean?

# Calculate the proportion of double hits within 2 standard deviations
dh_w2sd <- sum((dh - dh_mean)/ dh_sd < 2)/dh_n

# Print the proportion with a descriptive label
cat("Proportion of double hits within 2 standard deviations:", dh_w2sd, "\n\n")

## Proportion of double hits within 2 standard deviations: 1

# Calculate the difference from the empirical rule (0.95)
difference_from_empirical <- dh_w2sd - 0.95

# Print the difference with a descriptive label
cat("Difference from the empirical rule (0.95):", difference_from_empirical, "\n")

## Difference from the empirical rule (0.95): 0.05

What percent of the data lies within three standard deviations of the mean?

# Calculate the proportion of values within 3 standard deviations
dh_w3sd <- sum(abs((dh - dh_mean) / dh_sd) < 3) / dh_n

# Calculate the difference from the empirical rule
difference_empirical <- dh_w3sd - 0.9973

cat("Proportion of values within 3 standard deviations:", dh_w3sd, "\n\n")

## Proportion of values within 3 standard deviations: 1

cat("Difference from the empirical rule (0.9973):", difference_empirical, "\n")

## Difference from the empirical rule (0.9973): 0.0027

# Create a histogram and store the results
hist_data <- hist(dh, 
     main = "Histogram of Doubles Hits", 
     xlab = "Number of Doubles Hits", 
     ylab = "Frequency", 
     col = "blue", 
     border = "black", 
     xlim = c(0, 50), 
     ylim = c(0, 28),   
     breaks = 5,    
)

# Add labels to each bar
text(hist_data$mids, hist_data$counts, labels = hist_data$counts, pos = 3, cex = 0.8, col = "black")

The end, Thank you

In Class Activity 5

Krister Martinez

Getting Started with R,Part 2

Case-scenario 1

Solution

Question 1

Solution

Case-scenario 2

Solution

Question 2

Solution

Case-scenario 3

Solution

Question 3

Solution