# Install the required package
install.packages("readxl")
## Installing package into 'C:/Users/manue/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'readxl' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\manue\AppData\Local\Temp\Rtmp8W9OQP\downloaded_packages

Manuel Madalena

Let us continue getting started with R as we start discussing important statistical concepts in Sports Analytics.

Case-scenario 1

This is the fourth season of outfielder Luis Robert with the Chicago White Socks. If during the first three seasons he hit 11, 13, and 12 home runs, how many does he need on this season for his overall average to be at least 20?

Solution Given that x1=11, x2=13, x3=12

we want to find x4 such that the mean (aver age) number of home-runs is xÂŻ>=20

Notice that in this case n=4

According to the information above: 20Ă—4=11+13+12+x4

so when x4=61 , the home-runs average will be 20.

Home-runs so far

HR_before <- c(11, 13, 12)
HR_before
## [1] 11 13 12

Average Number of Home-runs per season wanted

wanted_HR <- 20
wanted_HR
## [1] 20

Number of seasons

n_seasons <- 4
n_seasons
## [1] 4

Needed Home-runs on season 4

x_4 <- n_seasons*wanted_HR - sum(HR_before)
x_4
## [1] 44

Minimum number of Home-runs needed by Robert

x_4
## [1] 44

According to the calculations above, Robert must hit 44 home-runs or better on this season to get an average number of home-runs per season of at least 20.

We could confirm this, by using the function mean() in R

Robert’s performance

Robert_HRs <- c(11, 13, 12,44)
Robert_HRs
## [1] 11 13 12 44

Find mean

mean(Robert_HRs)
## [1] 20

Find standard deviation

sd(Robert_HRs)
## [1] 16.02082

Find the maximum number of home-runs during the four seasons period

max(Robert_HRs)
## [1] 44

Find the minimum number of home-runs during the four seasons period

min(Robert_HRs)
## [1] 11

We can also use the summary() function to find basic statistics, including the median!

summary(Robert_HRs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   11.75   12.50   20.00   20.75   44.00

Question 1

Now, you must complete the problem below which represents a similar case scenario. You may use the steps that we executed in Case-scenario 1 as a template for your solution.

This is the sixth season of outfielder Juan Soto in the majors. If during the first five seasons he received 79, 108,41,145, and 135 walks, how many does he need on this season for his overall number of walks per season to be at least 100?

HR_before2 <- c(79, 108, 41, 145, 135)

wanted_HR2 <- 100

n_seasons2 <- 6

x_6 <- n_seasons2*wanted_HR2 - sum(HR_before2)
x_6
## [1] 92

Case-scenario 2

The average salary of 10 baseball players is 72,000 dollars a week and the average salary of 4 soccer players is 84,000. Find the mean salary of all 14 professional players.

Solution We can easily find the joined mean by adding both mean and dividing by the total number of people.

Let n1=10 denote the number of baseball players, and y1=72000 their mean salary. Let n2=4 the number of soccer players and y2=84000 their mean salary. Then the mean salary of all 16 individuals is: n1x1+n2x2n1+n2

We can compute this in R as follows:

n_1 <- 10
n_2 <- 4
y_1 <- 72000
y_2 <- 84000

Mean salary overall

salary_ave <-  (n_1*y_1 + n_2*y_2)/(n_1+n_2)
salary_ave
## [1] 75428.57

Question 2 The average salary of 7 basketball players is 102,000 dollars a week and the average salary of 9 NFL players is 91,000. Find the mean salary of all 16 professional players.

p1 <- 7
p2 <- 9
s1 <- 102000
s2 <- 91000

Mean Salary

salary_ave2 <- (p1*s1 + p2*s2)/(p1+p2)
salary_ave2
## [1] 95812.5

Case-scenario 3

The frequency distribution below lists the number of active players in the Barclays Premier League and the time left in their contract.

Years Number of players 6 28 5 72 4 201 3 109 2 56 1 34

  1. Find the mean,the median and the standard deviation.

  2. What percentage of the data lies within one standard deviation of the mean?

  3. What percentage of the data lies within two standard deviations of the mean?

  4. What percent of the data lies within three standard deviations of the mean?

  5. Draw a histogram to illustrate the data.

Solution

The allcontracts.csv file contains all the players’ contracts length. We can read this file in R using the read.csv() function.

install.packages("readxl")
## Installing package into 'C:/Users/manue/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'readxl' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\manue\AppData\Local\Temp\Rtmp8W9OQP\downloaded_packages
install.packages("writexl")
## Installing package into 'C:/Users/manue/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'writexl' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\manue\AppData\Local\Temp\Rtmp8W9OQP\downloaded_packages
getwd()
## [1] "C:/Users/manue/OneDrive/Desktop"
contract_length <- read.csv("allcontracts.csv", header = TRUE, sep = ",")
contract_years<-contract_length$years

Rstudio extracts the allcontracts.csv file from the computer and make it available to analyze it in Rstudio.

contract_years
##   [1] 6 5 3 6 5 1 5 5 4 1 6 4 4 1 6 1 5 3 2 1 3 2 5 5 1 4 3 5 5 4 6 6 3 1 4 4 4
##  [38] 4 5 2 1 1 4 3 6 2 3 3 3 6 4 4 5 3 4 1 6 5 4 2 2 2 4 3 5 3 2 6 1 5 1 3 5 5
##  [75] 3 5 5 4 6 2 3 2 5 6 1 1 5 2 1 4 6 3 1 4 1 2 3 2 5 5 4 1 6 6 2 6 6 5 2 1 2
## [112] 5 4 3 4 6 1 6 3 3 3 1 2 4 1 3 1 6 2 2 1 1 2 4 3 2 2 1 4 1 1 6 2 3 4 1 4 5
## [149] 4 6 5 3 3 5 4 3 3 4 4 1 2 2 1 1 4 4 2 3 1 4 4 5 5 1 1 4 1 1 2 5 6 6 1 3 3
## [186] 4 3 1 5 2 3 6 3 2 3 1 3 4 6 5 1 5 6 2 5 6 1 6 5 2 1 6 5 6 2 3 5 2 4 5 5 3
## [223] 6 1 2 2 6 3 4 2 6 2 4 3 4 2 1 5 6 6 6 5 4 5 4 6 6 2 3 3 1 3 5 2 2 6 5 2 6
## [260] 5 3 1 4 5 6 3 5 2 6 4 6 3 3 6 2 5 3 4 6 6 3 5 4 5 3 6 3 6 2 3 2 5 4 5 3 6
## [297] 3 3 4 6 4 3 3 1 1 2 4 6 6 6 1 1 5 4 5 6 2 3 1 1 4 6 2 1 3 4 2 2 1 3 5 1 3
## [334] 2 1 3 1 4 5 6 4 6 6 4 1 5 1 2 4 6 2 3 4 1 2 3 1 2 1 5 3 2 5 1 6 1 5 4 2 4
## [371] 4 2 1 5 1 4 2 4 2 6 1 4 2 2 4 3 4 6 5 4 6 4 5 6 6 4 2 6 6 4 4 1 5 5 6 2 2
## [408] 5 6 5 3 4 1 1 1 3 3 5 6 4 2 5 4 2 3 1 4 2 1 2 1 1 2 6 4 2 4 5 4 3 3 1 3 4
## [445] 4 5 2 2 6 3 2 6 4 5 2 2 3 3 3 4 1 1 6 4 3 1 6 5 2 3 2 5 3 1 4 1 6 3 1 5 3
## [482] 2 4 1 3 3 1 5 5 4 4 2 1 2 4 5 4 6 6

To find the mean and the standard deviation # Mean

contracts_mean<-mean(contract_years)
contracts_mean<-round(contracts_mean, digits = 1)
contracts_mean
## [1] 3.5

Median

contracts_median <- median(contract_years)
contracts_mediancontracts_median <- median(contract_years)
contracts_median
## [1] 3

Find number of observations

contracts_n <- length(contract_years)
contracts_n
## [1] 499

Find standard deviation

contracts_sd <- sd(contract_years)
contracts_sd
## [1] 1.69686

What percentage of the data lies within one standard deviation of the mean?

contracts_w1sd <- sum((contract_years - contracts_mean)/contracts_sd < 1)/ contracts_n
contracts_w1sd
## [1] 0.8416834

is the Percentage of observation within one standard deviation of the mean

Difference from empirical

contracts_w1sd - 0.68
## [1] 0.1616834

What percentage of the data lies within two standard deviations of the mean? Within 2 sd

contracts_w2sd <- sum((contract_years - contracts_mean)/ contracts_sd < 2)/contracts_n
contracts_w2sd
## [1] 1

Difference from empirical

contracts_w2sd - 0.95
## [1] 0.05

What percent of the data lies within three standard deviations of the mean? Within 3 sd

contracts_w3sd <- sum((contract_years - contracts_mean)/ contracts_sd < 3)/contracts_n
contracts_w3sd
## [1] 1

Difference from empirical

contracts_w3sd - 0.9973
## [1] 0.0027

Draw a histogram

Create histogram

hist(contract_years,xlab = "Years Left in Contract",col = "green",border = "red", xlim = c(0,8), ylim = c(0,225),
   breaks = 6)

boxplot(contract_years,main = "Years Left in Contract",ylab = "Years")

boxplot(contract_years,main = "Years Left in Contract",ylab = "Years",col = "lightblue",border = "darkblue",horizontal = FALSE)

Question 3

Use the skills learned in case scenario number 3 on one the following data sets. You may choose only one dataset. They are both available in Canvas.

doubles_hit.csv and triples_hit.csv

double_hits.csv

Solution

The double_hits.csv file contains all the players’ double hits. We can read this file in R using the read.csv() function.

doubles<-read.csv("doubles_hit.csv", header = TRUE, sep = ",")
doubles_hit<-doubles$doubles_hit

Rstudio extracts the doubles_hit.csv file from the computer and make it available to analyze it in Rstudio.

doubles_hit
##   [1] 37  4  6  7  9 25 18 11  8 13 15  1 30 30  6 23 14 26 33 23 34 32  9  4 23
##  [26] 34 19 29 15 27 18 35  7  7 19  4 38  2 16 15 26 15  3 19 24 33 34 33 38 29
##  [51] 19 18  7  7 30 15 31 12 17 21 11  9 35  1 27 27 27 10 35 34 13  5 40 40 11
##  [76] 40 29 23 37 22 29 15 24 25 40 14  1 47 49 45 42 40 40 46 41 42 43 45 48 46

To find the mean and the standard deviation # Mean

doubles_hit_mean<-mean(doubles_hit)
doubles_hit_mean
## [1] 23.55

Median

doubles_hit_median <- median(doubles_hit)
doubles_hit_median
## [1] 23.5

Find number of observations

doubles_hit_n <- length(doubles_hit)
doubles_hit_n
## [1] 100

Find standard deviation

doubles_hit_sd <- sd(doubles_hit)
doubles_hit_sd
## [1] 13.37371

What percentage of the data lies within one standard deviation of the mean?

doubles_hit_w1sd <- sum((doubles_hit - doubles_hit_mean)/doubles_hit_sd < 1)/ doubles_hit_n
doubles_hit_w1sd
## [1] 0.79

is the Percentage of observation within one standard deviation of the mean

Difference from empirical

doubles_hit_w1sd - 0.68
## [1] 0.11

What percentage of the data lies within two standard deviations of the mean? Within 2 sd

doubles_hit_w2sd <- sum((doubles_hit - doubles_hit_mean)/ doubles_hit_sd < 2)/doubles_hit_n
doubles_hit_w2sd
## [1] 1

Difference from empirical

doubles_hit_w2sd - 0.95
## [1] 0.05

What percent of the data lies within three standard deviations of the mean? Within 3 sd

doubles_hit_w3sd <- sum((doubles_hit - doubles_hit_mean)/ doubles_hit_sd < 3)/doubles_hit_n
doubles_hit_w3sd
## [1] 1

Difference from empirical

doubles_hit_w3sd - 0.9973
## [1] 0.0027

Draw a histogram

Create histogram

hist(doubles_hit, xlab="Number of Doubles", col="blue", border="lightblue", xlim = c(0,60), ylim = c(0,20), breaks=7)

boxplot(doubles_hit,main="Boxplot of Doubles Hit by Player", ylab="Doubles", col = "lightblue", border = "black")