Solution

Given that x1=11,x2=13,x3=12

we want to find x4 such that the mean (average) number of home-runs is x¯>=20

Notice that in this case n=4

According to the information above: 20×4=11+13+12+x4

so when x4=61, the home-runs average will be 20.

# Home-runs so far
HR_before <- c(11, 13, 12)
# Average Number of Home-runs per season wanted
wanted_HR <- 20
# Number of seasons
n_seasons <- 4
# Needed Home-runs on season 4
x_4 <- n_seasons*wanted_HR - sum(HR_before)
# Minimum number of Home-runs needed by Robert
x_4

## [1] 44

According to the calculations above, Robert must hit 44 home-runs or better on this season to get an average number of home-runs per season of at least 20.

We could confirm this, by using the function mean() in R

# Robert's performance
Robert_HRs <- c(11, 13, 12,44)
# Find mean
mean(Robert_HRs)

## [1] 20

# Find standard deviation
sd(Robert_HRs)

## [1] 16.02082

# Find the maximum number of home-runs during the four seasons period
max(Robert_HRs)

## [1] 44

# Find the minimum number of home-runs during the four seasons period
min(Robert_HRs)

## [1] 11

We can also use the summary() function to find basic statistics, including the median!

summary(Robert_HRs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   11.75   12.50   20.00   20.75   44.00

Question 1

Now, you must complete the problem below which represents a similar case scenario. You may use the steps that we executed in Case-scenario 1 as a template for your solution.

This is the sixth season of outfielder Juan Soto in the majors. If during the first five seasons he received 79, 108,41,145, and 135 walks, how many does he need on this season for his overall number of walks per season to be at least 100?

sum(79+108+41+145+135)

## [1] 508

x <-(sum(79+108+41+145+135))
y<-600-x
y

## [1] 92

# Average Number of walks per season wanted
wanted_walks <- 100
# Number of seasons
n_seasons <- 6
# Needed Home-runs on season 4
x_6 <- n_seasons*wanted_walks - x
# Minimum number of Home-runs needed by Robert
x_6

## [1] 92

In this sixth season of outfielder Juan Soto in the majors, he needs at least 92 walks this season for his overall number of walks per season to be at least 100.

Case-scenario 2

The average salary of 10 baseball players is 72,000 dollars a week and the average salary of 4 soccer players is 84,000. Find the mean salary of all 14 professional players.

Solution

We can easily find the joined mean by adding both mean and dividing by the total number of people.

Let n1=10 denote the number of baseball players, and y1=72000 their mean salary. Let n2=4 the number of soccer players and y2=84000 their mean salary. Then the mean salary of all 16 individuals is: (n1x1+n2x2)/(n1+n2)

We can compute this in R as follows:

n_1 <- 10
n_2 <- 4
y_1 <- 72000
y_2 <- 84000
# Mean salary overall
salary_ave <-  (n_1*y_1 + n_2*y_2)/(n_1+n_2)
salary_ave

## [1] 75428.57

Question 2

The average salary of 7 basketball players is 102,000 dollars a week and the average salary of 9 NFL players is 91,000. Find the mean salary of all 16 professional players.

n_1 <- 7
n_2 <- 9
y_1 <- 102000
y_2 <- 91000
# Mean salary overall
salary_ave <-  (n_1*y_1 + n_2*y_2)/(n_1+n_2)
salary_ave

## [1] 95812.5

Case-scenario 3

The frequency distribution below lists the number of active players in the Barclays Premier League and the time left in their contract.

Years Number of players 6 28 5 72 4 201 3 109 2 56 1 34

Find the mean,the median and the standard deviation.
What percentage of the data lies within one standard deviation of the mean?
What percentage of the data lies within two standard deviations of the mean?
What percent of the data lies within three standard deviations of the mean?

5.Draw a histogram to illustrate the data.

Solution

The allcontracts.csv file contains all the players’ contracts length. We can read this file in R using the read.csv() function.

contract_length <- read.table("allcontracts.csv", header = TRUE, sep = ",")
contract_years <- contract_length$years

We can see in the code we just ran above, we pulled the column “years” from the allcontracts.csv data set that we uploaded into the contract_length dataframe. We then saved this pulled column in the contract_years vector which we will use to perform our calculations on this measure and perform EDA upon it.

To find the mean and the standard deviation

# Mean 
contracts_mean  <- mean(contract_years)
contracts_mean

## [1] 3.458918

# Median
contracts_median <- median(contract_years)
contracts_median

## [1] 3

# Find number of observations
contracts_n <- length(contract_years)
# Find standard deviation
contracts_sd <- sd(contract_years)
contracts_sd

## [1] 1.69686

The average is 3.46, the median is 3, and the standard deviateions is 1.7

What percentage of the data lies within one standard deviation of the mean?

contracts_w1sd <- sum((contract_years - contracts_mean)/contracts_sd < 1)/ contracts_n
# Percentage of observation within one standard deviation of the mean
contracts_w1sd

## [1] 0.8416834

84% of the data lies within one standard deviation of the mean.

## Difference from empirical 
contracts_w1sd - 0.68

## [1] 0.1616834

What percentage of the data lies within two standard deviations of the mean?

## Within 2 sd
contracts_w2sd <- sum((contract_years - contracts_mean)/ contracts_sd < 2)/contracts_n
contracts_w2sd

## [1] 1

100% of the data lies within two standard deviation of the mean.

## Difference from empirical 
contracts_w2sd - 0.95

## [1] 0.05

What percent of the data lies within three standard deviations of the mean?

## Within 3 sd 
contracts_w3sd <- sum((contract_years - contracts_mean)/ contracts_sd < 3)/contracts_n
contracts_w3sd

## [1] 1

100% of the data lies within three standard deviation of the mean.

## Difference from empirical 
contracts_w3sd - 0.9973

## [1] 0.0027

Draw a histogram

# Create histogram
hist(contract_years,xlab = "Years Left in Contract",col = "green",border = "red", xlim = c(0,8), ylim = c(0,225),
   breaks = 5)

We can see that the distribution is unbalanced and left leaning with a majority of contracts being from 1-2 years.

Question 3 Use the skills learned in case scenario number 3 on one the following data set: doubles_hit.csv

Solution

The doubles_hit.csv file contains all the players’ doubles hit. We can read this file in R using the read.csv() function, like we did above with the allcontracts datasets.

doubles <- read.table("doubles_hit.csv", header = TRUE, sep = ",")
doubles_hit <- doubles$doubles_hit

We pulled the doubles hit column from the doubles hit dataset we loaded.

To find the mean, median, and the standard deviation

# Mean 
doubles_mean  <- mean(doubles_hit)
doubles_mean

## [1] 23.55

# Median
doubles_median <- median(doubles_hit)
doubles_median

## [1] 23.5

# Find number of observations
doubles_n <- length(doubles_hit)
# Find standard deviation
doubles_sd <- sd(doubles_hit)
doubles_sd

## [1] 13.37371

The mean is 23.55 nearly the same as the median 23.5 and the standard deviation is 13.4

What percentage of the data lies within one standard deviation of the mean?

doubles_w1sd <- sum((doubles_hit - doubles_mean)/doubles_sd < 1)/ doubles_n
# Percentage of observation within one standard deviation of the mean
doubles_w1sd

## [1] 0.79

79% of the data lies within one standard deviation of the mean

## Difference from empirical 
doubles_w1sd - 0.68

## [1] 0.11

What percentage of the data lies within two standard deviations of the mean?

## Within 2 sd
doubles_w2sd <- sum((doubles_hit - doubles_mean)/doubles_sd < 2)/ doubles_n
# Percentage of observation within two standard deviation of the mean
doubles_w2sd

## [1] 1

100% of the data lies within two standard deviation of the mean

## Difference from empirical 
doubles_w2sd - 0.95

## [1] 0.05

What percent of the data lies within three standard deviations of the mean?

## Within 3 sd 
doubles_w3sd <- sum((doubles_hit - doubles_mean)/ doubles_sd < 3)/doubles_n
doubles_w3sd

## [1] 1

100% of the data lies within three standard deviation of the mean.

## Difference from empirical 
doubles_w3sd - 0.9973

## [1] 0.0027

summary(doubles_hit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.75   23.50   23.55   34.00   49.00

Draw a histogram

# Create histogram
hist(doubles_hit,xlab = "Doubles Hit",col = "blue",border = "orange", xlim = c(0,55), ylim = c(0,40),
   breaks = 5)

We can see that our data set isn’t perfectlly gaussian or balanced since it slopes downward in value when reaching the last quartille which holds the max double hits values.

R Notebook 2 - Sports Analytics

Getting Started with R,Part 2

Case-scenario 1

Solution

Question 1

Case-scenario 2

Solution

Question 2

Case-scenario 3

Solution

Solution