In-class activity #5: Introduction to R Part 2

##Case-scenario 1 #This is the fourth season of outfielder Luis Robert with the Chicago White Socks. If during the first three seasons he hit 11, 13, and 12 home runs, how many does he need on this season for his overall average to be at least 20?

##Solution Given that x1=11,x2=13,x3=12 We want to find x4 such that the mean (average) number of home-runs is x¯>=20

Notice that in this case n=4

According to the information above: 20×4=11+13+12+x4,so when x4=44, the home-runs average will be 20.

# Home-runs so far
HR_before <- c(11, 13, 12)
# Average Number of Home-runs per season wanted
wanted_HR <- 20
# Number of seasons
n_seasons <- 4
# Needed Home-runs on season 4
x_4 <- n_seasons*wanted_HR - sum(HR_before)
# Minimum number of Home-runs needed by Robert
x_4

## [1] 44

According to the calculations above, Robert must hit 44 home-runs or better on this season to get an average number of home-runs per season of at least 20.

We could confirm this, by using the function mean() in R

# Robert's performance
Robert_HRs <- c(11, 13, 12,44)
# Find mean
mean(Robert_HRs)

## [1] 20

# Find standard deviation
sd(Robert_HRs)

## [1] 16.02082

# Find the maximum number of home-runs during the four seasons period
max(Robert_HRs)

## [1] 44

# Find the minimum number of home-runs during the four seasons period
min(Robert_HRs)

## [1] 11

We can also use the summary() function to find basic statistics, including the median!

summary(Robert_HRs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   11.75   12.50   20.00   20.75   44.00

##Question 1 Now, you must complete the problem below which represents a similar case scenario. You may use the steps that we executed in Case-scenario 1 as a template for your solution.

This is the sixth season of outfielder Juan Soto in the majors. If during the first five seasons he received 79, 108,41,145, and 135 walks, how many does he need on this season for his overall number of walks per season to be at least 100?

#walks so far
walks_before <- c(79, 108, 41, 145, 135)
# Average Number of Home-runs per season wanted
wanted_walks <- 100
# Number of seasons
n_seasons <- 6
# Needed Home-runs on season 6
x_6 <- n_seasons*wanted_walks - sum(walks_before)
# Minimum number of Home-runs needed by Juan Soto
x_6

## [1] 92

It means that Juan Soto needs at least 92 walks in his 6th season to have an overall average of at least 100 walks per season.

##Case-scenario 2 The average salary of 10 baseball players is 72,000 dollars a week and the average salary of 4 soccer players is 84,000. Find the mean salary of all 14 professional players.

##Solution We can easily find the joined mean by adding both mean and dividing by the total number of people.

Let n1=10 denote the number of baseball players, and y1=72000 their mean salary. Let n2=4 the number of soccer players and y2=84000 their mean salary. Then the mean salary of all 16 individuals is: n1x1+n2x2n1+n2

We can compute this in R as follows:

n_1 <- 10
n_2 <- 4
y_1 <- 72000
y_2 <- 84000
# Mean salary overall
salary_ave <-  (n_1*y_1 + n_2*y_2)/(n_1+n_2)
salary_ave

## [1] 75428.57

The mean salary of all 14 professional players (baseball and soccer players) is approx. $75,428.57.

##Question 2 The average salary of 7 basketball players is 102,000 dollars a week and the average salary of 9 NFL players is 91,000. Find the mean salary of all 16 professional players.

w_1 <- 7
w_2 <- 9
z_1 <- 102000
z_2 <- 91000
# Mean salary overall
salary_aver <-  (w_1*z_1 + w_2*z_2)/(w_1+w_2)
salary_aver

## [1] 95812.5

The mean salary of all 16 professional players (basketball and NFL players) is approx. $95,812.5.

##Case-scenario 3 The frequency distribution below lists the number of active players in the Barclays Premier League and the time left in their contract.

Years Number of players 6 28 5 72 4 201 3 109 2 56 1 34

Find the mean,the median and the standard deviation.
What percentage of the data lies within one standard deviation of the mean?
What percentage of the data lies within two standard deviations of the mean?
What percent of the data lies within three standard deviations of the mean?
Draw a histogram to illustrate the data.

##Solution The allcontracts.csv file contains all the players’ contracts length. We can read this file in R using the read.csv() function.

library(readr)
allcontracts <- read_csv("~/Miami Dade College/2024/Summer 2024/CAP4936 Special Topics in Data Analytics/Assigm 5/allcontracts.csv")

## New names:
## Rows: 499 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (2): years, ...4 lgl (2): ...2, ...3
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`

View(allcontracts)

contract_years <- allcontracts$years

Make comments about the code we just ran above.

To find the mean and the standard deviation

# Mean 
contracts_mean  <- mean(contract_years)
contracts_mean

## [1] 3.458918

The average is 3.46 years (5.5 months), indicating time left on their current contract.

# Median
contracts_median <- median(contract_years)
contracts_median

## [1] 3

This means that half of the players in the Barclays Premier League have contracts that last three years or less, and the other half have longer contracts.

# Find number of observations
contracts_n <- length(contract_years)
contracts_n

## [1] 499

# Find standard deviation
contracts_sd <- sd(contract_years)
contracts_sd

## [1] 1.69686

The dataset records 499 players’ contract lengths. The standard deviation is 1.70 years from the mean contract length.

What percentage of the data lies within one standard deviation of the mean?

contracts_w1sd <- sum((contract_years - contracts_mean)/contracts_sd < 1)/ contracts_n
# Percentage of observation within one standard deviation of the mean
contracts_w1sd

## [1] 0.8416834

Approx 84.17% of players have contract lengths within one standard deviation of the mean.

## Difference from empirical 
contracts_w1sd - 0.68

## [1] 0.1616834

The difference of 0.1616834 means that more players have contract lengths close to the average.

What percentage of the data lies within two standard deviations of the mean?

## Within 2 sd
contracts_w2sd <- sum((contract_years - contracts_mean)/ contracts_sd < 2)/contracts_n
contracts_w2sd

## [1] 1

Almost all contract lengths in the dataset fall within two standard deviations of the mean.

## Difference from empirical 
contracts_w2sd - 0.95

## [1] 0.05

The 0.05 suggests that slightly more players have contract lengths within two standard deviations of the mean.

What percent of the data lies within three standard deviations of the mean?

## Within 3 sd 
contracts_w3sd <- sum((contract_years - contracts_mean)/ contracts_sd < 3)/contracts_n
contracts_w3sd

## [1] 1

This means all the players are within three standard deviations of the mean.

## Difference from empirical 
contracts_w3sd - 0.9973

## [1] 0.0027

The difference of 0.0027 shows that slightly more players have contract lengths within three standard deviations of the mean.

Draw a histogram

hist(contract_years,xlab = "Years Left in Contract",col = "green",border = "red", xlim = c(0,8), ylim = c(0,225),
   breaks = 5)

The histogram shows a significant peak between 0 and 2 years. It indicates a peak in contracts with around two years remaining, suggesting a concentration in shorter-term contracts.

##Question 3 Use the skills learned in case scenario number 3 on one the following data sets. You may choose only one dataset. They are both available in Canvas.

library(readr)
doubles_hit <- read_csv("C:/Users/axeli/OneDrive/Documentos/Miami Dade College/2024/Summer 2024/CAP4936 Special Topics in Data Analytics/Assigm 5/doubles_hit.csv")

## Rows: 100 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): doubles_hit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(doubles_hit)

doubles_hit <- read.csv("doubles_hit.csv", header = TRUE, sep = ",")
doubles_hits <- doubles_hit$doubles_hit

#View(doubles_hit)

# Mean 
doubles_mean  <- mean(doubles_hits)
doubles_mean

## [1] 23.55

The average number of doubles hit across the dataset is 23.55.

# Median 
doubles_median  <- median(doubles_hits)
doubles_median

## [1] 23.5

The 23.5 indicates that half of the observed values.

# Find number of observations
doubles_n <- length(doubles_hits)
# Find standard deviation
doubles_sd <- sd(doubles_hits)
doubles_sd

## [1] 13.37371

Each observation (doubles_hits) deviates from the mean (doubles_mean = 23.55) by approx. 13.37.

What percentage of the data lies within one standard deviation of the mean?

doubles_w1sd <- sum((doubles_hits - doubles_mean)/doubles_sd < 1)/ doubles_n
# Percentage of observation within one standard deviation of the mean
doubles_w1sd

## [1] 0.79

0.79 represents the proportion of observations within a range of variation (one standard deviation) from the mean number.

## Difference from empirical 
doubles_w1sd - 0.68

## [1] 0.11

The 0.11 quantifies the deviation from the expected proportion of observations within one standard deviation of the mean.

What percentage of the data lies within two standard deviations of the mean?

## Within 2 sd
doubles_w2sd <- sum((doubles_hits - doubles_mean)/ doubles_sd < 2)/doubles_n
doubles_w2sd

## [1] 1

## Difference from empirical 
doubles_w2sd - 0.95

## [1] 0.05

What percent of the data lies within three standard deviations of the mean?

doubles_w3sd <- sum((doubles_hits - doubles_mean)/ doubles_sd < 3)/doubles_n
doubles_w3sd

## [1] 1

The 1 exceeds the expected percentage (0.95) by 0.05; this means that the distribution of doubles hit in baseball is slightly more significant around the mean.

## Difference from empirical 
doubles_w3sd - 0.9973

## [1] 0.0027

?hist

## starting httpd help server ... done

# Create histogram
hist(doubles_hits,xlab = "Number of Doubles Hits",col = "green",border = "red", xlim = c(0,50), ylim = c(0,28),breaks = 5)

The histogram shows that the most frequent number of double hits falls between 15 and 20. Games with very high doubles hits (over 30) are uncommon because to be honest, in that sport, is really hard to hit the ball that well.

In-class activity #5: Introduction to R Part 2 - Axel Paredes