Case-scenario 1
This is the fourth season of outfielder Luis Robert with the Chicago
White Socks
If during the first three seasons he hit 11, 13, and 12 home runs
##how many does he need on this season for his overall average to be at
least 20?
# Solution
#Given that x1=11,x2=13,x3=12
#we want to find x4
#such that the mean (average) number of home-runs is x¯>=20
#Notice that in this case n=4
#According to the information above: 20×4=11+13+12+x4
#so when x4=44. the home-runs average will be 20
#Answer Here
#Home-runs so far
HR_before <- c(11, 13, 12) # We declare the variable HR_before and convert it to a vector by assigning c(11, 13, 12)
#Average Number of Home-runs per season wanted
wanted_HR <- 20 # We declare the variable wanted_HR and assign 20
#Number of seasons
n_seasons <- 4 # We declare the variable n_seasons and assign 4
#Needed Home-runs on season 4
x_4 <- n_seasons*wanted_HR - sum(HR_before) # We declare the variable X_4 and assign the result of the following calculation
# 4*20 - (11+13+12)
#The result will be 44
#Minimum number of Home-runs needed by Robert
#In this code we are printing the contents of the variable X_4
x_4
[1] 44
According to the calculations above, Robert must hit 44 home-runs or
better on this season to get an average number of home-runs per season
of at least 20.
We could confirm this, by using the function mean() in R
#Robert's performance
Robert_HRs <- c(11, 13, 12,44) #We declared the variable Robert_HRs and convert it to a vector by assigning c(11, 13, 12,44)
#In this case we added the contents of HR_before and x_4
#Find mean
#In this code we are using the function mean() to find the average of the contents of Robert_HRs
#The results should be 20. This confirm the calculation on the priverious code chunck
mean(Robert_HRs)
[1] 20
#Find standard deviation
#In this code we are using the function sd() to find the standard deviation of the contents of Robert_HRs
sd(Robert_HRs)
[1] 16.02082
#Find the maximum number of home-runs during the four seasons period.
#In this code we are using the function max() to find the highest value in the contents of Robert_HRs
max(Robert_HRs)
[1] 44
#Find the minimum number of home-runs during the four seasons period
#In this code we are using the function min() to find the lowest value in the contents of Robert_HRs
min(Robert_HRs)
[1] 11
#We can also use the summary() function to find basic statistics, including the median!
#In This code we use the function summary() to find the following statistical values in the variable Robert_HRs
#Minimum (Min): The smallest value in the vector.
#1st Quartile (1st Qu): The value below which 25% of the data falls (also known as Q1 or the lower quartile).
#Median (Median or 50th percentile): The middle value in the sorted vector (Q2).
#Mean (Mean): The average value of the vector.
#3rd Quartile (3rd Qu): The value below which 75% of the data falls (Q3 or the upper quartile).
#Maximum (Max): The largest value in the vector.
#Number of non-missing values (NA's): The count of missing or NA values in the vector.
summary(Robert_HRs)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.00 11.75 12.50 20.00 20.75 44.00
Question 1
Now, you must complete the problem below which represents a similar
case scenario. You may use the steps that we executed in Case-scenario 1
as a template for your solution.
This is the sixth season of outfielder Juan Soto in the majors. If
during the first five seasons he received 79, 108,41,145, and 135 walks,
how many does he need on this season for his overall number of walks per
season to be at least 100?
#Home-runs so far
Walks_before <- c(79, 108, 41,145,135) #In this code we declare the variable Walks_before then convert it to a vector by adding c(79, 108, 41,145,135)
#Average Number of Walks per season wanted
wanted_Walks <- 100 #In this code we declare the variable wanted_Walks and assign 100
#Number of seasons
n_seasons <- 6 #In this code we declare the variable n_seasons and assign 100
#Needed Walks on season 6
x_6 <- n_seasons*wanted_Walks - sum(Walks_before) # In this code we declare the variable x_6 and assign the results of the following calculation
#6*100 - (79+108+41+145+135)
#600 - 508
#The result will be 92
#Minimum number of walks needed by Juan
x_6
[1] 92
Case-scenario
The average salary of 10 baseball players is 72,000 dollars a week
and the average salary of 4 soccer players is 84,000. Find the mean
salary of all 14 professional players.
Solution
We can easily find the joined mean by adding both mean and dividing
by the total number of people.
Let n1=10 denote the number of baseball players, and y1=72000 their
mean salary. Let n2=4 the number of soccer players and y2=84000 their
mean salary. Then the mean salary of all 16 individuals is:
n1x1+n2x2n1+n2
We can compute this in R as follows:
n_1 <- 10 # Declare the variable n_1 and assign 10 (Baseball Players)
n_2 <- 4 # Declare the variable n_2 and assign 4 (Soccer Players)
y_1 <- 72000 # Declare the variable y_1 and assign 72000 (Baseball Players Salaries)
y_2 <- 84000 # Declare the variable y_2 and assign 84000 (Soccer Players Salaries)
#Mean salary overall
salary_ave <- (n_1*y_1 + n_2*y_2)/(n_1+n_2) # Declare the variable salary_ave and assign the values of the following calculation
#10*72000 + 4*84000 / 10+4
#720000 + 336000 / 14
# The answer will be 75428.57
salary_ave #We are printing the contents of the variable salary_ave
[1] 75428.57
Question 2
The average salary of 7 basketball players is 102,000 dollars a week
and the average salary of 9 NFL players is 91,000. Find the mean salary
of all 16 professional players.
n_1 <- 7 # Declare the variable n_1 and assign 7 (Baseball Players)
n_2 <- 9 # Declare the variable n_2 and assign 9 (NFL Players)
y_1 <- 102000 # Declare the variable y_1 and assign 102000 (Baseball Players Salaries)
y_2 <- 91000 # Declare the variable y_2 and assign 91000 (NFL Players Salaries)
##Mean salary overall
salary_ave <- (n_1*y_1 + n_2*y_2)/(n_1+n_2) # Declare the variable salary_ave and assign the values of the following calculation
#7*102000 + 9*91000 / 7+9
#714000 + 819000 / 16
# The answer will be 95812.5
salary_ave #We are printing the contents of the variable salary_ave
[1] 95812.5
Case-scenario 3
The frequency distribution below lists the number of active players
in the Barclays Premier League and the time left in their contract.
Years Number of Players 6 28 5 72 4 201
3 109 2 56 1 34
Find the mean,the median and the standard deviation.
What percentage of the data lies within one standard deviation of the
mean?
What percentage of the data lies within two standard deviations of
the mean?
What percent of the data lies within three standard deviations of the
mean?
Draw a histogram to illustrate the data.
Solution
#The allcontracts.csv file contains all the players’ contracts length.
#We can read this file in R using the read.csv() function.
contract_length <- read.table("C:/Users/erlan/OneDrive/Desktop/School/CAP 4936 Sports Analytics/allcontracts.csv", header = TRUE, sep = ",")
#In This code we declare the variable contract_length
#Then use the read.table() function to read the file allcontracts.csv in the provided path including the heather and separated by comma to the variable contract_length
contract_years <- contract_length$years
#In The above line of code we are extracting the values stored in the column named years from the data frame or list contract_length and assigning them to a new variable named contract_years
Make comments about the code we just ran above.
contract_length: This variable holds a data frame that contains all
the data read from “allcontracts.csv”. Each row corresponds to a record
from the CSV file, and each column represents a different attribute
(such as contract details).
contract_years: This variable is a vector that contains only the
values from the years column of the contract_length data frame. Each
element in contract_years represents the length of a contract in years
as specified in the CSV file.
After executing this code, we can further analyze or manipulate the
data stored in contract_length and contract_years using various R
functions and operations. For example, you could calculate summary
statistics like mean, median, or maximum contract length using functions
like mean(contract_years), median(contract_years), or
max(contract_years).
To find the mean and the standard deviation
#Mean
#In This code we declare the variable contracts_mean
#Then use the function mean() to calculate the average of the contents in the variable contract_years
#Then assign the results to contracts_mean
contracts_mean <- mean(contract_years)
#in this code we are printing the contents of the variable contracts_mean
contracts_mean
[1] 3.458918
#Median
#In This code we declare the variable contracts_median
#Then use the function median() to calculate the median of the contents in the variable contract_years
#Then assign the results to contracts_median
contracts_median <- median(contract_years)
#in this code we are printing the contents of the variable contracts_median
contracts_median
[1] 3
#Find number of observations
#In this code we declare the variable contracts_n
#Then use the function length() to calculate the size of the contents in the variable contract_years and assign to contracts_n
contracts_n <- length(contract_years)
contracts_n # Printing the results. In this case 499
[1] 499
#Find standard deviation
#In this code we declare the variable contracts_sd
#Then use the function sd() to calculate the standard deviation of the contents in the variable contract_years and assign to contracts_sd
contracts_sd <- sd(contract_years)
contracts_sd # Printing the results. In this case 1.69686
[1] 1.69686
#What percentage of the data lies within one standard deviation of the mean?
#Declare the variable contracts_w1sd and assign the results of the following calculation
#We add the contents of contract_years minus the contents of contracts_mean divided by the contents of contracts_sd that are less than 1 (For 1 Standard Deviations)
#Then divide the previous calculation by the contents of contracts_n
contracts_w1sd <- sum((contract_years - contracts_mean)/contracts_sd < 1)/ contracts_n
#Percentage of observation within one standard deviation of the mean
contracts_w1sd # Printing the results. In this case 0.8416834
[1] 0.8416834
#Difference from empirical
#Subtracting 0.68 from the contents of the variable contracts_w1sd to find the Difference from empirical. In this case the result is 0.1616834
contracts_w1sd - 0.68
[1] 0.1616834
What percentage of the data lies within two standard deviations of
the mean?
#Within 2 sd
#Declare the variable contracts_w2sd and assign the results of the following calculation
#We add the contents of contract_years minus the contents of contracts_mean divided by the contents of contracts_sd that are less than 2 (For 2 Standard Deviations)
#Then divide the previous calculation by the contents of contracts_n
contracts_w2sd <- sum((contract_years - contracts_mean)/ contracts_sd < 2)/contracts_n
contracts_w2sd # Printing the results. In this case 1
[1] 1
#Difference from empirical
#Subtracting 0.95 from the contents of the variable contracts_w2sd to find the Difference from empirical. In this case the result is 0.05
contracts_w2sd - 0.95
[1] 0.05
What percent of the data lies within three standard deviations of the
mean?
#Within 3 sd
#Declare the variable contracts_w3sd and assign the results of the following calculation
#We add the contents of contract_years minus the contents of contracts_mean divided by the contents of contracts_sd that are less than 3 (For 3 Standard Deviations)
#Then divide the previous calculation by the contents of contracts_n
contracts_w3sd <- sum((contract_years - contracts_mean)/ contracts_sd < 3)/contracts_n
contracts_w3sd # Printing the results. In this case 1
[1] 1
#Difference from empirical
#Subtracting 0.9973 from the contents of the variable contracts_w3sd to find the Difference from empirical. In this case the result is 0.0027
contracts_w3sd - 0.9973
[1] 0.0027
Draw a histogram
#Create histogram
#Use the function hist() to create a histogram using the contents of contract_years
#With X axis lable Years Left in Contract
#Coloring the bars with green and the border with red
# 0,8 Limits for the x-axis
# 0,225 Limits for the y-axis
# breaks = 5 Number of bins (breaks) in the histogram
hist(contract_years,xlab = "Years Left in Contract",col = "green",border = "red", xlim = c(0,8), ylim = c(0,225),
breaks = 5)

Question 3
Use the skills learned in case scenario number 3 on one the following
data sets. You may choose only one dataset. They are both available in
Canvas.
doubles_hit.csv and triples_hit.csv
#The triples_hit.csv file contains all the players’ triple hits.
#We can read this file in R using the read.csv() function.
triples_data <- read.table("C:/Users/erlan/OneDrive/Desktop/School/CAP 4936 Sports Analytics/triples_hit.csv", header = TRUE, sep = ",")
#In This code we declare the variable triples_data
#Then use the read.table() function to read the file triples_hit.csv in the provided path including the heather and separated by comma to the variable contract_length
triples_hit <- triples_data$triples_hit
#In The above line of code we are extracting the values stored in the column named triples_hit from the data frame or list triples_data and assigning them to a new variable named triples_hit
Make comments about the code we just ran above.
Players: This variable holds a data frame that contains all the data
read from “allcontracts.csv”. Each row corresponds to a record from the
CSV file, and each column represents a different attribute.
triple_hits: This variable is a vector that contains only the values
from the triples_hit column of the Player data frame.
After executing this code, we can further analyze or manipulate the
data stored in Player and triple_hits using various R functions and
operations. For example, you could calculate summary statistics like
mean, median, or maximum triple hits using functions like
mean(triple_hits), median(triple_hits), or max(triple_hits).
To find the mean and the standard deviation
#Mean
#In This code we declare the variable triple_mean
#Then use the function mean() to calculate the average of the contents in the variable triples_hit
#Then assign the results to triple_mean
triple_mean <- mean(triples_hit)
#in this code we are printing the contents of the variable triple_mean
triple_mean
[1] 4.96
#Median
#In This code we declare the variable triple_mean
#Then use the function median() to calculate the median of the contents in the variable triples_hit
#Then assign the results to triple_mean
triple_median <- median(triples_hit)
#in this code we are printing the contents of the variable triple_mean
triple_median
[1] 5
#Find number of observations
#In this code we declare the variable triple_n
#Then use the function length() to calculate the size of the contents in the variable triples_hit and assign to triple_n
triple_n <- length(triples_hit)
triple_n # Printing the results. In this case 100
[1] 100
#Find standard deviation
#In this code we declare the variable triple_sd
#Then use the function sd() to calculate the standard deviation of the contents in the variable triples_hit and assign to triple_sd
triple_sd <- sd(triples_hit)
triple_sd # Printing the results. In this case 2.884721
[1] 2.884721
What percentage of the data lies within one standard deviation of the
mean?
#What percentage of the data lies within one standard deviation of the mean?
#Declare the variable triple_w1sd and assign the results of the following calculation
#We add the contents of triples_hit minus the contents of triple_mean divided by the contents of triple_sd that are less than 1 (For 1 Standard Deviations)
#Then divide the previous calculation by the contents of triple_n
triple_w1sd <- sum((triples_hit - triple_mean)/triple_sd < 1)/ triple_n
#Percentage of observation within one standard deviation of the mean
triple_w1sd # Printing the results. In this case 0.88
[1] 0.88
#Difference from empirical
#Subtracting 0.68 from the contents of the variable contracts_w1sd to find the Difference from empirical. In this case the result is 0.2
triple_w1sd - 0.68
[1] 0.2
What percentage of the data lies within two standard deviations of
the mean?
#Within 2 sd
#Declare the variable triple_w2sd and assign the results of the following calculation
#We add the contents of triples_hit minus the contents of triple_mean divided by the contents of triple_sd that are less than 2 (For 2 Standard Deviations)
#Then divide the previous calculation by the contents of triple_n
triple_w1sd <- sum((triples_hit - triple_mean)/triple_sd < 2)/ triple_n
#Percentage of observation within one standard deviation of the mean
triple_w1sd # Printing the results. In this case 0.93
[1] 0.93
##Difference from empirical
#Subtracting 0.95 from the contents of the variable contracts_w2sd to find the Difference from empirical. In this case the result is 0.91
triple_w1sd - 0.02
[1] 0.91
What percent of the data lies within three standard deviations of the
mean?
#Within 3 sd
#Declare the variable triple_w3sd and assign the results of the following calculation
#We add the contents of triples_hit minus the contents of triple_mean divided by the contents of triple_sd that are less than 3 (For 3 Standard Deviations)
#Then divide the previous calculation by the contents of triple_n
triple_w1sd <- sum((triples_hit - triple_mean)/triple_sd < 3)/ triple_n
#Percentage of observation within one standard deviation of the mean
triple_w1sd # Printing the results. In this case 0.98
[1] 0.98
#Difference from empirical
#Subtracting 0.9973 from the contents of the variable contracts_w3sd to find the Difference from empirical. In this case the result is -0.0173
triple_w3sd - 0.9973
[1] -0.0173
Draw a histogram
#Create histogram
#Use the function hist() to create a histogram using the contents of triples_hit
#With X axis lable Players Triples
#Coloring the bars with green and the border with red
# 0,8 Limits for the x-axis
# 0,225 Limits for the y-axis
# breaks = 5 Number of bins (breaks) in the histogram
hist(triples_hit,xlab = "Players Triples",col = "green",border = "red", xlim = c(0,8), ylim = c(0,225),
breaks = 5)

---
title: "Activity_5"
output: html_notebook
---

**Case-scenario 1**

This is the fourth season of outfielder Luis Robert with the Chicago White Socks 

If during the first three seasons he hit 11, 13, and 12 home runs ##how many does he need on this season for his overall average to be at least 20?


```{r}
# Solution

#Given that x1=11,x2=13,x3=12
#we want to find x4
#such that the mean (average) number of home-runs is x¯>=20

#Notice that in this case n=4

#According to the information above: 20×4=11+13+12+x4

#so when x4=44. the home-runs average will be 20

#Answer Here

#Home-runs so far

HR_before <- c(11, 13, 12) # We declare the variable HR_before and convert it to a vector by assigning c(11, 13, 12)

#Average Number of Home-runs per season wanted

wanted_HR <- 20 # We declare the variable wanted_HR and assign 20 

#Number of seasons

n_seasons <- 4 # We declare the variable n_seasons and assign 4 

#Needed Home-runs on season 4

x_4 <- n_seasons*wanted_HR - sum(HR_before) # We declare the variable X_4 and assign the result of the following calculation
# 4*20 - (11+13+12) 
#The result will be 44

#Minimum number of Home-runs needed by Robert
#In this code we are printing the contents of the variable X_4
x_4
```
According to the calculations above, 
Robert must hit 44 home-runs or better on this season 
to get an average number of home-runs per season of at least 20.

We could confirm this, by using the function mean() in R

```{r}
#Robert's performance

Robert_HRs <- c(11, 13, 12,44) #We declared the variable Robert_HRs and convert it to a vector by assigning c(11, 13, 12,44)
#In this case we added the contents of HR_before and x_4

#Find mean
#In this code we are using the function mean() to find the average of the contents of Robert_HRs
#The results should be 20. This confirm the calculation on the priverious code chunck
mean(Robert_HRs)
```
```{r}
#Find standard deviation
#In this code we are using the function sd() to find the standard deviation of the contents of Robert_HRs
sd(Robert_HRs)
```
```{r}
#Find the maximum number of home-runs during the four seasons period.
#In this code we are using the function max() to find the highest value in the contents of Robert_HRs
max(Robert_HRs)
```
```{r}
#Find the minimum number of home-runs during the four seasons period
#In this code we are using the function min() to find the lowest value in the contents of Robert_HRs
min(Robert_HRs)
```
```{r}
#We can also use the summary() function to find basic statistics, including the median!

#In This code we use the function summary() to find the following statistical values in the variable Robert_HRs

#Minimum (Min): The smallest value in the vector.
#1st Quartile (1st Qu): The value below which 25% of the data falls (also known as Q1 or the lower quartile).
#Median (Median or 50th percentile): The middle value in the sorted vector (Q2).
#Mean (Mean): The average value of the vector.
#3rd Quartile (3rd Qu): The value below which 75% of the data falls (Q3 or the upper quartile).
#Maximum (Max): The largest value in the vector.
#Number of non-missing values (NA's): The count of missing or NA values in the vector.

summary(Robert_HRs)
```
**Question 1**

Now, you must complete the problem below which represents a similar case scenario. 
You may use the steps that we executed in Case-scenario 1 as a template for your solution.

This is the sixth season of outfielder Juan Soto in the majors. 
If during the first five seasons he received 79, 108,41,145, and 135 walks, 
how many does he need on this season for his overall number of walks per season to be at least 100?

```{r}
#Home-runs so far

Walks_before <- c(79, 108, 41,145,135) #In this code we declare the variable Walks_before then convert it to a vector by adding c(79, 108, 41,145,135)

#Average Number of Walks per season wanted

wanted_Walks <- 100 #In this code we declare the variable wanted_Walks and assign 100

#Number of seasons

n_seasons <- 6 #In this code we declare the variable n_seasons and assign 100

#Needed Walks on season 6

x_6 <- n_seasons*wanted_Walks - sum(Walks_before) # In this code we declare the variable x_6 and assign the results of the following calculation
#6*100 - (79+108+41+145+135)
#600 - 508
#The result will be 92

#Minimum number of walks needed by Juan

x_6
```
**Case-scenario**

The average salary of 10 baseball players is 72,000 dollars a week and the average salary of 4 soccer players 
is 84,000. Find the mean salary of all 14 professional players.

**Solution**

We can easily find the joined mean by adding both mean and dividing by the total number of people.

Let n1=10
denote the number of baseball players, and y1=72000
their mean salary. Let n2=4
the number of soccer players and y2=84000
their mean salary. Then the mean salary of all 16 individuals is: n1x1+n2x2n1+n2

We can compute this in R as follows:

```{r}
n_1 <- 10 # Declare the variable n_1 and assign 10 (Baseball Players)
n_2 <- 4 # Declare the variable n_2 and assign 4 (Soccer Players)
y_1 <- 72000 # Declare the variable y_1 and assign 72000 (Baseball Players Salaries)
y_2 <- 84000 # Declare the variable y_2 and assign 84000 (Soccer Players Salaries)
#Mean salary overall
salary_ave <-  (n_1*y_1 + n_2*y_2)/(n_1+n_2) # Declare the variable salary_ave and assign the values of the following calculation
#10*72000 + 4*84000 / 10+4
#720000 + 336000 / 14
# The answer will be 75428.57

salary_ave #We are printing the contents of the variable salary_ave
```
**Question 2**

The average salary of 7 basketball players is 102,000 dollars a week and the average salary of 9 NFL players is 91,000. 
Find the mean salary of all 16 professional players.

```{r}
n_1 <- 7 # Declare the variable n_1 and assign 7 (Baseball Players)
n_2 <- 9 # Declare the variable n_2 and assign 9 (NFL Players)
y_1 <- 102000 # Declare the variable y_1 and assign 102000 (Baseball Players Salaries)
y_2 <- 91000 # Declare the variable y_2 and assign 91000 (NFL Players Salaries)
##Mean salary overall
salary_ave <-  (n_1*y_1 + n_2*y_2)/(n_1+n_2) # Declare the variable salary_ave and assign the values of the following calculation
#7*102000 + 9*91000 / 7+9
#714000 + 819000 / 16
# The answer will be 95812.5


salary_ave #We are printing the contents of the variable salary_ave
```
**Case-scenario 3**

The frequency distribution below lists the number of active players in the Barclays Premier League and the time left in their contract.

Years     Number of Players
6         28
5         72
4         201  
3         109
2         56
1         34

Find the mean,the median and the standard deviation.

What percentage of the data lies within one standard deviation of the mean?

What percentage of the data lies within two standard deviations of the mean?

What percent of the data lies within three standard deviations of the mean?

Draw a histogram to illustrate the data.

**Solution**

```{r}
#The allcontracts.csv file contains all the players’ contracts length. 
#We can read this file in R using the read.csv() function.

contract_length <- read.table("C:/Users/erlan/OneDrive/Desktop/School/CAP 4936 Sports Analytics/allcontracts.csv", header = TRUE, sep = ",")
#In This code we declare the variable contract_length 
#Then use the read.table() function to read the file allcontracts.csv in the provided path including the heather and separated by comma to the variable contract_length
contract_years <- contract_length$years
#In The above line of code we are extracting the values stored in the column named years from the data frame or list contract_length and assigning them to a new variable named contract_years
```

Make comments about the code we just ran above.

contract_length: This variable holds a data frame that contains all the data read from "allcontracts.csv". 
Each row corresponds to a record from the CSV file, and each column represents a different attribute (such as contract details).

contract_years: This variable is a vector that contains only the values from the years column of the contract_length data frame. 
Each element in contract_years represents the length of a contract in years as specified in the CSV file.


After executing this code, we can further analyze or manipulate the data stored in contract_length and contract_years using various R functions and operations. For example, you could calculate summary statistics like mean, median, or maximum contract length using functions like mean(contract_years), median(contract_years), or max(contract_years).

To find the mean and the standard deviation

```{r}
#Mean 
#In This code we declare the variable contracts_mean
#Then use the function mean() to calculate the average of the contents in the variable contract_years
#Then assign the results to contracts_mean
contracts_mean  <- mean(contract_years)
#in this code we are printing the contents of the variable contracts_mean
contracts_mean

```
```{r}
#Median
#In This code we declare the variable contracts_median
#Then use the function median() to calculate the median of the contents in the variable contract_years
#Then assign the results to contracts_median
contracts_median <- median(contract_years)
#in this code we are printing the contents of the variable contracts_median
contracts_median
```
```{r}
#Find number of observations
#In this code we declare the variable contracts_n
#Then use the function length() to calculate the size of the contents in the variable contract_years and assign to contracts_n
contracts_n <- length(contract_years)
contracts_n # Printing the results. In this case 499
#Find standard deviation
#In this code we declare the variable contracts_sd
#Then use the function sd() to calculate the standard deviation of the contents in the variable contract_years and assign to contracts_sd
contracts_sd <- sd(contract_years)
contracts_sd # Printing the results. In this case 1.69686
```

```{r}
#What percentage of the data lies within one standard deviation of the mean?
#Declare the variable contracts_w1sd and assign the results of the following calculation
#We add the contents of contract_years minus the contents of contracts_mean divided by the contents of contracts_sd that are less than 1 (For 1 Standard Deviations)
#Then divide the previous calculation by the contents of contracts_n
contracts_w1sd <- sum((contract_years - contracts_mean)/contracts_sd < 1)/ contracts_n
#Percentage of observation within one standard deviation of the mean
contracts_w1sd  # Printing the results. In this case 0.8416834
```
```{r}
#Difference from empirical 
#Subtracting 0.68 from the contents of the variable contracts_w1sd to find the Difference from empirical. In this case the result is 0.1616834
contracts_w1sd - 0.68
```
What percentage of the data lies within two standard deviations of the mean?

```{r}
#Within 2 sd
#Declare the variable contracts_w2sd and assign the results of the following calculation
#We add the contents of contract_years minus the contents of contracts_mean divided by the contents of contracts_sd that are less than 2 (For 2 Standard Deviations)
#Then divide the previous calculation by the contents of contracts_n
contracts_w2sd <- sum((contract_years - contracts_mean)/ contracts_sd < 2)/contracts_n

contracts_w2sd # Printing the results. In this case 1
```
```{r}
#Difference from empirical 
#Subtracting 0.95 from the contents of the variable contracts_w2sd to find the Difference from empirical. In this case the result is 0.05
contracts_w2sd - 0.95
```
What percent of the data lies within three standard deviations of the mean?

```{r}
#Within 3 sd 
#Declare the variable contracts_w3sd and assign the results of the following calculation
#We add the contents of contract_years minus the contents of contracts_mean divided by the contents of contracts_sd that are less than 3 (For 3 Standard Deviations)
#Then divide the previous calculation by the contents of contracts_n
contracts_w3sd <- sum((contract_years - contracts_mean)/ contracts_sd < 3)/contracts_n

contracts_w3sd # Printing the results. In this case 1
```
```{r}
#Difference from empirical 
#Subtracting 0.9973 from the contents of the variable contracts_w3sd to find the Difference from empirical. In this case the result is 0.0027
contracts_w3sd - 0.9973
```
**Draw a histogram**

```{r}
#Create histogram
#Use the function hist() to create a histogram using the contents of contract_years
#With X axis lable Years Left in Contract
#Coloring the bars with green and the border with red
# 0,8 Limits for the x-axis
# 0,225 Limits for the y-axis
# breaks = 5 Number of bins (breaks) in the histogram

hist(contract_years,xlab = "Years Left in Contract",col = "green",border = "red", xlim = c(0,8), ylim = c(0,225),
     breaks = 5)
```


**Question 3**

Use the skills learned in case scenario number 3 on one the following data sets. 
You may choose only one dataset. They are both available in Canvas.

doubles_hit.csv and triples_hit.csv

```{r}
#The triples_hit.csv file contains all the players’ triple hits. 
#We can read this file in R using the read.csv() function.

triples_data <- read.table("C:/Users/erlan/OneDrive/Desktop/School/CAP 4936 Sports Analytics/triples_hit.csv", header = TRUE, sep = ",")
#In This code we declare the variable triples_data
#Then use the read.table() function to read the file triples_hit.csv in the provided path including the heather and separated by comma to the variable contract_length
triples_hit <- triples_data$triples_hit
#In The above line of code we are extracting the values stored in the column named triples_hit from the data frame or list triples_data and assigning them to a new variable named triples_hit

```

Make comments about the code we just ran above.

Players: This variable holds a data frame that contains all the data read from "allcontracts.csv". 
Each row corresponds to a record from the CSV file, and each column represents a different attribute.

triple_hits: This variable is a vector that contains only the values from the triples_hit column of the Player data frame. 


After executing this code, we can further analyze or manipulate the data stored in Player and triple_hits using various R functions and operations. For example, you could calculate summary statistics like mean, median, or maximum triple hits using functions like mean(triple_hits), median(triple_hits), or max(triple_hits).

To find the mean and the standard deviation

```{r}
#Mean 
#In This code we declare the variable triple_mean
#Then use the function mean() to calculate the average of the contents in the variable triples_hit
#Then assign the results to triple_mean
triple_mean  <- mean(triples_hit)
#in this code we are printing the contents of the variable triple_mean
triple_mean
```
```{r}
#Median
#In This code we declare the variable triple_mean
#Then use the function median() to calculate the median of the contents in the variable triples_hit
#Then assign the results to triple_mean
triple_median <- median(triples_hit)
#in this code we are printing the contents of the variable triple_mean
triple_median
```
```{r}
#Find number of observations
#In this code we declare the variable triple_n
#Then use the function length() to calculate the size of the contents in the variable triples_hit and assign to triple_n
triple_n <- length(triples_hit)
triple_n # Printing the results. In this case 100
#Find standard deviation
#In this code we declare the variable triple_sd
#Then use the function sd() to calculate the standard deviation of the contents in the variable triples_hit and assign to triple_sd
triple_sd <- sd(triples_hit)
triple_sd # Printing the results. In this case 2.884721
```

What percentage of the data lies within one standard deviation of the mean?

```{r}
#What percentage of the data lies within one standard deviation of the mean?
#Declare the variable triple_w1sd and assign the results of the following calculation
#We add the contents of triples_hit minus the contents of triple_mean divided by the contents of triple_sd that are less than 1 (For 1 Standard Deviations)
#Then divide the previous calculation by the contents of triple_n
triple_w1sd <- sum((triples_hit - triple_mean)/triple_sd < 1)/ triple_n
#Percentage of observation within one standard deviation of the mean
triple_w1sd # Printing the results. In this case 0.88
```
```{r}
#Difference from empirical 
#Subtracting 0.68 from the contents of the variable contracts_w1sd to find the Difference from empirical. In this case the result is 0.2
triple_w1sd - 0.68
```
What percentage of the data lies within two standard deviations of the mean?

```{r}
#Within 2 sd
#Declare the variable triple_w2sd and assign the results of the following calculation
#We add the contents of triples_hit minus the contents of triple_mean divided by the contents of triple_sd that are less than 2 (For 2 Standard Deviations)
#Then divide the previous calculation by the contents of triple_n
triple_w1sd <- sum((triples_hit - triple_mean)/triple_sd < 2)/ triple_n
#Percentage of observation within one standard deviation of the mean
triple_w1sd # Printing the results. In this case 0.93

```

```{r}
##Difference from empirical 
#Subtracting 0.95 from the contents of the variable contracts_w2sd to find the Difference from empirical. In this case the result is 0.91
triple_w1sd - 0.02
```
What percent of the data lies within three standard deviations of the mean?

```{r}
#Within 3 sd 
#Declare the variable triple_w3sd and assign the results of the following calculation
#We add the contents of triples_hit minus the contents of triple_mean divided by the contents of triple_sd that are less than 3 (For 3 Standard Deviations)
#Then divide the previous calculation by the contents of triple_n
triple_w1sd <- sum((triples_hit - triple_mean)/triple_sd < 3)/ triple_n
#Percentage of observation within one standard deviation of the mean

triple_w1sd # Printing the results. In this case 0.98
```
```{r}
#Difference from empirical 
#Subtracting 0.9973 from the contents of the variable contracts_w3sd to find the Difference from empirical. In this case the result is -0.0173
triple_w3sd - 0.9973
```
**Draw a histogram**

```{r}
#Create histogram
#Use the function hist() to create a histogram using the contents of triples_hit
#With X axis lable Players Triples
#Coloring the bars with green and the border with red
# 0,8 Limits for the x-axis
# 0,225 Limits for the y-axis
# breaks = 5 Number of bins (breaks) in the histogram
hist(triples_hit,xlab = "Players Triples",col = "green",border = "red", xlim = c(0,8), ylim = c(0,225),
     breaks = 5)

```













