Let us continue getting started with R as we start discussing important statistical concepts.
Case-scenario 1
Kate must take four quizzes in a math class. If her scores on the first three quizzes are 71, 69, and 79, what score does she need on the final quiz for her overall mean to be at least 70?
Solution
Given that \(x_1 = 71, x_2 = 69, x_3 = 79\)
we want to find \(x_4\) such that the mean (average) grade is \(\bar{x} >= 70\)
Notice that in this case \(n = 4\).
According to the information above: \(70 \times 4 = 71 + 69 + 79 + x_4\)
so when \(x_4 = 61\), the quiz average will be 70.
# Grades so far
grades_before <- c(71, 69, 79)
# Average quiz grade wanted
wanted_grade <- 70
# Number of quizzes
n_quizzes <- 4
# Needed grade on quiz 4
x_4 <- n_quizzes*wanted_grade - sum(grades_before)
# Minimum grade needed by Kate
x_4
[1] 61
According to the calculations above, Kate must score 61 or better on the final quiz to get an average quiz grade of at least 70.
We could confirm this, by using the function mean() in R
# Quiz grades
kate_grades <- c(71, 69, 79,61)
# Find mean
mean(kate_grades)
[1] 70
# Find standard deviation
sd(kate_grades)
[1] 7.393691
# Find maximum grade
max(kate_grades)
[1] 79
# Find minimum grade
min(kate_grades)
[1] 61
We can also use the summary() function to find basic statistics, including the median!
summary(kate_grades)
Min. 1st Qu. Median Mean 3rd Qu. Max.
61 67 70 70 73 79
Next, I would like you to explain in detail every single task we completed above. In addition, let us deal with a similar case scenario and complete every single task we execute in Case-scenario 1.
Frank must take six quizzes in a Physics class. If his scores on the first five quizzes are 41, 69,63,94, and 99, what score does he need on the final quiz for his overall mean to be at least 70?
# Created a vector name quiz_grades and in the vector using the c to combine the values 41,69,63,94,99 into the quiz_grades vector.
quiz_grades <- c(41,69,63,94,99)
# Create variable least_grade to to hold the value of 70 which is the least Frank can have to pass.
least_grade <- 70
# Create another variable nbr_of_quizzes this will contain the number of quizzes Frank must take.
nbr_quizzes <- 6
# Create a variable to hold the minimum grade needed in the last quiz to receive a 70. The formula for need_grade is the number of quizzes nbr_quizzes time the wanted grade least_grade subtracting the sum by using the sum function of previous 5 grades quiz_grades
needed_grade <- nbr_quizzes * least_grade - sum(quiz_grades)
# Display contents of the variable showing the minimum grade needed.
needed_grade
[1] 54
# Create another vector all_grades holds all the grades including the needed_grades
all_grades <- c(41,69,63,94,99,54)
# Display the vector all grades
all_grades
[1] 41 69 63 94 99 54
# We can confirm the results of needed_grade but using the mean function the mean function will divide the total of all grades and in this case divide by 6 to get the minimum grade if not mean is not 70 the the calculation where wrong.
mean(all_grades)
[1] 70
# To get the standard deviation function we use the sd function and the variable of all_grades. The standard deviation is a measure of the amount of variation or dispersion of a set of values.
sd(all_grades)
[1] 22.64509
# Use of the max function to retrieve the max/maximum number in the vector all_grades.
max(all_grades)
[1] 99
# Use of the min function to retrieve the min/minimum number in the vector all_grades.
min(all_grades)
[1] 41
# Summary function is used to produce results summary of all_grades which includes min, max 1st quantile, median, mean, 3rd quantile.
summary(all_grades)
Min. 1st Qu. Median Mean 3rd Qu. Max.
41.00 56.25 66.00 70.00 87.75 99.00
###Now let us go back to Case-scenario 1
Another useful function is quantile to find
# the 25%
quantile(kate_grades, 1/4)
25%
67
# the 75%
quantile(kate_grades, 3/4)
75%
73
# the function IQR finds the interquantile range
# IQR(x) = quantile(x, 3/4) - quantile(x, 1/4)
IQR(kate_grades)
[1] 6
Make comments about the output and run a similar query using Frank_grades
# Use the quantile function produces sample quantile to the given probabilities 1/4 or 25% values are sorted aded then broken into 4 quantile
# The retrieve the value in the 25% of the vector 1/4 by use of the quantile function returns 56.25.
quantile(all_grades, 1/4)
25%
56.25
# Use the quantile function produces sample quantile to the given probabilities 3/4 or 75% values are sorted added ?broken into 4 quantile 3/4 being next to the last
# The retrieve the value in the 75% of the vector 3/4 by use of the quantile function returns 87.75.
quantile(all_grades, 3/4)
75%
87.75
# The function IQR finds the interquantile range, this quantile(x, 3/4) specifies the median of n largest values and quantile(x, 1/4) specifies the median of the smallest values
# IQR(x) = quantile(x, 3/4) - quantile(x, 1/4)
IQR(all_grades)
[1] 31.5
Case-scenario 2
The average salary of 10 men is 72,000 and the average salary of 4 women is 84,000. Find the mean salary of all 14 people.
Solution
We can easily find the joined mean by adding both mean and dividing by the total number of people.
Let \(n_1 = 10\) denote the number of men, and \(y_1 = 72000\) their mean salary. Let \(n_2 = 4\) the number of women and \(y_2 = 84000\) their mean salary. Then the mean salary of all 16 individuals is: \(\frac{n_1 x_1 + n_2 x_2}{n_1 + n_2}\)
We can compute this in R as follows:
n_1 <- 10
n_2 <- 4
y_1 <- 72000
y_2 <- 84000
# Mean salary overall
salary_ave <- (n_1*y_1 + n_2*y_2)/(n_1+n_2)
salary_ave
[1] 75428.57
Solve a similar problem by changing number of men and women as well as the average income for each group. Make comments about the output.
# n_1 is as variable holding the number of men
n_1 <- 20
# n_1 is as variable holding the number of women
n_2 <- 10
# y_1 is a variable holding the men's salary
y_1 <- 200000
# y_1 is a variable holding the women's salary
y_2 <- 150000
# salary_ave is the variable hold Mean salary of both men and woman together, this is done by multiplying the number of men(n_1) times the men's salary(y_1) add to the number of women(y_1) times their salary(y_2)
salary_ave <- (n_1*y_1 + n_2*y_2)/(n_1+n_2)
# Display the contents of the variable salary_ave this will return salary of all 30 individuals 183,333.3
salary_ave
[1] 183333.3
Case-scenario 3
The frequency distribution below lists the results of a test given in Professor Wang’s String theory class.
| 10 |
5 |
| 9 |
10 |
| 8 |
6 |
| 7 |
8 |
| 6 |
3 |
| 5 |
2 |
Find the mean,the median and the standard deviation of the scores.
What percentage of the data lies within one standard deviation of the mean?
What percentage of the data lies within two standard deviations of the mean?
What percent of the data lies within three standard deviations of the mean?
Draw a histogram to illustrate the data.
Solution
The allScores.csv file contains all the students’ scores in the quiz. We can read this file in R using the read.csv() function (hint:First create a csv file with 6 rows and 2 columns)
getwd()
[1] "E:/School/Summer 2021/Security and Data Governance/Scripts"
scores <- read.table("allScores.csv", header = TRUE, sep = ",")
WangScores <- scores$Score
WangScores
[1] 10 9 8 7 6 5 10 10 10 10 9 9 9 9 9 9 9 9 9 8 8 8 8 8 7 7 7 7 7 7 7 6 6 5
Make comments about the code we just ran above.
#I was getting error where I could not retrieve the dataset. So I used the getwd() to get my working directory # I then put the dataset in that working directory to retrieve. # created a dataframe scores to the data from allScores.csv used read.table to create the dataframe in a table format. header = TRUE includes the header from the file to the data frame. And sep - ‘,’ is the data is a separated separator by commas
Create a variable WangScores to hold the data in scores table Score column.
[1] 10 9 8 7 6 5 10 10 10 10 9 9 9 9 9 9 9 9 9 8 8 8 8 8 7 7 7 7 7 7 7 6 6 5
1. To finding the mean,median, observations and the standard deviation
The mean = 8
The median = 8
The number of observations = 34
The Standard Deviation = 1.435481
# Mean
Scores_mean <- mean(WangScores)
Scores_mean
[1] 8
# Median
Scores_median <- median(WangScores)
Scores_median
[1] 8
# Find number of observations
Scores_n <- length(WangScores)
Scores_n
[1] 34
# Find standard deviation
Scores_sd <- sd(WangScores)
Scores_sd
[1] 1.435481
Following the empirical rule also referred as the Three Sigma Rule or the (68-95-99.7) Rule.
- What percentage of the data lies within one standard deviation of the mean? # 68% of the data lies within 0.8529412 of the mean 8.
scores_w1sd <- sum((WangScores - Scores_mean)/Scores_sd < 1)/ Scores_n
# Percentage of observation within one standard deviation of the mean
scores_w1sd
[1] 0.8529412
## Difference from empirical
scores_w1sd - 0.68
[1] 0.1729412
- What percentage of the data lies within two standard deviations of the mean? # # Would tell me that 95% of the data falls within 1 the mean of 8.
## Within 2 sd
scores_w2sd <- sum((WangScores - Scores_mean)/ Scores_sd < 2)/Scores_n
scores_w2sd
[1] 1
## Difference from empirical
scores_w2sd - 0.95
[1] 0.05
- What percent of the data lies within three standard deviations of the mean? # Would tell me that 99.7% of the data falls within 1 the mean of 8.
## Within 3 sd
scores_w3sd <- sum((WangScores - Scores_mean)/ Scores_sd < 3)/Scores_n
scores_w3sd
[1] 1
## Difference from empirical
scores_w3sd - 0.9973
[1] 0.0027
Explain the implications of the results obtained in this problem. # There is a wide range of scores and most fall between 2 and 3 standard deviations.
# Create histogram
hist(WangScores)

In addition, create a similar query but this time addressing Frank_Scores
- Draw a histogram Explain the output and create a similar histogram for Frank_Scores. # –
1. Create a variable for Frank_Scores
Frank_Scores <- c(41,69,63,94,99,54)
Frank_Scores
[1] 41 69 63 94 99 54
2. To find the mean, median and the standard deviation of Frank_Scores
The mean is 70
The median is 66
The number if observation are 6
The Standard Deviation is 22.64509
# Mean
Scores_mean <- mean(Frank_Scores)
Scores_mean
[1] 70
# Median
Scores_median <- median(Frank_Scores)
Scores_median
[1] 66
# Find number of observations
Scores_n <- length(Frank_Scores)
Scores_n
[1] 6
# Find standard deviation
Scores_sd <- sd(Frank_Scores)
Scores_sd
[1] 22.64509
2. Calculating the standard deviation by summing Frank_Score - the mean of Frank_Score divide by Standard Deviation < 1 and dividing that result by the number of observations Scores_n
Using the empirical rule (68-95-99.7) we know that 68% percent fall between 0.6666667 and the mean 70.
scores_w1sd <- sum((Frank_Scores - Scores_mean)/Scores_sd < 1)/ Scores_n
# Percentage of observation within one standard deviation of the mean
scores_w1sd
[1] 0.6666667
## Difference from empirical
scores_w1sd - 0.68
[1] -0.01333333
3. Calculating the standard deviation by summing Frank_Score - the mean of Frank_Score divide by Standard Deviation < 2 and dividing that result by the number of observations Scores_n.
Using the empirical rule (68-95-99.7) we know that 95% percent fall between 0.6666667 and the mean 70.
## Within 2 sd
scores_w2sd <- sum((Frank_Scores - Scores_mean)/ Scores_sd < 2)/Scores_n
scores_w2sd
[1] 1
## Difference from empirical
scores_w2sd - 0.95
[1] 0.05
4. Calculating the standard deviation by summing Frank_Score - the mean of Frank_Score divide by Standard Deviation < 3 and dividing that result by the number of observations Scores_n
Using the empirical rule (68-95-99.7) we know that 95.7% percent fall between 5.666667 and the mean 70.
## Within 3 sd
scores_w3sd <- sum((WangScores - Scores_mean)/ Scores_sd < 3)/Scores_n
scores_w3sd
[1] 5.666667
## Difference from empirical
scores_w3sd - 0.9973
[1] 4.669367
# Create histogram
hist(Frank_Scores)

Frank_Score is not showing much
