Let us continue getting started with R as we start discussing
important statistical concepts.
Solution
Given that \(x_1 = 71, x_2 = 69, x_3 =
79\)
we want to find \(x_4\) such that
the mean (average) grade is \(\bar{x} >=
70\)
Notice that in this case \(n =
4\).
According to the information above: \(70
\times 4 = 71 + 69 + 79 + x_4\)
so when \(x_4 = 61\), the quiz
average will be 70.
# Grades so far
grades_before <- c(71, 69, 79)
# Average quiz grade wanted
wanted_grade <- 70
# Number of quizzes
n_quizzes <- 4
# Needed grade on quiz 4
x_4 <- n_quizzes*wanted_grade - sum(grades_before)
# Minimum grade needed by Kate
x_4
[1] 61
We can see we create a variable storing the current grades of kate as
“grades_before” and the overal mean grade she desires as “wanted_grade”.
We also create a variable storing the number of quizes. Then we store
the value of the the number of quizes multiplied by the wanted grade,
from which we subtract the sum of the grades before. Very satisfying
working through this equation it was very practical and made me feel
somewhat more confident in my math skills.
According to the calculations above, Kate must score 61 or better on
the final quiz to get an average quiz grade of at least 70.
We could confirm this, by using the function mean() in
R
# Quiz grades
kate_grades <- c(71, 69, 79,61)
# Find mean
mean(kate_grades)
[1] 70
# Find standard deviation
sd(kate_grades)
[1] 7.393691
# Find maximum grade
max(kate_grades)
[1] 79
# Find minimum grade
min(kate_grades)
[1] 61
We can see that kates mean grade is 70 when including her final
minimal score of 61. We also see that the standard deviation is 7.39 of
her new grade total. We also see that her maximum grade is 79 and her
minimum grade is 61
#We can also use the summary() function to find basic
statistics, including the median!
summary(kate_grades)
Min. 1st Qu. Median Mean 3rd Qu. Max.
61 67 70 70 73 79
#Next, I would like you to explain in detail every single task we
completed above. In addition, let us deal with a similar case scenario
and complete every single task we execute in Case-scenario 1.
We see that we have the min and max amount of kates grades listed as
the first and last values of the summary. we also see the 1st and 3rd
quantiles which we find in the next step in this summary, and the median
value in the middle of the list.
Frank must take six quizzes in a Physics class. If his scores on the
first five quizzes are 41, 69,63,94, and 99, what score does he need on
the final quiz for his overall mean to be at least 70?s
Frnk_Grds_bfr <- c(41,69,63,94,99)
fn_Quizzes <- 6
x_6 <- wanted_grade*fn_Quizzes - sum(Frnk_Grds_bfr)
x_6
[1] 54
Frnk_Grds <- c(41,69,63,94,99,54)
mean(Frnk_Grds)
[1] 70
sd(Frnk_Grds)
[1] 22.64509
max(Frnk_Grds)
[1] 99
min(Frnk_Grds)
[1] 41
#We can see that frank needs a grade of 54% on his final quiz to
recieve an overal mean score of 70.
Now let us go back to Case-scenario 1
Another useful function is quantile to find
# the 25%
quantile(kate_grades, 1/4)
25%
67
# the 75%
quantile(kate_grades, 3/4)
75%
73
# the function IQR finds the interquantile range
# IQR(x) = quantile(x, 3/4) - quantile(x, 1/4)
IQR(kate_grades)
[1] 6
quantile(kate_grades, 2/4)
50%
70
Using the quantile function we can select which quantiles we want to
be displayed, here we display the two quantiles which are standardly
displayed using the summary() function. The IQR function displays the
percentage difference between the first and third quantile which is 6% I
also included the 2nd quantile percent of kates_grades at the end. which
we see is 70%, which is 3% more or less of the first and third quantile,
respectively.
Make comments about the output and run a similar query using
Frank_grades
Case-scenario 2
# the 25%
quantile(Frnk_Grds, 1/4)
25%
56.25
# the 75%
quantile(Frnk_Grds, 3/4)
75%
87.75
# the function IQR finds the interquantile range
# IQR(x) = quantile(x, 3/4) - quantile(x, 1/4)
IQR(Frnk_Grds)
[1] 31.5
The average salary of 10 men is 72,000 and the average salary of 4
women is 84,000. Find the mean salary of all 14 people.
Solution
We can easily find the joined mean by adding both mean and dividing
by the total number of people.
Let \(n_1 = 10\) denote the number
of men, and \(y_1 = 72000\) their mean
salary. Let \(n_2 = 4\) the number of
women and \(y_2 = 84000\) their mean
salary. Then the mean salary of all 16 individuals is: \(\frac{n_1 x_1 + n_2 x_2}{n_1 + n_2}\)
We can compute this in R as follows:
n_1 <- 10
n_2 <- 4
y_1 <- 72000
y_2 <- 84000
# Mean salary overall
salary_ave <- (n_1*y_1 + n_2*y_2)/(n_1+n_2)
salary_ave
[1] 75428.57
We can see that the average salary of these 14 men and woman is
$75,428.57 dollars. very interesting and simple equation.
Solve a similar problem by changing number of men and women as well
as the average income for each group. Make comments about the
output.
n_1 <- 12
n_2 <- 6
y_1 <- 99000
y_2 <- 78000
# Mean salary overall
salary_ave <- (n_1*y_1 + n_2*y_2)/(n_1+n_2)
salary_ave
[1] 92000
After I changed the total count of men and women to 18 and their
salaries to 99k and 78k, the average salaries changed to 92k
Case-scenario 3
The frequency distribution below lists the results of a test given in
Professor Wang’s String theory class.
| 10 |
5 |
| 9 |
10 |
| 8 |
6 |
| 7 |
8 |
| 6 |
3 |
| 5 |
2 |
Find the mean,the median and the standard deviation of the
scores.
What percentage of the data lies within one standard deviation of
the mean?
What percentage of the data lies within two standard deviations
of the mean?
What percent of the data lies within three standard deviations of
the mean?
Draw a histogram to illustrate the data.
Solution
The allScores.csv file contains all the students’ scores
in the quiz. We can read this file in R using the
read.csv() function (hint:First create a csv file with 6
rows and 2 columns)
getwd()
[1] "C:/Users/antho/OneDrive/Documents/School/4.DataSecurity&Governance"
scores <- read.table("allScores.csv", header = TRUE, sep = ",")
WangScores <- scores$Score
WangScores
[1] 10 9 8 7 6 5 10 10 10 10 9 9 9 9 9 9 9 9 9 8 8 8 8 8 7 7 7 7 7 7 7 6 6 5
View(scores)
View(WangScores)
Make comments about the code we just ran above.
Here we find the mean(8), median(8), and the standard
deviation(1.44)
# Mean
Scores_mean <- mean(WangScores)
Scores_mean
[1] 8
# Median
Scores_median <- median(WangScores)
Scores_median
[1] 8
# Find number of observations
Scores_n <- length(WangScores)
# Find standard deviation
Scores_sd <- sd(WangScores)
Scores_sd
[1] 1.435481
We can see that both the mean and median of the wang scores is 8 and
the standard deviation is 1.435
#2. What percentage of the data lies within one standard deviation of
the mean?
scores_w1sd <- sum((WangScores - Scores_mean)/Scores_sd < 1)/ Scores_n
# Percentage of observation within one standard deviation of the mean
scores_w1sd
[1] 0.8529412
## Difference from empirical
scores_w1sd - 0.68
[1] 0.1729412
We can see that 85% of the scores fall within one standard deviation
of the mean?
#3. What percentage of the data lies within two standard deviations
of the mean?
## Within 2 sd
scores_w2sd <- sum((WangScores - Scores_mean)/ Scores_sd < 2)/Scores_n
scores_w2sd
[1] 1
## Difference from empirical
scores_w2sd - 0.95
[1] 0.05
We can see that 100% of the scores fall within three standard
deviation of the mean
#4. What percent of the data lies within three standard deviations of
the mean?
## Within 3 sd
scores_w3sd <- sum((WangScores - Scores_mean)/ Scores_sd < 3)/Scores_n
scores_w3sd
[1] 1
## Difference from empirical
scores_w3sd - 0.9973
[1] 0.0027
We can see that 100% of the scores fall within three standard
deviation of the mean
Explain the implications of the results obtained in this problem. In
addition, create a similar query but this time addressing
Frank_Scores
# Mean
FrScores_mean <- mean(Frnk_Grds)
FrScores_mean
[1] 70
# Median
FrScores_median <- median(Frnk_Grds)
FrScores_median
[1] 66
# Find number of observations
FrScores_n <- length(Frnk_Grds)
# Find standard deviation
FrScores_sd <- sd(Frnk_Grds)
FrScores_sd
[1] 22.64509
Frscores_w1sd <- sum((Frnk_Grds - FrScores_mean)/FrScores_sd < 1)/ FrScores_n
# Percentage of observation within one standard deviation of the mean
Frscores_w1sd
[1] 0.6666667
## Difference from empirical
Frscores_w1sd - 0.68
[1] -0.01333333
Frscores_w1sd <- sum((Frnk_Grds - FrScores_mean)/FrScores_sd < 2)/ FrScores_n
# Percentage of observation within one standard deviation of the mean
Frscores_w1sd
[1] 1
## Difference from empirical
Frscores_w1sd - 0.95
[1] 0.05
Frscores_w1sd <- sum((Frnk_Grds - FrScores_mean)/FrScores_sd < 3)/ FrScores_n
# Percentage of observation within one standard deviation of the mean
Frscores_w1sd
[1] 1
## Difference from empirical
Frscores_w1sd - 0.9973
[1] 0.0027
- Draw a histogram
# Create histogram
hist(WangScores)

Explain the output and create a similar histogram for Frank_Scores.
We can see that a histogram is produced that shows the distributions of
Wangs scores, of which most frequently fall between 6 and 8. I dont find
this histogram visual pleasing and some what misleading since there are
5 bars in this chart yet 6 X variables which range from 5 - 10.
hist(Frnk_Grds)

hist(Frnk_Grds_bfr)

Notice in the two graphs above once we add the lowest minimum grade
required by frank too the grades from before we see that the
distribution is flat. Maybe we can change this by assuming that franks
6th grade would be the mean(73) of his original grades and not the
minimum required(54).
mean(Frnk_Grds_bfr)
[1] 73.2
Frnk_Grds_bfr
[1] 41 69 63 94 99
Frnk_Grds_Bst <- c(41,69,63,94,99,73)
Frnk_Grds_Bst
[1] 41 69 63 94 99 73
After we make this change we can see in the graph below that the
distribution leans more to the higher values than havung the low
value.
hist(Frnk_Grds_Bst)

