1a. Produce descriptive statistics GPA year1. Produce a histogram.
In one or two sentences, summarize what this variable looks like, its
centrality, dispersion, etc.
summary(hw5$GPA_year1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.850 2.500 2.437 3.100 4.300
sd(hw5$GPA_year1)
## [1] 0.8923705
ggplot(hw5, aes(x = GPA_year1)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "black")
#Minimum = 0 # 1st Qu. = 1.850 # Median = 2.500
# Mean = 2.437 # 3rd Qu.= 3.100
# Maximum = 4.300 # Standard deviation= 0.8923705
This variable looks like it is skewed a bit to the left but most of
the data seems to be centered around about a GPA (Year 1) of
2.4-2.6
1b. The variable sex contains the gender (M/F) of the student. Use
the function table to determine how many males and females are in the
data
table(hw5$sex)
##
## F M
## 27381 16981
There are 27381 Females and 16981 males in this data.
1c. Let µx be the population mean GPA for male students in their
first year, and µy be the population mean GPA for female students in
their first year. We are interested in testing the following:
H0 : µx = µy
H1 : µx ̸= µy
That is, we want to test if, in the population, males and females
have, on average, the same GPA in their first year of college based on
the sample of students we have. Calculate the following test
statistic:
t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))
Where Xn is the sample average GPA of males in their first year, Yn
is the sample average GPA of females in their first year, s2 is the
squared standard deviation for each group, and n is the number of
observations for each group. What is the value of the test
statistic?
f_mean<-mean(hw5$GPA_year1[hw5$sex=="F"]) #2.43772
sd(hw5$GPA_year1[hw5$sex=="F"]) #0.8717473
## [1] 0.8717473
summary(hw5$GPA_year1[hw5$sex=="F"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.860 2.500 2.438 3.080 4.300
m_mean<-mean(hw5$GPA_year1[hw5$sex=="M"]) #2.436937
sd(hw5$GPA_year1[hw5$sex=="M"]) #0.9246819
## [1] 0.9246819
summary(hw5$GPA_year1[hw5$sex=="M"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.820 2.510 2.437 3.140 4.300
(2.436937 - 2.43772) / sqrt((((0.9246819^2)/16981)+((0.8717473^2)/27381)))
## [1] -0.08859657
Males=X Females=Y
t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))
t= (2.436937 - 2.43772) /
sqrt((((0.92468192)/16981)+((0.87174732)/27381)))
t= -0.08859657
1d. Given that nx and ny are large, we can use the normal
approximation and conclude that our test statistic satisfies: t ∼ N (0,
1). Compute the quantile of the standard distribution such that above
that number, only 2.5% of the mass is found. Call this number t95. Hint:
use the function qnorm. The testing procedure is the following:
If |t| > t95 =⇒ Reject H0
If |t| ≤ t95 =⇒ Do not reject H0
Given the test statistic you found in part 1c, can we reject the
null hypothesis?
qnorm(0.975, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
## [1] 1.959964
qnorm= 1.96
We cannot reject the null, t=-0.08859657 which is less than
1.96
1e. Calculate the p-value associated with the test statistic from
part 1d. Hint: use the function pnorm.
pval_1e <- 2*pnorm(-0.08859657)
1f. Use the function t.test to conduct the same test. Compare the
test statistic, the p-value, and the decision to reject or not with your
responses for parts 1d-1e. If there are differences, explain why they
arise.
t.test(hw5$GPA_year1[hw5$sex=="M"],hw5$GPA_year1[hw5$sex=="F"])
##
## Welch Two Sample t-test
##
## data: hw5$GPA_year1[hw5$sex == "M"] and hw5$GPA_year1[hw5$sex == "F"]
## t = -0.088598, df = 34380, p-value = 0.9294
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01810542 0.01653940
## sample estimates:
## mean of x mean of y
## 2.436937 2.437720
t= -0.088598
p= 0.9294
We still would not reject the null hypothesis as the t value is
still less than 1.96, the values are essentially the same
1g. Produce two overlapping histograms for the variable GPA
year1—one for males and one for females. Place two vertical lines
corresponding to the sample mean of each group. Are the above
statistical tests consistent with what you can observe in the
histograms?
ggplot(hw5, aes(x = GPA_year1, fill = sex)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "white") +
geom_vline(aes(xintercept = m_mean, color = "Male GPA Mean"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = f_mean, color = "Female GPA Mean"), linetype = "dashed", size = 1.5) +
labs(title = "Male vs. Female GPA Year One", x = "GPA Score", y = "Number of People")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Yes this graph is showing me we cannot reject the null hypothesis-
that is, on average, males and females have similar GPAs during their
first year in college.
1h. Interpret the results using your intuition. Did you find
evidence in favor or against the claim that males and females have the
same GPA in their first year of college in the population? Do the
results corroborate your expectations? Try to explain why the results
are the way they are (it does not have to be the correct
explanation).
I found evidence in favor that males and females have the same GPA
during their first year, both males and females had almost the same mean
GPA their first year and their other summary statistics are very similar
as well. I think the only reason some of the numbers are a bit different
is that the popuation of females is much higher than the males- but that
didn’t skew the results as much as I would think that it would.
2. The variable bpl north america takes the value of 1 when the
student was born in North America and 0 otherwise. Repeat parts 1b-1h,
using bpl north america instead of sex, to understand if students who
are born in North America have systematically different GPAs in their
first year compared to students who are not born in North America.
Enumerate your answers from 2b to 2h.
2b. The variable bpl north america takes the value of 1 when the
student was born in North America and 0 otherwise
table(hw5$bpl_north_america)
##
## 0 1
## 5729 38633
There are 38633 students born in North America and 5729 otherwise in
this data.
2c. Let µx be the population mean GPA for students born in North
America in their first year, and µy be the population mean GPA for
students born otherwise in their first year. We are interested in
testing the following:
H0 : µx = µy
H1 : µx ̸= µy
That is, we want to test if, in the population, students born in
north america and those born otherwise have, on average, the same GPA in
their first year of college based on the sample of students we have.
Calculate the following test statistic:
t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))
Where Xn is the sample average GPA of students born in North America
in their first year, Yn is the sample average GPA of students born
otherwise in their first year, s2 is the squared standard deviation for
each group, and n is the number of observations for each group. What is
the value of the test statistic?
NorAm_mean<-mean(hw5$GPA_year1[hw5$bpl_north_america=="1"]) #2.440751
sd(hw5$GPA_year1[hw5$bpl_north_america=="1"]) #0.8828401
## [1] 0.8828401
summary(hw5$GPA_year1[hw5$bpl_north_america=="1"]) #0.9539338
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.860 2.500 2.441 3.100 4.300
OW_mean<-mean(hw5$GPA_year1[hw5$bpl_north_america=="0"]) #2.414959
sd(hw5$GPA_year1[hw5$bpl_north_america=="0"]) #0.8828401
## [1] 0.9539338
(2.440751 - 2.414959) / sqrt((((0.8828401^2)/38633)+((0.9539338^2)/5729)))
## [1] 1.92771
2d. Given that nx and ny are large, we can use the normal
approximation and conclude that our test statistic satisfies: t ∼ N (0,
1). Compute the quantile of the standard distribution such that above
that number, only 2.5% of the mass is found. Call this number t95. Hint:
use the function qnorm. The testing procedure is the following:
If |t| > t95 =⇒ Reject H0
If |t| ≤ t95 =⇒ Do not reject H0
Given the test statistic you found in part 1c, can we reject the
null hypothesis?
qnorm(0.975, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) #1.96
## [1] 1.959964
We cannot reject the null, t=1.92771 which is less than 1.96
2e. Calculate the p-value associated with the test statistic from
part 1d. Hint: use the function pnorm.
pval_2e <- 2 * (1 - pnorm(1.92771)) #0.05389121
2f. Use the function t.test to conduct the same test. Compare the
test statistic, the p-value, and the decision to reject or not with your
responses for parts 2d-2e. If there are differences, explain why they
arise.
t.test(hw5$GPA_year1[hw5$bpl_north_america=="1"],hw5$GPA_year1[hw5$bpl_north_america=="0"])
##
## Welch Two Sample t-test
##
## data: hw5$GPA_year1[hw5$bpl_north_america == "1"] and hw5$GPA_year1[hw5$bpl_north_america == "0"]
## t = 1.9277, df = 7258.1, p-value = 0.05393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0004362443 0.0520195881
## sample estimates:
## mean of x mean of y
## 2.440751 2.414959
There is a small difference in the P value and in the tvalue, that
could be because of the math that we did by hand for the t value and the
software.
2g. Produce two overlapping histograms for the variable GPA
year1—one for North American born students and one for other students.
Place two vertical lines corresponding to the sample mean of each group.
Are the above statistical tests consistent with what you can observe in
the histograms?
hw5$bpl_north_america<-as.factor(hw5$bpl_north_america)
ggplot(hw5, aes(x = GPA_year1, fill = bpl_north_america)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "white") +
geom_vline(aes(xintercept = NorAm_mean, color = "North America Mean"), linetype = "dashed", size = 1.5) +
geom_vline(aes(xintercept = OW_mean, color = "Otherwise Mean"), linetype = "dashed", size = 1) +
labs(title = "Male vs. Female GPA Year One", x = "GPA Score", y = "Number of People")+
labs(title = "North American Born vs. Otherwise Students GPA Year One", x = "GPA Score", y = "Number of People")
## Yes this graph is showing me we cannot reject the null hypothesis-
that is, on average, students born in North America or Otherwise have
similar GPAs during their first year in college.
2h. Interpret the results using your intuition. Did you find
evidence in favor or against the claim that students born in North
American or Otherwise have the same GPA in their first year of college
in the population? Do the results corroborate your expectations? Try to
explain why the results are the way they are (it does not have to be the
correct explanation).
Their means are 0.03 points different (students in NoAm are higher)
but their standard deviations are 0.07 points different and the North
American student population is 6 times the size of the students born in
other places (38633 NoAm students and 5729 other students).
3. The variable age at entry contains the age in years at which
students entered college. Use the function table to tabulate the
variables. What values does it take? Create a new variable named very
young that takes the value of 1 if the student had 18 years or less when
he/she entered college and 0 if not. Use this new variable very young
instead of the variable sex, to repeat parts 1b-1h, trying to understand
if very young students have similar GPAs in their first year of college
to students that are older when they enter college. Enumerate your
answers from 3b to 3h.
table(hw5$age_at_entry)
##
## 17 18 19 20 21
## 2297 14380 24071 2901 713
hw5$very_young<-ifelse(hw5$age_at_entry<=18, 1, 0)
The variable age_at_entry has the values of 12, 18,19,20,and
21.
3b.Use the function table to determine how many of each age category
there are ( 1 is <= 18 (17 or 18) and 0 is >=19 (19,20,21)) are in
the data
table(hw5$very_young)
##
## 0 1
## 27685 16677
There are 16677 students that were 17 or 18 when they entered
college and 27685 were 19,20, or 21.
3c. Let µx be the population mean GPA for <=18 (1) students in
their first year, and µy be the population mean GPA for >=19 (0)
students in their first year. We are interested in testing the
following:
H0 : µx = µy
H1 : µx ̸= µy
That is, we want to test if, in the population, males and females
have, on average, the same GPA in their first year of college based on
the sample of students we have. Calculate the following test
statistic:
t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))
Where Xn is the sample average GPA of males in their first year, Yn
is the sample average GPA of females in their first year, s2 is the
squared standard deviation for each group, and n is the number of
observations for each group. What is the value of the test
statistic?
young_mean<-mean(hw5$GPA_year1[hw5$very_young==1]) #2.522095
sd(hw5$GPA_year1[hw5$very_young==1]) #0.8881439
## [1] 0.8881439
old_mean<-mean(hw5$GPA_year1[hw5$very_young==0]) #2.386413
sd(hw5$GPA_year1[hw5$very_young==0]) #0.8910477
## [1] 0.8910477
Young=X Old=Y
t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))
t= (2.522095 - 2.386413) /
sqrt((((0.88814392)/16677)+((0.89104772)/27685)))
#t= 15.56613
3d. Given that nx and ny are large, we can use the normal
approximation and conclude that our test statistic satisfies: t ∼ N (0,
1). Compute the quantile of the standard distribution such that above
that number, only 2.5% of the mass is found. Call this number t95. Hint:
use the function qnorm. The testing procedure is the following:
If |t| > t95 =⇒ Reject H0
If |t| ≤ t95 =⇒ Do not reject H0
Given the test statistic you found in part 1c, can we reject the
null hypothesis?
qnorm(0.975, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) #1.96
## [1] 1.959964
We can reject the null, t=15.56613 which is way larger than
1.96
3e. Calculate the p-value associated with the test statistic from
part 1d. Hint: use the function pnorm.
pval_3e <- 2*(1 - pnorm(15.56613)) #p=0
3f. Use the function t.test to conduct the same test. Compare the
test statistic, the p-value, and the decision to reject or not with your
responses for parts 1d-1e. If there are differences, explain why they
arise.
t.test(hw5$GPA_year1[hw5$very_young=="1"], hw5$GPA_year1[hw5$very_young=="0"])
##
## Welch Two Sample t-test
##
## data: hw5$GPA_year1[hw5$very_young == "1"] and hw5$GPA_year1[hw5$very_young == "0"]
## t = 15.566, df = 35228, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1185967 0.1527659
## sample estimates:
## mean of x mean of y
## 2.522095 2.386413
3g. Produce two overlapping histograms for the variable GPA
year1—one for <=18 (1) and >=19 (0). Place two vertical lines
corresponding to the sample mean of each group. Are the above
statistical tests consistent with what you can observe in the
histograms?
hw5$very_young<-as.factor(hw5$very_young)
ggplot(hw5, aes(x = GPA_year1, fill = very_young)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "white") +
geom_vline(aes(xintercept = young_mean, color = "<=18 years old (1) GPA Mean"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = old_mean, color = ">=19 years old (0) GPA Mean"), linetype = "dashed", size = 1.5) +
labs(title = "Younger vs. Older Entering College GPA Year One", x = "GPA Score", y = "Number of People")
## Yes this graph is showing me we can reject the null hypothesis- that
is, on average, students >=19 years old when entering college (0) do
not have similar GPAs to the <=18 (1) students during their first
year in college, they are a bit lower, which you can see by the dashed
lines (in other variables the lines were on top of each other).
3h. Interpret the results using your intuition. Did you find
evidence in favor or against the claim that males and females have the
same GPA in their first year of college in the population? Do the
results corroborate your expectations? Try to explain why the results
are the way they are (it does not have to be the correct
explanation).
The means for these two groups are different by 0.20 points, which
is a large difference in GPAs. These results don’t corroborate my
initial expectations necessarily- I didn’t think there would be much of
a difference. But, if I think about it more, people that go to college
early or on time (at 17-18 years of age), most likely excelled or did
well in high school and went straight to college (can also assume they
had the means to go straight to college, and students with higher
household incomes usually had more help throughout their educational
experiences to do well in school and therefore excel in college). And,
those that go to college a few years after high school may not have
excelled in school, could have been held back, or did not have the means
to go which could mean the opposite from above, where is they had a
lower household income throughout their school years they could have had
less support throughout their school years (school district not as
resourced or not much support at home, etc.).