1a. Produce descriptive statistics GPA year1. Produce a histogram. In one or two sentences, summarize what this variable looks like, its centrality, dispersion, etc.

summary(hw5$GPA_year1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.850   2.500   2.437   3.100   4.300
sd(hw5$GPA_year1)
## [1] 0.8923705
ggplot(hw5, aes(x = GPA_year1)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "black")

#Minimum = 0 # 1st Qu. = 1.850 # Median = 2.500
# Mean = 2.437 # 3rd Qu.= 3.100
# Maximum = 4.300 # Standard deviation= 0.8923705

This variable looks like it is skewed a bit to the left but most of the data seems to be centered around about a GPA (Year 1) of 2.4-2.6

1b. The variable sex contains the gender (M/F) of the student. Use the function table to determine how many males and females are in the data

table(hw5$sex)
## 
##     F     M 
## 27381 16981

There are 27381 Females and 16981 males in this data.

1c. Let µx be the population mean GPA for male students in their first year, and µy be the population mean GPA for female students in their first year. We are interested in testing the following:

H0 : µx = µy

H1 : µx ̸= µy

That is, we want to test if, in the population, males and females have, on average, the same GPA in their first year of college based on the sample of students we have. Calculate the following test statistic:

t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))

Where Xn is the sample average GPA of males in their first year, Yn is the sample average GPA of females in their first year, s2 is the squared standard deviation for each group, and n is the number of observations for each group. What is the value of the test statistic?

f_mean<-mean(hw5$GPA_year1[hw5$sex=="F"]) #2.43772
sd(hw5$GPA_year1[hw5$sex=="F"]) #0.8717473
## [1] 0.8717473
summary(hw5$GPA_year1[hw5$sex=="F"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.860   2.500   2.438   3.080   4.300
m_mean<-mean(hw5$GPA_year1[hw5$sex=="M"]) #2.436937
sd(hw5$GPA_year1[hw5$sex=="M"]) #0.9246819
## [1] 0.9246819
summary(hw5$GPA_year1[hw5$sex=="M"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.820   2.510   2.437   3.140   4.300
(2.436937 - 2.43772) / sqrt((((0.9246819^2)/16981)+((0.8717473^2)/27381)))
## [1] -0.08859657

Males=X Females=Y

t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))

t= (2.436937 - 2.43772) / sqrt((((0.92468192)/16981)+((0.87174732)/27381)))

t= -0.08859657

1d. Given that nx and ny are large, we can use the normal approximation and conclude that our test statistic satisfies: t ∼ N (0, 1). Compute the quantile of the standard distribution such that above that number, only 2.5% of the mass is found. Call this number t95. Hint: use the function qnorm. The testing procedure is the following:

If |t| > t95 =⇒ Reject H0

If |t| ≤ t95 =⇒ Do not reject H0

Given the test statistic you found in part 1c, can we reject the null hypothesis?

qnorm(0.975, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
## [1] 1.959964

qnorm= 1.96

We cannot reject the null, t=-0.08859657 which is less than 1.96

1e. Calculate the p-value associated with the test statistic from part 1d. Hint: use the function pnorm.

pval_1e <- 2*pnorm(-0.08859657)

1f. Use the function t.test to conduct the same test. Compare the test statistic, the p-value, and the decision to reject or not with your responses for parts 1d-1e. If there are differences, explain why they arise.

t.test(hw5$GPA_year1[hw5$sex=="M"],hw5$GPA_year1[hw5$sex=="F"])
## 
##  Welch Two Sample t-test
## 
## data:  hw5$GPA_year1[hw5$sex == "M"] and hw5$GPA_year1[hw5$sex == "F"]
## t = -0.088598, df = 34380, p-value = 0.9294
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01810542  0.01653940
## sample estimates:
## mean of x mean of y 
##  2.436937  2.437720

t= -0.088598

p= 0.9294

We still would not reject the null hypothesis as the t value is still less than 1.96, the values are essentially the same

1g. Produce two overlapping histograms for the variable GPA year1—one for males and one for females. Place two vertical lines corresponding to the sample mean of each group. Are the above statistical tests consistent with what you can observe in the histograms?

ggplot(hw5, aes(x = GPA_year1, fill = sex)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "white") +
  geom_vline(aes(xintercept = m_mean, color = "Male GPA Mean"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = f_mean, color = "Female GPA Mean"), linetype = "dashed", size = 1.5) +
  labs(title = "Male vs. Female GPA Year One", x = "GPA Score", y = "Number of People") 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Yes this graph is showing me we cannot reject the null hypothesis- that is, on average, males and females have similar GPAs during their first year in college.

1h. Interpret the results using your intuition. Did you find evidence in favor or against the claim that males and females have the same GPA in their first year of college in the population? Do the results corroborate your expectations? Try to explain why the results are the way they are (it does not have to be the correct explanation).

I found evidence in favor that males and females have the same GPA during their first year, both males and females had almost the same mean GPA their first year and their other summary statistics are very similar as well. I think the only reason some of the numbers are a bit different is that the popuation of females is much higher than the males- but that didn’t skew the results as much as I would think that it would.

2. The variable bpl north america takes the value of 1 when the student was born in North America and 0 otherwise. Repeat parts 1b-1h, using bpl north america instead of sex, to understand if students who are born in North America have systematically different GPAs in their first year compared to students who are not born in North America. Enumerate your answers from 2b to 2h.

2b. The variable bpl north america takes the value of 1 when the student was born in North America and 0 otherwise

table(hw5$bpl_north_america)
## 
##     0     1 
##  5729 38633

There are 38633 students born in North America and 5729 otherwise in this data.

2c. Let µx be the population mean GPA for students born in North America in their first year, and µy be the population mean GPA for students born otherwise in their first year. We are interested in testing the following:

H0 : µx = µy

H1 : µx ̸= µy

That is, we want to test if, in the population, students born in north america and those born otherwise have, on average, the same GPA in their first year of college based on the sample of students we have. Calculate the following test statistic:

t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))

Where Xn is the sample average GPA of students born in North America in their first year, Yn is the sample average GPA of students born otherwise in their first year, s2 is the squared standard deviation for each group, and n is the number of observations for each group. What is the value of the test statistic?

NorAm_mean<-mean(hw5$GPA_year1[hw5$bpl_north_america=="1"])  #2.440751
sd(hw5$GPA_year1[hw5$bpl_north_america=="1"]) #0.8828401
## [1] 0.8828401
summary(hw5$GPA_year1[hw5$bpl_north_america=="1"]) #0.9539338
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.860   2.500   2.441   3.100   4.300
OW_mean<-mean(hw5$GPA_year1[hw5$bpl_north_america=="0"]) #2.414959
sd(hw5$GPA_year1[hw5$bpl_north_america=="0"]) #0.8828401
## [1] 0.9539338
(2.440751 - 2.414959) / sqrt((((0.8828401^2)/38633)+((0.9539338^2)/5729)))
## [1] 1.92771

2d. Given that nx and ny are large, we can use the normal approximation and conclude that our test statistic satisfies: t ∼ N (0, 1). Compute the quantile of the standard distribution such that above that number, only 2.5% of the mass is found. Call this number t95. Hint: use the function qnorm. The testing procedure is the following:

If |t| > t95 =⇒ Reject H0

If |t| ≤ t95 =⇒ Do not reject H0

Given the test statistic you found in part 1c, can we reject the null hypothesis?

qnorm(0.975, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) #1.96
## [1] 1.959964

We cannot reject the null, t=1.92771 which is less than 1.96

2e. Calculate the p-value associated with the test statistic from part 1d. Hint: use the function pnorm.

pval_2e <- 2 * (1 - pnorm(1.92771)) #0.05389121

2f. Use the function t.test to conduct the same test. Compare the test statistic, the p-value, and the decision to reject or not with your responses for parts 2d-2e. If there are differences, explain why they arise.

t.test(hw5$GPA_year1[hw5$bpl_north_america=="1"],hw5$GPA_year1[hw5$bpl_north_america=="0"])
## 
##  Welch Two Sample t-test
## 
## data:  hw5$GPA_year1[hw5$bpl_north_america == "1"] and hw5$GPA_year1[hw5$bpl_north_america == "0"]
## t = 1.9277, df = 7258.1, p-value = 0.05393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0004362443  0.0520195881
## sample estimates:
## mean of x mean of y 
##  2.440751  2.414959

There is a small difference in the P value and in the tvalue, that could be because of the math that we did by hand for the t value and the software.

2g. Produce two overlapping histograms for the variable GPA year1—one for North American born students and one for other students. Place two vertical lines corresponding to the sample mean of each group. Are the above statistical tests consistent with what you can observe in the histograms?

hw5$bpl_north_america<-as.factor(hw5$bpl_north_america)

ggplot(hw5, aes(x = GPA_year1, fill = bpl_north_america)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "white") +
  geom_vline(aes(xintercept = NorAm_mean, color = "North America Mean"), linetype = "dashed", size = 1.5) +
  geom_vline(aes(xintercept = OW_mean, color = "Otherwise Mean"), linetype = "dashed", size = 1) +
  labs(title = "Male vs. Female GPA Year One", x = "GPA Score", y = "Number of People")+
  labs(title = "North American Born vs. Otherwise Students GPA Year One", x = "GPA Score", y = "Number of People") 

## Yes this graph is showing me we cannot reject the null hypothesis- that is, on average, students born in North America or Otherwise have similar GPAs during their first year in college.

2h. Interpret the results using your intuition. Did you find evidence in favor or against the claim that students born in North American or Otherwise have the same GPA in their first year of college in the population? Do the results corroborate your expectations? Try to explain why the results are the way they are (it does not have to be the correct explanation).

Their means are 0.03 points different (students in NoAm are higher) but their standard deviations are 0.07 points different and the North American student population is 6 times the size of the students born in other places (38633 NoAm students and 5729 other students).

3. The variable age at entry contains the age in years at which students entered college. Use the function table to tabulate the variables. What values does it take? Create a new variable named very young that takes the value of 1 if the student had 18 years or less when he/she entered college and 0 if not. Use this new variable very young instead of the variable sex, to repeat parts 1b-1h, trying to understand if very young students have similar GPAs in their first year of college to students that are older when they enter college. Enumerate your answers from 3b to 3h.

table(hw5$age_at_entry)
## 
##    17    18    19    20    21 
##  2297 14380 24071  2901   713
hw5$very_young<-ifelse(hw5$age_at_entry<=18, 1, 0)

The variable age_at_entry has the values of 12, 18,19,20,and 21.

3b.Use the function table to determine how many of each age category there are ( 1 is <= 18 (17 or 18) and 0 is >=19 (19,20,21)) are in the data

table(hw5$very_young)
## 
##     0     1 
## 27685 16677

There are 16677 students that were 17 or 18 when they entered college and 27685 were 19,20, or 21.

3c. Let µx be the population mean GPA for <=18 (1) students in their first year, and µy be the population mean GPA for >=19 (0) students in their first year. We are interested in testing the following:

H0 : µx = µy

H1 : µx ̸= µy

That is, we want to test if, in the population, males and females have, on average, the same GPA in their first year of college based on the sample of students we have. Calculate the following test statistic:

t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))

Where Xn is the sample average GPA of males in their first year, Yn is the sample average GPA of females in their first year, s2 is the squared standard deviation for each group, and n is the number of observations for each group. What is the value of the test statistic?

young_mean<-mean(hw5$GPA_year1[hw5$very_young==1]) #2.522095
sd(hw5$GPA_year1[hw5$very_young==1]) #0.8881439
## [1] 0.8881439
old_mean<-mean(hw5$GPA_year1[hw5$very_young==0]) #2.386413
sd(hw5$GPA_year1[hw5$very_young==0]) #0.8910477
## [1] 0.8910477

Young=X Old=Y

t= (Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))

t= (2.522095 - 2.386413) / sqrt((((0.88814392)/16677)+((0.89104772)/27685)))

#t= 15.56613

3d. Given that nx and ny are large, we can use the normal approximation and conclude that our test statistic satisfies: t ∼ N (0, 1). Compute the quantile of the standard distribution such that above that number, only 2.5% of the mass is found. Call this number t95. Hint: use the function qnorm. The testing procedure is the following:

If |t| > t95 =⇒ Reject H0

If |t| ≤ t95 =⇒ Do not reject H0

Given the test statistic you found in part 1c, can we reject the null hypothesis?

qnorm(0.975, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) #1.96
## [1] 1.959964

We can reject the null, t=15.56613 which is way larger than 1.96

3e. Calculate the p-value associated with the test statistic from part 1d. Hint: use the function pnorm.

pval_3e <- 2*(1 - pnorm(15.56613))  #p=0

3f. Use the function t.test to conduct the same test. Compare the test statistic, the p-value, and the decision to reject or not with your responses for parts 1d-1e. If there are differences, explain why they arise.

t.test(hw5$GPA_year1[hw5$very_young=="1"], hw5$GPA_year1[hw5$very_young=="0"])
## 
##  Welch Two Sample t-test
## 
## data:  hw5$GPA_year1[hw5$very_young == "1"] and hw5$GPA_year1[hw5$very_young == "0"]
## t = 15.566, df = 35228, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1185967 0.1527659
## sample estimates:
## mean of x mean of y 
##  2.522095  2.386413

3g. Produce two overlapping histograms for the variable GPA year1—one for <=18 (1) and >=19 (0). Place two vertical lines corresponding to the sample mean of each group. Are the above statistical tests consistent with what you can observe in the histograms?

hw5$very_young<-as.factor(hw5$very_young)

ggplot(hw5, aes(x = GPA_year1, fill = very_young)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 10, color = "white") +
  geom_vline(aes(xintercept = young_mean, color = "<=18 years old (1) GPA Mean"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = old_mean, color = ">=19 years old (0) GPA Mean"), linetype = "dashed", size = 1.5) +
  labs(title = "Younger vs. Older Entering College GPA Year One", x = "GPA Score", y = "Number of People")

## Yes this graph is showing me we can reject the null hypothesis- that is, on average, students >=19 years old when entering college (0) do not have similar GPAs to the <=18 (1) students during their first year in college, they are a bit lower, which you can see by the dashed lines (in other variables the lines were on top of each other).

3h. Interpret the results using your intuition. Did you find evidence in favor or against the claim that males and females have the same GPA in their first year of college in the population? Do the results corroborate your expectations? Try to explain why the results are the way they are (it does not have to be the correct explanation).

The means for these two groups are different by 0.20 points, which is a large difference in GPAs. These results don’t corroborate my initial expectations necessarily- I didn’t think there would be much of a difference. But, if I think about it more, people that go to college early or on time (at 17-18 years of age), most likely excelled or did well in high school and went straight to college (can also assume they had the means to go straight to college, and students with higher household incomes usually had more help throughout their educational experiences to do well in school and therefore excel in college). And, those that go to college a few years after high school may not have excelled in school, could have been held back, or did not have the means to go which could mean the opposite from above, where is they had a lower household income throughout their school years they could have had less support throughout their school years (school district not as resourced or not much support at home, etc.).