In Project #4, we used the normal distribution to begin asking whether NSCC students differ from national averages. Now, armed with a full toolkit – t-tests, proportion tests, two-sample methods, and confidence intervals – we return to the NSCC Student Dataset to build a more complete statistical portrait. In this project, you will conduct multiple hypothesis tests and construct confidence intervals. For each inference question, you are responsible for identifying the correct type of inference to apply. Consider carefully: Is the variable of interest numeric or categorical? Are you comparing one group to a standard, or two groups to each other?
Load the NSCC Student Dataset and familiarize yourself with its variables.
# Load the NSCC student dataset
nscc_student_data <- read.csv("nscc_student_data.csv")
# Preview the structure of the dataset
str(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : chr "July 5" "December 27" "January 31" "6-13" ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : chr "No" "Yes" "Yes" "Yes" ...
## $ VoterReg : chr "Yes" "Yes" "No" "Yes" ...
Variable Classification:
According to the American Association of Community Colleges (AACC), the national average age of community college students is 26 years old. Is the average age of students in this NSCC sample consistent with that national figure, or do NSCC students differ?
a. Write the hypotheses. H0: mu = 26 HA: mu != 26
Single sample t-test
b. Calculate the test statistic and p-value.
t.test(nscc_student_data$Age, mu=26)
##
## One Sample t-test
##
## data: nscc_student_data$Age
## t = -1.1284, df = 39, p-value = 0.266
## alternative hypothesis: true mean is not equal to 26
## 95 percent confidence interval:
## 22.36979 27.03021
## sample estimates:
## mean of x
## 24.7
c. Decision and Conclusion.
p-value > alpha -> fail to reject H0
There is not sufficient evidence to support the claim that there is a difference in mean age of NSCC students and the national average.
According to the U.S. Census Bureau, approximately 70% of eligible American adults are registered to vote. Do NSCC students register to vote at a higher rate than the general population?
a. Write the hypotheses. H0: p = 0.70 HA: p > 0.70
one proportion z-test
b. Calculate the test statistic and p-value.
table(nscc_student_data$VoterReg)
##
## No Yes
## 9 31
prop.test(31, 40, p = 0.70, alternative="greater")
##
## 1-sample proportions test with continuity correction
##
## data: 31 out of 40, null probability 0.7
## X-squared = 0.74405, df = 1, p-value = 0.1942
## alternative hypothesis: true p is greater than 0.7
## 95 percent confidence interval:
## 0.6374747 1.0000000
## sample estimates:
## p
## 0.775
c. Decision and Conclusion.
p-value > alpha -> fail to reject H0
There is not sufficient evidence to support the claim that the proportion of NSCC students registered to vote is greater than the national average.
Balancing school and work is a reality for many community college students. Is there a significant difference in the average number of hours worked per week between male and female NSCC students?
a. Write the hypotheses.
Let \(\mu_F\) = mean hours worked per week for female NSCC students and \(\mu_M\) = mean hours worked per week for male NSCC students.
H0: \(\mu_F\) = \(\mu_M\) HA: \(\mu_F\) != \(\mu_M\)
two sample t-test
b. Calculate the test statistic and p-value.
nsccmale <- subset(nscc_student_data, nscc_student_data$Gender == "Male")$HoursWorking
nsccfemale <- subset(nscc_student_data, nscc_student_data$Gender == "Female")$HoursWorking
t.test(nsccmale, nsccfemale)
##
## Welch Two Sample t-test
##
## data: nsccmale and nsccfemale
## t = -2.2559, df = 18.057, p-value = 0.03671
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -22.9858354 -0.8204324
## sample estimates:
## mean of x mean of y
## 17.61538 29.51852
c. Decision and Conclusion.
p-value < alpha -> reject H0
There is sufficient evidence to support the claim that there is a difference in hours worked per week for male and female NSCC students.
Three out of four NSCC students drink coffee – but is that rate the same for men and women? Test whether there is a significant difference in the proportion of coffee drinkers between male and female students.
a. Write the hypotheses.
Let \(p_F\) = proportion of female NSCC students who drink coffee and \(p_M\) = proportion of male NSCC students who drink coffee.
H0: \(p_F\) = \(\mu_M\) HA: \(p_F\) != \(\mu_M\)
Two proportion z-test
b. Calculate the test statistic and p-value.
coffeefemale <- sum(nscc_student_data$Coffee == "Yes" & nscc_student_data$Gender == "Female")
coffeemale <- sum(nscc_student_data$Coffee == "Yes" & nscc_student_data$Gender == "Male")
table(nscc_student_data$Gender)
##
## Female Male
## 27 13
femaletotal <- 27
maletotal <- 13
prop.test(c(coffeefemale, coffeemale), c(femaletotal, maletotal))
## Warning in prop.test(c(coffeefemale, coffeemale), c(femaletotal, maletotal)):
## Chi-squared approximation may be incorrect
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(coffeefemale, coffeemale) out of c(femaletotal, maletotal)
## X-squared = 3.1652e-31, df = 1, p-value = 1
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.3394307 0.2824506
## sample estimates:
## prop 1 prop 2
## 0.7407407 0.7692308
c. Decision and Conclusion.
p-value > alpha -> fail to reject H0
There is not sufficient evidence to support the claim that there is a difference in the proportion of coffee drinkers for male and female NSCC students.
Rather than testing a specific claim, we want to estimate the true average number of credits taken per semester by all NSCC students.
a. Construct and interpret a 95% confidence interval for the mean credits taken per semester.
#one sample t-test
t.test(nscc_student_data$Credits)
##
## One Sample t-test
##
## data: nscc_student_data$Credits
## t = 22.097, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 10.69715 12.85285
## sample estimates:
## mean of x
## 11.775
b. Interpretation.
We are 95% confident that the true number of credits taken per semester by all NSCC students is between 10.70 and 12.85 credits.
c. Follow-up question: Does this interval suggest students on average are taking a full-time load (defined as ≥ 12 credits)?
12 is in the interval so it’s possible 12 credits is the true mean, but the data does not suggest NSCC students on average are taking on a full-time load since 12 is on the higher end of the interval.
Using any variable(s) in the NSCC Student Dataset that have not yet been analyzed in this project, formulate an original research question. Your question must be answerable with inference via a hypothesis test and/or confidence interval.
a. State your research question.
Estimate the true average height of all NSCC students.
b. Identify the type of inference and justify your choice.
I’m using a one sample confidence interval because I’m analyzing one variable in order to estimate a single mean.
c. Write hypotheses (or describe what you are estimating, if using a confidence interval).
I’m estimating the average height of all NSCC, which is unknown, using a sample of NSCC students.
d. Conduct the analysis.
t.test(nscc_student_data$Height)
##
## One Sample t-test
##
## data: nscc_student_data$Height
## t = 38.001, df = 38, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 61.08699 67.96173
## sample estimates:
## mean of x
## 64.52436
e. Conclusion.
We are 95% confident that the average height of all NSCC students is between 61.08 and 67.96 inches.
Across this project, you conducted multiple hypothesis tests on the same dataset.
The 0.05 significance level represents the chance of falsely accepting the alternative hypothesis when the null hypothesis is actually true. Testing the same dataset repeatedly increases the chance of getting a false positive because the likelihood of our results including some random chance increases.
The difference in hours worked between male and female NSCC students was significant. Practically this is significant because the amount of hours worked matters in everyday life. The reason for this result could possibly be due to female students having greater financial necessity, available time, or being more likely to be hired.
The sample may not include online NSCC students which could make it unrepresentative of the true NSCC population. This would limit the scope of our data.