In Project #4, we used the normal distribution to begin asking whether NSCC students differ from national averages. Now, armed with a full toolkit – t-tests, proportion tests, two-sample methods, and confidence intervals – we return to the NSCC Student Dataset to build a more complete statistical portrait. In this project, you will conduct multiple hypothesis tests and construct confidence intervals. For each inference question, you are responsible for identifying the correct type of inference to apply. Consider carefully: Is the variable of interest numeric or categorical? Are you comparing one group to a standard, or two groups to each other?
Load the NSCC Student Dataset and familiarize yourself with its variables.
## Load the NSCC student dataset
nscc <- read.csv("nsccstudentdata.csv")
## Preview the structure of the dataset
str(nscc)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : chr "July 5" "December 27" "January 31" "6-13" ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : chr "No" "Yes" "Yes" "Yes" ...
## $ VoterReg : chr "Yes" "Yes" "No" "Yes" ...
Variable Classification: There are four character variables, two numeric variables, and nine interval varibles.
According to the American Association of Community Colleges (AACC), the national average age of community college students is 26 years old. Is the average age of students in this NSCC sample consistent with that national figure, or do NSCC students differ?
a. Write the hypotheses. \(H_0\): _Age = 26 \(H_A\): _Age doesn’t equal 26
b. Calculate the test statistic and p-value.
##one-sample t-test comparing the sample mean Age to 26
t.test(nscc$Age, mu = 26)
##
## One Sample t-test
##
## data: nscc$Age
## t = -1.1284, df = 39, p-value = 0.266
## alternative hypothesis: true mean is not equal to 26
## 95 percent confidence interval:
## 22.36979 27.03021
## sample estimates:
## mean of x
## 24.7
c. Decision and Conclusion. Since p > 0.05, fail to reject \(H_0\). The data reported suggests NSCC students are not significantly different from the nationally reported average of 26 years old.
According to the U.S. Census Bureau, approximately 70% of eligible American adults are registered to vote. Do NSCC students register to vote at a higher rate than the general population?
a. Write the hypotheses. \(H_0\):p=0.70 \(H_A\):p>0.70
b. Calculate the test statistic and p-value.
##Number of "Yes" Responses
yes <- sum(nscc$VoterReg == "Yes", na.rm = TRUE)
##Total Responses
n<-40
## one-proportion z-test comparing sample proportion to 0.70
prop.test(yes, n, p = 0.70, alternative = "greater", correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: yes out of n, null probability 0.7
## X-squared = 1.0714, df = 1, p-value = 0.1503
## alternative hypothesis: true p is greater than 0.7
## 95 percent confidence interval:
## 0.6510377 1.0000000
## sample estimates:
## p
## 0.775
c. Decision and Conclusion. Fail to reject \(H_0\), p value is not significantly greater than expected. There is no evidence to suggest NSCC students register at a higher rate than the general population.
Balancing school and work is a reality for many community college students. Is there a significant difference in the average number of hours worked per week between male and female NSCC students?
a. Write the hypotheses.
Let \(\mu_F\) = mean hours worked per week for female NSCC students and \(\mu_M\) = mean hours worked per week for male NSCC students. \(H_0\): _F = _M \(H_A\): _F doesn’t equal _M
b. Calculate the test statistic and p-value.
##average hours worked between genders using a two-sample t-test
t.test(HoursWorking ~ Gender, data = nscc)
##
## Welch Two Sample t-test
##
## data: HoursWorking by Gender
## t = 2.2559, df = 18.057, p-value = 0.03671
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## 0.8204324 22.9858354
## sample estimates:
## mean in group Female mean in group Male
## 29.51852 17.61538
c. Decision and Conclusion. Reject \(H_0\), there is a significant difference, females work more hours on average.
Three out of four NSCC students drink coffee – but is that rate the same for men and women? Test whether there is a significant difference in the proportion of coffee drinkers between male and female students.
a. Write the hypotheses.
Let \(p_F\) = proportion of female NSCC students who drink coffee and \(p_M\) = proportion of male NSCC students who drink coffee. \(H_0\): p_F = p_M \(H_A:\) p_F doesn’t equal p_M
b. Calculate the test statistic and p-value.
## Create a contingency table of Gender vs Coffee preference
table_coffee <- table(nscc$Gender, nscc$Coffee)
## Display the table to verify counts
table_coffee
##
## No Yes
## Female 7 20
## Male 3 10
## Run a two-proportion test using the table
prop.test(table_coffee)
## Warning in prop.test(table_coffee): Chi-squared approximation may be incorrect
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table_coffee
## X-squared = 3.7926e-32, df = 1, p-value = 1
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.2824506 0.3394307
## sample estimates:
## prop 1 prop 2
## 0.2592593 0.2307692
c. Decision and Conclusion. Fail to reject \(H_0\), there is no difference in coffee preference by gender.
Rather than testing a specific claim, we want to estimate the true average number of credits taken per semester by all NSCC students.
a. Construct and interpret a 95% confidence interval for the mean credits taken per semester.
##one-sample t-test to generate a 95% confidence interval for Credits
t.test(nscc$Credits)
##
## One Sample t-test
##
## data: nscc$Credits
## t = 22.097, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 10.69715 12.85285
## sample estimates:
## mean of x
## 11.775
b. Interpretation. We are 95% confident the true average credits is between 10.7 and 12.85.
c. Follow-up question: Does this interval suggest students on average are taking a full-time load (defined as ≥ 12 credits)? Twelve is inside the interval of acceptable values. It is plausible, but not definitively above full-time.
Using any variable(s) in the NSCC Student Dataset that have not yet been analyzed in this project, formulate an original research question. Your question must be answerable with inference via a hypothesis test and/or confidence interval.
a. State your research question. Do students have more than 2 siblings on average?
b. Identify the type of inference and justify your choice. One-sample t-test, continuous variable in the data, independent of each other.
c. Write hypotheses (or describe what you are estimating, if using a confidence interval). \(H_0\): = 2 \(H_A\): > 2
d. Conduct the analysis.
##Test whether the average number of siblings is greater than 2
t.test(nscc$Siblings, mu = 2, alternative = "greater")
##
## One Sample t-test
##
## data: nscc$Siblings
## t = 0.66614, df = 39, p-value = 0.2546
## alternative hypothesis: true mean is greater than 2
## 95 percent confidence interval:
## 1.770603 Inf
## sample estimates:
## mean of x
## 2.15
e. Conclusion. p > 0.05, no evidence to suggest students have >2 siblings.
Across this project, you conducted multiple hypothesis tests on the same dataset.
What is the risk of conducting many hypothesis tests on the same dataset? (Hint: think about what α = 0.05 means in terms of false positives.) Running many tests increases the chance of Type I error (false positives). α = 0.05, about 5% of tests may appear significant by chance.
Choose a question in this project where you found a result that was statistically significant. Do you believe this result is practically significant? If so, what do you believe could be reasons for this result? Discuss. Example (Q3): Statistically significant difference in work hours Practically meaningful because: Could affect stress, grades, or time management.
What is a limitation of the NSCC Student Dataset that affects all of the conclusions drawn in this project?
Small sample size (~40 students).