In Project #4, we used the normal distribution to begin asking whether NSCC students differ from national averages. Now, armed with a full toolkit – t-tests, proportion tests, two-sample methods, and confidence intervals – we return to the NSCC Student Dataset to build a more complete statistical portrait. In this project, you will conduct multiple hypothesis tests and construct confidence intervals. For each inference question, you are responsible for identifying the correct type of inference to apply. Consider carefully: Is the variable of interest numeric or categorical? Are you comparing one group to a standard, or two groups to each other?
Load the NSCC Student Dataset and familiarize yourself with its variables.
# Load the NSCC student dataset
nscc <- read.csv("C:/Users/ash91/OneDrive/Desktop/Honors_Stats/nscc_student_data.csv")
# Preview the structure of the dataset
str(nscc)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : chr "July 5" "December 27" "January 31" "6-13" ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : chr "No" "Yes" "Yes" "Yes" ...
## $ VoterReg : chr "Yes" "Yes" "No" "Yes" ...
Variable Classification:
According to the American Association of Community Colleges (AACC), the national average age of community college students is 26 years old. Is the average age of students in this NSCC sample consistent with that national figure, or do NSCC students differ?
a. Write the hypotheses.
H0: \(\mu = 26\) H1: \(\mu \neq 26\)
b. Calculate the test statistic and p-value.
# Test Statistic
(mean(nscc$Age)-26)/(sd(nscc$Age)/sqrt(40))
## [1] -1.12844
# P-Value
t.test(nscc$Age, alternative = "two.sided", mu = 26)$p.value
## [1] 0.2660281
c. Decision and Conclusion.
Since the p-value is greater than alpha (0.05), we fail to reject the null hypothesis (H0). There is insufficient evidence to suggest that the average age of students in this NSCC sample differs from the national average age of community college students.
According to the U.S. Census Bureau, approximately 70% of eligible American adults are registered to vote. Do NSCC students register to vote at a higher rate than the general population?
a. Write the hypotheses.
H0: \(p = .7\) H1: \(p > .7\)
b. Calculate the test statistic and p-value.
# Store Sample Data
table(nscc$VoterReg)
##
## No Yes
## 9 31
n <- 40
x <- 31
# Store Null Hypothesis
pnull <- .7
# Sample Proportion (p-hat)
phat <- x/n
# Standard Error
SE <- sqrt(pnull * (1 - pnull)/n)
# Test Statistic
teststat <- (phat - pnull) / SE
# P-Value
pnorm(teststat, lower.tail = FALSE)
## [1] 0.1503115
# Prop Test
prop.test(x, n, p = 0.7, alternative = "greater", correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: x out of n, null probability 0.7
## X-squared = 1.0714, df = 1, p-value = 0.1503
## alternative hypothesis: true p is greater than 0.7
## 95 percent confidence interval:
## 0.6510377 1.0000000
## sample estimates:
## p
## 0.775
c. Decision and Conclusion.
Since the p-value is greater than alpha (0.05), we fail to reject the null hypothesis. There is insufficient evidence to suggest that the NSCC students in this sample register to vote at a higher rate than the general population.
Balancing school and work is a reality for many community college students. Is there a significant difference in the average number of hours worked per week between male and female NSCC students?
a. Write the hypotheses.
Let \(\mu_F\) = mean hours worked per week for female NSCC students and \(\mu_M\) = mean hours worked per week for male NSCC students.
H0: \(\mu_F = \mu_M\) H1: \(\mu_F \neq \mu_M\)
b. Calculate the test statistic and p-value.
# Males Subset
nsccmales <- subset(nscc, nscc$Gender == "Male")
#Females Subset
nsccfemales <- subset(nscc, nscc$Gender == "Female")
# Male Sample Size
nm <- 13
# Female Sample Size
nf <- 27
# Standard Error
se_hours <- sqrt((sd(nsccmales$HoursWorking)^2/nm)+(sd(nsccfemales$HoursWorking)^2/nf))
# Test Statistic
ts_hours <- (mean(nsccmales$HoursWorking) - mean(nsccfemales$HoursWorking))/se_hours
# P-Value
pt(ts_hours, df = 39)*2
## [1] 0.02975773
c. Decision and Conclusion.
Since the p-value is less than alpha (0.05), we reject the null hypothesis. There is sufficient evidence to suggest that there is a significant difference in the average number of hours worked per week between male and female NSCC students.
Three out of four NSCC students drink coffee – but is that rate the same for men and women? Test whether there is a significant difference in the proportion of coffee drinkers between male and female students.
a. Write the hypotheses.
Let \(p_F\) = proportion of female NSCC students who drink coffee and \(p_M\) = proportion of male NSCC students who drink coffee.
H0: \(p_F = p_M\) H1: \(p_F \neq p_M\)
b. Calculate the test statistic and p-value.
# Proportion of NSCC Males Who Drink Coffee
malescoffee <- subset(nscc, Gender == "Male")
table(malescoffee$Coffee)
##
## No Yes
## 3 10
# Proportion of NSCC Females Who Drink Coffee
femalescoffee <- subset(nscc, Gender == "Female")
table(femalescoffee$Coffee)
##
## No Yes
## 7 20
# Count and Sample Size for NSCC Male Coffee-Drinkers
xm <- 10
nm <- 10 + 3
# Count and Sample Size for NSCC Female Coffee-Drinkers
xf <- 20
nf <- 20 + 7
# Sample Proportions for NSCC Male Coffee-Drinkers
xm/nm
## [1] 0.7692308
# Sample Proportions for NSCC Female Coffee-Drinkers
xf/nf
## [1] 0.7407407
# Pooled Proportion
ppool <- (xm + xf)/(nm + nf)
# Standard Error (using ppool)
se <- sqrt((ppool*(1-ppool)/nm) + (ppool*(1-ppool)/nf))
# Test Statistic
teststat <- (xm/nm - xf/nf)/se
# P-Value
pnorm(abs(teststat))
## [1] 0.5772651
c. Decision and Conclusion.
Since the p-value is greater than alpha (0.05), we fail to reject the null hypothesis. There is insufficient evidence to suggest that there is a significant difference in the proportion of coffee-drinkers between male and female students at NSCC.
Rather than testing a specific claim, we want to estimate the true average number of credits taken per semester by all NSCC students.
a. Construct and interpret a 95% confidence interval for the mean credits taken per semester.
# 95% Confidence Interval for Mean Credits per Semester
t.test(nscc$Credits, conf.level = 0.95)$conf.int
## [1] 10.69715 12.85285
## attr(,"conf.level")
## [1] 0.95
b. Interpretation.
We are 95% confident that the mean credits taken per semester by NSCC students falls somewhere between 10.7 and 12.9.
c. Follow-up question: Does this interval suggest students on average are taking a full-time load (defined as ≥ 12 credits)?
Since 12 credits is included in our 95% confidence interval, this suggests that it is plausible that the true average of students are taking a full-time load of credits, but it is not proof that a full-time load is the true average amount of credits that students are taking.
Using any variable(s) in the NSCC Student Dataset that have not yet been analyzed in this project, formulate an original research question. Your question must be answerable with inference via a hypothesis test and/or confidence interval.
a. State your research question.
What is the 95% confidence interval estimating the true average pulse rate of NSCC students?
Is there a significant difference in the average pulse rate between male and female NSCC students?
b. Identify the type of inference and justify your choice.
A two-sample hypothesis test, or two-sample t-test, compares the mean pulse rates between male and female NSCC students. Since the pulse rate is a numeric variable and we have unknown standard deviations, a two-sample t-test is the most appropriate choice of analysis for two independent groups.
c. Write hypotheses (or describe what you are estimating, if using a confidence interval).
Let \(\mu_F\) = mean pulse rate for female NSCC students and \(\mu_M\) = mean pulse rate for male NSCC students.
H0: \(\mu_F = \mu_M\) H1: \(\mu_F \neq \mu_M\)
d. Conduct the analysis.
# Males Subset
nsccmales <- subset(nscc, nscc$Gender == "Male")
#Females Subset
nsccfemales <- subset(nscc, nscc$Gender == "Female")
# Male Sample Size
table(is.na(nsccmales$PulseRate))
##
## FALSE
## 13
nm_pulse <- 13
# Female Sample Size
table(is.na(nsccfemales$PulseRate))
##
## FALSE TRUE
## 25 2
nf_pulse <- 25
# Standard Error
se_pulse <- sqrt((sd(nsccmales$PulseRate, na.rm = TRUE)^2/nm)+(sd(nsccfemales$PulseRate, na.rm = TRUE)^2/nf))
# Test Statistic
teststat_p <- (mean(nsccmales$PulseRate, na.rm = TRUE) - mean(nsccfemales$PulseRate, na.rm = TRUE))/se_pulse
# P-Value
pt(teststat_p, df = 37)*2
## [1] 0.3481056
e. Conclusion.
Since the p-value is greater than alpha (0.05), we fail to reject the null hypothesis. There is insufficient evidence to suggest that there is a significant difference in the average pulse rate between male and female NSCC students.
Across this project, you conducted multiple hypothesis tests on the same dataset.
Conducting many hypothesis tests on the same dataset increases the risk of making an error and falsely rejecting the null hypothesis.
In this project, the average number of hours worked per week between male and female students was statistically significant. However, I do not believe that necessarily means this result is practically significant. Other factors such as the amount of credits a student is taking per semester could impact a student’s availability to work, and hours worked per week may not be completely dependent on gender alone. Therefore, while there is a statistically significant difference in hours worked per week between male and female students, it is not practically significant.
A limitation of the NSCC Student Dataset that may affect the conclusions drawn in this project is that the data may be self-reported. This could cause inaccuracies in the dataset, and it may cause some degree of variation within the data due to the fact that not all students respond to surveys. Therefore, this sample may not be representative of the population.