In Project #4, we used the normal distribution to begin asking whether NSCC students differ from national averages. Now, armed with a full toolkit – t-tests, proportion tests, two-sample methods, and confidence intervals – we return to the NSCC Student Dataset to build a more complete statistical portrait. In this project, you will conduct multiple hypothesis tests and construct confidence intervals. For each inference question, you are responsible for identifying the correct type of inference to apply. Consider carefully: Is the variable of interest numeric or categorical? Are you comparing one group to a standard, or two groups to each other?
Load the NSCC Student Dataset and familiarize yourself with its variables.
# Load the NSCC student dataset
nscc_student_data <- read.csv("C:/Users/aless/Downloads/nscc_student_data.csv")
# Preview the structure of the dataset
str(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : chr "July 5" "December 27" "January 31" "6-13" ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : chr "No" "Yes" "Yes" "Yes" ...
## $ VoterReg : chr "Yes" "Yes" "No" "Yes" ...
Variable Classification:
Numeric variables of the data set: Height, ShoeLength
Integer variables of the data set: PulseRate, CoinFlip1, CoinFlip 2, Age, Siblings, RandomNum, HoursWorking, Credits, ProfsAge
Character variables of the data set: Gender, Birthday, Coffee, VoterReg
According to the American Association of Community Colleges (AACC), the national average age of community college students is 26 years old. Is the average age of students in this NSCC sample consistent with that national figure, or do NSCC students differ?
a. Write the hypotheses.
\(H_0: \mu = 26\)
\(H_A: \mu \neq 26\)
b. Calculate the test statistic and p-value.
For this problem, it will need a single sample t-test. (α = 0.05)
#Sample mean, standard deviation, size, degrees of freedom, and standard error
mn <- mean(nscc_student_data$Age)
sd <- sd(nscc_student_data$Age)
n <- 40
df <- 39
se <- sd/sqrt(40)
#This code calculates the test statistic
t <- (mn - 26)/(sd/sqrt(n))
#This code calculates the p-value
pt(abs(t), 39, lower.tail = FALSE)*2
## [1] 0.2660281
c. Decision and Conclusion. Since the p-value is greater than our alpha value of 0.05, we fail to reject the null hypothesis. There is not enough evidence to prove that the average age of NSCC community college students is the same as the national average.
According to the U.S. Census Bureau, approximately 70% of eligible American adults are registered to vote. Do NSCC students register to vote at a higher rate than the general population?
a. Write the hypotheses.
\(H_0: p = 0.70\)
\(H_A: p > 0.70\)
b. Calculate the test statistic and p-value.
For this problem, it will need a single proportion t-test. (α = 0.05)
#Sample proportion, standard deviation, size, degrees of freedom, and standard error
table(nscc_student_data$VoterReg)
##
## No Yes
## 9 31
p <- 31/40
pnull <- 0.70
SEnull <- sqrt(pnull*(1-pnull)/40)
#Calculate the test statistic
teststat <- (p - pnull)/SEnull
#Calculate the p-value
prop.test(31, 40, 0.70, alternative = "greater", correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 31 out of 40, null probability 0.7
## X-squared = 1.0714, df = 1, p-value = 0.1503
## alternative hypothesis: true p is greater than 0.7
## 95 percent confidence interval:
## 0.6510377 1.0000000
## sample estimates:
## p
## 0.775
c. Decision and Conclusion. Since the p-value is greater than our alpha, we fail to reject the null hypothesis. We do not have sufficient data to conclude that NSCC students register to vote at a higher rate than the general population.
Balancing school and work is a reality for many community college students. Is there a significant difference in the average number of hours worked per week between male and female NSCC students?
a. Write the hypotheses.
\(H_0: \mu_F - \mu_M = 0\)
\(H_A: \mu_F - \mu_M \neq 0\)
b. Calculate the test statistic and p-value.
For this, we will perform a two independent sample t-test. (α = 0.05)
#Create subsets
nsccfemale <- subset(nscc_student_data, nscc_student_data$Gender == "Female")
nsccmale <- subset(nscc_student_data, nscc_student_data$Gender == "Male")
#Find the difference
diff <- mean(nsccfemale$HoursWorking) - mean(nsccmale$HoursWorking)
#Sample sizes and standard deviations
n1 <- 27
n2 <- 13
s1 <- sd(nsccfemale$HoursWorking)
s2 <- sd(nsccmale$HoursWorking)
#Calculate standard error
se <- sqrt((s1^2/n1)+(s2^2/n2))
#Calculate test statistic
teststat <- diff/se
#Calculate the test statistic and p-value using t.test
t.test(nsccfemale$HoursWorking, nsccmale$HoursWorking)
##
## Welch Two Sample t-test
##
## data: nsccfemale$HoursWorking and nsccmale$HoursWorking
## t = 2.2559, df = 18.057, p-value = 0.03671
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.8204324 22.9858354
## sample estimates:
## mean of x mean of y
## 29.51852 17.61538
c. Decision and Conclusion.
Since the p-value is less than alpha, we have enough evidence to reject the null hypothesis and accept the hypothesis. Thus, there is a significant difference in hours worked on average between the male and female NSCC students.
Three out of four NSCC students drink coffee – but is that rate the same for men and women? Test whether there is a significant difference in the proportion of coffee drinkers between male and female students.
a. Write the hypotheses.
\(H_0: p_F - p_M = 0\)
\(H_A: p_F - p_M \neq 0\)
b. Calculate the test statistic and p-value.
For this problem, we will perform a hypothesis test for two independent proportions. (α = 0.05)
#Find the successes and sample sizes
table(nsccfemale$Coffee)
##
## No Yes
## 7 20
table(nsccmale$Coffee)
##
## No Yes
## 3 10
#Calculate the test statistic
prop.test(x = c(20, 10), n = c(20 + 7, 10 + 3), correct = FALSE)$statistic
## Warning in prop.test(x = c(20, 10), n = c(20 + 7, 10 + 3), correct = FALSE):
## Chi-squared approximation may be incorrect
## X-squared
## 0.0379867
#Calculate the p-value
prop.test(x = c(20, 10), n = c(20 + 7, 10 + 3), correct = FALSE)$p.value
## Warning in prop.test(x = c(20, 10), n = c(20 + 7, 10 + 3), correct = FALSE):
## Chi-squared approximation may be incorrect
## [1] 0.8454698
c. Decision and Conclusion.
Since our p-value is much greater than alpha, we fail to reject the null hypothesis. There is not enough evidence to suggest that there is a significant difference of coffee drinkers between female and male NSCC students.
Rather than testing a specific claim, we want to estimate the true average number of credits taken per semester by all NSCC students.
a. Construct and interpret a 95% confidence interval for the mean credits taken per semester.
#Find the mean and standard deviation
mn <- mean(nscc_student_data$Credits)
sd <- sd(nscc_student_data$Credits)
#Find the sample size
table(is.na(nscc_student_data$Credits))
##
## FALSE
## 40
n <- 40
#Calculate the interval
mn - 1.96*(sd/sqrt(n))
## [1] 10.73056
mn + 1.96*(sd/sqrt(n))
## [1] 12.81944
b. Interpretation.
We are 95% confident that the true average number of credits taken per semester by NSCC students is between 10.7 and 12.8.
c. Follow-up question: Does this interval suggest students on average are taking a full-time load (defined as ≥ 12 credits)?
This interval does not confidently suggest that students on average are taking a full-time load. There is a chance that the true mean can lie lower in our confidence interval, closer to 10.7. Therefore, we can not say for sure that NSCC students are on average taking a full-time load.
Using any variable(s) in the NSCC Student Dataset that have not yet been analyzed in this project, formulate an original research question. Your question must be answerable with inference via a hypothesis test and/or confidence interval.
a. State your research question.
Does the number of hours worked a week affect the number of credits a NSCC student enrolls in?
b. Identify the type of inference and justify your choice.
This would be a two independent means hypothesis, we will find the test statistic and p-value. (α = 0.05)
I will be creating subsets so that the hours worked are categorical. Group 1 will be of students who work 20 hours or less, Group 2 will be of students who work over 20 hours.
c. Write hypotheses (or describe what you are estimating, if using a confidence interval).
\(H_0: \mu_1 = \mu_2\)
\(H_A: \mu_1 \neq \mu_2\)
d. Conduct the analysis.
#Create the subsets
group1 <- subset(nscc_student_data, HoursWorking >= 20)
group2 <- subset(nscc_student_data, HoursWorking < 20)
#Calculate test statistic and p-value using t.test
t.test(group1$Credits, group2$Credits)$statistic
## t
## -0.1761456
e. Conclusion.
Since our p-value is much greater than our alpha value, there is not sufficient evidence to reject the null hypothesis or accept the alternative. Meaning, there is not enough evidence to conclude that the number of hours a student works a week affects the number of credits they enroll in.
Across this project, you conducted multiple hypothesis tests on the same dataset.
Since we have a alpha level of 0.05 and have used it as a standard for all tests, we have a 5% chance of rejecting the null hypothesis if it’s actually true. If we continue to use this alpha for numerous hypothesis tests instead of altering it based on what we’re looking for, that chance of of a false positive increases. So, the chances of our conclusions being wrong is higher than we would like it to be and our results may be inaccurate.
There was only one question in which we found statistically significant findings, the difference in working hours between men and women. I do not believe this result is practically significant because of our sample size.
One result that I found to be inaccurate was of question 6, in which we questioned if the number of hours a student works affects the number of credits taken during the semester. We got a very high p-value and could not reject the null hypothesis. However, I feel that on a larger scale we may find different results. I also feel that it would be interesting to compare the grades between two groups of students who take the same credits, but work different hours.
The biggest limitation that affects all of the conclusions drawn in this project is the sample size. We are a school of about 5,000 students per semester and there are an additional number of students who attend for non-credit courses. Since our sample is only made up of 40 students, it is not very good at representing the entire population of NSCC students. Another limitation is that we are working with self-reported data, its likely to be inaccurate.