Purpose – The Full Statistical Portrait of NSCC Students

In Project #4, we used the normal distribution to begin asking whether NSCC students differ from national averages. Now, armed with a full toolkit – t-tests, proportion tests, two-sample methods, and confidence intervals – we return to the NSCC Student Dataset to build a more complete statistical portrait. In this project, you will conduct multiple hypothesis tests and construct confidence intervals. For each inference question, you are responsible for identifying the correct type of inference to apply. Consider carefully: Is the variable of interest numeric or categorical? Are you comparing one group to a standard, or two groups to each other?


Preparation

Load the NSCC Student Dataset and familiarize yourself with its variables.

# Load the NSCC student dataset
nscc <- read.csv("nscc_student_data.csv")

# Preview the structure of the dataset
str(nscc)
## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : chr  "Female" "Female" "Female" "Female" ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : chr  "July 5" "December 27" "January 31" "6-13" ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : chr  "No" "Yes" "Yes" "Yes" ...
##  $ VoterReg    : chr  "Yes" "Yes" "No" "Yes" ...

Variable Classification:


Question 1: How Old Are NSCC Students?

According to the American Association of Community Colleges (AACC), the national average age of community college students is 26 years old. Is the average age of students in this NSCC sample consistent with that national figure, or do NSCC students differ?

a. Write the hypotheses.

\(H_0: \mu = 26\)

\(H_1: \mu \neq 26\)

\(\alpha: 0.05\)

b. Calculate the test statistic and p-value.

# t-Test to find test statistic and p-value
t.test(nscc$Age, alternative = "two.sided", mu = 26)
## 
##  One Sample t-test
## 
## data:  nscc$Age
## t = -1.1284, df = 39, p-value = 0.266
## alternative hypothesis: true mean is not equal to 26
## 95 percent confidence interval:
##  22.36979 27.03021
## sample estimates:
## mean of x 
##      24.7

Test statistic: -1.13

P-value: 0.27

c. Decision and Conclusion.

0.27 > alpha. Therefore, we fail to reject the null hypothesis.

There is not sufficient evidence to suggest that the mean age of NSCC students are any different from the national average.


Question 2: Are NSCC Students More Civically Engaged?

According to the U.S. Census Bureau, approximately 70% of eligible American adults are registered to vote. Do NSCC students register to vote at a higher rate than the general population?

a. Write the hypotheses.

\(H_0: p = 0.7\)

\(H_1 : p > 0.7\)

\(\alpha: 0.05\)

b. Calculate the test statistic and p-value.

# Finding sample data
table(nscc$VoterReg)
## 
##  No Yes 
##   9  31
xvote <- 31
nvote <- 40

# Standard error
sevote <- sqrt((0.7*(1-0.7))/nvote)

# Test statistic
tvote <- (xvote/nvote - 0.7)/sevote

# P-value
pnorm(tvote, lower.tail = FALSE)
## [1] 0.1503115
# prop.test with null p = 0.7
prop.test(xvote, nvote, p = 0.7, alternative = "greater", correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  xvote out of nvote, null probability 0.7
## X-squared = 1.0714, df = 1, p-value = 0.1503
## alternative hypothesis: true p is greater than 0.7
## 95 percent confidence interval:
##  0.6510377 1.0000000
## sample estimates:
##     p 
## 0.775

Test statistic: 1.04

P-value: 0.15

c. Decision and Conclusion.

0.15 > alpha. Therefore, we fail to reject the null hypothesis.

There is not sufficient evidence to show that the percentage of registered voters at NSCC is higher than the national average.


Question 3: Do Male and Female Students Work Different Hours?

Balancing school and work is a reality for many community college students. Is there a significant difference in the average number of hours worked per week between male and female NSCC students?

a. Write the hypotheses.

Let \(\mu_F\) = mean hours worked per week for female NSCC students and \(\mu_M\) = mean hours worked per week for male NSCC students.

\(H_0: \mu_F = \mu_M\)

\(H_1: \mu_F \neq \mu_M\)

\(\alpha: 0.05\)

b. Calculate the test statistic and p-value.

# Making subsets
NSCCfemale <- subset(nscc, nscc$Gender == "Female")
NSCCmale <- subset(nscc, nscc$Gender == "Male")

# Sample data
mean(NSCCfemale$HoursWorking) - mean(NSCCmale$HoursWorking)
## [1] 11.90313
fsd <- sd(NSCCfemale$HoursWorking)
msd <- sd(NSCCmale$HoursWorking)

# Standard error
fmse <- sqrt((fsd^2/27)+(msd^2/13))

# Test statistic and p-value
fmt <- 11.90313/fmse
pt(q = fmt, df = 13, lower.tail = FALSE)*2
## [1] 0.04194472
# Two-tailed t.test
t.test(NSCCfemale$HoursWorking, NSCCmale$HoursWorking)
## 
##  Welch Two Sample t-test
## 
## data:  NSCCfemale$HoursWorking and NSCCmale$HoursWorking
## t = 2.2559, df = 18.057, p-value = 0.03671
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.8204324 22.9858354
## sample estimates:
## mean of x mean of y 
##  29.51852  17.61538

Test statistic: 2.26

P-value: 0.04

c. Decision and Conclusion.

0.04 < alpha. Therefore, we reject the null hypothesis.

There is sufficient evidence to support that there is a difference in hours worked per week between female and male NSCC students. More specifically, it seems that female students work more on average.


Question 4: Does Coffee Preference Differ by Gender?

Three out of four NSCC students drink coffee – but is that rate the same for men and women? Test whether there is a significant difference in the proportion of coffee drinkers between male and female students.

a. Write the hypotheses.

Let \(p_F\) = proportion of female NSCC students who drink coffee and \(p_M\) = proportion of male NSCC students who drink coffee.

\(H_0: p_F = p_M\)

\(H_1: p_F \neq p_M\)

\(\alpha: 0.05\)

b. Calculate the test statistic and p-value.

# Sample data
table(NSCCfemale$Coffee)
## 
##  No Yes 
##   7  20
table(NSCCmale$Coffee)
## 
##  No Yes 
##   3  10
xcf <- 20
xcm <- 10
ncf <- 27
ncm <- 13

# Standard error
pcoffee <- (xcf+xcm)/(ncf+ncm)
secoffee <- sqrt((pcoffee*(1-pcoffee)/ncf)+(pcoffee*(1-pcoffee)/ncm))

# Test statistic and p-value
tcoffee <- (((xcf/ncf)-(xcm/ncm))/secoffee)
pnorm(abs(tcoffee), lower.tail = FALSE)*2
## [1] 0.8454698
# prop.test for Coffee variables
prop.test(x = c(xcf, xcm), n = c(ncf, ncm), alternative = "two.sided", correct = FALSE)
## Warning in prop.test(x = c(xcf, xcm), n = c(ncf, ncm), alternative =
## "two.sided", : Chi-squared approximation may be incorrect
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(xcf, xcm) out of c(ncf, ncm)
## X-squared = 0.037987, df = 1, p-value = 0.8455
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.3109406  0.2539606
## sample estimates:
##    prop 1    prop 2 
## 0.7407407 0.7692308

Test statistic: -0.2

P-value: 0.85

c. Decision and Conclusion.

0.85 > alpha. Therefore, we fail to reject the null hypothesis.

There is not sufficient evidence to suggest that there is any different between the coffee drinking habits of male and female NSCC students.


Question 5: How Many Credits Are NSCC Students Taking?

Rather than testing a specific claim, we want to estimate the true average number of credits taken per semester by all NSCC students.

a. Construct and interpret a 95% confidence interval for the mean credits taken per semester.

# t.test for the Credits variable
t.test(nscc$Credits)
## 
##  One Sample t-test
## 
## data:  nscc$Credits
## t = 22.097, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  10.69715 12.85285
## sample estimates:
## mean of x 
##    11.775

b. Interpretation.

We are 95% confident that the true average of credits taken is between 10.7 and 12.9 credits, or between 10 and 13 to round more conservatively as credits are only whole numbers.

c. Follow-up question: Does this interval suggest students on average are taking a full-time load (defined as ≥ 12 credits)?

It does not definitively suggest that the average is greater than or equal to 12, but it does suggest that this COULD be the case. However, the lower half of the confidence interval means it is just as likely that the average is less than a full-time load.


Question 6: Your Turn – Choose Your Own Inference

Using any variable(s) in the NSCC Student Dataset that have not yet been analyzed in this project, formulate an original research question. Your question must be answerable with inference via a hypothesis test and/or confidence interval.

a. State your research question.

Is there a different in shoe length between male and female students?

b. Identify the type of inference and justify your choice.

I will attempt to answer this question with a hypothesis test, as I feel this is the best method for displaying whether there is a difference between two averages.

c. Write hypotheses (or describe what you are estimating, if using a confidence interval).

\(H_0: \mu_F = \mu_M\)

\(H_1: \mu_F \neq \mu_M\)

\(\alpha: 0.05\)

d. Conduct the analysis.

# t.test for ShoeLength variable
t.test(NSCCfemale$ShoeLength, NSCCmale$ShoeLength)
## 
##  Welch Two Sample t-test
## 
## data:  NSCCfemale$ShoeLength and NSCCmale$ShoeLength
## t = -1.1684, df = 28.156, p-value = 0.2524
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.2328513  0.6105407
## sample estimates:
## mean of x mean of y 
##  10.07521  10.88636

Test statistic: -1.17

P-value: 0.25

e. Conclusion.

0.25 > alpha. Therefore, we fail to reject the null hypothesis.

There is not sufficient evidence to suggest there is a different between shoe lengths of male and female NSCC students.


Question 7: Reflection

Across this project, you conducted multiple hypothesis tests on the same dataset.

Conducting many hypothesis tests on the same dataset increases the chances of false positive just by data randomly appearing significant. due to the inherent variability of data.

The question of hours worked between male and female students had a statistically significant result, where it seemed female students worked more on average. I doubt the practical significance of this especially due to the small sample sizes of the groups (27 female students and 13 male students). I feel this result may reflect more on the sample of students rather than the whole NSCC population.

Maybe the biggest limitation is the relatively small sample size of 40 students. Because of this, the means seemed significantly influenced by just a few values, and this problem worsens when subsetting the dataset and therefore limiting the sample sizes even further.