Project #5 - The Full Picture: Inference on NSCC Students

Purpose – The Full Statistical Portrait of NSCC Students

In Project #4, we used the normal distribution to begin asking whether NSCC students differ from national averages. Now, armed with a full toolkit – t-tests, proportion tests, two-sample methods, and confidence intervals – we return to the NSCC Student Dataset to build a more complete statistical portrait. In this project, you will conduct multiple hypothesis tests and construct confidence intervals. For each inference question, you are responsible for identifying the correct type of inference to apply. Consider carefully: Is the variable of interest numeric or categorical? Are you comparing one group to a standard, or two groups to each other?

Preparation

Load the NSCC Student Dataset and familiarize yourself with its variables.

# Load the NSCC student dataset
nscc <- read.csv("C:/Users/ash91/OneDrive/Desktop/Honors_Stats/nscc_student_data.csv")

# Preview the structure of the dataset
str(nscc)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : chr  "Female" "Female" "Female" "Female" ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : chr  "July 5" "December 27" "January 31" "6-13" ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : chr  "No" "Yes" "Yes" "Yes" ...
##  $ VoterReg    : chr  "Yes" "Yes" "No" "Yes" ...

Variable Classification:

Question 1: How Old Are NSCC Students?

According to the American Association of Community Colleges (AACC), the national average age of community college students is 26 years old. Is the average age of students in this NSCC sample consistent with that national figure, or do NSCC students differ?

a. Write the hypotheses.

H0: \(\mu = 26\) H1: \(\mu \neq 26\)

b. Calculate the test statistic and p-value.

# Test Statistic

(mean(nscc$Age)-26)/(sd(nscc$Age)/sqrt(40))

## [1] -1.12844

# P-Value

t.test(nscc$Age, alternative = "two.sided", mu = 26)$p.value

## [1] 0.2660281

c. Decision and Conclusion.

Since the p-value is greater than alpha (0.05), we fail to reject the null hypothesis (H0). There is insufficient evidence to suggest that the average age of students in this NSCC sample differs from the national average age of community college students.

Question 2: Are NSCC Students More Civically Engaged?

According to the U.S. Census Bureau, approximately 70% of eligible American adults are registered to vote. Do NSCC students register to vote at a higher rate than the general population?

a. Write the hypotheses.

H0: \(p = .7\) H1: \(p > .7\)

b. Calculate the test statistic and p-value.

# Store Sample Data

table(nscc$VoterReg)

## 
##  No Yes 
##   9  31

n <- 40
x <- 31

# Store Null Hypothesis

pnull <- .7

# Sample Proportion (p-hat)

phat <- x/n

# Standard Error

SE <- sqrt(pnull * (1 - pnull)/n)

# Test Statistic

teststat <- (phat - pnull) / SE

# P-Value 

pnorm(teststat, lower.tail = FALSE)

## [1] 0.1503115

# Prop Test

prop.test(x, n, p = 0.7, alternative = "greater", correct = FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  x out of n, null probability 0.7
## X-squared = 1.0714, df = 1, p-value = 0.1503
## alternative hypothesis: true p is greater than 0.7
## 95 percent confidence interval:
##  0.6510377 1.0000000
## sample estimates:
##     p 
## 0.775

c. Decision and Conclusion.

Since the p-value is greater than alpha (0.05), we fail to reject the null hypothesis. There is insufficient evidence to suggest that the NSCC students in this sample register to vote at a higher rate than the general population.

Question 3: Do Male and Female Students Work Different Hours?

Balancing school and work is a reality for many community college students. Is there a significant difference in the average number of hours worked per week between male and female NSCC students?

a. Write the hypotheses.

Let \(\mu_F\) = mean hours worked per week for female NSCC students and \(\mu_M\) = mean hours worked per week for male NSCC students.

H0: \(\mu_F = \mu_M\) H1: \(\mu_F \neq \mu_M\)

b. Calculate the test statistic and p-value.

# Males Subset

nsccmales <- subset(nscc, nscc$Gender == "Male")

#Females Subset

nsccfemales <- subset(nscc, nscc$Gender == "Female")

# Male Sample Size

nm <- 13

# Female Sample Size 

nf <- 27

# Standard Error

se_hours <- sqrt((sd(nsccmales$HoursWorking)^2/nm)+(sd(nsccfemales$HoursWorking)^2/nf))

# Test Statistic 

ts_hours <- (mean(nsccmales$HoursWorking) - mean(nsccfemales$HoursWorking))/se_hours

# P-Value 

pt(ts_hours, df = 39)*2

## [1] 0.02975773

c. Decision and Conclusion.

Since the p-value is less than alpha (0.05), we reject the null hypothesis. There is sufficient evidence to suggest that there is a significant difference in the average number of hours worked per week between male and female NSCC students.

Question 4: Does Coffee Preference Differ by Gender?

Three out of four NSCC students drink coffee – but is that rate the same for men and women? Test whether there is a significant difference in the proportion of coffee drinkers between male and female students.

a. Write the hypotheses.

Let \(p_F\) = proportion of female NSCC students who drink coffee and \(p_M\) = proportion of male NSCC students who drink coffee.

H0: \(p_F = p_M\) H1: \(p_F \neq p_M\)

b. Calculate the test statistic and p-value.

# Proportion of NSCC Males Who Drink Coffee 

malescoffee <- subset(nscc, Gender == "Male")
table(malescoffee$Coffee)

## 
##  No Yes 
##   3  10

# Proportion of NSCC Females Who Drink Coffee

femalescoffee <- subset(nscc, Gender == "Female")
table(femalescoffee$Coffee)

## 
##  No Yes 
##   7  20

# Count and Sample Size for NSCC Male Coffee-Drinkers

xm <- 10
nm <- 10 + 3

# Count and Sample Size for NSCC Female Coffee-Drinkers

xf <- 20
nf <- 20 + 7

# Sample Proportions for NSCC Male Coffee-Drinkers

xm/nm

## [1] 0.7692308

# Sample Proportions for NSCC Female Coffee-Drinkers

xf/nf

## [1] 0.7407407

# Pooled Proportion

ppool <- (xm + xf)/(nm + nf)

# Standard Error (using ppool)

se <- sqrt((ppool*(1-ppool)/nm) + (ppool*(1-ppool)/nf))

# Test Statistic 

teststat <- (xm/nm - xf/nf)/se

# P-Value 

pnorm(abs(teststat))

## [1] 0.5772651

c. Decision and Conclusion.

Since the p-value is greater than alpha (0.05), we fail to reject the null hypothesis. There is insufficient evidence to suggest that there is a significant difference in the proportion of coffee-drinkers between male and female students at NSCC.

Question 5: How Many Credits Are NSCC Students Taking?

Rather than testing a specific claim, we want to estimate the true average number of credits taken per semester by all NSCC students.

a. Construct and interpret a 95% confidence interval for the mean credits taken per semester.

# 95% Confidence Interval for Mean Credits per Semester

t.test(nscc$Credits, conf.level = 0.95)$conf.int

## [1] 10.69715 12.85285
## attr(,"conf.level")
## [1] 0.95

b. Interpretation.

We are 95% confident that the mean credits taken per semester by NSCC students falls somewhere between 10.7 and 12.9.

c. Follow-up question: Does this interval suggest students on average are taking a full-time load (defined as ≥ 12 credits)?

Since 12 credits is included in our 95% confidence interval, this suggests that it is plausible that the true average of students are taking a full-time load of credits, but it is not proof that a full-time load is the true average amount of credits that students are taking.

Question 6: Your Turn – Choose Your Own Inference

Using any variable(s) in the NSCC Student Dataset that have not yet been analyzed in this project, formulate an original research question. Your question must be answerable with inference via a hypothesis test and/or confidence interval.

a. State your research question.

What is the 95% confidence interval estimating the true average pulse rate of NSCC students?

Is there a significant difference in the average pulse rate between male and female NSCC students?

b. Identify the type of inference and justify your choice.

A two-sample hypothesis test, or two-sample t-test, compares the mean pulse rates between male and female NSCC students. Since the pulse rate is a numeric variable and we have unknown standard deviations, a two-sample t-test is the most appropriate choice of analysis for two independent groups.

c. Write hypotheses (or describe what you are estimating, if using a confidence interval).

Let \(\mu_F\) = mean pulse rate for female NSCC students and \(\mu_M\) = mean pulse rate for male NSCC students.

H0: \(\mu_F = \mu_M\) H1: \(\mu_F \neq \mu_M\)

d. Conduct the analysis.

# Males Subset

nsccmales <- subset(nscc, nscc$Gender == "Male")

#Females Subset

nsccfemales <- subset(nscc, nscc$Gender == "Female")

# Male Sample Size

table(is.na(nsccmales$PulseRate))

## 
## FALSE 
##    13

nm_pulse <- 13

# Female Sample Size 

table(is.na(nsccfemales$PulseRate))

## 
## FALSE  TRUE 
##    25     2

nf_pulse <- 25

# Standard Error

se_pulse <- sqrt((sd(nsccmales$PulseRate, na.rm = TRUE)^2/nm)+(sd(nsccfemales$PulseRate, na.rm = TRUE)^2/nf))

# Test Statistic 

teststat_p <- (mean(nsccmales$PulseRate, na.rm = TRUE) - mean(nsccfemales$PulseRate, na.rm = TRUE))/se_pulse

# P-Value 

pt(teststat_p, df = 37)*2

## [1] 0.3481056

e. Conclusion.

Since the p-value is greater than alpha (0.05), we fail to reject the null hypothesis. There is insufficient evidence to suggest that there is a significant difference in the average pulse rate between male and female NSCC students.

Question 7: Reflection

Across this project, you conducted multiple hypothesis tests on the same dataset.

What is the risk of conducting many hypothesis tests on the same dataset? (Hint: think about what α = 0.05 means in terms of false positives.)

Conducting many hypothesis tests on the same dataset increases the risk of making an error and falsely rejecting the null hypothesis.

Choose a question in this project where you found a result that was statistically significant. Do you believe this result is practically significant? If so, what do you believe could be reasons for this result? Discuss.

In this project, the average number of hours worked per week between male and female students was statistically significant. However, I do not believe that necessarily means this result is practically significant. Other factors such as the amount of credits a student is taking per semester could impact a student’s availability to work, and hours worked per week may not be completely dependent on gender alone. Therefore, while there is a statistically significant difference in hours worked per week between male and female students, it is not practically significant.

What is a limitation of the NSCC Student Dataset that affects all of the conclusions drawn in this project?

A limitation of the NSCC Student Dataset that may affect the conclusions drawn in this project is that the data may be self-reported. This could cause inaccuracies in the dataset, and it may cause some degree of variation within the data due to the fact that not all students respond to surveys. Therefore, this sample may not be representative of the population.

Project #5 - The Full Picture: Inference on NSCC Students

MAT143H - Introduction to Statistics Honors

Ashley Shepard

Due: TBD

Purpose – The Full Statistical Portrait of NSCC Students

Preparation

Question 1: How Old Are NSCC Students?

Question 2: Are NSCC Students More Civically Engaged?

Question 3: Do Male and Female Students Work Different Hours?

Question 4: Does Coffee Preference Differ by Gender?

Question 5: How Many Credits Are NSCC Students Taking?

Question 6: Your Turn – Choose Your Own Inference

Question 7: Reflection