Project_5_Spring

title: “Project #5 - The Full Picture: Inference on NSCC Students” subtitle: “MAT143H - Introduction to Statistics Honors” author: “Lilyanna Romero” date: “Due: Thursday, Apri 30” output: html_document: default pdf_document: default

Purpose – The Full Statistical Portrait of NSCC Students

In Project #4, we used the normal distribution to begin asking whether NSCC students differ from national averages. Now, armed with a full toolkit – t-tests, proportion tests, two-sample methods, and confidence intervals – we return to the NSCC Student Dataset to build a more complete statistical portrait. In this project, you will conduct multiple hypothesis tests and construct confidence intervals. For each inference question, you are responsible for identifying the correct type of inference to apply. Consider carefully: Is the variable of interest numeric or categorical? Are you comparing one group to a standard, or two groups to each other?

Preparation

Load the NSCC Student Dataset and familiarize yourself with its variables.

# Load the NSCC student dataset
nscc <- read.csv("nscc_student_data.csv")

# Preview the structure of the dataset
str(nscc)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : chr  "Female" "Female" "Female" "Female" ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : chr  "July 5" "December 27" "January 31" "6-13" ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : chr  "No" "Yes" "Yes" "Yes" ...
##  $ VoterReg    : chr  "Yes" "Yes" "No" "Yes" ...

Variable Classification:

Question 1: How Old Are NSCC Students?

According to the American Association of Community Colleges (AACC), the national average age of community college students is 26 years old. Is the average age of students in this NSCC sample consistent with that national figure, or do NSCC students differ?

a. Write the hypotheses. Type of Inference: One-sample t-test H0: mu = 26 H1: mu != 26

b. Calculate the test statistic and p-value.

# Remove missing values from Age
age <- na.omit(nscc$Age)

# T.test function
t.test(age, mu = 26)

## 
##  One Sample t-test
## 
## data:  age
## t = -1.1284, df = 39, p-value = 0.266
## alternative hypothesis: true mean is not equal to 26
## 95 percent confidence interval:
##  22.36979 27.03021
## sample estimates:
## mean of x 
##      24.7

c. Decision and Conclusion. Decision: Using alpha = 0.05, the p-value > 0.05, so we fail to reject H0.

Conclusion: There is insufficient evidence to conclude that the average age of NSCC student differs from the national average of 26 years.

Question 2: Are NSCC Students More Civically Engaged?

According to the U.S. Census Bureau, approximately 70% of eligible American adults are registered to vote. Do NSCC students register to vote at a higher rate than the general population?

a. Write the hypotheses. Type of Inference: One-proportion z-test H0: p = 0.70 H1: p > 0.70

b. Calculate the test statistic and p-value.

#Remove missing values from Voter Registration 
voter <- na.omit(nscc$VoterReg)

#Count number of "Yes" responses 
yes_count <- sum(voter == "Yes")

#Total number of responses
n <- length(voter)

#Perform one-proportion test
prop.test(yes_count, n, p = 0.70, alternative = "greater", correct = FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  yes_count out of n, null probability 0.7
## X-squared = 1.0714, df = 1, p-value = 0.1503
## alternative hypothesis: true p is greater than 0.7
## 95 percent confidence interval:
##  0.6510377 1.0000000
## sample estimates:
##     p 
## 0.775

c. Decision and Conclusion. Decision: The p-value is greater than 0.05, so we fail to reject the H0.

Conclusion: There is insufficient evidence to conclude that NSCC students register to vote at a higher rate than 70%.

Question 3: Do Male and Female Students Work Different Hours?

Balancing school and work is a reality for many community college students. Is there a significant difference in the average number of hours worked per week between male and female NSCC students?

a. Write the hypotheses. Type of Inference: Two-sample t-test

Let \(\mu_F\) = mean hours worked per week for female NSCC students and \(\mu_M\) = mean hours worked per week for male NSCC students. H0: mu_F = mu_M H1: mu_F != mu_M

b. Calculate the test statistic and p-value.

#Extract hours worked for females (remove missing values)
female_hours <- na.omit(nscc$HoursWorking[nscc$Gender == "Female"])

#Extract hours worked for males (remove missing values)
male_hours <- na.omit(nscc$HoursWorking[nscc$Gender == "Male"])

#Perform two-sample t-test
t.test(female_hours, male_hours)

## 
##  Welch Two Sample t-test
## 
## data:  female_hours and male_hours
## t = 2.2559, df = 18.057, p-value = 0.03671
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.8204324 22.9858354
## sample estimates:
## mean of x mean of y 
##  29.51852  17.61538

c. Decision and Conclusion. Decision: The p-value is less than 0.05, so we reject the H0.

Conclusion: There is a statistically significant difference in average hours worked per week between male and female NSCC students.

Question 4: Does Coffee Preference Differ by Gender?

Three out of four NSCC students drink coffee – but is that rate the same for men and women? Test whether there is a significant difference in the proportion of coffee drinkers between male and female students.

a. Write the hypotheses. Type of Inference: Two-proportion z-test Let \(p_F\) = proportion of female NSCC students who drink coffee and \(p_M\) = proportion of male NSCC students who drink coffee.

H0: p_F = p_M H1: p_F != p_M

b. Calculate the test statistic and p-value.

#Create contingency table of Gender vs Coffee preference
coffee_table <- table(nscc$Gender, nscc$Coffee)

#Extract counts of "Yes"
female_yes <- coffee_table["Female","Yes"]
male_yes <- coffee_table["Male", "Yes"]

#Total counts for each group
female_total <- sum(coffee_table["Female",])
male_total <- sum(coffee_table["Male",])

#Perform two-proportion test
prop.test(c(female_yes, male_yes), c(female_total, male_total), correct = FALSE)

## Warning in prop.test(c(female_yes, male_yes), c(female_total, male_total), :
## Chi-squared approximation may be incorrect

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(female_yes, male_yes) out of c(female_total, male_total)
## X-squared = 0.037987, df = 1, p-value = 0.8455
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.3109406  0.2539606
## sample estimates:
##    prop 1    prop 2 
## 0.7407407 0.7692308

c. Decision and Conclusion. Decision: The p-value is greater than 0.05, so we fail to reject H0.

Conclusion: There is no statistically significant difference in coffee consumption between male and female students.

Question 5: How Many Credits Are NSCC Students Taking?

Rather than testing a specific claim, we want to estimate the true average number of credits taken per semester by all NSCC students.

Type of Inference: One-sample t-interval a. Construct and interpret a 95% confidence interval for the mean credits taken per semester.

#Remove missing values from Credits 
credits <- na.omit(nscc$Credits)

#Compute 95% confidence interval 
t.test(credits, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  credits
## t = 22.097, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  10.69715 12.85285
## sample estimates:
## mean of x 
##    11.775

b. Interpretation. We are 95% confident that the true mean number of credits taken per semester by NSCC students is between 10.697 to 12.852 credits.

c. Follow-up question: Does this interval suggest students on average are taking a full-time load (defined as ≥ 12 credits)? Since 12 credits falls within the confidence interval, this suggests that students are, on average, taking a full-time course load.

Question 6: Your Turn – Choose Your Own Inference

Using any variable(s) in the NSCC Student Dataset that have not yet been analyzed in this project, formulate an original research question. Your question must be answerable with inference via a hypothesis test and/or confidence interval.

a. State your research question. Do male and female NSCC students differ in height?

b. Identify the type of inference and justify your choice. Two-sample t-test (comparing means of quantitative variable between two groups).

c. Write hypotheses (or describe what you are estimating, if using a confidence interval). H0: mu_F = mu_M h1: mu_F != mu_M

d. Conduct the analysis.

#Extract heights for females (remove missing values)
female_height <- na.omit(nscc$Height[nscc$Gender == "Female"])

#Extract heights for males (remove missing values)
male_height <- na.omit(nscc$Height[nscc$Gender == "Male"])

#Perform two-sample t-test
t.test(female_height, male_height)

## 
##  Welch Two Sample t-test
## 
## data:  female_height and male_height
## t = -0.38388, df = 12.379, p-value = 0.7076
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.005955   9.098262
## sample estimates:
## mean of x mean of y 
##  63.87308  65.82692

e. Conclusion. Decision: The p-value is less than 0.05, so we reject H0. Conclusion: There is a statistically significant difference in height between male and female NSCC students.

Question 7: Reflection

Across this project, you conducted multiple hypothesis tests on the same dataset.

What is the risk of conducting many hypothesis tests on the same dataset? (Hint: think about what α = 0.05 means in terms of false positives.)

Conducting multiple hypothesis tests increases the probability of making a false positive. Since each test has a 5% chance of incorrectly rejecting a true null hypothesis, performing multiple tests increases the overall chance that at least one result appears statistically significant purely by chance.

Choose a question in this project where you found a result that was statistically significant. Do you believe this result is practically significant? If so, what do you believe could be reasons for this result? Discuss.

In Question 6, we found a statistically significant difference in height between males and females. This result is also practically significant because differences in height between genders are typically meaningful due to biological factors. Therefore, these results have real-world expectations and it supports the validity of the finding.

What is a limitation of the NSCC Student Dataset that affects all of the conclusions drawn in this project?

Major limitations of the NSCC dataset are: small sample size, it’s not randomly selected, and it contains missing and self-reported data. These factors limit how confidently we can generalize the results to all NSCC students.