Introduction:
Background Information
Questions
Data
Analysis:
Question 1
Question 2
Question 3
Question 4
Question 5
Question 6
Question 7
Question 8
Question 9
Question 10
Conclusion:
References:
Background Information:
This project will be evaluating the sleep patterns of college students. Students were evaluated over a two week period and were asked to keep a diary and record the time and quality of their sleep. After the two week period, students were provided a survey which asked many questions about attitudes and habits during the sleep study.
Questions:
To better understand the data that was collected, 10 questions will be evaluated to provide more insight on college students and their well being, school performance, and many other aspects of their day-to-day life.
Is there a significant difference in the average GPA between male and female college students?
Is there a significant difference in the average number of early classes between the first two class years and other class years?
Do students who identify as “larks” have significantly better cognitive skills (cognition z-score) compared to “owls”?
Is there a significant difference in the average number of classes missed in a semester between students who had at least one early class (EarlyClass=1) and those who didn’t (EarlyClass=0)?
Is there a significant difference in the average happiness level between students with at least moderate depression and normal depression status?
Is there a significant difference in average sleep quality scores between students who reported having at least one all-nighter (AllNighter=1) and those who didn’t (AllNighter=0)?
Do students who abstain from alcohol use have significantly better stress scores than those who report heavy alcohol use?
Is there a significant difference in the average number of drinks per week between students of different genders?
Is there a significant difference in the average weekday bedtime between students with high and low stress (Stress=High vs. Stress=Normal)?
Is there a significant difference in the average hours of sleep on weekends between first two year students and other students?
Data:
The dataset evaluated 253 college students with 27 variables. The 27 variables include the following:
# Summary of GPA by gender
myData %>%
group_by(Gender) %>%
summarize(
n = n(),
mean_GPA = mean(GPA, na.rm = TRUE),
sd_GPA = sd(GPA, na.rm = TRUE)
)
## # A tibble: 2 × 4
## Gender n mean_GPA sd_GPA
## <int> <int> <dbl> <dbl>
## 1 0 151 3.32 0.375
## 2 1 102 3.12 0.418
ggplot(myData, aes(x = Gender, y = GPA, fill = Gender)) +
geom_boxplot() +
labs(title = "GPA Distribution by Gender") +
theme_minimal()
## Warning: Orientation is not uniquely specified when both the x and y aesthetics are
## continuous. Picking default orientation 'x'.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# Run independent samples t-test
t_test_result <- t.test(GPA ~ Gender, data = myData, var.equal = FALSE)
t_test_result
##
## Welch Two Sample t-test
##
## data: GPA by Gender
## t = 3.9139, df = 200.9, p-value = 0.0001243
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 0.09982254 0.30252780
## sample estimates:
## mean in group 0 mean in group 1
## 3.324901 3.123725
# Extract p-value
t_test_result$p.value
## [1] 0.000124298
Comment: Looking at the box plot, gender 0 represents females and gender 1 represents males.
Due to the p-value being close to zero (\(p \approx 0.0001\)), we can reject the null hypothesis and know this is unlikely due to random variation. Evaluating the box plot, we can see that females have a slightly higher mean GPA of 3.32 where males have a mean GPA of 3.12.
# Create Grouping Variable: First Two Years vs. Other Years
myData <- myData %>%
mutate(YearGroup = ifelse(ClassYear %in% c(1, 2),
"FirstTwo",
"OtherYears"))
table(myData$YearGroup)
##
## FirstTwo OtherYears
## 142 111
# Summary Statistics:
myData %>%
group_by(YearGroup) %>%
summarize(
n = n(),
mean_early = mean(NumEarlyClass, na.rm = TRUE),
sd_early = sd(NumEarlyClass, na.rm = TRUE)
)
## # A tibble: 2 × 4
## YearGroup n mean_early sd_early
## <chr> <int> <dbl> <dbl>
## 1 FirstTwo 142 2.07 1.66
## 2 OtherYears 111 1.31 1.25
# Visualization:
ggplot(myData, aes(x = YearGroup, y = NumEarlyClass, fill = YearGroup)) +
geom_boxplot(alpha = 0.8) +
labs(
title = "Number of Early Classes by Class Year Group",
x = "Class Year Group",
y = "Number of Early Classes"
) +
theme_minimal() +
theme(legend.position = "none")
# Two-Sample T-Test:
t_test_result <- t.test(NumEarlyClass ~ YearGroup, data = myData)
t_test_result
##
## Welch Two Sample t-test
##
## data: NumEarlyClass by YearGroup
## t = 4.1813, df = 250.69, p-value = 4.009e-05
## alternative hypothesis: true difference in means between group FirstTwo and group OtherYears is not equal to 0
## 95 percent confidence interval:
## 0.4042016 1.1240309
## sample estimates:
## mean in group FirstTwo mean in group OtherYears
## 2.070423 1.306306
Comment: Looking at the p-value (\(p \approx 0.00004\)), we can reject the null hypothesis and conclude that this is not due to random variation and there is a statistically significant difference in the average number of early classes between the first two class years and the last two class years.
The box plot further supports this finding by showing that the first two class years tend to have more early courses than the last two class years.
# Compare cognition scores of Larks vs Owls
# Filter to only Lark and Owl (remove "Neither")
cogData <- myData %>%
filter(LarkOwl %in% c("Lark", "Owl")) %>%
droplevels()
table(cogData$LarkOwl)
##
## Lark Owl
## 41 49
# Summary statistics
cogData %>%
group_by(LarkOwl) %>%
summarize(
n = n(),
mean_cog = mean(CognitionZscore, na.rm = TRUE),
sd_cog = sd(CognitionZscore, na.rm = TRUE)
)
## # A tibble: 2 × 4
## LarkOwl n mean_cog sd_cog
## <chr> <int> <dbl> <dbl>
## 1 Lark 41 0.0902 0.830
## 2 Owl 49 -0.0384 0.653
# Visualization
ggplot(cogData, aes(x = LarkOwl, y = CognitionZscore, fill = LarkOwl)) +
geom_boxplot(alpha = 0.8) +
labs(
title = "Cognition Z-Score by Chronotype",
x = "Chronotype (Lark vs. Owl)",
y = "Cognition Z-Score"
) +
theme_minimal() +
theme(legend.position = "none")
# Test if Larks have significantly higher cognition scores than Owls
t_test_result <- t.test(CognitionZscore ~ LarkOwl,
data = cogData,
alternative = "greater") # one-tailed
t_test_result
##
## Welch Two Sample t-test
##
## data: CognitionZscore by LarkOwl
## t = 0.80571, df = 75.331, p-value = 0.2115
## alternative hypothesis: true difference in means between group Lark and group Owl is greater than 0
## 95 percent confidence interval:
## -0.1372184 Inf
## sample estimates:
## mean in group Lark mean in group Owl
## 0.09024390 -0.03836735
t.test(CognitionZscore ~ LarkOwl, data = cogData)
##
## Welch Two Sample t-test
##
## data: CognitionZscore by LarkOwl
## t = 0.80571, df = 75.331, p-value = 0.4229
## alternative hypothesis: true difference in means between group Lark and group Owl is not equal to 0
## 95 percent confidence interval:
## -0.1893561 0.4465786
## sample estimates:
## mean in group Lark mean in group Owl
## 0.09024390 -0.03836735
Comment: Looking at the box plot, there is a lot of overlap between the two groups (Larks and Owls) which indicates there is no relationship between cognitive performance and if students stay up late or wake up early.
Furthermore, evaluating the p-value (\(p \approx 0.423\)), which means we don’t reject the null hypothesis and can infer that this is likely due to random variation.
# Summary Statistics by Early Class:
myData %>%
group_by(EarlyClass) %>%
summarize(
n = n(),
mean_missed = mean(ClassesMissed, na.rm = TRUE),
sd_missed = sd(ClassesMissed, na.rm = TRUE)
)
## # A tibble: 2 × 4
## EarlyClass n mean_missed sd_missed
## <int> <int> <dbl> <dbl>
## 1 0 85 2.65 3.48
## 2 1 168 1.99 3.10
# Visualization:
ggplot(myData, aes(x = factor(EarlyClass), y = ClassesMissed, fill = factor(EarlyClass))) +
geom_boxplot(alpha = 0.8) +
labs(
title = "Classes Missed by Early Class Attendance",
x = "Early Class (0 = No, 1 = Yes)",
y = "Number of Classes Missed"
) +
theme_minimal() +
theme(legend.position = "none")
# Two-Sample T-Test:
t_test_result <- t.test(ClassesMissed ~ EarlyClass, data = myData)
t_test_result
##
## Welch Two Sample t-test
##
## data: ClassesMissed by EarlyClass
## t = 1.4755, df = 152.78, p-value = 0.1421
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.2233558 1.5412830
## sample estimates:
## mean in group 0 mean in group 1
## 2.647059 1.988095
Comment: Based on the p-value (\(p \approx 0.14\)), we reject the null hypothesis because there is no statistically significant difference in the average number of classes missed between students with at least one early class and those with no early classes.
The box plot confirms this finding by showing overlap in the data.
# Filter to Normal vs. Moderate Depression:
happinessData <- myData %>%
filter(DepressionStatus %in% c("normal", "moderate")) %>%
droplevels()
table(happinessData$DepressionStatus)
##
## moderate normal
## 34 209
# Summary Statistics:
happinessData %>%
group_by(DepressionStatus) %>%
summarize(
n = n(),
mean_happiness = mean(Happiness, na.rm = TRUE),
sd_happiness = sd(Happiness, na.rm = TRUE)
)
## # A tibble: 2 × 4
## DepressionStatus n mean_happiness sd_happiness
## <chr> <int> <dbl> <dbl>
## 1 moderate 34 23.1 4.97
## 2 normal 209 27.1 4.88
# Visualization:
ggplot(happinessData, aes(x = DepressionStatus, y = Happiness, fill = DepressionStatus)) +
geom_boxplot(alpha = 0.8) +
labs(
title = "Happiness Level by Depression Status",
x = "Depression Status",
y = "Happiness Level"
) +
theme_minimal() +
theme(legend.position = "none")
# Two-Sample T-Test:
t_test_result <- t.test(Happiness ~ DepressionStatus,
data = happinessData)
t_test_result
##
## Welch Two Sample t-test
##
## data: Happiness by DepressionStatus
## t = -4.3253, df = 43.992, p-value = 8.616e-05
## alternative hypothesis: true difference in means between group moderate and group normal is not equal to 0
## 95 percent confidence interval:
## -5.818614 -2.119748
## sample estimates:
## mean in group moderate mean in group normal
## 23.08824 27.05742
Comment: Based on the p-value (\(p \approx 0.00008\)) we can reject the null hypothesis and say that students with moderate depression have significantly lower happiness levels than students with normal depression.
Looking at the box plot we can that students with moderate depression have consistently lower happiness levels than students with normal depression.
# Summary Statistics by AllNighter
myData %>%
group_by(AllNighter) %>%
summarize(
n = n(),
mean_sleep = mean(PoorSleepQuality, na.rm = TRUE),
sd_sleep = sd(PoorSleepQuality, na.rm = TRUE)
)
## # A tibble: 2 × 4
## AllNighter n mean_sleep sd_sleep
## <int> <int> <dbl> <dbl>
## 1 0 219 6.14 2.92
## 2 1 34 7.03 2.82
# Visualization:
ggplot(myData, aes(x = factor(AllNighter), y = PoorSleepQuality, fill = factor(AllNighter))) +
geom_boxplot(alpha = 0.8) +
labs(
title = "Sleep Quality by All-Nighter Status",
x = "Had at Least One All-Nighter (0 = No, 1 = Yes)",
y = "Poor Sleep Quality Score"
) +
theme_minimal() +
theme(legend.position = "none")
# Two-Sample T-Test:
t_test_result <- t.test(PoorSleepQuality ~ AllNighter, data = myData)
t_test_result
##
## Welch Two Sample t-test
##
## data: PoorSleepQuality by AllNighter
## t = -1.7068, df = 44.708, p-value = 0.09479
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.9456958 0.1608449
## sample estimates:
## mean in group 0 mean in group 1
## 6.136986 7.029412
Comment: Based on the p-value (\(p \approx 0.09\)), we do not reject the null hypothesis and say that there is not enough evidence to support that students who pulled an all-nighter had different sleep quality than students who did not.
Looking at the box plot we can see that there is a consistent overlap between students who pulled an all-nighter and those who didn’t. This further supports our findings.
# Filter to Abstain vs Heavy Alcohol Use:
stressData <- myData %>%
filter(AlcoholUse %in% c("Abstain", "Heavy")) %>%
droplevels()
table(stressData$AlcoholUse)
##
## Abstain Heavy
## 34 16
# Summary Statistics:
stressData %>%
group_by(AlcoholUse) %>%
summarize(
n = n(),
mean_stress = mean(StressScore, na.rm = TRUE),
sd_stress = sd(StressScore, na.rm = TRUE)
)
## # A tibble: 2 × 4
## AlcoholUse n mean_stress sd_stress
## <chr> <int> <dbl> <dbl>
## 1 Abstain 34 8.97 7.58
## 2 Heavy 16 10.4 7.80
# Visualization:
ggplot(stressData, aes(x = AlcoholUse, y = StressScore, fill = AlcoholUse)) +
geom_boxplot(alpha = 0.8) +
labs(
title = "Stress Score by Alcohol Use",
x = "Alcohol Use",
y = "Stress Score"
) +
theme_minimal() +
theme(legend.position = "none")
# Two-Sample T-Test:
t_test_result <- t.test(StressScore ~ AlcoholUse, data = stressData)
t_test_result
##
## Welch Two Sample t-test
##
## data: StressScore by AlcoholUse
## t = -0.62604, df = 28.733, p-value = 0.5362
## alternative hypothesis: true difference in means between group Abstain and group Heavy is not equal to 0
## 95 percent confidence interval:
## -6.261170 3.327346
## sample estimates:
## mean in group Abstain mean in group Heavy
## 8.970588 10.437500
Comment: Based on the p-value (\(p \approx 0.54\)), we do not reject the null hypothesis because there is not enough evidence to say that students who abstain from alcohol have significantly different stress levels compared to students who choose to drink alcohol.
Looking at the box plot we can see there is a considerable amount of overlap which further supports our findings.
# Summary Statistics by Gender:
myData %>%
group_by(Gender) %>%
summarize(
n = n(),
mean_drinks = mean(Drinks, na.rm = TRUE),
sd_drinks = sd(Drinks, na.rm = TRUE)
)
## # A tibble: 2 × 4
## Gender n mean_drinks sd_drinks
## <int> <int> <dbl> <dbl>
## 1 0 151 4.24 2.72
## 2 1 102 7.54 4.93
# Visualization:
ggplot(myData, aes(x = factor(Gender), y = Drinks, fill = factor(Gender))) +
geom_boxplot(alpha = 0.8) +
labs(
title = "Number of Drinks per Week by Gender",
x = "Gender (0 = Female, 1 = Male)",
y = "Number of Drinks per Week"
) +
theme_minimal() +
theme(legend.position = "none")
# Two-Sample T-Test:
t_test_result <- t.test(Drinks ~ Gender, data = myData)
t_test_result
##
## Welch Two Sample t-test
##
## data: Drinks by Gender
## t = -6.1601, df = 142.75, p-value = 7.002e-09
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -4.360009 -2.241601
## sample estimates:
## mean in group 0 mean in group 1
## 4.238411 7.539216
Comment: Looking at the p-value (\(p \approx 0.000000007\)), we do not reject the null hypothesis and can conclude that male students consume significantly more alcohol per week than female students.
Looking at the box plot, we can see that male students on average drink more alcohol per week than female students. This information further proves our findings.
# Filter to High and Normal Stress:
SleepData <- myData %>%
filter(Stress %in% c("high", "normal")) %>%
droplevels() # <-- drops unused levels
levels(SleepData$Stress)
## NULL
table(SleepData$Stress)
##
## high normal
## 56 197
# Visualization:
ggplot(SleepData, aes(x = Stress, y = WeekdayBed, fill = Stress)) +
geom_boxplot(alpha = 0.6) +
labs(
title = "Weekday Bedtime by Stress Level",
x = "Stress Level",
y = "Weekday Bedtime (hours)"
) +
theme_minimal() +
theme(legend.position = "none")
# Two-Sample T-Test:
t.test(WeekdayBed ~ Stress, data = SleepData, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: WeekdayBed by Stress
## t = -1.0746, df = 87.048, p-value = 0.2855
## alternative hypothesis: true difference in means between group high and group normal is not equal to 0
## 95 percent confidence interval:
## -0.4856597 0.1447968
## sample estimates:
## mean in group high mean in group normal
## 24.71500 24.88543
Comment: Looking at the p-value (\(p \approx 0.30\)) we reject the null hypothesis and say that there is no statistically significant difference in the average weekday bedtime between students with high stress and normal stress.
Furthermore, the box plot shows each group overlaps each other which confirms our findings.
# Create a new grouping variable:
sleep_clean <- myData %>%
filter(!is.na(WeekendSleep), !is.na(ClassYear)) %>%
mutate(YearGroup = ifelse(ClassYear %in% c(1, 2),
"FirstTwoYears",
"OtherStudents"))
table(sleep_clean$YearGroup)
##
## FirstTwoYears OtherStudents
## 142 111
# Visualization
ggplot(sleep_clean, aes(x = YearGroup, y = WeekendSleep, fill = YearGroup)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Weekend Sleep Hours by Year Group",
x = "Student Group",
y = "Average Weekend Sleep (hours)") +
theme_minimal()
# Two-Sample T-Test:
t.test(WeekendSleep ~ YearGroup, data = sleep_clean)
##
## Welch Two Sample t-test
##
## data: WeekendSleep by YearGroup
## t = -0.047888, df = 237.36, p-value = 0.9618
## alternative hypothesis: true difference in means between group FirstTwoYears and group OtherStudents is not equal to 0
## 95 percent confidence interval:
## -0.3497614 0.3331607
## sample estimates:
## mean in group FirstTwoYears mean in group OtherStudents
## 8.213592 8.221892
Comment: Based on the p-value (\(p \approx 0.96\)), we do not reject the null hypothesis because there is no significant difference in the average weekend sleep hours between students in the first two years and the last two years.
Furthermore, the box plot shows a considerable amount of overlap between the two student groups which confirms our findings.
This section reviews the findings from each of the 10 questions that were evaluated in the previous section.
Question 1: Is there a significant difference in the average GPA between male and female college students?
Conclusion: Due to the p-value being close to zero (\(p \approx 0.0001\)), we can reject the null hypothesis and know this is unlikely due to random variation. Evaluating the box plot, we can see that females have a slightly higher mean GPA of 3.32 where males have a mean GPA of 3.12.
Question 2: Is there a significant difference in the average number of early classes between the first two class years and other class years?
Conclusion: Looking at the p-value (\(p \approx 0.00004\)), we can reject the null hypothesis and conclude that this is not due to random variation and there is a statistically significant difference in the average number of early classes between the first two class years and the last two class years.
Question 3: Do students who identify as “larks” have significantly better cognitive skills (cognition z-score) compared to “owls”?
Conclusion: Looking at the box plot, there is a lot of overlap between the two groups (Larks and Owls) which indicates there is no relationship between cognitive performance and if students stay up late or wake up early.
Question 4: Is there a significant difference in the average number of classes missed in a semester between students who had at least one early class (EarlyClass=1) and those who didn’t (EarlyClass=0)?
Conclusion: Based on the p-value (\(p \approx 0.14\)), we reject the null hypothesis because there is no statistically significant difference in the average number of classes missed between students with at least one early class and those with no early classes.
Question 5: Is there a significant difference in the average happiness level between students with at least moderate depression and normal depression status?
Conclusion: Based on the p-value (\(p \approx 0.00008\)) we can reject the null hypothesis and say that students with moderate depression have significantly lower happiness levels than students with normal depression.
Question 6: Is there a significant difference in average sleep quality scores between students who reported having at least one all-nighter (AllNighter=1) and those who didn’t (AllNighter=0)?
Conclusion: Based on the p-value (\(p \approx 0.09\)), we do not reject the null hypothesis and say that there is not enough evidence to support that students who pulled an all-nighter had different sleep quality than students who did not.
Question 7: Do students who abstain from alcohol use have significantly better stress scores than those who report heavy alcohol use?
Conclusion: Based on the p-value (\(p \approx 0.54\)), we do not reject the null hypothesis because there is not enough evidence to say that students who abstain from alcohol have significantly different stress levels compared to students who choose to drink alcohol.
Question 8: Is there a significant difference in the average number of drinks per week between students of different genders?
Conclusion: Looking at the p-value (\(p \approx 0.000000007\)), we do not reject the null hypothesis and can conclude that male students consume significantly more alcohol per week than female students.
Question 9: Is there a significant difference in the average weekday bedtime between students with high and low stress (Stress=High vs. Stress=Normal)?
Conclusion: Looking at the p-value (\(p \approx 0.30\)) we reject the null hypothesis and say that there is no statistically significant difference in the average weekday bedtime between students with high stress and normal stress.
Question 10: Is there a significant difference in the average hours of sleep on weekends between first two year students and other students?
Conclusion: Based on the p-value (\(p \approx 0.96\)), we do not reject the null hypothesis because there is no significant difference in the average weekend sleep hours between students in the first two years and the last two years.