sleep=read.csv("SleepStudy.csv", header=TRUE)
head(sleep)
#only need to run this once
install.packages(ggplot2)
library(ggplot2)
Rob Champigny, Jake Durocher, and Billy Smith - Final Project
Introduction: For our project we will be looking into the sleep habits and sleep quality from a survery group of college students. More specificaly our group will be asking two stastical questions, one being Categorical and the other being Quantitative. Our Categorical question is reasearching if a certain type of sleep habit (Lark, Owl, or Neither) affects a students overall depression status. Our Quantitative question is seeing if there is a correlation between students alcohol use (Average Number of Drinks) and sleep quality, specifically students reporting poor sleep quality.
Description of Methods: For our Categorical question we will be interpertating a Bar Graph and looking for any sort of relationship between sleep habits and depression status. Our Quantitative question will have much more in depth analysis. We will interpertate histograms, boxplots, number summaries, the standard deviation, two variable graphs, and the line of best fit.
Categorical Data Analysis
LarkOwl vs. Depression Status, which will display the dipression status for each sleep type. By using this we can see if there is a correlation between the sleep type and depression.
ggplot(sleep, aes(DepressionStatus, fill=LarkOwl))+geom_bar(position = "dodge")

From this Bar Chart, we see that there is very little relationship between sleep type and depression status. However, Owls seem to have a slightly greater depressions status than Larks, and those who claim to be neither an Owl or Lark show more signs of normal depression status, this is most likely because they are the majority of the data.
Quantitative Data Analysis
For a Quantataive Question, we are going to see if there is a correlation between the number or drinks students have and students reporting poor sleep quality. We will do this by analysing histograms, boxplots, and five number summaries.
Histogram of Drinks The histogram will display the range and shape of the data, while also giving us an estimate of the median.
qplot(sleep$Drinks,
geom="histogram",
binwidth = 1,
fill=I("tomato3"))

From this Histogram, we can see that the mojority of students have around 0-5 drinks. There is also a large group that have 0 drinks and some outliers that have around 20 drinks. The shape is fairly symetric and the answers range from 0 drinks to 23 drinks.
Boxplot of Drinks The boxplot will show us direct measures of the spread; including the range, median, Q1, Q3, and IQR. It will also give us a chance to notice the outliers.
ggplot(sleep, aes(x="",y = Drinks)) +
geom_boxplot()+
ggtitle("Drinks")

This Boxplot displays the median number of drinks (5) and the wide range that reaches all the way up to 24 from 0. It also tells us the Q1 (3), Q3 (8), and IQR (5).
Summary of Drinks The summary tells us eveything we need to know about the spread of the number of drinks students have. It relates direclty to the boxplot and is a better way to look at the min, Q1, median, mean, Q3, and max.
summary(sleep$Drinks)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 5.000 5.569 8.000 24.000
After reviewing the summary we are able to distinguish the min (0) and the max (24), therefore we know the range is 24. We also see the Q1 (3) and the Q3 (8), creating an IQR of 5. Finally, the median is 5 and the mean is 5.569, this information tells us that the data is fairly close together and clumped.
Standard Deviation of Drinks The standard deviation is used to tell how far the number or drinks is spread from the average.
sd(sleep$Drinks)
[1] 4.095119
Since this is a lower stadard deviation we are able to conclude that the majority of the data is close together and is typically within 4.1 drinks of the mean number of drinks (5.569).
Histogram of Sleep Quality The histogram will display the range and shape of the data, while also giving us an estimate of the median.
qplot(sleep$PoorSleepQuality,
geom="histogram",
binwidth = 1,
fill=I("khaki1"))

The Histogram of Poor Sleep quality displays that a majority of students claim to have poor sleep. It also shows us the shape, which is failry symetrical and has a center around 6. We aslo get a rough estimate of where the median is, which is about 6. Also, we can notice an outlier (18).
Boxplot of Sleep Quality The boxplot will show us direct measures of the spread; including the range, median, Q1, Q3, and IQR. It will also give us a chance to notice the outliers.
ggplot(sleep, aes(x="",y = PoorSleepQuality)) +
geom_boxplot()+
ggtitle("Poor Sleep Quality")

This Boxplot displays the median number of students reporting scores of poor sleep quality (6) and therange that reaches all the way up to 8 from 1. It also tells us the Q1 (4), Q3 (8), and IQR (4).
Summary of Sleep Quality The summary tells us eveything we need to know about the spread of the number of poor sleep scores that students have. It relates direclty to the boxplot and is a better way to look at the min, Q1, median, mean, Q3, and max.
summary(sleep$PoorSleepQuality)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 6.000 6.257 8.000 18.000
After reviewing the summary we are able to distinguish the min (1) and the max (18), therefore we know the range is 17. We also see the Q1 (4) and the Q3 (8), creating an IQR of 4. Finally, the median is 6 and the mean is 6.257, this information points to the data being very close together and clumped.
Standard Deviation of Sleep Quality The standard deviation is used to tell how far the reports of poor sleep quality is spread from the average.
sd(sleep$PoorSleepQuality)
[1] 2.919761
Since this is a very low stadard deviation we are able to conclude that the majority of the data is close together and is typically within 2.9 scores of the mean poor sleep quality score (6.257).
Two Variable Analysis Here we will look at the relationship between our two quantitative variables. By putting poor sleep quality and number of drinks side by side, we will clearly be able to see if there is any sort of relationship between them.
ggplot(sleep, aes(x=PoorSleepQuality, y=Drinks)) +
geom_point(size=2)+
ggtitle("Drinks vs. Poor Sleep Score")

Unfortunately, this dot plot really displays no realtionship between reports of poor sleep quality and number of drinks students have. Therfore, we can confrom that the number of drinks students have does not affect their sleep quality.
Calculate the Correlation Calculating the correlation will give us a number to show either a srtong or weak relationship in our study of drinks and sleep quality. It will also tell us if the slope is positive or negative.
cor(sleep$PoorSleepQuality,sleep$Drinks)
[1] -0.001989989
From the number above we are able to determine that the correlation or relationship is extremely weak. This means that there is very little relationship between the number of drinks students have and their report of poor sleep quality. It also shows us that the best fit line would have a negative slope.
Fit a line to the data and add it to the plot The line will allow us to interpertate an equation of the number of drinks relating to poor sleep quality.
ggplot(sleep, aes(x=PoorSleepQuality, y=Drinks)) +
geom_point(size=2)+
stat_smooth(method="lm",col="red",se=FALSE)+
ggtitle("Drinks vs. Poor Sleep Score")

Equation of the best fit line
#Fit the linear model
sleep.lm=lm(sleep$Drinks~sleep$PoorSleepQuality)
coefficients(sleep.lm)
(Intercept) sleep$PoorSleepQuality
5.586633427 -0.002791066
The slope is -0.0028 and the intercept is 5.587
So the model predicting that the avererage number of Drinks from sleep quality is:
Drinks = -0.0028*PoorSleep + 5.587
The relationship and the model would not be appropriate for prediction.
