Daniel J Wilson
a. What is the mean of the sample?
The GLM equation for the mean is:
\[\bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_{i}\]
\(\bar{x}\) is the mean
\(n\) is the number of terms
\(x_{i}\) is the value of each individual item
#READ IN CSV
sleep <- read.csv("/Users/danieljwilson/Dropbox/PROGRAMMING/R/StatsClass/Hmwrk1Sleep.csv", header = TRUE)
#FIND MEAN
sleepMean <- sum(sleep$hours)/length(sleep$hours)
sleepMean## [1] 6.608696
The mean is 6.608696.
b. What is the median of the sample (do by hand)?
| Index | Value |
|---|---|
| 01 | 4 |
| 02 | 4 |
| 03 | 4 |
| 04 | 4 |
| 05 | 4 |
| 06 | 5 |
| 07 | 5 |
| 08 | 6 |
| 09 | 6 |
| 10 | 6 |
| 11 | 7 |
| 12 | 7 |
| 13 | 7 |
| 14 | 7 |
| 15 | 7 |
| 16 | 7 |
| 17 | 7 |
| 18 | 7 |
| 19 | 8 |
| 20 | 8 |
| 21 | 9 |
| 22 | 9 |
| 23 | 14 |
The median value is 7.
c. What is the mode (by hand)?
| Index | Value |
|---|---|
| 01 | 4 |
| 02 | 4 |
| 03 | 4 |
| 04 | 4 |
| 05 | 4 |
| 06 | 5 |
| 07 | 5 |
| 08 | 6 |
| 09 | 6 |
| 10 | 6 |
| 11 | 7 |
| 12 | 7 |
| 13 | 7 |
| 14 | 7 |
| 15 | 7 |
| 16 | 7 |
| 17 | 7 |
| 18 | 7 |
| 19 | 8 |
| 20 | 8 |
| 21 | 9 |
| 22 | 9 |
| 23 | 14 |
The mode is also 7.
d. What is the variance?
#SUBTRACT VALUES FROM MEAN
sleep$variance <- sleep$hours - sleepMean
#SQUARE TO MAKE POSITIVE
sleep$variance2 <- sleep$variance * sleep$variance
#ADD SQUARED VALUES AND DIVIDE BY NUMBER OF VALUES (-1 for POPULATION CALCULATION)
sleepVariance <- sum(sleep$variance2)/(length(sleep$variance2)-1)
sleepVariance## [1] 5.067194
The variance is 5.067194.
e. What is the standard deviation?
#TAKE THE SQUARE ROOT OF THE VARIANCE
SD <- sqrt(sleepVariance)
SD## [1] 2.251043
The standard deviation is 2.251043.
f. What is the standard error of the mean?
#STANDARD DEVIATION DIVIDED BY SAMPLE SIZE
SE <- SD/sqrt(length(sleep$hours))
SE## [1] 0.4693749
hist(sleep$hours, main="Sleep Histogram", xlab="hours", ylab="instances")#ADD Z SCORE COLUMN
sleep$zed <- (sleep$hours - sleepMean)/ SD
sleep$zed## [1] 3.2835024 0.1738325 -1.1588832 -1.1588832 1.0623096 1.0623096
## [7] 0.1738325 0.6180710 -0.7146446 0.1738325 -1.1588832 -0.2704061
## [13] -0.2704061 -0.2704061 0.6180710 -0.7146446 -1.1588832 0.1738325
## [19] 0.1738325 0.1738325 0.1738325 0.1738325 -1.1588832
#CHECK Z SCORES
scale(sleep$hours)## [,1]
## [1,] 3.2835024
## [2,] 0.1738325
## [3,] -1.1588832
## [4,] -1.1588832
## [5,] 1.0623096
## [6,] 1.0623096
## [7,] 0.1738325
## [8,] 0.6180710
## [9,] -0.7146446
## [10,] 0.1738325
## [11,] -1.1588832
## [12,] -0.2704061
## [13,] -0.2704061
## [14,] -0.2704061
## [15,] 0.6180710
## [16,] -0.7146446
## [17,] -1.1588832
## [18,] 0.1738325
## [19,] 0.1738325
## [20,] 0.1738325
## [21,] 0.1738325
## [22,] 0.1738325
## [23,] -1.1588832
## attr(,"scaled:center")
## [1] 6.608696
## attr(,"scaled:scale")
## [1] 2.251043
Z-scores match.
#CALCULATE Z-SCORE
z16 <- (16- sleepMean) / SD
#CALCULATE PROBABILITY OF 16 HOUR (OR MORE) SLEEP USING "ROUGH" FORMULA
rough16 <- (.6^(z16^2))*.4
cat("The approximate probability is:", rough16)## The approximate probability is: 5.504159e-05
#CALCULATE PROBABILITY USING PNORM
prob16 <- (1-pnorm(z16))
cat("The actual probability is:", prob16)## The actual probability is: 1.509824e-05
#OR IN TERMS OF RATIOS
ratio16 <- (1/(1-pnorm(z16)))
cat("The ratio is 1 person in", ratio16)## The ratio is 1 person in 66232.88
#SOLVE FOR HOURS
hours1 <- 1.96 * SD + sleepMean
hours2 <- -3.09 * SD + sleepMean
sprintf("The predicted hours of sleep for someone with a z-score of 1.96 would be %f, or about %i hours.", hours1, as.integer(hours1))## [1] "The predicted hours of sleep for someone with a z-score of 1.96 would be 11.020740, or about 11 hours."
sprintf("The predicted hours of sleep for someone with a z-score of -3.09 would be %f, or about %i hours.", hours2, round(hours2, 0))## [1] "The predicted hours of sleep for someone with a z-score of -3.09 would be -0.347027, or about 0 hours."
I would test this hypothesis by taking a random sample of graduate students and finding out how many hours they had slept the night before. I would spread the data selection over each day of the week (an equal number of randomly selected students each day) to account for any weekly sleep patterns. With this data you could calculate a mean value and then also variance, standard deviation, z-scores and then find how likely it is that a student sleeps 5 hours per night (specifically this would be the range from 4.5 to 5.5 hours) as well as what percentage of the population sleeps more or less than exactly 5 hours.
Based on the data for this exercise this is not a very good hypothesis since the mean from the data is about 6.6 hours.
#Z Score for 5
zed5 <- (5 - sleepMean)/ SD
#Find percentage of people that are sleeping more than 5 hours
probMore5 <- 1-pnorm(zed5)
cat("The probability that someone sleeps more than 5 hours based on our data is", probMore5)## The probability that someone sleeps more than 5 hours based on our data is 0.7625857
#Find the 4.5 - 5.5 interval
fiveInt <- pnorm((5.5-sleepMean)/SD) - pnorm((4.5-sleepMean)/SD)
cat("The probability that any individual sleeps about 5 hours based on our data is", fiveInt)## The probability that any individual sleeps about 5 hours based on our data is 0.136734
#CALCULATE A 95% CONFIDENCE INTERVAL
error95 <- qnorm(0.975)*SD/sqrt(length(sleep$hours))
left95 <- sleepMean-error95
right95 <- sleepMean+error95
sprintf("The true mean has a probability of 95 percent of being in the interval between %f and %f assuming that the original random variable is normally distributed, and the samples are independent.", left95, right95)## [1] "The true mean has a probability of 95 percent of being in the interval between 5.688738 and 7.528653 assuming that the original random variable is normally distributed, and the samples are independent."
#CALCULATE A 99% CONFIDENCE INTERVAL
error99 <- qnorm(0.995)*SD/sqrt(length(sleep$hours))
left99 <- sleepMean-error99
right99 <- sleepMean+error99
sprintf("The true mean has a probability of 99 percent of being in the interval between %f and %f assuming that the original random variable is normally distributed, and the samples are independent.", left99, right99)## [1] "The true mean has a probability of 99 percent of being in the interval between 5.399666 and 7.817725 assuming that the original random variable is normally distributed, and the samples are independent."
My conclusion is that based on the current data collected, my friend is very likely wrong due to the extremely low chance that the true mean of the data set is 5.
#STANDARD ERROR = DEVIATION DIVIDED BY SAMPLE SIZE
SE100 <- SD/sqrt(100)
cat("Standard error if sample was 100 =", SE100)## Standard error if sample was 100 = 0.2251043
SE1000 <- SD/sqrt(1000)
cat("Standard error if sample was 1000 =", SE1000)## Standard error if sample was 1000 = 0.07118422
a. Increase sample size
Increase power
b. Use a more representative sample
Increase power
c. Using p < .001 rather than p < .05 for an alpha cutoff
Decrease power
d. Using a one tailed test
Increase power
library(ggplot2)
cuts1 <- data.frame(Legend="95% CI", vals=c(left95, right95))
cuts2 <- data.frame(Legend="99% CI", vals=c(left99, right99))
cuts3 <- data.frame(Legend="Mean", vals=c(sleepMean))
cuts4 <- data.frame(Legend="Mode/Median", vals=c(7))
cuts <- rbind(cuts1,cuts2,cuts3,cuts4)
ggplot(data=sleep, aes(x=sleep$hours)) +
geom_histogram(breaks=seq(0, 15, by=1),
col="red",
fill="yellow",
alpha = .2) +
geom_vline(data=cuts,
aes(xintercept=vals,
linetype=Legend,
colour = Legend),
show.legend = TRUE) +
labs(title="Hours of Sleep") +
labs(x="hours", y="count")The confidence intervals calculated are confidence intervals of the population mean based on the sample.
8 Hours
Daniel J Wilson