Daniel J Wilson

  1. Descriptive Statistics

a. What is the mean of the sample?

The GLM equation for the mean is:

\[\bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_{i}\]

\(\bar{x}\) is the mean

\(n\) is the number of terms

\(x_{i}\) is the value of each individual item

#READ IN CSV
sleep <- read.csv("/Users/danieljwilson/Dropbox/PROGRAMMING/R/StatsClass/Hmwrk1Sleep.csv", header = TRUE)

#FIND MEAN
sleepMean <- sum(sleep$hours)/length(sleep$hours)
sleepMean
## [1] 6.608696

The mean is 6.608696.

b. What is the median of the sample (do by hand)?

Index Value
01 4
02 4
03 4
04 4
05 4
06 5
07 5
08 6
09 6
10 6
11 7
12 7
13 7
14 7
15 7
16 7
17 7
18 7
19 8
20 8
21 9
22 9
23 14

The median value is 7.

c. What is the mode (by hand)?

Index Value
01 4
02 4
03 4
04 4
05 4
06 5
07 5
08 6
09 6
10 6
11 7
12 7
13 7
14 7
15 7
16 7
17 7
18 7
19 8
20 8
21 9
22 9
23 14

The mode is also 7.

d. What is the variance?

#SUBTRACT VALUES FROM MEAN
sleep$variance <- sleep$hours - sleepMean

#SQUARE TO MAKE POSITIVE
sleep$variance2 <- sleep$variance * sleep$variance

#ADD SQUARED VALUES AND DIVIDE BY NUMBER OF VALUES (-1 for POPULATION CALCULATION)
sleepVariance <- sum(sleep$variance2)/(length(sleep$variance2)-1)
sleepVariance
## [1] 5.067194

The variance is 5.067194.

e. What is the standard deviation?

#TAKE THE SQUARE ROOT OF THE VARIANCE
SD <- sqrt(sleepVariance)
SD
## [1] 2.251043

The standard deviation is 2.251043.

f. What is the standard error of the mean?

#STANDARD DEVIATION DIVIDED BY SAMPLE SIZE
SE <- SD/sqrt(length(sleep$hours))
SE
## [1] 0.4693749
  1. In R, generate a histogram of the data.
hist(sleep$hours, main="Sleep Histogram", xlab="hours", ylab="instances")

  1. Calculate and provide the z-score for each participant. Use R to write a script to do this, and then check with the built in R functions.
#ADD Z SCORE COLUMN
sleep$zed <- (sleep$hours - sleepMean)/ SD
sleep$zed
##  [1]  3.2835024  0.1738325 -1.1588832 -1.1588832  1.0623096  1.0623096
##  [7]  0.1738325  0.6180710 -0.7146446  0.1738325 -1.1588832 -0.2704061
## [13] -0.2704061 -0.2704061  0.6180710 -0.7146446 -1.1588832  0.1738325
## [19]  0.1738325  0.1738325  0.1738325  0.1738325 -1.1588832
#CHECK Z SCORES
scale(sleep$hours)
##             [,1]
##  [1,]  3.2835024
##  [2,]  0.1738325
##  [3,] -1.1588832
##  [4,] -1.1588832
##  [5,]  1.0623096
##  [6,]  1.0623096
##  [7,]  0.1738325
##  [8,]  0.6180710
##  [9,] -0.7146446
## [10,]  0.1738325
## [11,] -1.1588832
## [12,] -0.2704061
## [13,] -0.2704061
## [14,] -0.2704061
## [15,]  0.6180710
## [16,] -0.7146446
## [17,] -1.1588832
## [18,]  0.1738325
## [19,]  0.1738325
## [20,]  0.1738325
## [21,]  0.1738325
## [22,]  0.1738325
## [23,] -1.1588832
## attr(,"scaled:center")
## [1] 6.608696
## attr(,"scaled:scale")
## [1] 2.251043

Z-scores match.

  1. How unlikely would it be to find a person who slept for 16 hours given this distribution? Show your work.
#CALCULATE Z-SCORE 
z16 <- (16- sleepMean) / SD

#CALCULATE PROBABILITY OF 16 HOUR (OR MORE) SLEEP USING "ROUGH" FORMULA
rough16 <- (.6^(z16^2))*.4
cat("The approximate probability is:", rough16)
## The approximate probability is: 5.504159e-05
#CALCULATE PROBABILITY USING PNORM
prob16 <- (1-pnorm(z16))
cat("The actual probability is:", prob16)
## The actual probability is: 1.509824e-05
#OR IN TERMS OF RATIOS
ratio16 <- (1/(1-pnorm(z16)))
cat("The ratio is 1 person in", ratio16)
## The ratio is 1 person in 66232.88
  1. What would be the expected score of someone who had a z-score of 1.96? What about -3.09?
#SOLVE FOR HOURS
hours1 <- 1.96 * SD + sleepMean
hours2 <- -3.09 * SD + sleepMean
sprintf("The predicted hours of sleep for someone with a z-score of 1.96 would be %f, or about %i hours.", hours1, as.integer(hours1))
## [1] "The predicted hours of sleep for someone with a z-score of 1.96 would be 11.020740, or about 11 hours."
sprintf("The predicted hours of sleep for someone with a z-score of -3.09 would be %f, or about %i hours.", hours2, round(hours2, 0))
## [1] "The predicted hours of sleep for someone with a z-score of -3.09 would be -0.347027, or about 0 hours."
  1. A friend of yours is convinced that graduate students only sleep 5 hours a night. How would you test this hypothesis? Given this data, is this a good hypothesis? Write out the equations that you used to examine this hypothesis. Provide your models, alternative models, fit functions, and conclusions.

I would test this hypothesis by taking a random sample of graduate students (ideally more than 30) and finding out how many hours they had slept the night before. I would spread the data selection over each day of the week (an equal number of randomly selected students each day) to account for any weekly sleep patterns. With this data you could calculate a mean value and then also variance, standard deviation, z-scores and then find how likely it is that a student sleeps 5 hours per night (specifically this would be the range from 4.5 to 5.5 hours) as well as what percentage of the population sleeps more or less than exactly 5 hours.

Based on the data for this exercise this is not a very good hypothesis since the mean from the data is about 6.6 hours.

#Z Score for 5
zed5 <- (5 - sleepMean)/ SD
#Find percentage of people that are sleeping more than 5 hours
probMore5 <- 1-pnorm(zed5)
cat("The probability that someone sleeps more than 5 hours based on our data is", probMore5)
## The probability that someone sleeps more than 5 hours based on our data is 0.7625857
#Find the 4.5 - 5.5 interval
fiveInt <- pnorm((5.5-sleepMean)/SD) - pnorm((4.5-sleepMean)/SD)
cat("The probability that any individual sleeps about 5 hours based on our data is", fiveInt)
## The probability that any individual sleeps about 5 hours based on our data is 0.136734
#CALCULATE A 95% CONFIDENCE INTERVAL
error95 <- qnorm(0.975)*SD/sqrt(length(sleep$hours))
left95 <- sleepMean-error95
right95 <- sleepMean+error95

sprintf("The true mean has a probability of 95 percent of being in the interval between %f and %f assuming that the original random variable is normally distributed, and the samples are independent.", left95, right95)
## [1] "The true mean has a probability of 95 percent of being in the interval between 5.688738 and 7.528653 assuming that the original random variable is normally distributed, and the samples are independent."
#CALCULATE A 99% CONFIDENCE INTERVAL
error99 <- qnorm(0.995)*SD/sqrt(length(sleep$hours))
left99 <- sleepMean-error99
right99 <- sleepMean+error99

sprintf("The true mean has a probability of 99 percent of being in the interval between %f and %f assuming that the original random variable is normally distributed, and the samples are independent.", left99, right99)
## [1] "The true mean has a probability of 99 percent of being in the interval between 5.399666 and 7.817725 assuming that the original random variable is normally distributed, and the samples are independent."

My conclusion is that based on the current data collected, my friend is very likely wrong due to the extremely low chance that the true mean of the data set is 5.

  1. What would be the standard error of the mean be if the sample was 100? What about 1000?
#STANDARD ERROR = DEVIATION DIVIDED BY SAMPLE SIZE
SE100 <- SD/sqrt(100)
cat("Standard error if sample was 100 =", SE100)
## Standard error if sample was 100 = 0.2251043
SE1000 <- SD/sqrt(1000)
cat("Standard error if sample was 1000 =", SE1000)
## Standard error if sample was 1000 = 0.07118422
  1. How would each of these experimental changes change power?

a. Increase sample size

Increase power

b. Use a more representative sample

Increase power

c. Using p < .001 rather than p < .05 for an alpha cutoff

Decrease power

d. Using a one tailed test

Increase power

  1. Challenge points! Figure out how to make a plot using the R package ggplot2 that makes a histogram of the raw data that clearly indicated the 95 and 99% confidence intervals. Also, include a vertical line at the mean, the median, and the mode. Provide your plot, and the code that you used to generate it.
library(ggplot2)
cuts1 <- data.frame(Legend="95% CI", vals=c(left95, right95))
cuts2 <- data.frame(Legend="99% CI", vals=c(left99, right99))
cuts3 <- data.frame(Legend="Mean", vals=c(sleepMean))
cuts4 <- data.frame(Legend="Mode/Median", vals=c(7))

cuts <- rbind(cuts1,cuts2,cuts3,cuts4)

ggplot(data=sleep, aes(x=sleep$hours)) +
  geom_histogram(breaks=seq(0, 15, by=1),
                 col="red", 
                 fill="yellow",
                 alpha = .2) +
  geom_vline(data=cuts, 
             aes(xintercept=vals, 
                 linetype=Legend,
                 colour = Legend),
             show.legend  = TRUE) +

   labs(title="Hours of Sleep") +
   labs(x="hours", y="count")

The confidence intervals calculated are confidence intervals of the population mean based on the sample.

  1. How long did it take you to complete this assignment?

8 Hours

Daniel J Wilson