Homework Assignment #1

Daniel J Wilson

Descriptive Statistics

a. What is the mean of the sample?

a1. What is the GLM equation for the mean of this sample?

The GLM equation for the mean is:

\[\bar{x} = \frac{1}{n} \sum_{i=1}^{n}x_{i}\]

a2. What do each of the terms mean?

\(\bar{x}\) is the mean

\(n\) is the number of terms

\(x_{i}\) is the value of each individual item

a3. Write R code to compute the mean (not the mean(x) command, but each of the steps to get to that function).

#READ IN CSV
sleep <- read.csv("/Users/danieljwilson/Dropbox/PROGRAMMING/R/StatsClass/Hmwrk1Sleep.csv", header = TRUE)

#FIND MEAN
sleepMean <- sum(sleep$hours)/length(sleep$hours)
sleepMean

## [1] 6.608696

a4. What is the answer?

The mean is 6.608696.

b. What is the median of the sample (do by hand)?

Index	Value
01	4
02	4
03	4
04	4
05	4
06	5
07	5
08	6
09	6
10	6
11	7
12	7
13	7
14	7
15	7
16	7
17	7
18	7
19	8
20	8
21	9
22	9
23	14

The median value is 7.

c. What is the mode (by hand)?

Index	Value
01	4
02	4
03	4
04	4
05	4
06	5
07	5
08	6
09	6
10	6
11	7
12	7
13	7
14	7
15	7
16	7
17	7
18	7
19	8
20	8
21	9
22	9
23	14

The mode is also 7.

d. What is the variance?

d1. Write R code to compute the var (not the var(x) command, but each of the steps to get to that function).

#SUBTRACT VALUES FROM MEAN
sleep$variance <- sleep$hours - sleepMean

#SQUARE TO MAKE POSITIVE
sleep$variance2 <- sleep$variance * sleep$variance

#ADD SQUARED VALUES AND DIVIDE BY NUMBER OF VALUES (-1 for POPULATION CALCULATION)
sleepVariance <- sum(sleep$variance2)/(length(sleep$variance2)-1)
sleepVariance

## [1] 5.067194

d2. What is the answer?

The variance is 5.067194.

e. What is the standard deviation?

e1. Modify d1 above

#TAKE THE SQUARE ROOT OF THE VARIANCE
SD <- sqrt(sleepVariance)
SD

## [1] 2.251043

e2. What is the answer?

The standard deviation is 2.251043.

f. What is the standard error of the mean?

#STANDARD DEVIATION DIVIDED BY SAMPLE SIZE
SE <- SD/sqrt(length(sleep$hours))
SE

## [1] 0.4693749

In R, generate a histogram of the data.

hist(sleep$hours, main="Sleep Histogram", xlab="hours", ylab="instances")

Calculate and provide the z-score for each participant. Use R to write a script to do this, and then check with the built in R functions.

#ADD Z SCORE COLUMN
sleep$zed <- (sleep$hours - sleepMean)/ SD
sleep$zed

##  [1]  3.2835024  0.1738325 -1.1588832 -1.1588832  1.0623096  1.0623096
##  [7]  0.1738325  0.6180710 -0.7146446  0.1738325 -1.1588832 -0.2704061
## [13] -0.2704061 -0.2704061  0.6180710 -0.7146446 -1.1588832  0.1738325
## [19]  0.1738325  0.1738325  0.1738325  0.1738325 -1.1588832

#CHECK Z SCORES
scale(sleep$hours)

##             [,1]
##  [1,]  3.2835024
##  [2,]  0.1738325
##  [3,] -1.1588832
##  [4,] -1.1588832
##  [5,]  1.0623096
##  [6,]  1.0623096
##  [7,]  0.1738325
##  [8,]  0.6180710
##  [9,] -0.7146446
## [10,]  0.1738325
## [11,] -1.1588832
## [12,] -0.2704061
## [13,] -0.2704061
## [14,] -0.2704061
## [15,]  0.6180710
## [16,] -0.7146446
## [17,] -1.1588832
## [18,]  0.1738325
## [19,]  0.1738325
## [20,]  0.1738325
## [21,]  0.1738325
## [22,]  0.1738325
## [23,] -1.1588832
## attr(,"scaled:center")
## [1] 6.608696
## attr(,"scaled:scale")
## [1] 2.251043

Z-scores match.

How unlikely would it be to find a person who slept for 16 hours given this distribution? Show your work.

#CALCULATE Z-SCORE 
z16 <- (16- sleepMean) / SD

#CALCULATE PROBABILITY OF 16 HOUR (OR MORE) SLEEP USING "ROUGH" FORMULA
rough16 <- (.6^(z16^2))*.4
cat("The approximate probability is:", rough16)

## The approximate probability is: 5.504159e-05

#CALCULATE PROBABILITY USING PNORM
prob16 <- (1-pnorm(z16))
cat("The actual probability is:", prob16)

## The actual probability is: 1.509824e-05

#OR IN TERMS OF RATIOS
ratio16 <- (1/(1-pnorm(z16)))
cat("The ratio is 1 person in", ratio16)

## The ratio is 1 person in 66232.88

What would be the expected score of someone who had a z-score of 1.96? What about -3.09?

#SOLVE FOR HOURS
hours1 <- 1.96 * SD + sleepMean
hours2 <- -3.09 * SD + sleepMean
sprintf("The predicted hours of sleep for someone with a z-score of 1.96 would be %f, or about %i hours.", hours1, as.integer(hours1))

## [1] "The predicted hours of sleep for someone with a z-score of 1.96 would be 11.020740, or about 11 hours."

sprintf("The predicted hours of sleep for someone with a z-score of -3.09 would be %f, or about %i hours.", hours2, round(hours2, 0))

## [1] "The predicted hours of sleep for someone with a z-score of -3.09 would be -0.347027, or about 0 hours."

A friend of yours is convinced that graduate students only sleep 5 hours a night. How would you test this hypothesis? Given this data, is this a good hypothesis? Write out the equations that you used to examine this hypothesis. Provide your models, alternative models, fit functions, and conclusions.

I would test this hypothesis by taking a random sample of graduate students and finding out how many hours they had slept the night before. I would spread the data selection over each day of the week (an equal number of randomly selected students each day) to account for any weekly sleep patterns. With this data you could calculate a mean value and then also variance, standard deviation, z-scores and then find how likely it is that a student sleeps 5 hours per night (specifically this would be the range from 4.5 to 5.5 hours) as well as what percentage of the population sleeps more or less than exactly 5 hours.

Based on the data for this exercise this is not a very good hypothesis since the mean from the data is about 6.6 hours.

#Z Score for 5
zed5 <- (5 - sleepMean)/ SD
#Find percentage of people that are sleeping more than 5 hours
probMore5 <- 1-pnorm(zed5)
cat("The probability that someone sleeps more than 5 hours based on our data is", probMore5)

## The probability that someone sleeps more than 5 hours based on our data is 0.7625857

#Find the 4.5 - 5.5 interval
fiveInt <- pnorm((5.5-sleepMean)/SD) - pnorm((4.5-sleepMean)/SD)
cat("The probability that any individual sleeps about 5 hours based on our data is", fiveInt)

## The probability that any individual sleeps about 5 hours based on our data is 0.136734

#CALCULATE A 95% CONFIDENCE INTERVAL
error95 <- qnorm(0.975)*SD/sqrt(length(sleep$hours))
left95 <- sleepMean-error95
right95 <- sleepMean+error95

sprintf("The true mean has a probability of 95 percent of being in the interval between %f and %f assuming that the original random variable is normally distributed, and the samples are independent.", left95, right95)

## [1] "The true mean has a probability of 95 percent of being in the interval between 5.688738 and 7.528653 assuming that the original random variable is normally distributed, and the samples are independent."

#CALCULATE A 99% CONFIDENCE INTERVAL
error99 <- qnorm(0.995)*SD/sqrt(length(sleep$hours))
left99 <- sleepMean-error99
right99 <- sleepMean+error99

sprintf("The true mean has a probability of 99 percent of being in the interval between %f and %f assuming that the original random variable is normally distributed, and the samples are independent.", left99, right99)

## [1] "The true mean has a probability of 99 percent of being in the interval between 5.399666 and 7.817725 assuming that the original random variable is normally distributed, and the samples are independent."

My conclusion is that based on the current data collected, my friend is very likely wrong due to the extremely low chance that the true mean of the data set is 5.

What would be the standard error of the mean be if the sample was 100? What about 1000?

#STANDARD ERROR = DEVIATION DIVIDED BY SAMPLE SIZE
SE100 <- SD/sqrt(100)
cat("Standard error if sample was 100 =", SE100)

## Standard error if sample was 100 = 0.2251043

SE1000 <- SD/sqrt(1000)
cat("Standard error if sample was 1000 =", SE1000)

## Standard error if sample was 1000 = 0.07118422

How would each of these experimental changes change power?

a. Increase sample size

Increase power

b. Use a more representative sample

Increase power

c. Using p < .001 rather than p < .05 for an alpha cutoff

Decrease power

d. Using a one tailed test

Increase power

Challenge points! Figure out how to make a plot using the R package ggplot2 that makes a histogram of the raw data that clearly indicated the 95 and 99% confidence intervals. Also, include a vertical line at the mean, the median, and the mode. Provide your plot, and the code that you used to generate it.

library(ggplot2)
cuts1 <- data.frame(Legend="95% CI", vals=c(left95, right95))
cuts2 <- data.frame(Legend="99% CI", vals=c(left99, right99))
cuts3 <- data.frame(Legend="Mean", vals=c(sleepMean))
cuts4 <- data.frame(Legend="Mode/Median", vals=c(7))

cuts <- rbind(cuts1,cuts2,cuts3,cuts4)

ggplot(data=sleep, aes(x=sleep$hours)) +
  geom_histogram(breaks=seq(0, 15, by=1),
                 col="red", 
                 fill="yellow",
                 alpha = .2) +
  geom_vline(data=cuts, 
             aes(xintercept=vals, 
                 linetype=Legend,
                 colour = Legend),
             show.legend  = TRUE) +

   labs(title="Hours of Sleep") +
   labs(x="hours", y="count")

The confidence intervals calculated are confidence intervals of the population mean based on the sample.

How long did it take you to complete this assignment?

8 Hours

Daniel J Wilson

Index	Value
01	4
02	4
03	4
04	4
05	4
06	5
07	5
08	6
09	6
10	6
11	7
12	7
13	7
14	7
15	7
16	7
17	7
18	7
19	8
20	8
21	9
22	9
23	14

Index	Value
01	4
02	4
03	4
04	4
05	4
06	5
07	5
08	6
09	6
10	6
11	7
12	7
13	7
14	7
15	7
16	7
17	7
18	7
19	8
20	8
21	9
22	9
23	14

Index	Value
01	4
02	4
03	4
04	4
05	4
06	5
07	5
08	6
09	6
10	6
11	7
12	7
13	7
14	7
15	7
16	7
17	7
18	7
19	8
20	8
21	9
22	9
23	14

Index	Value
01	4
02	4
03	4
04	4
05	4
06	5
07	5
08	6
09	6
10	6
11	7
12	7
13	7
14	7
15	7
16	7
17	7
18	7
19	8
20	8
21	9
22	9
23	14

Index	Value
01	4
02	4
03	4
04	4
05	4
06	5
07	5
08	6
09	6
10	6
11	7
12	7
13	7
14	7
15	7
16	7
17	7
18	7
19	8
20	8
21	9
22	9
23	14

Index	Value
01	4
02	4
03	4
04	4
05	4
06	5
07	5
08	6
09	6
10	6
11	7
12	7
13	7
14	7
15	7
16	7
17	7
18	7
19	8
20	8
21	9
22	9
23	14