Math and Statistics Lab 4b

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")

load("ames.RData")

population <- ames$Gr.Liv.Area

samp<-sample(population, 60)

summary(samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     754    1094    1360    1482    1733    2794

hist(samp)

Exercise 1. The distribution is right skewed. Most observations fall between 1000 and 1500. Typical I understand as the most frequient.

Exercise 2. Another sample is very unlikely to be identical to the first. We randomly pick 60 observations and for these observations to be identical to the origincal sample would be extremly unlikely. However, another sample very likely to be simular to the original one.We pulling from the same population 60 observations which should be enough to give us an idea about the actual population, so both sample should be simular to the original population.

sample_mean <- mean(samp)

se <- sd(samp)/sqrt(60)

lower <- sample_mean-1.96*se

upper <- sample_mean+1.96*se

c(lower,upper)

## [1] 1358.825 1605.008

Exercise 3. Sample size should be over 30 and the sample distribution should not be skewed.

Exercise 4. We are 95% confident that population mean falls into confidence interval.

mean(population)

## [1] 1499.69

Exercise 5. It does. It captures population mean.

Exercise 6. 95% intervals should capture population mean if our assumptions are correct.

samp_mean<-rep(NA,50)

samp_sd <- rep(NA, 50)

n <- 60

for (i in 1:5000){
  samp<-sample(population,n)
  samp_mean[i]<-mean(samp)
  samp_sd[i]<-sd(samp)
}

lower_vector<-samp_mean-1.96*samp_sd/sqrt(n)
upper_vector<-samp_mean+1.96*samp_sd/sqrt(n)

c(lower_vector[1],upper_vector[1])

## [1] 1242.624 1464.143

On my own.

94.3% include population mean. It is not exactly equal to 95%. It is not exactly equal for number of reasons, such as:
- we use a lot of assumptions which could introduce error, sample mean instead of population mean, sample SD instead of population SD
- because some samples might be disqalified due to skewedness
- 95% confidence interval assumes infinite samples

plot_ci(lower_vector, upper_vector, mean(population))

count<-0
for (i in 1:5000){
  if (lower_vector[i]>mean(population)||upper_vector[i]<mean(population)){count<-count+1}
}

count/5000

## [1] 0.058

Let’s choose 90% confidence interval. Critical values for sample 1 will be 1365.704 and 1572.596

lower_vector1<-samp_mean-1.645*samp_sd/sqrt(n)
upper_vector1<-samp_mean+1.645*samp_sd/sqrt(n)

c(lower_vector1[1],upper_vector1[1])

## [1] 1260.424 1446.342

It is 89.5% vs 90%. Very close

plot_ci(lower_vector1, upper_vector1, mean(population))

count<-0
for (i in 1:5000){
  if (lower_vector1[i]>mean(population)||upper_vector1[i]<mean(population)){count<-count+1}
}

count/5000

## [1] 0.1094

Math and Statistics Lab 4b

Mikhail Groysman

November 12, 2018