7_MS_LabCI

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

population<- ames$Gr.Liv.Area
n<-60
samp<- sample(population, n)

hist(samp)

summary(samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     630    1074    1380    1419    1655    2715

samp_1stQ<-1066
samp_3rdQ<-1706
ub_outlier<- samp_3rdQ+(samp_3rdQ- samp_1stQ)*1.5; ub_outlier

## [1] 2666

lb_outlier<- samp_1stQ-(samp_3rdQ- samp_1stQ)*1.5; lb_outlier

## [1] 106

print(samp>2666)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

print(samp<106)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

samp[12]

## [1] 630

##Exercise 1: Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

The distribution of this sample for size of house (ft^2) in Ames, Iowa, is right skewed with one outlier above upper fence at 3005 (ft^2). Besides the skew being evident in the histogram with outlier, the median is lower than the mean which also hints to this distribution.

##Exercise 2: Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not? I would not expect another student’s sample distribution to be identical, but I would expect it to be similar because the sample n= 60 is large enough of a samplesize to warrant both being representative of the overall population.

sample_mean<- mean(samp);sample_mean

## [1] 1419.1

se<- sd(samp)/n^(1/2)

lower <-sample_mean - 1.96 * se
upper <-sample_mean + 1.96 * se
c(lower, upper)

## [1] 1311.282 1526.918

##Exercise 3, For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/√n. What conditions must be met for this to be true? Typically even if a population is not normal if n is significantly large, the sample will be normal. In this case n = 60 which is greater than n = 30 ( a suggested sample size minimum).

##Exercise 4 What does “95% confidence” mean? If you’re not sure, see Section 4.2.2. 95% confidence means that “we” (or our statististics) are 95% confident or sure that the population mean is between two values or in a certain interval.

mean(population)

## [1] 1499.69

##Exercise 5: Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value? The 95% confidence interval of 1368.816 to 1558.451 captures the true population mean of 1499.69. ## Exercise 6: Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean. I would expect 95% of the confidence intervals to capture the population mean (0-1 students in our class should likely have a confidence interval that did not capture the population mean). ##Exercise: Loops for samples

samp_mean<-rep(NA, 50)
samp_sd<-rep(NA, 50)

for (i in 1:50){
  samp <- sample(population,n)
  samp_mean[i]<-mean(samp)
  samp_sd[i]<-sd(samp)
}

lower_vector<-samp_mean -1.96*(samp_sd/n^(1/2))
upper_vector<-samp_mean +1.96*(samp_sd/n^(1/2))


CI_list<-matrix(c(lower_vector,upper_vector), 50:2)

CI_list

##           [,1]     [,2]
##  [1,] 1448.010 1775.056
##  [2,] 1438.238 1688.762
##  [3,] 1374.376 1645.224
##  [4,] 1406.726 1736.241
##  [5,] 1378.930 1615.403
##  [6,] 1337.042 1552.692
##  [7,] 1429.788 1700.978
##  [8,] 1367.411 1619.622
##  [9,] 1398.844 1598.590
## [10,] 1400.546 1687.988
## [11,] 1392.993 1638.807
## [12,] 1445.295 1687.305
## [13,] 1414.554 1679.613
## [14,] 1433.734 1641.033
## [15,] 1399.795 1656.638
## [16,] 1367.130 1596.703
## [17,] 1323.635 1563.365
## [18,] 1326.776 1567.590
## [19,] 1321.105 1591.861
## [20,] 1308.437 1552.463
## [21,] 1323.271 1535.662
## [22,] 1382.505 1658.295
## [23,] 1419.936 1747.231
## [24,] 1351.490 1595.744
## [25,] 1359.665 1613.835
## [26,] 1326.814 1568.553
## [27,] 1380.118 1635.216
## [28,] 1433.512 1804.754
## [29,] 1333.148 1551.185
## [30,] 1482.489 1725.211
## [31,] 1385.284 1620.882
## [32,] 1402.073 1645.093
## [33,] 1297.143 1505.757
## [34,] 1284.364 1558.603
## [35,] 1373.090 1630.277
## [36,] 1329.252 1612.148
## [37,] 1388.130 1679.803
## [38,] 1387.011 1660.289
## [39,] 1345.528 1636.505
## [40,] 1311.155 1502.512
## [41,] 1335.989 1590.677
## [42,] 1345.010 1684.990
## [43,] 1348.382 1607.518
## [44,] 1496.096 1723.504
## [45,] 1361.190 1692.710
## [46,] 1564.861 1884.939
## [47,] 1379.549 1609.051
## [48,] 1375.096 1634.570
## [49,] 1462.772 1807.028
## [50,] 1338.707 1538.726

##On your own: (#1)

samp_mean<-rep(NA, 50)
samp_sd<-rep(NA, 50)

for (i in 1:50){
  samp <- sample(population,n)
  samp_mean[i]<-mean(samp)
  samp_sd[i]<-sd(samp)
}

lower_vector<-samp_mean -1.96*(samp_sd/n^(1/2))
upper_vector<-samp_mean +1.96*(samp_sd/n^(1/2))

plot_ci(lower_vector, upper_vector, mean(population))

outside_ci<-4
perc_outside_ci<-outside_ci/60;perc_outside_ci

## [1] 0.06666667

qnorm(.875, 0, 1)

## [1] 1.150349

##(#2) The appropriate critical value for a 75% confidence interval is aproximately 1.150

samp_mean<-rep(NA, 50)
samp_sd<-rep(NA, 50)

for (i in 1:50){
  samp <- sample(population,n)
  samp_mean[i]<-mean(samp)
  samp_sd[i]<-sd(samp)
}

lower_vector_75<-samp_mean -1.15*(samp_sd/n^(1/2))
upper_vector_75<-samp_mean +1.15*(samp_sd/n^(1/2))


plot_ci(lower_vector_75, upper_vector_75, mean(population))

outside_ci<-15
perc_outside_ci<-outside_ci/60;perc_outside_ci

## [1] 0.25

##(#3) 25% are excluded from intervals so 75% is included in this run of samples. Ran a few other times and values ranged between 20-25% excludedand 75-80% included in CI.

7_MS_LabCI

MCS

3/17/2020