Lab 4

#### Name:Sonora Williams

#### Section: 01l

#### Date: September 24, 2013

### Exercises

#### Load data:

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area = ames$Gr.Liv.Area
price = ames$SalePrice

Exercise 1:The histogram is right skewed with a peak at 1,500 square feet for the area. Because the graph is right skewed, it appears that there are far more massive estates than there are tiny homes, suggesting that this is not necessarily a very impoverished population.

hist(area)

plot of chunk unnamed-chunk-2

Exercise 2:

set.seed(554985)

Exercise 3: This histogram differs in that its main peak happens before 1500 square feet at about 1250. However, the chart is also right skewed in that it reaches up to 2500 square feet,but the smallest area is about 500 square feet.

samp1 = sample(area, 50)
hist(samp1)

plot of chunk unnamed-chunk-4

Exercise 4:The second sample is a great deal different from the first. It peak is again centered a t 1500, but it only has 4 intervals reaching all the way up to 3000 square feet. If we were to take a sample size of 100, this would be more accurate than 50, and likewise a sample of size 1000 would be more accurate than 100, and 50.

samp2 = sample(area, 50)
hist(samp2)

plot of chunk unnamed-chunk-5

Exercise 5:There are 5,000 elements in this sample. The mean of this sample is at 0.3076 and the minimum is at o and the max is at 1538. The results would most likely be similar to this one if 50,000 elements were included, however, ends of the distribution would look different.

sample_means50 = rep(0, 5000)
for (i in 1:5000) {
    samp = sample(area, 50)
    sample_means50[i] = mean(samp)
}
length(sample_means50)
## [1] 5000
summary(sample_means50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1260    1450    1500    1500    1540    1760

Exercise 6: This small sample has 100 elements and each element represents the mean of the sample of fifty observations.

sample_means_small = rep(0, 100)
for (i in 1:100) {
    samp = sample(area, 50)
    sample_means_small[i] = mean(samp)
}

Exercise 7:The larger the sample size, the more precise the center. infact, the bigger the saple, this closer it gets to the actual population mean. Also the distribution narrows with an increase in sample size.

Part II

Set seed:

# enter your UID
population = ames$Gr.Liv.Area
set.seed(554985)
samp = sample(population, 60)

Exercise 8: Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean. This sample has a right skewed distribution as well as a mean of 1362. The typical size in the sample is about 1000 to 1500 population values. I figured typical to be the peak at that range as it had the largest frequency.

population = ames$Gr.Liv.Area
samp = sample(population, 60)
summary(samp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     747    1120    1360    1430    1640    2940
hist(samp)

plot of chunk unnamed-chunk-9

Exercise 9:Now compare your distribution to your neighbors’. Do they look similar? Are they identical? Why, or why not?

Well, my neighbors left, so my imaginary friends recreated using an adjacent computer will have to do. Doing this again, I got a very different mean, 1421, and histogram. the histogram is still right skewed, but the bin widths are smaller.

samp5 = sample(population, 60)
population = ames$Gr.Liv.Area
samp2 = sample(population, 60)
summary(samp5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     816    1180    1390    1540    1730    2640
hist(samp5)

plot of chunk unnamed-chunk-11

Exercise 10:Interpret this interval in context of the data. Make sure to include all relevant code for calculating the interval in your report.

The lower limit is 1415.763 and the upper limit is 1669.370, meaing that 95% of the time the mean will fall between this interval.

sample_mean = mean(samp)
se = sd(samp)/sqrt(60)
lower = sample_mean - qnorm(0.975) * se
upper = sample_mean + qnorm(0.975) * se
c(lower, upper)
## [1] 1326 1541

Exercise 11:For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/rad(n). What conditions must be met for this to be true?

Etiher the population has to already be normal or the sample size has to be very large. The population is already normal, and a sample size of 50 out 0f a population of 50,000 is large enough.

Exercise 12:What does “95% confidence” mean?

This means that 95% of the time, the mean value will fall within this interval, or more precisely that the interval will contain the actual mean 95% of the time.

Exercise 13: Does your confidence interval capture the true average size of houses in Ames? Do your neighbors intervals capture this value?

My confidence interval does capture the ture mean. and my imaginary friend's interval, 1294-1547, does also capture the true mean.

mean(population)
## [1] 1500
sample_mean = mean(samp5)
se = sd(samp)/sqrt(60)
lower = sample_mean - qnorm(0.975) * se
upper = sample_mean + qnorm(0.975) * se
c(lower, upper)
## [1] 1435 1650

Exercise 14:Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

I would expect 95% of the intervals to contain the true mean, because that is the whole purpose of the confidence interval. Well, only 94% of the intervals contained the true mean.

Exercise 15:View the first interval using the code below. Does this interval capture the true population mean? Note that you will need to include earlier code calculating lower and upper in your report for the below command to work.

The new interval 1291-1459 does not include the true mean of 1499.

samp_mean = rep(NA, 50)
samp_sd = rep(NA, 50)
n = 60
for (i in 1:50) {
    samp = sample(population, n)  # obtain a sample of size n = 60 from the population
    samp_mean[i] = mean(samp)  # save sample mean in ith element of samp_mean
    samp_sd[i] = sd(samp)
}  # save sample sd in ith element of samp_sd}}

lower = samp_mean - qnorm(0.975) * samp_sd/sqrt(n)
upper = samp_mean + qnorm(0.975) * samp_sd/sqrt(n)
c(lower[1], upper[1])
## [1] 1375 1640

Exercise 16:What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If no, explain why.

There are a total of fifty intervals and three intervals do not include the true mean. Therefore, 47 out of the total 50 include the ture mean for a percentage of 94%. This is not equal to the confidence level, and that is most likly because theere were not enough samples taken. If this were done for 100 intervals the percentage would probably aproach the confidence level. However, it is pretty close.

plot_ci(lower, upper, mean(population))

plot of chunk unnamed-chunk-15

Exercise 17: Confidence level = 92. The critical value, critical value is -1.4051.

Exercise 18:The new interval is 1409-1598. This includes the true mean. This time there are 6 out of the 50 that do not include the true mean. This time 88% of the intervals include the true mean. Go figure. This is a lot lower a percentage than the chosen confidence level.

samp_mean = rep(NA, 50)
samp_sd = rep(NA, 50)
n = 60
for (i in 1:50) {
    samp = sample(population, n)  # obtain a sample of size n = 60 from the population
    samp_mean[i] = mean(samp)  # save sample mean in ith element of samp_mean
    samp_sd[i] = sd(samp)
}  # save sample sd in ith element of samp_sd}}

lower = samp_mean - qnorm(0.96) * samp_sd/sqrt(n)
upper = samp_mean + qnorm(0.96) * samp_sd/sqrt(n)
c(lower[1], upper[1])
## [1] 1366 1562
plot_ci(lower, upper, mean(population))

plot of chunk unnamed-chunk-16