Lab 4 Questions - Xialing Walla

download.file("http://www.openintro.org/stat/data/ames.RData",destfile ="ames.RData")
load("ames.RData")
area<-ames$Gr.Liv.Area
price<-ames$SalePrice

Exercise 1 Describe this population distribution.

The population distribution is right skewed and unimodal. You can see the right skew as the mean is closer to the tail. The range is is about 5,300. By using a smaller bin (20) we can see some extream outliers between 4,000 to 6,000.

summary(area)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    334    1130    1440    1500    1740    5640 
hist(area, breaks = 20 )

plot of chunk unnamed-chunk-3

Exercise 2 - Describe the distribution of this sample? How does it compare to the distribution of the population?

The distribution of the 50 samples follow that of the population distribution. This agrees with the basic properties of point. The sample mean (point estimate) is close to the population mean. However, the range is not as wide (only about 600) and the outliers are not as extreme. However, since this is a simple random sample the outcome will differ if we do not use the set.seed function.

set.seed(50)
samp1<-sample(area,50)
summary(samp1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    848    1090    1340    1480    1720    3500 
hist(samp1)

plot of chunk unnamed-chunk-5

Exercise 3 - Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of thepopulation mean?

The sample mean 'r sd(samp2)' for the 2nd set of random samples changed. As the sample sizes get bigger, to 100 and 1000,the shape of the plot is looking more like the one for whole population. The SEs are getting smaller as well, which indicates that larger samples provide better estimate of the population mean, though it is not a guarantee that every large smaple will provide a better estimate than a particular small sample.

samp2<-sample(area,50)
summary(samp2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    845    1140    1530    1560    1850    3400 
sd(samp2)/(sqrt(50))
[1] 77.61
hist(samp2)

plot of chunk unnamed-chunk-7

set.seed(100)
samp3<-sample(area,100)
summary(samp3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    605    1130    1390    1510    1770    3110 
sd(samp3)/(sqrt(100))
[1] 51.3
hist(samp3)

plot of chunk unnamed-chunk-9

set.seed(1000)
samp4<-sample(area,1000)
summary(samp4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    407    1130    1440    1510    1780    4320 
sd(samp4)/(sqrt(1000))
[1] 15.66
hist(samp4, breaks = 25)

plot of chunk unnamed-chunk-11

Exercise 4-How many elements are there in sample means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

5000 elements (samples) are selected in sample_mean50. The center of the sampling disbribution is exactly at 1,500, which is the population mean. This is an unbiased calculation becasue the mean for the sampling distribution is equal to the true value of the population mean. Looking at the histogram of the sample means the spread is small because the sample size is big (5000 samples). The bigger size the samples, the smaller the spread, which shows low variability. The shape of the histogram is close to a normal distribution.

sample_means50<-rep(0,5000)
for(i in 1:5000){
samp<-sample(area,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, breaks = 25)

plot of chunk unnamed-chunk-13

Exercise 5-To make sure you understand what you've done in this loop, try running a smaller version.Initialize a vector of 100 zeros called sample means small.Run a loop that takes a sample of size 50 from area and stores the sample mean in sample means small but only iterate from 1 to 100.Print the output to your screen (type sample means small into the console and press enter). How many elements are there in this object called sample means small? What does each element represent?

There are 100 elements in sample_means_small, each representing a sample mean generated by the 100 samples of size 50.

sample_means_small<-rep(0,100)
for(i in 1:100){
    samp<-sample(area, 50)
    sample_means_small[i]<-mean(samp)
}
sample_means_small
  [1] 1479 1511 1467 1523 1443 1541 1510 1585 1503 1474 1525 1472 1535 1550
 [15] 1588 1470 1458 1528 1591 1462 1565 1333 1462 1469 1605 1516 1526 1478
 [29] 1438 1546 1479 1413 1496 1744 1510 1503 1470 1547 1415 1408 1498 1529
 [43] 1485 1382 1575 1585 1442 1423 1414 1403 1474 1544 1447 1547 1488 1344
 [57] 1517 1574 1624 1538 1522 1452 1407 1584 1360 1568 1516 1571 1597 1465
 [71] 1456 1631 1511 1415 1348 1481 1470 1488 1545 1511 1548 1433 1399 1508
 [85] 1320 1486 1492 1588 1548 1445 1480 1505 1499 1577 1500 1473 1555 1464
 [99] 1499 1410
hist(sample_means_small)

plot of chunk unnamed-chunk-15

Exercise 6 When the sample size is larger, what happens to the center? What about the spread?

when the sample size is large the center of the sampling disbribution should be the center of the population. The spread is becoming smaller as well as the sample sizes increse which indicate low variability of the data.

Lab 4 Data Analysis Questions - Xialing Walla

1. Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

set.seed(50)
samp5<-sample(price,50)
summary(samp5)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  76500  112000  140000  171000  202000  584000 

2. Since you have access to the population, simulate the sampling distribution for x price by taking 5000samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample means50 . Plot the data, then describe the shape of this sampling distribution.Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

The shape of the sampling mean distribution follows normal distrubtion as the sample size is large. the sample mean home price is at 180,000 based on 5000 samples of size 50. the spread is small. Looking at the summary for the population mean home price it is at 181,000, which is very close to the sampling mean.

sample_means50<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, breaks = 25)

plot of chunk unnamed-chunk-18

summary(price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12800  130000  160000  181000  214000  755000 

3. Change your sample size from 50 to 150 then compute the sampling distribution using the same method as above,and store these means in a new vector called samplemeans150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

My guesswould still be around 180,000.

sample_means150<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,150)
sample_means150[i]<-mean(samp)
}
hist(sample_means150, breaks = 25)

plot of chunk unnamed-chunk-21

4.Of the sampling distributions from 2 and 3, which has a smaller spread? If we're concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Sampling distributions from 3 has smaller spread. We would prefer a distribution with a small spread because it is associated with low data variability.

5. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.