download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
Making the variables
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)

Exercise 1
Describe the population distribution
The distribution of the population is skewed right because there most of the data on the left and some outliers on the right side skewing the data.
The unknown sampling distribution
estiamting the mean living area in Ames based on a sample
samp1<- sample(area, 50)
##Basically this takes 50 random samples from the population and makes a simulated sample from it
Exercise 2
Describe the distribution of this sample. How does it compare to the distribution of the population?
The data doesn’t have as large of a range compared to the original population but the means are similar to each other but the data gets tighter.
## To visualize samp1 we have to do a mean of it and a hist will also help
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 729 1166 1418 1495 1704 3140
mean(samp1)
## [1] 1494.86
hist(samp1)

Exercise 3
Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
The data comapred to sample 2 and sample 1 are not that different from each other the mean is pretty much the same at around 1500 and the ranges are also fairly similar. Having more samples will give a more accurate estimate of the population mean so the size of 1000 would be better for providing the estimate.
samp2<- sample(area, 50)
summary(samp2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 691 1130 1436 1477 1726 2799
##Using the hypothetical question given by the question into an actual sample
samp3<- sample(area, 1000)
summary(samp3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1143 1468 1520 1776 4676
##In this chunk the process is repeated to take 5000 samples of 50 and then take the means of the samples and put them into the variable sample_means50 and then showing it on a histogram
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)

Exercise 4
How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
There are 5000 elements in sample_means50. The sampling distribution is basically normal and the center is around 1500. There wouldn’t be much of a difference between 5000 to 50,000 sample means because those numbers are already large enough to produce a normal distribution.
The for loop
Exercise 5
To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?
There are 100 elements on the screen and those 100 elements are different sample means from the population.
## The for loop in this is similar to how for loops work in other coding languages where it does a set amount of iterations until some condition is false.
## This will show understanding of how to run a for loop
sample_means_small<- rep(NA, 100)
for (i in 1:100){
sampsmall <- sample(area, 50)
sample_means_small[i] <- mean(sampsmall)
}
sample_means_small
## [1] 1584.28 1607.44 1515.78 1505.56 1405.96 1452.22 1420.88 1388.64 1513.14
## [10] 1408.98 1506.54 1397.40 1476.14 1415.26 1595.50 1639.42 1603.98 1516.48
## [19] 1393.08 1490.68 1519.28 1411.52 1527.92 1654.52 1547.92 1606.32 1584.20
## [28] 1463.04 1449.40 1528.02 1539.56 1488.36 1408.94 1427.44 1520.78 1476.04
## [37] 1424.24 1485.94 1644.76 1368.58 1515.64 1469.38 1560.96 1433.28 1652.54
## [46] 1508.80 1568.00 1602.42 1551.72 1470.74 1503.58 1676.12 1500.54 1420.72
## [55] 1411.20 1371.72 1468.62 1592.36 1515.30 1535.80 1445.44 1450.80 1431.10
## [64] 1644.48 1528.90 1540.60 1499.04 1414.08 1527.16 1487.84 1505.66 1466.82
## [73] 1493.42 1542.32 1600.68 1301.22 1426.12 1406.64 1601.58 1415.72 1542.46
## [82] 1387.48 1683.50 1447.98 1563.44 1548.08 1571.14 1447.04 1491.12 1644.10
## [91] 1553.38 1498.84 1549.92 1380.80 1476.36 1552.80 1476.08 1530.30 1562.94
## [100] 1559.50
Sample size and the sampling distribution
##Changing the sample size to see how that would change our distribution
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
## This plots all the distributions next to each other so that looking at them will be easier to compare them to.
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

Exercise 6
When the sample size is larger, what happens to the center? What about the spread?
As the sample size gets larger the spread becomes smaller and the center gets higher meaning there are more sample means on the sample means landing on the center.
On my own
Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
The best estimate is $173,539.90
price_sample<- sample (price, 50)
mean(price_sample)
## [1] 172083
Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
Based on only the histogram distribution the mean is $180,000 and doing the mean command it shows $181,075.70.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)

mean(sample_means50)
## [1] 180772.7
##this code overwrites the variables declared earlier but still shows it correctly
Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
The shape of the sampling distribution for 150 samples has a smaller range and makes the values closer to each other as well as having a higher peak in the center compared to the sampling distribution of the sample size of 50
sample_means50 <- rep(NA, 5000)
sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
par(mfrow = c(2, 1))
xlimits <- range(sample_means50)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means150, breaks = 20, xlim = xlimits)
