download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
# Create two variables with short names that represent above ground living area of the house and the sales price
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
sd(area)
## [1] 505.5089
hist(area, breaks=50) #include more breaks to see shape better
#create density histogram
hist(area, probability=TRUE)
x <- 300:6000
y <- dnorm(x = x, mean = 1500, sd = 505)
lines(x = x, y, col = "blue")
#create Q-Q plot
qqnorm(area)
qqline(area)
The population distribution is right-skewed with a mean of 1500 and standard deviation of 505. There are some upper outliers.
# This commands collects a simple random sample of size 50 from the vector area, which is assigned to samp1
samp1 <- sample(area, 50)
mean(samp1)
## [1] 1525.68
sd(samp1)
## [1] 485.8597
hist(samp1, breaks=15)
# Create a density histogram showing normal curve
hist(samp1, probability=TRUE)
x <- 800:3000
y <- dnorm(x = x, mean = 1537, sd = 461)
lines(x = x, y, col = "red")
qqnorm(samp1)
qqline(samp1)
The sampling distribution looks mildly right-skewed, with its center around 1500 too. It is not as skewed as the population distribution, but its center is about the same.
# If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.
mean(samp1)
## [1] 1525.68
The sample mean (1537) is about the same as the population mean (1500).
# Take a second sample, also of size 50, and call it samp2.
samp2 <- sample(area, 50)
mean(samp2)
## [1] 1404.24
The mean of samp2 (1580) is hgher than the actal mean (1500), whereas the mean of samp1 (1537) is lower than the actual mean.
The larger sample size of 1000 would be a more accurate estimate of the population mean.
# Generate 5000 samples and compute the sample mean of each
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
# Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50.
# Adjust the bin width of your histogram to show a little more detail
hist(sample_means50, breaks = 25)
There are 5000 elements in sample_means50.
The sampling distribution looks normal, with the center (mean) at 1500.
No, collecting a larger number of sample means (50,000) would not change the distribution.
# With the for loop, these thousands of lines of code are compressed into a handful of lines. We’ve added one extra line to the code below, which prints the variable i during each iteration of the for loop. Run this code.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
# print(i) - OMITTED TO SAVE SPACE IN OUTPUT
}
# Initialize a vector of 100 zeros called sample_means_small.
sample_means_small <- rep(NA, 100)
# Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100.
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
# Print the output to your screen (type sample_means_small into the console and press enter)
100
a randomly generated sample mean
# To get a sense of the effect that sample size has on our distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
# To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.
The center remains unchanged, but the spread gets smaller. A smaller standard error (standard deviation of sampling distribution) increases the precision of the statistic.