Load library

Import data

setwd("C:/Users/cassandra/Documents/R/win-library/3.4/DATA606/labs/Lab4a/more")
ames <- read.csv("ames.csv")
rooms <- ames$TotRms.AbvGrd
area <- ames$Gr.Liv.Area
price <- ames$SalePrice

hist(area)

####Exercise 1. Describe this population distribution.

The population skews to the right. There are some outliers after 3000.

Exercise 2. Describe the distribution of this sample. How does it compare to the distribution of the population?

samp1 <- sample(area, 50)

# Plot Sample 1 of area
hist(samp1)

The sample mean is almost equal to the population mean. The distribution of the sample of 50 is a more symmetric than the population distribution.

Exercise 3. Take a second sample, also of size 50, and call it samp2.

How does the mean of samp2 compare with the mean of samp1?

Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

#the sample of 50 
samp2 <- sample(area, 100)
samp3 <- sample(area, 1000)
#Find the mean
mean(samp2)
## [1] 1489.63
mean(samp3)
## [1] 1481.923

An increase in sample size reduces the standard error which provides a more accurate estimate of the populaion mean.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 50)
  sample_means50[i] <- mean(samp)
}

hist(sample_means50)

There are 5000 elements =in smaple_mean50.
The center of distribution is 1,500. The increase in sample size redcued the spread.

Exercise 4.4. How many elements are there in sample_means50?

Describe the sampling distribution, and be sure to specifically note its center.

Would you expect the distribution to change if we instead collected 50,000 sample means?

Execise 5.To make sure you understand what you’ve done in this loop,
try running a smaller version. Initialize a vector of 100 zeros called sample_means_small.
Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100.
There are There are 100 elements. Each element represents the element generated by the sample of 50.

1. Take a random sample of size 50 from price.

price_sample <- sample(price, 50)

#Find sample mean of price

mean(price_sample)
## [1] 172862.7
2. Since you have access to the population, simulate the sampling distribution for x¯pricex¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means.
Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be?Finally, calculate and report the population mean.
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}

hist(sample_means50)

The sample mean is almost equal to the population mean.

The distribution of the sample of 50 is a more symmetric than the population distribution.

Based on the sampling dostribtuion the mean honme price of the population is $180,000

mean(price)

The population mean using all data elements is $180,796
the sample mean and population mean are almost the same amount.
3. Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above,
and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution,
and compare it to the sampling distribution for a sample size of 50.
Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}

hist(sample_means150)

The data is mor spread out to the two ends of the distribution when viewing the shape of a sample size of 50
The data is more condensed when viewing the shape of sampling distribution of 150 samples.
The shape of sample 50 is more symmetric
I would guess the mean sale price of homes to be $180,000. Both sampling distrbutions at 50 and 150 had a mean of $180,000.
4 Of the sampling distributions from 2 and 3, which has a smaller spread?
If we’re concerned with making estimates that are more often close to the true value,
would we prefer a distribution with a large or small spread?
The spread is smaller in sample 3.
The larger the sample size the less variablity as the spread becomes smaller.
Therefore, a larger sample size is preferable.