DATA606_Week7_Lab4a

load("more/ames.RData")

Describe this population distribution.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice
hist(area)

The population distribution appears to be rigt-skewed and unimodal

Describe the distribution of this sample. How does it compare to the distribution of the population?

samp1 <- sample(area, 50)
hist(samp1)

summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     540    1302    1500    1498    1679    4476

Since we are considering the sample data, the distribution will vary, however the given sample size 50% compared to previous one and it will reach close previous mean if we increase the sampling size

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

hist(sample_means50, breaks = 25)

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1220    1450    1498    1498    1544    1862

Increasing the sample size will approach closer the population mean, so 1000 is correct candidate

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

sample_means50 <- rep(NA, 5000)

samp <- sample(area, 50)
sample_means50[1] <- mean(samp)

samp <- sample(area, 50)
sample_means50[2] <- mean(samp)

samp <- sample(area, 50)
sample_means50[3] <- mean(samp)

samp <- sample(area, 50)
sample_means50[4] <- mean(samp)

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
  # print(i)
}

Increasing the sample size to 5000 is close to normal poluation and center will be almost simliar to center of population

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

hist(sample_means50)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

100 elements in sample_means_small and each element represents the mean of random n=50 sample area.

When the sample size is larger, what happens to the center? What about the spread?

If the sample size is larger, the center will be closer, the spread will be shorten

On your own

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

price1 <- sample(price,50)
summary(price1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   84500  130125  163900  187648  221500  377426

The best point estimate of the population mean is around 190000

Since you have access to the population, simulate the sampling distribution for $\bar{x}_{price}$ by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

samp1e <- sample(price, 50)
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(price, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  148828  173015  180147  180801  187958  227868

The home price of the population could be ~ $180,544

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

samp1e <- sample(price, 150)
sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(price, 150)
   sample_means150[i] <- mean(samp)
   }

hist(sample_means150)

summary(sample_means150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  160502  176257  180628  180659  184893  206507

The home price of the population could be ~ $180,890

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Sampling distributions from 3 is smllar spread,would prefer a smaller spread if we need more closer value to the true value.

DATA606_Week7_Lab4a_Assignment

Mohamed Thasleem Kalikul Zaman

March 14, 2019

On your own