Dan-Wigodsky-confidence

We begin with descripive statistics for our population and histograms for our population and our first sample.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

Our full population mean is 1499.6904437 and our mean from a sample of 60 houses is 1550.7666667.

head(population,30)

##  [1] 1656  896 1329 2110 1629 1604 1338 1280 1616 1804 1655 1187 1465 1341
## [15] 1502 3279 1752 1856  864 2073 1844 1173 1674 1004 1078 1056  882  864
## [29] 1337  987

samp

##  [1] 1344  998 2229 1665 1492 2494 1660 2158 1136 1512  854 1721 1105 1668
## [15] 1436 1540 1633 1044 1121  893 1725 1364 1326 1538 1212 1430 1656 1822
## [29] 2110 1370 1983 1646 1296 2318 1612 1632 1002 2450  630  925 1935 1203
## [43] 1127  912 1714 1607 2044 1808 2698 1614 1920 1301 1361 2504 2520  960
## [57] 1363  875 1382 1448

sample_mean<-mean(samp)

Our sample is much smaller than the population and appears to be more centered on the distribution. The population is more skewed to the right and has a huge number of values toward the left, along with some large houses that are greater than twice the median. 1500-1700 ft² is a typical house size. Typical means a fair number of values are in the range and the histogram has a mode encompassing these values.

I would not expect another student’s distribution to be similar and mine kept changing until I set a seed. 60 is not a large enough sample to get a stable sampling.

se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

## [1] 1430.844 1670.690

To be valid, our confidence interval must be normally distributed. We check that the sample is centered around the mean of the population. We want is to not be skewed. We want it to be randomly sampled from less than 10% of the population and not be too small.

A confidence interval tell us the furthest extreme we expect the mean of a sample to be 95% of the time. We are 95% certain the mean is within these values.

our confidence interval does capture the mean of the population.

We would expect our sample to capture the true mean 95% of the time. That’s another interpretation of our definition of what a confidence interval is.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
              }
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])

## [1] 1293.121 1602.679

plot_ci(lower_vector, upper_vector, mean(population))

We found 3 out of 50 samples did not encompass the mean of our population. That is close to what we expected, but not exact. Because it’s a kind of a small number one extra or one less would have changed it by a non-negligible amount.

In our previous analysis, we used a 95% confidence interval, corresponding to a critical value of 1.96 for a two-sided test. In our next analysis, we will use a two sided confidence interval of 90%, corresponding to a Z-score value of 1.645

lower_vector <- samp_mean - 1.645 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.645 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])

## [1] 1317.996 1577.804

plot_ci(lower_vector, upper_vector, mean(population))

Our new test found 9 samples outside of the mean based on a confidence interval of 90%. The is compared to the 5 we expected to find. This is not what we would expect to find. Some portion of the time, our confidence-confidence will waver. We can’t be 100% sure to have 90% of our samples fall within our value.

We tried sampling another time with a new seed and arrived at 7 outliers(for 95%) and 8 outliers(for 90%). This, of course can be dangerous, because it opens the door to looking for data, but in this case, it illustrates the uncertainty that can be found in this dimension of our data.

Dan-Wigodsky-confidence_intervals.Rmd

Dan Wigodsky

March 18, 2018

We begin with descripive statistics for our population and histograms for our population and our first sample.

Our full population mean is 1499.6904437 and our mean from a sample of 60 houses is 1550.7666667.

I would not expect another student’s distribution to be similar and mine kept changing until I set a seed. 60 is not a large enough sample to get a stable sampling.

To be valid, our confidence interval must be normally distributed. We check that the sample is centered around the mean of the population. We want is to not be skewed. We want it to be randomly sampled from less than 10% of the population and not be too small.

A confidence interval tell us the furthest extreme we expect the mean of a sample to be 95% of the time. We are 95% certain the mean is within these values.

our confidence interval does capture the mean of the population.

We would expect our sample to capture the true mean 95% of the time. That’s another interpretation of our definition of what a confidence interval is.

We found 3 out of 50 samples did not encompass the mean of our population. That is close to what we expected, but not exact. Because it’s a kind of a small number one extra or one less would have changed it by a non-negligible amount.

In our previous analysis, we used a 95% confidence interval, corresponding to a critical value of 1.96 for a two-sided test. In our next analysis, we will use a two sided confidence interval of 90%, corresponding to a Z-score value of 1.645

We tried sampling another time with a new seed and arrived at 7 outliers(for 95%) and 8 outliers(for 90%). This, of course can be dangerous, because it opens the door to looking for data, but in this case, it illustrates the uncertainty that can be found in this dimension of our data.