##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

This distribution consists of 2930 observations of area for dwellings in Ames,IA. Our distribution represents all dwellings in AMes and is not a sample. It appears to be right skewed. This is common for data when our values are limited at 0, with a number of large values. Our empirical distribution has mean = 1499.6904437 and sd = 505.5088875.

Is it possible that our distribution could be a lognormal distrbution? Here is the log of each data point:

hist(Area_data_logged, breaks=45)

hist(samp1,breaks=45)

hist(samp1,breaks=30)

Our sample of 50 values contains a subset of our greater population. It is not nearly normal. It doesn’t have an easily discernible shape.

The mean of our sample of 50 is 1391.64

samples.area <- data.frame(samp2=double(),
                 samp3=double(),
                 samp4=double(),
                 stringsAsFactors=FALSE)
for(i in 1:1){samples.area[i,1] <-mean(sample(area, 50))}
for(i in 1:1){samples.area[i,2] <-mean(sample(area, 100))}
for(i in 1:1){samples.area[i,3] <-mean(sample(area, 1000))}
mean(samples.area[,1])
## [1] 1588.42
mean(samples.area[,2])
## [1] 1454.01
mean(samples.area[,3])
## [1] 1481.385

Our mean from draws of size 1000 was much closer to our population mean, and the set of size 100 was a little better than the sample of size 50. As we increase our n, we get closer to an accurate distribution.

price_sample_1<-sample(price, 50)
mean(price)
## [1] 180796.1
mean(price_sample_1)
## [1] 181771.8
sample_means50 <- data.frame(sample.of.area=double(),
                 stringsAsFactors=FALSE)
for(i in 1:5000){sample_means50[i,1] <-mean(sample(area, 50))}

mean(sample_means50[,1])
## [1] 1500.675

Sample_means50 has 5000 elements. Each is a sampling of a subset of our population. In machine learning, bagging and random tree algorithms also work with random samples of data and simulation, each with different methods and reasons. The mean for our sample_means50 is 1500.6748. This is very close to our population mean. With 50000 samples, we got 1500.127 as the mean. It is getting asymptotically closer, but is taking a lot more computing time for each small reduction in bias.

sample_means_small <- rep(0, 100)
for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
   }
sample_means_small
##   [1] 1463.74 1563.74 1476.20 1444.88 1379.22 1478.54 1565.34 1454.30
##   [9] 1506.14 1474.42 1673.80 1531.26 1442.32 1612.38 1495.30 1488.04
##  [17] 1445.74 1458.46 1506.52 1571.18 1537.08 1369.96 1456.80 1570.04
##  [25] 1491.86 1627.84 1512.10 1426.10 1370.20 1532.70 1516.52 1441.26
##  [33] 1478.90 1430.26 1604.92 1375.16 1492.16 1502.74 1542.94 1462.54
##  [41] 1476.40 1532.60 1430.16 1564.82 1630.74 1488.12 1562.70 1475.98
##  [49] 1500.74 1464.56 1549.50 1370.62 1532.14 1429.46 1563.70 1511.58
##  [57] 1412.24 1569.58 1567.16 1517.24 1563.78 1445.22 1552.22 1578.42
##  [65] 1687.68 1543.10 1532.48 1569.12 1439.02 1474.78 1450.58 1538.36
##  [73] 1509.38 1653.44 1526.22 1359.16 1545.66 1586.44 1466.82 1517.52
##  [81] 1462.06 1399.26 1552.04 1475.02 1542.28 1509.28 1373.86 1442.90
##  [89] 1473.28 1509.56 1480.06 1523.04 1551.18 1539.54 1546.40 1503.92
##  [97] 1527.24 1544.48 1551.38 1571.86
hist(sample_means_small)

Sample_means_small has 100 elements. Each is a sampling of a subset of our population. The mean for our smmple_means5_small is 1505.4168. This is kind of close to our population mean.

When the sample size gets larger, our variance gets larger and our spread is a little wider, even though it looks more like a normal curve. Down below, we will graph a situation where our sample size is larger. In that case, our distribution becomes more leptokurtic and our variance drops. That’s because, there, our samples are each closer to the mean. Here, each sample is just as likely to be accurate, but we’re getting more information about our population and it’s probability curve.

On your own:

With a sample sized 50, our point estimate for the mean is 2.037487410^{5}

sample_means50_price <- data.frame(sample.of.price=double(),
                 stringsAsFactors=FALSE)
for(i in 1:5000){sample_means50_price[i,1] <-mean(sample(price, 50))}

mean(sample_means50_price[,1])
## [1] 180983.3
plot.a<-ggplot(data = sample_means50_price,aes(x=sample_means50_price[,1]))+geom_histogram(aes(x=sample_means50_price[,1],y=..density..),fill='#2354ff',binwidth=55)+stat_function(fun = dnorm, color='#eaff84',size=1.1,args = list(mean = mean(sample_means50_price[,1]), sd = sd(sample_means50_price[,1])))+ylim(0,.00014)+ theme(panel.background = element_rect(fill = '#34363d'))+ggtitle('5000 Size 50 Samples-price')+xlab('price')
plot.a

When we take our 50 price sample 5000 times, our data starts to look normal and our mean gets closer to the population mean. The population mean is 1.807960610^{5} and our sampled mean is 1.809832810^{5}.

## [1] "Variance of price, size=50"
##                 sample.of.price
## sample.of.price       127060780
## [1] "Variance of price, size=150"
##                 sample.of.price
## sample.of.price        41757176

When we take larger samples of 150, we have each sample do a better job of simulating our mean. The distribution of our samples is now narrower and has a smaller variance. This final sampling is our most accurate sampling. More samples than this will, ultimately, add less and less to our accuracy at the cost of increased computing time, and could should not exceed 10% of the total.