Practice constructing confidence intervals for the population mean, \(\mu\), given a random sample, \(X_i, i= 1, \dots, n\), of observations from the population of interest. Depending on the situation, we could use different statistical distributions, but for the purpose of this lab, we will be focusing on constructing confidence intervals using the Standard Normal Distribution.
Apply the concepts learned during the lecture on confidence
intervals to business problems using real world data in a widely used
statistical package. In this lab we will use RStudio, an
integrative development environment for the R programming
language. RStudio has many built-in features that simplify
coding in R. R is a free software, widely used
among data enthusiasts, business analysts, and programmers to analyze
data and perform statistical analysis.
Learn how to interpret confidence intervals and how to use them to solve business problems.
Assume you have obtained a Simple Random Sample (SRS) of size \(n\) from a population with mean, \(\mu\), and variance, \(\sigma^2\). Using this sample, the sample mean and sample variance can be computed as follows:
sample mean = \(\bar x_n = \frac{1}{n}\sum_{i=1}^{n} x_i\)
sample variance = \(s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar x_n)^2\)
When can we used the normal distribution to determine confidence intervals confidence intervals for the population mean \(\mu\)?
when the variance \(\sigma\) is known and
the sample is iid from \(N(\mu, \sigma)\)
the sample is iid from any distribution, and \(n\) is “large enough” for the Central Limit Theorem (CLT) to apply (usually when \(n \geq 30\))
the variance \(\sigma\) is unknown, the sample is iid from any distribution, and \(n\) is “large enough” for the Central Limit Theorem to apply (usually when \(n \geq 30\), assuming distribution is not strongly skewed)
Note: iid stands for independent and identically distributed. While technically a SRS refers to sampling without replacement, meaning there’s dependence between observations, as long as \(n\) is much smaller than the population size, we can usually neglect this dependence.
When the quantiles from the standard normal distribution can be used to determine a 100(1 - \(\alpha\))% confidence interval for \(\mu\), the confidence interval takes the following form:
(\(\bar x_n\) - \(SE(\bar X_n)\) \(\times\) \(z_{\alpha/2}\), \(\bar x_n\) + \(SE(\bar X_n)\) \(\times\) \(z_{\alpha/2}\))
where
\(\bar x_n\) is the sample mean
\(SE(\bar X_n)\) is either the estimated or true standard error of \(\bar X_n\)
-qnorm(\(\alpha\)/2).We have collected information on the population of houses sold in Cedar Falls between September 2022 and March 2023 using Zillow marketplace data available here. The information doesn’t include townhomes, multi-family, condos, lots, or apartments. We have information on the sale price and features of the house (bedrooms, bathrooms, etc). This dataset is organized in the file CedarFalls_HousesSoldLast6moths.csv
Initially we are going to load the data into RStudio
using the function read.csv as follows:
SoldHouses <- read.csv("CedarFalls_HousesSoldLast6moths.csv")
dim(SoldHouses)
## [1] 185 13
head(SoldHouses, 5) # to show the first 5 rows of the dataset
## id price_sold_usd Bedrooms Bathrooms sold_month sold_month_text sold_date
## 1 1 425000 4 4 3 March 03/03/2023
## 2 2 170000 3 1 3 March 03/03/2023
## 3 3 295000 3 3 3 March 03/03/2023
## 4 4 443900 5 3 3 March 03/03/2023
## 5 5 865176 5 4 3 March 03/02/2023
## street_address city state zip_code Country
## 1 3623 Pheasant Dr Cedar Falls IA 50613 USA
## 2 1900 Walnut St Cedar Falls IA 50613 USA
## 3 5404 Meadowlark Ln Cedar Falls IA 50613 USA
## 4 2715 Falcon Ln Cedar Falls IA 50613 USA
## 5 5307 Fernwood Dr Cedar Falls IA 50613 USA
## Property.URL
## 1 https://www.zillow.com/homedetails/3623-Pheasant-Dr-Cedar-Falls-IA-50613/76666933_zpid/
## 2 https://www.zillow.com/homedetails/1900-Walnut-St-Cedar-Falls-IA-50613/76668231_zpid/
## 3 https://www.zillow.com/homedetails/5404-Meadowlark-Ln-Cedar-Falls-IA-50613/76671882_zpid/
## 4 https://www.zillow.com/homedetails/2715-Falcon-Ln-Cedar-Falls-IA-50613/93829357_zpid/
## 5 https://www.zillow.com/homedetails/5307-Fernwood-Dr-Cedar-Falls-IA-50613/2071239917_zpid/
tail(SoldHouses, 5) # to show the last 5 rows of the dataset
## id price_sold_usd Bedrooms Bathrooms sold_month sold_month_text sold_date
## 181 181 299000 3 3 9 September 09/08/2022
## 182 182 515000 6 3 9 September 09/08/2022
## 183 183 185000 4 1 9 September 09/08/2022
## 184 184 200000 4 1 9 September 09/07/2022
## 185 185 243700 3 1 9 September 09/06/2022
## street_address city state zip_code Country
## 181 1815 Grand Blvd Cedar Falls IA 50613 USA
## 182 2908 Wellington Dr Cedar Falls IA 50613 USA
## 183 532 Bonita Blvd Cedar Falls IA 50613 USA
## 184 1522 Theimer St Cedar Falls IA 50613 USA
## 185 4114 Heritage Rd Cedar Falls IA 50613 USA
## Property.URL
## 181 https://www.zillow.com/homedetails/1815-Grand-Blvd-Cedar-Falls-IA-50613/76649890_zpid/
## 182 https://www.zillow.com/homedetails/2908-Wellington-Dr-Cedar-Falls-IA-50613/76669939_zpid/
## 183 https://www.zillow.com/homedetails/532-Bonita-Blvd-Cedar-Falls-IA-50613/76670979_zpid/
## 184 https://www.zillow.com/homedetails/1522-Theimer-St-Cedar-Falls-IA-50613/76666062_zpid/
## 185 https://www.zillow.com/homedetails/4114-Heritage-Rd-Cedar-Falls-IA-50613/76671154_zpid/
Now let’s obtain some summary statistics. Note that you can do this
using the summarise function. You can calculate as many
statistics as you want using this function, and just combine the
results. Some of the functions below should be self explanatory (like
mean, median, sd,
IQR, min, and max). A new
function here is the quantile function, which you can use
to calculate values corresponding to specific percentile cutoffs in the
distribution. For example quantile(x, 0.25) will yield the
cutoff value for the 25th percentile (Q1) in the distribution of
x.
SoldHouses %>%
summarise(mu = mean(price_sold_usd), pop_med = median(price_sold_usd),
sigma = sd(price_sold_usd), pop_iqr = IQR(price_sold_usd),
pop_min = min(price_sold_usd), pop_max = max(price_sold_usd),
pop_q1 = quantile(price_sold_usd, 0.25), # first quartile, 25th percentile
pop_q3 = quantile(price_sold_usd, 0.75)) # third quartile, 75th percentile
## mu pop_med sigma pop_iqr pop_min pop_max pop_q1 pop_q3
## 1 277486.1 240000 139706.3 179900 55000 865176 175000 354900
In this lab, we have access to the entire population of houses sold in a specific period of time (sep2022 - March2023), but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.
If you were interested in estimating the mean price of houses sold in
Cedar Falls based on a sample, you can use the sample_n
command to survey the population. Here we are taking a random sample of
30 units out of the population.
samp1 <- SoldHouses %>%
sample_n(30)
dim(samp1)
## [1] 30 13
head(samp1, 5) # to show the first 5 observations
## id price_sold_usd Bedrooms Bathrooms sold_month sold_month_text sold_date
## 1 52 450000 4 3 1 January 01/03/2023
## 2 108 230000 3 1 11 November 11/01/2022
## 3 40 340000 4 3 1 January 01/25/2023
## 4 157 297000 4 2 9 September 09/22/2022
## 5 106 536500 5 3 11 November 11/04/2022
## street_address city state zip_code Country
## 1 117 Kaspend Pl Cedar Falls IA 50613 USA
## 2 2712 Apollo St Cedar Falls IA 50613 USA
## 3 4210 Autumn Ridge Rd Cedar Falls IA 50613 USA
## 4 1310 Lilac Ln Cedar Falls IA 50613 USA
## 5 4231 James Dr Cedar Falls IA 50613 USA
## Property.URL
## 1 https://www.zillow.com/homedetails/117-Kaspend-Pl-Cedar-Falls-IA-50613/76666946_zpid/
## 2 https://www.zillow.com/homedetails/2712-Apollo-St-Cedar-Falls-IA-50613/76667894_zpid/
## 3 https://www.zillow.com/homedetails/4210-Autumn-Ridge-Rd-Cedar-Falls-IA-50613/76672495_zpid/
## 4 https://www.zillow.com/homedetails/1310-Lilac-Ln-Cedar-Falls-IA-50613/76650716_zpid/
## 5 https://www.zillow.com/homedetails/4231-James-Dr-Cedar-Falls-IA-50613/2061705383_zpid/
tail(samp1, 5) # to show the last 5 observations
## id price_sold_usd Bedrooms Bathrooms sold_month sold_month_text sold_date
## 26 63 590000 2 1 12 December 12/16/2022
## 27 38 240000 3 2 1 January 01/27/2023
## 28 162 324900 4 2 9 September 09/19/2022
## 29 6 114000 3 1 3 March 03/01/2023
## 30 116 203000 3 1 10 October 10/24/2022
## street_address city state zip_code Country
## 26 109 Tremont St Cedar Falls IA 50613 USA
## 27 1122 Oak Park Blvd Cedar Falls IA 50613 USA
## 28 1104 Franklin St Cedar Falls IA 50613 USA
## 29 115 Tremont St Cedar Falls IA 50613 USA
## 30 137 Highland Dr Cedar Falls IA 50613 USA
## Property.URL
## 26 https://www.zillow.com/homedetails/109-Tremont-St-Cedar-Falls-IA-50613/76666981_zpid/
## 27 https://www.zillow.com/homedetails/1122-Oak-Park-Blvd-Cedar-Falls-IA-50613/76670300_zpid/
## 28 https://www.zillow.com/homedetails/1104-Franklin-St-Cedar-Falls-IA-50613/76667604_zpid/
## 29 https://www.zillow.com/homedetails/115-Tremont-St-Cedar-Falls-IA-50613/76666985_zpid/
## 30 https://www.zillow.com/homedetails/137-Highland-Dr-Cedar-Falls-IA-50613/76665773_zpid/
This command collects a simple random sample of size 30 from the
SoldHouses dataset, and assigns the result to
samp1. This is similar to going into the City Assessor’s
database and pulling up the files on 30 random home sales.
Now let’s get a sense of some summary statistics for this sample
using the code we wrote above, but now the summary statistics are for a
sample of houses from the population. However be careful to not label
values mu and sigma anymore since these are
sample statistics, not population parameters.
# descriptive statistics
samp1 %>%
summarise(mean = mean(price_sold_usd), median = median(price_sold_usd),
sd = sd(price_sold_usd), min = min(price_sold_usd), max = max(price_sold_usd), sample_iqr = IQR(price_sold_usd),
sample_q1 = quantile(price_sold_usd, 0.25), # first quartile, 25th percentile
sample_q3 = quantile(price_sold_usd, 0.75)) # third quartile, 75th percentile
## mean median sd min max sample_iqr sample_q1 sample_q3
## 1 275476.7 245000 120251.3 114000 590000 149350 177500 326850
Let’s build an histogram for this sample as well (with the mean and the median in dashed vertical lines):
mean_samp1 = mean(samp1$price_sold_usd)
median_samp1 = median(samp1$price_sold_usd)
# histogram
ggplot(data = samp1, aes(x = price_sold_usd)) +
geom_histogram(binwidth = 60000, colour = "black", fill = "cornflowerblue") + geom_vline(aes(xintercept = as.integer(mean_samp1)), color="red", linetype="dashed") +
geom_vline(aes(xintercept = as.integer(median_samp1)), color="green", linetype="dashed") +
ggtitle("Histogram of sales prices") + labs(x = " Sales price ") + scale_x_continuous(breaks=seq(0,600000,60000))+ theme(plot.title = element_text(hjust = 0.5))
If we were interested in estimating the average price for the houses sold over the period of analysis using the sample, our best single guess is the sample mean:
samp1 %>%
summarise(x_bar = mean(price_sold_usd))
## x_bar
## 1 275476.7
Depending on which 30 homes you selected, your estimate could be a bit above or a bit below the true population mean prices. We were able to get a good sense of the population average by using a sample of only 16% of the population.
Now if we perform another sample of 30 units from the original population:
Would you expect the new mean sample to match the previous mean sample? Why, or why not? If the answer is no, would you expect the means to just be somewhat different or very different?
Let’s take a second sample, also of size 30, and call it
samp2. How does the mean of samp2 compare with
the mean of samp1? Suppose we took two more samples, one of
size 50 and one of size 100. Which would you think would provide a more
accurate estimate of the population mean?
ggplot(data = SoldHouses, aes(x = price_sold_usd)) +
geom_histogram(binwidth = 60000, colour = "black", fill = "cornflowerblue") +
ggtitle("Histogram of sales prices") + labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))
samp2 <- SoldHouses %>%
sample_n(30)
samp2 %>%
summarise(x_bar2 = mean(price_sold_usd))
## x_bar2
## 1 260096.4
mean_samp2 = mean(samp2$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
geom_histogram(data = samp2 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp2), color="red", linetype="dashed") + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))
samp3 <- SoldHouses %>%
sample_n(30)
samp3 %>%
summarise(x_bar3 = mean(price_sold_usd))
## x_bar3
## 1 279742.9
mean_samp3 = mean(samp3$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
geom_histogram(data = samp3 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp3), color="red", linetype="dashed") + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))
samp4 <- SoldHouses %>%
sample_n(30)
samp4 %>%
summarise(x_bar4 = mean(price_sold_usd))
## x_bar4
## 1 250428.2
mean_samp4 = mean(samp4$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
geom_histogram(data = samp4 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp4), color="red", linetype="dashed") + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))
samp5 <- SoldHouses %>%
sample_n(30)
samp5 %>%
summarise(x_bar5 = mean(price_sold_usd))
## x_bar5
## 1 252553.1
mean_samp5 = mean(samp5$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
geom_histogram(data = samp5 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp5), color="red", linetype="dashed") + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))
samp6 <- SoldHouses %>%
sample_n(30)
samp6 %>%
summarise(x_bar6 = mean(price_sold_usd))
## x_bar6
## 1 241719.7
mean_samp6 = mean(samp6$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
geom_histogram(data = samp6 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp6), color="red", linetype="dashed") + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))
Not surprisingly, every time you take another random sample, you get
a different sample mean. It’s useful to get a sense of just how much
variability you should expect when estimating the population mean this
way. The distribution of sample means, called the sampling
distribution (of the mean), can help you understand this
variability. In this lab, because you have access to the population, you
can build up the sampling distribution for the sample mean by repeating
the above steps many times. Here, we use R to take 5,000 different
samples of size 30 from the population, calculate the mean price of each
sample, and store each result in a vector called
sample_means50. Note that we specify that
replace = TRUE since sampling distributions are constructed
by sampling with replacement.
sample_means30 <- SoldHouses %>%
rep_sample_n(size = 30, reps = 5000, replace = TRUE) %>%
summarise(x_bar = mean(price_sold_usd))
mean_pop = mean(SoldHouses$price_sold_usd)
ggplot(NULL,) +
geom_histogram(data = SoldHouses, aes(x = price_sold_usd, y = after_stat(density)), binwidth = 60000, colour = "black", fill = "cornflowerblue", alpha = 0.2) +
geom_histogram(data = sample_means30, aes(x = x_bar, y = after_stat(density)), binwidth = 9000, colour = "white", fill = "chocolate1", alpha = 0.2) + geom_vline(aes(xintercept = mean_pop), color="black", linetype="dashed") + ggtitle(" Sampling distribution for mean sales price, n = 30, repetitions = 5000") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))
Note that for each of the 5000 times we computed a mean, we did so from a different sample! Notice how close to a normal distribution is the distribution of the sample mean. This is very important to decide how to construct confidence intervals.
Indeed, the sampling distribution that you computed tells you much about estimating the average prices for houses sold in Cedar Falls. Interestingly, the sampling distribution is centered at the true average price (see vertical dashed black line in the histogram above), and the spread of the distribution indicates how much variability is incurred by sampling only 30 home sales.
Now let’s build the same histogram but with a point estimate (see vertical red dashed line) and the population mean of prices (see black vertical dashed line). If you were to run the code below multiple times, the graph is going to change because we are selcting a random sample each time. Go ahead a run the code and see what happens with the graph. Does the red dashed line stay the same, or does it move around the population mean!
samp2 <- SoldHouses %>%
sample_n(30)
samp2 %>%
summarise(x_bar2 = mean(price_sold_usd))
## x_bar2
## 1 259772.7
mean_samp2 = mean(samp2$price_sold_usd)
mean_pop = mean(SoldHouses$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
geom_histogram(data = samp2 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp2), color="red", linetype="dashed") + geom_vline(aes(xintercept = mean_pop), color="black", linetype="dashed") + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))
Imagine you are a Real State Analyst for
What Cheer HomeBuilders, a company new to Cedar Falls, but
with an important presence in other states. They are new to the market,
so they don’t have access to the same information as companies
associated with the Home Builders Association of Iowa.
They are considering a project to develop houses in the range 450k-650k. Your supervisor asks you to write a report with the range of plausible values for the mean price of houses sold in Cedar Falls over the last six months to determine if the new development could be a good idea from a demand standpoint.
In particular he wants to be 95% confident about the
range of plausible values.
So your job is to use sample information to calculate a 95% confidence interval for mean prices of houses sold in Cedar Falls over the last six moths. Interpret this confidence interval in terms of the study:
Our 95% confidence interval takes the form of:
(\(\bar x_n\) - \(SE(\bar X_n)\) \(\times\) \(z_{\alpha/2}\), \(\bar x_n\) + \(SE(\bar X_n)\) \(\times\) \(z_{\alpha/2}\)) \(\approx\) 95%
where \(\bar x_n\) is the sample
mean for the sales prices, \(SE(\bar
X_n)\) is the standard error, \(z_{\alpha/2}\) is the 100\((1- \alpha/2)\) percentile of the standard
normal distribution. This is basically, getting the In R, \(z_{\alpha/2}\) = -qnorm(\(\alpha\)/2). We will use a
Simple Random Sample of 30 observations from the population. The dataset
sample_houses contain the sample information for sales
price for 30 houses in Cedar Halls over the last 6 months.
Let’s initially, generate the random sample using the function
sample_n(). Remember, it’s very unlikely that you will have
the full population to study any business problem you are interested.
Here the prupose is for you to understand why we can use features from
the normal distribution to approach our problem.
sample_houses <- SoldHouses %>%
sample_n(30) # this will select a random sample of 30 observations from the dataset `SoldHouses`
Now let’s build the interval with the information provided both here and in class:
n = nrow(sample_houses) # nrow() funtion help us to get the number of observatios
n
## [1] 30
sample_mean = mean(sample_houses$price_sold_usd)
sample_sd = sd(sample_houses$price_sold_usd)
se = sample_sd / sqrt(n)
quantile = -qnorm(0.025)
quantile
## [1] 1.959964
lower = sample_mean - se * quantile
upper = sample_mean + se * quantile
lower
## [1] 222733.7
upper
## [1] 310565
Our 95% confidence interval is (224999, 346428). Which means that we are 95% confident that the population mean for all the prices of houses sold in Cedar Falls over the last six months is between 224999 usd and 346,428 usd.
What does it mean for the Real State Analyst? What are the business implications?
In the remainder of this section, you will work on getting a sense of the effect that sample size has on your sampling distribution.
Use the Shiny app below to create sampling distributions of means of
price_sold_usds from samples of size 10, 50, and 100. Use
5,000 simulations.
What does each observation in the sampling distribution represent?
Effect of increasing the sample size -> How does the mean, standard error, and shape of the sampling distribution change as the sample size increases?
Effect of increasing the number of simulations -> How (if at all) do these values change if you increase the number of simulations?
Let’s do it together!