Professor: Andrés Castaño Zuluaga

Course: Business Statistics

University of Northern Iowa


Lab Goals

  1. Practice constructing confidence intervals for the population mean, \(\mu\), given a random sample, \(X_i, i= 1, \dots, n\), of observations from the population of interest. Depending on the situation, we could use different statistical distributions, but for the purpose of this lab, we will be focusing on constructing confidence intervals using the Standard Normal Distribution.

  2. Apply the concepts learned during the lecture on confidence intervals to business problems using real world data in a widely used statistical package. In this lab we will use RStudio, an integrative development environment for the R programming language. RStudio has many built-in features that simplify coding in R. R is a free software, widely used among data enthusiasts, business analysts, and programmers to analyze data and perform statistical analysis.

  3. Learn how to interpret confidence intervals and how to use them to solve business problems.

Constructing Confidence Intervals Using the Standard Normal Distribution

Assume you have obtained a Simple Random Sample (SRS) of size \(n\) from a population with mean, \(\mu\), and variance, \(\sigma^2\). Using this sample, the sample mean and sample variance can be computed as follows:

  1. sample mean = \(\bar x_n = \frac{1}{n}\sum_{i=1}^{n} x_i\)

  2. sample variance = \(s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar x_n)^2\)

When can we used the normal distribution to determine confidence intervals confidence intervals for the population mean \(\mu\)?

  1. when the variance \(\sigma\) is known and

    1. the sample is iid from \(N(\mu, \sigma)\)

    2. the sample is iid from any distribution, and \(n\) is “large enough” for the Central Limit Theorem (CLT) to apply (usually when \(n \geq 30\))

  2. the variance \(\sigma\) is unknown, the sample is iid from any distribution, and \(n\) is “large enough” for the Central Limit Theorem to apply (usually when \(n \geq 30\), assuming distribution is not strongly skewed)

Note: iid stands for independent and identically distributed. While technically a SRS refers to sampling without replacement, meaning there’s dependence between observations, as long as \(n\) is much smaller than the population size, we can usually neglect this dependence.

When the quantiles from the standard normal distribution can be used to determine a 100(1 - \(\alpha\))% confidence interval for \(\mu\), the confidence interval takes the following form:

(\(\bar x_n\) - \(SE(\bar X_n)\) \(\times\) \(z_{\alpha/2}\), \(\bar x_n\) + \(SE(\bar X_n)\) \(\times\) \(z_{\alpha/2}\))

where

  1. \(\bar x_n\) is the sample mean

  2. \(SE(\bar X_n)\) is either the estimated or true standard error of \(\bar X_n\)

  1. if the variance \(\sigma\) is known, \(SE(\bar X_n) = \sigma/\sqrt{n}\)
  2. if the variance \(\sigma\) is unknown, we replace \(SE(\bar X_n)\) with an its sample estimate \(s/\sqrt{n}\)
  1. \(z_{\alpha/2}\) is the 100\((1- \alpha/2)\) percentile of the standard normal distribution. In R, \(z_{\alpha/2}\) = -qnorm(\(\alpha\)/2).

Application: The Cedar Falls housing market over last 6 months (sep 2022 - march 2023)

We have collected information on the population of houses sold in Cedar Falls between September 2022 and March 2023 using Zillow marketplace data available here. The information doesn’t include townhomes, multi-family, condos, lots, or apartments. We have information on the sale price and features of the house (bedrooms, bathrooms, etc). This dataset is organized in the file CedarFalls_HousesSoldLast6moths.csv

Initially we are going to load the data into RStudio using the function read.csv as follows:

SoldHouses <- read.csv("CedarFalls_HousesSoldLast6moths.csv")
dim(SoldHouses)
## [1] 185  13
head(SoldHouses, 5) # to show the first 5 rows of the dataset
##   id price_sold_usd Bedrooms Bathrooms sold_month sold_month_text  sold_date
## 1  1         425000        4         4          3           March 03/03/2023
## 2  2         170000        3         1          3           March 03/03/2023
## 3  3         295000        3         3          3           March 03/03/2023
## 4  4         443900        5         3          3           March 03/03/2023
## 5  5         865176        5         4          3           March 03/02/2023
##       street_address        city state zip_code Country
## 1   3623 Pheasant Dr Cedar Falls    IA    50613     USA
## 2     1900 Walnut St Cedar Falls    IA    50613     USA
## 3 5404 Meadowlark Ln Cedar Falls    IA    50613     USA
## 4     2715 Falcon Ln Cedar Falls    IA    50613     USA
## 5   5307 Fernwood Dr Cedar Falls    IA    50613     USA
##                                                                                Property.URL
## 1   https://www.zillow.com/homedetails/3623-Pheasant-Dr-Cedar-Falls-IA-50613/76666933_zpid/
## 2     https://www.zillow.com/homedetails/1900-Walnut-St-Cedar-Falls-IA-50613/76668231_zpid/
## 3 https://www.zillow.com/homedetails/5404-Meadowlark-Ln-Cedar-Falls-IA-50613/76671882_zpid/
## 4     https://www.zillow.com/homedetails/2715-Falcon-Ln-Cedar-Falls-IA-50613/93829357_zpid/
## 5 https://www.zillow.com/homedetails/5307-Fernwood-Dr-Cedar-Falls-IA-50613/2071239917_zpid/
tail(SoldHouses, 5) # to show the last 5 rows of the dataset
##      id price_sold_usd Bedrooms Bathrooms sold_month sold_month_text  sold_date
## 181 181         299000        3         3          9       September 09/08/2022
## 182 182         515000        6         3          9       September 09/08/2022
## 183 183         185000        4         1          9       September 09/08/2022
## 184 184         200000        4         1          9       September 09/07/2022
## 185 185         243700        3         1          9       September 09/06/2022
##         street_address        city state zip_code Country
## 181    1815 Grand Blvd Cedar Falls    IA    50613     USA
## 182 2908 Wellington Dr Cedar Falls    IA    50613     USA
## 183    532 Bonita Blvd Cedar Falls    IA    50613     USA
## 184    1522 Theimer St Cedar Falls    IA    50613     USA
## 185   4114 Heritage Rd Cedar Falls    IA    50613     USA
##                                                                                  Property.URL
## 181    https://www.zillow.com/homedetails/1815-Grand-Blvd-Cedar-Falls-IA-50613/76649890_zpid/
## 182 https://www.zillow.com/homedetails/2908-Wellington-Dr-Cedar-Falls-IA-50613/76669939_zpid/
## 183    https://www.zillow.com/homedetails/532-Bonita-Blvd-Cedar-Falls-IA-50613/76670979_zpid/
## 184    https://www.zillow.com/homedetails/1522-Theimer-St-Cedar-Falls-IA-50613/76666062_zpid/
## 185   https://www.zillow.com/homedetails/4114-Heritage-Rd-Cedar-Falls-IA-50613/76671154_zpid/

Now let’s obtain some summary statistics. Note that you can do this using the summarise function. You can calculate as many statistics as you want using this function, and just combine the results. Some of the functions below should be self explanatory (like mean, median, sd, IQR, min, and max). A new function here is the quantile function, which you can use to calculate values corresponding to specific percentile cutoffs in the distribution. For example quantile(x, 0.25) will yield the cutoff value for the 25th percentile (Q1) in the distribution of x.

SoldHouses %>%
  summarise(mu = mean(price_sold_usd), pop_med = median(price_sold_usd), 
            sigma = sd(price_sold_usd), pop_iqr = IQR(price_sold_usd),
            pop_min = min(price_sold_usd), pop_max = max(price_sold_usd),
            pop_q1 = quantile(price_sold_usd, 0.25),  # first quartile, 25th percentile
            pop_q3 = quantile(price_sold_usd, 0.75))  # third quartile, 75th percentile
##         mu pop_med    sigma pop_iqr pop_min pop_max pop_q1 pop_q3
## 1 277486.1  240000 139706.3  179900   55000  865176 175000 354900

In this lab, we have access to the entire population of houses sold in a specific period of time (sep2022 - March2023), but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

If you were interested in estimating the mean price of houses sold in Cedar Falls based on a sample, you can use the sample_n command to survey the population. Here we are taking a random sample of 30 units out of the population.

samp1 <- SoldHouses %>%
  sample_n(30)
dim(samp1)
## [1] 30 13
head(samp1, 5) # to show the first 5 observations 
##    id price_sold_usd Bedrooms Bathrooms sold_month sold_month_text  sold_date
## 1  52         450000        4         3          1         January 01/03/2023
## 2 108         230000        3         1         11        November 11/01/2022
## 3  40         340000        4         3          1         January 01/25/2023
## 4 157         297000        4         2          9       September 09/22/2022
## 5 106         536500        5         3         11        November 11/04/2022
##         street_address        city state zip_code Country
## 1       117 Kaspend Pl Cedar Falls    IA    50613     USA
## 2       2712 Apollo St Cedar Falls    IA    50613     USA
## 3 4210 Autumn Ridge Rd Cedar Falls    IA    50613     USA
## 4        1310 Lilac Ln Cedar Falls    IA    50613     USA
## 5        4231 James Dr Cedar Falls    IA    50613     USA
##                                                                                  Property.URL
## 1       https://www.zillow.com/homedetails/117-Kaspend-Pl-Cedar-Falls-IA-50613/76666946_zpid/
## 2       https://www.zillow.com/homedetails/2712-Apollo-St-Cedar-Falls-IA-50613/76667894_zpid/
## 3 https://www.zillow.com/homedetails/4210-Autumn-Ridge-Rd-Cedar-Falls-IA-50613/76672495_zpid/
## 4        https://www.zillow.com/homedetails/1310-Lilac-Ln-Cedar-Falls-IA-50613/76650716_zpid/
## 5      https://www.zillow.com/homedetails/4231-James-Dr-Cedar-Falls-IA-50613/2061705383_zpid/
tail(samp1, 5) # to show the last 5 observations
##     id price_sold_usd Bedrooms Bathrooms sold_month sold_month_text  sold_date
## 26  63         590000        2         1         12        December 12/16/2022
## 27  38         240000        3         2          1         January 01/27/2023
## 28 162         324900        4         2          9       September 09/19/2022
## 29   6         114000        3         1          3           March 03/01/2023
## 30 116         203000        3         1         10         October 10/24/2022
##        street_address        city state zip_code Country
## 26     109 Tremont St Cedar Falls    IA    50613     USA
## 27 1122 Oak Park Blvd Cedar Falls    IA    50613     USA
## 28   1104 Franklin St Cedar Falls    IA    50613     USA
## 29     115 Tremont St Cedar Falls    IA    50613     USA
## 30    137 Highland Dr Cedar Falls    IA    50613     USA
##                                                                                 Property.URL
## 26     https://www.zillow.com/homedetails/109-Tremont-St-Cedar-Falls-IA-50613/76666981_zpid/
## 27 https://www.zillow.com/homedetails/1122-Oak-Park-Blvd-Cedar-Falls-IA-50613/76670300_zpid/
## 28   https://www.zillow.com/homedetails/1104-Franklin-St-Cedar-Falls-IA-50613/76667604_zpid/
## 29     https://www.zillow.com/homedetails/115-Tremont-St-Cedar-Falls-IA-50613/76666985_zpid/
## 30    https://www.zillow.com/homedetails/137-Highland-Dr-Cedar-Falls-IA-50613/76665773_zpid/

This command collects a simple random sample of size 30 from the SoldHouses dataset, and assigns the result to samp1. This is similar to going into the City Assessor’s database and pulling up the files on 30 random home sales.

Now let’s get a sense of some summary statistics for this sample using the code we wrote above, but now the summary statistics are for a sample of houses from the population. However be careful to not label values mu and sigma anymore since these are sample statistics, not population parameters.

# descriptive statistics
samp1 %>%
  summarise(mean = mean(price_sold_usd), median = median(price_sold_usd), 
            sd = sd(price_sold_usd), min = min(price_sold_usd), max =    max(price_sold_usd), sample_iqr = IQR(price_sold_usd),
            sample_q1 = quantile(price_sold_usd, 0.25),  # first quartile, 25th percentile
            sample_q3 = quantile(price_sold_usd, 0.75))  # third quartile, 75th percentile
##       mean median       sd    min    max sample_iqr sample_q1 sample_q3
## 1 275476.7 245000 120251.3 114000 590000     149350    177500    326850

Let’s build an histogram for this sample as well (with the mean and the median in dashed vertical lines):

mean_samp1 = mean(samp1$price_sold_usd)
median_samp1 = median(samp1$price_sold_usd)
# histogram
ggplot(data = samp1, aes(x = price_sold_usd)) +
  geom_histogram(binwidth = 60000, colour = "black", fill = "cornflowerblue") + geom_vline(aes(xintercept = as.integer(mean_samp1)), color="red", linetype="dashed") +
geom_vline(aes(xintercept = as.integer(median_samp1)), color="green", linetype="dashed") +
  ggtitle("Histogram of sales prices") + labs(x = " Sales price ") + scale_x_continuous(breaks=seq(0,600000,60000))+ theme(plot.title =  element_text(hjust = 0.5))

If we were interested in estimating the average price for the houses sold over the period of analysis using the sample, our best single guess is the sample mean:

samp1 %>%
summarise(x_bar = mean(price_sold_usd))
##      x_bar
## 1 275476.7

Depending on which 30 homes you selected, your estimate could be a bit above or a bit below the true population mean prices. We were able to get a good sense of the population average by using a sample of only 16% of the population.

Now if we perform another sample of 30 units from the original population:

Would you expect the new mean sample to match the previous mean sample? Why, or why not? If the answer is no, would you expect the means to just be somewhat different or very different?

Let’s take a second sample, also of size 30, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 50 and one of size 100. Which would you think would provide a more accurate estimate of the population mean?

ggplot(data = SoldHouses, aes(x = price_sold_usd)) +
  geom_histogram(binwidth = 60000, colour = "black", fill = "cornflowerblue") +
  ggtitle("Histogram of sales prices") + labs(x = " Sales price ") + theme(plot.title =  element_text(hjust = 0.5))

samp2 <- SoldHouses %>%
  sample_n(30)
samp2 %>%
summarise(x_bar2 = mean(price_sold_usd))
##     x_bar2
## 1 260096.4
mean_samp2 = mean(samp2$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
  geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
  geom_histogram(data = samp2 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp2), color="red", linetype="dashed")  + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))

samp3 <- SoldHouses %>%
  sample_n(30)
samp3 %>%
summarise(x_bar3 = mean(price_sold_usd))
##     x_bar3
## 1 279742.9
mean_samp3 = mean(samp3$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
  geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
  geom_histogram(data = samp3 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp3), color="red", linetype="dashed")  + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))

samp4 <- SoldHouses %>%
  sample_n(30)
samp4 %>%
summarise(x_bar4 = mean(price_sold_usd))
##     x_bar4
## 1 250428.2
mean_samp4 = mean(samp4$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
  geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
  geom_histogram(data = samp4 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp4), color="red", linetype="dashed")  + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))

samp5 <- SoldHouses %>%
  sample_n(30)
samp5 %>%
summarise(x_bar5 = mean(price_sold_usd))
##     x_bar5
## 1 252553.1
mean_samp5 = mean(samp5$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
  geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
  geom_histogram(data = samp5 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp5), color="red", linetype="dashed")  + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))

samp6 <- SoldHouses %>%
  sample_n(30)
samp6 %>%
summarise(x_bar6 = mean(price_sold_usd))
##     x_bar6
## 1 241719.7
mean_samp6 = mean(samp6$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
  geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
  geom_histogram(data = samp6 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp6), color="red", linetype="dashed")  + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))

Not surprisingly, every time you take another random sample, you get a different sample mean. It’s useful to get a sense of just how much variability you should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution (of the mean), can help you understand this variability. In this lab, because you have access to the population, you can build up the sampling distribution for the sample mean by repeating the above steps many times. Here, we use R to take 5,000 different samples of size 30 from the population, calculate the mean price of each sample, and store each result in a vector called sample_means50. Note that we specify that replace = TRUE since sampling distributions are constructed by sampling with replacement.

sample_means30 <- SoldHouses %>%
                    rep_sample_n(size = 30, reps = 5000, replace = TRUE) %>%
                    summarise(x_bar = mean(price_sold_usd))
mean_pop = mean(SoldHouses$price_sold_usd)
ggplot(NULL,) +
  geom_histogram(data = SoldHouses, aes(x = price_sold_usd, y = after_stat(density)), binwidth = 60000, colour = "black", fill = "cornflowerblue", alpha = 0.2) +
  geom_histogram(data = sample_means30, aes(x = x_bar, y = after_stat(density)), binwidth = 9000, colour = "white", fill = "chocolate1", alpha = 0.2) + geom_vline(aes(xintercept = mean_pop), color="black", linetype="dashed")  + ggtitle(" Sampling distribution for mean sales price, n = 30, repetitions = 5000") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))

Note that for each of the 5000 times we computed a mean, we did so from a different sample! Notice how close to a normal distribution is the distribution of the sample mean. This is very important to decide how to construct confidence intervals.

Indeed, the sampling distribution that you computed tells you much about estimating the average prices for houses sold in Cedar Falls. Interestingly, the sampling distribution is centered at the true average price (see vertical dashed black line in the histogram above), and the spread of the distribution indicates how much variability is incurred by sampling only 30 home sales.

Now let’s build the same histogram but with a point estimate (see vertical red dashed line) and the population mean of prices (see black vertical dashed line). If you were to run the code below multiple times, the graph is going to change because we are selcting a random sample each time. Go ahead a run the code and see what happens with the graph. Does the red dashed line stay the same, or does it move around the population mean!

samp2 <- SoldHouses %>%
  sample_n(30)
samp2 %>%
summarise(x_bar2 = mean(price_sold_usd))
##     x_bar2
## 1 259772.7
mean_samp2 = mean(samp2$price_sold_usd)
mean_pop = mean(SoldHouses$price_sold_usd)
ggplot(NULL, aes(x = price_sold_usd)) +
  geom_histogram(data = SoldHouses , binwidth = 60000, colour = "black", fill = "cornflowerblue") +
  geom_histogram(data = samp2 , binwidth = 50000, colour = "white", fill = "indianred1") + geom_vline(aes(xintercept = mean_samp2), color="red", linetype="dashed")  + geom_vline(aes(xintercept = mean_pop), color="black", linetype="dashed") + ggtitle(" Realization of the sample mean, n = 30") +labs(x = " Sales price ") + theme(plot.title = element_text(hjust = 0.5))

Application Confidence Intervals

Imagine you are a Real State Analyst for What Cheer HomeBuilders, a company new to Cedar Falls, but with an important presence in other states. They are new to the market, so they don’t have access to the same information as companies associated with the Home Builders Association of Iowa.

They are considering a project to develop houses in the range 450k-650k. Your supervisor asks you to write a report with the range of plausible values for the mean price of houses sold in Cedar Falls over the last six months to determine if the new development could be a good idea from a demand standpoint.

In particular he wants to be 95% confident about the range of plausible values.

So your job is to use sample information to calculate a 95% confidence interval for mean prices of houses sold in Cedar Falls over the last six moths. Interpret this confidence interval in terms of the study:

Our 95% confidence interval takes the form of:

(\(\bar x_n\) - \(SE(\bar X_n)\) \(\times\) \(z_{\alpha/2}\), \(\bar x_n\) + \(SE(\bar X_n)\) \(\times\) \(z_{\alpha/2}\)) \(\approx\) 95%

where \(\bar x_n\) is the sample mean for the sales prices, \(SE(\bar X_n)\) is the standard error, \(z_{\alpha/2}\) is the 100\((1- \alpha/2)\) percentile of the standard normal distribution. This is basically, getting the In R, \(z_{\alpha/2}\) = -qnorm(\(\alpha\)/2). We will use a Simple Random Sample of 30 observations from the population. The dataset sample_houses contain the sample information for sales price for 30 houses in Cedar Halls over the last 6 months.

Let’s initially, generate the random sample using the function sample_n(). Remember, it’s very unlikely that you will have the full population to study any business problem you are interested. Here the prupose is for you to understand why we can use features from the normal distribution to approach our problem.

sample_houses <- SoldHouses %>%
sample_n(30) # this will select a random sample of 30 observations from the dataset `SoldHouses`

Now let’s build the interval with the information provided both here and in class:

n = nrow(sample_houses) # nrow() funtion help us to get the number of observatios
n
## [1] 30
sample_mean =  mean(sample_houses$price_sold_usd) 
sample_sd = sd(sample_houses$price_sold_usd)
se = sample_sd / sqrt(n) 
quantile = -qnorm(0.025)
quantile
## [1] 1.959964
lower = sample_mean - se * quantile
upper = sample_mean + se * quantile 
lower
## [1] 222733.7
upper
## [1] 310565

Our 95% confidence interval is (224999, 346428). Which means that we are 95% confident that the population mean for all the prices of houses sold in Cedar Falls over the last six months is between 224999 usd and 346,428 usd.

What does it mean for the Real State Analyst? What are the business implications?

How the number of simulations and the sample size affects the sampling distribution of the mean

In the remainder of this section, you will work on getting a sense of the effect that sample size has on your sampling distribution.

Use the Shiny app below to create sampling distributions of means of price_sold_usds from samples of size 10, 50, and 100. Use 5,000 simulations.

  • What does each observation in the sampling distribution represent?

  • Effect of increasing the sample size -> How does the mean, standard error, and shape of the sampling distribution change as the sample size increases?

  • Effect of increasing the number of simulations -> How (if at all) do these values change if you increase the number of simulations?

Let’s do it together!

Shiny applications not supported in static R Markdown documents