We’ll be using the AirBnB data set from New York City in 2019. It lists every apartment available in New York City in 2019.
The relevant columns are:
In this homework, you’ll write a function to calculate a confidence interval for the difference in average price between apartments in Manhattan vs Brooklyn. Then using that function, the simulate collecting 10,000 different samples and calculating a 95% confidence interval for each sample and see if almost 95% of the intervals contain the true difference in average price between the two boroughs.
Save a data set named abnb_clean that is:
abnb_full |>
# Keeping the specified rows
filter(
room_type == "Entire home/apt",
neighbourhood_group %in% c("Brooklyn", "Manhattan"),
availability_365 >= 10,
price <= 1000
) |>
# Keeping the specified columns
dplyr::select(
id,
host_id,
host_name,
borough = neighbourhood_group,
price,
minimum_nights
) ->
abnb_clean
# Displaying the results
tibble(abnb_clean)
## # A tibble: 12,976 × 6
## id host_id host_name borough price minimum_nights
## <int> <int> <chr> <chr> <int> <int>
## 1 2595 2845 Jennifer Manhattan 225 1
## 2 3831 4869 LisaRoxanne Brooklyn 89 1
## 3 5099 7322 Chris Manhattan 200 3
## 4 5238 7549 Ben Manhattan 150 1
## 5 6848 15991 Allen & Irina Brooklyn 140 2
## 6 7097 17571 Jane Brooklyn 215 2
## 7 7726 20950 Adam And Charity Brooklyn 99 3
## 8 7750 17985 Sing Manhattan 190 7
## 9 8490 25183 Nathalie Brooklyn 120 2
## 10 9357 30193 Tommi Manhattan 150 10
## # ℹ 12,966 more rows
The code below will take a random sample of 5% of each Manhattan and Brooklyn apartments (called stratified sampling) and save it as abnb_sample. Use is as a test case for question 2 below, with price as y and borough for the groups
# Making sure the sample is the same for everyone
RNGversion("4.1.0"); set.seed(1234)
# Taking the stratified sample
abnb_clean |>
slice_sample(
by = borough,
prop = 0.01
) ->
abnb_sample
# Checking the results:
abnb_sample |>
summarize(
.by = borough,
apts = n(),
price_avg = mean(price) |> scales::dollar(),
price_sd = sd(price) |> scales::dollar()
) |>
gt::gt()
borough | apts | price_avg | price_sd |
---|---|---|---|
Manhattan | 75 | $254.71 | $197.08 |
Brooklyn | 54 | $177.09 | $94.10 |
You’ll write a function named two_sample_CI
that
takes two vectors (y & groups) and calculates a confidence interval
for the difference between the two means.
The function will have 3 arguments:
y =
the vector of numeric values to calculate the
confidence interval fromgroups =
the vector that indicates which group the
values in y
belong toCL =
the confidence level of the confidence interval in
proportions (default = 0.95)The function should return the following 4 objects (with the correct names)
averages =
a vector with the mean of each group: \(\bar{y}_1\) and \(\bar{y}_2\)diff_avg =
difference in means: \(\bar{y}_1 - \bar{y}_2\)diff_SE =
the standard error of the difference in
means: \(\textrm{SE}(\bar{y}_1 - \bar{y}_2) =
\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)diff_CI =
a vector that has the confidence interval for
the difference in means:\[(\bar{y}_1 - \bar{y}_2) \pm z_{\alpha/2}\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]
To calculate \(z_{\alpha/2}\), use
z = qnorm(p = (1-CL)/2, lower.tail = F)
two_sample_CI <-
function(y, groups, CL = 0.95){
## Calculating the summary stats
# Joining y and groups as a data.frame to use the dplyr verbs
data.frame(
y,
groups
) |>
# Calculating n, mean, and sd for each group
summarize(
.by = groups,
n = n(),
y_bar = mean(y),
sd = sd(y)
) ->
summary_stats
# Calculating the needed values for the CI
summary_stats |>
# Calculating the variance for each group
mutate(variance = sd^2/n) |>
# Calculating the difference in means and the standard error
summarize(
avg_diff = abs(diff(y_bar)),
diff_SE = sqrt(sum(variance))
) |>
# Calculating the lower and upper bound of the CI
mutate(
ci_lower = avg_diff - qnorm(p = (1-CL)/2, lower.tail = F)*diff_SE,
ci_upper = avg_diff + qnorm(p = (1-CL)/2, lower.tail = F)*diff_SE
) ->
ci_values
return(
list(
averages = summary_stats$y_bar,
diff_avg = ci_values$avg_diff,
diff_SE = ci_values$diff_SE,
diff_CI = c(ci_values$ci_lower, ci_values$ci_upper)
)
)
}
Use the code chunk below to check the results:
# 90% CI
two_sample_CI(
y = abnb_sample$price,
groups = abnb_sample$borough,
CL = 0.90
)
## $averages
## [1] 254.7067 177.0926
##
## $diff_avg
## [1] 77.61407
##
## $diff_SE
## [1] 26.11239
##
## $diff_CI
## [1] 34.66302 120.56513
# 95% CI
two_sample_CI(
y = abnb_sample$price,
groups = abnb_sample$borough
)
## $averages
## [1] 254.7067 177.0926
##
## $diff_avg
## [1] 77.61407
##
## $diff_SE
## [1] 26.11239
##
## $diff_CI
## [1] 26.43474 128.79341
# 99% CI
two_sample_CI(
y = abnb_sample$price,
groups = abnb_sample$borough,
CL = 0.99
)
## $averages
## [1] 254.7067 177.0926
##
## $diff_avg
## [1] 77.61407
##
## $diff_SE
## [1] 26.11239
##
## $diff_CI
## [1] 10.35302 144.87513
Write a simulation that will take a random sample of 1% from the abnb_clean NYC apartments data set and use the function from question 2 to calculate the 90% CI. Repeat the process a total of 10,000 times
Store the results in a data.frame named sim_results that has 10,000 rows and six columns:
The results should match what is in the solutions in Brightspace
# How many loops to perform
m <- 10000
# Create the data frame to store the results
sim_results <-
data.frame(
Manhattan_avg = rep(-1, m),
Brooklyn_avg = rep(-1, m),
diff_avg = rep(-1, m),
diff_SE = rep(-1, m),
CI_lower = rep(-1, m),
CI_upper = rep(-1, m)
)
tibble(sim_results)
## # A tibble: 10,000 × 6
## Manhattan_avg Brooklyn_avg diff_avg diff_SE CI_lower CI_upper
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1 -1 -1 -1 -1 -1
## 2 -1 -1 -1 -1 -1 -1
## 3 -1 -1 -1 -1 -1 -1
## 4 -1 -1 -1 -1 -1 -1
## 5 -1 -1 -1 -1 -1 -1
## 6 -1 -1 -1 -1 -1 -1
## 7 -1 -1 -1 -1 -1 -1
## 8 -1 -1 -1 -1 -1 -1
## 9 -1 -1 -1 -1 -1 -1
## 10 -1 -1 -1 -1 -1 -1
## # ℹ 9,990 more rows
Run the simulation where each loop should:
You can check the results in Brightspace to see if your results match
# Making sure the results are the same every time
RNGversion("4.1.0"); set.seed(2870)
# Write the simulation below:
for (i in 1:m){
# Getting the sample from abnb_clean
loop_sample <-
abnb_clean |>
slice_sample(
by = borough,
prop = 0.01
)
# Using the function to create the summary stats
loop_results <-
two_sample_CI(
y = loop_sample$price,
groups = loop_sample$borough,
CL = 0.90
)
## Storing the results in the columns of sim_results
# The individual means
sim_results[i, 1:2] <- loop_results$averages
# The difference in means
sim_results[i, 3] <- loop_results$diff_avg
# The SE of the difference in means
sim_results[i, 4] <- loop_results$diff_SE
# The confidence interval
sim_results[i, 5:6] <- loop_results$diff_CI
}
tibble(sim_results)
## # A tibble: 10,000 × 6
## Manhattan_avg Brooklyn_avg diff_avg diff_SE CI_lower CI_upper
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 251. 196. 54.5 24.7 13.9 95.2
## 2 244. 186. 58.3 26.4 14.9 102.
## 3 211. 181. 30.1 21.6 -5.45 65.7
## 4 249. 229. 20.8 33.0 -33.6 75.1
## 5 253. 178. 75.2 23.9 35.9 115.
## 6 222. 139. 82.9 15.1 58.0 108.
## 7 245. 162. 82.9 19.7 50.5 115.
## 8 252. 212. 40.0 24.5 -0.415 80.3
## 9 223. 186. 37.7 21.5 2.34 73.0
## 10 228. 167. 61.3 17.3 32.8 89.7
## # ℹ 9,990 more rows
Calculate, save, and display the true difference in the average price between apartments in Manhattan vs Brooklyn in the abnb_clean data set, then calculate the percentage of the 10,000 intervals created in question 3B.
# Calculating the true difference
abnb_clean |>
# The individual means
summarize(
.by = borough,
price_avg = mean(price)
) |>
# The difference
summarize(diff_avg = price_avg |> diff() |> abs()) |>
pull(diff_avg) ->
diff_in_means
diff_in_means
## [1] 64.6262
# Calculating the percentage of intervals that contain the true difference
sim_results |>
mutate(contains_true = CI_lower <= diff_in_means & CI_upper >= diff_in_means) |>
summarize(interval_prop = mean(contains_true))
## interval_prop
## 1 0.9009
sim_results |>
filter(CI_lower <= diff_in_means & CI_upper >= diff_in_means) |>
nrow()
## [1] 9009
Does the simulation indicate that the 90% confidence intervals are accurate?
Create the graph in Brightspace that illustrates how accurate the last 100 confidence intervals are. The graph should have:
sim_results |>
# Picking the last 100 rows
slice_tail(
n = 100
) |>
# Specifying the colors to use if the CI contains the parameter or not
mutate(
line_color = if_else(CI_lower < diff_in_means & CI_upper > diff_in_means,
"steelblue",
"tomato"),
row_num = row_number()
) |>
## Creating the graph
ggplot(
mapping = aes(
y = row_num
)
) +
# Drawing the line segment for the CI
geom_segment(
mapping = aes(
x = CI_lower,
y = row_num,
xend = CI_upper,
yend = row_num,
color = line_color
),
linewidth = 1
) +
# Adding the difference in means
geom_point(
mapping = aes(
x = diff_avg
),
size = 2
) +
# Adding the vertical line at the true difference
geom_vline(
xintercept = diff_in_means,
linetype = "dashed",
linewidth = 1
) +
# Changing the labels on the graph
labs(
title = "Results of 90% Confidence intervals for a difference in means",
x = "Difference in average price of apartments in Manhattan and Brooklyn",
y = NULL
) +
# Making the colors match the value in the column
scale_color_identity() +
# Removing the values on the y-axis
scale_y_continuous(
breaks = NULL,
expand = c(0.01, 0)
) +
# Changing the breaks on the x-axis
scale_x_continuous(
breaks = seq(-75, 200, 25),
minor_breaks = NULL,
labels = scales::label_dollar()
) +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5),
# axis.ticks.y = element_blank()
)