For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
<some variable>
.”
Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R
code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.
library(conflicted)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
We are working for the San Francisco local government, and we want to increase the number of affordable apartments for residents. To do this, we will assess price data of apartments in area along with metrics such as the number of beds, baths, and square footage. We want to identify what type of apartment will give the lowest end price for renters or home-owners. The city can then use this information to help incentivise a higher proportion of new apartment spaces constructed to be of these certain values.
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"
apts <- read_delim(url_, delim = ",")
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
baths <- apts |>
filter(bath %in% c(1,2)) |>
filter(beds == 2)|>
filter(in_sf == 1)
baths_minus_one <- baths$bath - 1
model <- glm(baths_minus_one ~ price_per_sqft, data = baths,
family = binomial(link = 'logit'))
model$coefficients
## (Intercept) price_per_sqft
## -3.610613832 0.004645585
sigmoid <- \(x) 1 / (1 + exp(-(-3.61 + 0.0046 * x)))
baths |>
ggplot(mapping = aes(x = price_per_sqft, y = baths_minus_one)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "Price per square foot of different 2 bedroom apartments",
x = "Price per square foot",
y = "1 bath (0) or 2 bath (1)") +
scale_y_continuous(breaks = c(0,1)) +
theme_minimal()
Above graph was a glimpse into differences in one and two bath in terms of price per square foot. We can see that most of the one bath apartments have a low price per sq-ft and less variation than the two bath apartments.
# san francisco filter
sfapts <- apts |>
filter(in_sf == 1)
sfapts |>
ggplot(mapping = aes(x = as.factor(beds), y = price_per_sqft)) +
geom_boxplot() +
labs(x = "Number of Bedrooms", y = "Price per square foot") +
theme_minimal()
sfapts$cpb <- sfapts$price / sfapts$beds
sfapts |>
ggplot(mapping = aes(x = as.factor(beds), y = cpb)) +
geom_boxplot() +
labs(x = "Number of Beds", y = "Price per bed") +
theme_minimal()
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Interestingly, we can see that as the number of beds increase, the price per square foot as a mean looks to steadily decrease. However, when looking at the cost per bed, another valuable metric, we see the mean price is lowest at about 2-4 bedrooms. Although 6 bedrooms is the lowest, this might be impacted from the small sample size of 6 bedroom apartments available. Since its not practical to think that the majority of new apartments constructed can consist of 6 bedroom apartments, we’ll likely want the scope of our future tests to only include apartments with four or less beds.
sfapts |>
ggplot(mapping = aes(x = as.factor(beds))) +
geom_bar() +
labs(x = "Number of Beds", y = "Number of Apartments") +
theme_minimal()
First we’ll test if there’s any difference between the means of bedrooms 2,3,and 4 for price per sq-ft. Our null hypothesis is going to be that there’s no difference between them.
library(tidyverse)
# san francisco filter
sfapts <- apts |>
filter(in_sf == 1)
filtered_sfapts <- sfapts |>
filter(beds %in% c(2,3,4)) #anova only on 2,3,and 4 beds
filtered_sfapts |>
ggplot(mapping = aes(x = as.factor(beds), y = price_per_sqft)) +
geom_boxplot() +
labs(x = "Number of Bedrooms", y = "Price per square foot") +
theme_minimal()
# Anova assumptions checks
filtered_sfapts |>
filter(beds == 4) |> # normality check for each group
ggplot(mapping = aes(x= price_per_sqft)) +
geom_histogram() +
scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
filtered_sfapts |>
filter(beds == 4) |> # sd for each group is relatively close
select(price_per_sqft) |>
pull() |>
sd()
## [1] 321.8597
m <- aov(price_per_sqft ~ beds, data = filtered_sfapts)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## beds 1 1232894 1232894 11.18 0.000993 ***
## Residuals 193 21283901 110279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We got an F value of 11.18 which is much larger than break-point of 1. Also the p value is low so we can be confident this is not due to random chance. Our F value is suggestive of a difference in these three groups.
Let’s run the same process on bedrooms and price per bed relationship to see if the groups are actually different.
filtered_sfapts$ppb <- filtered_sfapts$price / filtered_sfapts$beds
filtered_sfapts |>
ggplot(mapping = aes(x = as.factor(beds), y = ppb)) +
geom_boxplot() +
#scale_y_continuous(breaks = c(0, 1)) +
labs(x = "Number of Beds", y = "Price per bed") +
theme_minimal()
m2 <- aov(ppb ~ beds, data = filtered_sfapts)
summary(m2)
## Df Sum Sq Mean Sq F value Pr(>F)
## beds 1 5.883e+11 5.883e+11 4.107 0.0441 *
## Residuals 193 2.765e+13 1.432e+11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, F value of 4.1 indicates that at least one group is different from the others.
This adds some statistical rigor to determine that bedrooms are different when it comes to price per sq-ft and price per bed. So for efficiency it is looking like 3 bedroom apartments have lower price per bed among the three, also 3 bedroom apts also have lower price per sq-ft than 2 bedrooms. We could’ve done pairwise t tests and C.I for these, but since this is proof of concept we thought this should be sufficient.
Our data is beginning to suggest that 3 bedroom apartments might be among the most efficient in terms of price per square foot and price per bed. This would mean that occupants are paying the least for their room, and are getting the most value for their area at this number. Additionally, we’ve seen that prices also tend to be lower when there are less bathrooms. Because of this, we might want to conclude that we want to incentivise much more 2-4 bedroom apartments be build, especially three bedrooms and those with less bathrooms than average for their bedroom number. However, this might not be that simple.
Existing Data Bias
Housing price data often includes inherent biases tied to socio-economic disparities, historical racial zoning practices, or unequal market dynamics.
Example: If the dataset over-represents affluent neighborhoods, results might fail to generalize to low-income housing markets.
Variable Concerns: Features like “Elevation” may inadvertently reflect systemic neighborhood segregation, reinforcing biases.
Profit Over Affordability: The analysis could unintentionally push developers to prioritize maximizing profits, undermining the goal of affordable housing initiatives.
Exclusion Risk: Developers may focus on market segments that yield higher returns, leaving out designs that cater to diverse housing needs.
Historical Context: Historical practices such as redlining might have shaped current price trends but remain unquantifiable in a purely statistical analysis.
Home Buyers/Renters:
Positive Impact: Improved access to affordable and well-designed housing if insights are implemented responsibly.
Negative Impact: Risk of pricing out lower-income families if developers focus solely on profit-driven designs.
Real Estate Developers:
Positive Impact: Data-driven insights could support cost-effective, equitable housing solutions aligned with affordability goals.
Negative Impact: Facing ethical scrutiny if findings drive inequitable housing practices.
Fairness and Bias Mitigation:
Employ fairness-aware preprocessing techniques to address biases in the dataset.
Regularly audit model outputs to ensure equitable treatment of different demographic groups.
Collaborate with Stakeholders:
Transparency and Documentation:
Publish comprehensive documentation detailing:
Data sources and preprocessing methods.
Model assumptions and potential limitations.
Ethical considerations and mitigation efforts.
This ensures future analyses maintain integrity and build upon an ethically sound foundation.