For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
<some variable>
.”
Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R
code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.
A nationwide apartment rental platform (e.g., Zillow Rentals or Apartments.com) seeks to improve its rental price estimation model and recommend listings to users more accurately.
In order to enhance user satisfaction and reduce bounce rates on listings, the company needs to determine which features (e.g., square footage, elevation, year built) most influence the rental price of an apartment.
Relevant variables from the dataset: - price
,
sqft
, elevation
, year_built
,
beds
, bath
Analyses: - GLMs to model price
and in_sf
(location classifier) - Variable selection via stepwise regression -
Multicollinearity check via VIF - Model comparison via
AIC/BIC/Deviance
Assumptions: - GLMs are appropriate for the outcome variable type - Residual deviance and deviance comparisons inform improvement
Identify the variables that most significantly influence apartment price and accurately classify location (SF vs. NY) to power recommendation algorithms and pricing insights.
Currently, only square footage is transformed
(sqrt_sqft
) while others like elevation
or
year_built
are not.
apts <- apts |>
mutate(elevation2 = elevation^2,
sqft_beds_interaction = sqft * beds)
model_poly <- glm(log_price ~ sqrt_sqft + elevation2 + sqft_beds_interaction + beds, data = apts)
summary(model_poly)
This captures nonlinear effects of elevation and the interaction between size and number of bedrooms, which may influence price.
Residual plots, leverage, and influence points are missing.
library(lindia)
plot(model_poly)
These plots help validate GLM assumptions and identify outliers or influential points.
The in_sf
classification is modeled but not
validated.
library(caret)
apts$pred <- ifelse(predict(model1, type = "response") > 0.5, 1, 0)
confusionMatrix(as.factor(apts$pred), as.factor(apts$in_sf))
This provides accuracy, sensitivity, and specificity, giving insight into the model’s real-world classification ability.
year_built
may
reflect systemic urban development disparities.