For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
<some variable> will be
identified.”
Since this is a class, and not a workplace, we need to be careful not to present information to you too quickly or all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … Now that you have a more informed knowledge of statistics, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.
Week 14 Data Dive: Woods , Guyon, Grant
Linear regression model analyzing an ansecombe’s quartet dataset
# Loading in data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.5.2
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
# remove scientific notation
options(scipen = 6)
# default theme, unless otherwise noted
theme_set(theme_minimal())
# Code we are analyzing
ames <- make_ames()
ames <- ames |> rename_with(tolower)
Problem statement / Audience
The context for this business analysis is for real estate agents and clients (buyers and sellers) who want to understand which factors significantly impact the sales price for houses in Ames, Iowa to get a full understanding of the housing market. The ames housing dataset has collected data between 2006-2010.
Scope / Important variables
The analysis we will use is run a complete linear regression model to assess summary statistics, which identifies the most significant variables that impact / predict sales price. The current model in the week 8 lab is incomplete as it only shows a few coefficients, excluding important statistical calculations such as p-value and t-values. The current model only shows estimates.
ames_basic <- ames |>
filter(bldg_type == "OneFam",
house_style == "One_Story",
year_built >= 2000) |>
mutate(great_qual = ifelse(overall_qual %in%
c("Very_Excellent", "Excellent", "Very_Good"),
1, 0))
ames_basic |>
group_by(great_qual) |>
summarize(num = n())
## # A tibble: 2 × 2
## great_qual num
## <dbl> <int>
## 1 0 145
## 2 1 182
# include all variables and their interaction
model <- lm(sale_price ~ year_remod_add + great_qual
+ year_remod_add:great_qual, ames_basic)
# to view more coefficients a bit easier
tidy(model) |>
select(term, estimate) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 4 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) -829942.
## 2 year_remod_add 513.
## 3 great_qual -10088892.
## 4 year_remod_add:great_qual 5088.
Create new linear regression model with same variables to gain insight into summary statistics. The new model will yield residuals such as quartiles, min, and max. More importantly, the model will be far more extensive in the coefficients produced providing the standard error, t-value, and p-value to provide customers with more statistically significant data points to rely on.
new_lm_model <- lm(sale_price ~ year_remod_add + great_qual + year_remod_add:great_qual, data = ames_basic)
summary(new_lm_model)
##
## Call:
## lm(formula = sale_price ~ year_remod_add + great_qual + year_remod_add:great_qual,
## data = ames_basic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175088 -50737 -7689 32501 315315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -829941.8 5327025.6 -0.156 0.876
## year_remod_add 513.3 2656.3 0.193 0.847
## great_qual -10088892.4 7545297.2 -1.337 0.182
## year_remod_add:great_qual 5087.5 3761.7 1.352 0.177
##
## Residual standard error: 72100 on 323 degrees of freedom
## Multiple R-squared: 0.407, Adjusted R-squared: 0.4015
## F-statistic: 73.9 on 3 and 323 DF, p-value: < 2.2e-16
# Residuals vs Fitted model (basic)
plot(new_lm_model, which =1)
From this Residuals vs Fitted regression model and the linear regression model, it may be smart to redirect our analysis focus to logistic regression models. From the Residuals vs Fitted model, the model line of best fit has a slight curve around 0, suggesting an overall good fit. This leads us to explore logistic analysis methods to evaluate the relationship between the variables to further investigate what will drive business decisions for real estate agents and clients in the process of buying and selling houses.
# Correlation Heatmap
library(GGally)
library(dplyr)
ames_basic_corr <- ames_basic %>%
mutate(interaction = year_remod_add * great_qual)
ggcorr(
select(ames_basic_corr,
year_remod_add,
great_qual,
interaction),
label = TRUE
) +
labs(title = "Correlation Heatmap for Ames Housing Data")