Note: if you are completing this exercise in the fully-online version of H510, you’ll be doing it individually. So, you can ignore all instances of “your group”, and just think about them as “yourself”.
For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
First, create your own context for the lab. This should be a business use-case such as “a real estate firm aims to present housing trends (and recommendations) for their clients in Ames, Iowa”.
You do not need to solve the problem, you only need to define it.
Your scenario should include the following:
<some variable>
.”
My work:
Audience : the pricing team at a real estate agency in Ames, Iowa, who advises homeowners and developers on how to set competitive listing prices.
Problem Statement: Currently the team only uses averages to recommend what to set the prices of houses at. They do not account for factors like neighborhood, house quality, size, etc. In order for them to be more effective and accurate, they need to statistically determine how certain variables affect sale prices, using ANOVA and linear regression.
Scope: There are many different variables that would be useful to this problem statement. The ones most useful would be OverallQual, GrLivArea, Neighborhood, and Yearbuilt. The outcome variable also would be SalePrice. Some of the analyses we could do are ANOVA and linear regression. For ANOVA we could test if the sales prices differ based on neighborhoods. For Linear regression we could predict SalesPrice based on OverallQUal, GrLivArea, and YearBuilt. These would give the pricing team a much better idea of what they should be telling their clients to price their homes at. Some assumptions would be that the relationships between predictors and SalePrice are linear and that the residuals are normally distributed.
Objective: I would define my success criteria with the following. Firstly, having a multiple linear regression model that has a R^2 of at least 70%, this way we can explain at least 70% of the variance in the sale price. I would also would to be able to fully identify which neighborhoods have statistically different average sale prices (done via ANOVA). I would finally want to be able to definitively say which features (size, year built, quality) have strong influences on housing prices.
Since this is a class, and not a workplace, we need to be careful not to present too much information to you all at once. For this reason, our labs are often not as analytically rigorous or thorough as they might be in practice … So here, your goal is to:
Present a list of at least 3 (improved) analyses you would
recommend for your business scenario. Each proposed analysis
should be accompanied by a “proof of concept” R implementation. (As
usual, execute R
code blocks here in the RMarkdown
file.)
In the lab your group has been assigned, consider issues with models, statistical improvements, interpretations, analyses, visualizations, etc. Use this notebook as a sandbox for trying out different code, and investigating the data from a different perspective. Take notes on all the issues you see, and propose your solutions (even if you might need to request more data or resources to accomplish those solutions).
You’ll want to consider the following:
Feel free to use the reading for the week associated with your assigned lab to help refresh your memory on the concepts presented.
My work:
Issue 1 - Including Interaction Terms with the regression model. The initial model assumes that the predictors (e.g., OverallQual, GrLivArea, YearBuilt) have independent effects on SalePrice. However, it’s likely that these variables interact. For example, a big house with poor quality might not have as high of a price as a smaller, higher quality home.
Improvement 1 - Build a multiple linear regression model with interaction terms to capture combined effects.
#proof of concept for improvement 1
#interaction model
#interaction_model <- lm(SalePrice ~ OverallQual * GrLivArea + YearBuilt, data = housing)
#get results
#summary(interaction_model)
Issue 2 - Fix skewness of SalePrice by using log. Housing prices can be skewed and this would violate our assumptions so by adding a log transformation we could stabilize the variance and hopefully normalize the results.
Improvement 2 - Same as above but basically adding a log transformation so that we can normalize the results for SalePrice.
#proof of concept for improvement 2
#regression model but with log saleprice
#log_model <- lm(LogSalePrice ~ OverallQual + GrLivArea + YearBuilt, data = housing)
#get results and then check residuals
#summary(log_model)
#plot(log_model, which = 1)
Issue 3 - Using Post-hoc Tests after ANOVA. ANOVA will tells that there are differences between neighborhoods but won’t tell us WHICH are different.
Improvement 3 - After finding a significant ANOVA, we use Tukey’s Honest Significant Difference test (HSD) for pairwise comparisons between neighborhoods. Using Tukey’s test will give us exact insights into higher and lower priced neighborhoods which is what we are interested in.
#proof of concept for improvement 3
#running ANOVA and summary
#neighborhood_anova <- aov(SalePrice ~ Neighborhood, data = housing)
#summary(neighborhood_anova)
#tukey test
#tukey_results <- TukeyHSD(neighborhood_anova)
#print(tukey_results)
Review the materials from the Week 5 lesson on Ethics and Epistemology. This includes lecture slides, the lecture video, or the reading. You should also consider doing supplementary research on the topic at hand (e.g., news outlets, historical articles, etc.). Some issues you might want to consider include:
For example, in Week 10-11, we used the year built, square footage, elevation, and the number of bedrooms to determine the price of an apartment. A few questions you might ask are:
My work:
There are some definite issues with the business concept at use here and there are many different ways to address it. Since we are a company focused on helping clients set their housing prices right for some we would be lowering them while others we would be raising them so this could lead to some inherent issues.
Bias - Using the neighborhood variable has issues as neighborhoods are affected by ton of external factors and bias. What if one neighborhood has historically been valued lower because of discrimination and because of our data we are continuing that trend. We would need to disclose that information to our customers directly in order for them to understand such things.
Errors and Misleading Advice - While our data is helping improve the models of the company, none of them are perfect, like any model. This means that some people will get wrong advice and this could lead to them losing money and a negative affect on their life. This would be another thing to disclose to the customer to ensure they understand the risks.
Things we can’t take into account in a model - Emotional attachment to a home, neighborhood amenities, market momentum/world events, and other factors could all affect what happens to people’s home and their prices. These are things that no matter what we can’t take into account at the moment and could lead to issues for our customers in the future.
Share your model critique in this notebook as your data dive submission for the week. Make sure to include your own R code which executes suggested routines.