suppressMessages({
suppressWarnings({
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(readr)
library(patchwork)
library(broom)
library(lindia)
library(car)
library(caret)
library(MASS)
})
})
After reviewing the week 10 lab material together with my group members, I was able to note one or two things down as my observation, questions and the findings I came across during the group discussion.
The apartments dataset and the analysis for week 10 is a rather comprehensive dataset that highlighted the cost of apartments over the years in two selected cities, thus outlining the theoretical background and practical application of GLMs, transformations, and link functions, with specific examples of Poisson and logistic regression. In as much as I feel there are not too many issues wrong with the lab, I would indicate my major worry and as well as go through some of the six questions asked in the class during the group activity:
First thing first, the impact of unobserved variables and the potential for omitted variable bias is a big worry. Sensitivity analysis could be employed to assess how the inclusion or exclusion of certain predictors affects the model’s predictions. For example, issues such as multicollinearity and the stability of the regression coefficients in the presence of high correlation among predictors should be further analyzed.
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"
apts <- read_delim(url_, delim = ",")
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Model Assumptions: The assumptions underlying the GLMs, such as the distribution of the error terms and the relationship between the response and predictors are somewhat off, and for logistic regression, it is assumed that the log odds of the response are linearly related to the predictors. For Poisson regression, the log of the mean response is assumed to be linearly related to the predictors. I think a k-fold cross-validation on the logistic regression model can help assess the model’s performance and stability.
Statistical Improvements: As much as I know, the use of regularization techniques like LASSO or Ridge can be used to improve model performance and prevent overfitting. Both or either can help provide alternative modeling approaches, depending on the complexity of the data and the prediction task.
Possible Risks: As an entrepreneur, I feel the analysis does not discuss the ethical implications of the predictive modeling. For example, if used in a real estate context, it could influence market prices or contribute to gentrification. The use of elevation as a predictor in logistic regression could inadvertently lead to redlining if not carefully regulated.
To address the overdispersion in Poisson Regression, we need to fit a negative binomial model which is better suited for overdispersed count data.
nb_model <- glm.nb(price ~ I(sqrt(sqft)), data = apts)
summary(nb_model)
##
## Call:
## glm.nb(formula = price ~ I(sqrt(sqft)), data = apts, init.theta = 3.792319495,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.955089 0.080320 148.84 <2e-16 ***
## I(sqrt(sqft)) 0.060993 0.002058 29.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.7923) family taken to be 1)
##
## Null deviance: 1575.45 on 491 degrees of freedom
## Residual deviance: 513.48 on 490 degrees of freedom
## AIC: 14661
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.792
## Std. Err.: 0.232
##
## 2 x log-likelihood: -14654.679
ggplot(apts, aes(x = sqrt(sqft), y = price)) +
geom_point() +
geom_smooth(method = "glm", method.args = list(family = "negative.binomial(1)"), se = FALSE) +
labs(title = "Fitting Overdispersed Count Data with Negative Binomial Model")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Failed to fit group -1.
## Caused by error in `get()`:
## ! object 'negative.binomial(1)' of mode 'function' was not found
Biases: The models can be further evaluated for existing or potential biases such as sample bias or measurement bias. How was the data collected? Does it represent the population? It is very critical to ask these questions to give use the confidence that we are truly working on an actionable data.
Are there better visualizations which could have been used?: Well, yes. The histograms and scatter plots in the were kind of basic; the analysis could have been presented with more interactive and dynamic plots. The residual vs. fittedplot below I created below will perform better in representing the correlation between variables, and residual plots and could provide more insights into model fit.
residuals <- resid(nb_model)
fitted_values <- fitted(nb_model)
ggplot() +
geom_point(aes(x = fitted_values, y = residuals)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(x = "Fitted values", y = "Residuals",
title = "Residual vs. Fitted Plot") +
theme_minimal()
This plot indicates that for most of the data points, residuals are distributed around the zero line, which is generally a good sign for the regression model, although there seem to be several potential outliers with large residuals. What this pattern tells us is that the model may be appropriate for the majority of the data, or there could be specific points where the model does not predict accurately.
These outliers may be data entry errors, special cases, or if they indicate that a more complex model is needed to capture the underlying relationship, whatever the case may be - it still needs further analysis to actually determine why this is so. It’s also important to examine if the variance of residuals is constant, as the presence of outliers might suggest either a potential heteroscedasticity or non-linearity not addressed by the model.
These are the findings that I found in my re-review of the lab, the discussion with my classmates gave me many clues alongside with my own personal findings during, and after class.
There are some analysis that could be done blindly, but in terms of future analyses, it will be helpful to request additional data that could potentially improve the model.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.