suppressMessages({
  suppressWarnings({
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(readr)
library(patchwork)
library(broom)
library(lindia)
library(car)
library(caret)
library(MASS)
  })
  })

Model Critique:

After reviewing the week 10 lab material together with my group members, I was able to note one or two things down as my observation, questions and the findings I came across during the group discussion.

The apartments dataset and the analysis for week 10 is a rather comprehensive dataset that highlighted the cost of apartments over the years in two selected cities, thus outlining the theoretical background and practical application of GLMs, transformations, and link functions, with specific examples of Poisson and logistic regression. In as much as I feel there are not too many issues wrong with the lab, I would indicate my major worry and as well as go through some of the six questions asked in the class during the group activity:

First thing first, the impact of unobserved variables and the potential for omitted variable bias is a big worry. Sensitivity analysis could be employed to assess how the inclusion or exclusion of certain predictors affects the model’s predictions. For example, issues such as multicollinearity and the stability of the regression coefficients in the presence of high correlation among predictors should be further analyzed.

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/apartments/apartments.csv"

apts <- read_delim(url_, delim = ",")
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nb_model <- glm.nb(price ~ I(sqrt(sqft)), data = apts)

summary(nb_model)
## 
## Call:
## glm.nb(formula = price ~ I(sqrt(sqft)), data = apts, init.theta = 3.792319495, 
##     link = log)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   11.955089   0.080320  148.84   <2e-16 ***
## I(sqrt(sqft))  0.060993   0.002058   29.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.7923) family taken to be 1)
## 
##     Null deviance: 1575.45  on 491  degrees of freedom
## Residual deviance:  513.48  on 490  degrees of freedom
## AIC: 14661
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.792 
##           Std. Err.:  0.232 
## 
##  2 x log-likelihood:  -14654.679
ggplot(apts, aes(x = sqrt(sqft), y = price)) +
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = "negative.binomial(1)"), se = FALSE) +
  labs(title = "Fitting Overdispersed Count Data with Negative Binomial Model")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Failed to fit group -1.
## Caused by error in `get()`:
## ! object 'negative.binomial(1)' of mode 'function' was not found

residuals <- resid(nb_model)
fitted_values <- fitted(nb_model)

ggplot() +
  geom_point(aes(x = fitted_values, y = residuals)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(x = "Fitted values", y = "Residuals", 
       title = "Residual vs. Fitted Plot") +
  theme_minimal()

This plot indicates that for most of the data points, residuals are distributed around the zero line, which is generally a good sign for the regression model, although there seem to be several potential outliers with large residuals. What this pattern tells us is that the model may be appropriate for the majority of the data, or there could be specific points where the model does not predict accurately.

These outliers may be data entry errors, special cases, or if they indicate that a more complex model is needed to capture the underlying relationship, whatever the case may be - it still needs further analysis to actually determine why this is so. It’s also important to examine if the variance of residuals is constant, as the presence of outliers might suggest either a potential heteroscedasticity or non-linearity not addressed by the model.

These are the findings that I found in my re-review of the lab, the discussion with my classmates gave me many clues alongside with my own personal findings during, and after class.

There are some analysis that could be done blindly, but in terms of future analyses, it will be helpful to request additional data that could potentially improve the model.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.