Week 13_ Assignment

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Loading the dataset

url <- "https://raw.githubusercontent.com/leontoddjohnson/i590/main/data/apartments/apartments.csv"

apts <- read_delim(url, delim = ',')

## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Transforming Explanatory Variables

Using price as the dependent variable while utilizing price per square foot as an independent predictor introduces a problematic dependence, violating the assumption of independence due to the inherent inclusion of price in the price per square foot calculation. The model neglects crucial predictors such as neighborhood, bedroom/bathroom count, and building amenities, potentially introducing bias. Additionally, the absence of randomization in the data raises concerns about underlying confounding variables influencing the observed relationship between price and price per square foot.

The interpretation of results indicates limited explanatory power, evidenced by a low R-squared value, suggesting a significant portion of price variability remains unaccounted for. Causation cannot be inferred from this observational data, given the likely intertwined nature of factors influencing both price and price per square foot.

Furthermore, the analysis is constrained by focusing solely on NYC apartments, ensuring consistency but sacrificing generalizability.

# Model price per sqft as a function of price
model <- lm(price_per_sqft ~ price, 
            filter(apts, in_sf == 0))

# Calculate R-squared
r_squared <- summary(model)$r.squared

# Plot relationship  
apts |>  
  filter(in_sf == 0) |> 
  ggplot(aes(x = price, y = price_per_sqft)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) + 
  geom_smooth(se = FALSE) +
  labs(
    title = "Price per Sq Ft vs Price for Apartments", 
    subtitle = sprintf("R-squared: %.3f", r_squared)
  ) +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Switching the model orientation to predict price per square foot from price aligns with a more logical and causal relationship, given that price per square foot inherently depends on price. This approach is rooted in the rationale that predicting the less variable factor from the more stable one often yields more meaningful insights.
The practical significance of forecasting price per square foot becomes apparent, enabling estimations of square footage based on a known budget. Furthermore, streamlining the code by employing a single filter on the relevant condition enhances clarity and diminishes redundancy.
The preference for utilizing `sprintf()` instead of `paste()` for string formatting contributes to the cleanliness and readability of the code. The refined plot title acts as a succinct guide, providing a clearer understanding of the relationship portrayed in the visualization. Well-crafted labels further enhance the user-friendliness and informativeness of the presentation.

New Variable

# Add squared price per sqft as a new predictor
apts <- apts |>
  mutate(price_per_sqft_squared = price_per_sqft^2) 

# Model price using both original and squared price per sqft  
model <- lm(price ~ price_per_sqft_squared + price_per_sqft,
            filter(apts, in_sf == 0))

# Calculate R squared
r_squared <- summary(model)$r.squared 

# Plot the relationship
apts |>
  filter(in_sf == 0) |> 
  ggplot(aes(x = price_per_sqft_squared, y = price)) +
  geom_point() + 
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Price vs Squared Price per Sq Ft for Apts",
    x = "Price per Sq Ft Squared",
    y = "Price",
    subtitle = sprintf("R-squared: %.3f", r_squared)
  ) +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Combining filter calls based on the same condition reduces redundancy, simplifying the code for increased brevity. Clearly labeled axes on a plot enhance comprehension, providing context for the data presented on both the x and y axes.

Using sprintf() for the subtitle, instead of employing paste() and substitutions, results in cleaner code. Placing the squared term on the x-axis aligns with the model specification, presenting a clearer representation of the modeled relationship.

An insightful title succinctly captures the plot’s essential relationship, assisting readers in grasping its purpose and significance. In summary, these enhancements in variable naming, code structure, and plot annotations contribute to a more easily understood and interpretable analysis.

Power Transformation on the Response

Residual Plot and QQ plot

The provided code establishes a linear regression model predicting price from price_per_sqft, focusing exclusively on observations where in_sf is 0. Diagnostic plots, specifically residuals vs. fitted and Q-Q plots, are generated using gg_diagnose and stored in the plots object. Subsequently, the plot_all function displays these two plots, limiting each page to a maximum of 1 plot. To improve the analysis, it is recommended to consider log-transforming the price variable to address data skewness.

Furthermore, introducing additional diagnostic plots like scale-location can offer deeper insights into the model’s adherence to regression assumptions. The filter applied to in_sf suggests potential related samples in the data, prompting consideration of separate models or including in_sf as a predictor. Enhancements may also involve the addition of labels and titles to the plots, enhancing their interpretability and facilitating a more comprehensive understanding of the model’s performance.

BC Transformation

The code employs the powerTransform() function from the car package to identify an appropriate power transformation for the linear regression model ‘model.’ By specifying family=“bcPower,” the code signals its intention to apply a Box-Cox power transformation. The result of the powerTransform() operation is saved in the object pT. Extracting pT$lambda provides the optimal lambda value, which, in this instance, is 0.05507421. Using pT$lambda specifically isolates and retrieves the lambda parameter, representing the best-suited power transformation for the specified model.

Log Transfom

The ‘price’ variable is subjected to a logarithmic transformation, resulting in the creation of a new variable labeled ‘log_price’ to address skewness. Following this transformation, a linear model is employed to predict ‘log_price’ based on ‘price_per_sqft,’ with a filter specifically considering instances where ‘in_sf’ is equal to 0. The R-squared value is then derived from the model summary. Subsequent to these steps, a scatterplot is generated, illustrating the association between ‘price_per_sqft’ and the logarithmically transformed ‘price’ variable. The plot incorporates a linear regression line, and custom labels, which include the R-squared value, are applied to enhance interpretability and clarity.

Poisson Transform

Illustrating the distribution of sqft both before and after transformation would provide insights into the impact of the transformation. It’s advisable to incorporate additional diagnostics, like evaluating residuals versus fitted values, to validate the effectiveness of the smoothing line. Specify the smoothing line method explicitly, whether it’s loess, gam, or another, to enhance clarity in the analysis. Improving the plot could involve adding transparency to the points, faceting for subsets, or using color to encode a variable, thereby enhancing its overall effectiveness. Strengthening the analysis could be achieved by considering the inclusion of model validation techniques, such as training/test splits, to rigorously assess the model’s performance.