library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Loading the dataset

url <- "https://raw.githubusercontent.com/leontoddjohnson/i590/main/data/apartments/apartments.csv"

apts <- read_delim(url, delim = ',')
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Transforming Explanatory Variables

Firstly, employing price as the response variable while using price per square foot as a predictor introduces a problematic dependency, violating the assumption of independence. This is due to the inherent inclusion of price within the calculation of price per square foot.

Furthermore, the model overlooks essential predictors of price, such as neighborhood, bedroom/bathroom count, and building amenities, potentially introducing bias into the results. Additionally, the absence of randomization in the data raises concerns about underlying confounds influencing the observed relationship between price and price per square foot.

Interpreting the results reveals a limited explanatory power, as indicated by the R-squared value, suggesting that a substantial portion of the variability in price remains unexplained. Causation cannot be inferred from this observational data, given the likely intertwined nature of factors influencing both price and price per square foot.

The analysis is further hindered by the restriction of the dataset to NYC apartments, enhancing consistency but limiting generalizability.

# Model price per sqft as a function of price
model <- lm(price_per_sqft ~ price, 
            filter(apts, in_sf == 0))

# Calculate R-squared
r_squared <- summary(model)$r.squared

# Plot relationship  
apts |>  
  filter(in_sf == 0) |> 
  ggplot(aes(x = price, y = price_per_sqft)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) + 
  geom_smooth(se = FALSE) +
  labs(
    title = "Price per Sq Ft vs Price for Apartments", 
    subtitle = sprintf("R-squared: %.3f", r_squared)
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  1. Inverting the model to predict price per square foot based on price aligns with a more logical and causal relationship, as price per square foot is inherently dependent on price. This approach is grounded in the understanding that predicting the less variable factor from the more stable one is often more meaningful.
  2. The practical utility of predicting price per square foot becomes evident, allowing estimations of square footage based on a known budget. Additionally, simplifying code by implementing a single filter on the relevant condition enhances clarity and reduces redundancy.
  3. The preference for using `sprintf()` over `paste()` for string formatting contributes to code cleanliness and readability. The revised plot title serves as a quick guide, offering a clearer understanding of the relationship depicted in the visual, while well-crafted labels contribute to a more user-friendly and informative presentation.

New Variable

# Add squared price per sqft as a new predictor
apts <- apts |>
  mutate(price_per_sqft_squared = price_per_sqft^2) 

# Model price using both original and squared price per sqft  
model <- lm(price ~ price_per_sqft_squared + price_per_sqft,
            filter(apts, in_sf == 0))

# Calculate R squared
r_squared <- summary(model)$r.squared 

# Plot the relationship
apts |>
  filter(in_sf == 0) |> 
  ggplot(aes(x = price_per_sqft_squared, y = price)) +
  geom_point() + 
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Price vs Squared Price per Sq Ft for Apts",
    x = "Price per Sq Ft Squared",
    y = "Price",
    subtitle = sprintf("R-squared: %.3f", r_squared)
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Consolidating filter calls on the same condition eliminates duplication, streamlining the code for increased conciseness. Clear axis labels on a plot enhance reader understanding, providing context for the represented data on both the x and y axes.

Leveraging sprintf() for the subtitle, as opposed to paste() and substitutions, contributes to cleaner code. Placing the squared term on the x-axis maintains consistency with the model specification, offering a more transparent representation of the modeled relationship.

An informative title succinctly summarizes the plot’s key relationship, aiding readers in understanding its purpose and significance. Overall, these improvements in variable naming, code structure, and plot annotations contribute to a more accessible and interpretable analysis.

Power Transformation on the Response

Residual Plot and QQ plot

The provided code constructs a linear regression model predicting price from price_per_sqft, exclusively considering observations where in_sf is 0. The gg_diagnose function generates diagnostic plots, saving only residuals vs. fitted and Q-Q plots to the plots object. The plot_all function then displays these two plots with a maximum of 1 plot per page. To enhance the analysis, it is suggested to consider log-transforming the price variable to address skewness in the data.

Additionally, incorporating additional diagnostic plots, such as scale-location, can provide further insights into the adherence of the model to regression assumptions. The filter on in_sf hints at potential related samples in the data, prompting consideration for separate models or including in_sf as a predictor. Improvements could also involve adding labels and titles to the plots to enhance their interpretability and facilitate a more comprehensive understanding of the model’s performance.

BC Transformation

The code utilizes the powerTransform() function from the car package to determine a suitable power transformation for the linear regression model named ‘model.’ By specifying family=“bcPower,” the code indicates that it intends to apply a Box-Cox power transformation. The outcome of powerTransform() is stored in the object pT. Accessing pT$lambda retrieves the optimal lambda value, which in this case is 0.05507421. Utilizing pT$lambda isolates and extracts only the lambda parameter, representing the optimal power transformation for the given model.

Log Transfom

The ‘price’ variable undergoes a logarithmic transformation, resulting in the creation of a new variable named ‘log_price’ to address skewness. A linear model is then fitted, predicting ‘log_price’ from ‘price_per_sqft,’ with a filter applied to include only instances where ‘in_sf’ equals 0. The R-squared value is computed from the model summary. Subsequently, a scatterplot is generated, depicting the relationship between ‘price_per_sqft’ and the logarithm of ‘price.’ The plot includes a linear regression line, and custom labels, incorporating the R-squared value, are applied for clarity.

Poisson Transform

Visualizing the distribution of sqft before and after transformation would illustrate the transformation’s impact. It is recommended to include additional diagnostics, such as residuals vs. fitted values, to validate the smoothing line. Specify the smoothing line method, whether it’s loess, gam, or another, for clarity in the analysis. Enhancements to the plot, such as introducing transparency to the points, faceting for subsets, or color encoding a variable, could improve its effectiveness. To strengthen the analysis, consider incorporating model validation techniques like training/test splits for a more rigorous evaluation of the model’s performance.