R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
url <- "https://raw.githubusercontent.com/leontoddjohnson/i590/main/data/apartments/apartments.csv"

apts <- read_delim(url, delim = ',')
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Transforming Explanatory Variables

Issues with the model: - Using price as the response and price per square foot as the predictor is problematic since price per square foot contains price in its calculation. This creates an inherent dependency that violates assumptions of independence in the model. - The model does not account for other potential predictors of price like neighborhood, number of bedrooms/bathrooms, building amenities, etc. Omitting these variables could bias results. - The data is not randomized, so there may be underlying confounds influencing the apparent relationship between price and price per square foot.

Issues with interpretation: - The r-squared value suggests the model only explains part of the variance in price. Much of the variability is unaccounted for. - We cannot infer causation from this observational data. Factors driving both price and price per square foot are likely intertwined.

Issues with analysis: - Restricting to NYC apartments improves consistency, but limits generalizability. Testing across other cities would be needed to support broader claims. - Residual plots and other diagnostics should be used to validate model assumptions before finalizing the analysis.

Issues with visualization: - Plots could be enhanced by coloring points by neighborhood or listing categorical variables like bedroom count. This would reveal subgroups and trends. - Axis scales could be tweaked to better distribute the data points for visualization. - Faceting by neighborhood or other categories could also reveal differences in the relationship.

Potential solutions: - Collect data on other apartment features to build a more robust predictive model using regression or machine learning approaches. - Perform experiments and randomized controlled trials to better isolate causal connections between factors. - Enhance geographic diversity of data to improve generalizability. - Leverage domain expertise from real estate analysts to identify other important variables. - Apply formal statistical testing to quantify uncertainty in estimates and support conclusions.

The key is identifying limitations in the current analysis and finding constructive ways to expand, strengthen, and diversify the modeling approach. More data and methodological rigor would help develop defensible, nuanced conclusions.

# Model price per sqft as a function of price
model <- lm(price_per_sqft ~ price, 
            filter(apts, in_sf == 0))

# Calculate R-squared
r_squared <- summary(model)$r.squared

# Plot relationship  
apts |>  
  filter(in_sf == 0) |> 
  ggplot(aes(x = price, y = price_per_sqft)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) + 
  geom_smooth(se = FALSE) +
  labs(
    title = "Price per Sq Ft vs Price for Apartments", 
    subtitle = sprintf("R-squared: %.3f", r_squared)
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

1. Flipping the model:

  • It makes more sense to predict price per sqft based on price rather than the reverse. We generally want to predict the less stable variable from the more stable one.

  • Price per sqft depends on price, not the other way around. Modeling it this way matches the causal relationship.

  • Predicting price per sqft is more useful in practice - if you know the price you want to pay, you can estimate the sqft.

  1. Single filter:
  • Filtering twice on the same condition is redundant.

  • Doing it once reduces duplicate code.

  1. sprintf vs paste:
  • sprintf() is cleaner than paste() for formatting strings.

  • Fewer functions to read makes it more readable.

  1. Plot title:
  • The revised title makes the relationship being shown clearer at a glance.

  • Good labels quickly orient the reader to the purpose of the visual.

##New Variable:

# Add squared price per sqft as a new predictor
apts <- apts |>
  mutate(price_per_sqft_squared = price_per_sqft^2) 

# Model price using both original and squared price per sqft  
model <- lm(price ~ price_per_sqft_squared + price_per_sqft,
            filter(apts, in_sf == 0))

# Calculate R squared
r_squared <- summary(model)$r.squared 

# Plot the relationship
apts |>
  filter(in_sf == 0) |> 
  ggplot(aes(x = price_per_sqft_squared, y = price)) +
  geom_point() + 
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Price vs Squared Price per Sq Ft for Apts",
    x = "Price per Sq Ft Squared",
    y = "Price",
    subtitle = sprintf("R-squared: %.3f", r_squared)
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

1. More descriptive variable names:

  • Using price_per_sqft_squared instead of price_per_sqft_2 makes it clearer what the variable represents.

  • More descriptive names improve readability and model interpretability.

  1. Single filter call:
  • Filtering multiple times on the same condition causes duplication.

  • Removing the duplicate filter call makes the code more concise.

  1. Axis labels:
  • Labeling x and y axes makes the plot more understandable for readers.

  • It’s clearer what is being represented on each axis.

  1. sprintf() for subtitle:
  • sprintf() allows incorporating the R-squared value directly in the subtitle text.

  • This is cleaner than using paste() and substitutions.

  1. Squared term on x-axis:
  • Putting the squared term on the x-axis makes the plot consistent with the model specification.

  • It shows the relationship we are modeling more clearly.

  1. More informative title:
  • The updated title summarizes the key relationship being shown.

  • This helps orient readers to the plot’s purpose.

###Power Transformation on the Response

##Residual Plot and QQ plot The code fits a linear regression model predicting price from price_per_sqft, filtering to only include observations where in_sf is 0. gg_diagnose is used to generate diagnostic plots for the model, but only the residuals vs fitted and Q-Q plot are saved to the plots object. plot_all is used to display these two plots, with a maximum of 1 plot per page. Some potential improvements:

The model formula implies a linear relationship between price and price_per_sqft. It may be better to log-transform price to account for skewness. Additional diagnostic plots like scale-location could be included to check regression assumptions. The filter on in_sf suggests there may be two related samples in the data. Consider fitting separate models or including in_sf as a predictor. Labels and titles could be added to the plots to make them more interpretable.

##BC Transformation The code is using the powerTransform() function from the car package to identify an appropriate power transformation for the linear regression model ‘model’. The family=“bcPower” argument specifies that a Box-Cox power transformation should be applied. The code is storing the result of powerTransform() in an object pT. Accessing pT\(lambda returns the optimal lambda value of 0.05507421. pT\)lambda extracts just the lambda parameter, which represents the optimal power transformation.

Log Transfom

Transforms the ‘price’ variable by taking the log to create a new variable ‘log_price’. This accounts for skewness. Fits a linear model predicting log_price from ‘price_per_sqft’, filtering to ‘in_sf == 0’. Calculates the R-squared value from the model summary. Creates a scatterplot of ‘price_per_sqft’ vs the log of ‘price’, adding a linear regression line. Customizes plot labels including the R-squared value.

Poisson Transform

The distribution of sqft could be visualized before/after transforming to show the impact. Additional diagnostics like residuals vs fitted values could help validate the smoothing line. The smoothing line method (loess, gam, etc) should be specified for clarity. The plot could be enhanced with transparency in the points, faceting for subsets, or color encoding a variable. Model validation techniques like training/test splits would make the analysis more rigorous.