library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Loading the dataset

url <- "https://raw.githubusercontent.com/leontoddjohnson/i590/main/data/apartments/apartments.csv"

apts <- read_delim(url, delim = ',')
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Transforming Explanatory Variables

Using the property price as the main factor while relying on price per square foot as a standalone predictor creates a tricky situation, breaking the assumption of independence. This is because the price is inherently part of the price per square foot calculation. The model overlooks critical elements like neighborhood, bedroom/bathroom count, and building amenities, potentially leading to biased results. Also, the absence of randomness in the data raises concerns about underlying factors that might be influencing the relationship between price and price per square foot.

When interpreting the results, it’s evident that the model struggles to explain the variation in prices, as indicated by the low R-squared value. This implies that a substantial part of the price variability remains unexplained. Drawing conclusions about causation from this observational data is challenging due to the intricate connections between factors influencing both price and price per square foot.

Moreover, the analysis’s scope is limited to New York City apartments, offering consistency but sacrificing broader applicability. To build a more robust understanding, it’s crucial to consider additional factors and potentially expand the study to different locations.

# Model price per sqft as a function of price
model <- lm(price_per_sqft ~ price, 
            filter(apts, in_sf == 0))

# Calculate R-squared
r_squared <- summary(model)$r.squared

# Plot relationship  
apts |>  
  filter(in_sf == 0) |> 
  ggplot(aes(x = price, y = price_per_sqft)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) + 
  geom_smooth(se = FALSE) +
  labs(
    title = "Price per Sq Ft vs Price for Apartments", 
    subtitle = sprintf("R-squared: %.3f", r_squared)
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  1. Shifting the model’s focus to predict price per square foot from price aligns more logically with the causal relationship between the two variables, as price per square foot inherently relies on the overall price. This adjustment stems from the understanding that predicting the less variable factor from the more stable one often leads to more meaningful insights.
  2. The practical significance of forecasting price per square foot becomes evident as it enables estimations of square footage based on a known budget, providing valuable insights for individuals planning property investments. Additionally, streamlining the code by using a single filter on the relevant condition enhances clarity and reduces redundancy, making the code more straightforward and accessible.
  3. The decision to use `sprintf()` instead of `paste()` for string formatting contributes to the cleanliness and readability of the code, reflecting a commitment to coding best practices. The refined plot title serves as a concise guide, offering a clearer understanding of the relationship depicted in the visualization. Thoughtfully crafted labels further improve the user-friendliness and informativeness of the presentation, ensuring that the audience can easily interpret and derive insights from the visual representation of the data.

New Variable

# Add squared price per sqft as a new predictor
apts <- apts |>
  mutate(price_per_sqft_squared = price_per_sqft^2) 

# Model price using both original and squared price per sqft  
model <- lm(price ~ price_per_sqft_squared + price_per_sqft,
            filter(apts, in_sf == 0))

# Calculate R squared
r_squared <- summary(model)$r.squared 

# Plot the relationship
apts |>
  filter(in_sf == 0) |> 
  ggplot(aes(x = price_per_sqft_squared, y = price)) +
  geom_point() + 
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Price vs Squared Price per Sq Ft for Apts",
    x = "Price per Sq Ft Squared",
    y = "Price",
    subtitle = sprintf("R-squared: %.3f", r_squared)
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Combining filter calls that share the same condition is like tidying up your code by eliminating repetition, making it shorter and more straightforward. It’s akin to clearing unnecessary clutter. Imagine if you have two similar filters; you’d just need one to get the job done, making your code more concise and easier to follow.

When you’re looking at a plot, clear labels on the axes act like road signs, giving you a sense of direction. Think of it as ensuring you’re on the right path. A plot becomes a map, and labeled axes guide you through the landscape of data, making sure you don’t get lost.

Choosing sprintf() for the subtitle over paste() is like opting for a sleek, streamlined tool. It’s the difference between using a well-crafted tool versus a bunch of loose parts. sprintf() neatly incorporates the R-squared value into the subtitle, making the code cleaner and more organized.

Putting the squared term on the x-axis is akin to arranging your bookshelf in a logical order. It aligns with the model specification, making the plot a coherent and visually appealing narrative. It’s about presenting information in a way that flows naturally, like chapters in a book.

A well-crafted title is like the cover of your story. It encapsulates the essence of the plot, giving readers a glimpse into what to expect. It’s the difference between a captivating novel and one with a bland cover. These refinements in variable naming, code structure, and plot annotations collectively create a narrative that’s not just readable but engaging, making your analysis more accessible and meaningful.

Power Transformation on the Response

Residual Plot and QQ plot

In the given code, a linear regression model is created to predict price based on price_per_sqft, with a specific focus on cases where in_sf is 0. Diagnostic plots, specifically those illustrating residuals versus fitted values and Q-Q plots, are generated through gg_diagnose and stored as the plots object. The subsequent use of the plot_all function displays these plots, ensuring each page accommodates a maximum of 1 plot. A suggestion to enhance the analysis is to consider log-transforming the price variable, a measure that can effectively address any skewness in the data.

Moreover, for a more comprehensive evaluation, it is proposed to introduce additional diagnostic plots, such as the scale-location plot, which can provide deeper insights into the model’s adherence to regression assumptions. The indication from the filter on in_sf hints at potential related samples in the data, prompting a thoughtful consideration of either separate models or the inclusion of in_sf as a predictor. Additional improvements could involve the inclusion of labels and titles in the plots, enhancing their interpretability and contributing to a clearer understanding of the model’s overall performance.

BC Transformation

In this code, we’re using the powerTransform() function from the car package to figure out the best power transformation for our linear regression model, which we’ve named ‘model.’ By setting family=“bcPower,” we’re saying we want to apply a Box-Cox power transformation specifically. The outcome of this transformation is stored in a variable we’ve named pT. Now, when we access pT$lambda, we’re grabbing the magic number that makes our transformation work optimally – in this case, it’s 0.05507421. So, what’s happening here is that pT$lambda is like the secret sauce that represents the perfect power transformation for our model. It’s like finding the right key to unlock the full potential of our data and make our model perform at its best.

Log Transfom

The ‘price’ variable undergoes a log transformation, giving birth to a new variable called ‘log_price’ to tackle skewness issues. After this transformation, a linear model is deployed to forecast ‘log_price’ based on ‘price_per_sqft,’ focusing on cases where ‘in_sf’ equals 0. The resulting R-squared value is extracted from the model summary, offering insights into the model’s explanatory power. Subsequently, a scatterplot is crafted to portray the relationship between ‘price_per_sqft’ and the log-transformed ‘price’ variable. This plot not only features a linear regression line but also incorporates personalized labels that include the R-squared value, enriching the visualization with additional context for easier understanding and interpretation.

Poisson Transform

Examining how sqft’s distribution changes before and after the transformation would offer valuable insights into the transformation’s impact. It’s a good idea to include extra checks, like looking at residuals versus fitted values, to make sure the smoothing line is working effectively. Clearly stating the smoothing line method, whether it’s loess, gam, or another, would make the analysis easier to understand. Enhancing the plot by making points more transparent, creating subsets for clearer comparison, or using color to represent different variables could make the visualization more powerful. To make the analysis more robust, think about incorporating model validation techniques like training/test splits. This would provide a thorough evaluation of how well the model performs in different scenarios, ensuring a more reliable and comprehensive assessment.