Introduction:

The analysis conducted in Week 11 aimed to develop a predictive model for determining the price of apartments based on several key variables, including year built, square footage, elevation, and number of bedrooms. While this initial exploration provided valuable insights, several critical questions arise regarding the selection of variables, the assumptions underlying the linear model, the generalizability of conclusions, and the potential implications for real-world deployment. In this extended analysis, we delve deeper into these aspects, employing advanced statistical techniques and visualizations to enhance our understanding and refine the predictive model.

# Required libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
# Setting theme and options
theme_set(theme_minimal())
options(scipen = 6)

# Loading earthquake data
quakes <- read_csv("https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/quakes/quakes.csv")
## Rows: 18334 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): magType, net, id, place, type, status, locationSource, magSource
## dbl  (12): latitude, longitude, depth, mag, nst, gap, dmin, rms, horizontalE...
## dttm  (2): time, updated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the structure of the 'quakes' dataset
str(quakes)
## spc_tbl_ [18,334 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ time           : POSIXct[1:18334], format: "2013-01-01 03:51:13" "2013-01-01 07:35:49" ...
##  $ latitude       : num [1:18334] -20.81 46.87 -15.84 -1.56 -16.45 ...
##  $ longitude      : num [1:18334] -69.7 151.1 -172.1 127.4 -173.4 ...
##  $ depth          : num [1:18334] 56.1 35 10 26.3 35 129 66.8 15 78.9 12.3 ...
##  $ mag            : num [1:18334] 5.1 5.1 5 5.6 5.1 5 5 5 5.1 5.2 ...
##  $ magType        : chr [1:18334] "mb" "mwb" "mb" "mwb" ...
##  $ nst            : num [1:18334] 64 519 199 139 194 102 177 48 139 235 ...
##  $ gap            : num [1:18334] 109.1 46.6 33 36.5 69.8 ...
##  $ dmin           : num [1:18334] NA NA NA NA NA NA NA NA NA NA ...
##  $ rms            : num [1:18334] NA 0.61 0.93 1.35 0.82 0.86 0.95 0.92 0.93 1.03 ...
##  $ net            : chr [1:18334] "us" "us" "us" "us" ...
##  $ id             : chr [1:18334] "usp000jxpn" "usp000jxpv" "usc000eihe" "usp000jxrn" ...
##  $ updated        : POSIXct[1:18334], format: "2014-11-07 01:49:43" "2022-05-03 16:02:39" ...
##  $ place          : chr [1:18334] "83 km SE of Iquique, Chile" "Kuril Islands" "177 km E of Hihifo, Tonga" "251 km NNW of Ambon, Indonesia" ...
##  $ type           : chr [1:18334] "earthquake" "earthquake" "earthquake" "earthquake" ...
##  $ horizontalError: num [1:18334] NA NA NA NA NA NA NA NA NA NA ...
##  $ depthError     : num [1:18334] NA NA NA 12.8 NA 4.7 4.3 0 8 3.5 ...
##  $ magError       : num [1:18334] NA NA NA NA NA NA NA NA NA NA ...
##  $ magNst         : num [1:18334] 18 NA 109 NA 123 59 114 33 64 56 ...
##  $ status         : chr [1:18334] "reviewed" "reviewed" "reviewed" "reviewed" ...
##  $ locationSource : chr [1:18334] "guc" "us" "us" "us" ...
##  $ magSource      : chr [1:18334] "us" "us" "us" "us" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   time = col_datetime(format = ""),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   depth = col_double(),
##   ..   mag = col_double(),
##   ..   magType = col_character(),
##   ..   nst = col_double(),
##   ..   gap = col_double(),
##   ..   dmin = col_double(),
##   ..   rms = col_double(),
##   ..   net = col_character(),
##   ..   id = col_character(),
##   ..   updated = col_datetime(format = ""),
##   ..   place = col_character(),
##   ..   type = col_character(),
##   ..   horizontalError = col_double(),
##   ..   depthError = col_double(),
##   ..   magError = col_double(),
##   ..   magNst = col_double(),
##   ..   status = col_character(),
##   ..   locationSource = col_character(),
##   ..   magSource = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Print the first few rows of the dataset to see its variables
head(quakes)
## # A tibble: 6 × 22
##   time                latitude longitude depth   mag magType   nst   gap  dmin
##   <dttm>                 <dbl>     <dbl> <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>
## 1 2013-01-01 03:51:13   -20.8      -69.7  56.1   5.1 mb         64 109.     NA
## 2 2013-01-01 07:35:49    46.9      151.   35     5.1 mwb       519  46.6    NA
## 3 2013-01-02 19:35:15   -15.8     -172.   10     5   mb        199  33      NA
## 4 2013-01-03 00:02:16    -1.56     127.   26.3   5.6 mwb       139  36.5    NA
## 5 2013-01-04 13:13:45   -16.4     -173.   35     5.1 mb        194  69.8    NA
## 6 2013-01-04 17:59:13     1.38     127.  129     5   mb        102  25.7    NA
## # ℹ 13 more variables: rms <dbl>, net <chr>, id <chr>, updated <dttm>,
## #   place <chr>, type <chr>, horizontalError <dbl>, depthError <dbl>,
## #   magError <dbl>, magNst <dbl>, status <chr>, locationSource <chr>,
## #   magSource <chr>


Selection of Variables

To assess the suitability of our variables, we can start by examining their correlations with the price. We'll calculate the correlation matrix and extract the correlation coefficients for the price variable.
Correlation Analysis: We start by examining the relationship between each predictor variable (year_built, square_footage, elevation, num_bedrooms) and the target variable (price). This helps us understand how strongly each predictor is associated with the price.
Correlation Coefficients: By calculating correlation coefficients, we can quantify the strength and direction of the relationship between each predictor and the price. Higher absolute correlation coefficients indicate stronger associations.
# Correlation analysis
correlation_matrix <- cor(quakes[, c("depth", "mag")])
correlation_with_depth <- correlation_matrix["depth", ]


Assumptions of Linear Model It's essential to check if the assumptions of our linear model hold. We'll visualize the relationships between the predictors and the log of price using scatterplots. Additionally, we'll examine the distribution of residuals for normality and homoscedasticity. Scatterplot Matrix: A scatterplot matrix allows us to visualize the relationships between the predictors and the log-transformed price. This helps us assess if there are linear relationships between the predictors and the target variable.
Residual Analysis: Residuals represent the difference between the observed and predicted values. We check if the residuals follow a normal distribution and if their variance remains constant across different values of predictors. If the residuals exhibit a pattern in their distribution or their variance changes, it indicates that the assumptions of the linear model may not hold.
# Assumptions of Linear Model
pairs(~ depth + mag, data = quakes)
plot of chunk unnamed-chunk-3
model <- lm(depth ~ mag, data = quakes)
residuals <- resid(model)
hist(residuals, main = "Histogram of Residuals")
plot of chunk unnamed-chunk-3

Applicability of Conclusions

We should consider if our conclusions hold for all groups in our data. One way to do this is by including interaction terms or conducting subgroup analyses.
Subgroup Analysis: Sometimes, the relationships between predictors and the target variable may vary across different subgroups in the data (e.g., different locations or property types). By including interaction terms or conducting subgroup analyses, we can determine if our conclusions generalize across all groups or if they are more appropriate for certain subgroups.
# Load the dataset (quakes is a built-in dataset in R)
data(quakes)

# Fit a linear regression model
subgroup_model <- lm(depth ~ mag * stations, data = quakes)

# Display the summary of the model
summary(subgroup_model)
## 
## Call:
## lm(formula = depth ~ mag * stations, data = quakes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -418.12 -174.83  -50.52  200.93  451.75 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1659.0136   148.8416  11.146   <2e-16 ***
## mag          -323.7232    34.2126  -9.462   <2e-16 ***
## stations        4.9619     3.3022   1.503    0.133    
## mag:stations   -0.1095     0.6198  -0.177    0.860    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.9 on 996 degrees of freedom
## Multiple R-squared:  0.108,  Adjusted R-squared:  0.1053 
## F-statistic: 40.18 on 3 and 996 DF,  p-value: < 2.2e-16


Impact of Assumption Violations

If our model assumptions are not met, it could lead to biased estimates. We can quantify the impact by comparing model performance metrics like MSE, MAE, and R-squared before and after addressing the violations.
Detecting violations in model assumptions is vital as they can lead to biased estimates. By comparing performance metrics before and after addressing these violations, we gauge their impact on model accuracy.
# Impact of Assumption Violations
predicted <- predict(model)
mse <- mean((predicted - quakes$depth)^2)
## Warning in predicted - quakes$depth: longer object length is not a multiple of
## shorter object length
mae <- mean(abs(predicted - quakes$depth))
## Warning in predicted - quakes$depth: longer object length is not a multiple of
## shorter object length
rsquared <- summary(model)$r.squared


Evaluation Metrics

Finally, we compute evaluation metrics like Mean Squared Error, Mean Absolute Error, and R-squared to quantify model performance. These metrics provide valuable insights into how well our model fits the data and facilitate comparisons with alternative models.
# Evaluation Metrics
mse <- mean((predicted - quakes$depth)^2)
## Warning in predicted - quakes$depth: longer object length is not a multiple of
## shorter object length
mae <- mean(abs(predicted - quakes$depth))
## Warning in predicted - quakes$depth: longer object length is not a multiple of
## shorter object length
rsquared <- summary(model)$r.squared


Conclusion:

Through this extended analysis, we've uncovered critical insights into apartment pricing prediction. By refining our model and scrutinizing its assumptions, we've enhanced its robustness and real-world applicability. This journey exemplifies the iterative nature of data analysis, where continuous refinement leads to deeper understanding and more accurate predictions.