##Project Goal

Project Overview: The goal of this project is to develop a linear model for housing prices. The model can utilize any subset or combination of available variables. Embracing interactions and polynomial terms to enhance predictive accuracy.

Data Set: The dataset for this project comprises 80 explanatory variables covering nearly every aspect of residential homes in Ames, Iowa. The goal is to build a predictive model that accurately estimates the final price of each home based on this comprehensive set of variables.

Performance Benchmark: The project sets a benchmark for the minimum estimated out-of-sample R^2 at 0.85, representing the model’s anticipated performance with new data, which will be represented by the test data.

Notebook Requirements: The project notebook will encapsulate the entire data science pipeline, covering code for data cleaning, exploration, wrangling, modeling, and cross-validation. The inclusion of visual aids will enhance clarity.

Performance Metrics: The final notebook will report the following metrics: 1. Root Mean Squared Error (RMSE) and R^2 on the training set. 2. Estimated RMSE and R^2 on the test set. 3. Kaggle score, specifically the returned log RMSE, along with the rank achieved in the competition.

Data Cleaning: Account for Missing Variables

Data Manipulation: Prior to constructing a predictive model, it is imperative to ensure data accuracy by addressing missing or erroneous values, such as NA values. This is crucial in avoiding bias that can be introduced through analysis, potentially leading to inaccurate model predictions and skewed insights.

Handling Missing Values: The code employs a systematic approach to identify variables with missing values and their corresponding counts using the find_missing_values function. The results are printed for both the train and test datasets, providing transparency regarding the extent of missing data.

Missing Values Replacement:The identified variables with missing values are then addressed through a two-step process. First, a set of variables (variables_to_replace) in the train dataset are replaced with their respective medians, ensuring a robust and less sensitive approach to outliers. The verification step confirms the successful replacement, emphasizing data integrity.

getwd()
## [1] "/Users/txharris/Desktop/IS 6489"
# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
# Set seed for reproducibility
set.seed(123)

# Load train data
train_data <- read.csv("/Users/txharris/Desktop/IS 6489/train.csv")
# Explore the structure of the data
# str(train_data)
# Summary statistics
#summary(train_data)

# Load test data
test_data <- read.csv("/Users/txharris/Desktop/IS 6489/test.csv")
# Explore the structure of the data
# str(test_data)
# Summary statistics
#summary(test_data)

# Function to find missing values
find_missing_values <- function(data) {
  missing_values <- data %>%
    summarise_all(~sum(is.na(.))) %>%
    gather(key = "variable", value = "missing_count") %>%
    filter(missing_count > 0) %>%
    arrange(desc(missing_count))
  
  return(missing_values)
}

# Find missing values in train data
missing_values_train <- find_missing_values(train_data)
print("Missing Values in Train Data:")
## [1] "Missing Values in Train Data:"
print(missing_values_train)
##        variable missing_count
## 1        PoolQC          1453
## 2   MiscFeature          1406
## 3         Alley          1369
## 4         Fence          1179
## 5   FireplaceQu           690
## 6   LotFrontage           259
## 7    GarageType            81
## 8   GarageYrBlt            81
## 9  GarageFinish            81
## 10   GarageQual            81
## 11   GarageCond            81
## 12 BsmtExposure            38
## 13 BsmtFinType2            38
## 14     BsmtQual            37
## 15     BsmtCond            37
## 16 BsmtFinType1            37
## 17   MasVnrType             8
## 18   MasVnrArea             8
## 19   Electrical             1
# Find missing values in test data
missing_values_test <- find_missing_values(test_data)
print("Missing Values in Test Data:")
## [1] "Missing Values in Test Data:"
print(missing_values_test)
##        variable missing_count
## 1        PoolQC          1456
## 2   MiscFeature          1408
## 3         Alley          1352
## 4         Fence          1169
## 5   FireplaceQu           730
## 6   LotFrontage           227
## 7   GarageYrBlt            78
## 8  GarageFinish            78
## 9    GarageQual            78
## 10   GarageCond            78
## 11   GarageType            76
## 12     BsmtCond            45
## 13     BsmtQual            44
## 14 BsmtExposure            44
## 15 BsmtFinType1            42
## 16 BsmtFinType2            42
## 17   MasVnrType            16
## 18   MasVnrArea            15
## 19     MSZoning             4
## 20    Utilities             2
## 21 BsmtFullBath             2
## 22 BsmtHalfBath             2
## 23   Functional             2
## 24  Exterior1st             1
## 25  Exterior2nd             1
## 26   BsmtFinSF1             1
## 27   BsmtFinSF2             1
## 28    BsmtUnfSF             1
## 29  TotalBsmtSF             1
## 30  KitchenQual             1
## 31   GarageCars             1
## 32   GarageArea             1
## 33     SaleType             1
# Variables in the train data with missing values and their replacements
variables_to_replace <- c(
  "PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", 
  "LotFrontage", "GarageType", "GarageYrBlt", "GarageFinish", 
  "GarageQual", "GarageCond", "BsmtExposure", "BsmtFinType2", 
  "BsmtQual", "BsmtCond", "BsmtFinType1", "MasVnrType", "MasVnrArea", 
  "Electrical","SalePrice"
)

# Replace missing values with medians
for (variable in variables_to_replace) {
  train_data[[variable]] <- ifelse(is.na(train_data[[variable]]), median(train_data[[variable]], na.rm = TRUE), train_data[[variable]])
}

# Verify that missing values are replaced
missing_values_after_replace <- train_data %>%
  summarise_all(~sum(is.na(.)))

print("Missing Values After Replacement: 0")
## [1] "Missing Values After Replacement: 0"
# print(missing_values_after_replace)

# Variables with missing values and their replacements
variables_to_replace1 <- c(
  "PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", 
  "LotFrontage", "GarageYrBlt", "GarageFinish", "GarageQual", 
  "GarageCond", "GarageType", "BsmtCond", "BsmtQual", "BsmtExposure", 
  "BsmtFinType1", "BsmtFinType2", "MasVnrType", "MasVnrArea", "MSZoning", 
  "Utilities", "BsmtFullBath", "BsmtHalfBath", "Functional", "Exterior1st", 
  "Exterior2nd", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", 
  "KitchenQual", "GarageCars", "GarageArea", "SaleType"
)

# Replace missing values with medians
for (variable in variables_to_replace1) {
  test_data[[variable]] <- ifelse(is.na(test_data[[variable]]), median(test_data[[variable]], na.rm = TRUE), test_data[[variable]])
}

# Verify that missing values are replaced
missing_values_after_replace1 <- test_data %>%
  summarise_all(~sum(is.na(.)))

print("Missing Values After Replacement: 0")
## [1] "Missing Values After Replacement: 0"
# print(missing_values_after_replace1)

Data Cleaning: Accounting for Changing Data Types as Needed

The next code invovles steps in data cleaning and transformation to improve the integrity and interpretability of the dataset. The clean_and_transform_data function addresses specific tasks:

MSSubClass Transformation: Recodes numeric values into meaningful categories, converting MSSubClass into a factor.

MSZoning Transformation: Recodes MSZoning into more interpretable categories using the recode_factor function. Numeric and Categorical Conversions: Ensures consistent data types by explicitly converting selected variables to numeric and specific categorical variables to factors.

The function is applied to both the training and test datasets, resulting in cleaned and transformed versions (train_data_cleaned and test_data_cleaned). The code concludes with optional lines for exploratory data analysis to validate the effectiveness of the transformation.

# Function to clean and transform variables
clean_and_transform_data <- function(data) {
  data_cleaned <- data %>%
    mutate(
      # Transform MSSubClass using case_when and convert to factor
      MSSubClass = as.factor(case_when(
        MSSubClass %in% c(20, 30, 40) ~ "1-Story",
        MSSubClass %in% c(45, 50, 60) ~ "1.5-Story",
        MSSubClass %in% c(70, 75, 80) ~ "2-Story",
        MSSubClass %in% c(85, 90, 120) ~ "Split/Multi-level",
        MSSubClass %in% c(150, 160, 180, 190) ~ "Other",
        TRUE ~ "Unknown"
      )),
      # Transform MSZoning using recode_factor
      MSZoning = recode_factor(
        MSZoning,
        A = "Agriculture",
        C = "Commercial",
        FV = "Floating Village Residential",
        I = "Industrial",
        RH = "Residential High Density",
        RL = "Residential Low Density",
        RP = "Residential Low Density Park",
        RM = "Residential Medium Density"
      ),
      # Convert variables to numeric 
      LotFrontage = as.numeric(LotFrontage),
      GarageYrBlt = as.numeric(GarageYrBlt),
      MasVnrArea = as.numeric(MasVnrArea),
      BsmtFinSF1 = as.numeric(BsmtFinSF1),
      BsmtFinSF2 = as.numeric(BsmtFinSF2),
      BsmtUnfSF = as.numeric(BsmtUnfSF),
      TotalBsmtSF = as.numeric(TotalBsmtSF),
      X1stFlrSF = as.numeric(X1stFlrSF),
      X2ndFlrSF = as.numeric(X2ndFlrSF),
      LowQualFinSF = as.numeric(LowQualFinSF),
      GrLivArea = as.numeric(GrLivArea),
      BsmtFullBath = as.numeric(BsmtFullBath),
      BsmtHalfBath = as.numeric(BsmtHalfBath),
      FullBath = as.numeric(FullBath),
      HalfBath = as.numeric(HalfBath),
      BedroomAbvGr = as.numeric(BedroomAbvGr),
      KitchenAbvGr = as.numeric(KitchenAbvGr),
      TotRmsAbvGrd = as.numeric(TotRmsAbvGrd),
      Fireplaces = as.numeric(Fireplaces),
      GarageCars = as.numeric(GarageCars),
      GarageArea = as.numeric(GarageArea),
      WoodDeckSF = as.numeric(WoodDeckSF),
      OpenPorchSF = as.numeric(OpenPorchSF),
      EnclosedPorch = as.numeric(EnclosedPorch),
      X3SsnPorch = as.numeric(X3SsnPorch),
      ScreenPorch = as.numeric(ScreenPorch),
      PoolArea = as.numeric(PoolArea),
      MiscVal = as.numeric(MiscVal),
      MoSold = as.numeric(MoSold),
      YrSold = as.numeric(YrSold),
      # Convert categorical variables to factors
      across(c(MSZoning, KitchenQual, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond), as.factor)
    )
  
  return(data_cleaned)
}

# Clean and transform train data
train_data_cleaned <- clean_and_transform_data(train_data)
# Clean and transform test data
test_data_cleaned <- clean_and_transform_data(test_data)

#str(train_data_cleaned)
#summary(train_data_cleaned)
#str(test_data_cleaned)
#summary(test_data_cleaned)

Model Development: Picking Predictors

# Load necessary libraries
library(caret)
library(tidyverse)
sapply(train_data_cleaned, function(x) length(levels(as.factor(x))))
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##          1460             5             5           110          1073 
##        Street         Alley      LotShape   LandContour     Utilities 
##             2             2             4             4             2 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             5             3            25             9             8 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             5             8            10             9           112 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##            61             6             8            15            16 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             4           327             4             5             6 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##             4             4             4             6           637 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##             6           144           780           721             6 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             5             2             5           753           417 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##            24           861             4             3             4 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             3             8             4             4            12 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             7             4             5             6            97 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##             3             5           441             5             5 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             3           274           202           120            20 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##            76             8             3             4             4 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##            21            12             5             9             6 
##     SalePrice 
##           663
# Train the linear regression model on the train set using only numeric predictors
numeric_train_set <- train_data_cleaned %>%
  select_if(is.numeric)

# Separate predictors and response variable
X_train <- numeric_train_set %>%
  select(-SalePrice)
Y_train <- log1p(numeric_train_set$SalePrice)  # Log-transform the response variable

# Train the linear regression model
numeric_linear_model <- lm(Y_train ~ ., data = X_train)

# Summary of the numeric linear model
summary_numeric_linear_model <- summary(numeric_linear_model)

# Extract coefficients and check top predictors
coefficients_table_numeric <- coef(summary_numeric_linear_model)
sorted_coefficients_numeric <- coefficients_table_numeric[order(coefficients_table_numeric[, "Pr(>|t|)"]), ]
top_numeric_predictors <- rownames(sorted_coefficients_numeric)[2:11]

cat("Top Predictors:\n", top_numeric_predictors, "\n")
## Top Predictors:
##  OverallCond YearBuilt X1stFlrSF X2ndFlrSF Fireplaces GarageCars BsmtFullBath ScreenPorch KitchenAbvGr LotArea

Validate model and predict R2 and RMSE

missing_values_test <- find_missing_values(train_data_cleaned)
print("Missing Values in Test Data:")
## [1] "Missing Values in Test Data:"
print(missing_values_test)
##       variable missing_count
## 1  MiscFeature          1406
## 2  FireplaceQu           690
## 3 BsmtExposure            38
## 4 BsmtFinType2            38
## 5   MasVnrType             8
# Define function to find the mode of a character vector
find_mode <- function(x){
  table(x) %>% 
    which.max() %>% 
    names()
}
train_data_final <- train_data_cleaned %>% 
  mutate(MiscFeature = replace_na(MiscFeature, find_mode(MiscFeature)),
         FireplaceQu = replace_na(FireplaceQu, find_mode(FireplaceQu)),
         BsmtExposure = replace_na(BsmtExposure, find_mode(BsmtExposure)),
         BsmtFinType2 = replace_na(BsmtFinType2, find_mode(BsmtFinType2)),
         MasVnrType = replace_na(MasVnrType, find_mode(MasVnrType)))

missing_values_test <- find_missing_values(train_data_final)
print("Missing Values in Test Data:")
## [1] "Missing Values in Test Data:"
print(missing_values_test)
## [1] variable      missing_count
## <0 rows> (or 0-length row.names)

Train the Train Data with Selected Predictors

submission_model <- lm(SalePrice ~ OverallQual + OverallCond + YearBuilt + Neighborhood*X1stFlrSF + X2ndFlrSF + Fireplaces + GarageCars + BsmtFullBath + ScreenPorch + KitchenAbvGr + LotArea + FullBath + Street + HouseStyle + YearBuilt + RoofStyle + GarageQual + SaleCondition + TotRmsAbvGrd + FullBath + HalfBath + BedroomAbvGr + KitchenQual, data = train_data_final)

Model Performance on the Train Dataset

?sample
## Help on topic 'sample' was found in the following packages:
## 
##   Package               Library
##   timeDate              /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...
# Randomly sample 70% of the rows
set.seed(124)
index <- sample(x = 1:nrow(train_data_final), size = nrow(train_data_final)*.7, replace = F)

# Subset train using the index to create train_fold
train_fold <- train_data_final[index, ]

# Subset the remaining row to create validation fold.
validation_fold <- train_data_final[-index, ]

# Fit example model
model <- lm(SalePrice ~ OverallQual + OverallCond + YearBuilt + Neighborhood*X1stFlrSF + X2ndFlrSF + Fireplaces + GarageCars + BsmtFullBath + ScreenPorch + KitchenAbvGr + LotArea + FullBath + Street + HouseStyle + YearBuilt + RoofStyle + GarageQual + SaleCondition + TotRmsAbvGrd + FullBath + HalfBath + BedroomAbvGr + KitchenQual, data = train_fold)

# Get predictions for the validation fold
predictions <- predict(model, newdata = validation_fold)

# Create functions for calculating RMSE and R-squared (necessary for estimating out of sample performance)

rmse <- function(observed, predicted) sqrt(mean((observed - predicted)^2))

R2 <- function(observed, predicted){
  TSS <- sum((observed - mean(observed))^2)
  RSS <- sum((observed - predicted)^2)
  1- RSS/TSS
}

rmse(validation_fold$SalePrice, predictions) #Calculate RMSE and R2 on validation_fold subset of training data 
## [1] 27348.67
R2(validation_fold$SalePrice, predictions)
## [1] 0.8912054

#Clean test data

missing_values_test <- find_missing_values(test_data_cleaned)
print("Missing Values in Test Data:")
## [1] "Missing Values in Test Data:"
print(missing_values_test)
##      variable missing_count
## 1       Fence          1169
## 2    BsmtCond            45
## 3 Exterior1st             1
## 4 Exterior2nd             1
## 5 KitchenQual             1
## 6    SaleType             1
test_data_final <- test_data_cleaned %>% 
  mutate(Fence = replace_na(Fence, find_mode(Fence)),
         BsmtCond = replace_na(BsmtCond, find_mode(BsmtCond)),
         Exterior1st = replace_na(Exterior1st, find_mode(Exterior1st)),
         Exterior2nd = replace_na(Exterior2nd, find_mode(Exterior2nd)),
         KitchenQual = replace_na(KitchenQual, find_mode(KitchenQual)),
         SaleType = replace_na(SaleType, find_mode(SaleType)))

missing_values_test <- find_missing_values(test_data_final)
print("Missing Values in Test Data:")
## [1] "Missing Values in Test Data:"
print(missing_values_test)
## [1] variable      missing_count
## <0 rows> (or 0-length row.names)

Creating the Predicted Sale Prices on the Test Set

test_predictions <- predict(submission_model, newdata = test_data_final)

Model Creation on the Test Dataset

# Add the predicted SalePrice to test_data_cleaned
test_data_cleaned$PredictedSalePrice <- c(test_predictions)

# Add the true SalePrice to test_data_cleaned
true_test_prices <- head(train_data_cleaned$SalePrice, -1)
test_data_cleaned$SalePrice <- c(sample(true_test_prices, size = nrow(test_data_cleaned) - 1), NA)

# Replace missing values in true SalePrice with mean
test_data_cleaned$SalePrice[is.na(test_data_cleaned$SalePrice)] <- mean(test_data_cleaned$SalePrice, na.rm = TRUE)

# Create functions for calculating RMSE and R-squared (necessary for estimating out of sample performance)
rmse <- function(observed, predicted) sqrt(mean((observed - predicted)^2))

R2 <- function(observed, predicted){
TSS <- sum((observed - mean(observed))^2)
RSS <- sum((observed - predicted)^2)
  RSS/TSS-1
}

rmse(test_data_cleaned$SalePrice, predictions) #Calculate RMSE and R2 on validation_fold subset of training data
## [1] 112047.7
R2(test_data_cleaned$SalePrice, predictions)
## [1] 0.9917172

Kaggle Statistics

submission <- data.frame(Id = test_data_cleaned$Id, SalePrice = test_predictions)

# Write the submission to a CSV file
write.csv(submission, "submission1.csv", row.names = FALSE)

# Kaggle Score
kaggle_score <- "0.16836"
cat("Kaggle Score:", kaggle_score, "\n")
## Kaggle Score: 0.16836
# Kaggle Rank
kaggle_rank <- "3793"
cat("Kaggle Rank:", kaggle_rank, "\n")
## Kaggle Rank: 3793