Loading the necessary packages:

library(knitr) # For formatting tables
library(e1071) # for checking the skewness
library(ggplot2) # For plots
library(reshape2) # To reshape the data
library(randomForest) #for feature importance
library(dplyr)
library(MASS)
library(gridExtra)
library(tibble)

Loading and reading the data in R:

# Subsetthe data to include only the first 1459 rows to meet Kaggle requirement
Train_df_subset <- head(Train_df, 1459)

# Fixng the erros I got when submitting in  Kaggle 
Train_df_subset$Id <- seq_along(Train_df_subset$SalePrice)

# Export Train_df to CSV file
write.csv(Train_df_subset, "Train_df_subset.csv", row.names = FALSE)

Exploring the data:

Glimpse of the data:

# Display the first few rows of the data:
head(Train_df)
# Display the statistical properties of the data
summary(Train_df)

This data set contains 1460 observations with 81 variables; 39 of those variables are numerical and 43 are categorical

# Check for missing values
sum(is.na(Train_df))
## [1] 6965

Checking the skewness of the quantitative variables:

# Identify the numerical and categorical data variables
numerical_vars <- sapply(Train_df, is.numeric)
Categorical_vars <- sapply(Train_df, is.character)

# Subset the data to include only quantitative variables
numerical_data <- Train_df[, numerical_vars]
Categorical_data <- Train_df[, Categorical_vars]

# Calculate skewness for quantitative variables
skewness_values <- sapply(numerical_data, skewness)
head(skewness_values)
##          Id  MSSubClass LotFrontage     LotArea OverallQual OverallCond 
##   0.0000000   1.4047656          NA  12.1826150   0.2164984   0.6916440

Based on the skewness level above, the LotArea has the greatest positive value, which means that this variable has distribution that is skewed to the right. Now, let’s identify the variables that has positive level of skewness and order them in ascending order, so we can see of our choice of the independent varaible is the right decision:

# Identify variables with positive skewness
positive_skew_vars <- names(skewness_values[skewness_values > 0])

# Sort variables based on skewness values in ascending order
sorted_positive_skew_vars <- positive_skew_vars[order(skewness_values[skewness_values > 0])]

# Print the names of variables with positive skewness sorted in ascending order
print(sorted_positive_skew_vars)
##  [1] "FullBath"      "YrSold"        "GarageArea"    "BedroomAbvGr" 
##  [5] "MoSold"        "OverallQual"   "BsmtFullBath"  "Fireplaces"   
##  [9] "HalfBath"      "TotRmsAbvGrd"  "OverallCond"   "X2ndFlrSF"    
## [13] "BsmtUnfSF"     "GrLivArea"     "X1stFlrSF"     "MSSubClass"   
## [17] "TotalBsmtSF"   "WoodDeckSF"    "BsmtFinSF1"    "SalePrice"    
## [21] "OpenPorchSF"   "EnclosedPorch" "BsmtHalfBath"  "ScreenPorch"  
## [25] "BsmtFinSF2"    "KitchenAbvGr"  "LowQualFinSF"  "X3SsnPorch"   
## [29] "LotArea"       "PoolArea"      "MiscVal"       NA             
## [33] NA              NA

Based on the calculated skewness, the independent variable I am choosing is Lot Area, so \(X=LotArea\). The dependent variable is Sale Price, so \(Y=SalePrice\).

X <- Train_df_subset$LotArea
Y <- Train_df_subset$SalePrice

Calculating the probability:

I am going to calculate the following probabilities:

  1. P(X>x | Y>y)

  2. P(X>x, Y>y)

  3. P(X<x | Y>y)

Where the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable.

P(X>x | Y>y)

\(P(X>x | Y>y) = \frac{\text {Number of observations where X>x and Y>y}}{\text{Number of observations where Y>y}}\)

First, let’s count the number of observations where both \(X>x\) and \(Y>y\):

# Define the thresholds
x_threshold <- 11602  # 3rd quartile of X (Lot Area)
y_threshold <- 163000 # 2nd quartile or median of Y (Sale Price)

# Filter the data where both X and Y exceed their thresholds
filtered_Train_df <- Train_df[Train_df$LotArea > x_threshold & Train_df$SalePrice > y_threshold, ]

# Count the number of observations in the filtered dataset
num_observations_both_exceed <- nrow(filtered_Train_df)

# Print the count
print(num_observations_both_exceed)
## [1] 276

Now, let’s count the number of observations where \(Y>y\):

# Filter the data where only Y exceeds its threshold
filtered_data_Y_exceeds <- Train_df[Train_df$SalePrice > y_threshold, ]

# Count the number of observations in the filtered dataset
num_observations_Y_exceeds <- nrow(filtered_data_Y_exceeds)

# Print the count
print(num_observations_Y_exceeds)
## [1] 728

So the conditional probability that \(X>x\) given that \(Y>y\) y is \(P(X>x | Y>y) = \fract{276}{728}\)

# Calculate the conditional probability where X>x given Y>y
Cond_Prob1 <- round(num_observations_both_exceed/num_observations_Y_exceeds, 3)

# Display the Conditional probability
print(Cond_Prob1)
## [1] 0.379

\(\boxed {P(X>x | Y>y) \approx 0.379}\)

This probability suggests that given an observation with higher Sale Price of a house that exceeds the 2nd quartile (\(\$163000\)), there is 37.9% chance that those houses have a lot area that exceeds the 3rd quartile (\(11602 ft^2\))

P(X>x, Y>y)

\(P(X>x , Y>y) = \frac{\text {Number of observations where X>x and Y>y}}{\text{Total number of observations}}\)

Based on the calculation on the previous subsection “P(X>x | Y>y)”, we know that the number of observations where \(X>x\) and \(Y>y\) is 276. And the total number of observations in the data set is \(1460\):

# Calculate the probability of both X>x and Y>y
Prob2 <- round(num_observations_both_exceed/1460, 3)

# Print the probability
print(Prob2)
## [1] 0.189

This probability suggests that there is only a chance of 18.9% that houses that are sold with relatively high price (exceeding the 2nd quartile), have also a big lot area that exceeds the 3rd quartile. In other words, there is a low occurrence of houses with both high sale price and large lot area.

P(X<x | Y>y)

\(P(X<x | Y>y) = \frac{\text {Number of observations where X<x and Y>y}}{\text{Number of observations where Y>y}}\)

We Already found that the number of observations where \(X>x\) and \(Y>y\) is \(276\). And the number of observations where \(Y>y\) is \(728\)> To find the number of observations where \(X<x\) and \(Y>y\) would be the difference between both: \(728-276\)

# Number of observations where X<x and Y>y
Obs_X_less_Y_more <- num_observations_Y_exceeds - num_observations_both_exceed

# Probability of observations where X<x and Y>y
Prob3 <- round(Obs_X_less_Y_more / num_observations_Y_exceeds, 3)

# Display the probability
print(Prob3)
## [1] 0.621

This probability indicates that given houses with higher sale price, there is higher chance, 62.1%, that those houses have smaller lot area (less than 3rd quartile).

# Filter the data where both X and Y below their thresholds
filtered_Train_df_both_leq <- Train_df[Train_df$LotArea <= x_threshold & Train_df$SalePrice <= y_threshold, ]

# Count the number of observations in the filtered dataset where both X and Y are less or equal to their thresholds
num_observations_both_lessEq <- nrow(filtered_Train_df_both_leq)

# Print the count
print(num_observations_both_lessEq)
## [1] 643
# Filter the data where X<=x and Y>y
filtered_Train_df_X_leq_Y_g <- Train_df[Train_df$LotArea <= x_threshold & Train_df$SalePrice > y_threshold, ]

# Count the number of observations in the filtered dataset 
num_observations_X_leq_Y_g <- nrow(filtered_Train_df_X_leq_Y_g)

# Print the count
print(num_observations_X_leq_Y_g )
## [1] 452
# Filter the data where X>x and Y<=y
filtered_Train_df_X_leq_Y_g <- Train_df[Train_df$LotArea > x_threshold & Train_df$SalePrice <= y_threshold, ]

# Count the number of observations in the filtered dataset 
num_observations_X_g_Y_leq <- nrow(filtered_Train_df_X_leq_Y_g)

# Print the count
print(num_observations_X_g_Y_leq )
## [1] 89

Let’s create a table with all counts that we calculated:

# Create the table matrix
table_matrix <- matrix(c(num_observations_both_lessEq, num_observations_X_leq_Y_g, 
                         num_observations_X_g_Y_leq, num_observations_both_exceed),
                       nrow = 2, byrow = TRUE)

# Add row and column names
rownames(table_matrix) <- c("<=3rd Qu", ">3rd Qu")
colnames(table_matrix) <- c("<=2nd Qu", ">2nd Qu")

# Add totals for each row and column
table_matrix <- cbind(table_matrix, rowSums(table_matrix))
table_matrix <- rbind(table_matrix, colSums(table_matrix))

# Assign the row and column names to the table object
dimnames(table_matrix) <- list("x/y" = c("<=3rd Qu", ">3rd Qu", "Total"),
                               "x/y" = c("<=2nd Qu", ">2nd Qu", "Total"))

# Convert to a table object
table_num <- as.table(table_matrix)

# Print the table
print(table_num)
##           x/y
## x/y        <=2nd Qu >2nd Qu Total
##   <=3rd Qu      643     452  1095
##   >3rd Qu        89     276   365
##   Total         732     728  1460

Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does \(P(A|B)=P(A)P(B)\)?

# Display the probability of A given B
Cond_Prob1
## [1] 0.379
# Calculate P(A) x P(B)
Prob_A_and_B <- round(((num_observations_X_g_Y_leq + num_observations_both_exceed) /1460)*(num_observations_Y_exceeds/1460),3)

# Print the probability of A and B
print(Prob_A_and_B)
## [1] 0.125

Since \(P(A|B) \neq P(A) \times P(B)\), we can conclude that A and B are not independent.

let’s evaluate by running a Chi Square test for association:

# Define the calculated number of observations
observed <- matrix(c(643, 452, 89, 276), nrow = 2, byrow = TRUE)

# Perform chi-square test
chi_square_result <- chisq.test(observed)

# Print the result
print(chi_square_result)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  observed
## X-squared = 127.74, df = 1, p-value < 2.2e-16

Based on the chi square result above, we can reject the null hypothesis of independence, meaning that the variables are dependent (as we concluded above mathematically) and there is a significant association between the variables. So, splitting the training data in this fashion doesn’t make them independent.

Univariate descriptive statistics:

First let’s plot histograms for the numerical variables in the training data (Train_df)

# Univariate plots for numerical variables
for (col in names(Train_df)[numerical_vars]) {
  print(ggplot(Train_df, aes_string(x = col)) +
          geom_histogram(fill = "skyblue", color = "black") +
          labs(title = paste("Histogram of", col), x = col, y = "Frequency"))
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 259 rows containing non-finite outside the scale range
## (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite outside the scale range
## (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 81 rows containing non-finite outside the scale range
## (`stat_bin()`).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, i am going to plot box plots for each of the categorical variables against the SalePRice, to visualize which variable has stronger influence on the sale price of houses,

# Convert categorical variables to factors
for (col in names(Categorical_data)) {
  Categorical_data[[col]] <- factor(Categorical_data[[col]])
}

# Univariate box plots for categorical variables with SalePrice
for (col in names(Categorical_data)) {
  print(ggplot(data = cbind(Categorical_data, numerical_data), aes_string(x = col, y = "SalePrice")) +
          geom_boxplot(fill = "skyblue", color = "black") +
          labs(title = paste("Boxplot of", col), x = col, y = "Sale Price"))
}

To assess the correlation between the categorical variables and the SalePrice of houses, we can use the Analysis of Variance technique (ANOVA) as follows:

# Convert categorical variables to factors
categorical_data <- lapply(Categorical_data, factor)

# Combine numerical and categorical data
combined_data <- cbind(numerical_data, categorical_data)

# Perform ANOVA for each categorical variable against SalePrice
anova_results <- lapply(categorical_data, function(cat_var) {
  anova_result <- aov(SalePrice ~ cat_var, data = combined_data)
  return(summary(anova_result))
})

# Extract p-values from ANOVA results
p_values <- lapply(anova_results, function(anova_result) {
  return(anova_result[[1]][["Pr(>F)"]][1])
})

# Combine variable names and p-values into a data frame
result_df <- data.frame(Variable = names(p_values), P_Value = unlist(p_values))

# Print the result
print(result_df)

It’s little overwhelming to get any conclusion about the association between those variables and the SalePrice from the table, let’s visualize the results in a bar plot:

ANOVA Plot
ANOVA Plot

Now it is evident that “Neighborhood” component has the tallest bar; the smallest p-value, which indicates there is a strong association between the sale price of the houses and the neighborhood. In other words, there are significant differences in SalePrice across different neighborhoods. On the other hands, the “utilities” has the biggest p-value; there is a weak association.

Scatter plot of X and Y

Now, let’s graph a scatter plot of X, “LotArea” and Y, “SalePrice”:

# Plot SalePrice (Y) against LotArea (X)
plot(Train_df$LotArea, Train_df$SalePrice, 
     xlab = "LotArea", ylab = "SalePrice", 
     main = "Scatterplot of LotArea vs SalePrice")

Most of the data points are clustered on the left side of the plot, which indicates that the distribution is skewed. To better visualize the data, we can apply logarithmic transformation to both variables to spread out the data points evenly:

# Log transformation
plot(log(Train_df$LotArea), log(Train_df$SalePrice), 
     xlab = "Log(LotArea)", ylab = "Log(SalePrice)", 
     main = "Scatterplot of Log(LotArea) vs Log(SalePrice)")

95% CI for the difference in the mean

Now, let’s calculate the 95% confidence interval for the difference in means of two variables (LotArea and SalePrice):

# Perform t-test for the difference in means
t_test_result <- t.test(Train_df$LotArea, Train_df$SalePrice)

# Extract confidence interval from the t-test result
ci_lower <- t_test_result$conf.int[1]
ci_upper <- t_test_result$conf.int[2]

# Print the confidence interval
cat("95% Confidence Interval for the difference in means: (", ci_lower, ", ", ci_upper, ")\n")
## 95% Confidence Interval for the difference in means: ( -174514.7 ,  -166294.1 )

The interval above indicates that the SalePrice tends to be significantly lower than LotArea within that range.

Now let’s derive a correlation matrix for two of the quantitative variables; LotArea and PriceSale. The variables X and Y are already defined before as LotArea and SalePrice respectively:

# Calculate the correlation coefficient between X and Y
correlation_coefficient <- cor(X, Y)

# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.2638429

The correlation coefficient is positive, which indicates that as LotArea increases, the SalePrice increases as well. The number, 0.2638434, suggests a weak positive correlation between LotArea and SalePrice.

Correlation matrix for both variables; LotArea and SalePrice:

# Create a subset of your dataset with only the two variables of interest
subset_df <- Train_df[, c("LotArea", "SalePrice")]

# Calculate the correlation matrix
correlation_matrix <- cor(subset_df)

# Print the correlation matrix
print(correlation_matrix)
##             LotArea SalePrice
## LotArea   1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000

Now, let’s test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval:

# Test the hypothesis that the correlation between X and Y is 0
cor_test_result <- cor.test(X, Y)

# Print the result
print(cor_test_result)
## 
##  Pearson's product-moment correlation
## 
## data:  X and Y
## t = 10.441, df = 1457, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2154402 0.3109524
## sample estimates:
##       cor 
## 0.2638429
# Extract the confidence interval
ci_lower <- cor_test_result$conf.int[1]
ci_upper <- cor_test_result$conf.int[2]

# Print the 99% confidence interval
cat("99% Confidence Interval for the correlation coefficient:", ci_lower, "to", ci_upper, "\n")
## 99% Confidence Interval for the correlation coefficient: 0.2154402 to 0.3109524

The p-value, 2.2e-16, is very small which indicates strong evidence against the null hypothesis (that the correlation between these variables is 0). Therefore, we reject the null hypothesis, suggesting that the correlation between LotArea and SalePrice is statistically significant. The 99% Confidence Interval for the correlation coefficient: 0.2154574 to 0.3109369, emphasizes that the correlation coefficient is not 0. Meaning that the true correlation coefficient falls within the indicated interval. The “sample estimates”, estimates the correlation coefficient which I calculated before; 0.2638434.

Next, we are going to invert the correlation matrix, which is going to be the precision matrix and it will contain variance inflation factors on the diagonal.

# Invert the correlation matrix
inverted_matrix <- solve(correlation_matrix)

# Print the inverted matrix
print(inverted_matrix)
##              LotArea  SalePrice
## LotArea    1.0748219 -0.2835846
## SalePrice -0.2835846  1.0748219

Principle Components Analysis (PCA):

First, let’s multiply both correlation matrices, then conduct the PCA:

# Multiply correlation matrix by inverted matrix
cor_by_invert <- correlation_matrix %*% inverted_matrix

# Multiply inverted matrix by correlation matrix
invert_by_cor <- inverted_matrix %*% correlation_matrix

# Principal Components Analysis (PCA)
pca_cor_by_invert <- princomp(cor_by_invert)
pca_invert_by_cor <- princomp(invert_by_cor)

# Summary of PCA results
summary(pca_cor_by_invert)
## Importance of components:
##                           Comp.1       Comp.2
## Standard deviation     0.7071068 3.725290e-09
## Proportion of Variance 1.0000000 2.775558e-17
## Cumulative Proportion  1.0000000 1.000000e+00
summary(pca_invert_by_cor)
## Importance of components:
##                           Comp.1 Comp.2
## Standard deviation     0.7071068      0
## Proportion of Variance 1.0000000      0
## Cumulative Proportion  1.0000000      1

For “pca_cor_by_invert”:

  • Comp.1: This component has a standard deviation of approximately 0.7071068 and explains 100% of the variance.

  • Comp.2: This component has a standard deviation close to zero (3.725290e-09) and explains a very tiny proportion of the variance (close to zero).

For “pca_invert_by_cor”:

  • Comp.1: Similar to ‘pca_cor_by_prec’, this component has a standard deviation of approximately 0.7071068 and explains 100% of the variance, capturing all the variability in the data.

  • Comp.2: This component has a standard deviation of zero and explains none of the variance. It doesn’t contribute meaningfully to the understanding of the data.

Based on the skewness level of the variables of the data ‘Train_df’, the variable ‘LotArea’ has level of skewness of over 12, which indicates that it has a right skewed distribution. Let’s shift it so that the minimum value is above zero. To do that, we can add a constant value to each observation. This constant value should be at least equal to the absolute value of the smallest negative value in the variable, plus a small margin to ensure that the minimum value becomes positive.

# Compute the minimum value of LotArea
min_LotArea <- min(X)

# Compute the shifting constant
shift_constant <- abs(min_LotArea) + 1  # Add a small margin

# Shift LotArea by adding the constant to each observation
shifted_LotArea <- X + shift_constant

The minimum LotArea is 1300, which makes the shift constant is 1301. Let’s check the minimum of the ‘shifted_LotArea’ if it is above 0:

min(shifted_LotArea)
## [1] 2601

Let’s fit an exponential probability density function:

# Fit an exponential distribution to the shifted LotArea variable
fit_exp <- fitdistr(shifted_LotArea, "exponential")

# Print the fitted parameters
print(fit_exp)
##        rate    
##   8.461507e-05 
##  (2.215236e-06)

The estimated rate parameter \(\lambda\) = 8.461792e-05 suggests that the shifted LotArea values follow an exponential distribution, with an average rate of occurrence of approximately 8.461792e-05 per unit. The small standard error (2.214552e-06) indicates a relatively precise estimation of the rate parameter.

Now, let’s find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value.

# Optimal value of λ using MLE
optimal_lambda <- fit_exp$estimate

# Generate 1000 samples from the exponential distribution using the optimal λ
set.seed(123)  # for reproducibility
samples <- rexp(1000, rate = optimal_lambda)

# Display the first few samples
head(samples)
## [1]  9968.1681  6814.5102 15707.0701   373.1884   664.3140  3740.4827

The optimal \(\lambda\) is very close to the estimated \(\lambda\): optimal lambda (8.46e-05) \(\approx\) estimated lambda (8.461792e-05).

Histogram of the original and shifted Lot Area:

The values in the sample seem very large, we can visualize the data and the fitted distribution to assess the goodness of fit. To do this, we can create a histogram of the data along with the probability density function (PDF) of the fitted distribution:

# Plot histogram of the data
hist(shifted_LotArea, freq = FALSE, breaks = "FD", main = "Histogram of Shifted LotArea",
     xlab = "Shifted LotArea", ylab = "Density")

# Add the probability density function (PDF) of the fitted exponential distribution
curve(dexp(x, rate = optimal_lambda), add = TRUE, col = "red", lwd = 2)

# Add a legend
legend("topright", legend = c("Histogram", "Fitted Exponential PDF"),
       col = c("black", "red"), lwd = c(1, 2), bty = "n")

It seems that the histogram and the PDF closely align, that suggests a good fit. After visualizing the histogram of the shifted LotArea, we can plot a histogram of the original LotArea and compare it to the shifted one.

Let’s plot both histogram, side-to-side, for better comparison:

# Set up the plotting area
par(mfrow = c(1, 2))

# Plot histogram of the original LotArea variable
hist(X, freq = FALSE, breaks = "FD", main = "Histogram of Original LotArea",
     xlab = "Original LotArea", ylab = "Density")

# Plot histogram of the shifted LotArea variable
hist(shifted_LotArea, freq = FALSE, breaks = "FD", main = "Histogram of Shifted LotArea",
     xlab = "Shifted LotArea", ylab = "Density")

# Add the probability density function (PDF) of the fitted exponential distribution to the second plot
curve(dexp(x, rate = optimal_lambda), add = TRUE, col = "red", lwd = 2)

# Add a legend to the second plot
legend("topright", legend = c("Shifted LotArea", "Fitted Exponential PDF"),
       col = c("black", "red"), lwd = c(1, 2), bty = "n")

# Reset the plotting area
par(mfrow = c(1, 1))

Both histograms look almost the same shape, this indicates that the shifting process didn’t have significant impact on the distribution of the variable. So, the goodness of fit of the histogram of shifted LotArea and the similarity in distribution between both, original and shifted Lot Area suggest that the exponential distribution and the shifting process are appropriate for describing the data.

Nest, let’s use the exponential pdf to find the 5th and 95th percentiles using the cumulative distribution function (CDF).

The Cumulative Distribution Function (CDF)

# Calculate the 5th percentile using the cumulative distribution function (CDF)
fifth_percentile <- qexp(0.05, rate = optimal_lambda)

# Calculate the 95th percentile using the cumulative distribution function (CDF)
ninety_fifth_percentile <- qexp(0.95, rate = optimal_lambda)

# Print the results
print(paste("5th percentile:", fifth_percentile))
## [1] "5th percentile: 606.195719551753"
print(paste("95th percentile:", ninety_fifth_percentile))
## [1] "95th percentile: 35404.239537249"

In this step, we are going to generate a 95% confidence interval from the empirical data, assuming normality.

# Calculate the sample mean and standard deviation
sample_mean <- mean(shifted_LotArea)
sample_sd <- sd(shifted_LotArea)

# Calculate the standard error of the mean
se_mean <- sample_sd / sqrt(length(shifted_LotArea))

# Calculate the margin of error for a 95% confidence interval (assuming normality)
margin_of_error <- qnorm(0.975) * se_mean

# Calculate the lower and upper bounds of the confidence interval
lower_bound <- sample_mean - margin_of_error
upper_bound <- sample_mean + margin_of_error

# Print the confidence interval
print(paste("95% Confidence Interval (assuming normality): [", round(lower_bound, 2), ",", round(upper_bound, 2), "]"))
## [1] "95% Confidence Interval (assuming normality): [ 11305.89 , 12330.56 ]"

The interval shown above indicates that we can be 95% confident that the true population mean of the shifted LotArea variable falls within this range.

Now, we are going to provide the empirical 5th percentile and 95th percentile of the data.

# Calculate the empirical 5th percentile
empirical_5th_percentile <- quantile(shifted_LotArea, 0.05)

# Calculate the empirical 95th percentile
empirical_95th_percentile <- quantile(shifted_LotArea, 0.95)

# Print the results
print(paste("Empirical 5th percentile:", empirical_5th_percentile))
## [1] "Empirical 5th percentile: 4608.4"
print(paste("Empirical 95th percentile:", empirical_95th_percentile))
## [1] "Empirical 95th percentile: 18703.3"
  • The empirical 5th percentile of the data is 4612.7 \(ft^2\), which indicates that 5% of the houses in the dataset have a lot area of 4612.7 square feet or smaller.

  • The empirical 95th percentile of the data is 18702.15 \(ft^2\), this indicates that 95% of the houses in the dataset have a lot area of 18702.15 square feet or smaller.

Lot area plays a significant role in determining the sale price of houses, but its impact is influenced by various factors. So, we will perform regression modeling to check the influence of other variables of the data in Sale Price.

Regression Model:

Identifying Potential Predictors:

Let’s compute the correlation matrix for the entire subset that includes only the numerical variables:

# Compute the correlation matrix
correlation_matrix_numerical <- cor(numerical_data)

# Print the correlation matrix
print(correlation_matrix_numerical)
##                          Id   MSSubClass LotFrontage      LotArea OverallQual
## Id             1.0000000000  0.011156478          NA -0.033225519 -0.02836475
## MSSubClass     0.0111564782  1.000000000          NA -0.139781082  0.03262771
## LotFrontage              NA           NA           1           NA          NA
## LotArea       -0.0332255186 -0.139781082          NA  1.000000000  0.10580574
## OverallQual   -0.0283647539  0.032627708          NA  0.105805742  1.00000000
## OverallCond    0.0126089248 -0.059315817          NA -0.005636270 -0.09193234
## YearBuilt     -0.0127127154  0.027850137          NA  0.014227652  0.57232277
## YearRemodAdd  -0.0219976419  0.040581045          NA  0.013788427  0.55068392
## MasVnrArea               NA           NA          NA           NA          NA
## BsmtFinSF1    -0.0050240490 -0.069835749          NA  0.214103131  0.23966597
## BsmtFinSF2    -0.0059676720 -0.065648579          NA  0.111169745 -0.05911869
## BsmtUnfSF     -0.0079397034 -0.140759481          NA -0.002618360  0.30815893
## TotalBsmtSF   -0.0154145661 -0.238518409          NA  0.260833135  0.53780850
## X1stFlrSF      0.0104960410 -0.251758352          NA  0.299474579  0.47622383
## X2ndFlrSF      0.0055898489  0.307885721          NA  0.050985948  0.29549288
## LowQualFinSF  -0.0442299581  0.046473756          NA  0.004778970 -0.03042928
## GrLivArea      0.0082727577  0.074853180          NA  0.263116167  0.59300743
## BsmtFullBath   0.0022885556  0.003491026          NA  0.158154531  0.11109779
## BsmtHalfBath  -0.0201547452 -0.002332535          NA  0.048045571 -0.04015016
## FullBath       0.0055874529  0.131608222          NA  0.126030627  0.55059971
## HalfBath       0.0067838113  0.177354389          NA  0.014259469  0.27345810
## BedroomAbvGr   0.0377185542 -0.023438028          NA  0.119689908  0.10167636
## KitchenAbvGr   0.0029512364  0.281721040          NA -0.017783871 -0.18388223
## TotRmsAbvGrd   0.0272387244  0.040380065          NA  0.190014778  0.42745234
## Fireplaces    -0.0197716324 -0.045569340          NA  0.271364010  0.39676504
## GarageYrBlt              NA           NA          NA           NA          NA
## GarageCars     0.0165696771 -0.040109793          NA  0.154870740  0.60067072
## GarageArea     0.0176337785 -0.098671543          NA  0.180402755  0.56202176
## WoodDeckSF    -0.0296431972 -0.012579358          NA  0.171697687  0.23892339
## OpenPorchSF   -0.0004769113 -0.006100121          NA  0.084773809  0.30881882
## EnclosedPorch  0.0028892179 -0.012036622          NA -0.018339734 -0.11393686
## X3SsnPorch    -0.0466347889 -0.043824549          NA  0.020422830  0.03037057
## ScreenPorch    0.0013302086 -0.026030177          NA  0.043160378  0.06488636
## PoolArea       0.0570439048  0.008282708          NA  0.077672392  0.06516584
## MiscVal       -0.0062424048 -0.007683291          NA  0.038067692 -0.03140621
## MoSold         0.0211721766 -0.013584643          NA  0.001204988  0.07081517
## YrSold         0.0007117940 -0.021407038          NA -0.014261407 -0.02734671
## SalePrice     -0.0219167194 -0.084284135          NA  0.263843354  0.79098160
##                OverallCond    YearBuilt YearRemodAdd MasVnrArea   BsmtFinSF1
## Id             0.012608925 -0.012712715 -0.021997642         NA -0.005024049
## MSSubClass    -0.059315817  0.027850137  0.040581045         NA -0.069835749
## LotFrontage             NA           NA           NA         NA           NA
## LotArea       -0.005636270  0.014227652  0.013788427         NA  0.214103131
## OverallQual   -0.091932343  0.572322769  0.550683924         NA  0.239665966
## OverallCond    1.000000000 -0.375983196  0.073741498         NA -0.046230856
## YearBuilt     -0.375983196  1.000000000  0.592854976         NA  0.249503197
## YearRemodAdd   0.073741498  0.592854976  1.000000000         NA  0.128450547
## MasVnrArea              NA           NA           NA          1           NA
## BsmtFinSF1    -0.046230856  0.249503197  0.128450547         NA  1.000000000
## BsmtFinSF2     0.040229170 -0.049106831 -0.067758514         NA -0.050117400
## BsmtUnfSF     -0.136840570  0.149040392  0.181133087         NA -0.495251469
## TotalBsmtSF   -0.171097515  0.391452002  0.291065583         NA  0.522396052
## X1stFlrSF     -0.144202784  0.281985859  0.240379268         NA  0.445862656
## X2ndFlrSF      0.028942116  0.010307660  0.140023779         NA -0.137078986
## LowQualFinSF   0.025494320 -0.183784344 -0.062419100         NA -0.064502597
## GrLivArea     -0.079685865  0.199009714  0.287388520         NA  0.208171130
## BsmtFullBath  -0.054941515  0.187598550  0.119469879         NA  0.649211754
## BsmtHalfBath   0.117820915 -0.038161806 -0.012337032         NA  0.067418478
## FullBath      -0.194149489  0.468270787  0.439046484         NA  0.058543137
## HalfBath      -0.060769327  0.242655910  0.183330612         NA  0.004262424
## BedroomAbvGr   0.012980060 -0.070651217 -0.040580928         NA -0.107354677
## KitchenAbvGr  -0.087000855 -0.174800246 -0.149597521         NA -0.081006851
## TotRmsAbvGrd  -0.057583166  0.095589128  0.191739816         NA  0.044315624
## Fireplaces    -0.023819978  0.147716399  0.112581318         NA  0.260010920
## GarageYrBlt             NA           NA           NA         NA           NA
## GarageCars    -0.185757511  0.537850092  0.420622155         NA  0.224053522
## GarageArea    -0.151521371  0.478953820  0.371599809         NA  0.296970385
## WoodDeckSF    -0.003333699  0.224880142  0.205725920         NA  0.204306145
## OpenPorchSF   -0.032588814  0.188685840  0.226297633         NA  0.111760613
## EnclosedPorch  0.070356184 -0.387267783 -0.193919147         NA -0.102303306
## X3SsnPorch     0.025503660  0.031354513  0.045285810         NA  0.026450506
## ScreenPorch    0.054810529 -0.050364435 -0.038740011         NA  0.062020623
## PoolArea      -0.001984942  0.004949728  0.005829372         NA  0.140491286
## MiscVal        0.068776806 -0.034383139 -0.010286249         NA  0.003571473
## MoSold        -0.003510839  0.012398471  0.021490002         NA -0.015726948
## YrSold         0.043949746 -0.013617680  0.035743247         NA  0.014358922
## SalePrice     -0.077855894  0.522897333  0.507100967         NA  0.386419806
##                 BsmtFinSF2    BsmtUnfSF   TotalBsmtSF    X1stFlrSF    X2ndFlrSF
## Id            -0.005967672 -0.007939703 -0.0154145661  0.010496041  0.005589849
## MSSubClass    -0.065648579 -0.140759481 -0.2385184093 -0.251758352  0.307885721
## LotFrontage             NA           NA            NA           NA           NA
## LotArea        0.111169745 -0.002618360  0.2608331345  0.299474579  0.050985948
## OverallQual   -0.059118693  0.308158927  0.5378084986  0.476223829  0.295492879
## OverallCond    0.040229170 -0.136840570 -0.1710975146 -0.144202784  0.028942116
## YearBuilt     -0.049106831  0.149040392  0.3914520021  0.281985859  0.010307660
## YearRemodAdd  -0.067758514  0.181133087  0.2910655826  0.240379268  0.140023779
## MasVnrArea              NA           NA            NA           NA           NA
## BsmtFinSF1    -0.050117400 -0.495251469  0.5223960520  0.445862656 -0.137078986
## BsmtFinSF2     1.000000000 -0.209294492  0.1048095376  0.097117448 -0.099260316
## BsmtUnfSF     -0.209294492  1.000000000  0.4153596052  0.317987438  0.004469092
## TotalBsmtSF    0.104809538  0.415359605  1.0000000000  0.819529975 -0.174511950
## X1stFlrSF      0.097117448  0.317987438  0.8195299750  1.000000000 -0.202646181
## X2ndFlrSF     -0.099260316  0.004469092 -0.1745119501 -0.202646181  1.000000000
## LowQualFinSF   0.014806998  0.028166688 -0.0332453873 -0.014240673  0.063352950
## GrLivArea     -0.009639892  0.240257268  0.4548682025  0.566023969  0.687501064
## BsmtFullBath   0.158678061 -0.422900477  0.3073505537  0.244671104 -0.169493952
## BsmtHalfBath   0.070948134 -0.095804288 -0.0003145818  0.001955654 -0.023854784
## FullBath      -0.076443862  0.288886055  0.3237224136  0.380637495  0.421377983
## HalfBath      -0.032147837 -0.041117530 -0.0488037386 -0.119915909  0.609707300
## BedroomAbvGr  -0.015728114  0.166643317  0.0504499555  0.127400749  0.502900613
## KitchenAbvGr  -0.040751236  0.030085868 -0.0689006426  0.068100588  0.059305753
## TotRmsAbvGrd  -0.035226548  0.250647061  0.2855725637  0.409515979  0.616422635
## Fireplaces     0.046920709  0.051574882  0.3395193239  0.410531085  0.194560892
## GarageYrBlt             NA           NA            NA           NA           NA
## GarageCars    -0.038263513  0.214175190  0.4345848343  0.439316808  0.183925583
## GarageArea    -0.018226592  0.183302698  0.4866654638  0.489781654  0.138346959
## WoodDeckSF     0.067898326 -0.005316424  0.2320186091  0.235458623  0.092165418
## OpenPorchSF    0.003092562  0.129005415  0.2472637463  0.211671225  0.208026063
## EnclosedPorch  0.036543339 -0.002537855 -0.0954777367 -0.065291701  0.061988691
## X3SsnPorch    -0.029993398  0.020764006  0.0373837273  0.056104374 -0.024357648
## ScreenPorch    0.088871251 -0.012579273  0.0844889859  0.088758073  0.040606448
## PoolArea       0.041709055 -0.035092241  0.1260531321  0.131524976  0.081486878
## MiscVal        0.004939781 -0.023836645 -0.0184789224 -0.021095719  0.016196875
## MoSold        -0.015210738  0.034888443  0.0131961786  0.031371560  0.035164427
## YrSold         0.031705637 -0.041258195 -0.0149686480 -0.013603771 -0.028699914
## SalePrice     -0.011378121  0.214479106  0.6135805516  0.605852185  0.319333803
##                LowQualFinSF    GrLivArea  BsmtFullBath  BsmtHalfBath
## Id            -0.0442299581  0.008272758  0.0022885556 -0.0201547452
## MSSubClass     0.0464737559  0.074853180  0.0034910258 -0.0023325346
## LotFrontage              NA           NA            NA            NA
## LotArea        0.0047789699  0.263116167  0.1581545311  0.0480455709
## OverallQual   -0.0304292840  0.593007430  0.1110977861 -0.0401501577
## OverallCond    0.0254943199 -0.079685865 -0.0549415154  0.1178209151
## YearBuilt     -0.1837843444  0.199009714  0.1875985500 -0.0381618057
## YearRemodAdd  -0.0624191001  0.287388520  0.1194698791 -0.0123370321
## MasVnrArea               NA           NA            NA            NA
## BsmtFinSF1    -0.0645025969  0.208171130  0.6492117536  0.0674184779
## BsmtFinSF2     0.0148069979 -0.009639892  0.1586780608  0.0709481337
## BsmtUnfSF      0.0281666881  0.240257268 -0.4229004774 -0.0958042882
## TotalBsmtSF   -0.0332453873  0.454868203  0.3073505537 -0.0003145818
## X1stFlrSF     -0.0142406727  0.566023969  0.2446711042  0.0019556536
## X2ndFlrSF      0.0633529501  0.687501064 -0.1694939517 -0.0238547839
## LowQualFinSF   1.0000000000  0.134682813 -0.0471434219 -0.0058415048
## GrLivArea      0.1346828130  1.000000000  0.0348360495 -0.0189184832
## BsmtFullBath  -0.0471434219  0.034836050  1.0000000000 -0.1478709605
## BsmtHalfBath  -0.0058415048 -0.018918483 -0.1478709605  1.0000000000
## FullBath      -0.0007095096  0.630011646 -0.0645120486 -0.0545358120
## HalfBath      -0.0270800493  0.415771636 -0.0309049591 -0.0123399001
## BedroomAbvGr   0.1056065685  0.521269511 -0.1506728092  0.0465188484
## KitchenAbvGr   0.0075217443  0.100063165 -0.0415025464 -0.0379443502
## TotRmsAbvGrd   0.1311847760  0.825489374 -0.0532752361 -0.0238363413
## Fireplaces    -0.0212721434  0.461679134  0.1379277084  0.0289755866
## GarageYrBlt              NA           NA            NA            NA
## GarageCars    -0.0944795202  0.467247419  0.1318812244 -0.0208910590
## GarageArea    -0.0676014132  0.468997477  0.1791894804 -0.0245355796
## WoodDeckSF    -0.0254436480  0.247432821  0.1753151901  0.0401612233
## OpenPorchSF    0.0182510391  0.330223962  0.0673414614 -0.0253237579
## EnclosedPorch  0.0610812378  0.009113210 -0.0499106491 -0.0085553339
## X3SsnPorch    -0.0042956104  0.020643190 -0.0001060915  0.0351136309
## ScreenPorch    0.0267994130  0.101510396  0.0231477258  0.0321214072
## PoolArea       0.0621573723  0.170205336  0.0676155562  0.0200246298
## MiscVal       -0.0037928708 -0.002415640 -0.0230470249 -0.0073665245
## MoSold        -0.0221739606  0.050239681 -0.0253608943  0.0328727052
## YrSold        -0.0289208798 -0.036525820  0.0670491377 -0.0465238818
## SalePrice     -0.0256061300  0.708624478  0.2271222331 -0.0168441543
##                    FullBath     HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Id             0.0055874529  0.006783811  0.037718554  0.002951236  0.027238724
## MSSubClass     0.1316082224  0.177354389 -0.023438028  0.281721040  0.040380065
## LotFrontage              NA           NA           NA           NA           NA
## LotArea        0.1260306265  0.014259469  0.119689908 -0.017783871  0.190014778
## OverallQual    0.5505997094  0.273458099  0.101676356 -0.183882235  0.427452343
## OverallCond   -0.1941494887 -0.060769327  0.012980060 -0.087000855 -0.057583166
## YearBuilt      0.4682707872  0.242655910 -0.070651217 -0.174800246  0.095589128
## YearRemodAdd   0.4390464839  0.183330612 -0.040580928 -0.149597521  0.191739816
## MasVnrArea               NA           NA           NA           NA           NA
## BsmtFinSF1     0.0585431369  0.004262424 -0.107354677 -0.081006851  0.044315624
## BsmtFinSF2    -0.0764438620 -0.032147837 -0.015728114 -0.040751236 -0.035226548
## BsmtUnfSF      0.2888860555 -0.041117530  0.166643317  0.030085868  0.250647061
## TotalBsmtSF    0.3237224136 -0.048803739  0.050449956 -0.068900643  0.285572564
## X1stFlrSF      0.3806374950 -0.119915909  0.127400749  0.068100588  0.409515979
## X2ndFlrSF      0.4213779829  0.609707300  0.502900613  0.059305753  0.616422635
## LowQualFinSF  -0.0007095096 -0.027080049  0.105606569  0.007521744  0.131184776
## GrLivArea      0.6300116463  0.415771636  0.521269511  0.100063165  0.825489374
## BsmtFullBath  -0.0645120486 -0.030904959 -0.150672809 -0.041502546 -0.053275236
## BsmtHalfBath  -0.0545358120 -0.012339900  0.046518848 -0.037944350 -0.023836341
## FullBath       1.0000000000  0.136380589  0.363251983  0.133115214  0.554784254
## HalfBath       0.1363805887  1.000000000  0.226651484 -0.068262549  0.343414858
## BedroomAbvGr   0.3632519830  0.226651484  1.000000000  0.198596758  0.676619936
## KitchenAbvGr   0.1331152142 -0.068262549  0.198596758  1.000000000  0.256045409
## TotRmsAbvGrd   0.5547842535  0.343414858  0.676619936  0.256045409  1.000000000
## Fireplaces     0.2436705031  0.203648508  0.107569681 -0.123936235  0.326114480
## GarageYrBlt              NA           NA           NA           NA           NA
## GarageCars     0.4696720433  0.219178152  0.086106438 -0.050633892  0.362288571
## GarageArea     0.4056562085  0.163549364  0.065252530 -0.064433047  0.337822121
## WoodDeckSF     0.1877032138  0.108080303  0.046853773 -0.090130273  0.165983884
## OpenPorchSF    0.2599774255  0.199740148  0.093809572 -0.070090610  0.234191588
## EnclosedPorch -0.1150929635 -0.095316526  0.041570435  0.037312385  0.004151299
## X3SsnPorch     0.0353530166 -0.004972488 -0.024477796 -0.024600359 -0.006683241
## ScreenPorch   -0.0081060933  0.072425845  0.044299691 -0.051613366  0.059382600
## PoolArea       0.0496038256  0.022381498  0.070702584 -0.014525116  0.083757350
## MiscVal       -0.0142898450  0.001290145  0.007766972  0.062340724  0.024762884
## MoSold         0.0558721290 -0.009049888  0.046543860  0.026588907  0.036907077
## YrSold        -0.0196688407 -0.010268669 -0.036013893  0.031687207 -0.034516354
## SalePrice      0.5606637627  0.284107676  0.168213154 -0.135907371  0.533723156
##                 Fireplaces GarageYrBlt  GarageCars  GarageArea   WoodDeckSF
## Id            -0.019771632          NA  0.01656968  0.01763378 -0.029643197
## MSSubClass    -0.045569340          NA -0.04010979 -0.09867154 -0.012579358
## LotFrontage             NA          NA          NA          NA           NA
## LotArea        0.271364010          NA  0.15487074  0.18040276  0.171697687
## OverallQual    0.396765038          NA  0.60067072  0.56202176  0.238923392
## OverallCond   -0.023819978          NA -0.18575751 -0.15152137 -0.003333699
## YearBuilt      0.147716399          NA  0.53785009  0.47895382  0.224880142
## YearRemodAdd   0.112581318          NA  0.42062215  0.37159981  0.205725920
## MasVnrArea              NA          NA          NA          NA           NA
## BsmtFinSF1     0.260010920          NA  0.22405352  0.29697039  0.204306145
## BsmtFinSF2     0.046920709          NA -0.03826351 -0.01822659  0.067898326
## BsmtUnfSF      0.051574882          NA  0.21417519  0.18330270 -0.005316424
## TotalBsmtSF    0.339519324          NA  0.43458483  0.48666546  0.232018609
## X1stFlrSF      0.410531085          NA  0.43931681  0.48978165  0.235458623
## X2ndFlrSF      0.194560892          NA  0.18392558  0.13834696  0.092165418
## LowQualFinSF  -0.021272143          NA -0.09447952 -0.06760141 -0.025443648
## GrLivArea      0.461679134          NA  0.46724742  0.46899748  0.247432821
## BsmtFullBath   0.137927708          NA  0.13188122  0.17918948  0.175315190
## BsmtHalfBath   0.028975587          NA -0.02089106 -0.02453558  0.040161223
## FullBath       0.243670503          NA  0.46967204  0.40565621  0.187703214
## HalfBath       0.203648508          NA  0.21917815  0.16354936  0.108080303
## BedroomAbvGr   0.107569681          NA  0.08610644  0.06525253  0.046853773
## KitchenAbvGr  -0.123936235          NA -0.05063389 -0.06443305 -0.090130273
## TotRmsAbvGrd   0.326114480          NA  0.36228857  0.33782212  0.165983884
## Fireplaces     1.000000000          NA  0.30078877  0.26914124  0.200018796
## GarageYrBlt             NA           1          NA          NA           NA
## GarageCars     0.300788766          NA  1.00000000  0.88247541  0.226342138
## GarageArea     0.269141238          NA  0.88247541  1.00000000  0.224666307
## WoodDeckSF     0.200018796          NA  0.22634214  0.22466631  1.000000000
## OpenPorchSF    0.169405327          NA  0.21356945  0.24143467  0.058660609
## EnclosedPorch -0.024821869          NA -0.15143416 -0.12177672 -0.125988888
## X3SsnPorch     0.011257239          NA  0.03576529  0.03508670 -0.032770634
## ScreenPorch    0.184530270          NA  0.05049379  0.05141176 -0.074181351
## PoolArea       0.095073522          NA  0.02093353  0.06104727  0.073378207
## MiscVal        0.001408605          NA -0.04308013 -0.02739991 -0.009551228
## MoSold         0.046357102          NA  0.04052173  0.02797380  0.021011044
## YrSold        -0.024095565          NA -0.03911690 -0.02737794  0.022270451
## SalePrice      0.466928837          NA  0.64040920  0.62343144  0.324413445
##                 OpenPorchSF EnclosedPorch    X3SsnPorch  ScreenPorch
## Id            -0.0004769113   0.002889218 -0.0466347889  0.001330209
## MSSubClass    -0.0061001212  -0.012036622 -0.0438245492 -0.026030177
## LotFrontage              NA            NA            NA           NA
## LotArea        0.0847738088  -0.018339734  0.0204228296  0.043160378
## OverallQual    0.3088188234  -0.113936859  0.0303705671  0.064886360
## OverallCond   -0.0325888135   0.070356184  0.0255036600  0.054810529
## YearBuilt      0.1886858400  -0.387267783  0.0313545131 -0.050364435
## YearRemodAdd   0.2262976327  -0.193919147  0.0452858098 -0.038740011
## MasVnrArea               NA            NA            NA           NA
## BsmtFinSF1     0.1117606134  -0.102303306  0.0264505062  0.062020623
## BsmtFinSF2     0.0030925622   0.036543339 -0.0299933980  0.088871251
## BsmtUnfSF      0.1290054146  -0.002537855  0.0207640057 -0.012579273
## TotalBsmtSF    0.2472637463  -0.095477737  0.0373837273  0.084488986
## X1stFlrSF      0.2116712255  -0.065291701  0.0561043745  0.088758073
## X2ndFlrSF      0.2080260632   0.061988691 -0.0243576484  0.040606448
## LowQualFinSF   0.0182510391   0.061081238 -0.0042956104  0.026799413
## GrLivArea      0.3302239617   0.009113210  0.0206431897  0.101510396
## BsmtFullBath   0.0673414614  -0.049910649 -0.0001060915  0.023147726
## BsmtHalfBath  -0.0253237579  -0.008555334  0.0351136309  0.032121407
## FullBath       0.2599774255  -0.115092963  0.0353530166 -0.008106093
## HalfBath       0.1997401475  -0.095316526 -0.0049724884  0.072425845
## BedroomAbvGr   0.0938095716   0.041570435 -0.0244777964  0.044299691
## KitchenAbvGr  -0.0700906099   0.037312385 -0.0246003587 -0.051613366
## TotRmsAbvGrd   0.2341915878   0.004151299 -0.0066832410  0.059382600
## Fireplaces     0.1694053271  -0.024821869  0.0112572390  0.184530270
## GarageYrBlt              NA            NA            NA           NA
## GarageCars     0.2135694456  -0.151434160  0.0357652851  0.050493792
## GarageArea     0.2414346721  -0.121776720  0.0350867002  0.051411762
## WoodDeckSF     0.0586606086  -0.125988888 -0.0327706336 -0.074181351
## OpenPorchSF    1.0000000000  -0.093079318 -0.0058424993  0.074303944
## EnclosedPorch -0.0930793175   1.000000000 -0.0373052828 -0.082864245
## X3SsnPorch    -0.0058424993  -0.037305283  1.0000000000 -0.031435847
## ScreenPorch    0.0743039439  -0.082864245 -0.0314358470  1.000000000
## PoolArea       0.0607621115   0.054202562 -0.0079915489  0.051307395
## MiscVal       -0.0185837390   0.018360600  0.0003539653  0.031945761
## MoSold         0.0712548848  -0.028887266  0.0294737952  0.023216992
## YrSold        -0.0576193601  -0.009915937  0.0186449254  0.010694106
## SalePrice      0.3158562271  -0.128577958  0.0445836653  0.111446571
##                   PoolArea       MiscVal       MoSold       YrSold   SalePrice
## Id             0.057043905 -0.0062424048  0.021172177  0.000711794 -0.02191672
## MSSubClass     0.008282708 -0.0076832913 -0.013584643 -0.021407038 -0.08428414
## LotFrontage             NA            NA           NA           NA          NA
## LotArea        0.077672392  0.0380676920  0.001204988 -0.014261407  0.26384335
## OverallQual    0.065165844 -0.0314062105  0.070815172 -0.027346708  0.79098160
## OverallCond   -0.001984942  0.0687768061 -0.003510839  0.043949746 -0.07785589
## YearBuilt      0.004949728 -0.0343831387  0.012398471 -0.013617680  0.52289733
## YearRemodAdd   0.005829372 -0.0102862488  0.021490002  0.035743247  0.50710097
## MasVnrArea              NA            NA           NA           NA          NA
## BsmtFinSF1     0.140491286  0.0035714735 -0.015726948  0.014358922  0.38641981
## BsmtFinSF2     0.041709055  0.0049397812 -0.015210738  0.031705637 -0.01137812
## BsmtUnfSF     -0.035092241 -0.0238366451  0.034888443 -0.041258195  0.21447911
## TotalBsmtSF    0.126053132 -0.0184789224  0.013196179 -0.014968648  0.61358055
## X1stFlrSF      0.131524976 -0.0210957195  0.031371560 -0.013603771  0.60585218
## X2ndFlrSF      0.081486878  0.0161968746  0.035164427 -0.028699914  0.31933380
## LowQualFinSF   0.062157372 -0.0037928708 -0.022173961 -0.028920880 -0.02560613
## GrLivArea      0.170205336 -0.0024156396  0.050239681 -0.036525820  0.70862448
## BsmtFullBath   0.067615556 -0.0230470249 -0.025360894  0.067049138  0.22712223
## BsmtHalfBath   0.020024630 -0.0073665245  0.032872705 -0.046523882 -0.01684415
## FullBath       0.049603826 -0.0142898450  0.055872129 -0.019668841  0.56066376
## HalfBath       0.022381498  0.0012901448 -0.009049888 -0.010268669  0.28410768
## BedroomAbvGr   0.070702584  0.0077669720  0.046543860 -0.036013893  0.16821315
## KitchenAbvGr  -0.014525116  0.0623407240  0.026588907  0.031687207 -0.13590737
## TotRmsAbvGrd   0.083757350  0.0247628842  0.036907077 -0.034516354  0.53372316
## Fireplaces     0.095073522  0.0014086054  0.046357102 -0.024095565  0.46692884
## GarageYrBlt             NA            NA           NA           NA          NA
## GarageCars     0.020933531 -0.0430801281  0.040521730 -0.039116904  0.64040920
## GarageArea     0.061047272 -0.0273999144  0.027973800 -0.027377940  0.62343144
## WoodDeckSF     0.073378207 -0.0095512282  0.021011044  0.022270451  0.32441344
## OpenPorchSF    0.060762111 -0.0185837390  0.071254885 -0.057619360  0.31585623
## EnclosedPorch  0.054202562  0.0183606001 -0.028887266 -0.009915937 -0.12857796
## X3SsnPorch    -0.007991549  0.0003539653  0.029473795  0.018644925  0.04458367
## ScreenPorch    0.051307395  0.0319457608  0.023216992  0.010694106  0.11144657
## PoolArea       1.000000000  0.0296686509 -0.033736640 -0.059688932  0.09240355
## MiscVal        0.029668651  1.0000000000 -0.006494550  0.004906262 -0.02118958
## MoSold        -0.033736640 -0.0064945502  1.000000000 -0.145721413  0.04643225
## YrSold        -0.059688932  0.0049062625 -0.145721413  1.000000000 -0.02892259
## SalePrice      0.092403549 -0.0211895796  0.046432245 -0.028922585  1.00000000

Let’s visualize the correlation matrix on a heat map and highlight the variables with strong correlation with the sale price:

# Convert the correlation matrix to long format
correlation_long <- melt(correlation_matrix_numerical)

# Plot heatmap
ggplot(correlation_long, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  coord_fixed()

# 3. Highlight the cells corresponding to strong correlations with sale price
strong_correlation_threshold <- 0.6  

# Find variables strongly correlated with sale price
strong_correlation_with_saleprice <- correlation_matrix[,"SalePrice"] >= strong_correlation_threshold

# Highlight cells with strong correlation
ggplot(correlation_long, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  coord_fixed() +
  geom_tile(data = correlation_long[strong_correlation_with_saleprice, ], 
            aes(Var1, Var2), fill = "black", color = "black")

This is hare to read, let’s split the numerical data into subset and plot the heat map of the variables against the SalePrice:

The error that occurs after running the code above (for creating a heat map, the code was hidden to minimize the amount of code displayed in this R Markdown document.) indicates that there maybe row names that are duplicated, also the standard deviation is zero suggests that there are some variables that might have constant value across all observations. Let’s fix this by also handling if there is any missing data, then re-run the code:

# Check for duplicate variable names
duplicated_names <- duplicated(names(numerical_data))
if (any(duplicated_names)) {
  print("Duplicate variable names detected:")
  print(names(numerical_data)[duplicated_names])
  # Rename variables if necessary
  names(numerical_data)[duplicated_names] <- make.unique(names(numerical_data)[duplicated_names])
}
# Identify constant variables
constant_vars <- apply(numerical_data, 2, function(x) {
  all(diff(x, na.rm = TRUE) == 0, na.rm = TRUE)  
})

# Print constant variables
if (any(constant_vars)) {
  print("Constant variables detected:")
  print(names(numerical_data)[constant_vars])
} else {
  print("No constant variables detected.")
}
## [1] "No constant variables detected."
# Remove constant variables from numerical_data
numerical_data <- numerical_data[, !constant_vars, drop = FALSE]

# Check for missing values
missing_values <- apply(numerical_data, 2, function(x) {
  any(is.na(x))
})

# Print variables with missing values
if (any(missing_values)) {
  print("Variables with missing values detected:")
  print(names(numerical_data)[missing_values])
} else {
  print("No missing values detected.")
}
## [1] "Variables with missing values detected:"
## [1] "LotFrontage" "MasVnrArea"  "GarageYrBlt"

Based on the output above, the constant value issue was resolved. Also, there are three variables that have missing values, let’s handle them by replacing them with the mean:

# Impute missing values with mean
numerical_data$LotFrontage[is.na(numerical_data$LotFrontage)] <- mean(numerical_data$LotFrontage, na.rm = TRUE)
numerical_data$MasVnrArea[is.na(numerical_data$MasVnrArea)] <- mean(numerical_data$MasVnrArea, na.rm = TRUE)
numerical_data$GarageYrBlt[is.na(numerical_data$GarageYrBlt)] <- mean(numerical_data$GarageYrBlt, na.rm = TRUE)

# Re-check for missing values
missing_values <- apply(numerical_data, 2, function(x) {
  any(is.na(x))
})

# Print variables with missing values (should be empty now)
if (any(missing_values)) {
  print("Variables with missing values detected:")
  print(names(numerical_data)[missing_values])
} else {
  print("No missing values detected.")
}
## [1] "No missing values detected."
#Calculate the correlation matrix again after handling the missing data and the constant values
correlation_matrix_New <- cor(numerical_data)

print(correlation_matrix_New)

Yay! It worked.

However, it is still hard to read the heat map even though the data was split into 5 then 10 subsets. Let’s run the full linear model using all numerical variables as predictors for SalePrice.

# Fit linear model
Full_model <- lm(SalePrice ~ ., data = numerical_data)

# Summary of the model
summary(Full_model)
## 
## Call:
## lm(formula = SalePrice ~ ., data = numerical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470552  -16741   -2026   13943  302726 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.593e+05  1.414e+06   0.325 0.745370    
## Id            -1.078e+00  2.187e+00  -0.493 0.622214    
## MSSubClass    -1.812e+02  2.769e+01  -6.544 8.35e-11 ***
## LotFrontage   -5.660e+01  5.180e+01  -1.093 0.274731    
## LotArea        4.283e-01  1.022e-01   4.190 2.95e-05 ***
## OverallQual    1.730e+04  1.189e+03  14.551  < 2e-16 ***
## OverallCond    4.680e+03  1.033e+03   4.530 6.39e-06 ***
## YearBuilt      2.714e+02  6.757e+01   4.016 6.23e-05 ***
## YearRemodAdd   1.350e+02  6.864e+01   1.966 0.049494 *  
## MasVnrArea     3.130e+01  5.961e+00   5.251 1.74e-07 ***
## BsmtFinSF1     1.916e+01  4.670e+00   4.102 4.32e-05 ***
## BsmtFinSF2     8.207e+00  7.061e+00   1.162 0.245312    
## BsmtUnfSF      9.267e+00  4.196e+00   2.208 0.027375 *  
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      4.905e+01  5.817e+00   8.433  < 2e-16 ***
## X2ndFlrSF      4.896e+01  4.986e+00   9.819  < 2e-16 ***
## LowQualFinSF   2.502e+01  2.001e+01   1.250 0.211372    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   9.372e+03  2.613e+03   3.587 0.000346 ***
## BsmtHalfBath   2.005e+03  4.093e+03   0.490 0.624240    
## FullBath       3.461e+03  2.838e+03   1.220 0.222839    
## HalfBath      -1.873e+03  2.665e+03  -0.703 0.482317    
## BedroomAbvGr  -1.008e+04  1.703e+03  -5.921 4.01e-09 ***
## KitchenAbvGr  -1.233e+04  5.219e+03  -2.363 0.018264 *  
## TotRmsAbvGrd   5.082e+03  1.238e+03   4.105 4.27e-05 ***
## Fireplaces     3.948e+03  1.778e+03   2.221 0.026524 *  
## GarageYrBlt    1.222e+02  6.963e+01   1.755 0.079554 .  
## GarageCars     1.126e+04  2.876e+03   3.916 9.44e-05 ***
## GarageArea    -4.201e+00  9.953e+00  -0.422 0.673051    
## WoodDeckSF     2.387e+01  8.021e+00   2.976 0.002970 ** 
## OpenPorchSF   -2.910e+00  1.519e+01  -0.192 0.848036    
## EnclosedPorch  1.179e+01  1.687e+01   0.699 0.484819    
## X3SsnPorch     1.975e+01  3.144e+01   0.628 0.530021    
## ScreenPorch    5.596e+01  1.720e+01   3.254 0.001163 ** 
## PoolArea      -2.834e+01  2.386e+01  -1.187 0.235279    
## MiscVal       -7.364e-01  1.855e+00  -0.397 0.691531    
## MoSold        -4.523e+01  3.450e+02  -0.131 0.895733    
## YrSold        -7.764e+02  7.028e+02  -1.105 0.269469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34760 on 1424 degrees of freedom
## Multiple R-squared:  0.8132, Adjusted R-squared:  0.8086 
## F-statistic: 177.1 on 35 and 1424 DF,  p-value: < 2.2e-16

Based on the summary of the full regression model, the predictors that explain the Sale price are ‘MSSubClass’ (The building class), ‘LotArea’ (Lot size in square feet), ‘OverallQual’ (Overall material and finish quality), ‘OverallCond’ (Overall condition rating), YearBuilt’ (Original construction date), ‘MasVnrArea’ (Masonry veneer area in square feet), ‘BsmtFinSF1’ (Type 1 finished square feet), ‘X1stFlrSF’ (First Floor square feet), ‘X2ndFlrSF’ (Second floor square feet), ‘BsmtFullBath’ (Basement full bathrooms), ‘BedroomAbvGr’ (Number of bedrooms above ground), ‘TotRmsAbvGrd’ (Total rooms above grade (does not include bathrooms)), ‘GarageCars’ (Size of garage in car capacity), ‘WoodDeckSF’ (Wood deck area in square feet), and ‘ScreenPorch’ (Screen porch area in square feet).

Now, let’s perform the backward process elimination:

# Linear Model for the most significant predictors
half_full_model <- lm(SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond + 
                  YearBuilt + MasVnrArea + BsmtFinSF1 + X1stFlrSF + X2ndFlrSF + 
                  BsmtFullBath + BedroomAbvGr + TotRmsAbvGrd + GarageCars + 
                  WoodDeckSF + ScreenPorch, data = numerical_data)

# Step 1: Get summary of the full model
summary(half_full_model)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + X1stFlrSF + 
##     X2ndFlrSF + BsmtFullBath + BedroomAbvGr + TotRmsAbvGrd + 
##     GarageCars + WoodDeckSF + ScreenPorch, data = numerical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -482920  -16643   -1921   13974  282923 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.552e+05  8.803e+04  -9.715  < 2e-16 ***
## MSSubClass   -1.905e+02  2.402e+01  -7.930 4.34e-15 ***
## LotArea       4.430e-01  9.962e-02   4.447 9.37e-06 ***
## OverallQual   1.984e+04  1.089e+03  18.223  < 2e-16 ***
## OverallCond   5.290e+03  9.226e+02   5.733 1.20e-08 ***
## YearBuilt     3.930e+02  4.491e+01   8.752  < 2e-16 ***
## MasVnrArea    2.991e+01  5.880e+00   5.086 4.13e-07 ***
## BsmtFinSF1    1.053e+01  2.968e+00   3.548 0.000400 ***
## X1stFlrSF     5.704e+01  4.504e+00  12.664  < 2e-16 ***
## X2ndFlrSF     5.041e+01  4.005e+00  12.586  < 2e-16 ***
## BsmtFullBath  8.790e+03  2.381e+03   3.691 0.000232 ***
## BedroomAbvGr -1.070e+04  1.646e+03  -6.504 1.07e-10 ***
## TotRmsAbvGrd  4.687e+03  1.170e+03   4.007 6.46e-05 ***
## GarageCars    1.039e+04  1.698e+03   6.121 1.20e-09 ***
## WoodDeckSF    2.765e+01  7.889e+00   3.505 0.000470 ***
## ScreenPorch   5.670e+01  1.675e+01   3.384 0.000733 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34990 on 1444 degrees of freedom
## Multiple R-squared:  0.808,  Adjusted R-squared:  0.806 
## F-statistic: 405.1 on 15 and 1444 DF,  p-value: < 2.2e-16
# Step 2: Backward elimination
while(any(coef(summary(half_full_model))[, 4] > 0.05)) {
  max_p_value_index <- which.max(coef(summary(half_full_model))[, 4])
  if (max_p_value_index == 1) {
    break  # Intercept should not be removed
  }
  half_full_model <- update(half_full_model, . ~ . - names(coef(half_full_model))[max_p_value_index])
}

# Step 3: Final summary of the model after elimination
summary(half_full_model)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + X1stFlrSF + 
##     X2ndFlrSF + BsmtFullBath + BedroomAbvGr + TotRmsAbvGrd + 
##     GarageCars + WoodDeckSF + ScreenPorch, data = numerical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -482920  -16643   -1921   13974  282923 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.552e+05  8.803e+04  -9.715  < 2e-16 ***
## MSSubClass   -1.905e+02  2.402e+01  -7.930 4.34e-15 ***
## LotArea       4.430e-01  9.962e-02   4.447 9.37e-06 ***
## OverallQual   1.984e+04  1.089e+03  18.223  < 2e-16 ***
## OverallCond   5.290e+03  9.226e+02   5.733 1.20e-08 ***
## YearBuilt     3.930e+02  4.491e+01   8.752  < 2e-16 ***
## MasVnrArea    2.991e+01  5.880e+00   5.086 4.13e-07 ***
## BsmtFinSF1    1.053e+01  2.968e+00   3.548 0.000400 ***
## X1stFlrSF     5.704e+01  4.504e+00  12.664  < 2e-16 ***
## X2ndFlrSF     5.041e+01  4.005e+00  12.586  < 2e-16 ***
## BsmtFullBath  8.790e+03  2.381e+03   3.691 0.000232 ***
## BedroomAbvGr -1.070e+04  1.646e+03  -6.504 1.07e-10 ***
## TotRmsAbvGrd  4.687e+03  1.170e+03   4.007 6.46e-05 ***
## GarageCars    1.039e+04  1.698e+03   6.121 1.20e-09 ***
## WoodDeckSF    2.765e+01  7.889e+00   3.505 0.000470 ***
## ScreenPorch   5.670e+01  1.675e+01   3.384 0.000733 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34990 on 1444 degrees of freedom
## Multiple R-squared:  0.808,  Adjusted R-squared:  0.806 
## F-statistic: 405.1 on 15 and 1444 DF,  p-value: < 2.2e-16

Residual Analysis:

Residuals vs. Fitted Values Plot

plot(fitted(half_full_model), residuals(half_full_model),
     xlab = "Fitted values", ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

The residuals are not randomly scattered around 0; the relationship between the predictors and the response variable may not be purely linear.

Normal Q-Q Plot:

qqnorm(residuals(half_full_model))
qqline(residuals(half_full_model))

By looking at the tails of the Q-Q plot, they seem that they are curved, which indicates that the residuals are skewed; they are not symmetrically distributed around zero.

Residuals vs. Predictor Plots

png("residual_plots.png", width = 800, height = 800)
par(mfrow = c(3, 3))
for (predictor in names(half_full_model$model)[-1]) {
  plot(half_full_model$model[[predictor]], residuals(half_full_model),
       xlab = predictor, ylab = "Residuals",
       main = paste("Residuals vs", predictor))
}
dev.off()
## quartz_off_screen 
##                 2
Residual Plots
Residual Plots

Some of the plots show a pattern of vertical lines, that suggests that the residuals don’t have constant vaiance.

Scale-Location Plot:

sqrt_abs_residuals <- sqrt(abs(residuals(half_full_model)))
plot(fitted(half_full_model), sqrt_abs_residuals,
     xlab = "Fitted values", ylab = "Square root of absolute residuals",
     main = "Scale-Location Plot")
abline(h = 0, col = "red")

The points are clustered in the range of [0e+00, 4e+05], this indicates that the spread of residuals is relatively consistent across most of the fitted values. The presence of outliers outside the range mentioned above suggests that there might be specific observations where the model’s predictions are notably inaccurate.

Residuals vs. Leverage Plot:

plot(half_full_model, which = 5)

Predictions using linear model:

First, let’s create a subset of numerical data that includes only the most significant predictors:

# Create the list
predictor_variables <- c("MSSubClass", "LotArea", "OverallQual", "OverallCond", 
                         "YearBuilt", "MasVnrArea", "BsmtFinSF1", "X1stFlrSF", 
                         "X2ndFlrSF", "BsmtFullBath", "BedroomAbvGr", "TotRmsAbvGrd", 
                         "GarageCars", "WoodDeckSF", "ScreenPorch")

# Create a subset of "numerical_data" using the predictor variables
subset_data <- numerical_data[, c("SalePrice", predictor_variables)]

Now, let’s predict the Sale Price using the predictors variables:

# Predict the SalePrice using the half_full_model
predicted_SalePrice <- predict(half_full_model, newdata = numerical_data)

# Create a sequence of Id values starting from 1461
new_Id <- seq(1461, length.out = nrow(numerical_data), by = 1)

# Create a data frame with modified Id and predicted SalePrice
predictions <- data.frame(Id = new_Id, SalePrice = round(predicted_SalePrice, 0))

# View the predicted SalePrice
head(predictions)
##     Id SalePrice
## 1 1461    229830
## 2 1462    190568
## 3 1463    221832
## 4 1464    189561
## 5 1465    291333
## 6 1466    175566

Let’s export the predictions and submit it to Kaggle

# Remove the last row from the predictions data frame
predictions <- head(predictions, -1)

# Check for missing values in the SalePrice column
missing_values <- sum(is.na(predictions$SalePrice))
print(missing_values)
## [1] 0
# Replace missing values in the SalePrice column with the median value
predictions$SalePrice[is.na(predictions$SalePrice)] <- median(predictions$SalePrice, na.rm = TRUE)

# Write predictions data frame to a .csv file
write.csv(predictions, file = "predictions.csv", row.names = FALSE)
str(numerical_data)
## 'data.frame':    1460 obs. of  38 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ LotFrontage  : num  65 80 68 60 84 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ GarageYrBlt  : num  2003 1976 2001 1998 2000 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Exponential Model:

The Residuals vs. Fitted Values plot shows a smile-like shape which suggest that there is non-linear relationship between the predictors and the SalePrice. Let’s try exponential model and see what we can get:

# Step 1: Transform the Response Variable
numerical_data$LogSalePrice <- log(numerical_data$SalePrice)

# Step 2: Fit a Linear Regression Model
exp_model <- lm(LogSalePrice ~ LotArea, data = numerical_data)

# Step 3: Back-Transform the Coefficients
exp_model_coef <- exp(coef(exp_model))

# Print the coefficients
print(exp_model_coef)
##  (Intercept)      LotArea 
## 149604.25226      1.00001

The coefficient is equal to 1.00001 which indicates that the there is no strong exponential relationship between these variables.

# Evaluate model fit
summary(exp_model)
## 
## Call:
## lm(formula = LogSalePrice ~ LotArea, data = numerical_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53664 -0.23781 -0.02546  0.24052  1.44451 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.192e+01  1.468e-02  811.56   <2e-16 ***
## LotArea     1.030e-05  1.013e-06   10.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3861 on 1458 degrees of freedom
## Multiple R-squared:  0.06621,    Adjusted R-squared:  0.06557 
## F-statistic: 103.4 on 1 and 1458 DF,  p-value: < 2.2e-16

The values above suggest that LotArea is statistically significant in predicting LogSalePrice, as indicated by the low p-value (< 2.2e-16). However, the adjusted R-squared value (0.06557) indicates that the model explains only a small proportion of the variability (6.557%) in LogSalePrice, suggesting that other factors not included in the model may also influence SalePrice.

Let’s plot the results to compare the original response variable with the fitted values obtained from the exponential regression model.

# Plot the fitted values against the original response variable
plot(numerical_data$SalePrice, exp(predict(exp_model)), xlab = "Original Response", ylab = "Fitted Response (Exponential)")

# Perform residual analysis
plot(fitted(exp_model), residuals(exp_model),
     xlab = "Fitted values", ylab = "Residuals",
     main = "Residuals vs Fitted")

It seems that the exponential model is not great model for the variables Let’s try polynomial model and see what will get:

Polynomial Model:

# Select predictors
predictors <- c('LotArea', 'OverallQual', 'YearBuilt', 'MSSubClass', 'X1stFlrSF', 'X2ndFlrSF', 'BedroomAbvGr')  
degree <- 2  

# Create polynomial terms for each predictor individually
poly_terms <- lapply(numerical_data[predictors], poly, degree = degree)

# Combine polynomial terms into a single matrix
poly_matrix <- do.call(cbind, poly_terms)

# Fit polynomial regression model
poly_model <- lm(SalePrice ~ poly_matrix, data = numerical_data)

# Print summary of the model
summary(poly_model)
## 
## Call:
## lm(formula = SalePrice ~ poly_matrix, data = numerical_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -344789  -17704     860   15251  261834 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   180921.2      858.9 210.654  < 2e-16 ***
## poly_matrix1  286190.4    35901.7   7.971 3.16e-15 ***
## poly_matrix2  -90027.6    36866.0  -2.442   0.0147 *  
## poly_matrix1 1314488.7    56380.9  23.314  < 2e-16 ***
## poly_matrix2  542398.7    36578.4  14.828  < 2e-16 ***
## poly_matrix1  488417.4    43024.9  11.352  < 2e-16 ***
## poly_matrix2  -42447.8    38947.1  -1.090   0.2759    
## poly_matrix1 -176403.5    40307.7  -4.376 1.29e-05 ***
## poly_matrix2   68710.5    37926.9   1.812   0.0702 .  
## poly_matrix1  968307.0    52562.7  18.422  < 2e-16 ***
## poly_matrix2 -451619.2    35219.1 -12.823  < 2e-16 ***
## poly_matrix1 1056337.5    57262.3  18.447  < 2e-16 ***
## poly_matrix2  237490.0    38142.9   6.226 6.25e-10 ***
## poly_matrix1 -260090.0    43613.7  -5.963 3.10e-09 ***
## poly_matrix2  -25224.2    35522.8  -0.710   0.4778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32820 on 1445 degrees of freedom
## Multiple R-squared:  0.831,  Adjusted R-squared:  0.8294 
## F-statistic: 507.5 on 14 and 1445 DF,  p-value: < 2.2e-16

Conclusion:

Using linear model at first shows that there are Several predictors that are statistically significant in predicting the sale price, as indicated by their low p-values (< 0.05). These predictors include ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘X1stFlrSF’, ‘X2ndFlrSF’, ‘BedroomAbvGr’, ‘TotRmsAbvGrd’, ‘GarageCars’, ‘WoodDeckSF’, and ‘ScreenPorch’. By looking at the coefficients, we can see the predictors that have bigger rate of change (or slope). For instance, a one-unit increase in OverallQual is associated with an estimated increase in the sale price of approximately $19,840. The adjusted R-squared in the summary also indicates that 80.6% of the variability in the sale price is explained by the predictors included in the model. This suggests that the model provides a reasonably good fit to the data. So the linear model can be used to predict the sale price of houses based on the values of the predictors included in the model. We can use this equation:

\[ \begin{align} SalePrice = &-8.552^{05} + 1.984^{04} \times OverallQual+ -1.905^{02} \times MSSubClass + 4.430^{01} \times LotArea \\ &+ 5.290^{03} \times OverallCond + 3.930^{02} \times YearBuilt + 2.991^{01} \times MasVnrArea \\ &+ 1.053^{01} \times BsmtFinSF1 + 5.704^{01} \times X1stFlrSF + 5.041^{01} \times X2ndFlrSF \\ &+ 8.790^{03} \times BsmtFullBath + -1.070^{04} \times BedroomAbvGr \\ &+ 4.687^{03} \times TotRmsAbvGrd + 1.039^{04} \times GarageCars \\ &+ 2.765^{01} \times WoodDeckSF + 5.670^{01} \times ScreenPorch + \epsilon \end{align} \]

For the exponential model, the coefficient of ‘LotArea’ is estimated to be \(1.03 \times 10^{-5}\) which indicates that every one square foot increase in LotArea, the predicted natural logarithm of the sale price (LogSalePrice) increases by approximately \(1.03 \times 10^{-5}\) units. The p-value (\(<2.2 \times 10^{-16}\)) suggests that there is a significant relationship between LotArea and LogSalePrice. However, the adjusted R-squared approximately 6.57% of the variability in LogSalePrice is explained by LotArea. While statistically significant. The model can be used to predict the natural logarithm of the LogSalePrice (but not the sale price) based on the value of LotArea. The following equation can be used to predict the sale price:

$LogSalePrice = 11.92 + 1.030^{-05} LotArea + $

For the polynomial model, many of the predictors (the polynomial terms) have statistically significant coefficients, as indicated by the p-values (\(< 0.05\)). In other words, these terms have significant impact on predicting the sale price of the houses. The adjusted R-squared value of 0.8294 indicates that approximately 82.94% of the variability in the sale price is explained by the polynomial terms included in the model. This suggests s good fit of the data. This model can be also used to predict the sale price of houses based on the values of the polynomial terms included in the model.

Kaggle:

Please Find the image of my submission with my score and my profile in Kaggle: