library(knitr) # For formatting tables
library(e1071) # for checking the skewness
library(ggplot2) # For plots
library(reshape2) # To reshape the data
library(randomForest) #for feature importance
library(dplyr)
library(MASS)
library(gridExtra)
library(tibble)
# Subsetthe data to include only the first 1459 rows to meet Kaggle requirement
Train_df_subset <- head(Train_df, 1459)
# Fixng the erros I got when submitting in Kaggle
Train_df_subset$Id <- seq_along(Train_df_subset$SalePrice)
# Export Train_df to CSV file
write.csv(Train_df_subset, "Train_df_subset.csv", row.names = FALSE)
# Display the first few rows of the data:
head(Train_df)
# Display the statistical properties of the data
summary(Train_df)
This data set contains 1460 observations with 81 variables; 39 of those variables are numerical and 43 are categorical
# Check for missing values
sum(is.na(Train_df))
## [1] 6965
# Identify the numerical and categorical data variables
numerical_vars <- sapply(Train_df, is.numeric)
Categorical_vars <- sapply(Train_df, is.character)
# Subset the data to include only quantitative variables
numerical_data <- Train_df[, numerical_vars]
Categorical_data <- Train_df[, Categorical_vars]
# Calculate skewness for quantitative variables
skewness_values <- sapply(numerical_data, skewness)
head(skewness_values)
## Id MSSubClass LotFrontage LotArea OverallQual OverallCond
## 0.0000000 1.4047656 NA 12.1826150 0.2164984 0.6916440
Based on the skewness level above, the LotArea has the greatest positive value, which means that this variable has distribution that is skewed to the right. Now, let’s identify the variables that has positive level of skewness and order them in ascending order, so we can see of our choice of the independent varaible is the right decision:
# Identify variables with positive skewness
positive_skew_vars <- names(skewness_values[skewness_values > 0])
# Sort variables based on skewness values in ascending order
sorted_positive_skew_vars <- positive_skew_vars[order(skewness_values[skewness_values > 0])]
# Print the names of variables with positive skewness sorted in ascending order
print(sorted_positive_skew_vars)
## [1] "FullBath" "YrSold" "GarageArea" "BedroomAbvGr"
## [5] "MoSold" "OverallQual" "BsmtFullBath" "Fireplaces"
## [9] "HalfBath" "TotRmsAbvGrd" "OverallCond" "X2ndFlrSF"
## [13] "BsmtUnfSF" "GrLivArea" "X1stFlrSF" "MSSubClass"
## [17] "TotalBsmtSF" "WoodDeckSF" "BsmtFinSF1" "SalePrice"
## [21] "OpenPorchSF" "EnclosedPorch" "BsmtHalfBath" "ScreenPorch"
## [25] "BsmtFinSF2" "KitchenAbvGr" "LowQualFinSF" "X3SsnPorch"
## [29] "LotArea" "PoolArea" "MiscVal" NA
## [33] NA NA
Based on the calculated skewness, the independent variable I am choosing is Lot Area, so \(X=LotArea\). The dependent variable is Sale Price, so \(Y=SalePrice\).
X <- Train_df_subset$LotArea
Y <- Train_df_subset$SalePrice
I am going to calculate the following probabilities:
P(X>x | Y>y)
P(X>x, Y>y)
P(X<x | Y>y)
Where the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable.
\(P(X>x | Y>y) = \frac{\text {Number of observations where X>x and Y>y}}{\text{Number of observations where Y>y}}\)
First, let’s count the number of observations where both \(X>x\) and \(Y>y\):
# Define the thresholds
x_threshold <- 11602 # 3rd quartile of X (Lot Area)
y_threshold <- 163000 # 2nd quartile or median of Y (Sale Price)
# Filter the data where both X and Y exceed their thresholds
filtered_Train_df <- Train_df[Train_df$LotArea > x_threshold & Train_df$SalePrice > y_threshold, ]
# Count the number of observations in the filtered dataset
num_observations_both_exceed <- nrow(filtered_Train_df)
# Print the count
print(num_observations_both_exceed)
## [1] 276
Now, let’s count the number of observations where \(Y>y\):
# Filter the data where only Y exceeds its threshold
filtered_data_Y_exceeds <- Train_df[Train_df$SalePrice > y_threshold, ]
# Count the number of observations in the filtered dataset
num_observations_Y_exceeds <- nrow(filtered_data_Y_exceeds)
# Print the count
print(num_observations_Y_exceeds)
## [1] 728
So the conditional probability that \(X>x\) given that \(Y>y\) y is \(P(X>x | Y>y) = \fract{276}{728}\)
# Calculate the conditional probability where X>x given Y>y
Cond_Prob1 <- round(num_observations_both_exceed/num_observations_Y_exceeds, 3)
# Display the Conditional probability
print(Cond_Prob1)
## [1] 0.379
\(\boxed {P(X>x | Y>y) \approx 0.379}\)
This probability suggests that given an observation with higher Sale Price of a house that exceeds the 2nd quartile (\(\$163000\)), there is 37.9% chance that those houses have a lot area that exceeds the 3rd quartile (\(11602 ft^2\))
\(P(X>x , Y>y) = \frac{\text {Number of observations where X>x and Y>y}}{\text{Total number of observations}}\)
Based on the calculation on the previous subsection “P(X>x | Y>y)”, we know that the number of observations where \(X>x\) and \(Y>y\) is 276. And the total number of observations in the data set is \(1460\):
# Calculate the probability of both X>x and Y>y
Prob2 <- round(num_observations_both_exceed/1460, 3)
# Print the probability
print(Prob2)
## [1] 0.189
This probability suggests that there is only a chance of 18.9% that houses that are sold with relatively high price (exceeding the 2nd quartile), have also a big lot area that exceeds the 3rd quartile. In other words, there is a low occurrence of houses with both high sale price and large lot area.
\(P(X<x | Y>y) = \frac{\text {Number of observations where X<x and Y>y}}{\text{Number of observations where Y>y}}\)
We Already found that the number of observations where \(X>x\) and \(Y>y\) is \(276\). And the number of observations where \(Y>y\) is \(728\)> To find the number of observations where \(X<x\) and \(Y>y\) would be the difference between both: \(728-276\)
# Number of observations where X<x and Y>y
Obs_X_less_Y_more <- num_observations_Y_exceeds - num_observations_both_exceed
# Probability of observations where X<x and Y>y
Prob3 <- round(Obs_X_less_Y_more / num_observations_Y_exceeds, 3)
# Display the probability
print(Prob3)
## [1] 0.621
This probability indicates that given houses with higher sale price, there is higher chance, 62.1%, that those houses have smaller lot area (less than 3rd quartile).
# Filter the data where both X and Y below their thresholds
filtered_Train_df_both_leq <- Train_df[Train_df$LotArea <= x_threshold & Train_df$SalePrice <= y_threshold, ]
# Count the number of observations in the filtered dataset where both X and Y are less or equal to their thresholds
num_observations_both_lessEq <- nrow(filtered_Train_df_both_leq)
# Print the count
print(num_observations_both_lessEq)
## [1] 643
# Filter the data where X<=x and Y>y
filtered_Train_df_X_leq_Y_g <- Train_df[Train_df$LotArea <= x_threshold & Train_df$SalePrice > y_threshold, ]
# Count the number of observations in the filtered dataset
num_observations_X_leq_Y_g <- nrow(filtered_Train_df_X_leq_Y_g)
# Print the count
print(num_observations_X_leq_Y_g )
## [1] 452
# Filter the data where X>x and Y<=y
filtered_Train_df_X_leq_Y_g <- Train_df[Train_df$LotArea > x_threshold & Train_df$SalePrice <= y_threshold, ]
# Count the number of observations in the filtered dataset
num_observations_X_g_Y_leq <- nrow(filtered_Train_df_X_leq_Y_g)
# Print the count
print(num_observations_X_g_Y_leq )
## [1] 89
Let’s create a table with all counts that we calculated:
# Create the table matrix
table_matrix <- matrix(c(num_observations_both_lessEq, num_observations_X_leq_Y_g,
num_observations_X_g_Y_leq, num_observations_both_exceed),
nrow = 2, byrow = TRUE)
# Add row and column names
rownames(table_matrix) <- c("<=3rd Qu", ">3rd Qu")
colnames(table_matrix) <- c("<=2nd Qu", ">2nd Qu")
# Add totals for each row and column
table_matrix <- cbind(table_matrix, rowSums(table_matrix))
table_matrix <- rbind(table_matrix, colSums(table_matrix))
# Assign the row and column names to the table object
dimnames(table_matrix) <- list("x/y" = c("<=3rd Qu", ">3rd Qu", "Total"),
"x/y" = c("<=2nd Qu", ">2nd Qu", "Total"))
# Convert to a table object
table_num <- as.table(table_matrix)
# Print the table
print(table_num)
## x/y
## x/y <=2nd Qu >2nd Qu Total
## <=3rd Qu 643 452 1095
## >3rd Qu 89 276 365
## Total 732 728 1460
Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does \(P(A|B)=P(A)P(B)\)?
# Display the probability of A given B
Cond_Prob1
## [1] 0.379
# Calculate P(A) x P(B)
Prob_A_and_B <- round(((num_observations_X_g_Y_leq + num_observations_both_exceed) /1460)*(num_observations_Y_exceeds/1460),3)
# Print the probability of A and B
print(Prob_A_and_B)
## [1] 0.125
Since \(P(A|B) \neq P(A) \times P(B)\), we can conclude that A and B are not independent.
let’s evaluate by running a Chi Square test for association:
# Define the calculated number of observations
observed <- matrix(c(643, 452, 89, 276), nrow = 2, byrow = TRUE)
# Perform chi-square test
chi_square_result <- chisq.test(observed)
# Print the result
print(chi_square_result)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: observed
## X-squared = 127.74, df = 1, p-value < 2.2e-16
Based on the chi square result above, we can reject the null hypothesis of independence, meaning that the variables are dependent (as we concluded above mathematically) and there is a significant association between the variables. So, splitting the training data in this fashion doesn’t make them independent.
First let’s plot histograms for the numerical variables in the training data (Train_df)
# Univariate plots for numerical variables
for (col in names(Train_df)[numerical_vars]) {
print(ggplot(Train_df, aes_string(x = col)) +
geom_histogram(fill = "skyblue", color = "black") +
labs(title = paste("Histogram of", col), x = col, y = "Frequency"))
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 259 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 81 rows containing non-finite outside the scale range
## (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, i am going to plot box plots for each of the categorical variables against the SalePRice, to visualize which variable has stronger influence on the sale price of houses,
# Convert categorical variables to factors
for (col in names(Categorical_data)) {
Categorical_data[[col]] <- factor(Categorical_data[[col]])
}
# Univariate box plots for categorical variables with SalePrice
for (col in names(Categorical_data)) {
print(ggplot(data = cbind(Categorical_data, numerical_data), aes_string(x = col, y = "SalePrice")) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = paste("Boxplot of", col), x = col, y = "Sale Price"))
}
To assess the correlation between the categorical variables and the SalePrice of houses, we can use the Analysis of Variance technique (ANOVA) as follows:
# Convert categorical variables to factors
categorical_data <- lapply(Categorical_data, factor)
# Combine numerical and categorical data
combined_data <- cbind(numerical_data, categorical_data)
# Perform ANOVA for each categorical variable against SalePrice
anova_results <- lapply(categorical_data, function(cat_var) {
anova_result <- aov(SalePrice ~ cat_var, data = combined_data)
return(summary(anova_result))
})
# Extract p-values from ANOVA results
p_values <- lapply(anova_results, function(anova_result) {
return(anova_result[[1]][["Pr(>F)"]][1])
})
# Combine variable names and p-values into a data frame
result_df <- data.frame(Variable = names(p_values), P_Value = unlist(p_values))
# Print the result
print(result_df)
It’s little overwhelming to get any conclusion about the association between those variables and the SalePrice from the table, let’s visualize the results in a bar plot:
Now it is evident that “Neighborhood” component has the tallest bar; the smallest p-value, which indicates there is a strong association between the sale price of the houses and the neighborhood. In other words, there are significant differences in SalePrice across different neighborhoods. On the other hands, the “utilities” has the biggest p-value; there is a weak association.
Now, let’s graph a scatter plot of X, “LotArea” and Y, “SalePrice”:
# Plot SalePrice (Y) against LotArea (X)
plot(Train_df$LotArea, Train_df$SalePrice,
xlab = "LotArea", ylab = "SalePrice",
main = "Scatterplot of LotArea vs SalePrice")
Most of the data points are clustered on the left side of the plot, which indicates that the distribution is skewed. To better visualize the data, we can apply logarithmic transformation to both variables to spread out the data points evenly:
# Log transformation
plot(log(Train_df$LotArea), log(Train_df$SalePrice),
xlab = "Log(LotArea)", ylab = "Log(SalePrice)",
main = "Scatterplot of Log(LotArea) vs Log(SalePrice)")
Now, let’s calculate the 95% confidence interval for the difference in means of two variables (LotArea and SalePrice):
# Perform t-test for the difference in means
t_test_result <- t.test(Train_df$LotArea, Train_df$SalePrice)
# Extract confidence interval from the t-test result
ci_lower <- t_test_result$conf.int[1]
ci_upper <- t_test_result$conf.int[2]
# Print the confidence interval
cat("95% Confidence Interval for the difference in means: (", ci_lower, ", ", ci_upper, ")\n")
## 95% Confidence Interval for the difference in means: ( -174514.7 , -166294.1 )
The interval above indicates that the SalePrice tends to be significantly lower than LotArea within that range.
Now let’s derive a correlation matrix for two of the quantitative variables; LotArea and PriceSale. The variables X and Y are already defined before as LotArea and SalePrice respectively:
# Calculate the correlation coefficient between X and Y
correlation_coefficient <- cor(X, Y)
# Print the correlation coefficient
print(correlation_coefficient)
## [1] 0.2638429
The correlation coefficient is positive, which indicates that as LotArea increases, the SalePrice increases as well. The number, 0.2638434, suggests a weak positive correlation between LotArea and SalePrice.
Correlation matrix for both variables; LotArea and SalePrice:
# Create a subset of your dataset with only the two variables of interest
subset_df <- Train_df[, c("LotArea", "SalePrice")]
# Calculate the correlation matrix
correlation_matrix <- cor(subset_df)
# Print the correlation matrix
print(correlation_matrix)
## LotArea SalePrice
## LotArea 1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
Now, let’s test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval:
# Test the hypothesis that the correlation between X and Y is 0
cor_test_result <- cor.test(X, Y)
# Print the result
print(cor_test_result)
##
## Pearson's product-moment correlation
##
## data: X and Y
## t = 10.441, df = 1457, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2154402 0.3109524
## sample estimates:
## cor
## 0.2638429
# Extract the confidence interval
ci_lower <- cor_test_result$conf.int[1]
ci_upper <- cor_test_result$conf.int[2]
# Print the 99% confidence interval
cat("99% Confidence Interval for the correlation coefficient:", ci_lower, "to", ci_upper, "\n")
## 99% Confidence Interval for the correlation coefficient: 0.2154402 to 0.3109524
The p-value, 2.2e-16, is very small which indicates strong evidence against the null hypothesis (that the correlation between these variables is 0). Therefore, we reject the null hypothesis, suggesting that the correlation between LotArea and SalePrice is statistically significant. The 99% Confidence Interval for the correlation coefficient: 0.2154574 to 0.3109369, emphasizes that the correlation coefficient is not 0. Meaning that the true correlation coefficient falls within the indicated interval. The “sample estimates”, estimates the correlation coefficient which I calculated before; 0.2638434.
Next, we are going to invert the correlation matrix, which is going to be the precision matrix and it will contain variance inflation factors on the diagonal.
# Invert the correlation matrix
inverted_matrix <- solve(correlation_matrix)
# Print the inverted matrix
print(inverted_matrix)
## LotArea SalePrice
## LotArea 1.0748219 -0.2835846
## SalePrice -0.2835846 1.0748219
First, let’s multiply both correlation matrices, then conduct the PCA:
# Multiply correlation matrix by inverted matrix
cor_by_invert <- correlation_matrix %*% inverted_matrix
# Multiply inverted matrix by correlation matrix
invert_by_cor <- inverted_matrix %*% correlation_matrix
# Principal Components Analysis (PCA)
pca_cor_by_invert <- princomp(cor_by_invert)
pca_invert_by_cor <- princomp(invert_by_cor)
# Summary of PCA results
summary(pca_cor_by_invert)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 0.7071068 3.725290e-09
## Proportion of Variance 1.0000000 2.775558e-17
## Cumulative Proportion 1.0000000 1.000000e+00
summary(pca_invert_by_cor)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 0.7071068 0
## Proportion of Variance 1.0000000 0
## Cumulative Proportion 1.0000000 1
For “pca_cor_by_invert”:
Comp.1: This component has a standard deviation of approximately 0.7071068 and explains 100% of the variance.
Comp.2: This component has a standard deviation close to zero (3.725290e-09) and explains a very tiny proportion of the variance (close to zero).
For “pca_invert_by_cor”:
Comp.1: Similar to ‘pca_cor_by_prec’, this component has a standard deviation of approximately 0.7071068 and explains 100% of the variance, capturing all the variability in the data.
Comp.2: This component has a standard deviation of zero and explains none of the variance. It doesn’t contribute meaningfully to the understanding of the data.
Based on the skewness level of the variables of the data ‘Train_df’, the variable ‘LotArea’ has level of skewness of over 12, which indicates that it has a right skewed distribution. Let’s shift it so that the minimum value is above zero. To do that, we can add a constant value to each observation. This constant value should be at least equal to the absolute value of the smallest negative value in the variable, plus a small margin to ensure that the minimum value becomes positive.
# Compute the minimum value of LotArea
min_LotArea <- min(X)
# Compute the shifting constant
shift_constant <- abs(min_LotArea) + 1 # Add a small margin
# Shift LotArea by adding the constant to each observation
shifted_LotArea <- X + shift_constant
The minimum LotArea is 1300, which makes the shift constant is 1301. Let’s check the minimum of the ‘shifted_LotArea’ if it is above 0:
min(shifted_LotArea)
## [1] 2601
Let’s fit an exponential probability density function:
# Fit an exponential distribution to the shifted LotArea variable
fit_exp <- fitdistr(shifted_LotArea, "exponential")
# Print the fitted parameters
print(fit_exp)
## rate
## 8.461507e-05
## (2.215236e-06)
The estimated rate parameter \(\lambda\) = 8.461792e-05 suggests that the shifted LotArea values follow an exponential distribution, with an average rate of occurrence of approximately 8.461792e-05 per unit. The small standard error (2.214552e-06) indicates a relatively precise estimation of the rate parameter.
Now, let’s find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value.
# Optimal value of λ using MLE
optimal_lambda <- fit_exp$estimate
# Generate 1000 samples from the exponential distribution using the optimal λ
set.seed(123) # for reproducibility
samples <- rexp(1000, rate = optimal_lambda)
# Display the first few samples
head(samples)
## [1] 9968.1681 6814.5102 15707.0701 373.1884 664.3140 3740.4827
The optimal \(\lambda\) is very close to the estimated \(\lambda\): optimal lambda (8.46e-05) \(\approx\) estimated lambda (8.461792e-05).
The values in the sample seem very large, we can visualize the data and the fitted distribution to assess the goodness of fit. To do this, we can create a histogram of the data along with the probability density function (PDF) of the fitted distribution:
# Plot histogram of the data
hist(shifted_LotArea, freq = FALSE, breaks = "FD", main = "Histogram of Shifted LotArea",
xlab = "Shifted LotArea", ylab = "Density")
# Add the probability density function (PDF) of the fitted exponential distribution
curve(dexp(x, rate = optimal_lambda), add = TRUE, col = "red", lwd = 2)
# Add a legend
legend("topright", legend = c("Histogram", "Fitted Exponential PDF"),
col = c("black", "red"), lwd = c(1, 2), bty = "n")
It seems that the histogram and the PDF closely align, that suggests a good fit. After visualizing the histogram of the shifted LotArea, we can plot a histogram of the original LotArea and compare it to the shifted one.
Let’s plot both histogram, side-to-side, for better comparison:
# Set up the plotting area
par(mfrow = c(1, 2))
# Plot histogram of the original LotArea variable
hist(X, freq = FALSE, breaks = "FD", main = "Histogram of Original LotArea",
xlab = "Original LotArea", ylab = "Density")
# Plot histogram of the shifted LotArea variable
hist(shifted_LotArea, freq = FALSE, breaks = "FD", main = "Histogram of Shifted LotArea",
xlab = "Shifted LotArea", ylab = "Density")
# Add the probability density function (PDF) of the fitted exponential distribution to the second plot
curve(dexp(x, rate = optimal_lambda), add = TRUE, col = "red", lwd = 2)
# Add a legend to the second plot
legend("topright", legend = c("Shifted LotArea", "Fitted Exponential PDF"),
col = c("black", "red"), lwd = c(1, 2), bty = "n")
# Reset the plotting area
par(mfrow = c(1, 1))
Both histograms look almost the same shape, this indicates that the shifting process didn’t have significant impact on the distribution of the variable. So, the goodness of fit of the histogram of shifted LotArea and the similarity in distribution between both, original and shifted Lot Area suggest that the exponential distribution and the shifting process are appropriate for describing the data.
Nest, let’s use the exponential pdf to find the 5th and 95th percentiles using the cumulative distribution function (CDF).
# Calculate the 5th percentile using the cumulative distribution function (CDF)
fifth_percentile <- qexp(0.05, rate = optimal_lambda)
# Calculate the 95th percentile using the cumulative distribution function (CDF)
ninety_fifth_percentile <- qexp(0.95, rate = optimal_lambda)
# Print the results
print(paste("5th percentile:", fifth_percentile))
## [1] "5th percentile: 606.195719551753"
print(paste("95th percentile:", ninety_fifth_percentile))
## [1] "95th percentile: 35404.239537249"
In this step, we are going to generate a 95% confidence interval from the empirical data, assuming normality.
# Calculate the sample mean and standard deviation
sample_mean <- mean(shifted_LotArea)
sample_sd <- sd(shifted_LotArea)
# Calculate the standard error of the mean
se_mean <- sample_sd / sqrt(length(shifted_LotArea))
# Calculate the margin of error for a 95% confidence interval (assuming normality)
margin_of_error <- qnorm(0.975) * se_mean
# Calculate the lower and upper bounds of the confidence interval
lower_bound <- sample_mean - margin_of_error
upper_bound <- sample_mean + margin_of_error
# Print the confidence interval
print(paste("95% Confidence Interval (assuming normality): [", round(lower_bound, 2), ",", round(upper_bound, 2), "]"))
## [1] "95% Confidence Interval (assuming normality): [ 11305.89 , 12330.56 ]"
The interval shown above indicates that we can be 95% confident that the true population mean of the shifted LotArea variable falls within this range.
Now, we are going to provide the empirical 5th percentile and 95th percentile of the data.
# Calculate the empirical 5th percentile
empirical_5th_percentile <- quantile(shifted_LotArea, 0.05)
# Calculate the empirical 95th percentile
empirical_95th_percentile <- quantile(shifted_LotArea, 0.95)
# Print the results
print(paste("Empirical 5th percentile:", empirical_5th_percentile))
## [1] "Empirical 5th percentile: 4608.4"
print(paste("Empirical 95th percentile:", empirical_95th_percentile))
## [1] "Empirical 95th percentile: 18703.3"
The empirical 5th percentile of the data is 4612.7 \(ft^2\), which indicates that 5% of the houses in the dataset have a lot area of 4612.7 square feet or smaller.
The empirical 95th percentile of the data is 18702.15 \(ft^2\), this indicates that 95% of the houses in the dataset have a lot area of 18702.15 square feet or smaller.
Lot area plays a significant role in determining the sale price of houses, but its impact is influenced by various factors. So, we will perform regression modeling to check the influence of other variables of the data in Sale Price.
Let’s compute the correlation matrix for the entire subset that includes only the numerical variables:
# Compute the correlation matrix
correlation_matrix_numerical <- cor(numerical_data)
# Print the correlation matrix
print(correlation_matrix_numerical)
## Id MSSubClass LotFrontage LotArea OverallQual
## Id 1.0000000000 0.011156478 NA -0.033225519 -0.02836475
## MSSubClass 0.0111564782 1.000000000 NA -0.139781082 0.03262771
## LotFrontage NA NA 1 NA NA
## LotArea -0.0332255186 -0.139781082 NA 1.000000000 0.10580574
## OverallQual -0.0283647539 0.032627708 NA 0.105805742 1.00000000
## OverallCond 0.0126089248 -0.059315817 NA -0.005636270 -0.09193234
## YearBuilt -0.0127127154 0.027850137 NA 0.014227652 0.57232277
## YearRemodAdd -0.0219976419 0.040581045 NA 0.013788427 0.55068392
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 -0.0050240490 -0.069835749 NA 0.214103131 0.23966597
## BsmtFinSF2 -0.0059676720 -0.065648579 NA 0.111169745 -0.05911869
## BsmtUnfSF -0.0079397034 -0.140759481 NA -0.002618360 0.30815893
## TotalBsmtSF -0.0154145661 -0.238518409 NA 0.260833135 0.53780850
## X1stFlrSF 0.0104960410 -0.251758352 NA 0.299474579 0.47622383
## X2ndFlrSF 0.0055898489 0.307885721 NA 0.050985948 0.29549288
## LowQualFinSF -0.0442299581 0.046473756 NA 0.004778970 -0.03042928
## GrLivArea 0.0082727577 0.074853180 NA 0.263116167 0.59300743
## BsmtFullBath 0.0022885556 0.003491026 NA 0.158154531 0.11109779
## BsmtHalfBath -0.0201547452 -0.002332535 NA 0.048045571 -0.04015016
## FullBath 0.0055874529 0.131608222 NA 0.126030627 0.55059971
## HalfBath 0.0067838113 0.177354389 NA 0.014259469 0.27345810
## BedroomAbvGr 0.0377185542 -0.023438028 NA 0.119689908 0.10167636
## KitchenAbvGr 0.0029512364 0.281721040 NA -0.017783871 -0.18388223
## TotRmsAbvGrd 0.0272387244 0.040380065 NA 0.190014778 0.42745234
## Fireplaces -0.0197716324 -0.045569340 NA 0.271364010 0.39676504
## GarageYrBlt NA NA NA NA NA
## GarageCars 0.0165696771 -0.040109793 NA 0.154870740 0.60067072
## GarageArea 0.0176337785 -0.098671543 NA 0.180402755 0.56202176
## WoodDeckSF -0.0296431972 -0.012579358 NA 0.171697687 0.23892339
## OpenPorchSF -0.0004769113 -0.006100121 NA 0.084773809 0.30881882
## EnclosedPorch 0.0028892179 -0.012036622 NA -0.018339734 -0.11393686
## X3SsnPorch -0.0466347889 -0.043824549 NA 0.020422830 0.03037057
## ScreenPorch 0.0013302086 -0.026030177 NA 0.043160378 0.06488636
## PoolArea 0.0570439048 0.008282708 NA 0.077672392 0.06516584
## MiscVal -0.0062424048 -0.007683291 NA 0.038067692 -0.03140621
## MoSold 0.0211721766 -0.013584643 NA 0.001204988 0.07081517
## YrSold 0.0007117940 -0.021407038 NA -0.014261407 -0.02734671
## SalePrice -0.0219167194 -0.084284135 NA 0.263843354 0.79098160
## OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## Id 0.012608925 -0.012712715 -0.021997642 NA -0.005024049
## MSSubClass -0.059315817 0.027850137 0.040581045 NA -0.069835749
## LotFrontage NA NA NA NA NA
## LotArea -0.005636270 0.014227652 0.013788427 NA 0.214103131
## OverallQual -0.091932343 0.572322769 0.550683924 NA 0.239665966
## OverallCond 1.000000000 -0.375983196 0.073741498 NA -0.046230856
## YearBuilt -0.375983196 1.000000000 0.592854976 NA 0.249503197
## YearRemodAdd 0.073741498 0.592854976 1.000000000 NA 0.128450547
## MasVnrArea NA NA NA 1 NA
## BsmtFinSF1 -0.046230856 0.249503197 0.128450547 NA 1.000000000
## BsmtFinSF2 0.040229170 -0.049106831 -0.067758514 NA -0.050117400
## BsmtUnfSF -0.136840570 0.149040392 0.181133087 NA -0.495251469
## TotalBsmtSF -0.171097515 0.391452002 0.291065583 NA 0.522396052
## X1stFlrSF -0.144202784 0.281985859 0.240379268 NA 0.445862656
## X2ndFlrSF 0.028942116 0.010307660 0.140023779 NA -0.137078986
## LowQualFinSF 0.025494320 -0.183784344 -0.062419100 NA -0.064502597
## GrLivArea -0.079685865 0.199009714 0.287388520 NA 0.208171130
## BsmtFullBath -0.054941515 0.187598550 0.119469879 NA 0.649211754
## BsmtHalfBath 0.117820915 -0.038161806 -0.012337032 NA 0.067418478
## FullBath -0.194149489 0.468270787 0.439046484 NA 0.058543137
## HalfBath -0.060769327 0.242655910 0.183330612 NA 0.004262424
## BedroomAbvGr 0.012980060 -0.070651217 -0.040580928 NA -0.107354677
## KitchenAbvGr -0.087000855 -0.174800246 -0.149597521 NA -0.081006851
## TotRmsAbvGrd -0.057583166 0.095589128 0.191739816 NA 0.044315624
## Fireplaces -0.023819978 0.147716399 0.112581318 NA 0.260010920
## GarageYrBlt NA NA NA NA NA
## GarageCars -0.185757511 0.537850092 0.420622155 NA 0.224053522
## GarageArea -0.151521371 0.478953820 0.371599809 NA 0.296970385
## WoodDeckSF -0.003333699 0.224880142 0.205725920 NA 0.204306145
## OpenPorchSF -0.032588814 0.188685840 0.226297633 NA 0.111760613
## EnclosedPorch 0.070356184 -0.387267783 -0.193919147 NA -0.102303306
## X3SsnPorch 0.025503660 0.031354513 0.045285810 NA 0.026450506
## ScreenPorch 0.054810529 -0.050364435 -0.038740011 NA 0.062020623
## PoolArea -0.001984942 0.004949728 0.005829372 NA 0.140491286
## MiscVal 0.068776806 -0.034383139 -0.010286249 NA 0.003571473
## MoSold -0.003510839 0.012398471 0.021490002 NA -0.015726948
## YrSold 0.043949746 -0.013617680 0.035743247 NA 0.014358922
## SalePrice -0.077855894 0.522897333 0.507100967 NA 0.386419806
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF X2ndFlrSF
## Id -0.005967672 -0.007939703 -0.0154145661 0.010496041 0.005589849
## MSSubClass -0.065648579 -0.140759481 -0.2385184093 -0.251758352 0.307885721
## LotFrontage NA NA NA NA NA
## LotArea 0.111169745 -0.002618360 0.2608331345 0.299474579 0.050985948
## OverallQual -0.059118693 0.308158927 0.5378084986 0.476223829 0.295492879
## OverallCond 0.040229170 -0.136840570 -0.1710975146 -0.144202784 0.028942116
## YearBuilt -0.049106831 0.149040392 0.3914520021 0.281985859 0.010307660
## YearRemodAdd -0.067758514 0.181133087 0.2910655826 0.240379268 0.140023779
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 -0.050117400 -0.495251469 0.5223960520 0.445862656 -0.137078986
## BsmtFinSF2 1.000000000 -0.209294492 0.1048095376 0.097117448 -0.099260316
## BsmtUnfSF -0.209294492 1.000000000 0.4153596052 0.317987438 0.004469092
## TotalBsmtSF 0.104809538 0.415359605 1.0000000000 0.819529975 -0.174511950
## X1stFlrSF 0.097117448 0.317987438 0.8195299750 1.000000000 -0.202646181
## X2ndFlrSF -0.099260316 0.004469092 -0.1745119501 -0.202646181 1.000000000
## LowQualFinSF 0.014806998 0.028166688 -0.0332453873 -0.014240673 0.063352950
## GrLivArea -0.009639892 0.240257268 0.4548682025 0.566023969 0.687501064
## BsmtFullBath 0.158678061 -0.422900477 0.3073505537 0.244671104 -0.169493952
## BsmtHalfBath 0.070948134 -0.095804288 -0.0003145818 0.001955654 -0.023854784
## FullBath -0.076443862 0.288886055 0.3237224136 0.380637495 0.421377983
## HalfBath -0.032147837 -0.041117530 -0.0488037386 -0.119915909 0.609707300
## BedroomAbvGr -0.015728114 0.166643317 0.0504499555 0.127400749 0.502900613
## KitchenAbvGr -0.040751236 0.030085868 -0.0689006426 0.068100588 0.059305753
## TotRmsAbvGrd -0.035226548 0.250647061 0.2855725637 0.409515979 0.616422635
## Fireplaces 0.046920709 0.051574882 0.3395193239 0.410531085 0.194560892
## GarageYrBlt NA NA NA NA NA
## GarageCars -0.038263513 0.214175190 0.4345848343 0.439316808 0.183925583
## GarageArea -0.018226592 0.183302698 0.4866654638 0.489781654 0.138346959
## WoodDeckSF 0.067898326 -0.005316424 0.2320186091 0.235458623 0.092165418
## OpenPorchSF 0.003092562 0.129005415 0.2472637463 0.211671225 0.208026063
## EnclosedPorch 0.036543339 -0.002537855 -0.0954777367 -0.065291701 0.061988691
## X3SsnPorch -0.029993398 0.020764006 0.0373837273 0.056104374 -0.024357648
## ScreenPorch 0.088871251 -0.012579273 0.0844889859 0.088758073 0.040606448
## PoolArea 0.041709055 -0.035092241 0.1260531321 0.131524976 0.081486878
## MiscVal 0.004939781 -0.023836645 -0.0184789224 -0.021095719 0.016196875
## MoSold -0.015210738 0.034888443 0.0131961786 0.031371560 0.035164427
## YrSold 0.031705637 -0.041258195 -0.0149686480 -0.013603771 -0.028699914
## SalePrice -0.011378121 0.214479106 0.6135805516 0.605852185 0.319333803
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## Id -0.0442299581 0.008272758 0.0022885556 -0.0201547452
## MSSubClass 0.0464737559 0.074853180 0.0034910258 -0.0023325346
## LotFrontage NA NA NA NA
## LotArea 0.0047789699 0.263116167 0.1581545311 0.0480455709
## OverallQual -0.0304292840 0.593007430 0.1110977861 -0.0401501577
## OverallCond 0.0254943199 -0.079685865 -0.0549415154 0.1178209151
## YearBuilt -0.1837843444 0.199009714 0.1875985500 -0.0381618057
## YearRemodAdd -0.0624191001 0.287388520 0.1194698791 -0.0123370321
## MasVnrArea NA NA NA NA
## BsmtFinSF1 -0.0645025969 0.208171130 0.6492117536 0.0674184779
## BsmtFinSF2 0.0148069979 -0.009639892 0.1586780608 0.0709481337
## BsmtUnfSF 0.0281666881 0.240257268 -0.4229004774 -0.0958042882
## TotalBsmtSF -0.0332453873 0.454868203 0.3073505537 -0.0003145818
## X1stFlrSF -0.0142406727 0.566023969 0.2446711042 0.0019556536
## X2ndFlrSF 0.0633529501 0.687501064 -0.1694939517 -0.0238547839
## LowQualFinSF 1.0000000000 0.134682813 -0.0471434219 -0.0058415048
## GrLivArea 0.1346828130 1.000000000 0.0348360495 -0.0189184832
## BsmtFullBath -0.0471434219 0.034836050 1.0000000000 -0.1478709605
## BsmtHalfBath -0.0058415048 -0.018918483 -0.1478709605 1.0000000000
## FullBath -0.0007095096 0.630011646 -0.0645120486 -0.0545358120
## HalfBath -0.0270800493 0.415771636 -0.0309049591 -0.0123399001
## BedroomAbvGr 0.1056065685 0.521269511 -0.1506728092 0.0465188484
## KitchenAbvGr 0.0075217443 0.100063165 -0.0415025464 -0.0379443502
## TotRmsAbvGrd 0.1311847760 0.825489374 -0.0532752361 -0.0238363413
## Fireplaces -0.0212721434 0.461679134 0.1379277084 0.0289755866
## GarageYrBlt NA NA NA NA
## GarageCars -0.0944795202 0.467247419 0.1318812244 -0.0208910590
## GarageArea -0.0676014132 0.468997477 0.1791894804 -0.0245355796
## WoodDeckSF -0.0254436480 0.247432821 0.1753151901 0.0401612233
## OpenPorchSF 0.0182510391 0.330223962 0.0673414614 -0.0253237579
## EnclosedPorch 0.0610812378 0.009113210 -0.0499106491 -0.0085553339
## X3SsnPorch -0.0042956104 0.020643190 -0.0001060915 0.0351136309
## ScreenPorch 0.0267994130 0.101510396 0.0231477258 0.0321214072
## PoolArea 0.0621573723 0.170205336 0.0676155562 0.0200246298
## MiscVal -0.0037928708 -0.002415640 -0.0230470249 -0.0073665245
## MoSold -0.0221739606 0.050239681 -0.0253608943 0.0328727052
## YrSold -0.0289208798 -0.036525820 0.0670491377 -0.0465238818
## SalePrice -0.0256061300 0.708624478 0.2271222331 -0.0168441543
## FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Id 0.0055874529 0.006783811 0.037718554 0.002951236 0.027238724
## MSSubClass 0.1316082224 0.177354389 -0.023438028 0.281721040 0.040380065
## LotFrontage NA NA NA NA NA
## LotArea 0.1260306265 0.014259469 0.119689908 -0.017783871 0.190014778
## OverallQual 0.5505997094 0.273458099 0.101676356 -0.183882235 0.427452343
## OverallCond -0.1941494887 -0.060769327 0.012980060 -0.087000855 -0.057583166
## YearBuilt 0.4682707872 0.242655910 -0.070651217 -0.174800246 0.095589128
## YearRemodAdd 0.4390464839 0.183330612 -0.040580928 -0.149597521 0.191739816
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 0.0585431369 0.004262424 -0.107354677 -0.081006851 0.044315624
## BsmtFinSF2 -0.0764438620 -0.032147837 -0.015728114 -0.040751236 -0.035226548
## BsmtUnfSF 0.2888860555 -0.041117530 0.166643317 0.030085868 0.250647061
## TotalBsmtSF 0.3237224136 -0.048803739 0.050449956 -0.068900643 0.285572564
## X1stFlrSF 0.3806374950 -0.119915909 0.127400749 0.068100588 0.409515979
## X2ndFlrSF 0.4213779829 0.609707300 0.502900613 0.059305753 0.616422635
## LowQualFinSF -0.0007095096 -0.027080049 0.105606569 0.007521744 0.131184776
## GrLivArea 0.6300116463 0.415771636 0.521269511 0.100063165 0.825489374
## BsmtFullBath -0.0645120486 -0.030904959 -0.150672809 -0.041502546 -0.053275236
## BsmtHalfBath -0.0545358120 -0.012339900 0.046518848 -0.037944350 -0.023836341
## FullBath 1.0000000000 0.136380589 0.363251983 0.133115214 0.554784254
## HalfBath 0.1363805887 1.000000000 0.226651484 -0.068262549 0.343414858
## BedroomAbvGr 0.3632519830 0.226651484 1.000000000 0.198596758 0.676619936
## KitchenAbvGr 0.1331152142 -0.068262549 0.198596758 1.000000000 0.256045409
## TotRmsAbvGrd 0.5547842535 0.343414858 0.676619936 0.256045409 1.000000000
## Fireplaces 0.2436705031 0.203648508 0.107569681 -0.123936235 0.326114480
## GarageYrBlt NA NA NA NA NA
## GarageCars 0.4696720433 0.219178152 0.086106438 -0.050633892 0.362288571
## GarageArea 0.4056562085 0.163549364 0.065252530 -0.064433047 0.337822121
## WoodDeckSF 0.1877032138 0.108080303 0.046853773 -0.090130273 0.165983884
## OpenPorchSF 0.2599774255 0.199740148 0.093809572 -0.070090610 0.234191588
## EnclosedPorch -0.1150929635 -0.095316526 0.041570435 0.037312385 0.004151299
## X3SsnPorch 0.0353530166 -0.004972488 -0.024477796 -0.024600359 -0.006683241
## ScreenPorch -0.0081060933 0.072425845 0.044299691 -0.051613366 0.059382600
## PoolArea 0.0496038256 0.022381498 0.070702584 -0.014525116 0.083757350
## MiscVal -0.0142898450 0.001290145 0.007766972 0.062340724 0.024762884
## MoSold 0.0558721290 -0.009049888 0.046543860 0.026588907 0.036907077
## YrSold -0.0196688407 -0.010268669 -0.036013893 0.031687207 -0.034516354
## SalePrice 0.5606637627 0.284107676 0.168213154 -0.135907371 0.533723156
## Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF
## Id -0.019771632 NA 0.01656968 0.01763378 -0.029643197
## MSSubClass -0.045569340 NA -0.04010979 -0.09867154 -0.012579358
## LotFrontage NA NA NA NA NA
## LotArea 0.271364010 NA 0.15487074 0.18040276 0.171697687
## OverallQual 0.396765038 NA 0.60067072 0.56202176 0.238923392
## OverallCond -0.023819978 NA -0.18575751 -0.15152137 -0.003333699
## YearBuilt 0.147716399 NA 0.53785009 0.47895382 0.224880142
## YearRemodAdd 0.112581318 NA 0.42062215 0.37159981 0.205725920
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 0.260010920 NA 0.22405352 0.29697039 0.204306145
## BsmtFinSF2 0.046920709 NA -0.03826351 -0.01822659 0.067898326
## BsmtUnfSF 0.051574882 NA 0.21417519 0.18330270 -0.005316424
## TotalBsmtSF 0.339519324 NA 0.43458483 0.48666546 0.232018609
## X1stFlrSF 0.410531085 NA 0.43931681 0.48978165 0.235458623
## X2ndFlrSF 0.194560892 NA 0.18392558 0.13834696 0.092165418
## LowQualFinSF -0.021272143 NA -0.09447952 -0.06760141 -0.025443648
## GrLivArea 0.461679134 NA 0.46724742 0.46899748 0.247432821
## BsmtFullBath 0.137927708 NA 0.13188122 0.17918948 0.175315190
## BsmtHalfBath 0.028975587 NA -0.02089106 -0.02453558 0.040161223
## FullBath 0.243670503 NA 0.46967204 0.40565621 0.187703214
## HalfBath 0.203648508 NA 0.21917815 0.16354936 0.108080303
## BedroomAbvGr 0.107569681 NA 0.08610644 0.06525253 0.046853773
## KitchenAbvGr -0.123936235 NA -0.05063389 -0.06443305 -0.090130273
## TotRmsAbvGrd 0.326114480 NA 0.36228857 0.33782212 0.165983884
## Fireplaces 1.000000000 NA 0.30078877 0.26914124 0.200018796
## GarageYrBlt NA 1 NA NA NA
## GarageCars 0.300788766 NA 1.00000000 0.88247541 0.226342138
## GarageArea 0.269141238 NA 0.88247541 1.00000000 0.224666307
## WoodDeckSF 0.200018796 NA 0.22634214 0.22466631 1.000000000
## OpenPorchSF 0.169405327 NA 0.21356945 0.24143467 0.058660609
## EnclosedPorch -0.024821869 NA -0.15143416 -0.12177672 -0.125988888
## X3SsnPorch 0.011257239 NA 0.03576529 0.03508670 -0.032770634
## ScreenPorch 0.184530270 NA 0.05049379 0.05141176 -0.074181351
## PoolArea 0.095073522 NA 0.02093353 0.06104727 0.073378207
## MiscVal 0.001408605 NA -0.04308013 -0.02739991 -0.009551228
## MoSold 0.046357102 NA 0.04052173 0.02797380 0.021011044
## YrSold -0.024095565 NA -0.03911690 -0.02737794 0.022270451
## SalePrice 0.466928837 NA 0.64040920 0.62343144 0.324413445
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Id -0.0004769113 0.002889218 -0.0466347889 0.001330209
## MSSubClass -0.0061001212 -0.012036622 -0.0438245492 -0.026030177
## LotFrontage NA NA NA NA
## LotArea 0.0847738088 -0.018339734 0.0204228296 0.043160378
## OverallQual 0.3088188234 -0.113936859 0.0303705671 0.064886360
## OverallCond -0.0325888135 0.070356184 0.0255036600 0.054810529
## YearBuilt 0.1886858400 -0.387267783 0.0313545131 -0.050364435
## YearRemodAdd 0.2262976327 -0.193919147 0.0452858098 -0.038740011
## MasVnrArea NA NA NA NA
## BsmtFinSF1 0.1117606134 -0.102303306 0.0264505062 0.062020623
## BsmtFinSF2 0.0030925622 0.036543339 -0.0299933980 0.088871251
## BsmtUnfSF 0.1290054146 -0.002537855 0.0207640057 -0.012579273
## TotalBsmtSF 0.2472637463 -0.095477737 0.0373837273 0.084488986
## X1stFlrSF 0.2116712255 -0.065291701 0.0561043745 0.088758073
## X2ndFlrSF 0.2080260632 0.061988691 -0.0243576484 0.040606448
## LowQualFinSF 0.0182510391 0.061081238 -0.0042956104 0.026799413
## GrLivArea 0.3302239617 0.009113210 0.0206431897 0.101510396
## BsmtFullBath 0.0673414614 -0.049910649 -0.0001060915 0.023147726
## BsmtHalfBath -0.0253237579 -0.008555334 0.0351136309 0.032121407
## FullBath 0.2599774255 -0.115092963 0.0353530166 -0.008106093
## HalfBath 0.1997401475 -0.095316526 -0.0049724884 0.072425845
## BedroomAbvGr 0.0938095716 0.041570435 -0.0244777964 0.044299691
## KitchenAbvGr -0.0700906099 0.037312385 -0.0246003587 -0.051613366
## TotRmsAbvGrd 0.2341915878 0.004151299 -0.0066832410 0.059382600
## Fireplaces 0.1694053271 -0.024821869 0.0112572390 0.184530270
## GarageYrBlt NA NA NA NA
## GarageCars 0.2135694456 -0.151434160 0.0357652851 0.050493792
## GarageArea 0.2414346721 -0.121776720 0.0350867002 0.051411762
## WoodDeckSF 0.0586606086 -0.125988888 -0.0327706336 -0.074181351
## OpenPorchSF 1.0000000000 -0.093079318 -0.0058424993 0.074303944
## EnclosedPorch -0.0930793175 1.000000000 -0.0373052828 -0.082864245
## X3SsnPorch -0.0058424993 -0.037305283 1.0000000000 -0.031435847
## ScreenPorch 0.0743039439 -0.082864245 -0.0314358470 1.000000000
## PoolArea 0.0607621115 0.054202562 -0.0079915489 0.051307395
## MiscVal -0.0185837390 0.018360600 0.0003539653 0.031945761
## MoSold 0.0712548848 -0.028887266 0.0294737952 0.023216992
## YrSold -0.0576193601 -0.009915937 0.0186449254 0.010694106
## SalePrice 0.3158562271 -0.128577958 0.0445836653 0.111446571
## PoolArea MiscVal MoSold YrSold SalePrice
## Id 0.057043905 -0.0062424048 0.021172177 0.000711794 -0.02191672
## MSSubClass 0.008282708 -0.0076832913 -0.013584643 -0.021407038 -0.08428414
## LotFrontage NA NA NA NA NA
## LotArea 0.077672392 0.0380676920 0.001204988 -0.014261407 0.26384335
## OverallQual 0.065165844 -0.0314062105 0.070815172 -0.027346708 0.79098160
## OverallCond -0.001984942 0.0687768061 -0.003510839 0.043949746 -0.07785589
## YearBuilt 0.004949728 -0.0343831387 0.012398471 -0.013617680 0.52289733
## YearRemodAdd 0.005829372 -0.0102862488 0.021490002 0.035743247 0.50710097
## MasVnrArea NA NA NA NA NA
## BsmtFinSF1 0.140491286 0.0035714735 -0.015726948 0.014358922 0.38641981
## BsmtFinSF2 0.041709055 0.0049397812 -0.015210738 0.031705637 -0.01137812
## BsmtUnfSF -0.035092241 -0.0238366451 0.034888443 -0.041258195 0.21447911
## TotalBsmtSF 0.126053132 -0.0184789224 0.013196179 -0.014968648 0.61358055
## X1stFlrSF 0.131524976 -0.0210957195 0.031371560 -0.013603771 0.60585218
## X2ndFlrSF 0.081486878 0.0161968746 0.035164427 -0.028699914 0.31933380
## LowQualFinSF 0.062157372 -0.0037928708 -0.022173961 -0.028920880 -0.02560613
## GrLivArea 0.170205336 -0.0024156396 0.050239681 -0.036525820 0.70862448
## BsmtFullBath 0.067615556 -0.0230470249 -0.025360894 0.067049138 0.22712223
## BsmtHalfBath 0.020024630 -0.0073665245 0.032872705 -0.046523882 -0.01684415
## FullBath 0.049603826 -0.0142898450 0.055872129 -0.019668841 0.56066376
## HalfBath 0.022381498 0.0012901448 -0.009049888 -0.010268669 0.28410768
## BedroomAbvGr 0.070702584 0.0077669720 0.046543860 -0.036013893 0.16821315
## KitchenAbvGr -0.014525116 0.0623407240 0.026588907 0.031687207 -0.13590737
## TotRmsAbvGrd 0.083757350 0.0247628842 0.036907077 -0.034516354 0.53372316
## Fireplaces 0.095073522 0.0014086054 0.046357102 -0.024095565 0.46692884
## GarageYrBlt NA NA NA NA NA
## GarageCars 0.020933531 -0.0430801281 0.040521730 -0.039116904 0.64040920
## GarageArea 0.061047272 -0.0273999144 0.027973800 -0.027377940 0.62343144
## WoodDeckSF 0.073378207 -0.0095512282 0.021011044 0.022270451 0.32441344
## OpenPorchSF 0.060762111 -0.0185837390 0.071254885 -0.057619360 0.31585623
## EnclosedPorch 0.054202562 0.0183606001 -0.028887266 -0.009915937 -0.12857796
## X3SsnPorch -0.007991549 0.0003539653 0.029473795 0.018644925 0.04458367
## ScreenPorch 0.051307395 0.0319457608 0.023216992 0.010694106 0.11144657
## PoolArea 1.000000000 0.0296686509 -0.033736640 -0.059688932 0.09240355
## MiscVal 0.029668651 1.0000000000 -0.006494550 0.004906262 -0.02118958
## MoSold -0.033736640 -0.0064945502 1.000000000 -0.145721413 0.04643225
## YrSold -0.059688932 0.0049062625 -0.145721413 1.000000000 -0.02892259
## SalePrice 0.092403549 -0.0211895796 0.046432245 -0.028922585 1.00000000
Let’s visualize the correlation matrix on a heat map and highlight the variables with strong correlation with the sale price:
# Convert the correlation matrix to long format
correlation_long <- melt(correlation_matrix_numerical)
# Plot heatmap
ggplot(correlation_long, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
coord_fixed()
# 3. Highlight the cells corresponding to strong correlations with sale price
strong_correlation_threshold <- 0.6
# Find variables strongly correlated with sale price
strong_correlation_with_saleprice <- correlation_matrix[,"SalePrice"] >= strong_correlation_threshold
# Highlight cells with strong correlation
ggplot(correlation_long, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
coord_fixed() +
geom_tile(data = correlation_long[strong_correlation_with_saleprice, ],
aes(Var1, Var2), fill = "black", color = "black")
This is hare to read, let’s split the numerical data into subset and plot the heat map of the variables against the SalePrice:
The error that occurs after running the code above (for creating a heat map, the code was hidden to minimize the amount of code displayed in this R Markdown document.) indicates that there maybe row names that are duplicated, also the standard deviation is zero suggests that there are some variables that might have constant value across all observations. Let’s fix this by also handling if there is any missing data, then re-run the code:
# Check for duplicate variable names
duplicated_names <- duplicated(names(numerical_data))
if (any(duplicated_names)) {
print("Duplicate variable names detected:")
print(names(numerical_data)[duplicated_names])
# Rename variables if necessary
names(numerical_data)[duplicated_names] <- make.unique(names(numerical_data)[duplicated_names])
}
# Identify constant variables
constant_vars <- apply(numerical_data, 2, function(x) {
all(diff(x, na.rm = TRUE) == 0, na.rm = TRUE)
})
# Print constant variables
if (any(constant_vars)) {
print("Constant variables detected:")
print(names(numerical_data)[constant_vars])
} else {
print("No constant variables detected.")
}
## [1] "No constant variables detected."
# Remove constant variables from numerical_data
numerical_data <- numerical_data[, !constant_vars, drop = FALSE]
# Check for missing values
missing_values <- apply(numerical_data, 2, function(x) {
any(is.na(x))
})
# Print variables with missing values
if (any(missing_values)) {
print("Variables with missing values detected:")
print(names(numerical_data)[missing_values])
} else {
print("No missing values detected.")
}
## [1] "Variables with missing values detected:"
## [1] "LotFrontage" "MasVnrArea" "GarageYrBlt"
Based on the output above, the constant value issue was resolved. Also, there are three variables that have missing values, let’s handle them by replacing them with the mean:
# Impute missing values with mean
numerical_data$LotFrontage[is.na(numerical_data$LotFrontage)] <- mean(numerical_data$LotFrontage, na.rm = TRUE)
numerical_data$MasVnrArea[is.na(numerical_data$MasVnrArea)] <- mean(numerical_data$MasVnrArea, na.rm = TRUE)
numerical_data$GarageYrBlt[is.na(numerical_data$GarageYrBlt)] <- mean(numerical_data$GarageYrBlt, na.rm = TRUE)
# Re-check for missing values
missing_values <- apply(numerical_data, 2, function(x) {
any(is.na(x))
})
# Print variables with missing values (should be empty now)
if (any(missing_values)) {
print("Variables with missing values detected:")
print(names(numerical_data)[missing_values])
} else {
print("No missing values detected.")
}
## [1] "No missing values detected."
#Calculate the correlation matrix again after handling the missing data and the constant values
correlation_matrix_New <- cor(numerical_data)
print(correlation_matrix_New)
Yay! It worked.
However, it is still hard to read the heat map even though the data was split into 5 then 10 subsets. Let’s run the full linear model using all numerical variables as predictors for SalePrice.
# Fit linear model
Full_model <- lm(SalePrice ~ ., data = numerical_data)
# Summary of the model
summary(Full_model)
##
## Call:
## lm(formula = SalePrice ~ ., data = numerical_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470552 -16741 -2026 13943 302726
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.593e+05 1.414e+06 0.325 0.745370
## Id -1.078e+00 2.187e+00 -0.493 0.622214
## MSSubClass -1.812e+02 2.769e+01 -6.544 8.35e-11 ***
## LotFrontage -5.660e+01 5.180e+01 -1.093 0.274731
## LotArea 4.283e-01 1.022e-01 4.190 2.95e-05 ***
## OverallQual 1.730e+04 1.189e+03 14.551 < 2e-16 ***
## OverallCond 4.680e+03 1.033e+03 4.530 6.39e-06 ***
## YearBuilt 2.714e+02 6.757e+01 4.016 6.23e-05 ***
## YearRemodAdd 1.350e+02 6.864e+01 1.966 0.049494 *
## MasVnrArea 3.130e+01 5.961e+00 5.251 1.74e-07 ***
## BsmtFinSF1 1.916e+01 4.670e+00 4.102 4.32e-05 ***
## BsmtFinSF2 8.207e+00 7.061e+00 1.162 0.245312
## BsmtUnfSF 9.267e+00 4.196e+00 2.208 0.027375 *
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 4.905e+01 5.817e+00 8.433 < 2e-16 ***
## X2ndFlrSF 4.896e+01 4.986e+00 9.819 < 2e-16 ***
## LowQualFinSF 2.502e+01 2.001e+01 1.250 0.211372
## GrLivArea NA NA NA NA
## BsmtFullBath 9.372e+03 2.613e+03 3.587 0.000346 ***
## BsmtHalfBath 2.005e+03 4.093e+03 0.490 0.624240
## FullBath 3.461e+03 2.838e+03 1.220 0.222839
## HalfBath -1.873e+03 2.665e+03 -0.703 0.482317
## BedroomAbvGr -1.008e+04 1.703e+03 -5.921 4.01e-09 ***
## KitchenAbvGr -1.233e+04 5.219e+03 -2.363 0.018264 *
## TotRmsAbvGrd 5.082e+03 1.238e+03 4.105 4.27e-05 ***
## Fireplaces 3.948e+03 1.778e+03 2.221 0.026524 *
## GarageYrBlt 1.222e+02 6.963e+01 1.755 0.079554 .
## GarageCars 1.126e+04 2.876e+03 3.916 9.44e-05 ***
## GarageArea -4.201e+00 9.953e+00 -0.422 0.673051
## WoodDeckSF 2.387e+01 8.021e+00 2.976 0.002970 **
## OpenPorchSF -2.910e+00 1.519e+01 -0.192 0.848036
## EnclosedPorch 1.179e+01 1.687e+01 0.699 0.484819
## X3SsnPorch 1.975e+01 3.144e+01 0.628 0.530021
## ScreenPorch 5.596e+01 1.720e+01 3.254 0.001163 **
## PoolArea -2.834e+01 2.386e+01 -1.187 0.235279
## MiscVal -7.364e-01 1.855e+00 -0.397 0.691531
## MoSold -4.523e+01 3.450e+02 -0.131 0.895733
## YrSold -7.764e+02 7.028e+02 -1.105 0.269469
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34760 on 1424 degrees of freedom
## Multiple R-squared: 0.8132, Adjusted R-squared: 0.8086
## F-statistic: 177.1 on 35 and 1424 DF, p-value: < 2.2e-16
Based on the summary of the full regression model, the predictors that explain the Sale price are ‘MSSubClass’ (The building class), ‘LotArea’ (Lot size in square feet), ‘OverallQual’ (Overall material and finish quality), ‘OverallCond’ (Overall condition rating), YearBuilt’ (Original construction date), ‘MasVnrArea’ (Masonry veneer area in square feet), ‘BsmtFinSF1’ (Type 1 finished square feet), ‘X1stFlrSF’ (First Floor square feet), ‘X2ndFlrSF’ (Second floor square feet), ‘BsmtFullBath’ (Basement full bathrooms), ‘BedroomAbvGr’ (Number of bedrooms above ground), ‘TotRmsAbvGrd’ (Total rooms above grade (does not include bathrooms)), ‘GarageCars’ (Size of garage in car capacity), ‘WoodDeckSF’ (Wood deck area in square feet), and ‘ScreenPorch’ (Screen porch area in square feet).
Now, let’s perform the backward process elimination:
# Linear Model for the most significant predictors
half_full_model <- lm(SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond +
YearBuilt + MasVnrArea + BsmtFinSF1 + X1stFlrSF + X2ndFlrSF +
BsmtFullBath + BedroomAbvGr + TotRmsAbvGrd + GarageCars +
WoodDeckSF + ScreenPorch, data = numerical_data)
# Step 1: Get summary of the full model
summary(half_full_model)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + X1stFlrSF +
## X2ndFlrSF + BsmtFullBath + BedroomAbvGr + TotRmsAbvGrd +
## GarageCars + WoodDeckSF + ScreenPorch, data = numerical_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -482920 -16643 -1921 13974 282923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.552e+05 8.803e+04 -9.715 < 2e-16 ***
## MSSubClass -1.905e+02 2.402e+01 -7.930 4.34e-15 ***
## LotArea 4.430e-01 9.962e-02 4.447 9.37e-06 ***
## OverallQual 1.984e+04 1.089e+03 18.223 < 2e-16 ***
## OverallCond 5.290e+03 9.226e+02 5.733 1.20e-08 ***
## YearBuilt 3.930e+02 4.491e+01 8.752 < 2e-16 ***
## MasVnrArea 2.991e+01 5.880e+00 5.086 4.13e-07 ***
## BsmtFinSF1 1.053e+01 2.968e+00 3.548 0.000400 ***
## X1stFlrSF 5.704e+01 4.504e+00 12.664 < 2e-16 ***
## X2ndFlrSF 5.041e+01 4.005e+00 12.586 < 2e-16 ***
## BsmtFullBath 8.790e+03 2.381e+03 3.691 0.000232 ***
## BedroomAbvGr -1.070e+04 1.646e+03 -6.504 1.07e-10 ***
## TotRmsAbvGrd 4.687e+03 1.170e+03 4.007 6.46e-05 ***
## GarageCars 1.039e+04 1.698e+03 6.121 1.20e-09 ***
## WoodDeckSF 2.765e+01 7.889e+00 3.505 0.000470 ***
## ScreenPorch 5.670e+01 1.675e+01 3.384 0.000733 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34990 on 1444 degrees of freedom
## Multiple R-squared: 0.808, Adjusted R-squared: 0.806
## F-statistic: 405.1 on 15 and 1444 DF, p-value: < 2.2e-16
# Step 2: Backward elimination
while(any(coef(summary(half_full_model))[, 4] > 0.05)) {
max_p_value_index <- which.max(coef(summary(half_full_model))[, 4])
if (max_p_value_index == 1) {
break # Intercept should not be removed
}
half_full_model <- update(half_full_model, . ~ . - names(coef(half_full_model))[max_p_value_index])
}
# Step 3: Final summary of the model after elimination
summary(half_full_model)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + X1stFlrSF +
## X2ndFlrSF + BsmtFullBath + BedroomAbvGr + TotRmsAbvGrd +
## GarageCars + WoodDeckSF + ScreenPorch, data = numerical_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -482920 -16643 -1921 13974 282923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.552e+05 8.803e+04 -9.715 < 2e-16 ***
## MSSubClass -1.905e+02 2.402e+01 -7.930 4.34e-15 ***
## LotArea 4.430e-01 9.962e-02 4.447 9.37e-06 ***
## OverallQual 1.984e+04 1.089e+03 18.223 < 2e-16 ***
## OverallCond 5.290e+03 9.226e+02 5.733 1.20e-08 ***
## YearBuilt 3.930e+02 4.491e+01 8.752 < 2e-16 ***
## MasVnrArea 2.991e+01 5.880e+00 5.086 4.13e-07 ***
## BsmtFinSF1 1.053e+01 2.968e+00 3.548 0.000400 ***
## X1stFlrSF 5.704e+01 4.504e+00 12.664 < 2e-16 ***
## X2ndFlrSF 5.041e+01 4.005e+00 12.586 < 2e-16 ***
## BsmtFullBath 8.790e+03 2.381e+03 3.691 0.000232 ***
## BedroomAbvGr -1.070e+04 1.646e+03 -6.504 1.07e-10 ***
## TotRmsAbvGrd 4.687e+03 1.170e+03 4.007 6.46e-05 ***
## GarageCars 1.039e+04 1.698e+03 6.121 1.20e-09 ***
## WoodDeckSF 2.765e+01 7.889e+00 3.505 0.000470 ***
## ScreenPorch 5.670e+01 1.675e+01 3.384 0.000733 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34990 on 1444 degrees of freedom
## Multiple R-squared: 0.808, Adjusted R-squared: 0.806
## F-statistic: 405.1 on 15 and 1444 DF, p-value: < 2.2e-16
plot(fitted(half_full_model), residuals(half_full_model),
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")
The residuals are not randomly scattered around 0; the relationship between the predictors and the response variable may not be purely linear.
qqnorm(residuals(half_full_model))
qqline(residuals(half_full_model))
By looking at the tails of the Q-Q plot, they seem that they are curved, which indicates that the residuals are skewed; they are not symmetrically distributed around zero.
png("residual_plots.png", width = 800, height = 800)
par(mfrow = c(3, 3))
for (predictor in names(half_full_model$model)[-1]) {
plot(half_full_model$model[[predictor]], residuals(half_full_model),
xlab = predictor, ylab = "Residuals",
main = paste("Residuals vs", predictor))
}
dev.off()
## quartz_off_screen
## 2
Some of the plots show a pattern of vertical lines, that suggests that the residuals don’t have constant vaiance.
sqrt_abs_residuals <- sqrt(abs(residuals(half_full_model)))
plot(fitted(half_full_model), sqrt_abs_residuals,
xlab = "Fitted values", ylab = "Square root of absolute residuals",
main = "Scale-Location Plot")
abline(h = 0, col = "red")
The points are clustered in the range of [0e+00, 4e+05], this indicates that the spread of residuals is relatively consistent across most of the fitted values. The presence of outliers outside the range mentioned above suggests that there might be specific observations where the model’s predictions are notably inaccurate.
plot(half_full_model, which = 5)
First, let’s create a subset of numerical data that includes only the most significant predictors:
# Create the list
predictor_variables <- c("MSSubClass", "LotArea", "OverallQual", "OverallCond",
"YearBuilt", "MasVnrArea", "BsmtFinSF1", "X1stFlrSF",
"X2ndFlrSF", "BsmtFullBath", "BedroomAbvGr", "TotRmsAbvGrd",
"GarageCars", "WoodDeckSF", "ScreenPorch")
# Create a subset of "numerical_data" using the predictor variables
subset_data <- numerical_data[, c("SalePrice", predictor_variables)]
Now, let’s predict the Sale Price using the predictors variables:
# Predict the SalePrice using the half_full_model
predicted_SalePrice <- predict(half_full_model, newdata = numerical_data)
# Create a sequence of Id values starting from 1461
new_Id <- seq(1461, length.out = nrow(numerical_data), by = 1)
# Create a data frame with modified Id and predicted SalePrice
predictions <- data.frame(Id = new_Id, SalePrice = round(predicted_SalePrice, 0))
# View the predicted SalePrice
head(predictions)
## Id SalePrice
## 1 1461 229830
## 2 1462 190568
## 3 1463 221832
## 4 1464 189561
## 5 1465 291333
## 6 1466 175566
Let’s export the predictions and submit it to Kaggle
# Remove the last row from the predictions data frame
predictions <- head(predictions, -1)
# Check for missing values in the SalePrice column
missing_values <- sum(is.na(predictions$SalePrice))
print(missing_values)
## [1] 0
# Replace missing values in the SalePrice column with the median value
predictions$SalePrice[is.na(predictions$SalePrice)] <- median(predictions$SalePrice, na.rm = TRUE)
# Write predictions data frame to a .csv file
write.csv(predictions, file = "predictions.csv", row.names = FALSE)
str(numerical_data)
## 'data.frame': 1460 obs. of 38 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ LotFrontage : num 65 80 68 60 84 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ GarageYrBlt : num 2003 1976 2001 1998 2000 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
The Residuals vs. Fitted Values plot shows a smile-like shape which suggest that there is non-linear relationship between the predictors and the SalePrice. Let’s try exponential model and see what we can get:
# Step 1: Transform the Response Variable
numerical_data$LogSalePrice <- log(numerical_data$SalePrice)
# Step 2: Fit a Linear Regression Model
exp_model <- lm(LogSalePrice ~ LotArea, data = numerical_data)
# Step 3: Back-Transform the Coefficients
exp_model_coef <- exp(coef(exp_model))
# Print the coefficients
print(exp_model_coef)
## (Intercept) LotArea
## 149604.25226 1.00001
The coefficient is equal to 1.00001 which indicates that the there is no strong exponential relationship between these variables.
# Evaluate model fit
summary(exp_model)
##
## Call:
## lm(formula = LogSalePrice ~ LotArea, data = numerical_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.53664 -0.23781 -0.02546 0.24052 1.44451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.192e+01 1.468e-02 811.56 <2e-16 ***
## LotArea 1.030e-05 1.013e-06 10.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3861 on 1458 degrees of freedom
## Multiple R-squared: 0.06621, Adjusted R-squared: 0.06557
## F-statistic: 103.4 on 1 and 1458 DF, p-value: < 2.2e-16
The values above suggest that LotArea is statistically significant in predicting LogSalePrice, as indicated by the low p-value (< 2.2e-16). However, the adjusted R-squared value (0.06557) indicates that the model explains only a small proportion of the variability (6.557%) in LogSalePrice, suggesting that other factors not included in the model may also influence SalePrice.
Let’s plot the results to compare the original response variable with the fitted values obtained from the exponential regression model.
# Plot the fitted values against the original response variable
plot(numerical_data$SalePrice, exp(predict(exp_model)), xlab = "Original Response", ylab = "Fitted Response (Exponential)")
# Perform residual analysis
plot(fitted(exp_model), residuals(exp_model),
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs Fitted")
It seems that the exponential model is not great model for the variables Let’s try polynomial model and see what will get:
# Select predictors
predictors <- c('LotArea', 'OverallQual', 'YearBuilt', 'MSSubClass', 'X1stFlrSF', 'X2ndFlrSF', 'BedroomAbvGr')
degree <- 2
# Create polynomial terms for each predictor individually
poly_terms <- lapply(numerical_data[predictors], poly, degree = degree)
# Combine polynomial terms into a single matrix
poly_matrix <- do.call(cbind, poly_terms)
# Fit polynomial regression model
poly_model <- lm(SalePrice ~ poly_matrix, data = numerical_data)
# Print summary of the model
summary(poly_model)
##
## Call:
## lm(formula = SalePrice ~ poly_matrix, data = numerical_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -344789 -17704 860 15251 261834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 180921.2 858.9 210.654 < 2e-16 ***
## poly_matrix1 286190.4 35901.7 7.971 3.16e-15 ***
## poly_matrix2 -90027.6 36866.0 -2.442 0.0147 *
## poly_matrix1 1314488.7 56380.9 23.314 < 2e-16 ***
## poly_matrix2 542398.7 36578.4 14.828 < 2e-16 ***
## poly_matrix1 488417.4 43024.9 11.352 < 2e-16 ***
## poly_matrix2 -42447.8 38947.1 -1.090 0.2759
## poly_matrix1 -176403.5 40307.7 -4.376 1.29e-05 ***
## poly_matrix2 68710.5 37926.9 1.812 0.0702 .
## poly_matrix1 968307.0 52562.7 18.422 < 2e-16 ***
## poly_matrix2 -451619.2 35219.1 -12.823 < 2e-16 ***
## poly_matrix1 1056337.5 57262.3 18.447 < 2e-16 ***
## poly_matrix2 237490.0 38142.9 6.226 6.25e-10 ***
## poly_matrix1 -260090.0 43613.7 -5.963 3.10e-09 ***
## poly_matrix2 -25224.2 35522.8 -0.710 0.4778
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32820 on 1445 degrees of freedom
## Multiple R-squared: 0.831, Adjusted R-squared: 0.8294
## F-statistic: 507.5 on 14 and 1445 DF, p-value: < 2.2e-16
Using linear model at first shows that there are Several predictors that are statistically significant in predicting the sale price, as indicated by their low p-values (< 0.05). These predictors include ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘X1stFlrSF’, ‘X2ndFlrSF’, ‘BedroomAbvGr’, ‘TotRmsAbvGrd’, ‘GarageCars’, ‘WoodDeckSF’, and ‘ScreenPorch’. By looking at the coefficients, we can see the predictors that have bigger rate of change (or slope). For instance, a one-unit increase in OverallQual is associated with an estimated increase in the sale price of approximately $19,840. The adjusted R-squared in the summary also indicates that 80.6% of the variability in the sale price is explained by the predictors included in the model. This suggests that the model provides a reasonably good fit to the data. So the linear model can be used to predict the sale price of houses based on the values of the predictors included in the model. We can use this equation:
\[ \begin{align} SalePrice = &-8.552^{05} + 1.984^{04} \times OverallQual+ -1.905^{02} \times MSSubClass + 4.430^{01} \times LotArea \\ &+ 5.290^{03} \times OverallCond + 3.930^{02} \times YearBuilt + 2.991^{01} \times MasVnrArea \\ &+ 1.053^{01} \times BsmtFinSF1 + 5.704^{01} \times X1stFlrSF + 5.041^{01} \times X2ndFlrSF \\ &+ 8.790^{03} \times BsmtFullBath + -1.070^{04} \times BedroomAbvGr \\ &+ 4.687^{03} \times TotRmsAbvGrd + 1.039^{04} \times GarageCars \\ &+ 2.765^{01} \times WoodDeckSF + 5.670^{01} \times ScreenPorch + \epsilon \end{align} \]
For the exponential model, the coefficient of ‘LotArea’ is estimated to be \(1.03 \times 10^{-5}\) which indicates that every one square foot increase in LotArea, the predicted natural logarithm of the sale price (LogSalePrice) increases by approximately \(1.03 \times 10^{-5}\) units. The p-value (\(<2.2 \times 10^{-16}\)) suggests that there is a significant relationship between LotArea and LogSalePrice. However, the adjusted R-squared approximately 6.57% of the variability in LogSalePrice is explained by LotArea. While statistically significant. The model can be used to predict the natural logarithm of the LogSalePrice (but not the sale price) based on the value of LotArea. The following equation can be used to predict the sale price:
$LogSalePrice = 11.92 + 1.030^{-05} LotArea + $
For the polynomial model, many of the predictors (the polynomial terms) have statistically significant coefficients, as indicated by the p-values (\(< 0.05\)). In other words, these terms have significant impact on predicting the sale price of the houses. The adjusted R-squared value of 0.8294 indicates that approximately 82.94% of the variability in the sale price is explained by the polynomial terms included in the model. This suggests s good fit of the data. This model can be also used to predict the sale price of houses based on the values of the polynomial terms included in the model.