In this project, housing prices in the Ames area are investigated using the Ames Housing Dataset, with the central question being: what are the key factors influencing housing prices, and how accurately can these prices be predicted based on available data? The focus is on three predictor variables: BsmtFin SF 1 (Basement Type 1 finished square feet), Bsmt Unf SF (Basement Unfinished square feet of basement area), and 1st Flr SF (First Floor square feet). These variables are considered essential aspects of residential properties that may impact their sale prices.
The methodology includes data cleaning and exploration to ensure data integrity, summarize the dataset’s main characteristics, and identify patterns or trends.
Correlation analysis examines the relationships between the predictor variables and the sale prices of the properties, helping to identify potential predictors of housing prices and understand their strengths of association. Regression modeling techniques are utilized to develop predictive models for housing prices based on the selected predictor variables, assessing the models’ performance and interpreting their coefficients to understand the impact of each predictor on housing prices.
Model evaluation is performed to ensure the reliability of the models, evaluating them for multicollinearity, outliers, adherence to model assumptions, and exploration of transformation techniques to improve model performance. Finally, regression for all subsets is employed to identify the best-fitting model based on adjusted R-squared, determining the most influential predictors of housing prices.
By pursuing these objectives and questions, valuable insights into the factors driving housing prices in the Ames area are provided, and a reliable predictive model is developed to aid stakeholders in making informed decisions in the real estate market.
Data Loading and Cleaning
In the project’s first step, the dataset was loaded and cleaned using the janitor package, a powerful tool for data cleaning and exploration in R. The janitor package provides various functions that streamline the process of preparing data for analysis. The `clean_names()` function was particularly useful in this step, as it automatically cleans the column names of the dataset, making them more readable and consistent. This step is crucial for ensuring that the dataset is in a suitable format for subsequent analysis, including examining descriptive statistics and the dataset’s structure. Descriptive statistics were calculated to summarize the dataset, such as the mean, median, 1st and 3rd quarters, minimum and maximum values, and the columns with a missing value. The missing values were imputed with the respective mean for numerical columns and mode for factor columns.
The dataset’s structure was also examined to understand the types of variables present and their distributions. Following this, numeric columns were selected for correlation analysis, as these are the variables of interest for understanding the relationships between different features of the dataset and the target variable, in this case, sales price. This selection is based on the assumption that numerical relationships are more straightforward to analyze and interpret than categorical ones. The process of selecting numeric columns for correlation analysis sets the stage for further exploratory data analysis and model development, aiming to uncover the key factors influencing housing prices in the Ames area.
Figure 1 A histogram of sale prices was plotted to visualize the distribution
The histogram shows that the distribution is left-skewed, which means that the tail on the left side of the histogram (the lower sale prices) is longer than the right side. This suggests that a few houses were sold significantly at lower prices than the rest.
Figure 2: A Correlation matrix plot visualizing the correlation between the numeric values.
The correlation matrix plot above visualizes the relationships between numeric variables in the dataset, and here is how to interpret it:
Variables: The rows and columns represent different variables.
Correlation Coefficients: The cells show the correlation coefficients between pairs of variables, ranging from -1 to 1.
Color Coding
Positive Correlation: Blue cells indicate a positive correlation, where one variable increases as the other does.
Negative Correlation: Red cells indicate a negative correlation, where one variable decreases as the other increases.
Intensity: The darker the color, the stronger the correlation.
Overall Interpretation
The plot reveals patterns and dependencies among housing attributes like lot area, overall quality, and basement size.
A strong positive correlation is observed between ‘overall_qual’ and ‘sale_price’, suggesting that higher-quality houses sell for higher prices.
‘gr_liv_area’ and ‘tot_rms_abvgrd’ show a strong positive correlation, indicating that larger living areas are associated with more rooms above ground.
Some variables, such as ‘year_built’ and ‘bsmt_fin_sf_1’, show weaker or no significant correlations, implying they might not be as closely related to other housing attributes in the data set. This analysis provides insight into the factors influencing house prices and features.
Figure 3: The correlation between sale price and all other variables
The following were inferred from the above figure
order: This variable has a weak negative correlation with sale price (-0.03), implying a slight tendency for lower sale prices as the order increases. However, the correlation is not very strong.
pid: There’s a moderate negative correlation (-0.25) with the sale price. Properties with lower pid values tend to have higher sale prices.
ms_sub_class: The correlation is somewhat weak (-0.08), indicating a slight negative relationship with sale price.
lot_frontage: Moderate positive correlation (0.34). Properties with larger lot frontage tend to have higher sale prices.
lot_area: Moderate positive correlation (0.27). Sale price tends to increase with larger lot areas.
overall_qual: Strong positive correlation (0.80). Higher overall quality is strongly associated with higher sale prices.
overall_cond: Moderate negative correlation (-0.10). Lower overall condition is weakly associated with higher sale prices.
year_built: Moderate positive correlation (0.56). Newer properties tend to have higher sale prices.
year_remod_add: Moderate positive correlation (0.53). Properties with more recent remodeling tend to have higher sale prices.
mas_vnr_area: Moderate positive correlation (0.51). More masonry veneer area is associated with higher sale prices.
bsmt_fin_sf_1, bsmt_fin_sf_2, bsmt_unf_sf: Moderate positive correlation (0.43, 0.01, 0.18). Generally, larger basement finished and unfinished areas are associated with higher sale prices.
total_bsmt_sf: Moderate positive correlation (0.63). Total basement area is positively correlated with sale price.
x1st_flr_sf, x2nd_flr_sf: Moderate positive correlation (0.62, 0.27). Larger first and second-floor areas are associated with higher sale prices.
low_qual_fin_sf: Weak negative correlation (-0.04). Low-quality finished square footage is weakly associated with lower sale prices.
gr_liv_area: Strong positive correlation (0.71). Larger above-ground living area is strongly associated with higher sale prices.
bsmt_full_bath, bsmt_half_bath, full_bath, half_bath: Moderate positive correlation (0.28, -0.04, 0.55, 0.29). More bathrooms generally correlate with higher sale prices.
bedroom_abv_gr, kitchen_abv_gr: Weak positive and negative correlations (0.14, -0.12). The number of bedrooms has a slight positive association, while the number of kitchens has a slight negative association with sale prices.
tot_rms_abv_grd: Moderate positive correlation (0.50). More total rooms above ground are associated with higher sale prices.
fireplaces: Moderate positive correlation (0.47). More fireplaces are associated with higher sale prices.
garage_yr_blt: Moderate positive correlation (0.51). Newer garages are associated with higher sale prices.
garage_cars, garage_area: Strong positive correlations (0.65, 0.64). Larger garages (in terms of car capacity and area) are strongly associated with higher sale prices.
wood_deck_sf, open_porch_sf, enclosed_porch, x3ssn_porch, screen_porch: Moderate positive correlations (0.33, 0.31, -0.13, 0.03, 0.11). Larger deck and porch areas tend to be associated with higher sale prices, while enclosed porches have a slight negative association.
pool_area, misc_val: Weak positive and negative correlations (0.07, -0.02). The presence of a pool has a slight positive association with sale prices, while miscellaneous value has a slight negative association.
mo_sold, yr_sold: There are weak positive and negative correlations (0.04, -0.03). There’s a very slight positive association with sale prices for properties sold in certain months and a very slight negative association with sale prices for properties sold in specific years.
Table 1: The X variable that has higher, lower, and closer to 0.5 correlation values with the sale price
| Variable | Value | |
|---|---|---|
| Highest correlation | overall_qual | 0.7992618 |
| Lowest correlation | bsmt_fin_sf_2 | 0.005889764 |
| Correlation Closest to 0.5 | tot_rms_abv_grd | 0.4954744 |
Figure 3: The scatter plot of the variables with the highest correlation with sale price
The scatter plot shows a positive correlation between “overall_qual” and “sale price.”This implies that as the “overall_qual” increases, the “sale price” also increases. This suggests that higher overall quality is associated with higher sale prices. This could imply that the property’s overall quality significantly predicts its sale price for the dataset analyzed.
Figure 4: The scatter plot of the variables with the lowest correlation with sale price
The correlation between “bsmt_fin_sf_2,” representing the finished square footage of the basement, and the “sale price” of the property, as depicted in the scatter plot, suggests a weak relationship, as observed from the data points on the scatter plot, which spread out widely. This means that changes in the basement’s finished square footage do not strongly predict changes in the sale price.
Figure 5: The scatter plot of the variables with the closest to 0.5 correlation
The scatter plot shows a positive correlation between “tot_rms_abv_grd” representing total rooms above grade) and “sale price,” indicating that houses with more rooms tend to have higher sale prices. This implies that the sale price increases as the number of rooms increases. This suggests that buyers may value larger homes with more rooms, which can lead to higher prices. For sellers, having more rooms can be an advantage when selling a home.
In this section, a linear regression model was fitted to predict sale prices based on selected variables ( bsmt_fin_sf_1, bsmt_unf_sf, x1st_flr_sf, data ). The predictor variables were tested for multicollinearity, the variables were tested for outliers and also checked for the model assumptions.
Figure 6: The output showing the result of the fitted model
Call:
lm(formula = sale_price ~ bsmt_fin_sf_1 + bsmt_unf_sf + x1st_flr_sf,
data = df)
Residuals:
Min 1Q Median 3Q Max
-635726 -35976 -11379 32192 388732
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37466.782 3475.638 10.78 <0.0000000000000002 ***
bsmt_fin_sf_1 68.389 3.929 17.41 <0.0000000000000002 ***
bsmt_unf_sf 47.415 3.792 12.51 <0.0000000000000002 ***
x1st_flr_sf 74.633 4.200 17.77 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 59600 on 2926 degrees of freedom
Multiple R-squared: 0.4441, Adjusted R-squared: 0.4435
F-statistic: 779 on 3 and 2926 DF, p-value: < 0.00000000000000022
The equation of the model is given as:
[1] "Sale Price = 37466.7822726134 + 68.3889995505817 * bsmt_fin_sf_1 + 47.4152428222068 * bsmt_unf_sf + 74.6325229583286 * x1st_flr_sf"
The coefficients of the linear regression model have the following implications for the sale price of properties:
Intercept: The intercept represents the estimated sale price when all predictor variables (bsmt_fin_sf_1, bsmt_unf_sf, and x1st_flr_sf) are zero. In this case, it suggests that when all these variables are zero, the estimated sale price is approximately $37,467.78.
bsmt_fin_sf_1: The coefficient value suggests that finished basement space is valuable. For every additional square foot of finished basement, the sale price is expected to increase by $68.39, indicating buyers are willing to pay more for this feature.
bsmt_unf_sf: Unfinished basement space adds value but is less than finished space. Each additional square foot contributes $47.42 to the sale price, which could be because unfinished space offers potential for future customization.
x1st_flr_sf: The first-floor area is highly valued, with each additional square foot increasing the sale price by $74.63. This might reflect buyer preferences for larger living spaces on the main floor.
Residual standard error (59600): This is the average amount by which the actual sale prices deviate from the model’s predicted sale prices. It’s approximately $59,600.
Multiple R-squared (0.4441): The linear relationship with the predictor variables (bsmt_fin_sf_1, bsmt_unf_sf, and x1st_flr_sf) can explain approximately 44.41% of the variability in sale prices.
Adjusted R-squared (0.4435): The Adjusted R-squared penalizes the R-squared value for the number of predictors in the model. It is slightly lower than the multiple R-squared, reflecting the adjustment for model complexity.
F-statistic (779) and p-value (< 0.00000000000000022): The F-statistic tests the overall significance of the regression model. In this case, the extremely small p-value indicates that the regression model is statistically significant, suggesting that at least one of the predictor variables (bsmt_fin_sf_1, bsmt_unf_sf, or x1st_flr_sf) is related to the sale price.
The model suggests larger, well-appointed spaces are associated with higher property values. The exact sale price prediction would depend on the specific measurements of these features in a property.
Figure 7: Plots of Residuals vs fitted observations
The above figure is a scatter plot that visualizes the model’s residuals (the differences between observed and predicted values) against the predicted values (fitted values). The points are clustered around the origin, suggesting that for many observations, the predicted values are close to the actual values. The pink linear trend line intersects the cluster, indicating an attempt to capture the central tendency of the data points. Also, the specific outlier points are annotated with numerical values, highlighting observations where the model’s predictions significantly differ from the actual data.
Figure 8:Plots of Q-Q Residuals
The plot assesses whether a dataset follows a particular distribution, such as the normal distribution. The points along the reference line indicate that the data is normally distributed, and any deviations from this line suggest departures from normality. It can be observed from the plot that the data conforms to normal but experiences a slight deviation as it approaches the upper end. In the context of regression analysis, a Q-Q plot of residuals helps to verify the assumption of normally distributed errors; as the case may be in this plot, it can be ascertained that the normality assumption is satisfied.
Figure 9:Plots of Q-Q Residuals
The plots show that the standardized residuals are evenly distributed along the fitted values, a good sign of model fit in regression analysis. Also, the Scale-Location Relationship shows a consistent spread of residuals, suggesting homoscedasticity, meaning the variability of the response variable is relatively constant across all levels of the explanatory variable.
Figure 10: Plot of Residual vs Leverage
The plot above is the Residual vs. leverage showing the highly influential observations in the model. The leverage measures the extent to which the model’s coefficient will change if the influential values are removed, observations with higher leverage is considerd to be an influential observation.; from the plot it can be observed that 1499 is an influnetial observstion and if it is removed from the data set it will greately afcet the coefiecient of the model.
Figure 10: Multicollinearity test
bsmt_fin_sf_1 bsmt_unf_sf x1st_flr_sf
2.641384 2.289148 2.233810
The test indicated that there is no multicollinearity among the predictor variables as the VIF values are all less than 5.
Figure 11: Outlier test
Based on the above observations, the Bonferroni-adjusted p-values are significantly smaller than 0.05. This indicates that all the observations are statistically significant outliers at a significance level of 0.05.
So, based on the provided data and the chosen significance level of 0.05, it can be concluded that there are outliers present in the dataset. Removing outliers from a regression model can lead to loss of valuable information, biased estimates, reduced generalizability, violation of assumptions, and potential ethical concerns. Instead of outright removal, it’s better to investigate outliers further and consider alternative approaches to mitigate their impact on the analysis.
in this section, the model variables were transformed using the Log transformation to improve model performance. The model was refitted with the transformed data, and outlier tests were performed again.
Figure 12: The output showing the result of the fitted transformed model
Call:
lm(formula = log_sale_price ~ log_bsmt_fin_sf_1 + log_bsmt_unf_sf +
log_x1st_flr_sf, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.2994 -0.2047 -0.0230 0.2303 1.0341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.778915 0.126550 53.57 <0.0000000000000002 ***
log_bsmt_fin_sf_1 0.023300 0.002058 11.32 <0.0000000000000002 ***
log_bsmt_unf_sf 0.034390 0.003234 10.63 <0.0000000000000002 ***
log_x1st_flr_sf 0.706826 0.018561 38.08 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3122 on 2926 degrees of freedom
Multiple R-squared: 0.414, Adjusted R-squared: 0.4134
F-statistic: 689.2 on 3 and 2926 DF, p-value: < 0.00000000000000022
The model equation is given as
“log_sale_price = 6.7789 + 0.0233 * log_bsmt_fin_sf_1 + 0.0344 * log_bsmt_unf_sf + 0.7068 * log_x1st_flr_sf”
To quantify the improvement between the untransformed data in output 1 and the log-transformed data in output 2, the R-squared values are compared directly:
Output 1 (untransformed data):
Multiple R-squared: 0.4441
Adjusted R-squared: 0.4435
Output 2 (log-transformed data):
Multiple R-squared: 0.414
Adjusted R-squared: 0.4134
from the Comparison between the Adjusted R-squared values, it can be observed that the untransformed model in output 1 has an Adjusted R-squared of 0.4435, while the log-transformed model in output 2 has an Adjusted R-squared of 0.4134.
So, the log transformation decreased approximately 0.0301 in the Adjusted R-squared value compared to the untransformed data. This indicates that although the log transformation improved the model’s fit in terms of explained variance, it resulted in a slightly lower improvement when considering the adjusted value to account for the number of predictors.
Therefore, in this case, the log transformation did not improve the model. Instead, it resulted in a slight decrease in the Adjusted R-squared value, suggesting that the untransformed model (output 1) provided a better fit to the data compared to the log-transformed model (output 2).
Figure 13: the result of the outliers for the transformed model
rstudent unadjusted p-value Bonferroni p
182 -7.438482 0.00000000000013311 0.00000000039002
1554 -6.342258 0.00000000026144000 0.00000076601000
In summary, while the log transformation of variables in output 2 did not result in an improvement in adjusted R-squared compared to output 1, it still offers several potential benefits:
Outlier Mitigation: The log transformation tends to reduce the influence of outliers, making the model more robust to extreme values.
Linearity Improvement: Logarithmic transformations can help linearize relationships that are nonlinear in their original scale, potentially improving the model’s linearity assumptions.
Heteroscedasticity Reduction: By compressing the range of values, log transformations can help stabilize the variance of the target variable, leading to more homoscedastic residuals.
Coefficient Interpretability: Transforming variables can sometimes make the coefficients more interpretable, especially when the original scale of the data is skewed or heteroscedastic.
Model Simplification: Despite the decrease in adjusted R-squared, the log transformation may simplify the model by reducing the complexity of relationships between predictors and the target variable.
Overall, while the adjusted R-squared may have decreased in output 2, the log transformation still offers potential advantages in terms of model robustness, linearity, and interpretability, which should be considered alongside other performance metrics when evaluating the effectiveness of the regression model.
In this section, regression for all subsets was conducted to identify the best model based on the adjusted R-squared. The preferred model equation and adjusted R-squared were also determined.
Figure 14: All subset regression model selection
Subset selection object
Call: regsubsets.formula(sale_price ~ bsmt_fin_sf_1 + bsmt_unf_sf +
x1st_flr_sf, data = df)
3 Variables (and intercept)
Forced in Forced out
bsmt_fin_sf_1 FALSE FALSE
bsmt_unf_sf FALSE FALSE
x1st_flr_sf FALSE FALSE
1 subsets of each size up to 3
Selection Algorithm: exhaustive
bsmt_fin_sf_1 bsmt_unf_sf x1st_flr_sf
1 ( 1 ) " " " " "*"
2 ( 1 ) "*" " " "*"
3 ( 1 ) "*" "*" "*"
The result from the analysis presented above aims to determine the best combination of predictor variables (bsmt_fin_sf_1, bsmt_unf_sf, and x1st_flr_sf) to predict sale_price. All subsets regression was performed, where every possible combination of the predictor variables, along with an intercept term, was considered.
Subset 1: Only x1st_flr_sf is included in the model, indicating that it alone has a significant relationship with sale_price.
Subset 2: Both bsmt_fin_sf_1 and x1st_flr_sf are included, suggesting that these two variables together may improve the prediction of sale_price.
Subset 3: All three variables (bsmt_fin_sf_1, bsmt_unf_sf, and x1st_flr_sf) are included in the model, indicating that the combination of all three variables offers the best prediction of sale_price.
The analysis suggests that while including all three variables provides the most comprehensive model, using just x1st_flr_sf alone may also offer a reasonable prediction. The significance of each variable is determined by the combination of predictors that minimizes the error in predicting sale_price.
The coefficient of the model
(Intercept) bsmt_fin_sf_1 bsmt_unf_sf x1st_flr_sf
37466.78227 68.38900 47.41524 74.63252
Preferred model equation of the model:
sale_price = 37466.7822726133 (Intercept) + 68.3889995505825*bsmt_fin_sf_1 + 47.4152428222066* bsmt_unf_sf + 74.6325229583288* x1st_flr_sf + intercept
Adjusted R-square of the preferred model: 0.4434823
The comparison between the transformed model equation and the preferred model equation
The dependent variable and independent variable are transformed using the logarithm function. This attempt is to deal with the outlier observed in the data set. Meanwhile, the dependent and independent variables were used directly without any transformation.
Given that the preferred model has a higher R-squared value than the transformed Model, this indicates that the preferred Model explains a larger proportion of the variance in the dependent variable (sale price) than the transformed Model does. Higher R-squared values generally indicate better goodness of fit for the model. I will, therefore, choose the model from the all subset regression.
Conclusion
In this project, the key factors influencing housing prices in the Ames area were explored using the Ames Housing Dataset. Through data cleaning, exploration, correlation analysis, and regression modeling, valuable insights into the relationships between various housing attributes and sale prices were gained.
The analysis revealed several important findings:
Factors such as overall quality, basement size, and first-floor area have significant positive correlations with sale prices, indicating their strong influence on housing prices.
Other variables, such as year built, number of rooms, and garage size, also showed moderate to strong correlations with sale prices.
Regression modeling allowed quantifying the impact of these factors on sale prices. The coefficients of the linear regression model provided insights into how changes in predictor variables affect housing prices.
Testing for multicollinearity, outliers, and adherence to model assumptions ensured the reliability of the models.
While attempting a log transformation of variables to improve model performance, it was found that it did not significantly enhance the model fit compared to the untransformed model.
All subsets regression helped identify the best combination of predictor variables, with the preferred model explaining a larger proportion of the variance in sale prices.
In conclusion, the analysis provides valuable insights for stakeholders in the real estate market in the Ames area. By understanding the key factors influencing housing prices, stakeholders can make more informed decisions regarding property investments, sales, and development. The findings underscore the importance of features such as overall quality, basement size, and first-floor area in determining property values, highlighting areas for strategic focus in the real estate market.
References
R. Kabacoff, R in Action, 2nd Edition, Manning Publisher ISBN 978-161-7291-388
What is a Residuals vs. Leverage Plot? (Definition & Example) - Statology Retrieved April 13, 2024, from https//:www.statology.org/residuals-vs-leverage-plot/
How to Use regsubsets() in R for Model Selection - Statology Retrived April 13, 2024, from https//: www.statology.org/regsubsets-in-r/
R-codes
#Title:Regression Diagonoses with R
#Name: Ayeni Taiwo
#Course: ALY 6015:Intermidiate Analysis
#Instructor: Prof. Vivan Clements
#Date: 2024-04-11
cat("\014") # clears console
rm(list = ls()) # clears global environment
try(dev.off(dev.list()["RStudioGD"]), silent = TRUE)
try(p_unload(p_loaded(), character.only = TRUE), silent = TRUE)
options(scipen = 100)
library(pacman)
p_load(tidyverse)
p_load(ggthemes)
library(RColorBrewer)
p_load(psych, knitr, kableExtra)
p_load(correlation)
library(corrplot)
library(car)
library(leaps)
library(dplyr)
library(broom)
library(ggpubr)
library(readxl)
library(correlation)
library(ggplot2)
library(gridExtra)
library(MASS)
# Step 1: Load the Ames housing dataset
ames_data <- read.csv("AmesHousing.csv")
#cleaning data sets
p_load(janitor)
ames_data<-clean_names(ames_data)
ames_data
# Step 2: #Summary of the data
summary(ames_data) # Descriptive statistics
str(ames_data) # Structure of the dataset
names(ames_data)
# Select only numeric columns for correlation matrix
df<- select_if(ames_data, is.numeric)
df
#Summary of the subgroup data
summary(df)
# Step 3: Prepare the dataset for modeling
# Impute missing values with mean
# Check for missing values
colSums(is.na(df))
#line codes to calculate mode, required for dealing with the missing
get_mode <- function(x) {
ux <- unique(na.omit(x))
if(length(ux) == 0) return(NA) # Return NA if all values are NA
ux[which.max(tabulate(match(x, ux)))]
}
# Imputing missing values for factor columns with the mode
df[] <- lapply(df, function(x) if(is.factor(x)) replace(x, is.na(x), get_mode(x)) else x)
# Imputing missing values for numerical columns with the mean
df[] <- lapply(df, function(x) if(is.numeric(x)) replace(x, is.na(x), mean(x, na.rm = TRUE)) else x)
# Check if there are still missing values
colSums(is.na(df))
# Draw histogram for sale_price
hist(df$sale_price, main = "Histogram of Sale Price", xlab = "Sale Price")
#Step 4:Calculate the correlation matrix
correlation_matrix <- cor(df)
correlation_matrix
#Step 5: Plot the correlation matrix
corrplot(correlation_matrix, type = 'upper', col = brewer.pal( n = 8, name = "RdYlBu"))
#Step 6
# Calculate correlation between Sale Price and all other numeric variables
correlation <- cor(df, use = "complete.obs")
correlation
#Extracting correlations with Sale Price excluding itself
sale_price_corr <- correlation["sale_price", ]
sale_price_corr <- sale_price_corr[!names(sale_price_corr) == "sale_price"]
sale_price_corr
# Find the variable with the highest correlation and its value
highest_cor_val <- max(abs(sale_price_corr))
highest_cor_var <- names(sale_price_corr)[which.max(abs(sale_price_corr))]
highest_cor_val_and_var <- list(variable = highest_cor_var, correlation = highest_cor_val)
# Find the variable with the lowest correlation and its value
lowest_cor_val <- min(abs(sale_price_corr))
lowest_cor_var <- names(sale_price_corr)[which.min(abs(sale_price_corr))]
lowest_cor_val_and_var <- list(variable = lowest_cor_var, correlation = lowest_cor_val)
# Find the variable with the correlation closest to 0.5 and its value
closest_cor_0.5_val <- min(abs(sale_price_corr - 0.5))
closest_cor_0.5_var <- names(sale_price_corr)[which.min(abs(sale_price_corr - 0.5))]
closest_cor_0.5_val_and_var <- list(variable = closest_cor_0.5_var, correlation = sale_price_corr[closest_cor_0.5_var])
highest_cor_val_and_var
lowest_cor_val_and_var
closest_cor_0.5_val_and_var
# Scatter plot for variable with highest correlation
ggplot(df, aes_string(x = highest_cor_var, y = "sale_price")) +
geom_point() +
labs(title = paste("Scatter plot for highest_corr_variable vs Sale Price"))
# Scatter plot for variable with lowest correlation
ggplot(df, aes_string(x = lowest_cor_var, y = "sale_price")) +
geom_point() +
labs(title = paste("Scatter plot for lowest_corr_variable vs Sale Price"))
# Scatter plot for variable with correlation closest to 0.5
ggplot(df, aes_string(x = closest_cor_0.5_var, y = "sale_price")) +
geom_point() +
labs(title = paste("Scatter plot for closest_corr_variable vs Sale Price"))
#Step 7: fitting variables to model
model <- lm(sale_price~bsmt_fin_sf_1+ bsmt_unf_sf + x1st_flr_sf, data = df)
summary(model)
#Step 8 Extract coefficients
coefficients <- coef(model)
# Write the equation for the fitted regression model
equation <- paste("Sale Price = ", coefficients[1], " + ",
coefficients[2], "* bsmt_fin_sf_1 + ",
coefficients[3], "* bsmt_unf_sf + ",
coefficients[4], "* x1st_flr_sf")
equation
#Step 9:plotting the regression model
par(mfrow = c(1, 1))
plot(model)
#step 10: check for multicollinearity
vif(model)
#step 11: Checking for outlier test
outlierTest(model)
#Step 12: Apply a transformation (log transformation)
df$log_sale_price <- log(df$sale_price)
df$log_bsmt_fin_sf_1 <- log(df$bsmt_fin_sf_1 + 1) # Adding 1 to avoid log(0)
df$log_bsmt_unf_sf <- log(df$bsmt_unf_sf + 1)
df$log_x1st_flr_sf <- log(df$x1st_flr_sf)
# Refit the model with the cleaned data
cleaned_model <- lm(log_sale_price ~ log_bsmt_fin_sf_1 + log_bsmt_unf_sf + log_x1st_flr_sf, data = df)
summary(cleaned_model)
# Extract coefficients
coefficients <- coef(cleaned_model)
# Construct the equation
intercept <- coefficients[1]
log_bsmt_fin_sf_1_coef <- coefficients[2]
log_bsmt_unf_sf_coef <- coefficients[3]
log_x1st_flr_sf_coef <- coefficients[4]
# Write the equation
equation <- paste("log_sale_price =",
round(intercept, 4),
"+",
round(log_bsmt_fin_sf_1_coef, 4),
"* log_bsmt_fin_sf_1 +",
round(log_bsmt_unf_sf_coef, 4),
"* log_bsmt_unf_sf +",
round(log_x1st_flr_sf_coef, 4),
"* log_x1st_flr_sf")
# Print the equation
equation
#Checking for outlier test again after tranformation
outlierTest(cleaned_model)
# Step 13: Perform all subsets regression
all_subsets <- regsubsets(sale_price~bsmt_fin_sf_1+ bsmt_unf_sf + x1st_flr_sf, data = df)
# Summary of all subsets regression
summary_result <- summary(all_subsets)
summary_result
# Find the best model based on adjusted R^2
best_model_index <- which.max(summary_result$adjr2)
best_model_index
# Get the coefficients of the best model
best_model_coef <- coef(all_subsets, id = best_model_index)
best_model_coef
# Extract variable names
variable_names <- names(best_model_coef)
variable_names
# Get the adjusted R-square of the best model
adj_r_squared <- summary_result$adjr2[best_model_index]
adj_r_squared
# Construct the equation of the preferred model
equation <- paste("sale_price =", paste(best_model_coef, variable_names, collapse = " + "), "+ intercept")
# Print the equation and adjusted R-square
cat("Preferred model equation:\n", equation, "\n")
cat("Adjusted R-square of the preferred model:", adj_r_squared, "\n")