Introduction

In the real estate market, the sale price of a house is really important. It represent more than just a transactional figure and understanding the factors that influence house prices is crucial for various stakeholders, including buyers, sellers, real estate agents, and policymakers. For buyers, knowing the key determinants of house prices can help in making informed decisions. Sellers can use this information to price their properties competitively. Real estate agents can provide valuable insights to their clients based on market trends. Policymakers can use this data to formulate housing policies that promote affordability and sustainability in the housing market.

This project/report aims to provide valuable insights into the Ames housing market, shedding light on the dynamics of house sales and the factors driving housing prices. The data used has 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables

Objective

The objective of this project is to analyze the housing data from Ames, Iowa, to gain insights into the factors affecting housing prices.

Research Questions:

What are the key factors influencing the sale price of houses in Ames, Iowa?
How do variables like overall quality, living area, and garage area impact housing prices?
Are there specific neighborhoods or housing styles that command higher prices in Ames
Which year or month recordered the most sales

Analysis

We began the analysis by exploring the data, the dimension, and variables to get better understanding of the data. Other methodology performed were data cleaning, exploratory data analysis (EDA), regresion analysis, dignosis of the OLS assumptions

Summary Statistics

The summary statistics indicate that there are some missing values in the dataset when we look at the n column, variables such as lot_frontage, mas_vnr_area, total_bsmt_sf etc. contain missing values. The average price of house sales is $180,796.10, and the median sale price is $160,000. The min sale price is $34,900 and the maximum sale price is recorded as $755,000 which is extremely high and totally away from the average price prompting the presence of outliers.

Other attributes such as the lot_area(lot size), pool_area, misc_val, x3ssn_porch(Three season porch area)etc also show high presence of outliers. In general outliers are highly present in almost all the variables.

Alot of the variables have both the minimum and median values at zero(0), which promptly suggests that a substantial portion of the observations have a value of 0 for that variable. This may be because not all the houses will have these features for example not all guest bathromm(half_bath) or basement hence zero(0) will be recorded for the absence of these features.

The houses in the dataset have bulit year ranging from 18 were built in the year 1872 to 2010. With most of them sold betwwen 2006 to 2010.

The sale price of house in Iowa is right-skewed with skewness of 1.741715 and a kurtosis value of 5.1025 indicating of long tail. This implies that there are most less-expensive house compared to expensive.

Distribution of Sale Price

Missing Values

In total there are 13960 missing values present in the data, and about 5 of the variables have missing values more than 40%. These variables are all categorical features replacing these values in order to use for the analysis may affect the the integrity, hence we drop these features pool_qc: pool qualit, misc_feature: miscellaneous feature not covered in other categories, alley: type of alley access to property, fence: fence quality, fireplace_qu: fireplace quality

Filling missing Values

A closer look into the summary statistics of attributes with missing values and visualizing the histogram plots indicate that most of the values contain extremely outliers hence filling in with mean may affect the integrity of the analysis hence we fill missing values with the median.

The remaining missing values are categorical variables which we will not use for the analysis hence we select numeric variables and 2 or 3 categorical features we want to explore or visualize

Histogram with kernel density of Variables with Missing Values

Exploratory Data Analysis

As part of our research question, we like to explore house sale across the months in order to answer the question of which month recorded the most sales. And also investigate which neighborhoods or housing styles that command higher prices in Ames

Sale Price Across Month The Figure below evidently indicate that most of the houses were sold in the mid months i.e May, June, July and least were sold in initial and last month.

Number of Properties Sold in Each Month

Zoning Classification:

There are more properties are found in Residential Low Density (RL) areas and the rest. The average price of properties in Floating Village Residential (FV) areas is the highest among all zoning classifications, with an average price of approximately $218,987.

Residential Low Density (RL) areas also have relatively high average prices, with properties averaging around $191,283. Residential Medium Density (RM) areas have a lower average price compared to Floating Village Residential (FV) and Residential Low Density(R)L, with properties averaging approximately $126,781.

Frequency Plot & Average Price of Zoning Classification

Utilities & Foundation Material:

Builings with Poured Contrete(PConc) used as foundation is more expensive as compared to building which foundations were made of wood, stone or cinder block(CBlock). It’s not surprising that buildings with poured concrete (PConc) foundations are more expensive. As poured concrete foundations offer several advantages over alternative materials, such as greater strength, durability, customization options, and energy efficiency.

In the same way it also make sense that houses with all public Utilities (E,G,W,& S)(AllPub) are more expensive as compared to Electricity, Gas, and Water(Noswer) or houses with Electricity and Gas Only(NoSewa).

Frequency Plot of Utilities & Foundation

Neighborhood

From the boxplot below it is evidence that some neighborhood are more expensive than others. Places like NoRidge, NridgHt & StoneBr have the high sale price compared to others. These neighborhoods are known for their higher-end properties and tend to command higher sale prices compared to other neighborhoods and so it’s not surprising that house sale there are expensive.

Boxpot of Sale Price Across Neighborhood

A closer look into the data indicate that we can combine some of the values to become one, this include the number of bathroooms and number of total porch, again we can also compute for the age of the building from the year renovated and the year sold.

Hence we create new features namely age, total_bath, total_bsmt_bath & total_porch.

We have decided to drop some variables and select others based on domain knowledge that we believe will be relevant. In total we selected 23 attributes which was used to create a correlation table as show in appendix.

Correlation

We plotted the correlation to determine the relationship between the explanatory predictors and Sales prices of houses, and it was shown that most of our variables have positive relationships with predictor variable. The variable Over quality(overall_qual) of house is the variable with the highest correlation at +0.799, followed by Gr Liv Area (+0.707). Other variables such as garage car, garage area, total number of bath shown a positive correlations of (+0.648, +0.640, +0.553) respectively.

##    ms_sub_class    lot_frontage        lot_area    overall_qual    overall_cond 
##         -0.0851          0.3403          0.2665          0.7993         -0.1017 
##      year_built  year_remod_add    mas_vnr_area   bsmt_fin_sf_1   bsmt_fin_sf_2 
##          0.5584          0.5330          0.5022          0.4329          0.0060 
##     bsmt_unf_sf   total_bsmt_sf     x1st_flr_sf     x2nd_flr_sf low_qual_fin_sf 
##          0.1829          0.6322          0.6217          0.2694         -0.0377 
##     gr_liv_area  bsmt_full_bath  bsmt_half_bath       full_bath       half_bath 
##          0.7068          0.2758         -0.0358          0.5456          0.2851 
##  bedroom_abv_gr  kitchen_abv_gr tot_rms_abv_grd      fireplaces   garage_yr_blt 
##          0.1439         -0.1198          0.4955          0.4746          0.5089 
##     garage_cars     garage_area    wood_deck_sf   open_porch_sf  enclosed_porch 
##          0.6478          0.6404          0.3271          0.3130         -0.1288 
##     x3ssn_porch    screen_porch       pool_area        misc_val         mo_sold 
##          0.0322          0.1122          0.0684         -0.0157          0.0353 
##         yr_sold      sale_price             age      total_bath total_bsmt_bath 
##         -0.0306          1.0000         -0.5349          0.5531          0.2494 
##     total_porch    total_flr_sf 
##          0.1853          0.7136

Interesting the age of the house shown a negative correlation of -0.53, which indicate that the higher the the age of the builing the lesser the price, which is understandable since newer house would cost less compared to old houses.The overall condition rating of the house also had a negative correlation of 0.10 with sale price.

Based on the correlation values and personal domain knowledge of the project, we selected 11 variables out of the 22 numeric features to be our variable of interest names

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

scatterplot with Regression line

The plot in Figure 7: indicate that the Gr_live_area, has point which are closer to the regression line. Also for misc_val since majority of the observation is zero(0), the regression line could not capture almost all of the points, which may have resulted in low correlation value. The lot_area exibit the same pattern as misc_val. Most of the data are zero(0) but not all as the points are more spread than misc_val.Which indicate the not so low or so-high correlation values.

Model 1: Fitting the Model

We fitted our liner model, using the 12 selected features. The result indicate that, overall the model is significant as p-value is relatively small. But some individual predictors are not significant in predicting the sale price. Variables like the total_bath is not significant as the p_value >0.05

The Adjusted r-squared is 0.8032, which implies that our independent variables are able to explain about 80% variation in our sales price. However although our Adjusted R-squared is high, the RSE is extremely large(35470) indicates that our model deviates significantly from the real regression line, signifying that our model has several flaws needed to be investigated.

## 
## Call:
## lm(formula = sale_price ~ ., data = interest_variables)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -538129  -18093   -2119   15809  300365 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.488e+04  4.477e+03 -12.256  < 2e-16 ***
## overall_qual   2.003e+04  7.562e+02  26.485  < 2e-16 ***
## gr_liv_area    3.793e+01  2.218e+00  17.103  < 2e-16 ***
## garage_area    4.827e+01  3.959e+00  12.195  < 2e-16 ***
## total_bsmt_sf  2.724e+01  2.010e+00  13.551  < 2e-16 ***
## mas_vnr_area   4.198e+01  4.250e+00   9.878  < 2e-16 ***
## fireplaces     9.646e+03  1.185e+03   8.143 5.65e-16 ***
## lot_area       6.341e-01  9.046e-02   7.010 2.94e-12 ***
## age           -5.028e+02  4.024e+01 -12.495  < 2e-16 ***
## total_bath     3.153e+02  1.291e+03   0.244    0.807    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35470 on 2920 degrees of freedom
## Multiple R-squared:  0.8035, Adjusted R-squared:  0.8029 
## F-statistic:  1326 on 9 and 2920 DF,  p-value: < 2.2e-16

Model Diagnosis

To improve our model,we need to perform regression analysis to validate if the model violate any assumptions of Ordinary Least Square.

Diagnosis Plot of the Model

Linearity: The plot of the Residual vs Fitted, seem to indicate the presence of a pattern in the residual plot, as the red line seem to show a curve line as compared ti a straight. This implies that there is not a linear relation between our predictors and the outcome.
Normality: In the Q-Q Residual plot, we could see some values do not fall within the normality line, with some values showing at the tail indicating presence of outliers. Hence this violate the normality assumption
Homoscedasticity: In the diagnostic plot, the scale-Location plot is use to check for homogeneity of the residual variance. A horizontal line with equal spread points is a good indication of homoscedasticity. However our plot seem to exhibit same pattern as the residual and fitted plot, where there seem to be some curve pattern in the plotted points and the points around the red line is not equally spread. Hence our model voilate the homogeneity assumption.
Residual vs Leverage Plot: The residual vs leverage plot helps to identify influential points in our data. In our plot there seem to be some influencial point; the observation 2181,1499, 2181.

Other Underlying Assumption

Multicollinearity: From our selected variables there seem to be no multicollinearity with the predictors since the VIF values are all less than 5.

##  overall_qual   gr_liv_area   garage_area total_bsmt_sf  mas_vnr_area 
##      2.650421      2.926683      1.686679      1.826020      1.341961 
##    fireplaces      lot_area           age    total_bath 
##      1.371535      1.183014      1.640029      2.510956

Outliers: Here we can see the the points (1499,2181,2182,1761,1768,45,434,1064,1183,433), a total of 10 observations are identified as outliers as their Bonferroni p < 0.05. From the distribution of the sale price the present of ouliers were evident in our data. We will be be tacking the issue of outliers by using the log transformation instead of dropping.

##        rstudent unadjusted p-value Bonferroni p
## 1499 -16.426574         4.5992e-58   1.3476e-54
## 2181 -13.296600         3.2713e-39   9.5849e-36
## 2182  -9.708560         5.9321e-22   1.7381e-18
## 1761   8.656864         7.8801e-18   2.3089e-14
## 1768   7.527015         6.8753e-14   2.0145e-10
## 45     6.645216         3.5988e-11   1.0544e-07
## 434    6.155465         8.5103e-10   2.4935e-06
## 1064   6.114456         1.0979e-09   3.2167e-06
## 1183  -6.025824         1.8934e-09   5.5477e-06
## 433    5.887181         4.3760e-09   1.2822e-05

Improving Model

To know which variable we need to transform, we use the component residual plot to visualize the individual features. It helps assess the linearity assumption and detect potential non-linear relationships between the predictor and response variables. A component residual plot adds a line indicating the line of best fit. A significant difference between the residual line and the component line implies that the predictor does not have a linear relationship with the dependent variable.

From our Figure below, two variables seem to have the component line deviating from the line of best fit. The variables, lot_area & total_bsmt_sf have differences between the two lines.

To resolve this we will use transformation to transform the values. And also transform the values of the dependent variable as well since presence of outliers were seen. Kabacoff,(2015) state that when the model violates the normality assumption, ee can use the powerTransform() function to generate a maximum-likelihood estimation power.

## bcPower Transformation to Normality 
##                               Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## interest_variables$sale_price    0.0076           0      -0.0501       0.0654
## 
## Likelihood ratio test that transformation parameter is equal to 0
##  (log transformation)
##                              LRT df    pval
## LR test, lambda = (0) 0.06741106  1 0.79514
## 
## Likelihood ratio test that no transformation is needed
##                            LRT df       pval
## LR test, lambda = (1) 966.3636  1 < 2.22e-16

The power transformation analysis suggests that the best transformation to achieve normality in the sale price variable is a Box-Cox transformation with an estimated power of approximately 0.0076. The likelihood ratio test comparing the Box-Cox transformation with a log transformation indicates that there’s no significant difference between them (p-value = 0.79514). Additionally, the likelihood ratio test comparing the need for transformation against no transformation shows that a transformation is necessary (p-value < 0.000000000000000222), indicating that the original data doesn’t meet the assumption of normality.

Transforming the Predictors

In the same way when the assumption of linearity is violated, a transformation of the predictor variables can often help by using The boxTidwell() function to generate maximum-likelihood estimates of predictor powers that can improve linearity.

Using boxtidwell to tranform the total_bsmf_df didnt work hence remove from our model.

The Box-Tidwell test examines the linearity assumption between the predictor variable (lot_area) and the log odds of the response variable (sale_price).

In this case, the maximum likelihood estimate (MLE) of lambda, which is a transformation parameter, is approximately 0.19305. The Score Statistic (t) is -15.614, and the associated p-value is extremely low (< 0.00000000000000022), indicating strong evidence against the null hypothesis of linearity. Therefore, based on this test, there is significant evidence to reject the assumption of linearity between lot_area and sale_price.

##  MLE of lambda Score Statistic (t)  Pr(>|t|)    
##        0.19305             -15.614 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## iterations =  13

Model 2: Improved Model

Remodeling our regression using the transform values, improve the adjusted R-squared to almost 83% and also reduce the residual standard eror to 0.0014 which is much better than the previous model.

## 
## Call:
## lm(formula = transform_price ~ overall_qual + gr_liv_area + garage_area + 
##     mas_vnr_area + fireplaces + age + total_bath + trans_lot_area, 
##     data = interest_variables)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0164313 -0.0006444  0.0000845  0.0008081  0.0043690 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     1.082e+00  3.338e-04 3239.802  < 2e-16 ***
## overall_qual    1.075e-03  2.872e-05   37.427  < 2e-16 ***
## gr_liv_area     1.182e-06  8.778e-08   13.469  < 2e-16 ***
## garage_area     2.269e-06  1.569e-07   14.464  < 2e-16 ***
## mas_vnr_area    7.270e-07  1.663e-07    4.372 1.27e-05 ***
## fireplaces      5.599e-04  4.684e-05   11.954  < 2e-16 ***
## age            -2.960e-05  1.588e-06  -18.641  < 2e-16 ***
## total_bath      1.205e-04  4.892e-05    2.463   0.0139 *  
## trans_lot_area  8.276e-04  5.265e-05   15.719  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001403 on 2921 degrees of freedom
## Multiple R-squared:  0.8295, Adjusted R-squared:  0.8291 
## F-statistic:  1777 on 8 and 2921 DF,  p-value: < 2.2e-16

Model 3: Selecting the best Model

Using the All subset regression method to search for the best model show that the best model can be achieve by removing the variables total_bath from the regression predictors. Our best subset model is the top one model with the highest Adjusted R Square of 0.83 in the plot in Figure.

Best Model

##    (Intercept)   overall_qual    gr_liv_area    garage_area   mas_vnr_area 
##           TRUE           TRUE           TRUE           TRUE           TRUE 
##     fireplaces            age     total_bath trans_lot_area 
##           TRUE           TRUE           TRUE           TRUE

Hence the best model is:

transform_price = overall_qual + gr_liv_area + garage_area + mas_vnr_area + fireplaces + age + trans_lot_area

## 
## Call:
## lm(formula = transform_price ~ overall_qual + gr_liv_area + garage_area + 
##     mas_vnr_area + fireplaces + age + trans_lot_area, data = interest_variables)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0168052 -0.0006273  0.0000876  0.0008136  0.0043374 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     1.082e+00  3.279e-04 3299.268  < 2e-16 ***
## overall_qual    1.079e-03  2.870e-05   37.595  < 2e-16 ***
## gr_liv_area     1.310e-06  7.089e-08   18.479  < 2e-16 ***
## garage_area     2.274e-06  1.570e-07   14.482  < 2e-16 ***
## mas_vnr_area    7.177e-07  1.664e-07    4.314 1.66e-05 ***
## fireplaces      5.548e-04  4.683e-05   11.848  < 2e-16 ***
## age            -3.058e-05  1.539e-06  -19.872  < 2e-16 ***
## trans_lot_area  8.078e-04  5.208e-05   15.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001404 on 2922 degrees of freedom
## Multiple R-squared:  0.8292, Adjusted R-squared:  0.8288 
## F-statistic:  2026 on 7 and 2922 DF,  p-value: < 2.2e-16

Final Model

We use the AIC to evaluate our models to see which is best. The one with the least AIC score is the best model. The result indicate that the AIC of model 2 = -30168.26 while that of model 3 = -30164.18 Since the AIC of the third model is lower than the second model, the best regression model is:

transform_price = overall_qual + gr_liv_area + garage_area + mas_vnr_area + fireplaces + age + trans_lot_area

Conclusion

In our analysis we sort to understand the the factors that affect the sales price of house. We notice that, variables such as overall quality, above ground living area, garage area, fireplace, age etc.. tend to affect the price of houses. The older the house the lesser the price. Again the mid months of the year tend to have high sales of price as compared to the begining or ending months.

After understanding our data, we built a model to predict the sale price of the house. The first model be built had high residual standard error and by investigating the model aganist the underlying assumption of OLS, it was evidence that the model violated almost all the assumption with exception of multicollinearity.

By transforming some of our features, the RSE reduce to 0.0014. Lastly we use all subset regression method to select the best predictors. To which we use the AIC to evaluate between the model. The third(3rd) model was found to be the best than the second(2nd) model.

In conclusion we use 7 variables as the predictors of the model to predict the sale price.

Reference

A. (2024, February 16). 6 Assumptions of Linear Regression :Plots and Solutions. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/07/deeper-regression-analysis-assumptions-plots-solutions/

M. (2012, September 28). Histogram + Density Plot Combo in R | R-bloggers. R-bloggers. https://www.r-bloggers.com/2012/09/histogram-density-plot-combo-in-r/

Kabacoff, R. I. (2015). R in Action, Second Edition. O’Reilly Online Learning. Retrieved April 17, 2024, from https://learning.oreilly.com/library/view/r-in-action/9781617291388/kindle_split_002.html

Tan, A. T. (2022, March 16). Cracking the Ames Housing Dataset with Linear Regression | Alvin T. Tan | Towards Data Science. Medium. https://towardsdatascience.com/wrangling-through-dataland-modeling-house-prices-in-ames-iowa-75b9b4086c96

Appendix

#load data
ames_housing <- read.csv('AmesHousing.csv')
#View(ames_housing)

#Standardize column names
ames_housing <- clean_names(ames_housing)

#--- Explore data
dim(ames_housing)   # 2930 rows 82 
glimpse(ames_housing)
#summary(ames_housing) #there seem to be outliers in some of the variables

# select numeric columns
get_numeric_col <- function(df){
  # this function extract numeric columns from a dataset
  numeric_df <- df[, sapply(df, is.numeric)] 
  
  return(numeric_df)
}
#descriptive statistics
descriptive_state <- describe(get_numeric_col(ames_housing))
descriptive_state <- select(descriptive_state, c(n, mean, sd, min, max, skew, kurtosis))
descriptive_state 

# check for missing values
any(is.na(ames_housing)) # missing values are present
sum(is.na(ames_housing)) # in total, 13960 missing values

# percentage of missing values per column
get_percent_missing <- function(df){
  #this function produce the missing percentage of each variable
  missing_percentage <- colMeans(is.na(df)) * 100
  missing_percentage <- data.frame(missing_percentage)
  missing_percentage <- missing_percentage %>% 
    arrange(desc(missing_percentage)) %>% round(., 2)
  
  return(missing_percentage)
}

missing_percentage = get_percent_missing(ames_housing)
head(missing_percentage, 10) #  about 5 of the variables have missing values 
                            #   more than 40% , replacing these values in order
                            #   to use for the analysis may affect the the integrity
                            #   of our model hence we will drop these coloumns

# drop variables: all the variables are categorical values
#   pool_qc: pool quality (ordinal)
#   misc_feature: miscellaneous feature not covered in other categories (nominal)
#   alley:        type of alley access to property (nominal)
#   fence:        fence quality (ordinal)
#   fireplace_qu: fireplace quality (ordinal)

ames_housing_drop1 <- ames_housing %>% select(-c(pool_qc, misc_feature, alley,
                                                 fence, fireplace_qu))

any(duplicated(ames_housing_drop1)) # no duplicate values




# Select columns with missing values
missing_columns <- ames_housing_drop1[, colSums(is.na(ames_housing_drop1)) > 0]

#lets extract the numeric columns
missing_col_numeric <- get_numeric_col(missing_columns)

summary(missing_col_numeric) #summary variables with missing values

# a closer look into the summary statistics indicate that most of the values
# contain extremely outliers hence filling in with mean may affect the integrity
# of the analysis hence we fill missing values with the median


# ------ Visualize the distribution of missing variable columns 
# Reshape the dataframe into long format
df_long <- gather(missing_col_numeric, key = "variable", value = "value")

# Histogram with kernel density
hist_plot <- ggplot(df_long, aes(x = value)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = "skyblue", color = "black") +
  geom_density(alpha = 0.5, color = "red") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Histogram with kernel density of Variables with Missing Values",
       x = "Value", y = "Density") +
  theme(plot.title = element_text(hjust = .5))
# Print the histogram plot
#print(hist_plot)


#----- fill missing values with median
ames_fill_na <-  ames_housing_drop1 %>%
  mutate_if(is.numeric, ~ifelse(is.na(.), median(., na.rm = TRUE), .))

get_percent_missing(ames_fill_na) #remaining missing values

# the remaining missing values are categorical variables which we will not use 
# for the analysis hence we drop/select numeric variables and 2 or 3 categorical
# features we want to explore/visualize

# select variables of interest
ames_select <- ames_fill_na %>%
  select(where(is.numeric) | ms_zoning| utilities|neighborhood|foundation)

#drop the order and pid column
ames_select <- select(ames_select, -c(order, pid))
#View(ames_select)


#any(is.na(ames_select)) no missing values

#factor the character variables



# ------------- Exploratory Data Analysis --------
price_hist <- ggplot(ames_housing_drop1, aes(x = sale_price)) +
  geom_histogram(bins =30, fill='steelblue',color = 'white', alpha = 0.7) + 
  labs(title = 'Histogram of Sale Price',
       x = 'Sale Price($)',
       y = 'Frequency') +
  scale_x_continuous(labels = scales::comma) +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
#price_hist

#summary(ames_housing_drop1$sale_price)

price_boxplot <- ggplot(ames_housing_drop1, aes(x = sale_price)) + 
  geom_boxplot(color = 'red') +
  labs(title = 'Boxplot of Sale Price',
       x = 'Sale Price($)') +
  scale_x_continuous(labels = scales::comma) +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
#price_boxplot

# extract rows with outlier values
out <- boxplot.stats(ames_housing_drop1$sale_price)$out
out_ind <- which(ames_housing_drop1$sale_price %in% c(out))

ames_housing_drop1[out_ind,]

# our plot indicate that there are presence of outliers in our data, but for the
# sake of the diagnosis we want to mantain for now and investigate how relevant 
# these values are later in the process

# ----- Monthly sales
monthly_sales<- ames_select %>%
  group_by(mo_sold)%>%
  summarise(count = n())%>%
  mutate(month_sold = factor(month.abb[mo_sold], levels = month.abb))
#plot
monthly_plot <-ggplot(monthly_sales, aes(x=month_sold, y=count, fill=month_sold, label=count)) + 
  geom_col() + labs(title = 'Number of Properties Sold in Each Month',
                    x = 'Months',
                    y = 'Number of Properties') +
  theme_minimal() + theme(legend.position = 'None', 
                          plot.title = element_text(hjust = 0.5))

# ----- Year sales
year_sales<- ames_select %>%
  group_by(yr_sold)%>%
  summarise(count = n(),
            price = mean(sale_price))
#plot
year_plot <- ggplot(year_sales, aes(x=yr_sold, y=count, fill=yr_sold, label=count)) + 
  geom_col() + labs(title = 'Number of Properties Sold in Each Month',
                    x = 'Months',
                    y = 'Number of Properties') +
  theme_minimal() + theme(legend.position = 'None', 
                          plot.title = element_text(hjust = 0.5))


# --------- functions
get_group_price <- function(df, x){
  # this function takes a dataframe and a specific column
  # and turn the average price categorize by the categorical variables
  average_price <- df %>% group_by(!!sym(x)) %>% 
    summarise(Count = n(), 
              Price = mean(sale_price)) %>% 
    arrange(desc(Price))
  
  return(average_price)
} 


# ------- ms_zoning
table(ames_select$ms_zoning) #frequency
ms_zone_price <- get_group_price(ames_select, 'ms_zoning')

zone_freq <- ggplot(ms_zone_price, aes(x = ms_zoning, y=Count, fill=ms_zoning)) + 
  geom_bar(stat = 'identity') +
  labs(title = 'Frequency of Zone Classification',
       x = "Zones",
       y = 'Frequency') +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))

zone_price <- ggplot(ms_zone_price, aes(x = ms_zoning, y=Price, fill=ms_zoning)) + 
  geom_bar(stat = 'identity') +
  labs(title = 'Average Price Per Zone Classification',
       x = "Zones",
       y = 'Sale Price($)') +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))

#view plot
#zone_freq
#zone_price


# ---- Utilities
utility_group <- get_group_price(ames_select, "utilities")
#bar plot
utility_freq <- ggplot(utility_group, aes(x = utilities, y=Count, fill=utilities)) + 
  geom_bar(stat = 'identity') +
  labs(title = 'Frequency of Available Utilities',
       x = "Utilities",
       y = 'Frequency') +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))

utility_price <- ggplot(utility_group, aes(x = utilities, y=Price, fill=utilities)) + 
  geom_bar(stat = 'identity') +
  labs(title = 'Average Price Per Available Utilities',
       x = "Utilities",
       y = 'Frequency') +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
#view plot
#utility_freq
#utility_price


# ----- foundation
foundation_group <- get_group_price(ames_select, 'foundation')
found_plot <- ggplot(foundation_group,mapping = aes(x=foundation, y=Price,fill=foundation))+
  geom_col()+
  scale_y_continuous(labels = scales::comma) +
  theme_minimal()+ theme(legend.position = 'None',plot.title = element_text(hjust = 0.5)) +
  labs(title = 'Distribution of Foundation',
       x = 'Foundation',
       y = 'Sale Price($)')

# ----- neighbourhood
neigh_boxplot <-  ggplot(ames_select, aes(x = factor(neighborhood), y = sale_price,
                                          fill = neighborhood)) +
  geom_boxplot(position = position_dodge(width = 0.8)) +
  labs(x = "neighborhood", y = "Sale Price($)", title=' Boxplot Neighborhood') +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5),
                          axis.text.x = element_text(angle = 90, hjust = 1),
                          legend.position = 'None')
#neigh_boxplot



# a closer look into the data indicate that we can combine some of the values
# to become one, this include the number of bathroooms and number of total porch
# again we can also compute for the age of the building from the year renovated
# and the year sold.

#hence
ames_select$age = ames_select$yr_sold - ames_select$year_remod_add
ames_select$total_bath <- ames_select$full_bath + ames_select$half_bath
ames_select$total_bsmt_bath <- ames_select$bsmt_full_bath + ames_select$bsmt_half_bath
ames_select$total_porch <- ames_select$open_porch_sf + ames_select$x3ssn_porch+
  ames_select$screen_porch + ames_select$enclosed_porch
ames_select$total_flr_sf <- ames_select$x1st_flr_sf+ames_select$x2nd_flr_sf


#among the variables we drop some and selected the ones we thing based on
# domain knowledge will be relevant
selected_variables <- ames_select %>%
  select(sale_price, lot_frontage, lot_area, overall_qual,
         overall_cond, mas_vnr_area, total_bsmt_sf,
         low_qual_fin_sf, gr_liv_area, bedroom_abv_gr,
         kitchen_abv_gr, tot_rms_abv_grd, fireplaces,
         garage_cars, garage_area,
         pool_area, misc_val, age,
         total_bath, total_bsmt_bath, total_porch,
         total_flr_sf)

# ------------ Correlation ---------
corr_matrix <- round(cor(selected_variables), 4)
corr_table <- corr_matrix["sale_price", ]
print(corr_table)

##      sale_price    lot_frontage        lot_area    overall_qual    overall_cond 
##          1.0000          0.3403          0.2665          0.7993         -0.1017 
##    mas_vnr_area   total_bsmt_sf low_qual_fin_sf     gr_liv_area  bedroom_abv_gr 
##          0.5022          0.6322         -0.0377          0.7068          0.1439 
##  kitchen_abv_gr tot_rms_abv_grd      fireplaces     garage_cars     garage_area 
##         -0.1198          0.4955          0.4746          0.6478          0.6404 
##       pool_area        misc_val             age      total_bath total_bsmt_bath 
##          0.0684         -0.0157         -0.5349          0.5531          0.2494 
##     total_porch    total_flr_sf 
##          0.1853          0.7136

# Create scatterplot
high_val <- ggplot(selected_variables, aes(x = gr_liv_area, y=sale_price)) +geom_point()+
  geom_smooth(method ='lm')
low_val <- ggplot(ames_select, aes(x = misc_val, y=sale_price)) +geom_point()+
  geom_smooth(method ='lm')
mid_val <- ggplot(selected_variables, aes(x = lot_area, y=sale_price)) +geom_point()+
  geom_smooth(method ='lm')

# Arrange plots side by side
#ggarrange(high_val, low_val,mid_val, nrow = 1, ncol = 3)

# ---- variable of interest
interest_variables <- selected_variables %>%
  select(sale_price, overall_qual, gr_liv_area, garage_area, total_bsmt_sf,
         mas_vnr_area, fireplaces, lot_area, age, fireplaces,lot_area, 
         total_bath)

#library(corrplot)
interest_val_cor <- cor(interest_variables)
# 5. Correlation matrix
par(cex = 0.7) 
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(interest_val_cor, method="color", col=col(200),  
         type="upper", order="hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=90, #Text label color and rotation
         diag=FALSE, title = "Correlation Matrix",
         mar = c(1, 1, 1, 1))

# ------ Regression Model ------
limodel <- lm(sale_price~., data=interest_variables)
summary(limodel)


# ------ checking assumption of the model
# if you want to 2x2 grid
#par(mfrow=c(2,2))
#plot(limodel)
#par(mfrow=c(1,1))

#------ Multicollinearity
VIF(limodel)
sqrt(vif(limodel)) > 2

#--Outliers check
outlierTest(limodel)



# -- influencial plot
influencePlot(limodel, id.method="identify", main="Influence Plot",
              sub="Circle size is proportional to Cook’s distance")

#independence of the residual
#durbinWatsonTest(limodel)

#----- Improving Model
#crPlots(limodel)


# ---- Transformation
# from R in action edition 2, we can use When the model violates the normality 
#assumption, you typically attempt a transformation of the response variable. 
# We can use the powerTransform() function in the car package to generate a 
# maximum-likelihood estimation

#summary(powerTransform(interest_variables$sale_price))

# and also 
#boxTidwell(sale_price~lot_area,data=interest_variables)

#using boxtidwell to tranform the total_bsmf_df didnt work hence we wii
# we use the log transformation to transform the independent variables

#include the transform values to the data
interest_variables$transform_price <-(interest_variables$sale_price)^0.0076
interest_variables$trans_lot_area <- (interest_variables$lot_area)^0.19305
interest_variables$trans_total_bsmt_sf <- log(interest_variables$total_bsmt_sf)

# ------ Model 2 -------
limodel_1 <- lm(transform_price ~ overall_qual + gr_liv_area + garage_area +
                  mas_vnr_area + fireplaces + age + total_bath + 
                  trans_lot_area, data = interest_variables)

# Summary of the linear regression model
summary(limodel_1)

#qqPlot(limodel_1)


# using all subset to find the best model; 
# regsubsets
all_subsets <- regsubsets(transform_price ~ overall_qual + gr_liv_area + garage_area +
                   mas_vnr_area + fireplaces + lot_area + age + total_bath + 
                   trans_lot_area, data = interest_variables, nbest = 5)

#plot(all_subsets, scale="adjr2")

cp_values <- summary(all_subsets)$cp
best_subset <- which.min(cp_values)

# Select the best model based on a criterion (e.g., adjusted R-squared)
best_model <- summary(all_subsets)$which[which.max(summary(all_subsets)$adjr2), ]


#---- Model 3 ------
limodel_2  <- lm(transform_price ~ overall_qual + gr_liv_area + garage_area +
                                         mas_vnr_area + fireplaces + lot_area + 
                   age + trans_lot_area, data = interest_variables, )
summary(limodel_2)

#plot
reg_plot <- ggplot(interest_variables, aes(y=transform_price, x = overall_qual +
                                             gr_liv_area + garage_area +
                                             mas_vnr_area + fireplaces + lot_area + 
                                             age  + trans_lot_area))+
  geom_point() +geom_smooth(method="lm", col="black") +
  labs(x = 'Predictors',
       y = 'Sale Price',
       title = 'Regression Line') 
reg_plot

## `geom_smooth()` using formula = 'y ~ x'

#best model
model1_AIC <- AIC(limodel_1)
model2_AIC <- AIC(limodel_2)

#print
model1_AIC
model2_AIC

Ames House Sale Price Prediction - ALY6015 - Northeastern University - Prof Vivian Clement Edwin

Sheila Kwartemaa Boateng

2024-04-16