M1 Project Report
ALY6015:Intermediate Analytics
Northeastern University
Professor: Vladimir Shapiro

By: Zeeshan Ahmad Ansari

Date of Submission: 12 November, 2023


Library

#The report utilizes a set of libraries for various data processing and visualization tasks.

library(tidyverse)
library(readxl)
library(dplyr)
library(readr)
library(kableExtra)
library(corrplot)
library(car)

Introduction

Our research examines home sales data from the city of Ames, located in the state of Iowa, in the Midwestern United States. The timeframe of focus is the years spanning 2006 to 2010, a period of fluctuation in the national housing market. The dataset we will utilize contains 2,930 total observations, with each observation representing a home sale and including 82 distinct attributes about the home.

Through comprehensive analysis of this robust dataset, we aim to achieve two primary objectives. The first goal is identifying key factors that exhibit strong correlations with and influence on sold home prices in the Ames market during the study period. Uncovering these important determinants will shed light on the underlying drivers of real estate values and sale prices. The second objective is developing a predictive pricing model for Ames homes using linear regression techniques. Constructing an accurate forecasting model will enable projection of expected sales prices for future home listings in this city.

Our methodological approach will begin with thorough exploratory data analysis to understand data distributions and relationships. We will then conduct correlation studies to quantify connections between attributes and sale prices. After proper data preprocessing, including handling of any missing values and assessment of multicollinearity, we will fit a linear regression model to the data.

Several R programming packages will assist with analysis tasks. Corrplot and ggcorrplot will help visualize correlations. The leaps package will facilitate regression modeling and variable selection. Use of these tools will streamline our analytical workflow.

Completing these objectives will provide meaningful insights into factors impacting the Ames real estate market. The final regression model can support accurate home valuation and price prediction. More broadly, our findings will further comprehension of recent housing market trends, lending knowledge potentially transferable to other US cities.

Analysis


1. Load the Ames housing dataset.

In this task we will load a dataset called Ames Housing data set provided to us. We will use this data set for our further analysis

#Dataset_Employed_in_this_Report

M1W1Data = read.csv("D:/Quater_2/Second Part/ALY6015/Week_1/Assignment/Assignement_ALY6015/Dataset/AmesHousing.csv")


2. Perform Exploratory Data Analysis and use descriptive statistics to describe the data.

In this task we are generating descriptive statistics for both numerical and categorical columns. We have used summary command to get information about the dataset and improved the presentation using kable and kable_styling commands (Chiluiza, 2022).

# Rename Columns
colnames(M1W1Data) <- tolower(gsub("[ ,.]", "_", colnames(M1W1Data)))
#colnames(M1W1Data)


# Structure of the dataset
Structure_1 <- str(M1W1Data)
## 'data.frame':    2930 obs. of  82 variables:
##  $ order          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pid            : int  526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
##  $ ms_subclass    : int  20 20 20 20 60 60 120 120 120 60 ...
##  $ ms_zoning      : chr  "RL" "RH" "RL" "RL" ...
##  $ lot_frontage   : int  141 80 81 93 74 78 41 43 39 60 ...
##  $ lot_area       : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
##  $ street         : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ alley          : chr  NA NA NA NA ...
##  $ lot_shape      : chr  "IR1" "Reg" "IR1" "Reg" ...
##  $ land_contour   : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ utilities      : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ lot_config     : chr  "Corner" "Inside" "Corner" "Corner" ...
##  $ land_slope     : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ neighborhood   : chr  "NAmes" "NAmes" "NAmes" "NAmes" ...
##  $ condition_1    : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ condition_2    : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ bldg_type      : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ house_style    : chr  "1Story" "1Story" "1Story" "1Story" ...
##  $ overall_qual   : int  6 5 6 7 5 6 8 8 8 7 ...
##  $ overall_cond   : int  5 6 6 5 5 6 5 5 5 5 ...
##  $ year_built     : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ year_remod_add : int  1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
##  $ roof_style     : chr  "Hip" "Gable" "Hip" "Hip" ...
##  $ roof_matl      : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ exterior_1st   : chr  "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ exterior_2nd   : chr  "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ mas_vnr_type   : chr  "Stone" "None" "BrkFace" "None" ...
##  $ mas_vnr_area   : int  112 0 108 0 0 20 0 0 0 0 ...
##  $ exter_qual     : chr  "TA" "TA" "TA" "Gd" ...
##  $ exter_cond     : chr  "TA" "TA" "TA" "TA" ...
##  $ foundation     : chr  "CBlock" "CBlock" "CBlock" "CBlock" ...
##  $ bsmt_qual      : chr  "TA" "TA" "TA" "TA" ...
##  $ bsmt_cond      : chr  "Gd" "TA" "TA" "TA" ...
##  $ bsmt_exposure  : chr  "Gd" "No" "No" "No" ...
##  $ bsmtfin_type_1 : chr  "BLQ" "Rec" "ALQ" "ALQ" ...
##  $ bsmtfin_sf_1   : int  639 468 923 1065 791 602 616 263 1180 0 ...
##  $ bsmtfin_type_2 : chr  "Unf" "LwQ" "Unf" "Unf" ...
##  $ bsmtfin_sf_2   : int  0 144 0 0 0 0 0 0 0 0 ...
##  $ bsmt_unf_sf    : int  441 270 406 1045 137 324 722 1017 415 994 ...
##  $ total_bsmt_sf  : int  1080 882 1329 2110 928 926 1338 1280 1595 994 ...
##  $ heating        : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ heating_qc     : chr  "Fa" "TA" "TA" "Ex" ...
##  $ central_air    : chr  "Y" "Y" "Y" "Y" ...
##  $ electrical     : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ x1st_flr_sf    : int  1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
##  $ x2nd_flr_sf    : int  0 0 0 0 701 678 0 0 0 776 ...
##  $ low_qual_fin_sf: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ gr_liv_area    : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
##  $ bsmt_full_bath : int  1 0 0 1 0 0 1 0 1 0 ...
##  $ bsmt_half_bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ full_bath      : int  1 1 1 2 2 2 2 2 2 2 ...
##  $ half_bath      : int  0 0 1 1 1 1 0 0 0 1 ...
##  $ bedroom_abvgr  : int  3 2 3 3 3 3 2 2 2 3 ...
##  $ kitchen_abvgr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ kitchen_qual   : chr  "TA" "TA" "Gd" "Ex" ...
##  $ totrms_abvgrd  : int  7 5 6 8 6 7 6 5 5 7 ...
##  $ functional     : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ fireplaces     : int  2 0 0 2 1 1 0 0 1 1 ...
##  $ fireplace_qu   : chr  "Gd" NA NA "TA" ...
##  $ garage_type    : chr  "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ garage_yr_blt  : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ garage_finish  : chr  "Fin" "Unf" "Unf" "Fin" ...
##  $ garage_cars    : int  2 1 1 2 2 2 2 2 2 2 ...
##  $ garage_area    : int  528 730 312 522 482 470 582 506 608 442 ...
##  $ garage_qual    : chr  "TA" "TA" "TA" "TA" ...
##  $ garage_cond    : chr  "TA" "TA" "TA" "TA" ...
##  $ paved_drive    : chr  "P" "Y" "Y" "Y" ...
##  $ wood_deck_sf   : int  210 140 393 0 212 360 0 0 237 140 ...
##  $ open_porch_sf  : int  62 0 36 0 34 36 0 82 152 60 ...
##  $ enclosed_porch : int  0 0 0 0 0 0 170 0 0 0 ...
##  $ x3ssn_porch    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ screen_porch   : int  0 120 0 0 0 0 0 144 0 0 ...
##  $ pool_area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pool_qc        : chr  NA NA NA NA ...
##  $ fence          : chr  NA "MnPrv" NA NA ...
##  $ misc_feature   : chr  NA NA "Gar2" NA ...
##  $ misc_val       : int  0 0 12500 0 0 0 0 0 0 0 ...
##  $ mo_sold        : int  5 6 6 4 3 6 4 1 3 6 ...
##  $ yr_sold        : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ sale_type      : chr  "WD " "WD " "WD " "WD " ...
##  $ sale_condition : chr  "Normal" "Normal" "Normal" "Normal" ...
##  $ saleprice      : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
# Summary statistics of the dataset
data_summary <- summary(M1W1Data)

kable(data_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")
order pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour utilities lot_config land_slope neighborhood condition_1 condition_2 bldg_type house_style overall_qual overall_cond year_built year_remod_add roof_style roof_matl exterior_1st exterior_2nd mas_vnr_type mas_vnr_area exter_qual exter_cond foundation bsmt_qual bsmt_cond bsmt_exposure bsmtfin_type_1 bsmtfin_sf_1 bsmtfin_type_2 bsmtfin_sf_2 bsmt_unf_sf total_bsmt_sf heating heating_qc central_air electrical x1st_flr_sf x2nd_flr_sf low_qual_fin_sf gr_liv_area bsmt_full_bath bsmt_half_bath full_bath half_bath bedroom_abvgr kitchen_abvgr kitchen_qual totrms_abvgrd functional fireplaces fireplace_qu garage_type garage_yr_blt garage_finish garage_cars garage_area garage_qual garage_cond paved_drive wood_deck_sf open_porch_sf enclosed_porch x3ssn_porch screen_porch pool_area pool_qc fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice
Min. : 1.0 Min. :5.263e+08 Min. : 20.00 Length:2930 Min. : 21.00 Min. : 1300 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Min. : 0.0 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Min. : 0.0 Length:2930 Min. : 0.00 Min. : 0.0 Min. : 0 Length:2930 Length:2930 Length:2930 Length:2930 Min. : 334.0 Min. : 0.0 Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000 Length:2930 Min. : 2.000 Length:2930 Min. :0.0000 Length:2930 Length:2930 Min. :1895 Length:2930 Min. :0.000 Min. : 0.0 Length:2930 Length:2930 Length:2930 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0 Min. : 0.000 Length:2930 Length:2930 Length:2930 Min. : 0.00 Min. : 1.000 Min. :2006 Length:2930 Length:2930 Min. : 12789
1st Qu.: 733.2 1st Qu.:5.285e+08 1st Qu.: 20.00 Class :character 1st Qu.: 58.00 1st Qu.: 7440 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1965 Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0 Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0 Class :character 1st Qu.: 0.00 1st Qu.: 219.0 1st Qu.: 793 Class :character Class :character Class :character Class :character 1st Qu.: 876.2 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:1126 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character 1st Qu.:0.0000 Class :character Class :character 1st Qu.:1960 Class :character 1st Qu.:1.000 1st Qu.: 320.0 Class :character Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0.000 Class :character Class :character Class :character 1st Qu.: 0.00 1st Qu.: 4.000 1st Qu.:2007 Class :character Class :character 1st Qu.:129500
Median :1465.5 Median :5.355e+08 Median : 50.00 Mode :character Median : 68.00 Median : 9436 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 6.000 Median :5.000 Median :1973 Median :1993 Mode :character Mode :character Mode :character Mode :character Mode :character Median : 0.0 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 370.0 Mode :character Median : 0.00 Median : 466.0 Median : 990 Mode :character Mode :character Mode :character Mode :character Median :1084.0 Median : 0.0 Median : 0.000 Median :1442 Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000 Median :3.000 Median :1.000 Mode :character Median : 6.000 Mode :character Median :1.0000 Mode :character Mode :character Median :1979 Mode :character Median :2.000 Median : 480.0 Mode :character Mode :character Mode :character Median : 0.00 Median : 27.00 Median : 0.00 Median : 0.000 Median : 0 Median : 0.000 Mode :character Mode :character Mode :character Median : 0.00 Median : 6.000 Median :2008 Mode :character Mode :character Median :160000
Mean :1465.5 Mean :7.145e+08 Mean : 57.39 NA Mean : 69.22 Mean : 10148 NA NA NA NA NA NA NA NA NA NA NA NA Mean : 6.095 Mean :5.563 Mean :1971 Mean :1984 NA NA NA NA NA Mean : 101.9 NA NA NA NA NA NA NA Mean : 442.6 NA Mean : 49.72 Mean : 559.3 Mean :1052 NA NA NA NA Mean :1159.6 Mean : 335.5 Mean : 4.677 Mean :1500 Mean :0.4314 Mean :0.06113 Mean :1.567 Mean :0.3795 Mean :2.854 Mean :1.044 NA Mean : 6.443 NA Mean :0.5993 NA NA Mean :1978 NA Mean :1.767 Mean : 472.8 NA NA NA Mean : 93.75 Mean : 47.53 Mean : 23.01 Mean : 2.592 Mean : 16 Mean : 2.243 NA NA NA Mean : 50.63 Mean : 6.216 Mean :2008 NA NA Mean :180796
3rd Qu.:2197.8 3rd Qu.:9.072e+08 3rd Qu.: 70.00 NA 3rd Qu.: 80.00 3rd Qu.: 11555 NA NA NA NA NA NA NA NA NA NA NA NA 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004 NA NA NA NA NA 3rd Qu.: 164.0 NA NA NA NA NA NA NA 3rd Qu.: 734.0 NA 3rd Qu.: 0.00 3rd Qu.: 802.0 3rd Qu.:1302 NA NA NA NA 3rd Qu.:1384.0 3rd Qu.: 703.8 3rd Qu.: 0.000 3rd Qu.:1743 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 NA 3rd Qu.: 7.000 NA 3rd Qu.:1.0000 NA NA 3rd Qu.:2002 NA 3rd Qu.:2.000 3rd Qu.: 576.0 NA NA NA 3rd Qu.: 168.00 3rd Qu.: 70.00 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0 3rd Qu.: 0.000 NA NA NA 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 NA NA 3rd Qu.:213500
Max. :2930.0 Max. :1.007e+09 Max. :190.00 NA Max. :313.00 Max. :215245 NA NA NA NA NA NA NA NA NA NA NA NA Max. :10.000 Max. :9.000 Max. :2010 Max. :2010 NA NA NA NA NA Max. :1600.0 NA NA NA NA NA NA NA Max. :5644.0 NA Max. :1526.00 Max. :2336.0 Max. :6110 NA NA NA NA Max. :5095.0 Max. :2065.0 Max. :1064.000 Max. :5642 Max. :3.0000 Max. :2.00000 Max. :4.000 Max. :2.0000 Max. :8.000 Max. :3.000 NA Max. :15.000 NA Max. :4.0000 NA NA Max. :2207 NA Max. :5.000 Max. :1488.0 NA NA NA Max. :1424.00 Max. :742.00 Max. :1012.00 Max. :508.000 Max. :576 Max. :800.000 NA NA NA Max. :17000.00 Max. :12.000 Max. :2010 NA NA Max. :755000
NA NA NA NA NA’s :490 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :23 NA NA NA NA NA NA NA NA’s :1 NA NA’s :1 NA’s :1 NA’s :1 NA NA NA NA NA NA NA NA NA’s :2 NA’s :2 NA NA NA NA NA NA NA NA NA NA NA’s :159 NA NA’s :1 NA’s :1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

The str()function is used to display the structure of the dataset M1W1Data. It reveals the types and formats of the variables in the dataset, including the number of observations and variables.

# Descriptive statistics of numerical columns
num_summary <- summary(select_if(M1W1Data, is.numeric))

kable(num_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")
order pid ms_subclass lot_frontage lot_area overall_qual overall_cond year_built year_remod_add mas_vnr_area bsmtfin_sf_1 bsmtfin_sf_2 bsmt_unf_sf total_bsmt_sf x1st_flr_sf x2nd_flr_sf low_qual_fin_sf gr_liv_area bsmt_full_bath bsmt_half_bath full_bath half_bath bedroom_abvgr kitchen_abvgr totrms_abvgrd fireplaces garage_yr_blt garage_cars garage_area wood_deck_sf open_porch_sf enclosed_porch x3ssn_porch screen_porch pool_area misc_val mo_sold yr_sold saleprice
Min. : 1.0 Min. :5.263e+08 Min. : 20.00 Min. : 21.00 Min. : 1300 Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950 Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0 Min. : 334.0 Min. : 0.0 Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 2.000 Min. :0.0000 Min. :1895 Min. :0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0 Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006 Min. : 12789
1st Qu.: 733.2 1st Qu.:5.285e+08 1st Qu.: 20.00 1st Qu.: 58.00 1st Qu.: 7440 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1965 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 219.0 1st Qu.: 793 1st Qu.: 876.2 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:1126 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.0000 1st Qu.:1960 1st Qu.:1.000 1st Qu.: 320.0 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 4.000 1st Qu.:2007 1st Qu.:129500
Median :1465.5 Median :5.355e+08 Median : 50.00 Median : 68.00 Median : 9436 Median : 6.000 Median :5.000 Median :1973 Median :1993 Median : 0.0 Median : 370.0 Median : 0.00 Median : 466.0 Median : 990 Median :1084.0 Median : 0.0 Median : 0.000 Median :1442 Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000 Median :3.000 Median :1.000 Median : 6.000 Median :1.0000 Median :1979 Median :2.000 Median : 480.0 Median : 0.00 Median : 27.00 Median : 0.00 Median : 0.000 Median : 0 Median : 0.000 Median : 0.00 Median : 6.000 Median :2008 Median :160000
Mean :1465.5 Mean :7.145e+08 Mean : 57.39 Mean : 69.22 Mean : 10148 Mean : 6.095 Mean :5.563 Mean :1971 Mean :1984 Mean : 101.9 Mean : 442.6 Mean : 49.72 Mean : 559.3 Mean :1052 Mean :1159.6 Mean : 335.5 Mean : 4.677 Mean :1500 Mean :0.4314 Mean :0.06113 Mean :1.567 Mean :0.3795 Mean :2.854 Mean :1.044 Mean : 6.443 Mean :0.5993 Mean :1978 Mean :1.767 Mean : 472.8 Mean : 93.75 Mean : 47.53 Mean : 23.01 Mean : 2.592 Mean : 16 Mean : 2.243 Mean : 50.63 Mean : 6.216 Mean :2008 Mean :180796
3rd Qu.:2197.8 3rd Qu.:9.072e+08 3rd Qu.: 70.00 3rd Qu.: 80.00 3rd Qu.: 11555 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004 3rd Qu.: 164.0 3rd Qu.: 734.0 3rd Qu.: 0.00 3rd Qu.: 802.0 3rd Qu.:1302 3rd Qu.:1384.0 3rd Qu.: 703.8 3rd Qu.: 0.000 3rd Qu.:1743 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.0000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.: 168.00 3rd Qu.: 70.00 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 3rd Qu.:213500
Max. :2930.0 Max. :1.007e+09 Max. :190.00 Max. :313.00 Max. :215245 Max. :10.000 Max. :9.000 Max. :2010 Max. :2010 Max. :1600.0 Max. :5644.0 Max. :1526.00 Max. :2336.0 Max. :6110 Max. :5095.0 Max. :2065.0 Max. :1064.000 Max. :5642 Max. :3.0000 Max. :2.00000 Max. :4.000 Max. :2.0000 Max. :8.000 Max. :3.000 Max. :15.000 Max. :4.0000 Max. :2207 Max. :5.000 Max. :1488.0 Max. :1424.00 Max. :742.00 Max. :1012.00 Max. :508.000 Max. :576 Max. :800.000 Max. :17000.00 Max. :12.000 Max. :2010 Max. :755000
NA NA NA NA’s :490 NA NA NA NA NA NA’s :23 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA NA NA NA NA’s :2 NA’s :2 NA NA NA NA NA NA NA’s :159 NA’s :1 NA’s :1 NA NA NA NA NA NA NA NA NA NA

The above provided code computes summary statistics for the numerical columns within the dataset.

# Descriptive statistics of categorical columns
cat_summary <- summary(select_if(M1W1Data, is.character))

kable(cat_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "200px")
ms_zoning street alley lot_shape land_contour utilities lot_config land_slope neighborhood condition_1 condition_2 bldg_type house_style roof_style roof_matl exterior_1st exterior_2nd mas_vnr_type exter_qual exter_cond foundation bsmt_qual bsmt_cond bsmt_exposure bsmtfin_type_1 bsmtfin_type_2 heating heating_qc central_air electrical kitchen_qual functional fireplace_qu garage_type garage_finish garage_qual garage_cond paved_drive pool_qc fence misc_feature sale_type sale_condition
Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930 Length:2930
Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character

The above provided code computes summary statistics for the categorical columns within the dataset.

Observations to Task_2:

The Ames Housing dataset under evaluation has 2930 observations over 82 variables. The variables contain a range of data kinds, including integers (denoted as ‘int’) and character strings (denoted as ‘chr’). It’s important to notice that some variables have NA values, which indicate missing or unrecorded data.

The summary command is used in the above code to provide a short summary of the dataset variables. The resulting table includes key statistics for each variable, such as minimum, maximum, median, mean, and quartile values. This allows for a fast grasp of the range and distribution of numerical properties.


3. Prepare the dataset for modeling by imputing missing values with the variable’s mean value or any other value that you prefer.

data <- M1W1Data 

# Check for missing values
summarize_na <- colSums(is.na(data))
#summarize_na

# Separating numeric and non-numeric columns
numeric_cols <- sapply(data, is.numeric)
non_numeric_cols <- sapply(data, function(x) !is.numeric(x))

# Imputing missing values for numeric cols with mean
data[numeric_cols] <- lapply(data[numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

# Imputing missig vlaues for non-numeric cols with "NA" 
data[non_numeric_cols] <- lapply(data[non_numeric_cols], function(x) ifelse(is.na(x), "NA", x))

variables_missing <- c("bsmt_qual", "bsmt_cond", "bsmt_exposure", "bsmtfin_type_1", "bsmtfin_type_2", "garage_type", "garage_finish", "garage_qual", "garage_cond", "mas_vnr_type")

# Replace NA with "NA" in the specified columns
data[variables_missing][is.na(data[variables_missing])] <- "NA"


# Checking again for missing values
missing_value_check <- colSums(is.na(data))
#missing_value_check

#view(data)

Observations to Task_3:

The code first separates the columns into numeric and non numeric using sapply(). This allows for imputation to be tailored based on data type. Imputing data correctly based on its type is an important consideration.

For numeric columns, lapply() is used to loop through each column and replace missing values with the mean. Imputing with the mean is a suitable strategy for numeric data.

Non-numeric columns have NA values imputed with a new NA category. This maintains the categorical nature of these variables, rather than converting them to numeric. Imputing a new category is appropriate for these data types.

Blank values across all columns are replaced with NA using lapply(). This ensures blank strings are treated as missing values, for consistency. Handling blank values is an important data cleaning step.


4. Use the cor() function to produce a correlation matrix of the numeric values.

numeric_data <- data[sapply(data, is.numeric)]

# Calculate the correlation matrix
correlation_matrix <- cor(numeric_data)
#correlation_matrix
# View the correlation matrix
view(correlation_matrix)

Observations to Task_4:

Firstly, we have selected only the columns that contain numerical data from the entire dataset using the code: numeric_data <- data[sapply(data, is.numeric)]. This step is important as it isolates columns with numeric values for subsequent analysis.

Next, we generated a correlation matrix, generated by the cor() function, essentially indicates how the different numeric columns in the dataset are related to each other.

Finally, we have displayed the correlation matrix. The printed correlation matrix shows a table of values, where each number represents the relationship between two different numeric columns. A higher value (either positive or negative) signifies a stronger relationship between the two columns, while a value closer to zero suggests a weaker or negligible relationship.

The purpose of generating a correlation matrix is to understand how the different numeric columns in the dataset are interrelated. By analyzing this matrix, one can derive insights into which columns might influence each other, which is valuable for various data analysis and predictive modeling tasks.


5. Produce a plot of the correlation matrix, and explain how to interpret it. (hint - check the corrplot or ggcorrplot plot libraries)

corrplot(correlation_matrix, tl.cex = 0.5, type = "upper")

Observations to Task_5:

In the above correlation plot the colors represent the degree and direction of the correlation. Darker colors indicate a stronger correlation, with blue shades for positive correlations and red shades for negative correlations.

The hc.order = TRUE parameter has arranged the variables based on their correlations. This arrangement helps in identifying clusters of variables that are more strongly correlated with each other.

The parameter lab = TRUE displays the variable names next to the cells, making it easier to identify which variables are being compared in the plot.

This plot allows us to do more detailed exploration of correlations between variables, especially in larger datasets. By observing the color intensity and patterns, we can quickly identify variables that exhibit significant relationships, understand the direction of their correlations, and recognize clusters of variables that are closely related based on the hierarchical arrangement.


6. Make a scatter plot for the X continuous variable with the highest correlation with SalePrice. Do the same for the X variable that has the lowest correlation with SalePrice. Finally, make a scatter plot between X and SalePrice with the correlation closest to 0.5. Interpret the scatter plots and describe how the patterns differ.

# Extract the correlation values of SalePrice with other variables
correlation_with_saleprice <- correlation_matrix["saleprice", ]
correlation_with_saleprice_sorted <- sort(correlation_with_saleprice)
correlation_with_saleprice_sorted
##             pid  enclosed_porch   kitchen_abvgr    overall_cond     ms_subclass 
##    -0.246521213    -0.128787442    -0.119813720    -0.101696932    -0.085091576 
## low_qual_fin_sf  bsmt_half_bath           order         yr_sold        misc_val 
##    -0.037659765    -0.035815123    -0.031407925    -0.030569087    -0.015691463 
##    bsmtfin_sf_2     x3ssn_porch         mo_sold       pool_area    screen_porch 
##     0.005889764     0.032224649     0.035258842     0.068403247     0.112151214 
##   bedroom_abvgr     bsmt_unf_sf        lot_area     x2nd_flr_sf  bsmt_full_bath 
##     0.143913428     0.182804552     0.266549220     0.269373357     0.275893674 
##       half_bath   open_porch_sf    wood_deck_sf    lot_frontage    bsmtfin_sf_1 
##     0.285056032     0.312950506     0.327143174     0.340751054     0.432794357 
##      fireplaces   totrms_abvgrd    mas_vnr_area   garage_yr_blt  year_remod_add 
##     0.474558093     0.495474417     0.505784081     0.510684432     0.532973754 
##       full_bath      year_built     x1st_flr_sf   total_bsmt_sf     garage_area 
##     0.545603901     0.558426106     0.621676063     0.632105117     0.640385461 
##     garage_cars     gr_liv_area    overall_qual       saleprice 
##     0.647861110     0.706779921     0.799261795     1.000000000
plot(data$saleprice, data$saleprice, 
     xlab = "SalePrice", ylab = "SalePrice",
     main = "Scatter plot between sale price with itself")

plot(data$bsmtfin_sf_2, data$saleprice, 
     xlab = "SalePrice", ylab = "bsmtfin_sf_2",
     main = "Scatter plot Sale Price and bsmtfin_sf_2")

plot(data$garage_area, data$saleprice, 
     xlab = "Sale Price", ylab = "Garage Area",
     main = "Scatter plot between Sale Price and Garage Area")

Observations to Task_6:

By examining the sorted correlation values, it provides an overview of how strongly each variable is related to the SalePrice. Variables with higher correlations are typically more influential in predicting SalePrice where SalePrice itself correlates with the highest value.

Scatter Plots:

  1. SalePrice vs. SalePrice: This plot simply shows a perfect linear relationship, which is expected as it’s the variable plotted against itself.

  2. SalePrice vs. bsmtfin_sf_2: The scatter plot visualizes the relationship between ‘SalePrice’ and ‘bsmtfin_sf_2’. This plot is not that much linear and there are outliers also which makes the interepretation tough.

  3. SalePrice vs. Garage Area: This plot shows the relationship between SalePrice and Garage Area. It has some outliers but we can see that with increse in sales price the Garage area also increases fot most of the places


7. Using at least 3 continuous variables, fit a regression model in R.

I have chosen the following three continuous varibales for the regression model: (Lot Frontage, Lot Area and Garage Area).

# Fit a linear regression model
model <- lm(saleprice ~ lot_frontage + lot_area + garage_area, data = numeric_data)

# Summary of the regression model
summary(model)
## 
## Call:
## lm(formula = saleprice ~ lot_frontage + lot_area + garage_area, 
##     data = numeric_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -378818  -32640   -4477   25820  470664 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.027e+04  3.983e+03  10.111  < 2e-16 ***
## lot_frontage 3.976e+02  5.827e+01   6.823 1.08e-11 ***
## lot_area     1.052e+00  1.519e-01   6.928 5.22e-12 ***
## garage_area  2.164e+02  5.505e+00  39.311  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59980 on 2926 degrees of freedom
## Multiple R-squared:  0.4368, Adjusted R-squared:  0.4362 
## F-statistic: 756.5 on 3 and 2926 DF,  p-value: < 2.2e-16
# Extract model coefficients
coefficients <- summary(model)$coefficients
coefficients
##                  Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)  40270.388263 3982.6789662 10.111382  1.197653e-23
## lot_frontage   397.582519   58.2747105  6.822557  1.082565e-11
## lot_area         1.052361    0.1518948  6.928226  5.219572e-12
## garage_area    216.412087    5.5051092 39.311134 9.216922e-272
# Report the model in equation form
cat("SalePrice =", 
    round(coefficients[1, 1], digits = 2), 
    "+", round(coefficients[2, 1], digits = 2), "* lot_frontage",
    "+", round(coefficients[3, 1], digits = 2), "* lot_area",
    "+", round(coefficients[4, 1], digits = 2), "* garage_area", "\n")
## SalePrice = 40270.39 + 397.58 * lot_frontage + 1.05 * lot_area + 216.41 * garage_area

Observations to Task_7:

The model predicts home sale prices in Ames, using three predictor variables lot_frontage, lot_area, and garage_area. These variables likely represent important property characteristics influencing market value.

The output from the R summary() function indicates all three predictors are highly statistically significant (p < 0.001) with t-values exceeding 6. This signals they provide useful information to the model.

The R-squared of 0.4368 and adjusted R-squared of 0.4362 suggest that together the predictors explain approximately 43.7% of the variation in sale prices. The minimal difference between the two values implies the model should generalize reasonably well to new data.

Multicollinearity does not seem problematic as evidenced by the large t-values and relatively small standard errors for the coefficients.

The extracted model coefficients show lot_frontage and garage_area have the largest positive relationships with sale price. The equation form demonstrates a 1 unit increase in lot_frontage and garage_area is associated with a 397 and 216 increase in price, holding other variables constant.

Overall, the model provides a reasonable fit but could likely be improved by incorporating more predictive variables. Further refinement may increase the R-squared and decrease the residual error.


8. Report the model in equation form and interpret each coefficient of the model in the context of this problem.

Formula Used:

Sales Price = b0 + b1* lot frontage + b2* lot area + b3* garage area

Where:

b0 is the intercept.

b1 is the coefficient for lot_frontage.

b2 is the coefficient for lot_area.

b3 is the coefficient for garage_area.

  1. Intercept (b0): The intercept represents the estimated SalePrice when all the independent variables (lot_frontage, lot_area, and garage_area) are zero. In this context, it may not have a meaningful interpretation because it’s unlikely that all these variables would be zero.

  2. Coefficient for lot_frontage (b1): This coefficient represents the change in SalePrice for a one-unit increase in lot_frontage, holding other variables constant. For example, if b1 is 100, it means that for each additional unit of lot_frontage, the SalePrice is estimated to increase by 100 units, assuming all other variables remain constant.

  3. Coefficient for lot_area (b2): This coefficient represents the change in SalePrice for a one-unit increase in lot_area, holding other variables constant. Similarly, if b2 is, for instance, 50, it means that for each additional unit of lot_area, the SalePrice is estimated to increase by 50 units, assuming all other variables remain constant.

  4. Coefficient for garage_area (b3): This coefficient represents the change in SalePrice for a one-unit increase in garage_area, holding other variables constant. If b3 is, for example, 75, it means that for each additional unit of garage_area, the SalePrice is estimated to increase by 75 units, assuming all other variables remain constant.


9. Use the plot() function to plot your regression model. Interpret the four graphs that are produced.

# Plot the regression model diagnostics
par(mfrow=c(2,2))
plot(model)

Observations to Task_9:

These four plots are common diagnostic plots used to evaluate the assumptions of a linear regression model.

The residuals versus fitted values plot shows the errors scattered against the predicted values. We want to see a random pattern, but here the residuals seem to grow as the fitted values increase, implying the variance may not be constant across the range.

The normal Q-Q plot checks if the residuals follow a normal distribution, a requirement for regression. Most points align with the reference line, but some deviation at the extremes indicates potential outliers or heavy-tailed residuals.

The scale-location plot is similar to the first, assessing constancy of variance. Again, the increasing spread of residuals across fitted values points to possible heteroscedasticity.

The residuals versus leverage plot identifies highly influential observations. Points outside the Cook’s distance lines are leveraged and could skew the model. A few cases here appear potentially problematic for the overall fit.

In summary, the diagnostic plots reveal areas of concern like non-constant variance, influential points, and non-normality that may be distorting the model. Addressing these issues could improve the regression performance and reliability.

Overall, while the model might capture the central tendency of the data, the diagnostic plots suggest potential issues with non-constant variance (heteroscedasticity), the normality of residuals, and the presence of influential points. Addressing these issues might involve transformations of the dependent variable, robust regression methods, or removing influential observations, depending on the context and severity of the issues.


10. Check your model for multicollinearity and report your findings. What steps would you take to correct multicollinearity if it exists?

# Calculate Variance Inflation Factor (VIF)
vif_result <- vif(model)
print(vif_result)
## lot_frontage     lot_area  garage_area 
##     1.256838     1.166333     1.140594

Observations to Task_10:

The Variance Inflation Factor (VIF) assesses multicollinearity in a regression model, with values below 5 considered acceptable. For lot_frontage, lot_area, and garage_area, VIFs of 1.26, 1.17, and 1.14 respectively suggest very low multicollinearity. Overall, these variables exhibit very low inter-variable influence, assuring minimal multicollinearity concerns in the model.

If there is a multicollinearity we can use these steps to mitigate it:

  1. Remove variables with high VIF values. Sometimes dropping one of the highly correlated variables can help alleviate multicollinearity.

  2. Scale continuous variables (e.g., using standardization or normalization) to reduce the scale differences between predictors.

  3. If multicollinearity persists, PCA (Principal Component Analysis ) can be applied to transform the correlated variables into a set of uncorrelated variables.

After making adjustments to address multicollinearity, we need to re-evaluate the regression model to check if the issue has been resolved and the model’s performance has improved or not.


11. Check your model for outliers and report your findings. Should these observations be removed from the model?

# Extracting residuals and leverage values
residuals <- residuals(model)
leverage <- hatvalues(model)
stud_resid <- rstudent(model)
cooks_d <- cooks.distance(model)

# Combine the data with the residuals, leverage, and statistics
model_data <- model.frame(model)
outlier_data <- cbind(model_data, residuals, leverage, stud_resid, cooks_d)

# Create plots to visualize outliers
# Plot residuals vs. fitted values
plot(model, which = 1)

# Plot residuals vs. leverage
plot(model, which = 3)

# Plot Cook's distance
plot(model, which = 4)

Observations to Task_11:

Yes we do have outliers in the above model.

Outliers can have a significant impact on a regression model, influencing coefficient estimation and interpretation. They may require cautious thinking before removing them. Their removal, however, should be supported by domain knowledge or statistical reasons. It is critical to ensure that outliers are not influential because of a specific trend in the data or any systematic problem.

The residuals versus fitted values plot shows the errors spreading outwards as the predicted values get larger. This fanning pattern hints at heteroscedasticity - the error variance changes across the range of predictors, violating a regression assumption.

A few distinct points detach from the main cluster, like observations 1768 and 1499. These potential outliers could substantially impact model fit if they represent anomalies and not just natural variability.

The red line depicting the residual average trend isn’t perfectly horizontal, implying the relationship between predictors and outcome may not be strictly linear. The model may be missing some nonlinear associations between the independent and dependent variables.

In essence, the diagnostic plots signal concerns like heteroscedasticity, outliers, and nonlinearity that could undermine model assumptions. Addressing these issues through things like data cleaning, transformations, or model revisions could greatly improve the reliability and interpretability of the regression results.


Conclusion

In this analysis, we conducted an in-depth examination of the Ames, Iowa housing dataset, with the goals of understanding key drivers of home sale prices and developing a predictive pricing model. Through exploratory analysis and correlation studies, we gained insight into relationships between sale price and property attributes like lot size, living area, and garage space.

Our multiple linear regression model revealed lot frontage, lot area, and garage area as useful predictors, explaining approximately 44% of variation in sale prices.

In summary, our methodology enabled tangible progress towards the goals outlined. The workflow, from initial data wrangling to regression diagnostics, serves as a practical example of an end-to-end analytics process. The skills applied here are exploratory analysis, correlation mapping, model development will useful for learning.


References

  1. Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/vectors_matrix


Appendix
This report contains an R Markdown file named as follows Week_1_ALY6015_Ansari.Rmd