M1 Project Report
ALY6015:Intermediate Analytics
Northeastern University
Professor: Vladimir Shapiro

By: Zeeshan Ahmad Ansari
Date of Submission: 12 November, 2023

Library

#The report utilizes a set of libraries for various data processing and visualization tasks.

library(tidyverse)
library(readxl)
library(dplyr)
library(readr)
library(kableExtra)
library(corrplot)
library(car)

Introduction

Our research examines home sales data from the city of Ames, located in the state of Iowa, in the Midwestern United States. The timeframe of focus is the years spanning 2006 to 2010, a period of fluctuation in the national housing market. The dataset we will utilize contains 2,930 total observations, with each observation representing a home sale and including 82 distinct attributes about the home.

Through comprehensive analysis of this robust dataset, we aim to achieve two primary objectives. The first goal is identifying key factors that exhibit strong correlations with and influence on sold home prices in the Ames market during the study period. Uncovering these important determinants will shed light on the underlying drivers of real estate values and sale prices. The second objective is developing a predictive pricing model for Ames homes using linear regression techniques. Constructing an accurate forecasting model will enable projection of expected sales prices for future home listings in this city.

Our methodological approach will begin with thorough exploratory data analysis to understand data distributions and relationships. We will then conduct correlation studies to quantify connections between attributes and sale prices. After proper data preprocessing, including handling of any missing values and assessment of multicollinearity, we will fit a linear regression model to the data.

Several R programming packages will assist with analysis tasks. Corrplot and ggcorrplot will help visualize correlations. The leaps package will facilitate regression modeling and variable selection. Use of these tools will streamline our analytical workflow.

Completing these objectives will provide meaningful insights into factors impacting the Ames real estate market. The final regression model can support accurate home valuation and price prediction. More broadly, our findings will further comprehension of recent housing market trends, lending knowledge potentially transferable to other US cities.

Analysis

1. Load the Ames housing dataset.

In this task we will load a dataset called Ames Housing data set provided to us. We will use this data set for our further analysis

#Dataset_Employed_in_this_Report

M1W1Data = read.csv("D:/Quater_2/Second Part/ALY6015/Week_1/Assignment/Assignement_ALY6015/Dataset/AmesHousing.csv")

2. Perform Exploratory Data Analysis and use descriptive statistics to describe the data.

In this task we are generating descriptive statistics for both numerical and categorical columns. We have used summary command to get information about the dataset and improved the presentation using kable and kable_styling commands (Chiluiza, 2022).

# Rename Columns
colnames(M1W1Data) <- tolower(gsub("[ ,.]", "_", colnames(M1W1Data)))
#colnames(M1W1Data)


# Structure of the dataset
Structure_1 <- str(M1W1Data)

## 'data.frame':    2930 obs. of  82 variables:
##  $ order          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pid            : int  526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
##  $ ms_subclass    : int  20 20 20 20 60 60 120 120 120 60 ...
##  $ ms_zoning      : chr  "RL" "RH" "RL" "RL" ...
##  $ lot_frontage   : int  141 80 81 93 74 78 41 43 39 60 ...
##  $ lot_area       : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
##  $ street         : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ alley          : chr  NA NA NA NA ...
##  $ lot_shape      : chr  "IR1" "Reg" "IR1" "Reg" ...
##  $ land_contour   : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ utilities      : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ lot_config     : chr  "Corner" "Inside" "Corner" "Corner" ...
##  $ land_slope     : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ neighborhood   : chr  "NAmes" "NAmes" "NAmes" "NAmes" ...
##  $ condition_1    : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ condition_2    : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ bldg_type      : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ house_style    : chr  "1Story" "1Story" "1Story" "1Story" ...
##  $ overall_qual   : int  6 5 6 7 5 6 8 8 8 7 ...
##  $ overall_cond   : int  5 6 6 5 5 6 5 5 5 5 ...
##  $ year_built     : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ year_remod_add : int  1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
##  $ roof_style     : chr  "Hip" "Gable" "Hip" "Hip" ...
##  $ roof_matl      : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ exterior_1st   : chr  "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ exterior_2nd   : chr  "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ mas_vnr_type   : chr  "Stone" "None" "BrkFace" "None" ...
##  $ mas_vnr_area   : int  112 0 108 0 0 20 0 0 0 0 ...
##  $ exter_qual     : chr  "TA" "TA" "TA" "Gd" ...
##  $ exter_cond     : chr  "TA" "TA" "TA" "TA" ...
##  $ foundation     : chr  "CBlock" "CBlock" "CBlock" "CBlock" ...
##  $ bsmt_qual      : chr  "TA" "TA" "TA" "TA" ...
##  $ bsmt_cond      : chr  "Gd" "TA" "TA" "TA" ...
##  $ bsmt_exposure  : chr  "Gd" "No" "No" "No" ...
##  $ bsmtfin_type_1 : chr  "BLQ" "Rec" "ALQ" "ALQ" ...
##  $ bsmtfin_sf_1   : int  639 468 923 1065 791 602 616 263 1180 0 ...
##  $ bsmtfin_type_2 : chr  "Unf" "LwQ" "Unf" "Unf" ...
##  $ bsmtfin_sf_2   : int  0 144 0 0 0 0 0 0 0 0 ...
##  $ bsmt_unf_sf    : int  441 270 406 1045 137 324 722 1017 415 994 ...
##  $ total_bsmt_sf  : int  1080 882 1329 2110 928 926 1338 1280 1595 994 ...
##  $ heating        : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ heating_qc     : chr  "Fa" "TA" "TA" "Ex" ...
##  $ central_air    : chr  "Y" "Y" "Y" "Y" ...
##  $ electrical     : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ x1st_flr_sf    : int  1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
##  $ x2nd_flr_sf    : int  0 0 0 0 701 678 0 0 0 776 ...
##  $ low_qual_fin_sf: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ gr_liv_area    : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
##  $ bsmt_full_bath : int  1 0 0 1 0 0 1 0 1 0 ...
##  $ bsmt_half_bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ full_bath      : int  1 1 1 2 2 2 2 2 2 2 ...
##  $ half_bath      : int  0 0 1 1 1 1 0 0 0 1 ...
##  $ bedroom_abvgr  : int  3 2 3 3 3 3 2 2 2 3 ...
##  $ kitchen_abvgr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ kitchen_qual   : chr  "TA" "TA" "Gd" "Ex" ...
##  $ totrms_abvgrd  : int  7 5 6 8 6 7 6 5 5 7 ...
##  $ functional     : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ fireplaces     : int  2 0 0 2 1 1 0 0 1 1 ...
##  $ fireplace_qu   : chr  "Gd" NA NA "TA" ...
##  $ garage_type    : chr  "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ garage_yr_blt  : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ garage_finish  : chr  "Fin" "Unf" "Unf" "Fin" ...
##  $ garage_cars    : int  2 1 1 2 2 2 2 2 2 2 ...
##  $ garage_area    : int  528 730 312 522 482 470 582 506 608 442 ...
##  $ garage_qual    : chr  "TA" "TA" "TA" "TA" ...
##  $ garage_cond    : chr  "TA" "TA" "TA" "TA" ...
##  $ paved_drive    : chr  "P" "Y" "Y" "Y" ...
##  $ wood_deck_sf   : int  210 140 393 0 212 360 0 0 237 140 ...
##  $ open_porch_sf  : int  62 0 36 0 34 36 0 82 152 60 ...
##  $ enclosed_porch : int  0 0 0 0 0 0 170 0 0 0 ...
##  $ x3ssn_porch    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ screen_porch   : int  0 120 0 0 0 0 0 144 0 0 ...
##  $ pool_area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pool_qc        : chr  NA NA NA NA ...
##  $ fence          : chr  NA "MnPrv" NA NA ...
##  $ misc_feature   : chr  NA NA "Gar2" NA ...
##  $ misc_val       : int  0 0 12500 0 0 0 0 0 0 0 ...
##  $ mo_sold        : int  5 6 6 4 3 6 4 1 3 6 ...
##  $ yr_sold        : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ sale_type      : chr  "WD " "WD " "WD " "WD " ...
##  $ sale_condition : chr  "Normal" "Normal" "Normal" "Normal" ...
##  $ saleprice      : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...

# Summary statistics of the dataset
data_summary <- summary(M1W1Data)

kable(data_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")

order	pid	ms_subclass	ms_zoning	lot_frontage	lot_area	street	alley	lot_shape	land_contour	utilities	lot_config	land_slope	neighborhood	condition_1	condition_2	bldg_type	house_style	overall_qual	overall_cond	year_built	year_remod_add	roof_style	roof_matl	exterior_1st	exterior_2nd	mas_vnr_type	mas_vnr_area	exter_qual	exter_cond	foundation	bsmt_qual	bsmt_cond	bsmt_exposure	bsmtfin_type_1	bsmtfin_sf_1	bsmtfin_type_2	bsmtfin_sf_2	bsmt_unf_sf	total_bsmt_sf	heating	heating_qc	central_air	electrical	x1st_flr_sf	x2nd_flr_sf	low_qual_fin_sf	gr_liv_area	bsmt_full_bath	bsmt_half_bath	full_bath	half_bath	bedroom_abvgr	kitchen_abvgr	kitchen_qual	totrms_abvgrd	functional	fireplaces	fireplace_qu	garage_type	garage_yr_blt	garage_finish	garage_cars	garage_area	garage_qual	garage_cond	paved_drive	wood_deck_sf	open_porch_sf	enclosed_porch	x3ssn_porch	screen_porch	pool_area	pool_qc	fence	misc_feature	misc_val	mo_sold	yr_sold	sale_type	sale_condition	saleprice
Min. : 1.0	Min. :5.263e+08	Min. : 20.00	Length:2930	Min. : 21.00	Min. : 1300	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Min. : 1.000	Min. :1.000	Min. :1872	Min. :1950	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Min. : 0.0	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Min. : 0.0	Length:2930	Min. : 0.00	Min. : 0.0	Min. : 0	Length:2930	Length:2930	Length:2930	Length:2930	Min. : 334.0	Min. : 0.0	Min. : 0.000	Min. : 334	Min. :0.0000	Min. :0.00000	Min. :0.000	Min. :0.0000	Min. :0.000	Min. :0.000	Length:2930	Min. : 2.000	Length:2930	Min. :0.0000	Length:2930	Length:2930	Min. :1895	Length:2930	Min. :0.000	Min. : 0.0	Length:2930	Length:2930	Length:2930	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.000	Min. : 0	Min. : 0.000	Length:2930	Length:2930	Length:2930	Min. : 0.00	Min. : 1.000	Min. :2006	Length:2930	Length:2930	Min. : 12789
1st Qu.: 733.2	1st Qu.:5.285e+08	1st Qu.: 20.00	Class :character	1st Qu.: 58.00	1st Qu.: 7440	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.: 5.000	1st Qu.:5.000	1st Qu.:1954	1st Qu.:1965	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.: 0.0	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.: 0.0	Class :character	1st Qu.: 0.00	1st Qu.: 219.0	1st Qu.: 793	Class :character	Class :character	Class :character	Class :character	1st Qu.: 876.2	1st Qu.: 0.0	1st Qu.: 0.000	1st Qu.:1126	1st Qu.:0.0000	1st Qu.:0.00000	1st Qu.:1.000	1st Qu.:0.0000	1st Qu.:2.000	1st Qu.:1.000	Class :character	1st Qu.: 5.000	Class :character	1st Qu.:0.0000	Class :character	Class :character	1st Qu.:1960	Class :character	1st Qu.:1.000	1st Qu.: 320.0	Class :character	Class :character	Class :character	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.000	1st Qu.: 0	1st Qu.: 0.000	Class :character	Class :character	Class :character	1st Qu.: 0.00	1st Qu.: 4.000	1st Qu.:2007	Class :character	Class :character	1st Qu.:129500
Median :1465.5	Median :5.355e+08	Median : 50.00	Mode :character	Median : 68.00	Median : 9436	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median : 6.000	Median :5.000	Median :1973	Median :1993	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median : 0.0	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median : 370.0	Mode :character	Median : 0.00	Median : 466.0	Median : 990	Mode :character	Mode :character	Mode :character	Mode :character	Median :1084.0	Median : 0.0	Median : 0.000	Median :1442	Median :0.0000	Median :0.00000	Median :2.000	Median :0.0000	Median :3.000	Median :1.000	Mode :character	Median : 6.000	Mode :character	Median :1.0000	Mode :character	Mode :character	Median :1979	Mode :character	Median :2.000	Median : 480.0	Mode :character	Mode :character	Mode :character	Median : 0.00	Median : 27.00	Median : 0.00	Median : 0.000	Median : 0	Median : 0.000	Mode :character	Mode :character	Mode :character	Median : 0.00	Median : 6.000	Median :2008	Mode :character	Mode :character	Median :160000
Mean :1465.5	Mean :7.145e+08	Mean : 57.39	NA	Mean : 69.22	Mean : 10148	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean : 6.095	Mean :5.563	Mean :1971	Mean :1984	NA	NA	NA	NA	NA	Mean : 101.9	NA	NA	NA	NA	NA	NA	NA	Mean : 442.6	NA	Mean : 49.72	Mean : 559.3	Mean :1052	NA	NA	NA	NA	Mean :1159.6	Mean : 335.5	Mean : 4.677	Mean :1500	Mean :0.4314	Mean :0.06113	Mean :1.567	Mean :0.3795	Mean :2.854	Mean :1.044	NA	Mean : 6.443	NA	Mean :0.5993	NA	NA	Mean :1978	NA	Mean :1.767	Mean : 472.8	NA	NA	NA	Mean : 93.75	Mean : 47.53	Mean : 23.01	Mean : 2.592	Mean : 16	Mean : 2.243	NA	NA	NA	Mean : 50.63	Mean : 6.216	Mean :2008	NA	NA	Mean :180796
3rd Qu.:2197.8	3rd Qu.:9.072e+08	3rd Qu.: 70.00	NA	3rd Qu.: 80.00	3rd Qu.: 11555	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.: 7.000	3rd Qu.:6.000	3rd Qu.:2001	3rd Qu.:2004	NA	NA	NA	NA	NA	3rd Qu.: 164.0	NA	NA	NA	NA	NA	NA	NA	3rd Qu.: 734.0	NA	3rd Qu.: 0.00	3rd Qu.: 802.0	3rd Qu.:1302	NA	NA	NA	NA	3rd Qu.:1384.0	3rd Qu.: 703.8	3rd Qu.: 0.000	3rd Qu.:1743	3rd Qu.:1.0000	3rd Qu.:0.00000	3rd Qu.:2.000	3rd Qu.:1.0000	3rd Qu.:3.000	3rd Qu.:1.000	NA	3rd Qu.: 7.000	NA	3rd Qu.:1.0000	NA	NA	3rd Qu.:2002	NA	3rd Qu.:2.000	3rd Qu.: 576.0	NA	NA	NA	3rd Qu.: 168.00	3rd Qu.: 70.00	3rd Qu.: 0.00	3rd Qu.: 0.000	3rd Qu.: 0	3rd Qu.: 0.000	NA	NA	NA	3rd Qu.: 0.00	3rd Qu.: 8.000	3rd Qu.:2009	NA	NA	3rd Qu.:213500
Max. :2930.0	Max. :1.007e+09	Max. :190.00	NA	Max. :313.00	Max. :215245	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :10.000	Max. :9.000	Max. :2010	Max. :2010	NA	NA	NA	NA	NA	Max. :1600.0	NA	NA	NA	NA	NA	NA	NA	Max. :5644.0	NA	Max. :1526.00	Max. :2336.0	Max. :6110	NA	NA	NA	NA	Max. :5095.0	Max. :2065.0	Max. :1064.000	Max. :5642	Max. :3.0000	Max. :2.00000	Max. :4.000	Max. :2.0000	Max. :8.000	Max. :3.000	NA	Max. :15.000	NA	Max. :4.0000	NA	NA	Max. :2207	NA	Max. :5.000	Max. :1488.0	NA	NA	NA	Max. :1424.00	Max. :742.00	Max. :1012.00	Max. :508.000	Max. :576	Max. :800.000	NA	NA	NA	Max. :17000.00	Max. :12.000	Max. :2010	NA	NA	Max. :755000
NA	NA	NA	NA	NA’s :490	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :23	NA	NA	NA	NA	NA	NA	NA	NA’s :1	NA	NA’s :1	NA’s :1	NA’s :1	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :2	NA’s :2	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :159	NA	NA’s :1	NA’s :1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

The str()function is used to display the structure of the dataset M1W1Data. It reveals the types and formats of the variables in the dataset, including the number of observations and variables.

# Descriptive statistics of numerical columns
num_summary <- summary(select_if(M1W1Data, is.numeric))

kable(num_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")

order	pid	ms_subclass	lot_frontage	lot_area	overall_qual	overall_cond	year_built	year_remod_add	mas_vnr_area	bsmtfin_sf_1	bsmtfin_sf_2	bsmt_unf_sf	total_bsmt_sf	x1st_flr_sf	x2nd_flr_sf	low_qual_fin_sf	gr_liv_area	bsmt_full_bath	bsmt_half_bath	full_bath	half_bath	bedroom_abvgr	kitchen_abvgr	totrms_abvgrd	fireplaces	garage_yr_blt	garage_cars	garage_area	wood_deck_sf	open_porch_sf	enclosed_porch	x3ssn_porch	screen_porch	pool_area	misc_val	mo_sold	yr_sold	saleprice
Min. : 1.0	Min. :5.263e+08	Min. : 20.00	Min. : 21.00	Min. : 1300	Min. : 1.000	Min. :1.000	Min. :1872	Min. :1950	Min. : 0.0	Min. : 0.0	Min. : 0.00	Min. : 0.0	Min. : 0	Min. : 334.0	Min. : 0.0	Min. : 0.000	Min. : 334	Min. :0.0000	Min. :0.00000	Min. :0.000	Min. :0.0000	Min. :0.000	Min. :0.000	Min. : 2.000	Min. :0.0000	Min. :1895	Min. :0.000	Min. : 0.0	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.000	Min. : 0	Min. : 0.000	Min. : 0.00	Min. : 1.000	Min. :2006	Min. : 12789
1st Qu.: 733.2	1st Qu.:5.285e+08	1st Qu.: 20.00	1st Qu.: 58.00	1st Qu.: 7440	1st Qu.: 5.000	1st Qu.:5.000	1st Qu.:1954	1st Qu.:1965	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.00	1st Qu.: 219.0	1st Qu.: 793	1st Qu.: 876.2	1st Qu.: 0.0	1st Qu.: 0.000	1st Qu.:1126	1st Qu.:0.0000	1st Qu.:0.00000	1st Qu.:1.000	1st Qu.:0.0000	1st Qu.:2.000	1st Qu.:1.000	1st Qu.: 5.000	1st Qu.:0.0000	1st Qu.:1960	1st Qu.:1.000	1st Qu.: 320.0	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.000	1st Qu.: 0	1st Qu.: 0.000	1st Qu.: 0.00	1st Qu.: 4.000	1st Qu.:2007	1st Qu.:129500
Median :1465.5	Median :5.355e+08	Median : 50.00	Median : 68.00	Median : 9436	Median : 6.000	Median :5.000	Median :1973	Median :1993	Median : 0.0	Median : 370.0	Median : 0.00	Median : 466.0	Median : 990	Median :1084.0	Median : 0.0	Median : 0.000	Median :1442	Median :0.0000	Median :0.00000	Median :2.000	Median :0.0000	Median :3.000	Median :1.000	Median : 6.000	Median :1.0000	Median :1979	Median :2.000	Median : 480.0	Median : 0.00	Median : 27.00	Median : 0.00	Median : 0.000	Median : 0	Median : 0.000	Median : 0.00	Median : 6.000	Median :2008	Median :160000
Mean :1465.5	Mean :7.145e+08	Mean : 57.39	Mean : 69.22	Mean : 10148	Mean : 6.095	Mean :5.563	Mean :1971	Mean :1984	Mean : 101.9	Mean : 442.6	Mean : 49.72	Mean : 559.3	Mean :1052	Mean :1159.6	Mean : 335.5	Mean : 4.677	Mean :1500	Mean :0.4314	Mean :0.06113	Mean :1.567	Mean :0.3795	Mean :2.854	Mean :1.044	Mean : 6.443	Mean :0.5993	Mean :1978	Mean :1.767	Mean : 472.8	Mean : 93.75	Mean : 47.53	Mean : 23.01	Mean : 2.592	Mean : 16	Mean : 2.243	Mean : 50.63	Mean : 6.216	Mean :2008	Mean :180796
3rd Qu.:2197.8	3rd Qu.:9.072e+08	3rd Qu.: 70.00	3rd Qu.: 80.00	3rd Qu.: 11555	3rd Qu.: 7.000	3rd Qu.:6.000	3rd Qu.:2001	3rd Qu.:2004	3rd Qu.: 164.0	3rd Qu.: 734.0	3rd Qu.: 0.00	3rd Qu.: 802.0	3rd Qu.:1302	3rd Qu.:1384.0	3rd Qu.: 703.8	3rd Qu.: 0.000	3rd Qu.:1743	3rd Qu.:1.0000	3rd Qu.:0.00000	3rd Qu.:2.000	3rd Qu.:1.0000	3rd Qu.:3.000	3rd Qu.:1.000	3rd Qu.: 7.000	3rd Qu.:1.0000	3rd Qu.:2002	3rd Qu.:2.000	3rd Qu.: 576.0	3rd Qu.: 168.00	3rd Qu.: 70.00	3rd Qu.: 0.00	3rd Qu.: 0.000	3rd Qu.: 0	3rd Qu.: 0.000	3rd Qu.: 0.00	3rd Qu.: 8.000	3rd Qu.:2009	3rd Qu.:213500
Max. :2930.0	Max. :1.007e+09	Max. :190.00	Max. :313.00	Max. :215245	Max. :10.000	Max. :9.000	Max. :2010	Max. :2010	Max. :1600.0	Max. :5644.0	Max. :1526.00	Max. :2336.0	Max. :6110	Max. :5095.0	Max. :2065.0	Max. :1064.000	Max. :5642	Max. :3.0000	Max. :2.00000	Max. :4.000	Max. :2.0000	Max. :8.000	Max. :3.000	Max. :15.000	Max. :4.0000	Max. :2207	Max. :5.000	Max. :1488.0	Max. :1424.00	Max. :742.00	Max. :1012.00	Max. :508.000	Max. :576	Max. :800.000	Max. :17000.00	Max. :12.000	Max. :2010	Max. :755000
NA	NA	NA	NA’s :490	NA	NA	NA	NA	NA	NA’s :23	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA	NA	NA	NA	NA’s :2	NA’s :2	NA	NA	NA	NA	NA	NA	NA’s :159	NA’s :1	NA’s :1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

The above provided code computes summary statistics for the numerical columns within the dataset.

# Descriptive statistics of categorical columns
cat_summary <- summary(select_if(M1W1Data, is.character))

kable(cat_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "200px")

ms_zoning	street	alley	lot_shape	land_contour	utilities	lot_config	land_slope	neighborhood	condition_1	condition_2	bldg_type	house_style	roof_style	roof_matl	exterior_1st	exterior_2nd	mas_vnr_type	exter_qual	exter_cond	foundation	bsmt_qual	bsmt_cond	bsmt_exposure	bsmtfin_type_1	bsmtfin_type_2	heating	heating_qc	central_air	electrical	kitchen_qual	functional	fireplace_qu	garage_type	garage_finish	garage_qual	garage_cond	paved_drive	pool_qc	fence	misc_feature	sale_type	sale_condition
Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930	Length:2930
Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character
Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character

The above provided code computes summary statistics for the categorical columns within the dataset.

Observations to Task_2:

The Ames Housing dataset under evaluation has 2930 observations over 82 variables. The variables contain a range of data kinds, including integers (denoted as ‘int’) and character strings (denoted as ‘chr’). It’s important to notice that some variables have NA values, which indicate missing or unrecorded data.

The summary command is used in the above code to provide a short summary of the dataset variables. The resulting table includes key statistics for each variable, such as minimum, maximum, median, mean, and quartile values. This allows for a fast grasp of the range and distribution of numerical properties.

3. Prepare the dataset for modeling by imputing missing values with the variable’s mean value or any other value that you prefer.

data <- M1W1Data 

# Check for missing values
summarize_na <- colSums(is.na(data))
#summarize_na

# Separating numeric and non-numeric columns
numeric_cols <- sapply(data, is.numeric)
non_numeric_cols <- sapply(data, function(x) !is.numeric(x))

# Imputing missing values for numeric cols with mean
data[numeric_cols] <- lapply(data[numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

# Imputing missig vlaues for non-numeric cols with "NA" 
data[non_numeric_cols] <- lapply(data[non_numeric_cols], function(x) ifelse(is.na(x), "NA", x))

variables_missing <- c("bsmt_qual", "bsmt_cond", "bsmt_exposure", "bsmtfin_type_1", "bsmtfin_type_2", "garage_type", "garage_finish", "garage_qual", "garage_cond", "mas_vnr_type")

# Replace NA with "NA" in the specified columns
data[variables_missing][is.na(data[variables_missing])] <- "NA"


# Checking again for missing values
missing_value_check <- colSums(is.na(data))
#missing_value_check

#view(data)

Observations to Task_3:

The code first separates the columns into numeric and non numeric using sapply(). This allows for imputation to be tailored based on data type. Imputing data correctly based on its type is an important consideration.

For numeric columns, lapply() is used to loop through each column and replace missing values with the mean. Imputing with the mean is a suitable strategy for numeric data.

Non-numeric columns have NA values imputed with a new NA category. This maintains the categorical nature of these variables, rather than converting them to numeric. Imputing a new category is appropriate for these data types.

Blank values across all columns are replaced with NA using lapply(). This ensures blank strings are treated as missing values, for consistency. Handling blank values is an important data cleaning step.

4. Use the cor() function to produce a correlation matrix of the numeric values.

numeric_data <- data[sapply(data, is.numeric)]

# Calculate the correlation matrix
correlation_matrix <- cor(numeric_data)
#correlation_matrix
# View the correlation matrix
view(correlation_matrix)

Observations to Task_4:

Firstly, we have selected only the columns that contain numerical data from the entire dataset using the code: numeric_data <- data[sapply(data, is.numeric)]. This step is important as it isolates columns with numeric values for subsequent analysis.

Next, we generated a correlation matrix, generated by the cor() function, essentially indicates how the different numeric columns in the dataset are related to each other.

Finally, we have displayed the correlation matrix. The printed correlation matrix shows a table of values, where each number represents the relationship between two different numeric columns. A higher value (either positive or negative) signifies a stronger relationship between the two columns, while a value closer to zero suggests a weaker or negligible relationship.

The purpose of generating a correlation matrix is to understand how the different numeric columns in the dataset are interrelated. By analyzing this matrix, one can derive insights into which columns might influence each other, which is valuable for various data analysis and predictive modeling tasks.

5. Produce a plot of the correlation matrix, and explain how to interpret it. (hint - check the corrplot or ggcorrplot plot libraries)

corrplot(correlation_matrix, tl.cex = 0.5, type = "upper")

Observations to Task_5:

In the above correlation plot the colors represent the degree and direction of the correlation. Darker colors indicate a stronger correlation, with blue shades for positive correlations and red shades for negative correlations.

The hc.order = TRUE parameter has arranged the variables based on their correlations. This arrangement helps in identifying clusters of variables that are more strongly correlated with each other.

The parameter lab = TRUE displays the variable names next to the cells, making it easier to identify which variables are being compared in the plot.

This plot allows us to do more detailed exploration of correlations between variables, especially in larger datasets. By observing the color intensity and patterns, we can quickly identify variables that exhibit significant relationships, understand the direction of their correlations, and recognize clusters of variables that are closely related based on the hierarchical arrangement.

6. Make a scatter plot for the X continuous variable with the highest correlation with SalePrice. Do the same for the X variable that has the lowest correlation with SalePrice. Finally, make a scatter plot between X and SalePrice with the correlation closest to 0.5. Interpret the scatter plots and describe how the patterns differ.

# Extract the correlation values of SalePrice with other variables
correlation_with_saleprice <- correlation_matrix["saleprice", ]
correlation_with_saleprice_sorted <- sort(correlation_with_saleprice)
correlation_with_saleprice_sorted

##             pid  enclosed_porch   kitchen_abvgr    overall_cond     ms_subclass 
##    -0.246521213    -0.128787442    -0.119813720    -0.101696932    -0.085091576 
## low_qual_fin_sf  bsmt_half_bath           order         yr_sold        misc_val 
##    -0.037659765    -0.035815123    -0.031407925    -0.030569087    -0.015691463 
##    bsmtfin_sf_2     x3ssn_porch         mo_sold       pool_area    screen_porch 
##     0.005889764     0.032224649     0.035258842     0.068403247     0.112151214 
##   bedroom_abvgr     bsmt_unf_sf        lot_area     x2nd_flr_sf  bsmt_full_bath 
##     0.143913428     0.182804552     0.266549220     0.269373357     0.275893674 
##       half_bath   open_porch_sf    wood_deck_sf    lot_frontage    bsmtfin_sf_1 
##     0.285056032     0.312950506     0.327143174     0.340751054     0.432794357 
##      fireplaces   totrms_abvgrd    mas_vnr_area   garage_yr_blt  year_remod_add 
##     0.474558093     0.495474417     0.505784081     0.510684432     0.532973754 
##       full_bath      year_built     x1st_flr_sf   total_bsmt_sf     garage_area 
##     0.545603901     0.558426106     0.621676063     0.632105117     0.640385461 
##     garage_cars     gr_liv_area    overall_qual       saleprice 
##     0.647861110     0.706779921     0.799261795     1.000000000

plot(data$saleprice, data$saleprice, 
     xlab = "SalePrice", ylab = "SalePrice",
     main = "Scatter plot between sale price with itself")

plot(data$bsmtfin_sf_2, data$saleprice, 
     xlab = "SalePrice", ylab = "bsmtfin_sf_2",
     main = "Scatter plot Sale Price and bsmtfin_sf_2")

plot(data$garage_area, data$saleprice, 
     xlab = "Sale Price", ylab = "Garage Area",
     main = "Scatter plot between Sale Price and Garage Area")

Observations to Task_6:

By examining the sorted correlation values, it provides an overview of how strongly each variable is related to the SalePrice. Variables with higher correlations are typically more influential in predicting SalePrice where SalePrice itself correlates with the highest value.

Scatter Plots:

SalePrice vs. SalePrice: This plot simply shows a perfect linear relationship, which is expected as it’s the variable plotted against itself.
SalePrice vs. bsmtfin_sf_2: The scatter plot visualizes the relationship between ‘SalePrice’ and ‘bsmtfin_sf_2’. This plot is not that much linear and there are outliers also which makes the interepretation tough.
SalePrice vs. Garage Area: This plot shows the relationship between SalePrice and Garage Area. It has some outliers but we can see that with increse in sales price the Garage area also increases fot most of the places

7. Using at least 3 continuous variables, fit a regression model in R.

I have chosen the following three continuous varibales for the regression model: (Lot Frontage, Lot Area and Garage Area).

# Fit a linear regression model
model <- lm(saleprice ~ lot_frontage + lot_area + garage_area, data = numeric_data)

# Summary of the regression model
summary(model)

## 
## Call:
## lm(formula = saleprice ~ lot_frontage + lot_area + garage_area, 
##     data = numeric_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -378818  -32640   -4477   25820  470664 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.027e+04  3.983e+03  10.111  < 2e-16 ***
## lot_frontage 3.976e+02  5.827e+01   6.823 1.08e-11 ***
## lot_area     1.052e+00  1.519e-01   6.928 5.22e-12 ***
## garage_area  2.164e+02  5.505e+00  39.311  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59980 on 2926 degrees of freedom
## Multiple R-squared:  0.4368, Adjusted R-squared:  0.4362 
## F-statistic: 756.5 on 3 and 2926 DF,  p-value: < 2.2e-16

# Extract model coefficients
coefficients <- summary(model)$coefficients
coefficients

##                  Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)  40270.388263 3982.6789662 10.111382  1.197653e-23
## lot_frontage   397.582519   58.2747105  6.822557  1.082565e-11
## lot_area         1.052361    0.1518948  6.928226  5.219572e-12
## garage_area    216.412087    5.5051092 39.311134 9.216922e-272

# Report the model in equation form
cat("SalePrice =", 
    round(coefficients[1, 1], digits = 2), 
    "+", round(coefficients[2, 1], digits = 2), "* lot_frontage",
    "+", round(coefficients[3, 1], digits = 2), "* lot_area",
    "+", round(coefficients[4, 1], digits = 2), "* garage_area", "\n")

## SalePrice = 40270.39 + 397.58 * lot_frontage + 1.05 * lot_area + 216.41 * garage_area

Observations to Task_7:

The model predicts home sale prices in Ames, using three predictor variables lot_frontage, lot_area, and garage_area. These variables likely represent important property characteristics influencing market value.

The output from the R summary() function indicates all three predictors are highly statistically significant (p < 0.001) with t-values exceeding 6. This signals they provide useful information to the model.

The R-squared of 0.4368 and adjusted R-squared of 0.4362 suggest that together the predictors explain approximately 43.7% of the variation in sale prices. The minimal difference between the two values implies the model should generalize reasonably well to new data.

Multicollinearity does not seem problematic as evidenced by the large t-values and relatively small standard errors for the coefficients.

The extracted model coefficients show lot_frontage and garage_area have the largest positive relationships with sale price. The equation form demonstrates a 1 unit increase in lot_frontage and garage_area is associated with a 397 and 216 increase in price, holding other variables constant.

Overall, the model provides a reasonable fit but could likely be improved by incorporating more predictive variables. Further refinement may increase the R-squared and decrease the residual error.

8. Report the model in equation form and interpret each coefficient of the model in the context of this problem.

Formula Used:

Sales Price = b0 + b1* lot frontage + b2* lot area + b3* garage area

Where:

b0 is the intercept.

b1 is the coefficient for lot_frontage.

b2 is the coefficient for lot_area.

b3 is the coefficient for garage_area.

Intercept (b0): The intercept represents the estimated SalePrice when all the independent variables (lot_frontage, lot_area, and garage_area) are zero. In this context, it may not have a meaningful interpretation because it’s unlikely that all these variables would be zero.
Coefficient for lot_frontage (b1): This coefficient represents the change in SalePrice for a one-unit increase in lot_frontage, holding other variables constant. For example, if b1 is 100, it means that for each additional unit of lot_frontage, the SalePrice is estimated to increase by 100 units, assuming all other variables remain constant.
Coefficient for lot_area (b2): This coefficient represents the change in SalePrice for a one-unit increase in lot_area, holding other variables constant. Similarly, if b2 is, for instance, 50, it means that for each additional unit of lot_area, the SalePrice is estimated to increase by 50 units, assuming all other variables remain constant.
Coefficient for garage_area (b3): This coefficient represents the change in SalePrice for a one-unit increase in garage_area, holding other variables constant. If b3 is, for example, 75, it means that for each additional unit of garage_area, the SalePrice is estimated to increase by 75 units, assuming all other variables remain constant.

9. Use the plot() function to plot your regression model. Interpret the four graphs that are produced.

# Plot the regression model diagnostics
par(mfrow=c(2,2))
plot(model)

Observations to Task_9:

These four plots are common diagnostic plots used to evaluate the assumptions of a linear regression model.

The residuals versus fitted values plot shows the errors scattered against the predicted values. We want to see a random pattern, but here the residuals seem to grow as the fitted values increase, implying the variance may not be constant across the range.

The normal Q-Q plot checks if the residuals follow a normal distribution, a requirement for regression. Most points align with the reference line, but some deviation at the extremes indicates potential outliers or heavy-tailed residuals.

The scale-location plot is similar to the first, assessing constancy of variance. Again, the increasing spread of residuals across fitted values points to possible heteroscedasticity.

The residuals versus leverage plot identifies highly influential observations. Points outside the Cook’s distance lines are leveraged and could skew the model. A few cases here appear potentially problematic for the overall fit.

In summary, the diagnostic plots reveal areas of concern like non-constant variance, influential points, and non-normality that may be distorting the model. Addressing these issues could improve the regression performance and reliability.

Overall, while the model might capture the central tendency of the data, the diagnostic plots suggest potential issues with non-constant variance (heteroscedasticity), the normality of residuals, and the presence of influential points. Addressing these issues might involve transformations of the dependent variable, robust regression methods, or removing influential observations, depending on the context and severity of the issues.

10. Check your model for multicollinearity and report your findings. What steps would you take to correct multicollinearity if it exists?

# Calculate Variance Inflation Factor (VIF)
vif_result <- vif(model)
print(vif_result)

## lot_frontage     lot_area  garage_area 
##     1.256838     1.166333     1.140594

Observations to Task_10:

The Variance Inflation Factor (VIF) assesses multicollinearity in a regression model, with values below 5 considered acceptable. For lot_frontage, lot_area, and garage_area, VIFs of 1.26, 1.17, and 1.14 respectively suggest very low multicollinearity. Overall, these variables exhibit very low inter-variable influence, assuring minimal multicollinearity concerns in the model.

If there is a multicollinearity we can use these steps to mitigate it:

Remove variables with high VIF values. Sometimes dropping one of the highly correlated variables can help alleviate multicollinearity.
Scale continuous variables (e.g., using standardization or normalization) to reduce the scale differences between predictors.
If multicollinearity persists, PCA (Principal Component Analysis ) can be applied to transform the correlated variables into a set of uncorrelated variables.

After making adjustments to address multicollinearity, we need to re-evaluate the regression model to check if the issue has been resolved and the model’s performance has improved or not.

11. Check your model for outliers and report your findings. Should these observations be removed from the model?

# Extracting residuals and leverage values
residuals <- residuals(model)
leverage <- hatvalues(model)
stud_resid <- rstudent(model)
cooks_d <- cooks.distance(model)

# Combine the data with the residuals, leverage, and statistics
model_data <- model.frame(model)
outlier_data <- cbind(model_data, residuals, leverage, stud_resid, cooks_d)

# Create plots to visualize outliers
# Plot residuals vs. fitted values
plot(model, which = 1)

# Plot residuals vs. leverage
plot(model, which = 3)

# Plot Cook's distance
plot(model, which = 4)

Observations to Task_11:

Yes we do have outliers in the above model.

Outliers can have a significant impact on a regression model, influencing coefficient estimation and interpretation. They may require cautious thinking before removing them. Their removal, however, should be supported by domain knowledge or statistical reasons. It is critical to ensure that outliers are not influential because of a specific trend in the data or any systematic problem.

The residuals versus fitted values plot shows the errors spreading outwards as the predicted values get larger. This fanning pattern hints at heteroscedasticity - the error variance changes across the range of predictors, violating a regression assumption.

A few distinct points detach from the main cluster, like observations 1768 and 1499. These potential outliers could substantially impact model fit if they represent anomalies and not just natural variability.

The red line depicting the residual average trend isn’t perfectly horizontal, implying the relationship between predictors and outcome may not be strictly linear. The model may be missing some nonlinear associations between the independent and dependent variables.

In essence, the diagnostic plots signal concerns like heteroscedasticity, outliers, and nonlinearity that could undermine model assumptions. Addressing these issues through things like data cleaning, transformations, or model revisions could greatly improve the reliability and interpretability of the regression results.

Conclusion

In this analysis, we conducted an in-depth examination of the Ames, Iowa housing dataset, with the goals of understanding key drivers of home sale prices and developing a predictive pricing model. Through exploratory analysis and correlation studies, we gained insight into relationships between sale price and property attributes like lot size, living area, and garage space.

Our multiple linear regression model revealed lot frontage, lot area, and garage area as useful predictors, explaining approximately 44% of variation in sale prices.

In summary, our methodology enabled tangible progress towards the goals outlined. The workflow, from initial data wrangling to regression diagnostics, serves as a practical example of an end-to-end analytics process. The skills applied here are exploratory analysis, correlation mapping, model development will useful for learning.

References

Dee Chiluiza. (2022, June 25).RPubs. https://rpubs.com/Dee_Chiluiza/vectors_matrix

Appendix
This report contains an R Markdown file named as follows Week_1_ALY6015_Ansari.Rmd

Week_1_ALY6015 Assignment

2023-11-12

Library

Introduction

Analysis