rm(list=ls())
#Library######
library(readr)
library(tidyverse)
library(dplyr)
library(DT)
library(RColorBrewer)
library(rio)
library(dbplyr)
library(psych)
library(FSA)
library(knitr)
library(RColorBrewer)
library(plotrix)
library(kableExtra)
library(ISLR)
library(data.table)
library(magrittr)
library(ggplot2)
library(summarytools)
library(hrbrthemes)
library(cowplot)
library(reshape2)
library(scales)
library(zoo)
library(corrplot)
library(lares)
library(leaps)
library(MASS)
library(car)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(tidyr)
#dataset used######
AmesHousing_1 <- read_csv("Datasets/AmesHousing-1.csv")
Introduction
The simple regression analysis studies the relationship between
single dependent and single independent variable. Number of cigarettes
per day and number of years they lived has single independent and single
dependent variable which have the 85% influence on number of years
smokers live and 15% unexplained. What if I want to study effect of
alcohol as well along with cigarettes? Rather than studying separate
regression analysis on number of cigarettes and number of years they
live and number of alcohol consumption in ml/day and number of years
they live, we can study the multiple regression analysis. In this case,
alcohol consumption ml/day and number of cigarettes per day would be
independent variables and number of years they live dependent (Bluman,
2014).
Multiple correlation coefficient is denoted by \(R\) and Multiple correlation determination
is denoted by \(R^2\).
Moreover, understanding the data is crucial before planning study plan.
When you know your data, preparing study plan and statistical tests
would be beneficial. We always prepare study plan and statistical tests
depending on the data. Also, utilizing correct tools is equally
important and understanding which charts are used depending upon
variable would help in data visualization (Bluman, 2014).
In a nutshell, it is important to familiarize yourself with the data, use appropriate statistical tools and tests to make sound decisions. In this report we will be working on Ames Housing dataset which will provide an alternative to Boston Housing data. Ames, Iowa dataset have individual residential properties sold between 2006 and 2010. This dataset contains 82 variables and 2930 observations.
#Descriptive Statistics#######
##Data Describe#####
describe(AmesHousing_1) %>%
kable(caption = "<center> Table 1, Descriptive Statistic of dataset using describe()</center>",
align = "c",
digits = 2) %>%
kable_styling(bootstrap_options = c("hover",
"bordered",
"condensed",
"responsive",
"stripped"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "logical or categorical variables are converted to numeric denoted by *")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Order | 1 | 2930 | 1465.50 | 845.96 | 1465.5 | 1465.50 | 1086.00 | 1 | 2930 | 2929 | 0.00 | -1.20 | 15.63 |
| PID* | 2 | 2930 | 1465.50 | 845.96 | 1465.5 | 1465.50 | 1086.00 | 1 | 2930 | 2929 | 0.00 | -1.20 | 15.63 |
| MS SubClass* | 3 | 2930 | 5.29 | 4.36 | 5.0 | 4.75 | 5.93 | 1 | 16 | 15 | 0.73 | -0.51 | 0.08 |
| MS Zoning* | 4 | 2930 | 5.97 | 0.87 | 6.0 | 6.07 | 0.00 | 1 | 7 | 6 | -2.61 | 8.41 | 0.02 |
| Lot Frontage | 5 | 2440 | 69.22 | 23.37 | 68.0 | 68.35 | 17.79 | 21 | 313 | 292 | 1.50 | 11.20 | 0.47 |
| Lot Area | 6 | 2930 | 10147.92 | 7880.02 | 9436.5 | 9481.05 | 3024.50 | 1300 | 215245 | 213945 | 12.81 | 264.39 | 145.58 |
| Street* | 7 | 2930 | 2.00 | 0.06 | 2.0 | 2.00 | 0.00 | 1 | 2 | 1 | -15.52 | 239.01 | 0.00 |
| Alley* | 8 | 198 | 1.39 | 0.49 | 1.0 | 1.37 | 0.00 | 1 | 2 | 1 | 0.43 | -1.82 | 0.03 |
| Lot Shape* | 9 | 2930 | 2.94 | 1.41 | 4.0 | 3.05 | 0.00 | 1 | 4 | 3 | -0.61 | -1.60 | 0.03 |
| Land Contour* | 10 | 2930 | 3.78 | 0.70 | 4.0 | 4.00 | 0.00 | 1 | 4 | 3 | -3.12 | 8.44 | 0.01 |
| Utilities* | 11 | 2930 | 1.00 | 0.06 | 1.0 | 1.00 | 0.00 | 1 | 3 | 2 | 34.02 | 1187.96 | 0.00 |
| Lot Config* | 12 | 2930 | 4.06 | 1.60 | 5.0 | 4.32 | 0.00 | 1 | 5 | 4 | -1.19 | -0.44 | 0.03 |
| Land Slope* | 13 | 2930 | 1.05 | 0.25 | 1.0 | 1.00 | 0.00 | 1 | 3 | 2 | 4.98 | 26.62 | 0.00 |
| Neighborhood* | 14 | 2930 | 15.30 | 7.02 | 16.0 | 15.40 | 8.90 | 1 | 28 | 27 | -0.20 | -1.19 | 0.13 |
| Condition 1* | 15 | 2930 | 3.04 | 0.87 | 3.0 | 3.00 | 0.00 | 1 | 9 | 8 | 2.99 | 15.74 | 0.02 |
| Condition 2* | 16 | 2930 | 3.00 | 0.21 | 3.0 | 3.00 | 0.00 | 1 | 8 | 7 | 12.08 | 308.97 | 0.00 |
| Bldg Type* | 17 | 2930 | 1.52 | 1.22 | 1.0 | 1.17 | 0.00 | 1 | 5 | 4 | 2.15 | 3.00 | 0.02 |
| House Style* | 18 | 2930 | 4.02 | 1.91 | 3.0 | 4.01 | 0.00 | 1 | 8 | 7 | 0.32 | -0.95 | 0.04 |
| Overall Qual | 19 | 2930 | 6.09 | 1.41 | 6.0 | 6.08 | 1.48 | 1 | 10 | 9 | 0.19 | 0.05 | 0.03 |
| Overall Cond | 20 | 2930 | 5.56 | 1.11 | 5.0 | 5.47 | 0.00 | 1 | 9 | 8 | 0.57 | 1.48 | 0.02 |
| Year Built | 21 | 2930 | 1971.36 | 30.25 | 1973.0 | 1974.25 | 37.06 | 1872 | 2010 | 138 | -0.60 | -0.50 | 0.56 |
| Year Remod/Add | 22 | 2930 | 1984.27 | 20.86 | 1993.0 | 1985.63 | 20.76 | 1950 | 2010 | 60 | -0.45 | -1.34 | 0.39 |
| Roof Style* | 23 | 2930 | 2.39 | 0.82 | 2.0 | 2.24 | 0.00 | 1 | 6 | 5 | 1.56 | 0.89 | 0.02 |
| Roof Matl* | 24 | 2930 | 2.06 | 0.54 | 2.0 | 2.00 | 0.00 | 1 | 8 | 7 | 8.72 | 76.98 | 0.01 |
| Exterior 1st* | 25 | 2930 | 11.16 | 3.65 | 14.0 | 11.47 | 1.48 | 1 | 16 | 15 | -0.59 | -0.76 | 0.07 |
| Exterior 2nd* | 26 | 2930 | 11.87 | 4.00 | 15.0 | 12.19 | 2.97 | 1 | 17 | 16 | -0.56 | -0.90 | 0.07 |
| Mas Vnr Type* | 27 | 2907 | 3.45 | 1.04 | 4.0 | 3.47 | 0.00 | 1 | 5 | 4 | -0.58 | -1.10 | 0.02 |
| Mas Vnr Area | 28 | 2907 | 101.90 | 179.11 | 0.0 | 61.14 | 0.00 | 0 | 1600 | 1600 | 2.60 | 9.26 | 3.32 |
| Exter Qual* | 29 | 2930 | 3.53 | 0.70 | 4.0 | 3.64 | 0.00 | 1 | 4 | 3 | -1.79 | 3.67 | 0.01 |
| Exter Cond* | 30 | 2930 | 4.71 | 0.77 | 5.0 | 4.93 | 0.00 | 1 | 5 | 4 | -2.50 | 5.11 | 0.01 |
| Foundation* | 31 | 2930 | 2.39 | 0.73 | 2.0 | 2.45 | 1.48 | 1 | 6 | 5 | 0.01 | 0.76 | 0.01 |
| Bsmt Qual* | 32 | 2850 | 3.69 | 1.31 | 3.0 | 3.85 | 2.97 | 1 | 5 | 4 | -0.46 | -0.83 | 0.02 |
| Bsmt Cond* | 33 | 2850 | 4.80 | 0.69 | 5.0 | 5.00 | 0.00 | 1 | 5 | 4 | -3.33 | 9.73 | 0.01 |
| Bsmt Exposure* | 34 | 2847 | 3.28 | 1.13 | 4.0 | 3.47 | 0.00 | 1 | 4 | 3 | -1.16 | -0.32 | 0.02 |
| BsmtFin Type 1* | 35 | 2850 | 3.76 | 1.81 | 3.0 | 3.82 | 2.97 | 1 | 6 | 5 | -0.04 | -1.36 | 0.03 |
| BsmtFin SF 1 | 36 | 2929 | 442.63 | 455.59 | 370.0 | 384.08 | 548.56 | 0 | 5644 | 5644 | 1.41 | 6.84 | 8.42 |
| BsmtFin Type 2* | 37 | 2849 | 5.68 | 1.01 | 6.0 | 5.97 | 0.00 | 1 | 6 | 5 | -3.38 | 10.80 | 0.02 |
| BsmtFin SF 2 | 38 | 2929 | 49.72 | 169.17 | 0.0 | 2.04 | 0.00 | 0 | 1526 | 1526 | 4.14 | 18.73 | 3.13 |
| Bsmt Unf SF | 39 | 2929 | 559.26 | 439.49 | 466.0 | 510.77 | 415.13 | 0 | 2336 | 2336 | 0.92 | 0.40 | 8.12 |
| Total Bsmt SF | 40 | 2929 | 1051.61 | 440.62 | 990.0 | 1035.05 | 349.89 | 0 | 6110 | 6110 | 1.16 | 9.11 | 8.14 |
| Heating* | 41 | 2930 | 2.03 | 0.25 | 2.0 | 2.00 | 0.00 | 1 | 6 | 5 | 12.10 | 168.45 | 0.00 |
| Heating QC* | 42 | 2930 | 2.54 | 1.74 | 1.0 | 2.42 | 0.00 | 1 | 5 | 4 | 0.48 | -1.52 | 0.03 |
| Central Air* | 43 | 2930 | 1.93 | 0.25 | 2.0 | 2.00 | 0.00 | 1 | 2 | 1 | -3.47 | 10.01 | 0.00 |
| Electrical* | 44 | 2929 | 4.69 | 1.05 | 5.0 | 5.00 | 0.00 | 1 | 5 | 4 | -3.09 | 7.67 | 0.02 |
| 1st Flr SF | 45 | 2930 | 1159.56 | 391.89 | 1084.0 | 1127.17 | 349.89 | 334 | 5095 | 4761 | 1.47 | 6.95 | 7.24 |
| 2nd Flr SF | 46 | 2930 | 335.46 | 428.40 | 0.0 | 272.90 | 0.00 | 0 | 2065 | 2065 | 0.87 | -0.42 | 7.91 |
| Low Qual Fin SF | 47 | 2930 | 4.68 | 46.31 | 0.0 | 0.00 | 0.00 | 0 | 1064 | 1064 | 12.11 | 175.18 | 0.86 |
| Gr Liv Area | 48 | 2930 | 1499.69 | 505.51 | 1442.0 | 1452.25 | 461.09 | 334 | 5642 | 5308 | 1.27 | 4.12 | 9.34 |
| Bsmt Full Bath | 49 | 2928 | 0.43 | 0.52 | 0.0 | 0.40 | 0.00 | 0 | 3 | 3 | 0.62 | -0.75 | 0.01 |
| Bsmt Half Bath | 50 | 2928 | 0.06 | 0.25 | 0.0 | 0.00 | 0.00 | 0 | 2 | 2 | 3.94 | 14.88 | 0.00 |
| Full Bath | 51 | 2930 | 1.57 | 0.55 | 2.0 | 1.56 | 0.00 | 0 | 4 | 4 | 0.17 | -0.54 | 0.01 |
| Half Bath | 52 | 2930 | 0.38 | 0.50 | 0.0 | 0.34 | 0.00 | 0 | 2 | 2 | 0.70 | -1.03 | 0.01 |
| Bedroom AbvGr | 53 | 2930 | 2.85 | 0.83 | 3.0 | 2.83 | 0.00 | 0 | 8 | 8 | 0.31 | 1.88 | 0.02 |
| Kitchen AbvGr | 54 | 2930 | 1.04 | 0.21 | 1.0 | 1.00 | 0.00 | 0 | 3 | 3 | 4.31 | 19.82 | 0.00 |
| Kitchen Qual* | 55 | 2930 | 3.86 | 1.27 | 5.0 | 4.03 | 0.00 | 1 | 5 | 4 | -0.62 | -0.68 | 0.02 |
| TotRms AbvGrd | 56 | 2930 | 6.44 | 1.57 | 6.0 | 6.33 | 1.48 | 2 | 15 | 13 | 0.75 | 1.15 | 0.03 |
| Functional* | 57 | 2930 | 7.69 | 1.18 | 8.0 | 8.00 | 0.00 | 1 | 8 | 7 | -3.83 | 13.79 | 0.02 |
| Fireplaces | 58 | 2930 | 0.60 | 0.65 | 1.0 | 0.52 | 1.48 | 0 | 4 | 4 | 0.74 | 0.10 | 0.01 |
| Fireplace Qu* | 59 | 1508 | 3.72 | 1.13 | 3.0 | 3.78 | 1.48 | 1 | 5 | 4 | -0.12 | -1.01 | 0.03 |
| Garage Type* | 60 | 2773 | 3.28 | 1.79 | 2.0 | 3.11 | 0.00 | 1 | 6 | 5 | 0.75 | -1.31 | 0.03 |
| Garage Yr Blt | 61 | 2771 | 1978.13 | 25.53 | 1979.0 | 1980.71 | 31.13 | 1895 | 2207 | 312 | -0.38 | 1.82 | 0.48 |
| Garage Finish* | 62 | 2771 | 2.18 | 0.82 | 2.0 | 2.23 | 1.48 | 1 | 3 | 2 | -0.35 | -1.43 | 0.02 |
| Garage Cars | 63 | 2929 | 1.77 | 0.76 | 2.0 | 1.77 | 0.00 | 0 | 5 | 5 | -0.22 | 0.24 | 0.01 |
| Garage Area | 64 | 2929 | 472.82 | 215.05 | 480.0 | 468.35 | 182.36 | 0 | 1488 | 1488 | 0.24 | 0.94 | 3.97 |
| Garage Qual* | 65 | 2771 | 4.84 | 0.66 | 5.0 | 5.00 | 0.00 | 1 | 5 | 4 | -4.02 | 14.47 | 0.01 |
| Garage Cond* | 66 | 2771 | 4.90 | 0.52 | 5.0 | 5.00 | 0.00 | 1 | 5 | 4 | -5.25 | 26.38 | 0.01 |
| Paved Drive* | 67 | 2930 | 2.83 | 0.54 | 3.0 | 3.00 | 0.00 | 1 | 3 | 2 | -2.98 | 7.15 | 0.01 |
| Wood Deck SF | 68 | 2930 | 93.75 | 126.36 | 0.0 | 71.21 | 0.00 | 0 | 1424 | 1424 | 1.84 | 6.73 | 2.33 |
| Open Porch SF | 69 | 2930 | 47.53 | 67.48 | 27.0 | 33.87 | 40.03 | 0 | 742 | 742 | 2.53 | 10.92 | 1.25 |
| Enclosed Porch | 70 | 2930 | 23.01 | 64.14 | 0.0 | 4.83 | 0.00 | 0 | 1012 | 1012 | 4.01 | 28.42 | 1.18 |
| 3Ssn Porch | 71 | 2930 | 2.59 | 25.14 | 0.0 | 0.00 | 0.00 | 0 | 508 | 508 | 11.39 | 149.63 | 0.46 |
| Screen Porch | 72 | 2930 | 16.00 | 56.09 | 0.0 | 0.00 | 0.00 | 0 | 576 | 576 | 3.95 | 17.81 | 1.04 |
| Pool Area | 73 | 2930 | 2.24 | 35.60 | 0.0 | 0.00 | 0.00 | 0 | 800 | 800 | 16.92 | 299.06 | 0.66 |
| Pool QC* | 74 | 13 | 2.46 | 1.20 | 3.0 | 2.45 | 1.48 | 1 | 4 | 3 | -0.05 | -1.68 | 0.33 |
| Fence* | 75 | 572 | 2.41 | 0.84 | 3.0 | 2.49 | 0.00 | 1 | 4 | 3 | -0.68 | -0.89 | 0.03 |
| Misc Feature* | 76 | 106 | 3.85 | 0.55 | 4.0 | 4.00 | 0.00 | 1 | 5 | 4 | -3.16 | 10.37 | 0.05 |
| Misc Val | 77 | 2930 | 50.64 | 566.34 | 0.0 | 0.00 | 0.00 | 0 | 17000 | 17000 | 21.98 | 564.85 | 10.46 |
| Mo Sold | 78 | 2930 | 6.22 | 2.71 | 6.0 | 6.16 | 2.97 | 1 | 12 | 11 | 0.19 | -0.46 | 0.05 |
| Yr Sold | 79 | 2930 | 2007.79 | 1.32 | 2008.0 | 2007.74 | 1.48 | 2006 | 2010 | 4 | 0.13 | -1.16 | 0.02 |
| Sale Type* | 80 | 2930 | 9.36 | 1.88 | 10.0 | 9.87 | 0.00 | 1 | 10 | 9 | -3.32 | 10.76 | 0.03 |
| Sale Condition* | 81 | 2930 | 4.78 | 1.08 | 5.0 | 5.00 | 0.00 | 1 | 6 | 5 | -2.79 | 7.25 | 0.02 |
| SalePrice | 82 | 2930 | 180796.06 | 79886.69 | 160000.0 | 170429.15 | 54856.20 | 12789 | 755000 | 742211 | 1.74 | 5.10 | 1475.84 |
##Exploratory Data Analysis#####
house_rel = subset(AmesHousing_1, select=c("Lot Area",
"Total Bsmt SF",
"Year Built",
"Year Remod/Add",
"Yr Sold",
"SalePrice"))
names(house_rel) %<>% stringr::str_replace_all("\\s","_") #renaming column names
ggplot(data = house_rel, mapping = aes(x = Lot_Area, y = SalePrice)) +
geom_boxplot(mapping = aes(group = cut_width(Lot_Area, 12000)))+
coord_cartesian(xlim = c(0, 100000),
ylim = c(0,1000000))+
ggtitle("Plot 2.1: Lot Area to House Sales Price")+
xlab("Area of Lot (sq. feet)")+
ylab("Sales Price(USD)")+
theme(plot.title = element_text(hjust = 0.5))
#EDA overview
house_rel_mod <- reshape2::melt(house_rel) #covert columns to rows
ggplot(house_rel_mod, aes(value)) +
facet_wrap(~variable, scales = 'free_x') +
geom_histogram(binwidth = function(x) 2.8 * IQR(x) / (length(x)^(1/3)))+
scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
ggtitle("Plot 2.2: EDA Overview")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(house_rel)+ #years sold bar chart
geom_bar(mapping = aes(x=Yr_Sold),
fill = "turquoise")+
ggtitle("Plot 2.3: House Year Sold")+
xlab("Year Sold")+
ylab("Number of Houses")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(house_rel, aes(Year_Built))+
geom_bar(stat="count", width=0.6, fill="steelblue")+
scale_x_binned()+
ggtitle("Plot 2.4: House Year Built")+
xlab("House Built Year")+
ylab("Number of Houses")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(data = house_rel, aes(x = Lot_Area))+
geom_bar(stat="count", width=0.7, fill="steelblue")+
xlim(0, 250000)+
scale_x_binned()+
ggtitle("Plot 2.5: House Lot Area")+
xlab("Area of Lot (sq. feet)")+
ylab("Number of Houses")+
theme(plot.title = element_text(hjust = 0.5))
Before starting a study, it’s crucial to understand your data in
statistics. Once you get the facts, you must take further procedures to
reach a conclusion that will aid in decision making. This would be
possible only if you know the data, appropriate statistical test.
AmesHousing dataset is utilized in Table 1, and describe() returns
crucial descriptive statistical numbers and variable names.
In
describe(), descriptive statistics would be returned only for numeric
variables and for logical or categorical variables are converted to
numeric for the sake of calculation and are denoted by \(*\). In table 1, MS SubClass, MS Zoning,
Street, Lot Shape, other logical and categorical variable type are
denoted by \(*\).
Moreover, descriptive statistics help to understand the data and
appropriate data visualization tools. Plot 2.1 box plot helps to
understand the relationship between lot area and sales price. Also, box
plot helps to understand outliers in the relationship, if any. In the
plot 2.1 we can see for different lot areas, there are outliers which
are represented by \(.\)(dots). These
outliers could be present on either side of the box. The ourliers are
present for the lot area between 25,000 sqft and 50,000 sqft with
respect to sales price (USD).
To understand the descriptive
statistics and exploratory data analysis, I have considered five most
common factors related to sale of individual priority. Lot area,
basement area, year build, year renovation, and year sold are some of
the factors which would be considered while purchasing property.
Furthermore, I have used various data visualization tools as box plot,
bar chart, and histogram. EDA overview provides a compact view of these
variables in single chart. In the overview chart we can see that between
1980 and 1990 property renovation was considerably low and increases
thereafter. Property renovation was stable between 1960 and 1980.
Plot 2.4 House year built helps to understand property building was
gradually increasing until 1920 and there was slowed between 1930 and
1950. However, between 1950 and 1980, property building rose to almost
double compared to last 30 years and after 2000 it almost reached to
double of last 10 years.
#Missing Value######
missing_dataset <- data.frame(AmesHousing_1) %>%
mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))#replacing NA values to mean
##Table Representation######
describe(missing_dataset) %>%
kable(caption = "<center> Table 1, Descriptive Statistic of dataset using describe()</center>",
align = "c",
digits = 2) %>%
kable_styling(bootstrap_options = c("hover",
"bordered",
"condensed",
"responsive",
"stripped"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "logical or categorical variables are converted to numeric denoted by *")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Order | 1 | 2930 | 1465.50 | 845.96 | 1465.50 | 1465.50 | 1086.00 | 1 | 2930 | 2929 | 0.00 | -1.20 | 15.63 |
| PID* | 2 | 2930 | 1465.50 | 845.96 | 1465.50 | 1465.50 | 1086.00 | 1 | 2930 | 2929 | 0.00 | -1.20 | 15.63 |
| MS.SubClass* | 3 | 2930 | 5.29 | 4.36 | 5.00 | 4.75 | 5.93 | 1 | 16 | 15 | 0.73 | -0.51 | 0.08 |
| MS.Zoning* | 4 | 2930 | 5.97 | 0.87 | 6.00 | 6.07 | 0.00 | 1 | 7 | 6 | -2.61 | 8.41 | 0.02 |
| Lot.Frontage | 5 | 2930 | 69.22 | 21.32 | 69.22 | 68.52 | 13.68 | 21 | 313 | 292 | 1.64 | 14.05 | 0.39 |
| Lot.Area | 6 | 2930 | 10147.92 | 7880.02 | 9436.50 | 9481.05 | 3024.50 | 1300 | 215245 | 213945 | 12.81 | 264.39 | 145.58 |
| Street* | 7 | 2930 | 2.00 | 0.06 | 2.00 | 2.00 | 0.00 | 1 | 2 | 1 | -15.52 | 239.01 | 0.00 |
| Alley* | 8 | 198 | 1.39 | 0.49 | 1.00 | 1.37 | 0.00 | 1 | 2 | 1 | 0.43 | -1.82 | 0.03 |
| Lot.Shape* | 9 | 2930 | 2.94 | 1.41 | 4.00 | 3.05 | 0.00 | 1 | 4 | 3 | -0.61 | -1.60 | 0.03 |
| Land.Contour* | 10 | 2930 | 3.78 | 0.70 | 4.00 | 4.00 | 0.00 | 1 | 4 | 3 | -3.12 | 8.44 | 0.01 |
| Utilities* | 11 | 2930 | 1.00 | 0.06 | 1.00 | 1.00 | 0.00 | 1 | 3 | 2 | 34.02 | 1187.96 | 0.00 |
| Lot.Config* | 12 | 2930 | 4.06 | 1.60 | 5.00 | 4.32 | 0.00 | 1 | 5 | 4 | -1.19 | -0.44 | 0.03 |
| Land.Slope* | 13 | 2930 | 1.05 | 0.25 | 1.00 | 1.00 | 0.00 | 1 | 3 | 2 | 4.98 | 26.62 | 0.00 |
| Neighborhood* | 14 | 2930 | 15.30 | 7.02 | 16.00 | 15.40 | 8.90 | 1 | 28 | 27 | -0.20 | -1.19 | 0.13 |
| Condition.1* | 15 | 2930 | 3.04 | 0.87 | 3.00 | 3.00 | 0.00 | 1 | 9 | 8 | 2.99 | 15.74 | 0.02 |
| Condition.2* | 16 | 2930 | 3.00 | 0.21 | 3.00 | 3.00 | 0.00 | 1 | 8 | 7 | 12.08 | 308.97 | 0.00 |
| Bldg.Type* | 17 | 2930 | 1.52 | 1.22 | 1.00 | 1.17 | 0.00 | 1 | 5 | 4 | 2.15 | 3.00 | 0.02 |
| House.Style* | 18 | 2930 | 4.02 | 1.91 | 3.00 | 4.01 | 0.00 | 1 | 8 | 7 | 0.32 | -0.95 | 0.04 |
| Overall.Qual | 19 | 2930 | 6.09 | 1.41 | 6.00 | 6.08 | 1.48 | 1 | 10 | 9 | 0.19 | 0.05 | 0.03 |
| Overall.Cond | 20 | 2930 | 5.56 | 1.11 | 5.00 | 5.47 | 0.00 | 1 | 9 | 8 | 0.57 | 1.48 | 0.02 |
| Year.Built | 21 | 2930 | 1971.36 | 30.25 | 1973.00 | 1974.25 | 37.06 | 1872 | 2010 | 138 | -0.60 | -0.50 | 0.56 |
| Year.Remod.Add | 22 | 2930 | 1984.27 | 20.86 | 1993.00 | 1985.63 | 20.76 | 1950 | 2010 | 60 | -0.45 | -1.34 | 0.39 |
| Roof.Style* | 23 | 2930 | 2.39 | 0.82 | 2.00 | 2.24 | 0.00 | 1 | 6 | 5 | 1.56 | 0.89 | 0.02 |
| Roof.Matl* | 24 | 2930 | 2.06 | 0.54 | 2.00 | 2.00 | 0.00 | 1 | 8 | 7 | 8.72 | 76.98 | 0.01 |
| Exterior.1st* | 25 | 2930 | 11.16 | 3.65 | 14.00 | 11.47 | 1.48 | 1 | 16 | 15 | -0.59 | -0.76 | 0.07 |
| Exterior.2nd* | 26 | 2930 | 11.87 | 4.00 | 15.00 | 12.19 | 2.97 | 1 | 17 | 16 | -0.56 | -0.90 | 0.07 |
| Mas.Vnr.Type* | 27 | 2907 | 3.45 | 1.04 | 4.00 | 3.47 | 0.00 | 1 | 5 | 4 | -0.58 | -1.10 | 0.02 |
| Mas.Vnr.Area | 28 | 2930 | 101.90 | 178.41 | 0.00 | 61.28 | 0.00 | 0 | 1600 | 1600 | 2.61 | 9.36 | 3.30 |
| Exter.Qual* | 29 | 2930 | 3.53 | 0.70 | 4.00 | 3.64 | 0.00 | 1 | 4 | 3 | -1.79 | 3.67 | 0.01 |
| Exter.Cond* | 30 | 2930 | 4.71 | 0.77 | 5.00 | 4.93 | 0.00 | 1 | 5 | 4 | -2.50 | 5.11 | 0.01 |
| Foundation* | 31 | 2930 | 2.39 | 0.73 | 2.00 | 2.45 | 1.48 | 1 | 6 | 5 | 0.01 | 0.76 | 0.01 |
| Bsmt.Qual* | 32 | 2850 | 3.69 | 1.31 | 3.00 | 3.85 | 2.97 | 1 | 5 | 4 | -0.46 | -0.83 | 0.02 |
| Bsmt.Cond* | 33 | 2850 | 4.80 | 0.69 | 5.00 | 5.00 | 0.00 | 1 | 5 | 4 | -3.33 | 9.73 | 0.01 |
| Bsmt.Exposure* | 34 | 2847 | 3.28 | 1.13 | 4.00 | 3.47 | 0.00 | 1 | 4 | 3 | -1.16 | -0.32 | 0.02 |
| BsmtFin.Type.1* | 35 | 2850 | 3.76 | 1.81 | 3.00 | 3.82 | 2.97 | 1 | 6 | 5 | -0.04 | -1.36 | 0.03 |
| BsmtFin.SF.1 | 36 | 2930 | 442.63 | 455.51 | 370.50 | 383.98 | 549.30 | 0 | 5644 | 5644 | 1.41 | 6.84 | 8.42 |
| BsmtFin.Type.2* | 37 | 2849 | 5.68 | 1.01 | 6.00 | 5.97 | 0.00 | 1 | 6 | 5 | -3.38 | 10.80 | 0.02 |
| BsmtFin.SF.2 | 38 | 2930 | 49.72 | 169.14 | 0.00 | 2.00 | 0.00 | 0 | 1526 | 1526 | 4.14 | 18.74 | 3.12 |
| Bsmt.Unf.SF | 39 | 2930 | 559.26 | 439.42 | 466.00 | 510.69 | 415.13 | 0 | 2336 | 2336 | 0.92 | 0.41 | 8.12 |
| Total.Bsmt.SF | 40 | 2930 | 1051.61 | 440.54 | 990.00 | 1035.00 | 349.89 | 0 | 6110 | 6110 | 1.16 | 9.11 | 8.14 |
| Heating* | 41 | 2930 | 2.03 | 0.25 | 2.00 | 2.00 | 0.00 | 1 | 6 | 5 | 12.10 | 168.45 | 0.00 |
| Heating.QC* | 42 | 2930 | 2.54 | 1.74 | 1.00 | 2.42 | 0.00 | 1 | 5 | 4 | 0.48 | -1.52 | 0.03 |
| Central.Air* | 43 | 2930 | 1.93 | 0.25 | 2.00 | 2.00 | 0.00 | 1 | 2 | 1 | -3.47 | 10.01 | 0.00 |
| Electrical* | 44 | 2929 | 4.69 | 1.05 | 5.00 | 5.00 | 0.00 | 1 | 5 | 4 | -3.09 | 7.67 | 0.02 |
| X1st.Flr.SF | 45 | 2930 | 1159.56 | 391.89 | 1084.00 | 1127.17 | 349.89 | 334 | 5095 | 4761 | 1.47 | 6.95 | 7.24 |
| X2nd.Flr.SF | 46 | 2930 | 335.46 | 428.40 | 0.00 | 272.90 | 0.00 | 0 | 2065 | 2065 | 0.87 | -0.42 | 7.91 |
| Low.Qual.Fin.SF | 47 | 2930 | 4.68 | 46.31 | 0.00 | 0.00 | 0.00 | 0 | 1064 | 1064 | 12.11 | 175.18 | 0.86 |
| Gr.Liv.Area | 48 | 2930 | 1499.69 | 505.51 | 1442.00 | 1452.25 | 461.09 | 334 | 5642 | 5308 | 1.27 | 4.12 | 9.34 |
| Bsmt.Full.Bath | 49 | 2930 | 0.43 | 0.52 | 0.00 | 0.40 | 0.00 | 0 | 3 | 3 | 0.62 | -0.75 | 0.01 |
| Bsmt.Half.Bath | 50 | 2930 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 2 | 2 | 3.94 | 14.89 | 0.00 |
| Full.Bath | 51 | 2930 | 1.57 | 0.55 | 2.00 | 1.56 | 0.00 | 0 | 4 | 4 | 0.17 | -0.54 | 0.01 |
| Half.Bath | 52 | 2930 | 0.38 | 0.50 | 0.00 | 0.34 | 0.00 | 0 | 2 | 2 | 0.70 | -1.03 | 0.01 |
| Bedroom.AbvGr | 53 | 2930 | 2.85 | 0.83 | 3.00 | 2.83 | 0.00 | 0 | 8 | 8 | 0.31 | 1.88 | 0.02 |
| Kitchen.AbvGr | 54 | 2930 | 1.04 | 0.21 | 1.00 | 1.00 | 0.00 | 0 | 3 | 3 | 4.31 | 19.82 | 0.00 |
| Kitchen.Qual* | 55 | 2930 | 3.86 | 1.27 | 5.00 | 4.03 | 0.00 | 1 | 5 | 4 | -0.62 | -0.68 | 0.02 |
| TotRms.AbvGrd | 56 | 2930 | 6.44 | 1.57 | 6.00 | 6.33 | 1.48 | 2 | 15 | 13 | 0.75 | 1.15 | 0.03 |
| Functional* | 57 | 2930 | 7.69 | 1.18 | 8.00 | 8.00 | 0.00 | 1 | 8 | 7 | -3.83 | 13.79 | 0.02 |
| Fireplaces | 58 | 2930 | 0.60 | 0.65 | 1.00 | 0.52 | 1.48 | 0 | 4 | 4 | 0.74 | 0.10 | 0.01 |
| Fireplace.Qu* | 59 | 1508 | 3.72 | 1.13 | 3.00 | 3.78 | 1.48 | 1 | 5 | 4 | -0.12 | -1.01 | 0.03 |
| Garage.Type* | 60 | 2773 | 3.28 | 1.79 | 2.00 | 3.11 | 0.00 | 1 | 6 | 5 | 0.75 | -1.31 | 0.03 |
| Garage.Yr.Blt | 61 | 2930 | 1978.13 | 24.83 | 1978.13 | 1980.62 | 29.85 | 1895 | 2207 | 312 | -0.40 | 2.09 | 0.46 |
| Garage.Finish* | 62 | 2771 | 2.18 | 0.82 | 2.00 | 2.23 | 1.48 | 1 | 3 | 2 | -0.35 | -1.43 | 0.02 |
| Garage.Cars | 63 | 2930 | 1.77 | 0.76 | 2.00 | 1.77 | 0.00 | 0 | 5 | 5 | -0.22 | 0.24 | 0.01 |
| Garage.Area | 64 | 2930 | 472.82 | 215.01 | 480.00 | 468.32 | 182.36 | 0 | 1488 | 1488 | 0.24 | 0.95 | 3.97 |
| Garage.Qual* | 65 | 2771 | 4.84 | 0.66 | 5.00 | 5.00 | 0.00 | 1 | 5 | 4 | -4.02 | 14.47 | 0.01 |
| Garage.Cond* | 66 | 2771 | 4.90 | 0.52 | 5.00 | 5.00 | 0.00 | 1 | 5 | 4 | -5.25 | 26.38 | 0.01 |
| Paved.Drive* | 67 | 2930 | 2.83 | 0.54 | 3.00 | 3.00 | 0.00 | 1 | 3 | 2 | -2.98 | 7.15 | 0.01 |
| Wood.Deck.SF | 68 | 2930 | 93.75 | 126.36 | 0.00 | 71.21 | 0.00 | 0 | 1424 | 1424 | 1.84 | 6.73 | 2.33 |
| Open.Porch.SF | 69 | 2930 | 47.53 | 67.48 | 27.00 | 33.87 | 40.03 | 0 | 742 | 742 | 2.53 | 10.92 | 1.25 |
| Enclosed.Porch | 70 | 2930 | 23.01 | 64.14 | 0.00 | 4.83 | 0.00 | 0 | 1012 | 1012 | 4.01 | 28.42 | 1.18 |
| X3Ssn.Porch | 71 | 2930 | 2.59 | 25.14 | 0.00 | 0.00 | 0.00 | 0 | 508 | 508 | 11.39 | 149.63 | 0.46 |
| Screen.Porch | 72 | 2930 | 16.00 | 56.09 | 0.00 | 0.00 | 0.00 | 0 | 576 | 576 | 3.95 | 17.81 | 1.04 |
| Pool.Area | 73 | 2930 | 2.24 | 35.60 | 0.00 | 0.00 | 0.00 | 0 | 800 | 800 | 16.92 | 299.06 | 0.66 |
| Pool.QC* | 74 | 13 | 2.46 | 1.20 | 3.00 | 2.45 | 1.48 | 1 | 4 | 3 | -0.05 | -1.68 | 0.33 |
| Fence* | 75 | 572 | 2.41 | 0.84 | 3.00 | 2.49 | 0.00 | 1 | 4 | 3 | -0.68 | -0.89 | 0.03 |
| Misc.Feature* | 76 | 106 | 3.85 | 0.55 | 4.00 | 4.00 | 0.00 | 1 | 5 | 4 | -3.16 | 10.37 | 0.05 |
| Misc.Val | 77 | 2930 | 50.64 | 566.34 | 0.00 | 0.00 | 0.00 | 0 | 17000 | 17000 | 21.98 | 564.85 | 10.46 |
| Mo.Sold | 78 | 2930 | 6.22 | 2.71 | 6.00 | 6.16 | 2.97 | 1 | 12 | 11 | 0.19 | -0.46 | 0.05 |
| Yr.Sold | 79 | 2930 | 2007.79 | 1.32 | 2008.00 | 2007.74 | 1.48 | 2006 | 2010 | 4 | 0.13 | -1.16 | 0.02 |
| Sale.Type* | 80 | 2930 | 9.36 | 1.88 | 10.00 | 9.87 | 0.00 | 1 | 10 | 9 | -3.32 | 10.76 | 0.03 |
| Sale.Condition* | 81 | 2930 | 4.78 | 1.08 | 5.00 | 5.00 | 0.00 | 1 | 6 | 5 | -2.79 | 7.25 | 0.02 |
| SalePrice | 82 | 2930 | 180796.06 | 79886.69 | 160000.00 | 170429.15 | 54856.20 | 12789 | 755000 | 742211 | 1.74 | 5.10 | 1475.84 |
Dataset have some missing values in numeric variables and these could
harm the correlation, so to avoid this, I have replaced these NAs with
mean of respective variable.
Describe() is used to represent the
statistical values of the dataset after replacing NAs with mean.
#Correlation Matrix####
AmesHousing_1$only_numeric <- missing_dataset[sapply(missing_dataset,is.numeric)]
correla_dataset = cor(AmesHousing_1$only_numeric, use="pairwise")
corrplot(correla_dataset,
type = 'lower',
order = 'hclust',
tl.col = 'black',
cl.ratio = 0.2,
tl.srt = 45,
col = COL2('PuOr', 10),
tl.cex = 0.50,
title = "Chart 4.1: Correaltion Matrix",
mar=c(0,0,1,0) )
corr_var(AmesHousing_1$only_numeric, SalePrice, top = 40)
Now that we have gained insights about the data and replaced NAs with mean, we can establish the correlation of numeric variables. Dataset have multiple continuous, nominal, and discrete variable types. I have mutated only numeric columns within the dataset to establish correlation.
Chart 4.1 is a pairwise matrix of correlation of numeric variables
only. Chart have correlation coefficient range at the bottom -1 to 1.
SalePrice has strong correlation with overall quality, Garage area,
Total Basement SF, and Gr live area. However, we can not understand the
correlation coefficient with this matrix.
To understand the
correlation coefficient, I have used corr_var() understand strong and
weak coefficient of correlation. Overall quality, Gr Live Area, Total
Basement SF are some of the strong correlations with saleprice. Bedroom
AboveGr, Pool Area, and BsmtFinSF2 have weak correlation with
saleprice.
Scatter plot visualization tool is used to understand the nature of correlation between the variables. Independent variable on X-axis and dependent on Y-axis (Bluman, 2014).
#More than 0.5 correlation scatter plot#####
names(AmesHousing_1) %<>% stringr::str_replace_all("\\s","_")
T6_SC1 <- ggplot(data = AmesHousing_1, aes(x = Overall_Qual, y = SalePrice)) +
geom_point(alpha = 0.5,
pch = 24) +
geom_smooth(method = "lm", se = FALSE,
color = "#99004C",
lty=1,
lwd=1)+
labs(title = "Overall House Quality to Sales Price Correlation",
x = "Overall Quality",
y = "Sale Price(USD)")+
expand_limits(x = c(0, NA), y = c(0, NA)) +
scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
theme(plot.title = element_text(hjust = 0.5))+
geom_text(x=2.5, y=600000, label=paste("Correlation:", round(correla_dataset[4,37],3)),
color="forestgreen",size = 3)
print(T6_SC1)
In the above scatter plot, correlation between saleprice and overall quality is established. Correlation coefficient is 0.799 which mean the relationship is positive strong. The data on the chart is on a straight line increasing from left to right. There is a positive correlation between the two variables because the slope of the line is positive. We can conclude that higher the overall quality, higher the saleprice.
#Close to 0.5 correlation scatter plot#####
T6_SC2 <- ggplot(data = AmesHousing_1, aes(x = TotRms_AbvGrd, y = SalePrice)) +
geom_point(alpha = 0.5,
pch = 24) +
geom_smooth(method = "lm", se = FALSE,
color = "#99004C",
lty=1,
lwd=1)+
labs(title = "Total Rooms Above Ground to Sales Price Correlation",
x = "Total Rooms Above Ground",
y = "Sale Price(USD)")+
expand_limits(x = c(0, NA), y = c(0, NA)) +
scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
theme(plot.title = element_text(hjust = 0.5))+
geom_text(x=2.5, y=600000, label=paste("Correlation:", round(correla_dataset[23,37],4)),
color="forestgreen",size = 3)
print(T6_SC2)
In the above scatter plot, correlation between saleprice and Total Rooms Above Ground is established. Correlation coefficient is 0.4955 which mean the relationship is positive. The data on the chart is on a straight line increasing from left to right. There is a positive correlation between the two variables because the slope of the line is positive.
#Less than 0.5 correlation scatter plot#####
T6_SC3 <- ggplot(data = AmesHousing_1, aes(x = BsmtFin_SF_2, y = SalePrice)) +
geom_point(alpha = 0.5,
pch = 24) +
geom_smooth(method = "lm", se = FALSE,
color = "#99004C",
lty=1,
lwd=1)+
labs(title = "Type 2 Basement Finish to Sales Price Correlation",
x = "Type 2 finished basement sqr feet",
y = "Sale Price(USD)")+
expand_limits(x = c(0, NA), y = c(0, NA)) +
scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
theme(plot.title = element_text(hjust = 0.5))+
geom_text(x=1250, y=750000, label=paste("Correlation:", round(correla_dataset[10,37],4)),
color="forestgreen",size = 3)
print(T6_SC3)
In the above scatter plot, correlation between saleprice and Type 2 Finished Basement is established. Correlation coefficient is close to 0 which is 0.0059, means the relationship is not linear. The data on the chart is on a straight line from left to right which is neither increasing or decreasing. The data might be related in some other nonlinear way (Bluman, 2014).
#Correaltion of SalePrice to choosen three continuous variables#####
par(mfrow = c(2,2))
attach(missing_dataset)
T7_RegMdl = lm(SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area)
detach(missing_dataset)
t7 = summary(T7_RegMdl)
tab_model(t7)
| Dependent variable | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 66352.12 | 61114.99 – 71589.25 | <0.001 |
| Lot.Area | 1.23 | 0.97 – 1.49 | <0.001 |
| Mas.Vnr.Area | 136.00 | 123.71 – 148.30 | <0.001 |
| Garage.Area | 186.33 | 175.97 – 196.68 | <0.001 |
| Observations | 2930 | ||
| R2 / R2 adjusted | 0.507 / 0.507 | ||
plot(T7_RegMdl)
For regression model I have chooses Lot Area, Mas Vnr Area, and Garage
Area to understand correlation with Saleprice.
Intercept of the
regression is 66352.12 and slope of line for Lot area(\(b_1\)) is 1.23, Mas Vnr Area slope(\(b_2\)) is 136.00, and Garage Area(\(b_3\)) is 186.33
Coefficient of
Determination \(R^2\) is 0.5071 means
50.71% the variation in the dependent variable saleprice can be
explained by the independent variables. Regression Coefficient of Lot
Area is 1.23, meaning increasing 1% change in Lot Area is associated
with 1.23% increase in saleprice, controlling Garage Area, Mas Vnr Area
(Kabacoff, 2015).
Furthermore, four graphs represents Normality (upper-right) ,
Linearity (upper-left), Homoscendasticity (bottom-left), and Residuals
vs Leverage graph (bottom right).
Residuals vs Leverage graph
(bottom right) represents outliers in the model. The dependent variable
value isn’t used while calculating an observation’s leverage (Kabacoff,
2015).
Normality (upper-right) represents a probability plot of the
standardized residuals against the values that would be expected under
normality (Kabacoff, 2015).As points are not on the normality line means
there are outliers in the model.
Linearity (upper-left) model should
account for any systematic variation in the data.
Homoscendasticity
(bottom-left) In a regression model, heteroscedasticity corresponds to
the uneven dispersion of residuals at different levels of a response
variable.
#Multicolinearity#####
vifval = vif(T7_RegMdl) #understanding Variance Inflation Factor
print(vifval)
## Lot.Area Mas.Vnr.Area Garage.Area
## 1.050306 1.164050 1.199737
sqrtvifval = sqrt(vifval) > 2 #Problem
print(sqrtvifval)
## Lot.Area Mas.Vnr.Area Garage.Area
## FALSE FALSE FALSE
When independent variables in a regression model are correlated, multicollinearity emerges (Kabacoff, 2015). VIF that is Variance Inflation Factor is used to understand the multicollinearity.
The VIF value of our model is less than 2 which means some
multicollinearity occur but it is acceptable. If VIF values is more than
5, it would be great to reconsider the model as multicollinearity is
presents strongly. As our model’s VIF value is close 2, it is acceptable
and no further action would be taken to correct the multicollinearity.
#Identifying specific and overall outliers in the dataset######
outlierTest(T7_RegMdl) # identify outliers from the dataset
## rstudent unadjusted p-value Bonferroni p
## 1761 9.188404 7.3519e-20 2.1541e-16
## 1499 -6.482266 1.0567e-10 3.0960e-07
## 1768 6.283111 3.8122e-10 1.1170e-06
## 433 5.750663 9.8079e-09 2.8737e-05
## 424 5.685941 1.4293e-08 4.1879e-05
## 2181 -5.643089 1.8301e-08 5.3622e-05
## 2333 5.532870 3.4287e-08 1.0046e-04
## 1064 5.264756 1.5054e-07 4.4108e-04
## 45 4.901155 1.0047e-06 2.9437e-03
## 434 4.508005 6.8011e-06 1.9927e-02
hat.plot <- function(T7_RegMdl) #identify precise outliers from the dataset
{
p <- length(coefficients(T7_RegMdl))
n <- length(fitted(T7_RegMdl))
plot (hatvalues(T7_RegMdl), main = "Index Plot of Hat Values")
abline(h = c(2,3)*p/n,
col = "red", lty = 2)
identify(1:n, hatvalues(T7_RegMdl), names(hatvalues(T7_RegMdl)))
}
hat.plot(T7_RegMdl) #outlier order: 957, 1571, 2116
## integer(0)
##Influencial observation#####
cutoff <- 4/(nrow(AmesHousing_1) - length(T7_RegMdl$coefficients) - 1)
plot(T7_RegMdl, which = 4, cook.levels = cutoff)
abline(h = cutoff, lty = 2, col = "red")
So far we have conducted good fit model tests to quantify the fit of the
model. Task 9, plotting our model to understand presence of outliers
need to be further evaluated as there are outliers in the model.
General outliertest() provides list of overall outliers in the model,
however we need specif outliers to make a decision to delete or keep the
outliers. High Leverage Point function will return the precise outliers
and same can be confirmed with Cook’s Distance plot as well.
Cook’s Distance:
The Cook’s Distance plot is a popular choice for
identifying influential observations. Influential observations are those
having a Cook’s D value greater than \(\frac{4}{n – k – 1}\), where \(n\) is the sample size and \(k\) is the number of predictor variables
(Kabacoff, 2015).
We can conclude that 1571 and 2116 are two
outliers present in Hight Leverage Point and Cook’s Distance model. As
VIF score is less than 2, I feel these two points wouldn’t be any harm
and I decided not to change the model.
#Identifying good fit model within selected variables#####
stepAIC(T7_RegMdl, direction = "backward") #stepwise regression for good fit model
## Start: AIC=64083.87
## SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area
##
## Df Sum of Sq RSS AIC
## <none> 9.2132e+12 64084
## - Lot.Area 1 2.6218e+11 9.4754e+12 64164
## - Mas.Vnr.Area 1 1.4814e+12 1.0695e+13 64519
## - Garage.Area 1 3.9183e+12 1.3131e+13 65120
##
## Call:
## lm(formula = SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area)
##
## Coefficients:
## (Intercept) Lot.Area Mas.Vnr.Area Garage.Area
## 66352.12 1.23 136.00 186.33
Regmodsubset = regsubsets(SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area, data = missing_dataset, nbest = 3) #subset for finding good fit model
plot(Regmodsubset, scale = "adjr2")
summary(Regmodsubset)
## Subset selection object
## Call: regsubsets.formula(SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area,
## data = missing_dataset, nbest = 3)
## 3 Variables (and intercept)
## Forced in Forced out
## Lot.Area FALSE FALSE
## Mas.Vnr.Area FALSE FALSE
## Garage.Area FALSE FALSE
## 3 subsets of each size up to 3
## Selection Algorithm: exhaustive
## Lot.Area Mas.Vnr.Area Garage.Area
## 1 ( 1 ) " " " " "*"
## 1 ( 2 ) " " "*" " "
## 1 ( 3 ) "*" " " " "
## 2 ( 1 ) " " "*" "*"
## 2 ( 2 ) "*" " " "*"
## 2 ( 3 ) "*" "*" " "
## 3 ( 1 ) "*" "*" "*"
With stepAIC I tried to find best fit model, however there is not any
other recommendation provided in stepAIC. Furthermore, regsubsets() used
to find best fit model with 3 sets and there is not any recommendation
provided. Plot() regression subset provided the same result as
regsubsets(), however plot() is compared with \(adjr^2\) and selected model with Lot Area,
Mas Vnr Area, and Garag Area have the highest \(adjr^2\) 51%
After applying multiple
methods I have decided to no to change the model.
#Finding best fit regression model from only numeric data#####
#sigma(T7_RegMdl)/mean(missing_dataset$SalePrice)
Regmodsubset2 = regsubsets(SalePrice ~., nbest = 5, method = "exhaustive",
data = AmesHousing_1$only_numeric)
## Reordering variables and trying again:
summary(Regmodsubset2)
## Subset selection object
## Call: regsubsets.formula(SalePrice ~ ., nbest = 5, method = "exhaustive",
## data = AmesHousing_1$only_numeric)
## 36 Variables (and intercept)
## Forced in Forced out
## Order FALSE FALSE
## Lot.Frontage FALSE FALSE
## Lot.Area FALSE FALSE
## Overall.Qual FALSE FALSE
## Overall.Cond FALSE FALSE
## Year.Built FALSE FALSE
## Year.Remod.Add FALSE FALSE
## Mas.Vnr.Area FALSE FALSE
## BsmtFin.SF.1 FALSE FALSE
## BsmtFin.SF.2 FALSE FALSE
## Bsmt.Unf.SF FALSE FALSE
## X1st.Flr.SF FALSE FALSE
## X2nd.Flr.SF FALSE FALSE
## Low.Qual.Fin.SF FALSE FALSE
## Bsmt.Full.Bath FALSE FALSE
## Bsmt.Half.Bath FALSE FALSE
## Full.Bath FALSE FALSE
## Half.Bath FALSE FALSE
## Bedroom.AbvGr FALSE FALSE
## Kitchen.AbvGr FALSE FALSE
## TotRms.AbvGrd FALSE FALSE
## Fireplaces FALSE FALSE
## Garage.Yr.Blt FALSE FALSE
## Garage.Cars FALSE FALSE
## Garage.Area FALSE FALSE
## Wood.Deck.SF FALSE FALSE
## Open.Porch.SF FALSE FALSE
## Enclosed.Porch FALSE FALSE
## X3Ssn.Porch FALSE FALSE
## Screen.Porch FALSE FALSE
## Pool.Area FALSE FALSE
## Misc.Val FALSE FALSE
## Mo.Sold FALSE FALSE
## Yr.Sold FALSE FALSE
## Total.Bsmt.SF FALSE FALSE
## Gr.Liv.Area FALSE FALSE
## 5 subsets of each size up to 9
## Selection Algorithm: exhaustive
## Order Lot.Frontage Lot.Area Overall.Qual Overall.Cond Year.Built
## 1 ( 1 ) " " " " " " "*" " " " "
## 1 ( 2 ) " " " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " " " "
## 1 ( 4 ) " " " " " " " " " " " "
## 1 ( 5 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " "
## 2 ( 2 ) " " " " " " "*" " " " "
## 2 ( 3 ) " " " " " " "*" " " " "
## 2 ( 4 ) " " " " " " "*" " " " "
## 2 ( 5 ) " " " " " " "*" " " " "
## 3 ( 1 ) " " " " " " "*" " " " "
## 3 ( 2 ) " " " " " " "*" " " " "
## 3 ( 3 ) " " " " " " "*" " " " "
## 3 ( 4 ) " " " " " " "*" " " " "
## 3 ( 5 ) " " " " " " "*" " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 4 ( 2 ) " " " " " " "*" " " " "
## 4 ( 3 ) " " " " " " "*" " " "*"
## 4 ( 4 ) " " " " " " "*" " " " "
## 4 ( 5 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " "*"
## 5 ( 2 ) " " " " " " "*" " " " "
## 5 ( 3 ) " " " " " " "*" " " " "
## 5 ( 4 ) " " " " " " "*" " " " "
## 5 ( 5 ) " " " " " " "*" " " " "
## 6 ( 1 ) " " " " " " "*" " " " "
## 6 ( 2 ) " " " " " " "*" " " " "
## 6 ( 3 ) " " " " " " "*" " " " "
## 6 ( 4 ) " " " " " " "*" " " " "
## 6 ( 5 ) " " " " " " "*" " " " "
## 7 ( 1 ) " " " " " " "*" " " " "
## 7 ( 2 ) " " " " " " "*" " " " "
## 7 ( 3 ) " " " " " " "*" " " " "
## 7 ( 4 ) " " " " " " "*" " " " "
## 7 ( 5 ) " " " " " " "*" " " " "
## 8 ( 1 ) " " " " " " "*" " " " "
## 8 ( 2 ) " " " " " " "*" " " " "
## 8 ( 3 ) " " " " "*" "*" " " " "
## 8 ( 4 ) " " " " " " "*" " " " "
## 8 ( 5 ) " " " " " " "*" " " " "
## 9 ( 1 ) " " " " "*" "*" " " " "
## 9 ( 2 ) " " " " " " "*" " " " "
## 9 ( 3 ) " " " " "*" "*" " " " "
## 9 ( 4 ) " " " " "*" "*" " " " "
## 9 ( 5 ) " " " " " " "*" " " " "
## Year.Remod.Add Mas.Vnr.Area BsmtFin.SF.1 BsmtFin.SF.2 Bsmt.Unf.SF
## 1 ( 1 ) " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " "
## 1 ( 4 ) " " " " " " " " " "
## 1 ( 5 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 2 ( 2 ) " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " " "
## 2 ( 4 ) " " " " " " " " " "
## 2 ( 5 ) " " " " "*" " " " "
## 3 ( 1 ) " " " " "*" " " " "
## 3 ( 2 ) " " " " " " " " " "
## 3 ( 3 ) " " " " " " " " " "
## 3 ( 4 ) " " " " " " " " " "
## 3 ( 5 ) " " " " " " " " " "
## 4 ( 1 ) " " " " "*" " " " "
## 4 ( 2 ) " " " " "*" " " " "
## 4 ( 3 ) " " " " "*" " " " "
## 4 ( 4 ) " " " " "*" " " " "
## 4 ( 5 ) "*" " " "*" " " " "
## 5 ( 1 ) " " " " "*" " " " "
## 5 ( 2 ) "*" " " "*" " " " "
## 5 ( 3 ) " " " " " " " " "*"
## 5 ( 4 ) " " " " "*" " " " "
## 5 ( 5 ) "*" " " "*" " " " "
## 6 ( 1 ) "*" " " " " " " "*"
## 6 ( 2 ) "*" " " " " " " "*"
## 6 ( 3 ) "*" " " "*" " " " "
## 6 ( 4 ) "*" " " "*" " " " "
## 6 ( 5 ) "*" " " "*" " " " "
## 7 ( 1 ) "*" "*" " " " " "*"
## 7 ( 2 ) "*" "*" " " " " "*"
## 7 ( 3 ) "*" " " "*" " " " "
## 7 ( 4 ) "*" "*" "*" " " " "
## 7 ( 5 ) "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " "*"
## 8 ( 2 ) "*" "*" " " " " "*"
## 8 ( 3 ) "*" "*" " " " " "*"
## 8 ( 4 ) "*" "*" "*" " " " "
## 8 ( 5 ) "*" "*" " " " " "*"
## 9 ( 1 ) "*" "*" " " " " "*"
## 9 ( 2 ) "*" "*" "*" " " " "
## 9 ( 3 ) "*" "*" " " " " "*"
## 9 ( 4 ) "*" "*" "*" " " " "
## 9 ( 5 ) "*" "*" "*" " " " "
## Total.Bsmt.SF X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1 ( 1 ) " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " "*"
## 1 ( 3 ) " " " " " " " " " "
## 1 ( 4 ) " " " " " " " " " "
## 1 ( 5 ) "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " "*"
## 2 ( 2 ) " " "*" " " " " " "
## 2 ( 3 ) "*" " " " " " " " "
## 2 ( 4 ) " " " " " " " " " "
## 2 ( 5 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*"
## 3 ( 2 ) "*" " " " " " " "*"
## 3 ( 3 ) " " "*" "*" " " " "
## 3 ( 4 ) " " "*" " " " " "*"
## 3 ( 5 ) " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " "*"
## 4 ( 2 ) " " " " " " " " "*"
## 4 ( 3 ) " " " " " " " " "*"
## 4 ( 4 ) " " " " " " " " "*"
## 4 ( 5 ) " " " " " " " " "*"
## 5 ( 1 ) " " " " " " " " "*"
## 5 ( 2 ) " " " " " " " " "*"
## 5 ( 3 ) "*" " " " " " " "*"
## 5 ( 4 ) "*" " " " " " " "*"
## 5 ( 5 ) " " " " " " " " "*"
## 6 ( 1 ) "*" " " " " " " "*"
## 6 ( 2 ) "*" " " " " " " "*"
## 6 ( 3 ) " " "*" "*" " " " "
## 6 ( 4 ) "*" " " " " " " "*"
## 6 ( 5 ) " " "*" "*" " " " "
## 7 ( 1 ) "*" " " " " " " "*"
## 7 ( 2 ) "*" " " " " " " "*"
## 7 ( 3 ) " " "*" "*" " " " "
## 7 ( 4 ) "*" " " " " " " "*"
## 7 ( 5 ) " " "*" "*" " " " "
## 8 ( 1 ) "*" " " " " " " "*"
## 8 ( 2 ) "*" " " " " " " "*"
## 8 ( 3 ) "*" " " " " " " "*"
## 8 ( 4 ) " " "*" "*" " " " "
## 8 ( 5 ) "*" " " " " " " "*"
## 9 ( 1 ) "*" " " " " " " "*"
## 9 ( 2 ) " " "*" "*" " " " "
## 9 ( 3 ) "*" " " " " " " "*"
## 9 ( 4 ) "*" " " " " " " "*"
## 9 ( 5 ) " " "*" " " " " "*"
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1 ( 1 ) " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " "
## 1 ( 4 ) " " " " " " " " " "
## 1 ( 5 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 2 ( 2 ) " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " " "
## 2 ( 4 ) " " " " " " " " " "
## 2 ( 5 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 3 ( 2 ) " " " " " " " " " "
## 3 ( 3 ) " " " " " " " " " "
## 3 ( 4 ) " " " " " " " " " "
## 3 ( 5 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 4 ( 2 ) " " " " " " " " " "
## 4 ( 3 ) " " " " " " " " " "
## 4 ( 4 ) " " " " " " " " " "
## 4 ( 5 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 5 ( 2 ) " " " " " " " " " "
## 5 ( 3 ) " " " " " " " " " "
## 5 ( 4 ) " " " " " " " " " "
## 5 ( 5 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 6 ( 2 ) " " " " " " " " " "
## 6 ( 3 ) " " " " " " " " " "
## 6 ( 4 ) " " " " " " " " " "
## 6 ( 5 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 7 ( 2 ) " " " " " " " " " "
## 7 ( 3 ) " " " " " " " " " "
## 7 ( 4 ) " " " " " " " " " "
## 7 ( 5 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 8 ( 2 ) " " " " " " " " " "
## 8 ( 3 ) " " " " " " " " " "
## 8 ( 4 ) " " " " " " " " " "
## 8 ( 5 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## 9 ( 2 ) " " " " " " " " " "
## 9 ( 3 ) " " " " " " " " " "
## 9 ( 4 ) " " " " " " " " " "
## 9 ( 5 ) " " " " " " " " " "
## Kitchen.AbvGr TotRms.AbvGrd Fireplaces Garage.Yr.Blt Garage.Cars
## 1 ( 1 ) " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " "*"
## 1 ( 4 ) " " " " " " " " " "
## 1 ( 5 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 2 ( 2 ) " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " " "
## 2 ( 4 ) " " " " " " " " " "
## 2 ( 5 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 3 ( 2 ) " " " " " " " " " "
## 3 ( 3 ) " " " " " " " " " "
## 3 ( 4 ) " " " " " " " " " "
## 3 ( 5 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 4 ( 2 ) " " " " " " " " "*"
## 4 ( 3 ) " " " " " " " " " "
## 4 ( 4 ) " " " " " " "*" " "
## 4 ( 5 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 5 ( 2 ) " " " " " " " " " "
## 5 ( 3 ) " " " " " " " " "*"
## 5 ( 4 ) " " " " " " " " "*"
## 5 ( 5 ) " " " " " " " " "*"
## 6 ( 1 ) " " " " " " " " "*"
## 6 ( 2 ) " " " " " " " " " "
## 6 ( 3 ) " " " " " " " " "*"
## 6 ( 4 ) " " " " " " " " "*"
## 6 ( 5 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " "*"
## 7 ( 2 ) " " " " " " " " " "
## 7 ( 3 ) "*" " " " " " " "*"
## 7 ( 4 ) " " " " " " " " "*"
## 7 ( 5 ) " " " " " " " " "*"
## 8 ( 1 ) " " " " " " " " "*"
## 8 ( 2 ) " " " " " " " " " "
## 8 ( 3 ) " " " " " " " " "*"
## 8 ( 4 ) "*" " " " " " " "*"
## 8 ( 5 ) "*" " " " " " " "*"
## 9 ( 1 ) " " " " " " " " "*"
## 9 ( 2 ) "*" " " " " " " "*"
## 9 ( 3 ) " " " " " " " " " "
## 9 ( 4 ) " " " " " " " " "*"
## 9 ( 5 ) "*" " " " " " " "*"
## Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
## 1 ( 1 ) " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " "
## 1 ( 4 ) "*" " " " " " " " "
## 1 ( 5 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 2 ( 2 ) " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " " "
## 2 ( 4 ) "*" " " " " " " " "
## 2 ( 5 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 3 ( 2 ) " " " " " " " " " "
## 3 ( 3 ) " " " " " " " " " "
## 3 ( 4 ) " " " " " " " " " "
## 3 ( 5 ) "*" " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 4 ( 2 ) " " " " " " " " " "
## 4 ( 3 ) " " " " " " " " " "
## 4 ( 4 ) " " " " " " " " " "
## 4 ( 5 ) " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " "
## 5 ( 2 ) "*" " " " " " " " "
## 5 ( 3 ) " " " " " " " " " "
## 5 ( 4 ) " " " " " " " " " "
## 5 ( 5 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 6 ( 2 ) "*" " " " " " " " "
## 6 ( 3 ) " " " " " " " " " "
## 6 ( 4 ) " " " " " " " " " "
## 6 ( 5 ) "*" " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 7 ( 2 ) "*" " " " " " " " "
## 7 ( 3 ) " " " " " " " " " "
## 7 ( 4 ) " " " " " " " " " "
## 7 ( 5 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 8 ( 2 ) "*" " " " " " " " "
## 8 ( 3 ) " " " " " " " " " "
## 8 ( 4 ) " " " " " " " " " "
## 8 ( 5 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## 9 ( 2 ) " " " " " " " " " "
## 9 ( 3 ) "*" " " " " " " " "
## 9 ( 4 ) " " " " " " " " " "
## 9 ( 5 ) " " " " " " " " " "
## Screen.Porch Pool.Area Misc.Val Mo.Sold Yr.Sold
## 1 ( 1 ) " " " " " " " " " "
## 1 ( 2 ) " " " " " " " " " "
## 1 ( 3 ) " " " " " " " " " "
## 1 ( 4 ) " " " " " " " " " "
## 1 ( 5 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 2 ( 2 ) " " " " " " " " " "
## 2 ( 3 ) " " " " " " " " " "
## 2 ( 4 ) " " " " " " " " " "
## 2 ( 5 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 3 ( 2 ) " " " " " " " " " "
## 3 ( 3 ) " " " " " " " " " "
## 3 ( 4 ) " " " " " " " " " "
## 3 ( 5 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 4 ( 2 ) " " " " " " " " " "
## 4 ( 3 ) " " " " " " " " " "
## 4 ( 4 ) " " " " " " " " " "
## 4 ( 5 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 5 ( 2 ) " " " " " " " " " "
## 5 ( 3 ) " " " " " " " " " "
## 5 ( 4 ) " " " " " " " " " "
## 5 ( 5 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 6 ( 2 ) " " " " " " " " " "
## 6 ( 3 ) " " " " " " " " " "
## 6 ( 4 ) " " " " " " " " " "
## 6 ( 5 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 7 ( 2 ) " " " " " " " " " "
## 7 ( 3 ) " " " " " " " " " "
## 7 ( 4 ) " " " " " " " " " "
## 7 ( 5 ) " " " " " " " " " "
## 8 ( 1 ) " " " " "*" " " " "
## 8 ( 2 ) " " " " "*" " " " "
## 8 ( 3 ) " " " " " " " " " "
## 8 ( 4 ) " " " " " " " " " "
## 8 ( 5 ) " " " " " " " " " "
## 9 ( 1 ) " " " " "*" " " " "
## 9 ( 2 ) " " " " "*" " " " "
## 9 ( 3 ) " " " " "*" " " " "
## 9 ( 4 ) " " " " "*" " " " "
## 9 ( 5 ) " " " " "*" " " " "
Bestfitreg = lm(data = AmesHousing_1$only_numeric,
SalePrice ~ Mas.Vnr.Area+ Gr.Liv.Area + BsmtFin.SF.1)
t13 = summary(Bestfitreg)
tab_model(t13)
| SalePrice | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 16123.04 | 10289.45 – 21956.63 | <0.001 |
| Mas Vnr Area | 89.03 | 77.70 – 100.36 | <0.001 |
| Gr Liv Area | 90.67 | 86.77 – 94.57 | <0.001 |
| BsmtFin SF 1 | 44.34 | 40.19 – 48.49 | <0.001 |
| Observations | 2930 | ||
| R2 / R2 adjusted | 0.615 / 0.615 | ||
As we have randomly selected three continuous variables to run
regression model, we can find out best fit model by executing
regsubsets() on only numeric values of Ames Housing dataset.
regsubsets() provided best five model to choose from and in these five
model Mas Vnr Area and Overall Quality are most common independent
variables. Apart from these two Gr Liv Area, Kitchern Above Gr, and Misc
Val are best fit independet variables.
I would be considsering
Mas.Vnr.Area, Gr.Liv.Area, BsmtFin.SF.1 for best fit model.
\(y' = 16123.04 + 89.03*x_1 +
90.67*x_2 + 44.34*x_3\)
\(Adj
R^2\) is 61.5% means this model is 61.5% effective in predicting
sale price.
#Comparing two model#####
AIC(T7_RegMdl, Bestfitreg) %>%
kable(caption = "<center>Comparing two models</center>",
align = "c") %>%
kable_styling(bootstrap_options = c("hover",
"bordered",
"condensed",
"responsive",
"stripped"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%")
| df | AIC | |
|---|---|---|
| T7_RegMdl | 5 | 72400.85 |
| Bestfitreg | 5 | 71673.36 |
In this task I would be comparing best fit model with our earlier model to understand which one is effective to predict sale price. \(Adj R^2\) of first model(T7_RegMdl) is 50.71% and \(Adj R^2\) for Bestfitreg is 61.5% This is enough to decide that Bestfitreg is the best fit model as sale price is influenced 61.5% This ccan be concluded with AIC as well. AIC value of Bestfitreg is 71673.36 which supports the decision compared to T7_RegMdl value is 72400.85
Task 2 and 3 would provide a statistical overview of the dataset and
replacing NAs with mean to evaluate the best fit regression model. It is
crucial for a researcher to understand the data first. How many
variables are there, what type of variables are they, and their
statistical operations. These tools would provide description
statistical values which are often required in the study. Describe()
tool would provide required statistical operations performed on the
dataset. It would definitely save the time and helps to better visualize
the data.
Task 4 onwards are more important to understand the
correlation matrix and scatter plot to visualize the established
relationship between independent and dependent variables. Apart from
that understanding multicollinearity, identifying outliers and sub sets
of the model or best fit model using regsubsets() is important to
analyze the correlation.
Throughput this project I have utilized
various statistical tests respective to the multiple correlation I would
be testing. Occasionally researcher needs to test the significance of
relationship between/among independent and dependent variables and
having right information about statistical tests: t test or F test is
important in hypothesis testing.
This project provided me hands on experience with correlation and
regression testing of more than one independent and dependent
variables.
Bluman, A. (2014). Elementary statistics: A step by step
approach. McGraw-Hill Education.
Kabacoff, R. (2015). R in
Action. Manning Publications Co.
Soetewey, A. (2020).
Correlogram in R: how to highlight the most correlated variables in a
dataset. Stats and r. https://statsandr.com/blog/correlogram-in-r-how-to-highlight-the-most-correlated-variables-in-a-dataset/
Wei, T. & Simko, V. (2021). An Introduction to corrplot
Package. Cran.r-project. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html#visualize-non-correlation-matrix-na-value-and-math-label
Wickham. H. (2022). Flexibly Reshape Data: A reboot of the reshape
package. Cran.r-project. https://cran.r-project.org/web/packages/reshape2/reshape2.pdf