rm(list=ls())

#Library######
library(readr)
library(tidyverse) 
library(dplyr) 
library(DT) 
library(RColorBrewer) 
library(rio) 
library(dbplyr) 
library(psych) 
library(FSA) 
library(knitr)
library(RColorBrewer)
library(plotrix)
library(kableExtra)
library(ISLR)
library(data.table)
library(magrittr)
library(ggplot2)
library(summarytools)
library(hrbrthemes)
library(cowplot)
library(reshape2)
library(scales)
library(zoo)
library(corrplot)
library(lares)
library(leaps)
library(MASS)
library(car)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(tidyr)

#dataset used######
AmesHousing_1 <- read_csv("Datasets/AmesHousing-1.csv")


Correlation and Regression


Introduction


The simple regression analysis studies the relationship between single dependent and single independent variable. Number of cigarettes per day and number of years they lived has single independent and single dependent variable which have the 85% influence on number of years smokers live and 15% unexplained. What if I want to study effect of alcohol as well along with cigarettes? Rather than studying separate regression analysis on number of cigarettes and number of years they live and number of alcohol consumption in ml/day and number of years they live, we can study the multiple regression analysis. In this case, alcohol consumption ml/day and number of cigarettes per day would be independent variables and number of years they live dependent (Bluman, 2014).
Multiple correlation coefficient is denoted by \(R\) and Multiple correlation determination is denoted by \(R^2\).

Multiple Linear regression formula \(y=a+b_1x_1+b_2x_2+b_3x_3+\epsilon\)


Furthermore, understanding simple and multiple correlation and regression to calculate the predicted values would be very crucial in forecasting and decision making. Correlation and Hypothesis testing lead back to central limit theorem and calculating and level of significance. These statistical tests are critical for determining, analyzing, and predicting values to aid in decision making.

Moreover, understanding the data is crucial before planning study plan. When you know your data, preparing study plan and statistical tests would be beneficial. We always prepare study plan and statistical tests depending on the data. Also, utilizing correct tools is equally important and understanding which charts are used depending upon variable would help in data visualization (Bluman, 2014).

In a nutshell, it is important to familiarize yourself with the data, use appropriate statistical tools and tests to make sound decisions. In this report we will be working on Ames Housing dataset which will provide an alternative to Boston Housing data. Ames, Iowa dataset have individual residential properties sold between 2006 and 2010. This dataset contains 82 variables and 2930 observations.

Task 2: Descriptive statistics and EDA of the dataset Ames Housing
#Descriptive Statistics#######

##Data Describe#####
describe(AmesHousing_1) %>%
  kable(caption = "<center> Table 1, Descriptive Statistic of dataset using describe()</center>",
        align = "c",
        digits = 2) %>%
   kable_styling(bootstrap_options = c("hover",
                                        "bordered",
                                       "condensed",
                                       "responsive",
                                       "stripped"),
                font_size = 11) %>%
   scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "logical or categorical variables are converted to numeric denoted by *")
Table 1, Descriptive Statistic of dataset using describe()
vars n mean sd median trimmed mad min max range skew kurtosis se
Order 1 2930 1465.50 845.96 1465.5 1465.50 1086.00 1 2930 2929 0.00 -1.20 15.63
PID* 2 2930 1465.50 845.96 1465.5 1465.50 1086.00 1 2930 2929 0.00 -1.20 15.63
MS SubClass* 3 2930 5.29 4.36 5.0 4.75 5.93 1 16 15 0.73 -0.51 0.08
MS Zoning* 4 2930 5.97 0.87 6.0 6.07 0.00 1 7 6 -2.61 8.41 0.02
Lot Frontage 5 2440 69.22 23.37 68.0 68.35 17.79 21 313 292 1.50 11.20 0.47
Lot Area 6 2930 10147.92 7880.02 9436.5 9481.05 3024.50 1300 215245 213945 12.81 264.39 145.58
Street* 7 2930 2.00 0.06 2.0 2.00 0.00 1 2 1 -15.52 239.01 0.00
Alley* 8 198 1.39 0.49 1.0 1.37 0.00 1 2 1 0.43 -1.82 0.03
Lot Shape* 9 2930 2.94 1.41 4.0 3.05 0.00 1 4 3 -0.61 -1.60 0.03
Land Contour* 10 2930 3.78 0.70 4.0 4.00 0.00 1 4 3 -3.12 8.44 0.01
Utilities* 11 2930 1.00 0.06 1.0 1.00 0.00 1 3 2 34.02 1187.96 0.00
Lot Config* 12 2930 4.06 1.60 5.0 4.32 0.00 1 5 4 -1.19 -0.44 0.03
Land Slope* 13 2930 1.05 0.25 1.0 1.00 0.00 1 3 2 4.98 26.62 0.00
Neighborhood* 14 2930 15.30 7.02 16.0 15.40 8.90 1 28 27 -0.20 -1.19 0.13
Condition 1* 15 2930 3.04 0.87 3.0 3.00 0.00 1 9 8 2.99 15.74 0.02
Condition 2* 16 2930 3.00 0.21 3.0 3.00 0.00 1 8 7 12.08 308.97 0.00
Bldg Type* 17 2930 1.52 1.22 1.0 1.17 0.00 1 5 4 2.15 3.00 0.02
House Style* 18 2930 4.02 1.91 3.0 4.01 0.00 1 8 7 0.32 -0.95 0.04
Overall Qual 19 2930 6.09 1.41 6.0 6.08 1.48 1 10 9 0.19 0.05 0.03
Overall Cond 20 2930 5.56 1.11 5.0 5.47 0.00 1 9 8 0.57 1.48 0.02
Year Built 21 2930 1971.36 30.25 1973.0 1974.25 37.06 1872 2010 138 -0.60 -0.50 0.56
Year Remod/Add 22 2930 1984.27 20.86 1993.0 1985.63 20.76 1950 2010 60 -0.45 -1.34 0.39
Roof Style* 23 2930 2.39 0.82 2.0 2.24 0.00 1 6 5 1.56 0.89 0.02
Roof Matl* 24 2930 2.06 0.54 2.0 2.00 0.00 1 8 7 8.72 76.98 0.01
Exterior 1st* 25 2930 11.16 3.65 14.0 11.47 1.48 1 16 15 -0.59 -0.76 0.07
Exterior 2nd* 26 2930 11.87 4.00 15.0 12.19 2.97 1 17 16 -0.56 -0.90 0.07
Mas Vnr Type* 27 2907 3.45 1.04 4.0 3.47 0.00 1 5 4 -0.58 -1.10 0.02
Mas Vnr Area 28 2907 101.90 179.11 0.0 61.14 0.00 0 1600 1600 2.60 9.26 3.32
Exter Qual* 29 2930 3.53 0.70 4.0 3.64 0.00 1 4 3 -1.79 3.67 0.01
Exter Cond* 30 2930 4.71 0.77 5.0 4.93 0.00 1 5 4 -2.50 5.11 0.01
Foundation* 31 2930 2.39 0.73 2.0 2.45 1.48 1 6 5 0.01 0.76 0.01
Bsmt Qual* 32 2850 3.69 1.31 3.0 3.85 2.97 1 5 4 -0.46 -0.83 0.02
Bsmt Cond* 33 2850 4.80 0.69 5.0 5.00 0.00 1 5 4 -3.33 9.73 0.01
Bsmt Exposure* 34 2847 3.28 1.13 4.0 3.47 0.00 1 4 3 -1.16 -0.32 0.02
BsmtFin Type 1* 35 2850 3.76 1.81 3.0 3.82 2.97 1 6 5 -0.04 -1.36 0.03
BsmtFin SF 1 36 2929 442.63 455.59 370.0 384.08 548.56 0 5644 5644 1.41 6.84 8.42
BsmtFin Type 2* 37 2849 5.68 1.01 6.0 5.97 0.00 1 6 5 -3.38 10.80 0.02
BsmtFin SF 2 38 2929 49.72 169.17 0.0 2.04 0.00 0 1526 1526 4.14 18.73 3.13
Bsmt Unf SF 39 2929 559.26 439.49 466.0 510.77 415.13 0 2336 2336 0.92 0.40 8.12
Total Bsmt SF 40 2929 1051.61 440.62 990.0 1035.05 349.89 0 6110 6110 1.16 9.11 8.14
Heating* 41 2930 2.03 0.25 2.0 2.00 0.00 1 6 5 12.10 168.45 0.00
Heating QC* 42 2930 2.54 1.74 1.0 2.42 0.00 1 5 4 0.48 -1.52 0.03
Central Air* 43 2930 1.93 0.25 2.0 2.00 0.00 1 2 1 -3.47 10.01 0.00
Electrical* 44 2929 4.69 1.05 5.0 5.00 0.00 1 5 4 -3.09 7.67 0.02
1st Flr SF 45 2930 1159.56 391.89 1084.0 1127.17 349.89 334 5095 4761 1.47 6.95 7.24
2nd Flr SF 46 2930 335.46 428.40 0.0 272.90 0.00 0 2065 2065 0.87 -0.42 7.91
Low Qual Fin SF 47 2930 4.68 46.31 0.0 0.00 0.00 0 1064 1064 12.11 175.18 0.86
Gr Liv Area 48 2930 1499.69 505.51 1442.0 1452.25 461.09 334 5642 5308 1.27 4.12 9.34
Bsmt Full Bath 49 2928 0.43 0.52 0.0 0.40 0.00 0 3 3 0.62 -0.75 0.01
Bsmt Half Bath 50 2928 0.06 0.25 0.0 0.00 0.00 0 2 2 3.94 14.88 0.00
Full Bath 51 2930 1.57 0.55 2.0 1.56 0.00 0 4 4 0.17 -0.54 0.01
Half Bath 52 2930 0.38 0.50 0.0 0.34 0.00 0 2 2 0.70 -1.03 0.01
Bedroom AbvGr 53 2930 2.85 0.83 3.0 2.83 0.00 0 8 8 0.31 1.88 0.02
Kitchen AbvGr 54 2930 1.04 0.21 1.0 1.00 0.00 0 3 3 4.31 19.82 0.00
Kitchen Qual* 55 2930 3.86 1.27 5.0 4.03 0.00 1 5 4 -0.62 -0.68 0.02
TotRms AbvGrd 56 2930 6.44 1.57 6.0 6.33 1.48 2 15 13 0.75 1.15 0.03
Functional* 57 2930 7.69 1.18 8.0 8.00 0.00 1 8 7 -3.83 13.79 0.02
Fireplaces 58 2930 0.60 0.65 1.0 0.52 1.48 0 4 4 0.74 0.10 0.01
Fireplace Qu* 59 1508 3.72 1.13 3.0 3.78 1.48 1 5 4 -0.12 -1.01 0.03
Garage Type* 60 2773 3.28 1.79 2.0 3.11 0.00 1 6 5 0.75 -1.31 0.03
Garage Yr Blt 61 2771 1978.13 25.53 1979.0 1980.71 31.13 1895 2207 312 -0.38 1.82 0.48
Garage Finish* 62 2771 2.18 0.82 2.0 2.23 1.48 1 3 2 -0.35 -1.43 0.02
Garage Cars 63 2929 1.77 0.76 2.0 1.77 0.00 0 5 5 -0.22 0.24 0.01
Garage Area 64 2929 472.82 215.05 480.0 468.35 182.36 0 1488 1488 0.24 0.94 3.97
Garage Qual* 65 2771 4.84 0.66 5.0 5.00 0.00 1 5 4 -4.02 14.47 0.01
Garage Cond* 66 2771 4.90 0.52 5.0 5.00 0.00 1 5 4 -5.25 26.38 0.01
Paved Drive* 67 2930 2.83 0.54 3.0 3.00 0.00 1 3 2 -2.98 7.15 0.01
Wood Deck SF 68 2930 93.75 126.36 0.0 71.21 0.00 0 1424 1424 1.84 6.73 2.33
Open Porch SF 69 2930 47.53 67.48 27.0 33.87 40.03 0 742 742 2.53 10.92 1.25
Enclosed Porch 70 2930 23.01 64.14 0.0 4.83 0.00 0 1012 1012 4.01 28.42 1.18
3Ssn Porch 71 2930 2.59 25.14 0.0 0.00 0.00 0 508 508 11.39 149.63 0.46
Screen Porch 72 2930 16.00 56.09 0.0 0.00 0.00 0 576 576 3.95 17.81 1.04
Pool Area 73 2930 2.24 35.60 0.0 0.00 0.00 0 800 800 16.92 299.06 0.66
Pool QC* 74 13 2.46 1.20 3.0 2.45 1.48 1 4 3 -0.05 -1.68 0.33
Fence* 75 572 2.41 0.84 3.0 2.49 0.00 1 4 3 -0.68 -0.89 0.03
Misc Feature* 76 106 3.85 0.55 4.0 4.00 0.00 1 5 4 -3.16 10.37 0.05
Misc Val 77 2930 50.64 566.34 0.0 0.00 0.00 0 17000 17000 21.98 564.85 10.46
Mo Sold 78 2930 6.22 2.71 6.0 6.16 2.97 1 12 11 0.19 -0.46 0.05
Yr Sold 79 2930 2007.79 1.32 2008.0 2007.74 1.48 2006 2010 4 0.13 -1.16 0.02
Sale Type* 80 2930 9.36 1.88 10.0 9.87 0.00 1 10 9 -3.32 10.76 0.03
Sale Condition* 81 2930 4.78 1.08 5.0 5.00 0.00 1 6 5 -2.79 7.25 0.02
SalePrice 82 2930 180796.06 79886.69 160000.0 170429.15 54856.20 12789 755000 742211 1.74 5.10 1475.84
Note: logical or categorical variables are converted to numeric denoted by *
##Exploratory Data Analysis#####

house_rel = subset(AmesHousing_1, select=c("Lot Area",
                                           "Total Bsmt SF",
                                           "Year Built",
                                           "Year Remod/Add",
                                           "Yr Sold",
                                           "SalePrice"))
names(house_rel) %<>% stringr::str_replace_all("\\s","_") #renaming column names

ggplot(data = house_rel, mapping = aes(x = Lot_Area, y = SalePrice)) + 
  geom_boxplot(mapping = aes(group = cut_width(Lot_Area, 12000)))+
  coord_cartesian(xlim = c(0, 100000),
                  ylim = c(0,1000000))+
  ggtitle("Plot 2.1: Lot Area to House Sales Price")+
       xlab("Area of Lot (sq. feet)")+
       ylab("Sales Price(USD)")+
  theme(plot.title = element_text(hjust = 0.5))

#EDA overview
  
house_rel_mod <- reshape2::melt(house_rel) #covert columns to rows 
ggplot(house_rel_mod, aes(value)) +
  facet_wrap(~variable, scales = 'free_x') +
  geom_histogram(binwidth = function(x) 2.8 * IQR(x) / (length(x)^(1/3)))+
  scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
  ggtitle("Plot 2.2: EDA Overview")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(house_rel)+ #years sold bar chart
  geom_bar(mapping = aes(x=Yr_Sold),
           fill = "turquoise")+
  ggtitle("Plot 2.3: House Year Sold")+
       xlab("Year Sold")+
       ylab("Number of Houses")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(house_rel, aes(Year_Built))+
  geom_bar(stat="count", width=0.6, fill="steelblue")+
  scale_x_binned()+
  ggtitle("Plot 2.4: House Year Built")+
       xlab("House Built Year")+
       ylab("Number of Houses")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = house_rel, aes(x = Lot_Area))+
  geom_bar(stat="count", width=0.7, fill="steelblue")+
  xlim(0, 250000)+
  scale_x_binned()+
  ggtitle("Plot 2.5: House Lot Area")+
       xlab("Area of Lot (sq. feet)")+
       ylab("Number of Houses")+
  theme(plot.title = element_text(hjust = 0.5))


Observations:

Before starting a study, it’s crucial to understand your data in statistics. Once you get the facts, you must take further procedures to reach a conclusion that will aid in decision making. This would be possible only if you know the data, appropriate statistical test.
AmesHousing dataset is utilized in Table 1, and describe() returns crucial descriptive statistical numbers and variable names.
In describe(), descriptive statistics would be returned only for numeric variables and for logical or categorical variables are converted to numeric for the sake of calculation and are denoted by \(*\). In table 1, MS SubClass, MS Zoning, Street, Lot Shape, other logical and categorical variable type are denoted by \(*\).

Moreover, descriptive statistics help to understand the data and appropriate data visualization tools. Plot 2.1 box plot helps to understand the relationship between lot area and sales price. Also, box plot helps to understand outliers in the relationship, if any. In the plot 2.1 we can see for different lot areas, there are outliers which are represented by \(.\)(dots). These outliers could be present on either side of the box. The ourliers are present for the lot area between 25,000 sqft and 50,000 sqft with respect to sales price (USD).
To understand the descriptive statistics and exploratory data analysis, I have considered five most common factors related to sale of individual priority. Lot area, basement area, year build, year renovation, and year sold are some of the factors which would be considered while purchasing property. Furthermore, I have used various data visualization tools as box plot, bar chart, and histogram. EDA overview provides a compact view of these variables in single chart. In the overview chart we can see that between 1980 and 1990 property renovation was considerably low and increases thereafter. Property renovation was stable between 1960 and 1980.
Plot 2.4 House year built helps to understand property building was gradually increasing until 1920 and there was slowed between 1930 and 1950. However, between 1950 and 1980, property building rose to almost double compared to last 30 years and after 2000 it almost reached to double of last 10 years.


Task 3: Imputing Missing Values


#Missing Value######


missing_dataset <- data.frame(AmesHousing_1) %>%
  mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))#replacing NA values to mean


##Table Representation######

describe(missing_dataset) %>%
  kable(caption = "<center> Table 1, Descriptive Statistic of dataset using describe()</center>",
        align = "c",
        digits = 2) %>%
   kable_styling(bootstrap_options = c("hover",
                                        "bordered",
                                       "condensed",
                                       "responsive",
                                       "stripped"),
                font_size = 11) %>%
   scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "logical or categorical variables are converted to numeric denoted by *")
Table 1, Descriptive Statistic of dataset using describe()
vars n mean sd median trimmed mad min max range skew kurtosis se
Order 1 2930 1465.50 845.96 1465.50 1465.50 1086.00 1 2930 2929 0.00 -1.20 15.63
PID* 2 2930 1465.50 845.96 1465.50 1465.50 1086.00 1 2930 2929 0.00 -1.20 15.63
MS.SubClass* 3 2930 5.29 4.36 5.00 4.75 5.93 1 16 15 0.73 -0.51 0.08
MS.Zoning* 4 2930 5.97 0.87 6.00 6.07 0.00 1 7 6 -2.61 8.41 0.02
Lot.Frontage 5 2930 69.22 21.32 69.22 68.52 13.68 21 313 292 1.64 14.05 0.39
Lot.Area 6 2930 10147.92 7880.02 9436.50 9481.05 3024.50 1300 215245 213945 12.81 264.39 145.58
Street* 7 2930 2.00 0.06 2.00 2.00 0.00 1 2 1 -15.52 239.01 0.00
Alley* 8 198 1.39 0.49 1.00 1.37 0.00 1 2 1 0.43 -1.82 0.03
Lot.Shape* 9 2930 2.94 1.41 4.00 3.05 0.00 1 4 3 -0.61 -1.60 0.03
Land.Contour* 10 2930 3.78 0.70 4.00 4.00 0.00 1 4 3 -3.12 8.44 0.01
Utilities* 11 2930 1.00 0.06 1.00 1.00 0.00 1 3 2 34.02 1187.96 0.00
Lot.Config* 12 2930 4.06 1.60 5.00 4.32 0.00 1 5 4 -1.19 -0.44 0.03
Land.Slope* 13 2930 1.05 0.25 1.00 1.00 0.00 1 3 2 4.98 26.62 0.00
Neighborhood* 14 2930 15.30 7.02 16.00 15.40 8.90 1 28 27 -0.20 -1.19 0.13
Condition.1* 15 2930 3.04 0.87 3.00 3.00 0.00 1 9 8 2.99 15.74 0.02
Condition.2* 16 2930 3.00 0.21 3.00 3.00 0.00 1 8 7 12.08 308.97 0.00
Bldg.Type* 17 2930 1.52 1.22 1.00 1.17 0.00 1 5 4 2.15 3.00 0.02
House.Style* 18 2930 4.02 1.91 3.00 4.01 0.00 1 8 7 0.32 -0.95 0.04
Overall.Qual 19 2930 6.09 1.41 6.00 6.08 1.48 1 10 9 0.19 0.05 0.03
Overall.Cond 20 2930 5.56 1.11 5.00 5.47 0.00 1 9 8 0.57 1.48 0.02
Year.Built 21 2930 1971.36 30.25 1973.00 1974.25 37.06 1872 2010 138 -0.60 -0.50 0.56
Year.Remod.Add 22 2930 1984.27 20.86 1993.00 1985.63 20.76 1950 2010 60 -0.45 -1.34 0.39
Roof.Style* 23 2930 2.39 0.82 2.00 2.24 0.00 1 6 5 1.56 0.89 0.02
Roof.Matl* 24 2930 2.06 0.54 2.00 2.00 0.00 1 8 7 8.72 76.98 0.01
Exterior.1st* 25 2930 11.16 3.65 14.00 11.47 1.48 1 16 15 -0.59 -0.76 0.07
Exterior.2nd* 26 2930 11.87 4.00 15.00 12.19 2.97 1 17 16 -0.56 -0.90 0.07
Mas.Vnr.Type* 27 2907 3.45 1.04 4.00 3.47 0.00 1 5 4 -0.58 -1.10 0.02
Mas.Vnr.Area 28 2930 101.90 178.41 0.00 61.28 0.00 0 1600 1600 2.61 9.36 3.30
Exter.Qual* 29 2930 3.53 0.70 4.00 3.64 0.00 1 4 3 -1.79 3.67 0.01
Exter.Cond* 30 2930 4.71 0.77 5.00 4.93 0.00 1 5 4 -2.50 5.11 0.01
Foundation* 31 2930 2.39 0.73 2.00 2.45 1.48 1 6 5 0.01 0.76 0.01
Bsmt.Qual* 32 2850 3.69 1.31 3.00 3.85 2.97 1 5 4 -0.46 -0.83 0.02
Bsmt.Cond* 33 2850 4.80 0.69 5.00 5.00 0.00 1 5 4 -3.33 9.73 0.01
Bsmt.Exposure* 34 2847 3.28 1.13 4.00 3.47 0.00 1 4 3 -1.16 -0.32 0.02
BsmtFin.Type.1* 35 2850 3.76 1.81 3.00 3.82 2.97 1 6 5 -0.04 -1.36 0.03
BsmtFin.SF.1 36 2930 442.63 455.51 370.50 383.98 549.30 0 5644 5644 1.41 6.84 8.42
BsmtFin.Type.2* 37 2849 5.68 1.01 6.00 5.97 0.00 1 6 5 -3.38 10.80 0.02
BsmtFin.SF.2 38 2930 49.72 169.14 0.00 2.00 0.00 0 1526 1526 4.14 18.74 3.12
Bsmt.Unf.SF 39 2930 559.26 439.42 466.00 510.69 415.13 0 2336 2336 0.92 0.41 8.12
Total.Bsmt.SF 40 2930 1051.61 440.54 990.00 1035.00 349.89 0 6110 6110 1.16 9.11 8.14
Heating* 41 2930 2.03 0.25 2.00 2.00 0.00 1 6 5 12.10 168.45 0.00
Heating.QC* 42 2930 2.54 1.74 1.00 2.42 0.00 1 5 4 0.48 -1.52 0.03
Central.Air* 43 2930 1.93 0.25 2.00 2.00 0.00 1 2 1 -3.47 10.01 0.00
Electrical* 44 2929 4.69 1.05 5.00 5.00 0.00 1 5 4 -3.09 7.67 0.02
X1st.Flr.SF 45 2930 1159.56 391.89 1084.00 1127.17 349.89 334 5095 4761 1.47 6.95 7.24
X2nd.Flr.SF 46 2930 335.46 428.40 0.00 272.90 0.00 0 2065 2065 0.87 -0.42 7.91
Low.Qual.Fin.SF 47 2930 4.68 46.31 0.00 0.00 0.00 0 1064 1064 12.11 175.18 0.86
Gr.Liv.Area 48 2930 1499.69 505.51 1442.00 1452.25 461.09 334 5642 5308 1.27 4.12 9.34
Bsmt.Full.Bath 49 2930 0.43 0.52 0.00 0.40 0.00 0 3 3 0.62 -0.75 0.01
Bsmt.Half.Bath 50 2930 0.06 0.25 0.00 0.00 0.00 0 2 2 3.94 14.89 0.00
Full.Bath 51 2930 1.57 0.55 2.00 1.56 0.00 0 4 4 0.17 -0.54 0.01
Half.Bath 52 2930 0.38 0.50 0.00 0.34 0.00 0 2 2 0.70 -1.03 0.01
Bedroom.AbvGr 53 2930 2.85 0.83 3.00 2.83 0.00 0 8 8 0.31 1.88 0.02
Kitchen.AbvGr 54 2930 1.04 0.21 1.00 1.00 0.00 0 3 3 4.31 19.82 0.00
Kitchen.Qual* 55 2930 3.86 1.27 5.00 4.03 0.00 1 5 4 -0.62 -0.68 0.02
TotRms.AbvGrd 56 2930 6.44 1.57 6.00 6.33 1.48 2 15 13 0.75 1.15 0.03
Functional* 57 2930 7.69 1.18 8.00 8.00 0.00 1 8 7 -3.83 13.79 0.02
Fireplaces 58 2930 0.60 0.65 1.00 0.52 1.48 0 4 4 0.74 0.10 0.01
Fireplace.Qu* 59 1508 3.72 1.13 3.00 3.78 1.48 1 5 4 -0.12 -1.01 0.03
Garage.Type* 60 2773 3.28 1.79 2.00 3.11 0.00 1 6 5 0.75 -1.31 0.03
Garage.Yr.Blt 61 2930 1978.13 24.83 1978.13 1980.62 29.85 1895 2207 312 -0.40 2.09 0.46
Garage.Finish* 62 2771 2.18 0.82 2.00 2.23 1.48 1 3 2 -0.35 -1.43 0.02
Garage.Cars 63 2930 1.77 0.76 2.00 1.77 0.00 0 5 5 -0.22 0.24 0.01
Garage.Area 64 2930 472.82 215.01 480.00 468.32 182.36 0 1488 1488 0.24 0.95 3.97
Garage.Qual* 65 2771 4.84 0.66 5.00 5.00 0.00 1 5 4 -4.02 14.47 0.01
Garage.Cond* 66 2771 4.90 0.52 5.00 5.00 0.00 1 5 4 -5.25 26.38 0.01
Paved.Drive* 67 2930 2.83 0.54 3.00 3.00 0.00 1 3 2 -2.98 7.15 0.01
Wood.Deck.SF 68 2930 93.75 126.36 0.00 71.21 0.00 0 1424 1424 1.84 6.73 2.33
Open.Porch.SF 69 2930 47.53 67.48 27.00 33.87 40.03 0 742 742 2.53 10.92 1.25
Enclosed.Porch 70 2930 23.01 64.14 0.00 4.83 0.00 0 1012 1012 4.01 28.42 1.18
X3Ssn.Porch 71 2930 2.59 25.14 0.00 0.00 0.00 0 508 508 11.39 149.63 0.46
Screen.Porch 72 2930 16.00 56.09 0.00 0.00 0.00 0 576 576 3.95 17.81 1.04
Pool.Area 73 2930 2.24 35.60 0.00 0.00 0.00 0 800 800 16.92 299.06 0.66
Pool.QC* 74 13 2.46 1.20 3.00 2.45 1.48 1 4 3 -0.05 -1.68 0.33
Fence* 75 572 2.41 0.84 3.00 2.49 0.00 1 4 3 -0.68 -0.89 0.03
Misc.Feature* 76 106 3.85 0.55 4.00 4.00 0.00 1 5 4 -3.16 10.37 0.05
Misc.Val 77 2930 50.64 566.34 0.00 0.00 0.00 0 17000 17000 21.98 564.85 10.46
Mo.Sold 78 2930 6.22 2.71 6.00 6.16 2.97 1 12 11 0.19 -0.46 0.05
Yr.Sold 79 2930 2007.79 1.32 2008.00 2007.74 1.48 2006 2010 4 0.13 -1.16 0.02
Sale.Type* 80 2930 9.36 1.88 10.00 9.87 0.00 1 10 9 -3.32 10.76 0.03
Sale.Condition* 81 2930 4.78 1.08 5.00 5.00 0.00 1 6 5 -2.79 7.25 0.02
SalePrice 82 2930 180796.06 79886.69 160000.00 170429.15 54856.20 12789 755000 742211 1.74 5.10 1475.84
Note: logical or categorical variables are converted to numeric denoted by *


Observations:

Dataset have some missing values in numeric variables and these could harm the correlation, so to avoid this, I have replaced these NAs with mean of respective variable.
Describe() is used to represent the statistical values of the dataset after replacing NAs with mean.


Task 4, 5: Correlation Matrix


#Correlation Matrix####

AmesHousing_1$only_numeric <- missing_dataset[sapply(missing_dataset,is.numeric)] 

correla_dataset = cor(AmesHousing_1$only_numeric, use="pairwise")
corrplot(correla_dataset, 
         type = 'lower', 
         order = 'hclust', 
         tl.col = 'black',
         cl.ratio = 0.2, 
         tl.srt = 45, 
         col = COL2('PuOr', 10), 
         tl.cex = 0.50,
         title = "Chart 4.1: Correaltion Matrix",
         mar=c(0,0,1,0) )

corr_var(AmesHousing_1$only_numeric, SalePrice, top = 40)


Observations:

Now that we have gained insights about the data and replaced NAs with mean, we can establish the correlation of numeric variables. Dataset have multiple continuous, nominal, and discrete variable types. I have mutated only numeric columns within the dataset to establish correlation.

Chart 4.1 is a pairwise matrix of correlation of numeric variables only. Chart have correlation coefficient range at the bottom -1 to 1. SalePrice has strong correlation with overall quality, Garage area, Total Basement SF, and Gr live area. However, we can not understand the correlation coefficient with this matrix.
To understand the correlation coefficient, I have used corr_var() understand strong and weak coefficient of correlation. Overall quality, Gr Live Area, Total Basement SF are some of the strong correlations with saleprice. Bedroom AboveGr, Pool Area, and BsmtFinSF2 have weak correlation with saleprice.


Task 6: Scatter Plots for highest, close, and lowest correlation to 0.5

Scatter plot visualization tool is used to understand the nature of correlation between the variables. Independent variable on X-axis and dependent on Y-axis (Bluman, 2014).


Task 6.1: Correlation highest to 0.5


#More than 0.5 correlation scatter plot#####

names(AmesHousing_1) %<>% stringr::str_replace_all("\\s","_")

T6_SC1 <- ggplot(data = AmesHousing_1, aes(x = Overall_Qual, y = SalePrice)) + 
  geom_point(alpha = 0.5, 
             pch = 24) +
  geom_smooth(method = "lm", se = FALSE,
              color = "#99004C",
              lty=1,
              lwd=1)+
  labs(title = "Overall House Quality to Sales Price Correlation",
       x = "Overall Quality",
       y = "Sale Price(USD)")+
  expand_limits(x = c(0, NA), y = c(0, NA)) +
  scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
  theme(plot.title = element_text(hjust = 0.5))+
 geom_text(x=2.5, y=600000, label=paste("Correlation:", round(correla_dataset[4,37],3)),
           color="forestgreen",size = 3)


print(T6_SC1)


Observations:

In the above scatter plot, correlation between saleprice and overall quality is established. Correlation coefficient is 0.799 which mean the relationship is positive strong. The data on the chart is on a straight line increasing from left to right. There is a positive correlation between the two variables because the slope of the line is positive. We can conclude that higher the overall quality, higher the saleprice.


Task 6.2: Correlation closet to 0.5


#Close to 0.5 correlation scatter plot#####

T6_SC2 <- ggplot(data = AmesHousing_1, aes(x = TotRms_AbvGrd, y = SalePrice)) + 
  geom_point(alpha = 0.5, 
             pch = 24) +
  geom_smooth(method = "lm", se = FALSE,
              color = "#99004C",
              lty=1,
              lwd=1)+
  labs(title = "Total Rooms Above Ground to Sales Price Correlation",
       x = "Total Rooms Above Ground",
       y = "Sale Price(USD)")+
  expand_limits(x = c(0, NA), y = c(0, NA)) +
  scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
  theme(plot.title = element_text(hjust = 0.5))+
 geom_text(x=2.5, y=600000, label=paste("Correlation:", round(correla_dataset[23,37],4)),
           color="forestgreen",size = 3)


print(T6_SC2)


Observations:

In the above scatter plot, correlation between saleprice and Total Rooms Above Ground is established. Correlation coefficient is 0.4955 which mean the relationship is positive. The data on the chart is on a straight line increasing from left to right. There is a positive correlation between the two variables because the slope of the line is positive.


Task 6.3: Correlation lowest to 0.5


#Less than 0.5 correlation scatter plot#####

T6_SC3 <- ggplot(data = AmesHousing_1, aes(x = BsmtFin_SF_2, y = SalePrice)) + 
  geom_point(alpha = 0.5, 
             pch = 24) +
  geom_smooth(method = "lm", se = FALSE,
              color = "#99004C",
              lty=1,
              lwd=1)+
  labs(title = "Type 2 Basement Finish to Sales Price Correlation",
       x = "Type 2 finished basement sqr feet",
       y = "Sale Price(USD)")+
  expand_limits(x = c(0, NA), y = c(0, NA)) +
  scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
  theme(plot.title = element_text(hjust = 0.5))+
 geom_text(x=1250, y=750000, label=paste("Correlation:", round(correla_dataset[10,37],4)),
           color="forestgreen",size = 3)


print(T6_SC3)


Observations:

In the above scatter plot, correlation between saleprice and Type 2 Finished Basement is established. Correlation coefficient is close to 0 which is 0.0059, means the relationship is not linear. The data on the chart is on a straight line from left to right which is neither increasing or decreasing. The data might be related in some other nonlinear way (Bluman, 2014).


Task 7,8,9: Regression model of 3 variables


#Correaltion of SalePrice to choosen three continuous variables#####
par(mfrow = c(2,2))

attach(missing_dataset)

T7_RegMdl = lm(SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area) 

detach(missing_dataset)


t7 = summary(T7_RegMdl)
tab_model(t7)
  Dependent variable
Predictors Estimates CI p
(Intercept) 66352.12 61114.99 – 71589.25 <0.001
Lot.Area 1.23 0.97 – 1.49 <0.001
Mas.Vnr.Area 136.00 123.71 – 148.30 <0.001
Garage.Area 186.33 175.97 – 196.68 <0.001
Observations 2930
R2 / R2 adjusted 0.507 / 0.507
plot(T7_RegMdl)


Observations:

For regression model I have chooses Lot Area, Mas Vnr Area, and Garage Area to understand correlation with Saleprice.
Intercept of the regression is 66352.12 and slope of line for Lot area(\(b_1\)) is 1.23, Mas Vnr Area slope(\(b_2\)) is 136.00, and Garage Area(\(b_3\)) is 186.33
Coefficient of Determination \(R^2\) is 0.5071 means 50.71% the variation in the dependent variable saleprice can be explained by the independent variables. Regression Coefficient of Lot Area is 1.23, meaning increasing 1% change in Lot Area is associated with 1.23% increase in saleprice, controlling Garage Area, Mas Vnr Area (Kabacoff, 2015).

Multiple Linear regression formula \(y'=a+b_1x_1+b_2x_2+b_3x_3+\epsilon\)

\(y' = 66352.12 + 1.23*x_1 + 136.00*x_2 + 186.33*x_3\)

Furthermore, four graphs represents Normality (upper-right) , Linearity (upper-left), Homoscendasticity (bottom-left), and Residuals vs Leverage graph (bottom right).
Residuals vs Leverage graph (bottom right) represents outliers in the model. The dependent variable value isn’t used while calculating an observation’s leverage (Kabacoff, 2015).
Normality (upper-right) represents a probability plot of the standardized residuals against the values that would be expected under normality (Kabacoff, 2015).As points are not on the normality line means there are outliers in the model.
Linearity (upper-left) model should account for any systematic variation in the data.
Homoscendasticity (bottom-left) In a regression model, heteroscedasticity corresponds to the uneven dispersion of residuals at different levels of a response variable.


Task 10: Multicolinearity


#Multicolinearity#####
vifval = vif(T7_RegMdl) #understanding Variance Inflation Factor
print(vifval) 
##     Lot.Area Mas.Vnr.Area  Garage.Area 
##     1.050306     1.164050     1.199737
sqrtvifval = sqrt(vifval) > 2 #Problem
print(sqrtvifval)
##     Lot.Area Mas.Vnr.Area  Garage.Area 
##        FALSE        FALSE        FALSE


Observations:

When independent variables in a regression model are correlated, multicollinearity emerges (Kabacoff, 2015). VIF that is Variance Inflation Factor is used to understand the multicollinearity.

\(VIF = \frac{1}{1-R_i^2}\)


The VIF value of our model is less than 2 which means some multicollinearity occur but it is acceptable. If VIF values is more than 5, it would be great to reconsider the model as multicollinearity is presents strongly. As our model’s VIF value is close 2, it is acceptable and no further action would be taken to correct the multicollinearity.


Task 11: Indentifying Outliers


#Identifying specific and overall outliers in the dataset######


outlierTest(T7_RegMdl) # identify outliers from the dataset
##       rstudent unadjusted p-value Bonferroni p
## 1761  9.188404         7.3519e-20   2.1541e-16
## 1499 -6.482266         1.0567e-10   3.0960e-07
## 1768  6.283111         3.8122e-10   1.1170e-06
## 433   5.750663         9.8079e-09   2.8737e-05
## 424   5.685941         1.4293e-08   4.1879e-05
## 2181 -5.643089         1.8301e-08   5.3622e-05
## 2333  5.532870         3.4287e-08   1.0046e-04
## 1064  5.264756         1.5054e-07   4.4108e-04
## 45    4.901155         1.0047e-06   2.9437e-03
## 434   4.508005         6.8011e-06   1.9927e-02
hat.plot <- function(T7_RegMdl) #identify precise outliers from the dataset
  {
p <- length(coefficients(T7_RegMdl))
n <- length(fitted(T7_RegMdl))
plot (hatvalues(T7_RegMdl), main = "Index Plot of Hat Values") 
abline(h = c(2,3)*p/n,
col = "red", lty = 2)
identify(1:n, hatvalues(T7_RegMdl), names(hatvalues(T7_RegMdl)))
}
 
hat.plot(T7_RegMdl) #outlier order: 957, 1571, 2116

## integer(0)
##Influencial observation#####

cutoff <- 4/(nrow(AmesHousing_1) - length(T7_RegMdl$coefficients) - 1)
plot(T7_RegMdl, which = 4, cook.levels = cutoff)
abline(h = cutoff, lty = 2, col = "red")


Observations:

So far we have conducted good fit model tests to quantify the fit of the model. Task 9, plotting our model to understand presence of outliers need to be further evaluated as there are outliers in the model.
General outliertest() provides list of overall outliers in the model, however we need specif outliers to make a decision to delete or keep the outliers. High Leverage Point function will return the precise outliers and same can be confirmed with Cook’s Distance plot as well.

Cook’s Distance:
The Cook’s Distance plot is a popular choice for identifying influential observations. Influential observations are those having a Cook’s D value greater than \(\frac{4}{n – k – 1}\), where \(n\) is the sample size and \(k\) is the number of predictor variables (Kabacoff, 2015).
We can conclude that 1571 and 2116 are two outliers present in Hight Leverage Point and Cook’s Distance model. As VIF score is less than 2, I feel these two points wouldn’t be any harm and I decided not to change the model.


Task 12:


#Identifying good fit model within selected variables#####

stepAIC(T7_RegMdl, direction = "backward") #stepwise regression for good fit model
## Start:  AIC=64083.87
## SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area
## 
##                Df  Sum of Sq        RSS   AIC
## <none>                       9.2132e+12 64084
## - Lot.Area      1 2.6218e+11 9.4754e+12 64164
## - Mas.Vnr.Area  1 1.4814e+12 1.0695e+13 64519
## - Garage.Area   1 3.9183e+12 1.3131e+13 65120
## 
## Call:
## lm(formula = SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area)
## 
## Coefficients:
##  (Intercept)      Lot.Area  Mas.Vnr.Area   Garage.Area  
##     66352.12          1.23        136.00        186.33
Regmodsubset = regsubsets(SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area, data = missing_dataset, nbest = 3) #subset for finding good fit model

plot(Regmodsubset, scale = "adjr2") 

summary(Regmodsubset)
## Subset selection object
## Call: regsubsets.formula(SalePrice ~ Lot.Area + Mas.Vnr.Area + Garage.Area, 
##     data = missing_dataset, nbest = 3)
## 3 Variables  (and intercept)
##              Forced in Forced out
## Lot.Area         FALSE      FALSE
## Mas.Vnr.Area     FALSE      FALSE
## Garage.Area      FALSE      FALSE
## 3 subsets of each size up to 3
## Selection Algorithm: exhaustive
##          Lot.Area Mas.Vnr.Area Garage.Area
## 1  ( 1 ) " "      " "          "*"        
## 1  ( 2 ) " "      "*"          " "        
## 1  ( 3 ) "*"      " "          " "        
## 2  ( 1 ) " "      "*"          "*"        
## 2  ( 2 ) "*"      " "          "*"        
## 2  ( 3 ) "*"      "*"          " "        
## 3  ( 1 ) "*"      "*"          "*"


Observation:

With stepAIC I tried to find best fit model, however there is not any other recommendation provided in stepAIC. Furthermore, regsubsets() used to find best fit model with 3 sets and there is not any recommendation provided. Plot() regression subset provided the same result as regsubsets(), however plot() is compared with \(adjr^2\) and selected model with Lot Area, Mas Vnr Area, and Garag Area have the highest \(adjr^2\) 51%
After applying multiple methods I have decided to no to change the model.


Task 13:


#Finding best fit regression model from only numeric data#####

#sigma(T7_RegMdl)/mean(missing_dataset$SalePrice)

Regmodsubset2 = regsubsets(SalePrice ~., nbest = 5, method = "exhaustive",
                           data = AmesHousing_1$only_numeric)
## Reordering variables and trying again:
summary(Regmodsubset2)
## Subset selection object
## Call: regsubsets.formula(SalePrice ~ ., nbest = 5, method = "exhaustive", 
##     data = AmesHousing_1$only_numeric)
## 36 Variables  (and intercept)
##                 Forced in Forced out
## Order               FALSE      FALSE
## Lot.Frontage        FALSE      FALSE
## Lot.Area            FALSE      FALSE
## Overall.Qual        FALSE      FALSE
## Overall.Cond        FALSE      FALSE
## Year.Built          FALSE      FALSE
## Year.Remod.Add      FALSE      FALSE
## Mas.Vnr.Area        FALSE      FALSE
## BsmtFin.SF.1        FALSE      FALSE
## BsmtFin.SF.2        FALSE      FALSE
## Bsmt.Unf.SF         FALSE      FALSE
## X1st.Flr.SF         FALSE      FALSE
## X2nd.Flr.SF         FALSE      FALSE
## Low.Qual.Fin.SF     FALSE      FALSE
## Bsmt.Full.Bath      FALSE      FALSE
## Bsmt.Half.Bath      FALSE      FALSE
## Full.Bath           FALSE      FALSE
## Half.Bath           FALSE      FALSE
## Bedroom.AbvGr       FALSE      FALSE
## Kitchen.AbvGr       FALSE      FALSE
## TotRms.AbvGrd       FALSE      FALSE
## Fireplaces          FALSE      FALSE
## Garage.Yr.Blt       FALSE      FALSE
## Garage.Cars         FALSE      FALSE
## Garage.Area         FALSE      FALSE
## Wood.Deck.SF        FALSE      FALSE
## Open.Porch.SF       FALSE      FALSE
## Enclosed.Porch      FALSE      FALSE
## X3Ssn.Porch         FALSE      FALSE
## Screen.Porch        FALSE      FALSE
## Pool.Area           FALSE      FALSE
## Misc.Val            FALSE      FALSE
## Mo.Sold             FALSE      FALSE
## Yr.Sold             FALSE      FALSE
## Total.Bsmt.SF       FALSE      FALSE
## Gr.Liv.Area         FALSE      FALSE
## 5 subsets of each size up to 9
## Selection Algorithm: exhaustive
##          Order Lot.Frontage Lot.Area Overall.Qual Overall.Cond Year.Built
## 1  ( 1 ) " "   " "          " "      "*"          " "          " "       
## 1  ( 2 ) " "   " "          " "      " "          " "          " "       
## 1  ( 3 ) " "   " "          " "      " "          " "          " "       
## 1  ( 4 ) " "   " "          " "      " "          " "          " "       
## 1  ( 5 ) " "   " "          " "      " "          " "          " "       
## 2  ( 1 ) " "   " "          " "      "*"          " "          " "       
## 2  ( 2 ) " "   " "          " "      "*"          " "          " "       
## 2  ( 3 ) " "   " "          " "      "*"          " "          " "       
## 2  ( 4 ) " "   " "          " "      "*"          " "          " "       
## 2  ( 5 ) " "   " "          " "      "*"          " "          " "       
## 3  ( 1 ) " "   " "          " "      "*"          " "          " "       
## 3  ( 2 ) " "   " "          " "      "*"          " "          " "       
## 3  ( 3 ) " "   " "          " "      "*"          " "          " "       
## 3  ( 4 ) " "   " "          " "      "*"          " "          " "       
## 3  ( 5 ) " "   " "          " "      "*"          " "          " "       
## 4  ( 1 ) " "   " "          " "      "*"          " "          " "       
## 4  ( 2 ) " "   " "          " "      "*"          " "          " "       
## 4  ( 3 ) " "   " "          " "      "*"          " "          "*"       
## 4  ( 4 ) " "   " "          " "      "*"          " "          " "       
## 4  ( 5 ) " "   " "          " "      "*"          " "          " "       
## 5  ( 1 ) " "   " "          " "      "*"          " "          "*"       
## 5  ( 2 ) " "   " "          " "      "*"          " "          " "       
## 5  ( 3 ) " "   " "          " "      "*"          " "          " "       
## 5  ( 4 ) " "   " "          " "      "*"          " "          " "       
## 5  ( 5 ) " "   " "          " "      "*"          " "          " "       
## 6  ( 1 ) " "   " "          " "      "*"          " "          " "       
## 6  ( 2 ) " "   " "          " "      "*"          " "          " "       
## 6  ( 3 ) " "   " "          " "      "*"          " "          " "       
## 6  ( 4 ) " "   " "          " "      "*"          " "          " "       
## 6  ( 5 ) " "   " "          " "      "*"          " "          " "       
## 7  ( 1 ) " "   " "          " "      "*"          " "          " "       
## 7  ( 2 ) " "   " "          " "      "*"          " "          " "       
## 7  ( 3 ) " "   " "          " "      "*"          " "          " "       
## 7  ( 4 ) " "   " "          " "      "*"          " "          " "       
## 7  ( 5 ) " "   " "          " "      "*"          " "          " "       
## 8  ( 1 ) " "   " "          " "      "*"          " "          " "       
## 8  ( 2 ) " "   " "          " "      "*"          " "          " "       
## 8  ( 3 ) " "   " "          "*"      "*"          " "          " "       
## 8  ( 4 ) " "   " "          " "      "*"          " "          " "       
## 8  ( 5 ) " "   " "          " "      "*"          " "          " "       
## 9  ( 1 ) " "   " "          "*"      "*"          " "          " "       
## 9  ( 2 ) " "   " "          " "      "*"          " "          " "       
## 9  ( 3 ) " "   " "          "*"      "*"          " "          " "       
## 9  ( 4 ) " "   " "          "*"      "*"          " "          " "       
## 9  ( 5 ) " "   " "          " "      "*"          " "          " "       
##          Year.Remod.Add Mas.Vnr.Area BsmtFin.SF.1 BsmtFin.SF.2 Bsmt.Unf.SF
## 1  ( 1 ) " "            " "          " "          " "          " "        
## 1  ( 2 ) " "            " "          " "          " "          " "        
## 1  ( 3 ) " "            " "          " "          " "          " "        
## 1  ( 4 ) " "            " "          " "          " "          " "        
## 1  ( 5 ) " "            " "          " "          " "          " "        
## 2  ( 1 ) " "            " "          " "          " "          " "        
## 2  ( 2 ) " "            " "          " "          " "          " "        
## 2  ( 3 ) " "            " "          " "          " "          " "        
## 2  ( 4 ) " "            " "          " "          " "          " "        
## 2  ( 5 ) " "            " "          "*"          " "          " "        
## 3  ( 1 ) " "            " "          "*"          " "          " "        
## 3  ( 2 ) " "            " "          " "          " "          " "        
## 3  ( 3 ) " "            " "          " "          " "          " "        
## 3  ( 4 ) " "            " "          " "          " "          " "        
## 3  ( 5 ) " "            " "          " "          " "          " "        
## 4  ( 1 ) " "            " "          "*"          " "          " "        
## 4  ( 2 ) " "            " "          "*"          " "          " "        
## 4  ( 3 ) " "            " "          "*"          " "          " "        
## 4  ( 4 ) " "            " "          "*"          " "          " "        
## 4  ( 5 ) "*"            " "          "*"          " "          " "        
## 5  ( 1 ) " "            " "          "*"          " "          " "        
## 5  ( 2 ) "*"            " "          "*"          " "          " "        
## 5  ( 3 ) " "            " "          " "          " "          "*"        
## 5  ( 4 ) " "            " "          "*"          " "          " "        
## 5  ( 5 ) "*"            " "          "*"          " "          " "        
## 6  ( 1 ) "*"            " "          " "          " "          "*"        
## 6  ( 2 ) "*"            " "          " "          " "          "*"        
## 6  ( 3 ) "*"            " "          "*"          " "          " "        
## 6  ( 4 ) "*"            " "          "*"          " "          " "        
## 6  ( 5 ) "*"            " "          "*"          " "          " "        
## 7  ( 1 ) "*"            "*"          " "          " "          "*"        
## 7  ( 2 ) "*"            "*"          " "          " "          "*"        
## 7  ( 3 ) "*"            " "          "*"          " "          " "        
## 7  ( 4 ) "*"            "*"          "*"          " "          " "        
## 7  ( 5 ) "*"            "*"          "*"          " "          " "        
## 8  ( 1 ) "*"            "*"          " "          " "          "*"        
## 8  ( 2 ) "*"            "*"          " "          " "          "*"        
## 8  ( 3 ) "*"            "*"          " "          " "          "*"        
## 8  ( 4 ) "*"            "*"          "*"          " "          " "        
## 8  ( 5 ) "*"            "*"          " "          " "          "*"        
## 9  ( 1 ) "*"            "*"          " "          " "          "*"        
## 9  ( 2 ) "*"            "*"          "*"          " "          " "        
## 9  ( 3 ) "*"            "*"          " "          " "          "*"        
## 9  ( 4 ) "*"            "*"          "*"          " "          " "        
## 9  ( 5 ) "*"            "*"          "*"          " "          " "        
##          Total.Bsmt.SF X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1  ( 1 ) " "           " "         " "         " "             " "        
## 1  ( 2 ) " "           " "         " "         " "             "*"        
## 1  ( 3 ) " "           " "         " "         " "             " "        
## 1  ( 4 ) " "           " "         " "         " "             " "        
## 1  ( 5 ) "*"           " "         " "         " "             " "        
## 2  ( 1 ) " "           " "         " "         " "             "*"        
## 2  ( 2 ) " "           "*"         " "         " "             " "        
## 2  ( 3 ) "*"           " "         " "         " "             " "        
## 2  ( 4 ) " "           " "         " "         " "             " "        
## 2  ( 5 ) " "           " "         " "         " "             " "        
## 3  ( 1 ) " "           " "         " "         " "             "*"        
## 3  ( 2 ) "*"           " "         " "         " "             "*"        
## 3  ( 3 ) " "           "*"         "*"         " "             " "        
## 3  ( 4 ) " "           "*"         " "         " "             "*"        
## 3  ( 5 ) " "           " "         " "         " "             "*"        
## 4  ( 1 ) " "           " "         " "         " "             "*"        
## 4  ( 2 ) " "           " "         " "         " "             "*"        
## 4  ( 3 ) " "           " "         " "         " "             "*"        
## 4  ( 4 ) " "           " "         " "         " "             "*"        
## 4  ( 5 ) " "           " "         " "         " "             "*"        
## 5  ( 1 ) " "           " "         " "         " "             "*"        
## 5  ( 2 ) " "           " "         " "         " "             "*"        
## 5  ( 3 ) "*"           " "         " "         " "             "*"        
## 5  ( 4 ) "*"           " "         " "         " "             "*"        
## 5  ( 5 ) " "           " "         " "         " "             "*"        
## 6  ( 1 ) "*"           " "         " "         " "             "*"        
## 6  ( 2 ) "*"           " "         " "         " "             "*"        
## 6  ( 3 ) " "           "*"         "*"         " "             " "        
## 6  ( 4 ) "*"           " "         " "         " "             "*"        
## 6  ( 5 ) " "           "*"         "*"         " "             " "        
## 7  ( 1 ) "*"           " "         " "         " "             "*"        
## 7  ( 2 ) "*"           " "         " "         " "             "*"        
## 7  ( 3 ) " "           "*"         "*"         " "             " "        
## 7  ( 4 ) "*"           " "         " "         " "             "*"        
## 7  ( 5 ) " "           "*"         "*"         " "             " "        
## 8  ( 1 ) "*"           " "         " "         " "             "*"        
## 8  ( 2 ) "*"           " "         " "         " "             "*"        
## 8  ( 3 ) "*"           " "         " "         " "             "*"        
## 8  ( 4 ) " "           "*"         "*"         " "             " "        
## 8  ( 5 ) "*"           " "         " "         " "             "*"        
## 9  ( 1 ) "*"           " "         " "         " "             "*"        
## 9  ( 2 ) " "           "*"         "*"         " "             " "        
## 9  ( 3 ) "*"           " "         " "         " "             "*"        
## 9  ( 4 ) "*"           " "         " "         " "             "*"        
## 9  ( 5 ) " "           "*"         " "         " "             "*"        
##          Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1  ( 1 ) " "            " "            " "       " "       " "          
## 1  ( 2 ) " "            " "            " "       " "       " "          
## 1  ( 3 ) " "            " "            " "       " "       " "          
## 1  ( 4 ) " "            " "            " "       " "       " "          
## 1  ( 5 ) " "            " "            " "       " "       " "          
## 2  ( 1 ) " "            " "            " "       " "       " "          
## 2  ( 2 ) " "            " "            " "       " "       " "          
## 2  ( 3 ) " "            " "            " "       " "       " "          
## 2  ( 4 ) " "            " "            " "       " "       " "          
## 2  ( 5 ) " "            " "            " "       " "       " "          
## 3  ( 1 ) " "            " "            " "       " "       " "          
## 3  ( 2 ) " "            " "            " "       " "       " "          
## 3  ( 3 ) " "            " "            " "       " "       " "          
## 3  ( 4 ) " "            " "            " "       " "       " "          
## 3  ( 5 ) " "            " "            " "       " "       " "          
## 4  ( 1 ) " "            " "            " "       " "       " "          
## 4  ( 2 ) " "            " "            " "       " "       " "          
## 4  ( 3 ) " "            " "            " "       " "       " "          
## 4  ( 4 ) " "            " "            " "       " "       " "          
## 4  ( 5 ) " "            " "            " "       " "       " "          
## 5  ( 1 ) " "            " "            " "       " "       " "          
## 5  ( 2 ) " "            " "            " "       " "       " "          
## 5  ( 3 ) " "            " "            " "       " "       " "          
## 5  ( 4 ) " "            " "            " "       " "       " "          
## 5  ( 5 ) " "            " "            " "       " "       " "          
## 6  ( 1 ) " "            " "            " "       " "       " "          
## 6  ( 2 ) " "            " "            " "       " "       " "          
## 6  ( 3 ) " "            " "            " "       " "       " "          
## 6  ( 4 ) " "            " "            " "       " "       " "          
## 6  ( 5 ) " "            " "            " "       " "       " "          
## 7  ( 1 ) " "            " "            " "       " "       " "          
## 7  ( 2 ) " "            " "            " "       " "       " "          
## 7  ( 3 ) " "            " "            " "       " "       " "          
## 7  ( 4 ) " "            " "            " "       " "       " "          
## 7  ( 5 ) " "            " "            " "       " "       " "          
## 8  ( 1 ) " "            " "            " "       " "       " "          
## 8  ( 2 ) " "            " "            " "       " "       " "          
## 8  ( 3 ) " "            " "            " "       " "       " "          
## 8  ( 4 ) " "            " "            " "       " "       " "          
## 8  ( 5 ) " "            " "            " "       " "       " "          
## 9  ( 1 ) " "            " "            " "       " "       " "          
## 9  ( 2 ) " "            " "            " "       " "       " "          
## 9  ( 3 ) " "            " "            " "       " "       " "          
## 9  ( 4 ) " "            " "            " "       " "       " "          
## 9  ( 5 ) " "            " "            " "       " "       " "          
##          Kitchen.AbvGr TotRms.AbvGrd Fireplaces Garage.Yr.Blt Garage.Cars
## 1  ( 1 ) " "           " "           " "        " "           " "        
## 1  ( 2 ) " "           " "           " "        " "           " "        
## 1  ( 3 ) " "           " "           " "        " "           "*"        
## 1  ( 4 ) " "           " "           " "        " "           " "        
## 1  ( 5 ) " "           " "           " "        " "           " "        
## 2  ( 1 ) " "           " "           " "        " "           " "        
## 2  ( 2 ) " "           " "           " "        " "           " "        
## 2  ( 3 ) " "           " "           " "        " "           " "        
## 2  ( 4 ) " "           " "           " "        " "           " "        
## 2  ( 5 ) " "           " "           " "        " "           " "        
## 3  ( 1 ) " "           " "           " "        " "           " "        
## 3  ( 2 ) " "           " "           " "        " "           " "        
## 3  ( 3 ) " "           " "           " "        " "           " "        
## 3  ( 4 ) " "           " "           " "        " "           " "        
## 3  ( 5 ) " "           " "           " "        " "           " "        
## 4  ( 1 ) " "           " "           " "        " "           " "        
## 4  ( 2 ) " "           " "           " "        " "           "*"        
## 4  ( 3 ) " "           " "           " "        " "           " "        
## 4  ( 4 ) " "           " "           " "        "*"           " "        
## 4  ( 5 ) " "           " "           " "        " "           " "        
## 5  ( 1 ) " "           " "           " "        " "           " "        
## 5  ( 2 ) " "           " "           " "        " "           " "        
## 5  ( 3 ) " "           " "           " "        " "           "*"        
## 5  ( 4 ) " "           " "           " "        " "           "*"        
## 5  ( 5 ) " "           " "           " "        " "           "*"        
## 6  ( 1 ) " "           " "           " "        " "           "*"        
## 6  ( 2 ) " "           " "           " "        " "           " "        
## 6  ( 3 ) " "           " "           " "        " "           "*"        
## 6  ( 4 ) " "           " "           " "        " "           "*"        
## 6  ( 5 ) " "           " "           " "        " "           " "        
## 7  ( 1 ) " "           " "           " "        " "           "*"        
## 7  ( 2 ) " "           " "           " "        " "           " "        
## 7  ( 3 ) "*"           " "           " "        " "           "*"        
## 7  ( 4 ) " "           " "           " "        " "           "*"        
## 7  ( 5 ) " "           " "           " "        " "           "*"        
## 8  ( 1 ) " "           " "           " "        " "           "*"        
## 8  ( 2 ) " "           " "           " "        " "           " "        
## 8  ( 3 ) " "           " "           " "        " "           "*"        
## 8  ( 4 ) "*"           " "           " "        " "           "*"        
## 8  ( 5 ) "*"           " "           " "        " "           "*"        
## 9  ( 1 ) " "           " "           " "        " "           "*"        
## 9  ( 2 ) "*"           " "           " "        " "           "*"        
## 9  ( 3 ) " "           " "           " "        " "           " "        
## 9  ( 4 ) " "           " "           " "        " "           "*"        
## 9  ( 5 ) "*"           " "           " "        " "           "*"        
##          Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
## 1  ( 1 ) " "         " "          " "           " "            " "        
## 1  ( 2 ) " "         " "          " "           " "            " "        
## 1  ( 3 ) " "         " "          " "           " "            " "        
## 1  ( 4 ) "*"         " "          " "           " "            " "        
## 1  ( 5 ) " "         " "          " "           " "            " "        
## 2  ( 1 ) " "         " "          " "           " "            " "        
## 2  ( 2 ) " "         " "          " "           " "            " "        
## 2  ( 3 ) " "         " "          " "           " "            " "        
## 2  ( 4 ) "*"         " "          " "           " "            " "        
## 2  ( 5 ) " "         " "          " "           " "            " "        
## 3  ( 1 ) " "         " "          " "           " "            " "        
## 3  ( 2 ) " "         " "          " "           " "            " "        
## 3  ( 3 ) " "         " "          " "           " "            " "        
## 3  ( 4 ) " "         " "          " "           " "            " "        
## 3  ( 5 ) "*"         " "          " "           " "            " "        
## 4  ( 1 ) "*"         " "          " "           " "            " "        
## 4  ( 2 ) " "         " "          " "           " "            " "        
## 4  ( 3 ) " "         " "          " "           " "            " "        
## 4  ( 4 ) " "         " "          " "           " "            " "        
## 4  ( 5 ) " "         " "          " "           " "            " "        
## 5  ( 1 ) "*"         " "          " "           " "            " "        
## 5  ( 2 ) "*"         " "          " "           " "            " "        
## 5  ( 3 ) " "         " "          " "           " "            " "        
## 5  ( 4 ) " "         " "          " "           " "            " "        
## 5  ( 5 ) " "         " "          " "           " "            " "        
## 6  ( 1 ) " "         " "          " "           " "            " "        
## 6  ( 2 ) "*"         " "          " "           " "            " "        
## 6  ( 3 ) " "         " "          " "           " "            " "        
## 6  ( 4 ) " "         " "          " "           " "            " "        
## 6  ( 5 ) "*"         " "          " "           " "            " "        
## 7  ( 1 ) " "         " "          " "           " "            " "        
## 7  ( 2 ) "*"         " "          " "           " "            " "        
## 7  ( 3 ) " "         " "          " "           " "            " "        
## 7  ( 4 ) " "         " "          " "           " "            " "        
## 7  ( 5 ) " "         " "          " "           " "            " "        
## 8  ( 1 ) " "         " "          " "           " "            " "        
## 8  ( 2 ) "*"         " "          " "           " "            " "        
## 8  ( 3 ) " "         " "          " "           " "            " "        
## 8  ( 4 ) " "         " "          " "           " "            " "        
## 8  ( 5 ) " "         " "          " "           " "            " "        
## 9  ( 1 ) " "         " "          " "           " "            " "        
## 9  ( 2 ) " "         " "          " "           " "            " "        
## 9  ( 3 ) "*"         " "          " "           " "            " "        
## 9  ( 4 ) " "         " "          " "           " "            " "        
## 9  ( 5 ) " "         " "          " "           " "            " "        
##          Screen.Porch Pool.Area Misc.Val Mo.Sold Yr.Sold
## 1  ( 1 ) " "          " "       " "      " "     " "    
## 1  ( 2 ) " "          " "       " "      " "     " "    
## 1  ( 3 ) " "          " "       " "      " "     " "    
## 1  ( 4 ) " "          " "       " "      " "     " "    
## 1  ( 5 ) " "          " "       " "      " "     " "    
## 2  ( 1 ) " "          " "       " "      " "     " "    
## 2  ( 2 ) " "          " "       " "      " "     " "    
## 2  ( 3 ) " "          " "       " "      " "     " "    
## 2  ( 4 ) " "          " "       " "      " "     " "    
## 2  ( 5 ) " "          " "       " "      " "     " "    
## 3  ( 1 ) " "          " "       " "      " "     " "    
## 3  ( 2 ) " "          " "       " "      " "     " "    
## 3  ( 3 ) " "          " "       " "      " "     " "    
## 3  ( 4 ) " "          " "       " "      " "     " "    
## 3  ( 5 ) " "          " "       " "      " "     " "    
## 4  ( 1 ) " "          " "       " "      " "     " "    
## 4  ( 2 ) " "          " "       " "      " "     " "    
## 4  ( 3 ) " "          " "       " "      " "     " "    
## 4  ( 4 ) " "          " "       " "      " "     " "    
## 4  ( 5 ) " "          " "       " "      " "     " "    
## 5  ( 1 ) " "          " "       " "      " "     " "    
## 5  ( 2 ) " "          " "       " "      " "     " "    
## 5  ( 3 ) " "          " "       " "      " "     " "    
## 5  ( 4 ) " "          " "       " "      " "     " "    
## 5  ( 5 ) " "          " "       " "      " "     " "    
## 6  ( 1 ) " "          " "       " "      " "     " "    
## 6  ( 2 ) " "          " "       " "      " "     " "    
## 6  ( 3 ) " "          " "       " "      " "     " "    
## 6  ( 4 ) " "          " "       " "      " "     " "    
## 6  ( 5 ) " "          " "       " "      " "     " "    
## 7  ( 1 ) " "          " "       " "      " "     " "    
## 7  ( 2 ) " "          " "       " "      " "     " "    
## 7  ( 3 ) " "          " "       " "      " "     " "    
## 7  ( 4 ) " "          " "       " "      " "     " "    
## 7  ( 5 ) " "          " "       " "      " "     " "    
## 8  ( 1 ) " "          " "       "*"      " "     " "    
## 8  ( 2 ) " "          " "       "*"      " "     " "    
## 8  ( 3 ) " "          " "       " "      " "     " "    
## 8  ( 4 ) " "          " "       " "      " "     " "    
## 8  ( 5 ) " "          " "       " "      " "     " "    
## 9  ( 1 ) " "          " "       "*"      " "     " "    
## 9  ( 2 ) " "          " "       "*"      " "     " "    
## 9  ( 3 ) " "          " "       "*"      " "     " "    
## 9  ( 4 ) " "          " "       "*"      " "     " "    
## 9  ( 5 ) " "          " "       "*"      " "     " "
Bestfitreg = lm(data = AmesHousing_1$only_numeric,
                SalePrice ~ Mas.Vnr.Area+ Gr.Liv.Area + BsmtFin.SF.1)

t13 = summary(Bestfitreg)

tab_model(t13)
  SalePrice
Predictors Estimates CI p
(Intercept) 16123.04 10289.45 – 21956.63 <0.001
Mas Vnr Area 89.03 77.70 – 100.36 <0.001
Gr Liv Area 90.67 86.77 – 94.57 <0.001
BsmtFin SF 1 44.34 40.19 – 48.49 <0.001
Observations 2930
R2 / R2 adjusted 0.615 / 0.615


Observation:

As we have randomly selected three continuous variables to run regression model, we can find out best fit model by executing regsubsets() on only numeric values of Ames Housing dataset.
regsubsets() provided best five model to choose from and in these five model Mas Vnr Area and Overall Quality are most common independent variables. Apart from these two Gr Liv Area, Kitchern Above Gr, and Misc Val are best fit independet variables.
I would be considsering Mas.Vnr.Area, Gr.Liv.Area, BsmtFin.SF.1 for best fit model.

Multiple Linear regression formula \(y'=a+b_1x_1+b_2x_2+b_3x_3+\epsilon\)


\(y' = 16123.04 + 89.03*x_1 + 90.67*x_2 + 44.34*x_3\)
\(Adj R^2\) is 61.5% means this model is 61.5% effective in predicting sale price.

#Comparing two model#####

AIC(T7_RegMdl, Bestfitreg) %>%
  kable(caption = "<center>Comparing two models</center>",
        align = "c") %>%
  kable_styling(bootstrap_options = c("hover",
                                        "bordered",
                                       "condensed",
                                       "responsive",
                                       "stripped"),
                font_size = 11) %>%
   scroll_box(width = "100%", height = "100%")
Comparing two models
df AIC
T7_RegMdl 5 72400.85
Bestfitreg 5 71673.36
Observation:

In this task I would be comparing best fit model with our earlier model to understand which one is effective to predict sale price. \(Adj R^2\) of first model(T7_RegMdl) is 50.71% and \(Adj R^2\) for Bestfitreg is 61.5% This is enough to decide that Bestfitreg is the best fit model as sale price is influenced 61.5% This ccan be concluded with AIC as well. AIC value of Bestfitreg is 71673.36 which supports the decision compared to T7_RegMdl value is 72400.85

Conclusion:

Task 2 and 3 would provide a statistical overview of the dataset and replacing NAs with mean to evaluate the best fit regression model. It is crucial for a researcher to understand the data first. How many variables are there, what type of variables are they, and their statistical operations. These tools would provide description statistical values which are often required in the study. Describe() tool would provide required statistical operations performed on the dataset. It would definitely save the time and helps to better visualize the data.
Task 4 onwards are more important to understand the correlation matrix and scatter plot to visualize the established relationship between independent and dependent variables. Apart from that understanding multicollinearity, identifying outliers and sub sets of the model or best fit model using regsubsets() is important to analyze the correlation.
Throughput this project I have utilized various statistical tests respective to the multiple correlation I would be testing. Occasionally researcher needs to test the significance of relationship between/among independent and dependent variables and having right information about statistical tests: t test or F test is important in hypothesis testing.

This project provided me hands on experience with correlation and regression testing of more than one independent and dependent variables.

Reference:


Bluman, A. (2014). Elementary statistics: A step by step approach. McGraw-Hill Education.
Kabacoff, R. (2015). R in Action. Manning Publications Co. 
Soetewey, A. (2020). Correlogram in R: how to highlight the most correlated variables in a dataset. Stats and r. https://statsandr.com/blog/correlogram-in-r-how-to-highlight-the-most-correlated-variables-in-a-dataset/
Wei, T. & Simko, V. (2021). An Introduction to corrplot Package. Cran.r-project. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html#visualize-non-correlation-matrix-na-value-and-math-label
Wickham. H. (2022). Flexibly Reshape Data: A reboot of the reshape package. Cran.r-project. https://cran.r-project.org/web/packages/reshape2/reshape2.pdf