We pick the \(X3\) column vector for the independent variable, and the \(Y2\) column vector for the dependent variable. We assume \(x\) is estimated at \(X3\)’s third quartile and \(y\) at \(Y2\)’s first quartile.
# Choose Y2 for dependent variable
Y <- c(20.8, 14.6, 18, 7.3, 19.4, 13.5, 14.7, 15.3, 12.6, 13, 13.1, 10.3, 14.9, 14.8, 16.2, 15.7, 16.3, 11.5, 12.2, 11.8)
# Choose X3 for independent variable
X <- c(9.5, 3.7, 11.7, 7.4, 5.3, 7.4, 7.4, 8.6, 9.1, 11.4, 8.4, 7.3, 11.3, 4.4, 9.3, 10.9, 10.9, 7.7, 7.7, 11.5)
# Compile Y2 and X3 in a dataframe
prob_data <- data.frame(Y, X)
# Review a summary of the distribution
summary(prob_data)
## Y X
## Min. : 7.30 Min. : 3.700
## 1st Qu.:12.50 1st Qu.: 7.400
## Median :14.65 Median : 8.500
## Mean :14.30 Mean : 8.545
## 3rd Qu.:15.82 3rd Qu.:10.900
## Max. :20.80 Max. :11.700
\(X\)’s third quartile value of \(10.9\) appears twice within the distribution - to avoid it not being counted in probability calculcations we’ll treat the quartile inequalities as \(\leq\) and \(>\).
# Designate quantiles for independent and dependent variables
X_threshold1 <- quantile(X)[[4]]
X_threshold2 <- quantile(X)[[2]]
Y_threshold <- quantile(Y)[[2]]
# Calculate all the cases
ALLXY <- prob_data %>% tally()
XQ13 <- prob_data %>% filter(X <= X_threshold1) %>% tally()
XQ24 <- prob_data %>% filter(X > X_threshold2) %>% tally()
XQ4 <- prob_data %>% filter(X > X_threshold1) %>% tally()
YQ1 <- prob_data %>% filter(Y <= Y_threshold) %>% tally()
YQ24 <- prob_data %>% filter(Y > Y_threshold) %>% tally()
XQ13_YQ1 <- prob_data %>% filter(X <= X_threshold1 & Y <= Y_threshold) %>% tally()
XQ4_YQ1 <- prob_data %>% filter(X > X_threshold1 & Y <= Y_threshold) %>% tally()
XQ13_YQ24 <- prob_data %>% filter(X <= X_threshold1 & Y > Y_threshold) %>% tally()
XQ4_YQ24 <- prob_data %>% filter(X > X_threshold1 & Y > Y_threshold) %>% tally()
XQ24_YQ24 <- prob_data %>% filter(X > X_threshold2 & Y > Y_threshold) %>% tally()
We calculate the requested conditional probabilities:
## Y is greater than the first quartile in 15 out of 20 cases.
##
## X is greater than the third quartile in 4 out of 20 cases.
##
## The probability that X is greater than the third quartile conditioned on Y being greater than the first quartile is 0.2 .
## Y is greater than the first quartile in 15 out of 20 cases.
##
## X is greater than the third quartile in 4 out of 20 cases.
##
## The probability that X is greater than the third quartile and that Y is greater than the first quartile is 0.8.
## Y is greater than the first quartile in 15 out of 20 cases.
##
## X is less than the third quartile in 16 out of 20 cases.
##
## The probability that X is less than the third quartile conditioned on Y being greater than the first quartile is 0.8.
We put together a contingency table to summarize the counts.
# Build table with totals
contingency_table <- matrix(c(
XQ13_YQ1, XQ4_YQ1, YQ1,
XQ13_YQ24, XQ4_YQ24, YQ24,
XQ13, XQ4, ALLXY),
nrow = 3, ncol = 3, byrow = TRUE)
colnames(contingency_table) <- c('<= 3d quartile', '>3d quartile', 'Total')
rownames(contingency_table) = c('<= 1st quartile', '> 1st quartile', 'Total')
contingency_table
## <= 3d quartile >3d quartile Total
## <= 1st quartile 4 1 5
## > 1st quartile 12 3 15
## Total 16 4 20
Splitting the the training data this way does not make them independent. Two random variables are independent if the probability distribution of one variable is not affected by the presence of another. The sampling method described - letting \(A\) be the new variable counting those observations above the 1st quartile for \(X\), and \(B\) be the new variable counting those observations above the 1st quartile for \(Y\) - deliberately selects a specific segment of the distribution. Since this is non-random, it is unlikely that this will result in indepedence.
We compute the probabilities \(P(A)\), \(P(B)\), \(P(A) \times P(B)\), and \(P(AB)\).
## P(A) is 0.65.
##
## P(B) is 0.75.
##
## P(A) X P(B) is 0.5.
##
## P(AB) is 0.4875.
\(P(AB)\) and \(P(A) \times P(B)\) are not equivalent.
We check for association with a \(\chi^2\) test, treating the quartile associations as two factors with distinct categories.
##
## Pearson's Chi-squared test
##
## data: XQ24_vec
## X-squared = 2.7217, df = 12, p-value = 0.9972
The p-value is much larger than .05 threshold (nearly 1). Accordinlgy, we cannot be confident about rejecting the null hypothesis that there is no association between the two. This is consistent with the intuition that, as sampling is not random, the variables are unlikely to be independent.
We load in the Ames housing dataset from Kaggle’s House Prices: Advanced Regression Techniques competition.
# Load the train and test CSVs manually
train_path <- file.path(getwd(), "train.csv")
train_df <- read_csv(file = train_path, col_names = TRUE)
## Parsed with column specification:
## cols(
## .default = col_character(),
## Id = col_integer(),
## MSSubClass = col_integer(),
## LotFrontage = col_integer(),
## LotArea = col_integer(),
## OverallQual = col_integer(),
## OverallCond = col_integer(),
## YearBuilt = col_integer(),
## YearRemodAdd = col_integer(),
## MasVnrArea = col_integer(),
## BsmtFinSF1 = col_integer(),
## BsmtFinSF2 = col_integer(),
## BsmtUnfSF = col_integer(),
## TotalBsmtSF = col_integer(),
## `1stFlrSF` = col_integer(),
## `2ndFlrSF` = col_integer(),
## LowQualFinSF = col_integer(),
## GrLivArea = col_integer(),
## BsmtFullBath = col_integer(),
## BsmtHalfBath = col_integer(),
## FullBath = col_integer()
## # ... with 18 more columns
## )
## See spec(...) for full column specifications.
test_path <- file.path(getwd(), "test.csv")
test_df <- read_csv(file = test_path, col_names = TRUE)
## Parsed with column specification:
## cols(
## .default = col_character(),
## Id = col_integer(),
## MSSubClass = col_integer(),
## LotFrontage = col_integer(),
## LotArea = col_integer(),
## OverallQual = col_integer(),
## OverallCond = col_integer(),
## YearBuilt = col_integer(),
## YearRemodAdd = col_integer(),
## MasVnrArea = col_integer(),
## BsmtFinSF1 = col_integer(),
## BsmtFinSF2 = col_integer(),
## BsmtUnfSF = col_integer(),
## TotalBsmtSF = col_integer(),
## `1stFlrSF` = col_integer(),
## `2ndFlrSF` = col_integer(),
## LowQualFinSF = col_integer(),
## GrLivArea = col_integer(),
## BsmtFullBath = col_integer(),
## BsmtHalfBath = col_integer(),
## FullBath = col_integer()
## # ... with 17 more columns
## )
## See spec(...) for full column specifications.
head(train_df)
## # A tibble: 6 x 81
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <int> <int> <chr> <int> <int> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## # ... with 73 more variables: LandContour <chr>, Utilities <chr>,
## # LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## # Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## # OverallQual <int>, OverallCond <int>, YearBuilt <int>,
## # YearRemodAdd <int>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>,
## # MasVnrArea <int>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## # BsmtQual <chr>, BsmtCond <chr>, BsmtExposure <chr>,
## # BsmtFinType1 <chr>, BsmtFinSF1 <int>, BsmtFinType2 <chr>,
## # BsmtFinSF2 <int>, BsmtUnfSF <int>, TotalBsmtSF <int>, Heating <chr>,
## # HeatingQC <chr>, CentralAir <chr>, Electrical <chr>, `1stFlrSF` <int>,
## # `2ndFlrSF` <int>, LowQualFinSF <int>, GrLivArea <int>,
## # BsmtFullBath <int>, BsmtHalfBath <int>, FullBath <int>,
## # HalfBath <int>, BedroomAbvGr <int>, KitchenAbvGr <int>,
## # KitchenQual <chr>, TotRmsAbvGrd <int>, Functional <chr>,
## # Fireplaces <int>, FireplaceQu <chr>, GarageType <chr>,
## # GarageYrBlt <int>, GarageFinish <chr>, GarageCars <int>,
## # GarageArea <int>, GarageQual <chr>, GarageCond <chr>,
## # PavedDrive <chr>, WoodDeckSF <int>, OpenPorchSF <int>,
## # EnclosedPorch <int>, `3SsnPorch` <int>, ScreenPorch <int>,
## # PoolArea <int>, PoolQC <chr>, Fence <chr>, MiscFeature <chr>,
## # MiscVal <int>, MoSold <int>, YrSold <int>, SaleType <chr>,
## # SaleCondition <chr>, SalePrice <int>
head(test_df)
## # A tibble: 6 x 80
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <int> <int> <chr> <int> <int> <chr> <chr> <chr>
## 1 1461 20 RH 80 11622 Pave <NA> Reg
## 2 1462 20 RL 81 14267 Pave <NA> IR1
## 3 1463 60 RL 74 13830 Pave <NA> IR1
## 4 1464 60 RL 78 9978 Pave <NA> IR1
## 5 1465 120 RL 43 5005 Pave <NA> IR1
## 6 1466 60 RL 75 10000 Pave <NA> IR1
## # ... with 72 more variables: LandContour <chr>, Utilities <chr>,
## # LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## # Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## # OverallQual <int>, OverallCond <int>, YearBuilt <int>,
## # YearRemodAdd <int>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>,
## # MasVnrArea <int>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## # BsmtQual <chr>, BsmtCond <chr>, BsmtExposure <chr>,
## # BsmtFinType1 <chr>, BsmtFinSF1 <int>, BsmtFinType2 <chr>,
## # BsmtFinSF2 <int>, BsmtUnfSF <int>, TotalBsmtSF <int>, Heating <chr>,
## # HeatingQC <chr>, CentralAir <chr>, Electrical <chr>, `1stFlrSF` <int>,
## # `2ndFlrSF` <int>, LowQualFinSF <int>, GrLivArea <int>,
## # BsmtFullBath <int>, BsmtHalfBath <int>, FullBath <int>,
## # HalfBath <int>, BedroomAbvGr <int>, KitchenAbvGr <int>,
## # KitchenQual <chr>, TotRmsAbvGrd <int>, Functional <chr>,
## # Fireplaces <int>, FireplaceQu <chr>, GarageType <chr>,
## # GarageYrBlt <int>, GarageFinish <chr>, GarageCars <int>,
## # GarageArea <int>, GarageQual <chr>, GarageCond <chr>,
## # PavedDrive <chr>, WoodDeckSF <int>, OpenPorchSF <int>,
## # EnclosedPorch <int>, `3SsnPorch` <int>, ScreenPorch <int>,
## # PoolArea <int>, PoolQC <chr>, Fence <chr>, MiscFeature <chr>,
## # MiscVal <int>, MoSold <int>, YrSold <int>, SaleType <chr>,
## # SaleCondition <chr>
First, we summarize the distribution of the dependent variable in this housing dataset, the sales prices of homes, labelled SalePrice.
# Check for missing values
is.na(train_df$SalePrice) %>% sum()
## [1] 0
# Calculate the mean
train_df$SalePrice %>% describe()
## vars n mean sd median trimmed mad min max range
## X1 1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
## skew kurtosis se
## X1 1.88 6.5 2079.11
# Plot the distribution
train_df %>% ggplot(aes(x = SalePrice, fill = ..count..)) +
geom_histogram(bins=45) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(title = 'Distribution: Sale Price', x = 'Sale Price ($)', y = 'Count') +
theme_minimal() +
theme(plot.title = element_text(hjust = .5), legend.position = 'none')
There are 1,460 homes in the dataset, and a sales price for each of them. The lowest price is $34,900 and the highest is $755,000, with a mean value of $180,921. Prices show a pronounced rightward skew.
Documentation on the Ames Dataset suggests removing observations that are outliers or partial sales - all of which have square footage exceeding 4,000 square feet. As a best practice we could exclude these observations as the first act of data cleaning, but that would lead to a result set with a different length than that required by Kaggle.
# confirm 1stFlrSF is the right filter variable
train_df %>%
filter(GrLivArea >= 4000) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 4
Turning our attention to other features, we check for missing values in the training dataset.
# Check for noisome NAs
sapply(train_df, function(x) sum(is.na(x)))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1369 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1453 1179 1406
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
Rather than impute values or remove observations, in this analysis we’ll exclude features with NAs.
# Remove the affected columns
clean_df <- train_df %>%
dplyr::select(-c(LotFrontage, Alley, MasVnrType, MasVnrArea, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Electrical, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature))
A large number of features remain - some categorical, some ordinal, and a number quantitative.
# Summary of all features not missing values
str(clean_df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1460 obs. of 62 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ 1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ 2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ 3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 81
## .. ..$ Id : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ MSSubClass : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ MSZoning : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LotFrontage : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ LotArea : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Street : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Alley : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LotShape : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LandContour : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Utilities : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LotConfig : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ LandSlope : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Neighborhood : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Condition1 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Condition2 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BldgType : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ HouseStyle : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ OverallQual : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ OverallCond : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ YearBuilt : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ YearRemodAdd : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ RoofStyle : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ RoofMatl : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Exterior1st : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Exterior2nd : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ MasVnrType : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ MasVnrArea : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ ExterQual : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ ExterCond : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Foundation : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BsmtQual : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BsmtCond : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BsmtExposure : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BsmtFinType1 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BsmtFinSF1 : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BsmtFinType2 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ BsmtFinSF2 : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BsmtUnfSF : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ TotalBsmtSF : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Heating : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ HeatingQC : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ CentralAir : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Electrical : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ 1stFlrSF : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ 2ndFlrSF : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ LowQualFinSF : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ GrLivArea : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BsmtFullBath : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BsmtHalfBath : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ FullBath : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ HalfBath : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BedroomAbvGr : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ KitchenAbvGr : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ KitchenQual : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ TotRmsAbvGrd : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Functional : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Fireplaces : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ FireplaceQu : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ GarageType : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ GarageYrBlt : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ GarageFinish : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ GarageCars : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ GarageArea : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ GarageQual : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ GarageCond : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ PavedDrive : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ WoodDeckSF : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ OpenPorchSF : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ EnclosedPorch: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ 3SsnPorch : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ ScreenPorch : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ PoolArea : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ PoolQC : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Fence : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ MiscFeature : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ MiscVal : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ MoSold : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ YrSold : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ SaleType : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ SaleCondition: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ SalePrice : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
There’s a lot to investigate in the training dataset, and a good way to understand relationships between the variables would be to conduct pairwise visualization. However, the number of factor levels for categorical variables makes this computationally prohibitive. As the aim of this analysis is regression, we reduce this dimensionality to focus on quantitative variables of interest before looking at correlations.
# Gather the quantitative variables of potential interest and crate a dataframe based on them
quantvar_subset1 <- c('Id', 'YearBuilt', 'YearRemodAdd', 'YrSold', 'LotArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice')
working_df = clean_df[quantvar_subset1]
Narrowing the feature set to just quantitative variables still leaves around twenty variables - still quite a few. We visualize correlations between the features, as well as with sale prices.
# Remove Id, then create and visualize correlation matrix
working_df %>%
dplyr::select(-Id) %>%
cor() %>%
ggcorrplot(insig = 'blank', # leaves blank if not significant
colors = c('midnightblue', 'white', 'red2')) + # provides custom color range
labs(title = 'Correlation between Quantitative Variables', x = '', y = '') +
theme_minimal() +
theme(plot.title = element_text(hjust = .5),
legend.position = 'none',
axis.text.x = element_text(size = 8, angle = 90),
axis.text.y = element_text(size = 8))
In this grid, the stronger the positive correlation, the redder the cell; the stronger the negative correlation, the bluer. Cells that are completely white demonstrate no relationship, or more likely, none that is statistically significant.
There are few negative correlations in this dataset, and the most not particular insightful. For example:
With the exception of second floor area and overall area, the strongest positive correlations are with sale prices, the most marked of which are found:
GrLivArea, first floor 1stFlrSF, total basement TotalBsmtSF, and garage GarageArea; and,YearBuilt and when remodeled YearRemodAdd - which is the same value if a home was never remodeled.The year a home was built shows the most correlation with garage and total basement and less with first floor and total living area; the year it was remodeled has similar correlations with size, though total living area is higher and basement, lower.
We look at a summary of descriptive statistics for each feature.
# Create summary of univariate descriptive statistics
working_df_stats <- as_tibble(psych::describe(working_df))
working_df_stats <- tibble::rownames_to_column(working_df_stats, var = "feature")
working_df_stats_simple <- working_df_stats %>% dplyr::select(-c(vars, n, trimmed, mad, range, kurtosis, se))
working_df_stats_simple
## # A tibble: 21 x 7
## feature mean sd median min max skew
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Id 730. 422. 730. 1 1460 0
## 2 YearBuilt 1971. 30.2 1973 1872 2010 -0.612
## 3 YearRemodAdd 1985. 20.6 1994 1950 2010 -0.503
## 4 YrSold 2008. 1.33 2008 2006 2010 0.0961
## 5 LotArea 10517. 9981. 9478. 1300 215245 12.2
## 6 BsmtFinSF1 444. 456. 384. 0 5644 1.68
## 7 BsmtFinSF2 46.5 161. 0 0 1474 4.25
## 8 BsmtUnfSF 567. 442. 478. 0 2336 0.918
## 9 TotalBsmtSF 1057. 439. 992. 0 6110 1.52
## 10 1stFlrSF 1163. 387. 1087 334 4692 1.37
## 11 2ndFlrSF 347. 437. 0 0 2065 0.811
## 12 LowQualFinSF 5.84 48.6 0 0 572 8.99
## 13 GrLivArea 1515. 525. 1464 334 5642 1.36
## 14 GarageArea 473. 214. 480 0 1418 0.180
## 15 WoodDeckSF 94.2 125. 0 0 857 1.54
## 16 OpenPorchSF 46.7 66.3 25 0 547 2.36
## 17 3SsnPorch 3.41 29.3 0 0 508 10.3
## 18 ScreenPorch 15.1 55.8 0 0 480 4.11
## 19 PoolArea 2.76 40.2 0 0 738 14.8
## 20 MiscVal 43.5 496. 0 0 15500 24.4
## 21 SalePrice 180921. 79443. 163000 34900 755000 1.88
We dig into three quantiative variables in particular: GrLivArea, GarageArea, and YearRemodAdd. We start with a smaller correlation matrix specific to the three.
# Revised data frame based on those quantitative variables
narrow_working_df <- working_df %>% dplyr::select(Id, SalePrice, GrLivArea, GarageArea, YearRemodAdd)
# Subset data for pairwise comparison
narrow_corr_plot <- narrow_working_df %>%
dplyr::select(-Id, -SalePrice) %>%
ggpairs() +
theme_minimal()
narrow_corr_plot
GrLivArea is right-skewed - that is, there is a long tail of large homes, more of which were remodeled or possibly built (when homes have not been remodeled, this feature takes the same value at YearBuilt) in the nineties and aughts unsurprisingly bringing with them larger garages. It’s positivelty correlated with the other two features, more with GarageArea (.469) than YearRemodAdd (.287).
GarageArea follows an approximately normal distribution, save for the bump represented by homes without indoor or covered parking which is distributed across both home large and small and old and new. It’s positively correalted with YearRemodAdd (.372).
We already discussed YearRemodAdd’s in the context of the other features, so a quick word on its distribution: its left-skewed, representing either an uptick in remodeled and/or new homes from the nineties peaking in the mid-aughts (seemingly before 2007-8).
We also visualize the linear relationships of the features with the response variable: first, between the total living area of a home and its sale price, them the home’s garage area, and lastly the year the home was remodeled.
# GrLivArea vs. SalePrice: fit simple linear model and append predicted value and residuals
GrLivArea_lm <- lm(SalePrice ~ GrLivArea, data = narrow_working_df)
narrow_working_df$GrLivArea_pred <- predict(GrLivArea_lm)
narrow_working_df$GrLivArea_resid <- residuals(GrLivArea_lm)
# GrLivArea vs. SalePrice: scatterplot with regression line and residual visualization
GrLivArea_residviz <- ggplot(data = narrow_working_df, aes(x = GrLivArea, y = SalePrice)) +
geom_segment(aes(xend = GrLivArea, yend = GrLivArea_pred), alpha = .2) +
geom_smooth(method = 'lm', se = FALSE, color = 'darkgray') +
geom_point(aes(color = GrLivArea_resid), size = 1.5) +
scale_color_gradient2(low = 'midnightblue', mid = 'white', high = 'red2') +
guides(color = FALSE) +
geom_point(aes(y = GrLivArea_pred), size = 1.5, shape = 1) +
labs(title = 'Linear Model with Residuals: Total Living Area vs. Sale Price', x = 'Total Living Area (sq ft)', y = 'Sale Price ($)') +
ylim(0, 800000) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# GrLivArea vs. SalePrice: plot residuals
GrLivArea_residplot <- ggplot(data = GrLivArea_lm, aes(x = .fitted, y = .resid)) +
geom_point(aes(y = .resid, color = .resid)) +
scale_color_gradient2(low = "midnightblue", mid = 'white', high = 'red2') +
stat_smooth(method = 'loess', se = TRUE, fill = 'gray95', color = 'darkgray') +
geom_hline(yintercept = 0, col = "black", linetype = "dashed", alpha = .8, size = .5) +
guides(color = FALSE) +
labs(title = 'Residual Plot: Total Living Area vs. Sale Price', x = 'Fitted Values', y = 'Residuals') +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# GrLivArea vs. SalePrice: QQ-plot residuals
GrLivArea_qqplot <- ggplot(data = narrow_working_df, aes(sample = GrLivArea)) +
stat_qq(size = 1.5) +
stat_qq_line(color = 'darkgray') +
labs(title = 'Quartile-Quartile Plot: Total Living Area vs. Sale Price', x = "Theoretical Quantiles", y = "Standardized Residuals") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# GarageArea vs. SalePrice: fit simple linear model and append predicted value and residuals to GarageArea vs. SalePrice
GarageArea_lm <- lm(SalePrice ~ GarageArea, data = narrow_working_df)
narrow_working_df$GarageArea_pred <- predict(GarageArea_lm)
narrow_working_df$GarageArea_resid <- residuals(GarageArea_lm)
# GarageArea vs. SalePrice: scatterplot with regression line and residual visualization
GarageArea_residviz <- ggplot(data = narrow_working_df, aes(x = GarageArea, y = SalePrice)) +
geom_segment(aes(xend = GarageArea, yend = GarageArea_pred), alpha = .2) +
geom_smooth(method = 'lm', se = FALSE, color = 'darkgray') +
geom_point(aes(color = GarageArea_resid), size = 1.5) +
scale_color_gradient2(low = 'midnightblue', mid = 'white', high = 'red2') +
guides(color = FALSE) +
geom_point(aes(y = GarageArea_pred), size = 1.5, shape = 1) +
labs(title = 'Linear Model with Residuals: Garage Area vs. Sale Price', x = 'Garage Area (sq ft)', y = 'Sale Price ($)') +
ylim(0, 800000) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# GarageArea vs. SalePrice: plot residuals
GarageArea_residplot <- ggplot(data = GarageArea_lm, aes(x = .fitted, y = .resid)) +
geom_point(aes(y = .resid, color = .resid)) +
scale_color_gradient2(low = "midnightblue", mid = 'white', high = 'red2') +
stat_smooth(method = 'loess', se = TRUE, fill = 'gray95', color = 'darkgray') +
geom_hline(yintercept = 0, col = "black", linetype = "dashed", alpha = .8, size = .5) +
guides(color = FALSE) +
labs(title = 'Residual Plot: : Garage Area vs. Sale Price', x = 'Fitted Values', y = 'Residuals') +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# GarageArea vs. SalePrice: QQ-plot residuals
GarageArea_qqplot <- ggplot(data = narrow_working_df, aes(sample = GarageArea)) +
stat_qq(size = 1.5) +
stat_qq_line(color = 'darkgray') +
labs(title = 'Quartile-Quartile Plot: Garage Living Area vs. Sale Price', x = "Theoretical Quantiles", y = "Standardized Residuals") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# YearRemodAdd vs. SalePrice: fit simple linear model and append predicted value and residuals to original dataframe
YearRemodAdd_lm <- lm(SalePrice ~ YearRemodAdd, data = narrow_working_df)
narrow_working_df$YearRemodAdd_pred <- predict(YearRemodAdd_lm)
narrow_working_df$YearRemodAdd_resid <- residuals(YearRemodAdd_lm)
# YearRemodAdd vs. SalePrice: scatterplot with regression line and residual visualization
YearRemodAdd_residviz <- ggplot(data = narrow_working_df, aes(x = YearRemodAdd, y = SalePrice)) +
geom_segment(aes(xend = YearRemodAdd, yend = YearRemodAdd_pred), alpha = .2) +
geom_smooth(method = 'lm', se = FALSE, color = 'darkgray') +
geom_point(aes(color = YearRemodAdd_resid), size = 1.5) +
scale_color_gradient2(low = 'midnightblue', mid = 'white', high = 'red2') +
guides(color = FALSE) +
geom_point(aes(y = YearRemodAdd_pred), size = 1.5, shape = 1) +
labs(title = 'Linear Model with Residuals: Year Remodeled', x = 'Year Remodeled', y = 'Sale Price ($)') +
ylim(0, 800000) +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# YearRemodAdd vs. SalePrice: plot residuals
YearRemodAdd_residplot <- ggplot(data = YearRemodAdd_lm, aes(x = .fitted, y = .resid)) +
geom_point(aes(y = .resid, color = .resid)) +
scale_color_gradient2(low = "midnightblue", mid = 'white', high = 'red2') +
stat_smooth(method = 'loess', se = TRUE, fill = 'gray95', color = 'darkgray') +
geom_hline(yintercept = 0, col = "black", linetype = "dashed", alpha = .8, size = .5) +
guides(color = FALSE) +
labs(title = 'Residual Plot: Year Remodeled', x = 'Fitted Values', y = 'Residuals') +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# YearRemodAdd vs. SalePrice: QQ-plot residuals
YearRemodAdd_qqplot <- ggplot(data = narrow_working_df, aes(sample = YearRemodAdd)) +
stat_qq(size = 1.5) +
stat_qq_line(color = 'darkgray') +
labs(title = 'Quartile-Quartile Plot: Year Remodeled', x = "Theoretical Quantiles", y = "Standardized Residuals") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# summary(GrLivArea_lm)
GrLivArea_residviz
GrLivArea_residplot
GrLivArea_qqplot
# summary(GarageArea_lm)
GarageArea_residviz
GarageArea_residplot
GarageArea_qqplot
# summary(YearRemodAdd_lm)
YearRemodAdd_residviz
YearRemodAdd_residplot
YearRemodAdd_qqplot
GrLivArea reveals a positive relationship with SalePrice. The QQ plot underscores the feature’s pronounced positive, right-ward skew, and outliers are exerting leverage. Residuals grow in size for large values of SalePrice, meaning that the distribution isn’t very normal.
GarageArea shows similar patterns - though the relationship with SalePrice isn’t as strong, there are outliers and positive skew, and the tail impact of homes with no garage space is discernible in the residuals.
Interestingly, YearRemodAdd by itself has an even tamer positive relationship with SalePrice. Records seem to commence around 1950 and remodeling / new home construction falls off in the mid-aughts, so the negative skew in the QQ-plot needs to be interpreted. Based on the residuals, the longitudinal distribution does not appear normal.
We briefly revisit correlations between the features.
# Confirm correlations at the 80% confidence level
GrLivArea_GarageArea_corr <- cor.test(
x = narrow_working_df$GrLivArea, y = narrow_working_df$GarageArea, conf.level = .8)
GarageArea_YearRemodAdd_corr <- cor.test(
x = narrow_working_df$GarageArea, y = narrow_working_df$YearRemodAdd, conf.level = .8)
YearRemodAdd_GrLivArea_corr <- cor.test(
x = narrow_working_df$YearRemodAdd, y = narrow_working_df$GrLivArea, conf.level = .8)
cat(paste0('Correlation between total living area and garage area at the 80% confidence level is ', round(as.numeric(GrLivArea_GarageArea_corr$estimate), 4), '.'))
## Correlation between total living area and garage area at the 80% confidence level is 0.469.
cat(paste0('\nCorrelation between garage area and year remodeled at the 80% confidence level is ', round(as.numeric(GarageArea_YearRemodAdd_corr$estimate), 4), '.'))
##
## Correlation between garage area and year remodeled at the 80% confidence level is 0.3716.
cat(paste0('\nCorrelation between year remodeled and total living area at the 80% confidence level is ', round(as.numeric(YearRemodAdd_GrLivArea_corr$estimate), 4), '.'))
##
## Correlation between year remodeled and total living area at the 80% confidence level is 0.2874.
This confirms that Pearson test provided by the correaltion matrix were at the 80% confidence level, and confirms that \(P < .05\) for all three pairs.
Familywise error is the probability of coming to at least one false conclusion in a series of hyptohesis tests. To address this, we can group the confidence level across the three tests: \(\beta = .80\) across three tests equates to \(\beta = (1 - \frac{1 - .8}{3} = .933)\) for each test. We rerun the correlation tests at this confidence level.
# Conduct correlations at 93.3% confidence level
cor.test(x = narrow_working_df$GrLivArea, y = narrow_working_df$GarageArea, conf.level = .93333)
##
## Pearson's product-moment correlation
##
## data: narrow_working_df$GrLivArea and narrow_working_df$GarageArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.333 percent confidence interval:
## 0.4306870 0.5056208
## sample estimates:
## cor
## 0.4689975
cor.test(x = narrow_working_df$GarageArea, y = narrow_working_df$YearRemodAdd, conf.level = .93333)
##
## Pearson's product-moment correlation
##
## data: narrow_working_df$GarageArea and narrow_working_df$YearRemodAdd
## t = 15.283, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.333 percent confidence interval:
## 0.3294698 0.4122530
## sample estimates:
## cor
## 0.3715998
cor.test(x = narrow_working_df$YearRemodAdd, y = narrow_working_df$GrLivArea, conf.level = .93333)
##
## Pearson's product-moment correlation
##
## data: narrow_working_df$YearRemodAdd and narrow_working_df$GrLivArea
## t = 11.457, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.333 percent confidence interval:
## 0.2427298 0.3308317
## sample estimates:
## cor
## 0.2873885
At the 93.3% confidence level \(P < .05\) for all three pairs, so we can dismiss the null hypothesis.
We invert the correlation matrix to yield the precision matrix.
narrow_corr_matrix <- narrow_working_df %>%
dplyr::select(GrLivArea, GarageArea, YearRemodAdd) %>%
cor()
# Invert corrlation matrix
prec_matrix <- solve(narrow_corr_matrix)
cat('Correlation Matrix \n ')
## Correlation Matrix
##
round(narrow_corr_matrix, 8)
## GrLivArea GarageArea YearRemodAdd
## GrLivArea 1.0000000 0.4689975 0.2873885
## GarageArea 0.4689975 1.0000000 0.3715998
## YearRemodAdd 0.2873885 0.3715998 1.0000000
cat('\n Precision Matrix \n')
##
## Precision Matrix
round(prec_matrix, 8)
## GrLivArea GarageArea YearRemodAdd
## GrLivArea 1.3068513 -0.5491812 -0.1714984
## GarageArea -0.5491812 1.3909928 -0.3590643
## YearRemodAdd -0.1714984 -0.3590643 1.1827149
We multiply these two matrixes by one another to yield the identity matrix.
# Multiply correlation and precision matrix
check_matrix1 <- prec_matrix %*% narrow_corr_matrix
check_matrix2 <- narrow_corr_matrix %*% prec_matrix
cat('Precision Matrix X Correlation Matrix \n ')
## Precision Matrix X Correlation Matrix
##
round(check_matrix1, 8)
## GrLivArea GarageArea YearRemodAdd
## GrLivArea 1 0 0
## GarageArea 0 1 0
## YearRemodAdd 0 0 1
cat('\n Correlation Matrix X Precision Matrix \n ')
##
## Correlation Matrix X Precision Matrix
##
round(check_matrix2, 8)
## GrLivArea GarageArea YearRemodAdd
## GrLivArea 1 0 0
## GarageArea 0 1 0
## YearRemodAdd 0 0 1
Lastly, we conduct lower-upper decomposition on the matrix, using built-in functions rather than working long-hand.
# Perform LU decomposition
lu_matrix <- lu.decomposition(narrow_corr_matrix)
cat('LU Decomposition Matrix \n \n')
## LU Decomposition Matrix
##
lu_matrix
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.4689975 1.0000000 0
## [3,] 0.2873885 0.3035933 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 4.689975e-01 0.2873885
## [2,] 0 7.800414e-01 0.2368153
## [3,] 0 2.775558e-17 0.8455123
We explore features in the dataset skewed to the right.
# Skew by feature
working_df_stats %>%
dplyr::select(feature, skew) %>%
arrange(desc(skew))
## # A tibble: 21 x 2
## feature skew
## <chr> <dbl>
## 1 MiscVal 24.4
## 2 PoolArea 14.8
## 3 LotArea 12.2
## 4 3SsnPorch 10.3
## 5 LowQualFinSF 8.99
## 6 BsmtFinSF2 4.25
## 7 ScreenPorch 4.11
## 8 OpenPorchSF 2.36
## 9 SalePrice 1.88
## 10 BsmtFinSF1 1.68
## 11 WoodDeckSF 1.54
## 12 TotalBsmtSF 1.52
## 13 1stFlrSF 1.37
## 14 GrLivArea 1.36
## 15 BsmtUnfSF 0.918
## 16 2ndFlrSF 0.811
## 17 GarageArea 0.180
## 18 YrSold 0.0961
## 19 Id 0
## 20 YearRemodAdd -0.503
## 21 YearBuilt -0.612
Most features show positive (i.e. rightward) skew, with only a few revealing left skew. Of the features most right-skewed, MiscVal is definitionally non-specific and PoolArea’s incidence is somewhat sparse - so we choose LotArea, or the square footage of the home’s lot.
We scale LotArea so its minimum is just above zero.
# Find the minimum value and add a new feature that scales it just above 0
scaling_value <- min(working_df$LotArea) * .999999999999999
working_df <- working_df %>%
mutate(scaled_LotArea = LotArea - scaling_value) # straight subtraction does not alter the correlation
# Confirm it worked
working_df %>%
dplyr::select(Id, LotArea, scaled_LotArea) %>%
head()
## # A tibble: 6 x 3
## Id LotArea scaled_LotArea
## <int> <int> <dbl>
## 1 1 8450 7150.
## 2 2 9600 8300.
## 3 3 11250 9950.
## 4 4 9550 8250.
## 5 5 14260 12960.
## 6 6 14115 12815.
We fit an exponential probability distribution function, take 1,000 samples from the distribution based on the value of \(\lambda\), and then compare that sampled distribution to the actual.
# Fit exponential distribution
exp_fit <- fitdistr(working_df$scaled_LotArea, densfun = 'exponential')
exp_fit_lambda <- as.numeric(exp_fit$estimate)
exp_fit_distr <- as.data.frame(rexp(n = 1000, rate = exp_fit_lambda))
colnames(exp_fit_distr) <- 'sampled_values'
exp_fit_mean <- mean(exp_fit_distr$sampled_values)
exp_fit_stdev <- exp_fit$sd
lambda_ci <- c(exp_fit_mean + c(-1, 1) * 1.96 * exp_fit_stdev) # confidence interval @ 95% for sampled distribution
# Plot a histogram of the sampled exponential distribution
exp_fit_plot <- ggplot(data = exp_fit_distr, aes(sampled_values)) +
geom_histogram(binwidth = 5000, fill = 'midnightblue') +
scale_x_continuous(limits = c(0, 100000), labels = comma, breaks = seq(from = 0, to = 100000, by = 10000)) +
scale_y_continuous(limits = c(0, 500)) +
labs(title = 'Scaled Lot Area (Sampled)', x = 'Lot Area (sq ft)', y = 'Count') +
theme_minimal() +
theme(plot.title = element_text(hjust = .5), legend.position = 'none')
# Plot a histogram of the empirical distribution
scaled_LotArea_plot <- ggplot(data = working_df, aes(scaled_LotArea)) +
geom_histogram(binwidth = 5000, fill = 'dodgerblue3') +
scale_x_continuous(limits = c(0, 100000), labels = comma, breaks = seq(from = 0, to = 100000, by = 10000)) +
scale_y_continuous(limits = c(0, 500)) +
labs(title = 'Scaled Lot Area (Empirical)', x = 'Lot Area (sq ft)', y = 'Count') +
theme_minimal() +
theme(plot.title = element_text(hjust = .5), legend.position = 'none')
We also find the 5th and 95th percentiles using the CDF.
# Calculate mean and standard deviation of empirical distribution
actual_mean <- mean(working_df$scaled_LotArea)
actual_stdev <- sd(working_df$scaled_LotArea)
# [EXPLORE DISCREPANCY BETWEEN THE SIMULATED AND EMPIRICAL MEANS]
cat(paste0('The sampled distribution\'s mean is ', as.character(round(exp_fit_mean, 3)), '.'))
## The sampled distribution's mean is 8986.702.
cat(paste0('\nThe empirical distribution\'s mean is ', as.character(round(actual_mean, 3)), '.'))
##
## The empirical distribution's mean is 9216.828.
# The 5th and 95th percentiles of the exponential CDF
exp_quantiles <- qexp(p = c(.05, .95), rate = exp_fit_lambda)
cat(paste0('\nThe 5th percentile of the exponential cumulative distribution function is ', as.character(round(exp_quantiles[1]), 3), ' and the 95th percentile is ', as.character(round(exp_quantiles[2]), 3), '.'))
##
## The 5th percentile of the exponential cumulative distribution function is 473 and the 95th percentile is 27611.
Though both distributions are based on the same expontial \(\lambda\), the empirical distribution is less peaked and has a higher minimum value. The density of the empirical distribution seems logical, given that there is likely less residential demand for larger properties as well as a regulatory optimum for lot size. Controlling for the long tail of large lots, a more normal distribution might better approximate.
We build a mutliple regression model using stepwise backward elimination. As GrLivArea and TotalBsmtSF are highly correlated with 1stFlrSF, we won’t include it. TotalBsmtSF is also correlated with BsmtFinSF1, so we keep that out. We include LotArea but not scaled_LotArea.
We also check for missing values in the test dataset that will hamper are ability to make predictions.
# Check for noisome NAs
sapply(test_df, function(x) sum(is.na(x)))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 4 227 0
## Street Alley LotShape LandContour Utilities
## 0 1352 0 0 2
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 1 1
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 16 15 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 44 45 44 42 1
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 42 1 1 1 0
## HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF
## 0 0 0 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 2 2 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 1 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 2 0 730 76 78
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 78 1 1 78 78
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1456 1169 1408
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 1 0
The following features we would consider including in the regression are missing values in the test dataset: BsmtFinSF2, BsmtUnfSf, TotalBsmtSF, and GarageArea. We remove these from the model to avoid downstream issues with prediction on test data.
# We'll start with the following quantitative features
multregr_fit <- lm(formula = SalePrice ~
YearBuilt + YearRemodAdd + YrSold + `2ndFlrSF` + LowQualFinSF +
GrLivArea + WoodDeckSF + OpenPorchSF + `3SsnPorch` + ScreenPorch +
PoolArea + MiscVal + LotArea,
data = working_df
)
summary(multregr_fit)
##
## Call:
## lm(formula = SalePrice ~ YearBuilt + YearRemodAdd + YrSold +
## `2ndFlrSF` + LowQualFinSF + GrLivArea + WoodDeckSF + OpenPorchSF +
## `3SsnPorch` + ScreenPorch + PoolArea + MiscVal + LotArea,
## data = working_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -498006 -20230 -2687 16807 308272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.571e+05 1.705e+06 -0.503 0.615318
## YearBuilt 6.946e+02 4.823e+01 14.401 < 2e-16 ***
## YearRemodAdd 5.863e+02 6.980e+01 8.400 < 2e-16 ***
## YrSold -8.258e+02 8.495e+02 -0.972 0.331154
## `2ndFlrSF` -3.836e+01 3.703e+00 -10.361 < 2e-16 ***
## LowQualFinSF -8.251e+01 2.398e+01 -3.440 0.000598 ***
## GrLivArea 1.086e+02 3.480e+00 31.199 < 2e-16 ***
## WoodDeckSF 4.397e+01 9.581e+00 4.589 4.83e-06 ***
## OpenPorchSF 3.015e+01 1.823e+01 1.654 0.098438 .
## `3SsnPorch` 3.407e+01 3.841e+01 0.887 0.375275
## ScreenPorch 1.002e+02 2.046e+01 4.899 1.07e-06 ***
## PoolArea -5.431e+01 2.849e+01 -1.906 0.056794 .
## MiscVal -1.328e+00 2.265e+00 -0.586 0.557722
## LotArea 5.167e-01 1.200e-01 4.304 1.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42800 on 1446 degrees of freedom
## Multiple R-squared: 0.7123, Adjusted R-squared: 0.7097
## F-statistic: 275.4 on 13 and 1446 DF, p-value: < 2.2e-16
Based on p-values of the features, candidates for removal from the model in stepwise backward elimination are: MiscVal, 3SsnPorch, YrSold, and OpenPorchSF. Pulling out in indiviual iterations led to the same conclusion, so we extract them all in one go.
# Remove multiple features, jumping several steps of iteration
multregr_fit2 <- lm(formula = SalePrice ~
YearBuilt + YearRemodAdd + `2ndFlrSF` + LowQualFinSF +
GrLivArea + WoodDeckSF + ScreenPorch + PoolArea + LotArea,
data = working_df
)
summary(multregr_fit2)
##
## Call:
## lm(formula = SalePrice ~ YearBuilt + YearRemodAdd + `2ndFlrSF` +
## LowQualFinSF + GrLivArea + WoodDeckSF + ScreenPorch + PoolArea +
## LotArea, data = working_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -497392 -20233 -2667 16816 305240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.548e+06 1.160e+05 -21.961 < 2e-16 ***
## YearBuilt 7.031e+02 4.807e+01 14.626 < 2e-16 ***
## YearRemodAdd 5.941e+02 6.934e+01 8.569 < 2e-16 ***
## `2ndFlrSF` -3.859e+01 3.697e+00 -10.439 < 2e-16 ***
## LowQualFinSF -8.202e+01 2.398e+01 -3.420 0.000643 ***
## GrLivArea 1.100e+02 3.411e+00 32.239 < 2e-16 ***
## WoodDeckSF 4.237e+01 9.550e+00 4.437 9.81e-06 ***
## ScreenPorch 1.005e+02 2.041e+01 4.924 9.43e-07 ***
## PoolArea -5.299e+01 2.844e+01 -1.863 0.062636 .
## LotArea 5.185e-01 1.199e-01 4.323 1.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42820 on 1450 degrees of freedom
## Multiple R-squared: 0.7113, Adjusted R-squared: 0.7095
## F-statistic: 396.9 on 9 and 1450 DF, p-value: < 2.2e-16
\(R^{2}\) has dropped slightly. As PoolArea’s p-value exceeds the \(.05\) threshold, we try removing that feature in one more iteration of the model.
# We'll iterate manually and jump steps
multregr_fit3 <- lm(formula = SalePrice ~
YearBuilt + YearRemodAdd + `2ndFlrSF` + LowQualFinSF +
GrLivArea + WoodDeckSF + ScreenPorch + LotArea,
data = working_df
)
summary(multregr_fit3)
##
## Call:
## lm(formula = SalePrice ~ YearBuilt + YearRemodAdd + `2ndFlrSF` +
## LowQualFinSF + GrLivArea + WoodDeckSF + ScreenPorch + LotArea,
## data = working_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -519312 -20227 -2631 16853 288125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.557e+06 1.160e+05 -22.041 < 2e-16 ***
## YearBuilt 7.038e+02 4.811e+01 14.630 < 2e-16 ***
## YearRemodAdd 5.986e+02 6.936e+01 8.630 < 2e-16 ***
## `2ndFlrSF` -3.830e+01 3.697e+00 -10.360 < 2e-16 ***
## LowQualFinSF -8.351e+01 2.399e+01 -3.481 0.000514 ***
## GrLivArea 1.091e+02 3.385e+00 32.242 < 2e-16 ***
## WoodDeckSF 4.170e+01 9.551e+00 4.366 1.36e-05 ***
## ScreenPorch 9.928e+01 2.042e+01 4.863 1.28e-06 ***
## LotArea 5.144e-01 1.200e-01 4.286 1.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42850 on 1451 degrees of freedom
## Multiple R-squared: 0.7106, Adjusted R-squared: 0.709
## F-statistic: 445.4 on 8 and 1451 DF, p-value: < 2.2e-16
\(R^{2}\) fell with the removal of PoolArea, so we keep it and proceed with the previous iteration to predict SalePrice in the test dataset. As the remaining features have p-values below \(.05\) threshold, we halt backward elimination here.
We examine the model’s residual fit as well as a quantile-quantile plot.
# Multiple regression model: append predicted value and residuals
working_df$multregr_pred <- predict(multregr_fit2)
working_df$multregr_resid <- residuals(multregr_fit2)
# GrLivArea vs. SalePrice: plot residuals
multregr_residplot <- ggplot(data = multregr_fit2, aes(x = .fitted, y = .resid)) +
geom_point(aes(y = .resid, color = .resid)) +
scale_color_gradient2(low = "midnightblue", mid = 'white', high = 'red2') +
stat_smooth(method = 'loess', se = TRUE, fill = 'gray95', color = 'darkgray') +
geom_hline(yintercept = 0, col = "black", linetype = "dashed", alpha = .8, size = .5) +
guides(color = FALSE) +
labs(title = 'Residual Plot: Multiple Regression on Sale Price', x = 'Fitted Values', y = 'Residuals') +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
# GrLivArea vs. SalePrice: QQ-plot residuals
multregr_qqplot <- ggplot(data = working_df, aes(sample = SalePrice)) +
stat_qq(size = 1.5) +
stat_qq_line(color = 'darkgray') +
labs(title = 'Quartile-Quartile Plot: Multiple Regression on Sale Price', x = "Theoretical Quantiles", y = "Standardized Residuals") +
theme_minimal() +
theme(plot.title = element_text(hjust = .5))
multregr_residplot
multregr_qqplot
While the $R^{2} is relatively high, the residuals do not display normal variance, suggeesting that this model could benefit from extensive tuning - which I’ll get to learn about next semester in DATA 621.
We predict SalePrice of homes in the holdout test dataset based on the multiple linear regression, and then save those predictions for submission to Kaggle.
# Add the scaled_LotArea feature to the test set for completeness
test_df2 <- test_df %>%
mutate(scaled_LotArea = LotArea - scaling_value) # %>%
# dplyr::select(Id, YearBuilt, YearRemodAdd, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, LowQualFinSF, GrLivArea, GarageArea, WoodDeckSF, ScreenPorch, PoolArea, scaled_LotArea)
# Check it worked
test_df2 %>%
dplyr::select(Id, scaled_LotArea) %>%
head()
## # A tibble: 6 x 2
## Id scaled_LotArea
## <int> <dbl>
## 1 1461 10322.
## 2 1462 12967.
## 3 1463 12530.
## 4 1464 8678.
## 5 1465 3705.
## 6 1466 8700.
# Predict SalePrice of the test dataset
test_pred <- predict(object = multregr_fit, newdata = test_df2)
# Convert predicted values to tabular format
test_pred_df <- data.frame(cbind(test_df2$Id, test_pred))
colnames(test_pred_df) = c('Id', 'SalePrice')
test_pred_df$Id <- as.integer(test_pred_df$Id)
# Confirm data frame structure before exporting
nrow(test_pred_df)
## [1] 1459
colnames(test_pred_df)
## [1] "Id" "SalePrice"
# Export to CSV for submission to Kaggle
submit_path <- file.path(getwd(), "JO_submission.csv")
write_csv(x = test_pred_df, col_names = TRUE, path = submit_path)
My user name is jeremyobrien16, and this multiple regression was scored \(.21748\) on Kaggle.