I uploaded the kaggle datasets to my GitHub, reading them here for reproducability.
test_url <- "https://raw.githubusercontent.com/andrewbowen19/computationalMath605/main/data/test.csv"
train_url <- "https://raw.githubusercontent.com/andrewbowen19/computationalMath605/main/data/train.csv"
test <- read.csv(test_url)
train <- read.csv(train_url)
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
Creating a scatter plot of the basement square footage and sale price ($)
ggplot(train, aes(x=TotalBsmtSF, y=SalePrice)) + geom_point() + xlab("Basement Square Footage") + ylab("Sale Price ($)")
Creating another scatter plot between the lot area (\(ft^2\)) and SalePrice
ggplot(train, aes(x=LotArea, y=SalePrice)) + geom_point() + xlab("Lot Area (ft^2)") + ylab("Sale Price ($)")
Generating a correlation matrix between our three quantitative
variables above (LotArea, TotalBsmtSF,
SalePrice)
(corTrain <- cor(train[,c("LotArea", "TotalBsmtSF", "SalePrice") ]))
## LotArea TotalBsmtSF SalePrice
## LotArea 1.0000000 0.2608331 0.2638434
## TotalBsmtSF 0.2608331 1.0000000 0.6135806
## SalePrice 0.2638434 0.6135806 1.0000000
Running cor.test for each permutation of quantitative
variables above
for (i in colnames(corTrain)){
for (j in colnames(corTrain)){
if (i != j){
ctest <- cor.test(corTrain[,c(i)], corTrain[,c(j)], conf.level=0.8)
print(ctest)
}
}
}
##
## Pearson's product-moment correlation
##
## data: corTrain[, c(i)] and corTrain[, c(j)]
## t = -1.6444, df = 1, p-value = 0.3478
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
## cor
## -0.8544218
##
##
## Pearson's product-moment correlation
##
## data: corTrain[, c(i)] and corTrain[, c(j)]
## t = -1.6097, df = 1, p-value = 0.3539
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
## cor
## -0.8494291
##
##
## Pearson's product-moment correlation
##
## data: corTrain[, c(i)] and corTrain[, c(j)]
## t = -1.6444, df = 1, p-value = 0.3478
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
## cor
## -0.8544218
##
##
## Pearson's product-moment correlation
##
## data: corTrain[, c(i)] and corTrain[, c(j)]
## t = 0.50613, df = 1, p-value = 0.7017
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
## cor
## 0.4515869
##
##
## Pearson's product-moment correlation
##
## data: corTrain[, c(i)] and corTrain[, c(j)]
## t = -1.6097, df = 1, p-value = 0.3539
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
## cor
## -0.8494291
##
##
## Pearson's product-moment correlation
##
## data: corTrain[, c(i)] and corTrain[, c(j)]
## t = 0.50613, df = 1, p-value = 0.7017
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
## cor
## 0.4515869
For each combination of variables tested, we have p-values greater than our significance level (\(\alpha = 0.2\)). This indicates that we can not reject our null hypothesis that the true population correlation between variables is equal to 0.
det(corTrain)
## [1] 0.5703238
Our determinant is not equal to 0, so we should be
able to invert our correlation matrix. We can do this using the built-in
solve function in R:
(invCor <- solve(corTrain))
## LotArea TotalBsmtSF SalePrice
## LotArea 1.0932718 -0.1734874 -0.1820040
## TotalBsmtSF -0.1734874 1.6313307 -0.9551793
## SalePrice -0.1820040 -0.9551793 1.6341000
corTrain %*% invCor
## LotArea TotalBsmtSF SalePrice
## LotArea 1.000000e+00 0 0.000000e+00
## TotalBsmtSF 0.000000e+00 1 2.220446e-16
## SalePrice -2.775558e-17 0 1.000000e+00
Accounting for some floating point errors, the above matrix product is the identity matrix, which holds for the multiplication of a matrix \(A\) with its inverse \(A^{-1}\): \(I = AA^{-1}\)
LU Decomposition of our correlation matrix from above using the
matrixcalc package
(LU <- lu.decomposition(corTrain))
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.2608331 1.0000000 0
## [3,] 0.2638434 0.5845293 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.2608331 0.2638434
## [2,] 0 0.9319661 0.5447615
## [3,] 0 0.0000000 0.6119577
Let’s verify that our LU decomposition does in fact equal our original correlation matrix when multiplied together: \(A = LU\)
A <- LU$L %*% LU$U
A == corTrain
## LotArea TotalBsmtSF SalePrice
## LotArea TRUE TRUE TRUE
## TotalBsmtSF TRUE TRUE TRUE
## SalePrice TRUE TRUE TRUE
Let’s use the MasVnrArea variable, which is definitely
skew-right. We’ll need to remove a few NA values from our
data first
df <- train %>% filter(!is.na(MasVnrArea))
qplot(df$MasVnrArea)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Calling fitdistr from the MASS package to
fit our dataset variable to an exponential
# Fit our distribution to an exponential
fit <- fitdistr(df$MasVnrArea, "exponential")
# Getting the rate poarameter lambda
lambda <- fit$estimate[["rate"]]
Now let’s generate random values with our lambda
variable as the rate parameter
dat <- rexp(1000, lambda)
randValues <- as.data.frame(dat, col.names=character(1))
hist(randValues[,1], col=rgb(0,0,1,0.5))
hist(df$MasVnrArea, add=TRUE, col=rgb(1,0,0,0.5))
These both look to be skew-right with similar shapes (although the Kaggle data has a bit longer tail). Overall, these historgrams are similarly shaped and likely would have been pulled from the same distribution
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
p5 <- pexp(0.05, rate=lambda)
p95 <- pexp(0.95, rate=lambda)
Let’s now construct
a 95% confidence interval from our Kaggle variable
MasVnrArea. To do this we’ll need our sample mean and
standard deviation for this field
n <- length(df$MasVnrArea)
mu <- mean(df$MasVnrArea)
sigma <- sd(df$MasVnrArea)
margin <- qt(0.975,df=n-1)*sigma/sqrt(n)
lower_interval <- mu - margin
upper_interval <- mu + margin
print(lower_interval)
## [1] 94.36422
print(upper_interval)
## [1] 113.0063
Getting 5th and 95th percentile of our MasVnrArea
variable
quantile(df$MasVnrArea, c(0.05, 0.95))
## 5% 95%
## 0 456
mylm <- lm(SalePrice ~ FullBath + LotArea + OverallCond + GarageArea, train)
summary(mylm)
##
## Call:
## lm(formula = SalePrice ~ FullBath + LotArea + OverallCond + GarageArea,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -259833 -29066 -4312 20005 426711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.008e+04 9.529e+03 -2.107 0.035314 *
## FullBath 5.336e+04 2.890e+03 18.462 < 2e-16 ***
## LotArea 1.073e+00 1.467e-01 7.315 4.23e-13 ***
## OverallCond 4.587e+03 1.321e+03 3.472 0.000532 ***
## GarageArea 1.704e+02 7.455e+00 22.863 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54880 on 1455 degrees of freedom
## Multiple R-squared: 0.524, Adjusted R-squared: 0.5227
## F-statistic: 400.5 on 4 and 1455 DF, p-value: < 2.2e-16
plot(mylm)
Linear Relationship It doesn’t appear that our residuals have any relationship to our fitted values, per our first model plot above.
Residuals are distributed normally We can check this assumption with our QQ-plot. The plot is generally linear in the middle of our residuals, but we can test this with a simple histogram and a Shapiro Test
hist(mylm$residuals)
shapiro.test(mylm$residuals)
##
## Shapiro-Wilk normality test
##
## data: mylm$residuals
## W = 0.89165, p-value < 2.2e-16
Our p-value being less than a significance level of \(\alpha=0.05\) indicates we can assume normality of our residuals.
bptest(mylm)
##
## studentized Breusch-Pagan test
##
## data: mylm
## BP = 140, df = 4, p-value < 2.2e-16
Our p-value being less than \(\alpha=0.05\) means that we can reject the null hypothesis and claim that our variables meet the criterion of homoscedasticity.
test$SalePrice = predict(mylm, newdata = test)
Now we can write our predicted values to a csv file for Kaggle submission
# Get DF of only Id and price column for submission
df <- as.data.frame(test[, c("Id", "SalePrice")])
# impute any missing values for Kaggle as the mean sale price
x <- mean(df$SalePrice, na.rm=TRUE)
df[is.na(df$SalePrice), "SalePrice"] <- x
# Write data to csv
write.csv(df, "submission.csv", row.names=F)