Load our Datasets

I uploaded the kaggle datasets to my GitHub, reading them here for reproducability.

test_url <- "https://raw.githubusercontent.com/andrewbowen19/computationalMath605/main/data/test.csv"
train_url <- "https://raw.githubusercontent.com/andrewbowen19/computationalMath605/main/data/train.csv"

test <- read.csv(test_url)
train <- read.csv(train_url)

Descriptive and Inferential Statistics

summary(train)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 

Creating a scatter plot of the basement square footage and sale price ($)

ggplot(train, aes(x=TotalBsmtSF, y=SalePrice)) + geom_point() + xlab("Basement Square Footage") + ylab("Sale Price ($)")

Creating another scatter plot between the lot area (\(ft^2\)) and SalePrice

ggplot(train, aes(x=LotArea, y=SalePrice)) + geom_point() + xlab("Lot Area (ft^2)") + ylab("Sale Price ($)")

Generating a correlation matrix between our three quantitative variables above (LotArea, TotalBsmtSF, SalePrice)

(corTrain <- cor(train[,c("LotArea", "TotalBsmtSF", "SalePrice") ]))
##               LotArea TotalBsmtSF SalePrice
## LotArea     1.0000000   0.2608331 0.2638434
## TotalBsmtSF 0.2608331   1.0000000 0.6135806
## SalePrice   0.2638434   0.6135806 1.0000000

Running cor.test for each permutation of quantitative variables above

for (i in colnames(corTrain)){
  for (j in colnames(corTrain)){
    if (i != j){
      ctest <- cor.test(corTrain[,c(i)], corTrain[,c(j)], conf.level=0.8)
      print(ctest)
    }
  }
}
## 
##  Pearson's product-moment correlation
## 
## data:  corTrain[, c(i)] and corTrain[, c(j)]
## t = -1.6444, df = 1, p-value = 0.3478
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
##        cor 
## -0.8544218 
## 
## 
##  Pearson's product-moment correlation
## 
## data:  corTrain[, c(i)] and corTrain[, c(j)]
## t = -1.6097, df = 1, p-value = 0.3539
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
##        cor 
## -0.8494291 
## 
## 
##  Pearson's product-moment correlation
## 
## data:  corTrain[, c(i)] and corTrain[, c(j)]
## t = -1.6444, df = 1, p-value = 0.3478
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
##        cor 
## -0.8544218 
## 
## 
##  Pearson's product-moment correlation
## 
## data:  corTrain[, c(i)] and corTrain[, c(j)]
## t = 0.50613, df = 1, p-value = 0.7017
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
##       cor 
## 0.4515869 
## 
## 
##  Pearson's product-moment correlation
## 
## data:  corTrain[, c(i)] and corTrain[, c(j)]
## t = -1.6097, df = 1, p-value = 0.3539
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
##        cor 
## -0.8494291 
## 
## 
##  Pearson's product-moment correlation
## 
## data:  corTrain[, c(i)] and corTrain[, c(j)]
## t = 0.50613, df = 1, p-value = 0.7017
## alternative hypothesis: true correlation is not equal to 0
## sample estimates:
##       cor 
## 0.4515869

For each combination of variables tested, we have p-values greater than our significance level (\(\alpha = 0.2\)). This indicates that we can not reject our null hypothesis that the true population correlation between variables is equal to 0.

Linear Algebra and Correlation

det(corTrain)
## [1] 0.5703238

Our determinant is not equal to 0, so we should be able to invert our correlation matrix. We can do this using the built-in solve function in R:

(invCor <- solve(corTrain))
##                LotArea TotalBsmtSF  SalePrice
## LotArea      1.0932718  -0.1734874 -0.1820040
## TotalBsmtSF -0.1734874   1.6313307 -0.9551793
## SalePrice   -0.1820040  -0.9551793  1.6341000
corTrain %*% invCor
##                   LotArea TotalBsmtSF    SalePrice
## LotArea      1.000000e+00           0 0.000000e+00
## TotalBsmtSF  0.000000e+00           1 2.220446e-16
## SalePrice   -2.775558e-17           0 1.000000e+00

Accounting for some floating point errors, the above matrix product is the identity matrix, which holds for the multiplication of a matrix \(A\) with its inverse \(A^{-1}\): \(I = AA^{-1}\)

LU Decomposition of our correlation matrix from above using the matrixcalc package

(LU <- lu.decomposition(corTrain))
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.2608331 1.0000000    0
## [3,] 0.2638434 0.5845293    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.2608331 0.2638434
## [2,]    0 0.9319661 0.5447615
## [3,]    0 0.0000000 0.6119577

Let’s verify that our LU decomposition does in fact equal our original correlation matrix when multiplied together: \(A = LU\)

A <- LU$L %*% LU$U 
A == corTrain
##             LotArea TotalBsmtSF SalePrice
## LotArea        TRUE        TRUE      TRUE
## TotalBsmtSF    TRUE        TRUE      TRUE
## SalePrice      TRUE        TRUE      TRUE

Calculus-Bases Probability & Statistics

Let’s use the MasVnrArea variable, which is definitely skew-right. We’ll need to remove a few NA values from our data first

df <- train %>% filter(!is.na(MasVnrArea))
qplot(df$MasVnrArea)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Calling fitdistr from the MASS package to fit our dataset variable to an exponential

# Fit our distribution to an exponential
fit <- fitdistr(df$MasVnrArea, "exponential")

# Getting the rate poarameter lambda
lambda <- fit$estimate[["rate"]]

Now let’s generate random values with our lambda variable as the rate parameter

dat <- rexp(1000, lambda)
randValues <- as.data.frame(dat, col.names=character(1))


hist(randValues[,1], col=rgb(0,0,1,0.5))
hist(df$MasVnrArea, add=TRUE, col=rgb(1,0,0,0.5))

These both look to be skew-right with similar shapes (although the Kaggle data has a bit longer tail). Overall, these historgrams are similarly shaped and likely would have been pulled from the same distribution

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

p5 <- pexp(0.05, rate=lambda)
p95 <- pexp(0.95, rate=lambda)

Let’s now construct a 95% confidence interval from our Kaggle variable MasVnrArea. To do this we’ll need our sample mean and standard deviation for this field

n <- length(df$MasVnrArea)
mu <- mean(df$MasVnrArea)
sigma <- sd(df$MasVnrArea)

margin <- qt(0.975,df=n-1)*sigma/sqrt(n)

lower_interval <- mu - margin
upper_interval <- mu + margin
print(lower_interval)
## [1] 94.36422
print(upper_interval)
## [1] 113.0063

Getting 5th and 95th percentile of our MasVnrArea variable

quantile(df$MasVnrArea, c(0.05, 0.95))
##  5% 95% 
##   0 456

Modeling

mylm <- lm(SalePrice ~ FullBath + LotArea + OverallCond + GarageArea, train)

summary(mylm)
## 
## Call:
## lm(formula = SalePrice ~ FullBath + LotArea + OverallCond + GarageArea, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -259833  -29066   -4312   20005  426711 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.008e+04  9.529e+03  -2.107 0.035314 *  
## FullBath     5.336e+04  2.890e+03  18.462  < 2e-16 ***
## LotArea      1.073e+00  1.467e-01   7.315 4.23e-13 ***
## OverallCond  4.587e+03  1.321e+03   3.472 0.000532 ***
## GarageArea   1.704e+02  7.455e+00  22.863  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54880 on 1455 degrees of freedom
## Multiple R-squared:  0.524,  Adjusted R-squared:  0.5227 
## F-statistic: 400.5 on 4 and 1455 DF,  p-value: < 2.2e-16
plot(mylm)

Linear Regression Assumptions

  1. Linear Relationship It doesn’t appear that our residuals have any relationship to our fitted values, per our first model plot above.

  2. Residuals are distributed normally We can check this assumption with our QQ-plot. The plot is generally linear in the middle of our residuals, but we can test this with a simple histogram and a Shapiro Test

hist(mylm$residuals)

shapiro.test(mylm$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mylm$residuals
## W = 0.89165, p-value < 2.2e-16

Our p-value being less than a significance level of \(\alpha=0.05\) indicates we can assume normality of our residuals.

  1. Homoscedasticity We can test this with a Breush-Pagan test in R
bptest(mylm)
## 
##  studentized Breusch-Pagan test
## 
## data:  mylm
## BP = 140, df = 4, p-value < 2.2e-16

Our p-value being less than \(\alpha=0.05\) means that we can reject the null hypothesis and claim that our variables meet the criterion of homoscedasticity.

Results

test$SalePrice = predict(mylm, newdata = test)

Now we can write our predicted values to a csv file for Kaggle submission

# Get DF of only Id and price column for submission
df <- as.data.frame(test[, c("Id", "SalePrice")])

# impute any missing values for Kaggle as the mean sale price
x <- mean(df$SalePrice, na.rm=TRUE)
df[is.na(df$SalePrice), "SalePrice"] <- x

# Write data to csv
write.csv(df, "submission.csv", row.names=F)