Final

Problem 1:

Using R, generate a random variable \(X\) that has 10,000 random uniform numbers from 1 to \(N\), where \(N\) can be any number of your choosing greater than or equal to 6. Then generate a random variable \(Y\) taht has 10,000 random normal numbers iwth a mean of \(\mu = \sigma = (N+1)/2\).

The code below generates the random variables \(X\) and \(Y\). I set a seed to get consistent answers every time I run the code.

set.seed(100)

X = runif(10000,1,6)

Y = rnorm(10000,(6+1)/2,(6+1)/2)

1a, 1b, 1c

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

1a).

\(\Large P(X>x | X > y)\)

The way that this probability is interpreted is as follows:

The probability that random variable X is greater than the median of X given that random variable X is greater than the 1st quartile of Y.

The code below takes the number of X values that are greater than the median divided by the number of X values that are greater than the first quantile of Y.

The resulting probability is \(0.5127\).

sum(X > median(X) & X > quantile(Y,0.25))/sum(X > quantile(Y,0.25))

## [1] 0.5127153

1b).

\(\Large P(X>x, Y>y)\)

The way that this probability is interpreted is as follows:

The probability that random variable X is greater than the median of X and that random variable Y is greater than the first quartile of Y.

Given that these are two independent random variables, I can multiply \(P(X>x)\) with \(P(Y>y)\) to get \(P(X>x, Y>y)\).

By definition, the median is larger than 50% of the values in a distribution and the first quartile is larger than 25% of the values in a distribution. Multiplying these two values together gets \(0.375\).

The code below states these probabilities more explicitly based off of the distributions themselves.

sum(X > median(X))/length(X) * sum(Y > quantile(Y,0.25))/length(Y)

## [1] 0.375

1c).

\(\Large P(X<x|X>y)\)

The way that this probability is interpreted is as follows:

The probability that a random variable X is less than the median of X given that random variable X is greater than the first quartile of Y.

The code below finds the number of observations of X that are less than the median of X and greater than the first quantile of Y and then divides that number by the number of observations of X that are greater than the first quantile of Y. The probability is \(0.4873\)

sum(X < median(X) & X > quantile(Y,0.25))/sum(X > quantile(Y,0.25))

## [1] 0.4872847

1d).

Investigate whether \(P(X>x \text{ and } Y>y) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

The code below builds a table that calculates the marginal and joint probabilities. The first few lines of code finds the number of observations for each joint probability. The next line calculates the marginal probabilities and puts the marginal and joint probabilities into one table. The next lines rename the rows and columns and display the probabilities.

one = sum(X > median(X) & Y > quantile(Y,0.25))
two = sum(X < median(X) & Y > quantile(Y,0.25))
three = sum(X > median(X) & Y < quantile(Y,0.25))
four = sum(X < median(X) & Y < quantile(Y,0.25))

matrix_1d = matrix(c(one,two,one+two,three,four,three+four,one+three,two+four,one+two+three+four),byrow = F,nrow=3)

rownames(matrix_1d) = c("X > median","X < median","margin")

colnames(matrix_1d) = c("Y > q1","Y < q1", "margin")

matrix_1d/10000

##            Y > q1 Y < q1 margin
## X > median 0.3755 0.1245    0.5
## X < median 0.3745 0.1255    0.5
## margin     0.7500 0.2500    1.0

According to the table, the probability that \(P(X>x \text{ and } Y>y)\) is \(0.3755\). The probability that \(P(X>x)P(Y>y)\) is \(0.5 * 0.75 = 0.375\).

\(0.3755\) is very close to \(0.375\), which would indicate that \(P(X>x \text{ and } Y>y) = P(X>x)P(Y>y)\). In other words, there is not enough evidence to conclude that the two probabilities are not equal.

1e).

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

matrix_1e = matrix(c(one,two,three,four),byrow = F,nrow = 2)

rownames(matrix_1e) = c("X > median","X < median")

colnames(matrix_1e) = c("Y > q1","Y < q1")

matrix_1e

##            Y > q1 Y < q1
## X > median   3755   1245
## X < median   3745   1255

fisher.test(matrix_1e)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix_1e
## p-value = 0.8354
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9222661 1.1076494
## sample estimates:
## odds ratio 
##   1.010724

chisq.test(matrix_1e)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix_1e
## X-squared = 0.0432, df = 1, p-value = 0.8353

According to both tests, the p-values are approximately 0.835, which indicates that the null hypothesis is not rejected. There is not enough evidence to conclude that the proportions observed differ significantly from the expected values if the events were independent.

The difference between the two tests is as follows:

The Fisher’s exact test is used to determine if the proportions of an event occurring are different depending on the value of another event occurring. The Fisher’s exact test calculates the p-value using the hypergeometric distribution and conditional probabilities. Since this test uses the hypergeometric distribution, there are a lot of factorial calculations, which can become computationally large when the number of observations is large. However, since we are using R, computational complexity isn’t an issue.

The chi-squared test calculates how likely an observed distribution occurred at random. The test compares the probabilities of the observed events compared to the expected probabilities if the events were independent. A chi-squared coefficient is calculated from the observed values and the expected values, which is then compared to the chi-squared distribution to find the p-value.

Since the sample size for this data is so large, the computational complexity of the Fisher’s exact test is not ideal. It would be more appropriate to use the chi-squared test because it can handle large sample sizes more efficiently.

Problem 2:

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. Do the following:

wd = "C:/Users/super_000/Desktop/Data 605 Math/Final/house-prices-advanced-regression-techniques"
setwd(wd)
train = read.csv("train.csv",stringsAsFactors = F)
test = read.csv("test.csv",stringsAsFactors = F)

a).

Univariate Descriptive Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

Given that there are 81 variables in the training data set, it would be pretty tedious to plot every single variable. Instead, I have decided to provide five number summaries and frequency counts for each variable. In the code below, I created an lapply loop that iterates through each variable and returns a frequency table for categorical variables and five number summaries for numeric variables.

descriptive_statistics = lapply(1:ncol(train),function(x){
  if(class(train[,x]) == "character"){
    return(table(train[,x]))
  }
  else{
    return(summary(train[,x]))
  }
})
names(descriptive_statistics) = colnames(train)

descriptive_statistics

## $Id
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   365.8   730.5   730.5  1095.2  1460.0 
## 
## $MSSubClass
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    20.0    50.0    56.9    70.0   190.0 
## 
## $MSZoning
## 
## C (all)      FV      RH      RL      RM 
##      10      65      16    1151     218 
## 
## $LotFrontage
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   59.00   69.00   70.05   80.00  313.00     259 
## 
## $LotArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245 
## 
## $Street
## 
## Grvl Pave 
##    6 1454 
## 
## $Alley
## 
## Grvl Pave 
##   50   41 
## 
## $LotShape
## 
## IR1 IR2 IR3 Reg 
## 484  41  10 925 
## 
## $LandContour
## 
##  Bnk  HLS  Low  Lvl 
##   63   50   36 1311 
## 
## $Utilities
## 
## AllPub NoSeWa 
##   1459      1 
## 
## $LotConfig
## 
##  Corner CulDSac     FR2     FR3  Inside 
##     263      94      47       4    1052 
## 
## $LandSlope
## 
##  Gtl  Mod  Sev 
## 1382   65   13 
## 
## $Neighborhood
## 
## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert 
##      17       2      16      58      28     150      51     100      79 
##  IDOTRR MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes OldTown 
##      37      17      49     225      41       9      77      73     113 
##  Sawyer SawyerW Somerst StoneBr   SWISU  Timber Veenker 
##      74      59      86      25      25      38      11 
## 
## $Condition1
## 
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNe   RRNn 
##     48     81   1260      8     19     11     26      2      5 
## 
## $Condition2
## 
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNn 
##      2      6   1445      1      2      1      1      2 
## 
## $BldgType
## 
##   1Fam 2fmCon Duplex  Twnhs TwnhsE 
##   1220     31     52     43    114 
## 
## $HouseStyle
## 
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
##    154     14    726      8     11    445     37     65 
## 
## $OverallQual
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.099   7.000  10.000 
## 
## $OverallCond
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   5.000   5.575   6.000   9.000 
## 
## $YearBuilt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1872    1954    1973    1971    2000    2010 
## 
## $YearRemodAdd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1950    1967    1994    1985    2004    2010 
## 
## $RoofStyle
## 
##    Flat   Gable Gambrel     Hip Mansard    Shed 
##      13    1141      11     286       7       2 
## 
## $RoofMatl
## 
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       1    1434       1       1       1      11       5       6 
## 
## $Exterior1st
## 
## AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd 
##      20       1       2      50       1      61     222       1     220 
## Plywood   Stone  Stucco VinylSd Wd Sdng WdShing 
##     108       2      25     515     206      26 
## 
## $Exterior2nd
## 
## AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd 
##      20       3       7      25       1      60     207      10     214 
##   Other Plywood   Stone  Stucco VinylSd Wd Sdng Wd Shng 
##       1     142       5      26     504     197      38 
## 
## $MasVnrType
## 
##  BrkCmn BrkFace    None   Stone 
##      15     445     864     128 
## 
## $MasVnrArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0     0.0   103.7   166.0  1600.0       8 
## 
## $ExterQual
## 
##  Ex  Fa  Gd  TA 
##  52  14 488 906 
## 
## $ExterCond
## 
##   Ex   Fa   Gd   Po   TA 
##    3   28  146    1 1282 
## 
## $Foundation
## 
## BrkTil CBlock  PConc   Slab  Stone   Wood 
##    146    634    647     24      6      3 
## 
## $BsmtQual
## 
##  Ex  Fa  Gd  TA 
## 121  35 618 649 
## 
## $BsmtCond
## 
##   Fa   Gd   Po   TA 
##   45   65    2 1311 
## 
## $BsmtExposure
## 
##  Av  Gd  Mn  No 
## 221 134 114 953 
## 
## $BsmtFinType1
## 
## ALQ BLQ GLQ LwQ Rec Unf 
## 220 148 418  74 133 430 
## 
## $BsmtFinSF1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0   383.5   443.6   712.2  5644.0 
## 
## $BsmtFinType2
## 
##  ALQ  BLQ  GLQ  LwQ  Rec  Unf 
##   19   33   14   46   54 1256 
## 
## $BsmtFinSF2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   46.55    0.00 1474.00 
## 
## $BsmtUnfSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   223.0   477.5   567.2   808.0  2336.0 
## 
## $TotalBsmtSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0 
## 
## $Heating
## 
## Floor  GasA  GasW  Grav  OthW  Wall 
##     1  1428    18     7     2     4 
## 
## $HeatingQC
## 
##  Ex  Fa  Gd  Po  TA 
## 741  49 241   1 428 
## 
## $CentralAir
## 
##    N    Y 
##   95 1365 
## 
## $Electrical
## 
## FuseA FuseF FuseP   Mix SBrkr 
##    94    27     3     1  1334 
## 
## $X1stFlrSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334     882    1087    1163    1391    4692 
## 
## $X2ndFlrSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     347     728    2065 
## 
## $LowQualFinSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   5.845   0.000 572.000 
## 
## $GrLivArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642 
## 
## $BsmtFullBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4253  1.0000  3.0000 
## 
## $BsmtHalfBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05753 0.00000 2.00000 
## 
## $FullBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.565   2.000   3.000 
## 
## $HalfBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3829  1.0000  2.0000 
## 
## $BedroomAbvGr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   2.866   3.000   8.000 
## 
## $KitchenAbvGr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.047   1.000   3.000 
## 
## $KitchenQual
## 
##  Ex  Fa  Gd  TA 
## 100  39 586 735 
## 
## $TotRmsAbvGrd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   6.518   7.000  14.000 
## 
## $Functional
## 
## Maj1 Maj2 Min1 Min2  Mod  Sev  Typ 
##   14    5   31   34   15    1 1360 
## 
## $Fireplaces
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   0.613   1.000   3.000 
## 
## $FireplaceQu
## 
##  Ex  Fa  Gd  Po  TA 
##  24  33 380  20 313 
## 
## $GarageType
## 
##  2Types  Attchd Basment BuiltIn CarPort  Detchd 
##       6     870      19      88       9     387 
## 
## $GarageYrBlt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1900    1961    1980    1979    2002    2010      81 
## 
## $GarageFinish
## 
## Fin RFn Unf 
## 352 422 605 
## 
## $GarageCars
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.767   2.000   4.000 
## 
## $GarageArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   334.5   480.0   473.0   576.0  1418.0 
## 
## $GarageQual
## 
##   Ex   Fa   Gd   Po   TA 
##    3   48   14    3 1311 
## 
## $GarageCond
## 
##   Ex   Fa   Gd   Po   TA 
##    2   35    9    7 1326 
## 
## $PavedDrive
## 
##    N    P    Y 
##   90   30 1340 
## 
## $WoodDeckSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   94.24  168.00  857.00 
## 
## $OpenPorchSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   25.00   46.66   68.00  547.00 
## 
## $EnclosedPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   21.95    0.00  552.00 
## 
## $X3SsnPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    3.41    0.00  508.00 
## 
## $ScreenPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   15.06    0.00  480.00 
## 
## $PoolArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.759   0.000 738.000 
## 
## $PoolQC
## 
## Ex Fa Gd 
##  2  2  3 
## 
## $Fence
## 
## GdPrv  GdWo MnPrv  MnWw 
##    59    54   157    11 
## 
## $MiscFeature
## 
## Gar2 Othr Shed TenC 
##    2    2   49    1 
## 
## $MiscVal
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    43.49     0.00 15500.00 
## 
## $MoSold
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.322   8.000  12.000 
## 
## $YrSold
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2006    2007    2008    2008    2009    2010 
## 
## $SaleType
## 
##   COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD 
##    43     2     9     5     5     4   122     3  1267 
## 
## $SaleCondition
## 
## Abnorml AdjLand  Alloca  Family  Normal Partial 
##     101       4      12      20    1198     125 
## 
## $SalePrice
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Scatterplot Matrix

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

The code below creates a correlogram of MSSubClass, LotFrontage, and SalePrice with a scatterplot matrix in the lower triangle and a shade panel in the upper triangle. There appears to be a moderate positive correlation between LotFrontage and SalePrice.

library(corrgram)

## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

corrgram_df = train %>%
  dplyr::select(MSSubClass,LotFrontage,SalePrice)

corrgram(corrgram_df, lower.panel = panel.pts,upper.panel = panel.shade)

Correlation Matrix

Derive a correlation matrix for any three quantitative variables in the dataset.

The code below creates a correlation matrix between LotArea, OverallQual, and OverallCond.

cormat_df = train %>%
  dplyr::select(LotArea,OverallQual,OverallCond)

cormat = cor(cormat_df)

cormat

##                 LotArea OverallQual OverallCond
## LotArea      1.00000000  0.10580574 -0.00563627
## OverallQual  0.10580574  1.00000000 -0.09193234
## OverallCond -0.00563627 -0.09193234  1.00000000

Hypothesis Testing

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

cor.test(train$LotArea,train$SalePrice,method="pearson",conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

Given that the p-value is nearly zero for the correlation between Lot Area and Sale Price, we have enough evidence to reject the null hypothesis and conclude with 80% confidence that the true correlation is between 0.232 and 0.294.

cor.test(train$OverallQual,train$SalePrice,method="pearson",conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

Given that the p-value is nearly zero for the correlation between Overall Quality and Sale Price, we have enough evidence to reject the null hypothesis and conclude with 80% confidence that the true correlation is between 0.778 and 0.803.

cor.test(train$OverallCond,train$SalePrice,method="pearson",conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$OverallCond and train$SalePrice
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.1111272 -0.0444103
## sample estimates:
##         cor 
## -0.07785589

Given that the p-value is nearly zero for the correlation between Overall Condition and Sale Price, we have enough evidence to reject the null hypothesis and conclude with 80% confidence that the true correlation is between -0.111 and -0.044.

Analysis

Discuss the meaning of your analysis.

Through this analysis, I was able to uncover a few things. In the first part of the problem, I looked at the descriptive statistics for each variable. These descriptive statistics provided a five number summary consisting of the minimum, 1st quartile, median, 3rd quartile, and maximum values for each variable. For categorical variables, I created a count for each category for each categorical variable.

After looking at those summary statistics, I made a scatterplot matrix using the corrgram library that plotted MSSubClass and LotFrontage relative to SalesPrice. This scatterplot displays the pairwise correlations between the variables with blue indicating a positive correlation and red indicating a negative correlation. The darker the color, the stronger the correlation.

From there, I looked at three other variables: Lot Area, Overall Condition, and Overall Quality. I used R to calculate a correlation matrix between these three variables and found that none of the three variables had strong correlations with each other given that their correlations were close to zero.

Finally I tested the pairwise correlations between the three variables and Sale Price. I found that all of the variables had statistically significant correlations with Sale Price given that none of the confidence intervals of the correlations included zero. However, in terms of practical significance, the correlations for Lot Area and Overall Condition are so low that it might not be worth it to include them in the final model.

Familywise Error

Would you be worried about familywise error? Why or why not?

1 - ((1 - 0.2)^3)

## [1] 0.488

At a confidence level of 80%, the chance of familywise error is very high even for a low number of tests. Given that the barrier of passing the test is so low, even just one statistical test has a 20% chance of type-1 error. Performing 3 tests brings that chance up to 48.8%. This means that out of the three tests we performed, there is a 48.8% chance that at least one type-1 error occurred. Familywise error would be less of a problem if the significance level was higher; i.e. alpha < 0.0001. However, at the current significance level, familywise error is very likely to occur.

b).

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

cormat_inv = solve(cormat)

cormat_inv

##                  LotArea OverallQual  OverallCond
## LotArea      1.011338860 -0.10738903 -0.004172346
## OverallQual -0.107389032  1.01992670  0.093158977
## OverallCond -0.004172346  0.09315898  1.008540807

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

round(cormat %*% cormat_inv,5)

##             LotArea OverallQual OverallCond
## LotArea           1           0           0
## OverallQual       0           1           0
## OverallCond       0           0           1

round(cormat_inv %*% cormat,5)

##             LotArea OverallQual OverallCond
## LotArea           1           0           0
## OverallQual       0           1           0
## OverallCond       0           0           1

Conduct LU decomposition on the matrix.

row_reducer = function(A,L,colnum,original){
  
  if(A[colnum,colnum] == 0){
    
    if(identical(L %*% A, original) == TRUE){
      
      return(list(L = L, U = A, A = original))
      
      
    }
    
    else{
      
      return("No Solution")
      
    }
    
    
  }
  
  else{
    
    pivot = A[colnum,colnum]
    rowx = A[colnum,]
    
    for(x in seq(colnum + 1, nrow(A))){
      
      L[x,colnum] = A[x,colnum]/pivot
      A[x,] = ((-A[x,colnum]/pivot) * rowx) + A[x,]
      
    }
    
    
  }
  
  return(list(L = L, U = A, A = original))
  
}

LU = function(A){
  
  original = A
  
  print("Initial Matrix")
  print(A)
  
  L = diag(nrow(A))
  
  if(A[1,1] == 0){
    
    return("No Solution")
    
  }
  
  else{
    
    
    output = row_reducer(A,L,1,original)
    
    if(ncol(A) == 2){
      
      return(output)
      
    }
    
    else{
      
        for(x in seq(2,ncol(A) - 1)){
        
          if(output[[1]][1] == "No Solution"){
            
            return("No Solution")
          
           }
          
          else{
          
            output = row_reducer(output$U,output$L,x,original)
        
        }
        
      }
      
      
    }
    
    
    return(output)
    
    
  }
  
}

LU(cormat)

## [1] "Initial Matrix"
##                 LotArea OverallQual OverallCond
## LotArea      1.00000000  0.10580574 -0.00563627
## OverallQual  0.10580574  1.00000000 -0.09193234
## OverallCond -0.00563627 -0.09193234  1.00000000

## $L
##             [,1]        [,2] [,3]
## [1,]  1.00000000  0.00000000    0
## [2,]  0.10580574  1.00000000    0
## [3,] -0.00563627 -0.09237006    1
## 
## $U
##             LotArea OverallQual OverallCond
## LotArea           1   0.1058057 -0.00563627
## OverallQual       0   0.9888051 -0.09133599
## OverallCond       0   0.0000000  0.99153152
## 
## $A
##                 LotArea OverallQual OverallCond
## LotArea      1.00000000  0.10580574 -0.00563627
## OverallQual  0.10580574  1.00000000 -0.09193234
## OverallCond -0.00563627 -0.09193234  1.00000000

c).

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

hist(train$MSSubClass)

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

LotArea_exp = fitdistr(train$LotArea,densfun = "exponential")

LotArea_exp

##        rate    
##   9.508570e-05 
##  (2.488507e-06)

Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000,lambda)).

lambda = LotArea_exp$estimate

samples = rexp(1000,lambda)

Plot a histogram and compare it with a histogram of your original variable.

The histograms look quite different. The sampled histogram looks much more evenly distributed with a gradually decreasing tail, while the real data has a very sharp decrease as the lot area increases.

hist(samples, breaks = 50)

hist(train$LotArea, breaks = 50)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

quantile(samples,probs = c(0.05,0.95))

##         5%        95% 
##   533.3104 30457.0813

Also generate a 95% confidence interval from the empirical data, assuming normality.

LA = train$LotArea

normal_dist = rnorm(length(LA),mean(LA),sd(LA))

quantile(normal_dist, probs = c(0.05,0.95))

##        5%       95% 
## -6037.193 27165.050

Finally, provide the empirical 5th percentile and 95th percentile of the data.

quantile(LA,probs = c(0.05,0.95))

##       5%      95% 
##  3311.70 17401.15

Discuss.

The initial data had a very clear right skew, so it didn’t make sense to assume normality. The refit exponential distribution was much closer to the true distribution than the normal distribution. Also, the normal distribution’s confidence interval was negative, while the real data has a minimum value of 1300. However, the normal distribution’s confidence interval had an upper bound that was closer to the true upper bound compared to the exponential distribution’s upper bound. Overall, the exponential distribution is a better fit for MSSubclass, albeit, with a more extreme upper bound than the normal distribution.

d).

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

Cleaning the Data

The first step towards building my model was cleaning the data. Given that there are over 80 variables, some ordinal, some categorical, and some with missing data elements, it made sense to recode many of the variables to fit properly within a linear model. The code below uses the sqldf library which allows the use of SQL queries in R. It is much easier to recode factors and ordinal variables in SQL compared to R.

The code below recodes the ordinal variables to a numerical scale and changes binary variables to ones and zeros instead of their factor names.

library(mice)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:corrgram':
## 
##     panel.fill

## 
## Attaching package: 'mice'

## The following object is masked from 'package:tidyr':
## 
##     complete

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

#detach("package:RMySQL", unload=TRUE)

query = "select Id
    , MSSubClass
    , MSZoning
    , LotFrontage
    , LotArea
    , case when Street = 'Pave' then 1 else 0 end as StreetPaved
    , Alley
    , LotShape
    , LandContour
    , case when Utilities = 'AllPub' then 1 else 0 end as UtilitiesAll
    , LotConfig
    , case when LandSlope = 'Gtl' then 1
           when LandSlope = 'Mod' then 2
           when LandSlope = 'Sev' then 3
        end as LandSlope
    , Neighborhood
    , Condition1
    , Condition2
    , BldgType
    , HouseStyle
    , OverallQual
    , OverallCond
    , YearBuilt
    , YearRemodAdd
    , RoofStyle
    , RoofMatl
    , Exterior1st
    , Exterior2nd
    , MasVnrType
    , MasVnrArea
    , case when ExterQual = 'Ex' then 5
           when ExterQual = 'Gd' then 4
           when ExterQual = 'TA' then 3
           when ExterQual = 'Fa' then 2
           when ExterQual = 'Po' then 1
        end as ExterQual
    , case when ExterCond = 'Ex' then 5
           when ExterCond = 'Gd' then 4
           when ExterCond = 'TA' then 3
           when ExterCond = 'Fa' then 2
           when ExterCond = 'Po' then 1
        end as ExterCond
    , Foundation
    , case when BsmtQual = 'Ex' then 5
           when BsmtQual = 'Gd' then 4
           when BsmtQual = 'TA' then 3
           when BsmtQual = 'Fa' then 2
           when BsmtQual = 'Po' then 1
           when BsmtQual = 'NA' then 0
        end as BsmtQual
    , case when BsmtCond = 'Ex' then 5
           when BsmtCond = 'Gd' then 4
           when BsmtCond = 'TA' then 3
           when BsmtCond = 'Fa' then 2
           when BsmtCond = 'Po' then 1
           when BsmtCond = 'NA' then 0
        end as BsmtCond
    , case when BsmtExposure = 'Gd' then 4
           when BsmtExposure = 'Av' then 3
           when BsmtExposure = 'Mn' then 2
           when BsmtExposure = 'No' then 1
           when BsmtExposure = 'NA' then 0
        end as BsmtExposure
    , case when BsmtFinType1 = 'GLQ' then 6
           when BsmtFinType1 = 'ALQ' then 5
           when BsmtFinType1 = 'BLQ' then 4
           when BsmtFinType1 = 'Rec' then 3
           when BsmtFinType1 = 'LwQ' then 2
           when BsmtFinType1 = 'Unf' then 1
           when BsmtFinType1 = 'NA' then 0
        end as BsmtFinType1
    , BsmtFinSF1
    , case when BsmtFinType2 = 'GLQ' then 6
           when BsmtFinType2 = 'ALQ' then 5
           when BsmtFinType2 = 'BLQ' then 4
           when BsmtFinType2 = 'Rec' then 3
           when BsmtFinType2 = 'LwQ' then 2
           when BsmtFinType2 = 'Unf' then 1
           when BsmtFinType2 = 'NA' then 0
        end as BsmtFinType2
    , BsmtFinSF2
    , BsmtUnfSF
    , TotalBsmtSF
    , Heating
    , case when HeatingQC = 'Ex' then 5
           when HeatingQC = 'Gd' then 4
           when HeatingQC = 'TA' then 3
           when HeatingQC = 'Fa' then 2
           when HeatingQC = 'Po' then 1
        end as HeatingQC
    , CentralAir
    , Electrical
    , X1stFlrSF
    , X2ndFlrSF
    , LowQualFinSF
    , GrLivArea
    , BsmtFullBath
    , BsmtHalfBath
    , FullBath
    , HalfBath
    , BedroomAbvGr
    , KitchenAbvGr
    , case when KitchenQual = 'Ex' then 5
           when KitchenQual = 'Gd' then 4
           when KitchenQual = 'TA' then 3
           when KitchenQual = 'Fa' then 2
           when KitchenQual = 'Po' then 1
        end as KitchenQual
    , TotRmsAbvGrd
    , case when Functional = 'Typ' then 8
           when Functional = 'Min1' then 7
           when Functional = 'Min2' then 6
           when Functional = 'Mod' then 5
           when Functional = 'Maj1' then 4
           when Functional = 'Maj2' then 3
           when Functional = 'Sev' then 2
           when Functional = 'Sal' then 1
        end as Functional
    , Fireplaces
    , case when FireplaceQu = 'Ex' then 5
           when FireplaceQu = 'Gd' then 4
           when FireplaceQu = 'TA' then 3
           when FireplaceQu = 'Fa' then 2
           when FireplaceQu = 'Po' then 1
           when FireplaceQu = 'NA' then 0
        end as FireplaceQu
    , GarageType
    , GarageYrBlt
    , case when GarageFinish = 'Fin' then 3
           when GarageFinish = 'RFn' then 2
           when GarageFinish = 'Unf' then 1
           when GarageFinish = 'NA' then 0
        end as GarageFinish
    , GarageCars
    , GarageArea
    , case when GarageQual = 'Ex' then 5
           when GarageQual = 'Gd' then 4
           when GarageQual = 'TA' then 3
           when GarageQual = 'Fa' then 2
           when GarageQual = 'Po' then 1
           when GarageQual = 'NA' then 0
        end as GarageQual
    , case when GarageCond = 'Ex' then 5
           when GarageCond = 'Gd' then 4
           when GarageCond = 'TA' then 3
           when GarageCond = 'Fa' then 2
           when GarageCond = 'Po' then 1
           when GarageCond = 'NA' then 0
        end as GarageCond
    , PavedDrive
    , WoodDeckSF
    , OpenPorchSF
    , EnclosedPorch
    , X3SsnPorch
    , ScreenPorch
    , PoolArea
    , PoolQC
    , case when Fence = 'GdPrv' then 4
           when Fence = 'MnPrv' then 3
           when Fence = 'GdWo' then 2
           when Fence = 'MnWw' then 1
           when Fence = 'NA' then 0
        end as Fence
    , MiscFeature
    , MiscVal
    , MoSold
    , YrSold
    , SaleType
    , SaleCondition
    , SalePrice
from train"

cleaned_train = sqldf::sqldf(query)

The next step after cleaning the data was to impute any missing values. Since missing values will result in deleting rows from the training set, I decided to impute them using the mice package. I decided to use predictive mean matching to impute the numerical missing values and assigned a value of 0 to all categorical missing values.

mice_impute_train = mice(cleaned_train,m=5,maxit = 5,method = 'pmm')

## 
##  iter imp variable
##   1   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   1   2  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   1   3  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   1   4  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   1   5  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   2   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   2   2  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   2   3  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   2   4  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   2   5  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   3   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   3   2  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   3   3  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   3   4  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   3   5  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   4   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   4   2  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   4   3  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   4   4  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   4   5  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   5   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   5   2  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   5   3  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   5   4  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   5   5  LotFrontage*  MasVnrArea*  GarageYrBlt*
##  * Please inspect the loggedEvents

## Warning: Number of logged events: 191

new_train = complete(mice_impute_train,2)

new_train[is.na(new_train)] = 0

Fitting the Model

From what I know about houses, the factors that seem to affect the price of houses the most is their size and their location. Generally speaking, larger houses are almost always more expensive than smaller houses in the same neighborhoods. For my model, I chose variables that relate to the physical size of the house, like Lot Area, Garage Area, Total Rooms Above Ground as different measures of how big the house is. To factor for location, I added the Neighborhood variable. While most of the neighborhoods are not statistically significant, the expensive neighborhoods are statistically significant and can bring up the home values substantially.

clean_model = lm(SalePrice~
                   BsmtExposure+
                   BsmtFinSF1+
                   BsmtUnfSF+
                   Fireplaces+
                   GarageArea+
                   KitchenAbvGr+
                   LandContour+
                   LotArea+
                   MasVnrArea+
                   Neighborhood+
                   OverallCond+
                   OverallQual+
                   TotRmsAbvGrd
                 ,data = new_train)

summary(clean_model)

## 
## Call:
## lm(formula = SalePrice ~ BsmtExposure + BsmtFinSF1 + BsmtUnfSF + 
##     Fireplaces + GarageArea + KitchenAbvGr + LandContour + LotArea + 
##     MasVnrArea + Neighborhood + OverallCond + OverallQual + TotRmsAbvGrd, 
##     data = new_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -391127  -17758    -923   14516  318071 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.877e+04  1.467e+04  -4.007 6.47e-05 ***
## BsmtExposure1       -1.962e+04  6.471e+03  -3.032 0.002473 ** 
## BsmtExposure2       -1.424e+04  7.121e+03  -1.999 0.045769 *  
## BsmtExposure3       -1.221e+04  6.895e+03  -1.771 0.076820 .  
## BsmtExposure4        9.992e+03  7.425e+03   1.346 0.178599    
## BsmtFinSF1           3.091e+01  3.106e+00   9.949  < 2e-16 ***
## BsmtUnfSF            1.668e+01  3.054e+00   5.461 5.59e-08 ***
## Fireplaces           7.948e+03  1.709e+03   4.652 3.60e-06 ***
## GarageArea           4.013e+01  5.592e+00   7.176 1.16e-12 ***
## KitchenAbvGr        -2.744e+04  4.760e+03  -5.764 1.01e-08 ***
## LandContourHLS       2.188e+04  6.807e+03   3.215 0.001335 ** 
## LandContourLow       1.214e+04  7.964e+03   1.525 0.127596    
## LandContourLvl       1.610e+04  4.633e+03   3.474 0.000528 ***
## LotArea              4.344e-01  1.065e-01   4.078 4.79e-05 ***
## MasVnrArea           1.895e+01  6.068e+00   3.123 0.001826 ** 
## NeighborhoodBlueste -4.364e+03  2.531e+04  -0.172 0.863126    
## NeighborhoodBrDale  -2.890e+04  1.224e+04  -2.361 0.018364 *  
## NeighborhoodBrkSide -3.122e+03  9.697e+03  -0.322 0.747536    
## NeighborhoodClearCr  8.176e+03  1.112e+04   0.735 0.462476    
## NeighborhoodCollgCr  1.213e+04  8.770e+03   1.383 0.166992    
## NeighborhoodCrawfor  1.981e+04  9.956e+03   1.990 0.046821 *  
## NeighborhoodEdwards -1.330e+04  9.261e+03  -1.436 0.151287    
## NeighborhoodGilbert  9.944e+03  9.204e+03   1.080 0.280179    
## NeighborhoodIDOTRR  -1.864e+04  1.028e+04  -1.813 0.070006 .  
## NeighborhoodMeadowV -6.889e+03  1.196e+04  -0.576 0.564597    
## NeighborhoodMitchel -3.573e+03  9.778e+03  -0.365 0.714882    
## NeighborhoodNAmes   -9.337e+03  8.879e+03  -1.052 0.293182    
## NeighborhoodNoRidge  7.325e+04  1.014e+04   7.222 8.35e-13 ***
## NeighborhoodNPkVill -1.122e+04  1.404e+04  -0.799 0.424198    
## NeighborhoodNridgHt  5.787e+04  9.295e+03   6.226 6.28e-10 ***
## NeighborhoodNWAmes  -6.987e+03  9.406e+03  -0.743 0.457686    
## NeighborhoodOldTown -1.429e+04  9.236e+03  -1.547 0.121971    
## NeighborhoodSawyer  -4.932e+03  9.517e+03  -0.518 0.604355    
## NeighborhoodSawyerW  9.438e+03  9.524e+03   0.991 0.321857    
## NeighborhoodSomerst  2.963e+04  9.135e+03   3.244 0.001207 ** 
## NeighborhoodStoneBr  7.005e+04  1.085e+04   6.455 1.48e-10 ***
## NeighborhoodSWISU   -5.893e+03  1.113e+04  -0.530 0.596447    
## NeighborhoodTimber   9.310e+03  1.020e+04   0.913 0.361511    
## NeighborhoodVeenker  2.749e+04  1.327e+04   2.071 0.038505 *  
## OverallCond          5.296e+03  9.017e+02   5.874 5.31e-09 ***
## OverallQual          1.648e+04  1.140e+03  14.457  < 2e-16 ***
## TotRmsAbvGrd         1.197e+04  7.180e+02  16.668  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33570 on 1418 degrees of freedom
## Multiple R-squared:  0.8265, Adjusted R-squared:  0.8215 
## F-statistic: 164.7 on 41 and 1418 DF,  p-value: < 2.2e-16

Residual Analysis

As we can see from the model output, the adjusted R-squared is quite good at 0.8189. The residuals are almost centered at zero, which is a good sign that there is not much bias in the model. Looking at the residual plot below, we can see that there is a very tight, random spread around the residual line. There does not appear to be significant changes in variation throughout the data. There does not appear to be a discernable pattern in the residuals either. The only concern is that there are a few outliers, but overall, the residual plot looks very good.

plot(clean_model$residuals,main = "Residual Plot")
abline(h=0,col = "red")

Looking at the QQplot, we can see that the points follow the QQ line very closely except at the tails. This indicates that the residuals are almost normally distributed, which is a good sign. At the extremes of housing prices, there are other factors that might contribute to housing price beyond just location and house size.

qqnorm(clean_model$residuals)
qqline(clean_model$residuals)

Overall, the model appears to be a good fit. The residuals do not exhibit any abnormal behavior that would indicate a non-linear model would be a better fit.

Cross Validation

Now that the variables have been chosen, we can now tune the coefficients to be better at predicting new observations. I have called the caret library, which has many useful machine learning functions that can be used to cross validate the data.

library(caret)

## Loading required package: ggplot2

Using 20-fold cross validation, I was able to squeeze out a little bit more predictive power out of the model. Time to make some predictions.

train_control_20 = trainControl(method = "cv", number = 20)

lm_cv20 = train(SalePrice~
                   BsmtExposure+
                   BsmtFinSF1+
                   BsmtUnfSF+
                   Fireplaces+
                   GarageArea+
                   KitchenAbvGr+
                   LandContour+
                   LotArea+
                   MasVnrArea+
                   Neighborhood+
                   OverallCond+
                   OverallQual+
                   TotRmsAbvGrd
                ,new_train,method = "lm", trControl = train_control_20, na.action = na.pass)

print(lm_cv20)

## Linear Regression 
## 
## 1460 samples
##   13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 1388, 1386, 1388, 1388, 1386, 1388, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   33831.5  0.8236922  22395.01
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Predictions

To make my predictions, I ran the same cleaning and imputation scripts on the testing data that I used for the training data.

query = "select Id
    , MSSubClass
    , MSZoning
    , LotFrontage
    , LotArea
    , case when Street = 'Pave' then 1 else 0 end as StreetPaved
    , Alley
    , LotShape
    , LandContour
    , case when Utilities = 'AllPub' then 1 else 0 end as UtilitiesAll
    , LotConfig
    , case when LandSlope = 'Gtl' then 1
           when LandSlope = 'Mod' then 2
           when LandSlope = 'Sev' then 3
        end as LandSlope
    , Neighborhood
    , Condition1
    , Condition2
    , BldgType
    , HouseStyle
    , OverallQual
    , OverallCond
    , YearBuilt
    , YearRemodAdd
    , RoofStyle
    , RoofMatl
    , Exterior1st
    , Exterior2nd
    , MasVnrType
    , MasVnrArea
    , case when ExterQual = 'Ex' then 5
           when ExterQual = 'Gd' then 4
           when ExterQual = 'TA' then 3
           when ExterQual = 'Fa' then 2
           when ExterQual = 'Po' then 1
        end as ExterQual
    , case when ExterCond = 'Ex' then 5
           when ExterCond = 'Gd' then 4
           when ExterCond = 'TA' then 3
           when ExterCond = 'Fa' then 2
           when ExterCond = 'Po' then 1
        end as ExterCond
    , Foundation
    , case when BsmtQual = 'Ex' then 5
           when BsmtQual = 'Gd' then 4
           when BsmtQual = 'TA' then 3
           when BsmtQual = 'Fa' then 2
           when BsmtQual = 'Po' then 1
           when BsmtQual = 'NA' then 0
        end as BsmtQual
    , case when BsmtCond = 'Ex' then 5
           when BsmtCond = 'Gd' then 4
           when BsmtCond = 'TA' then 3
           when BsmtCond = 'Fa' then 2
           when BsmtCond = 'Po' then 1
           when BsmtCond = 'NA' then 0
        end as BsmtCond
    , case when BsmtExposure = 'Gd' then 4
           when BsmtExposure = 'Av' then 3
           when BsmtExposure = 'Mn' then 2
           when BsmtExposure = 'No' then 1
           when BsmtExposure = 'NA' then 0
        end as BsmtExposure
    , case when BsmtFinType1 = 'GLQ' then 6
           when BsmtFinType1 = 'ALQ' then 5
           when BsmtFinType1 = 'BLQ' then 4
           when BsmtFinType1 = 'Rec' then 3
           when BsmtFinType1 = 'LwQ' then 2
           when BsmtFinType1 = 'Unf' then 1
           when BsmtFinType1 = 'NA' then 0
        end as BsmtFinType1
    , BsmtFinSF1
    , case when BsmtFinType2 = 'GLQ' then 6
           when BsmtFinType2 = 'ALQ' then 5
           when BsmtFinType2 = 'BLQ' then 4
           when BsmtFinType2 = 'Rec' then 3
           when BsmtFinType2 = 'LwQ' then 2
           when BsmtFinType2 = 'Unf' then 1
           when BsmtFinType2 = 'NA' then 0
        end as BsmtFinType2
    , BsmtFinSF2
    , BsmtUnfSF
    , TotalBsmtSF
    , Heating
    , case when HeatingQC = 'Ex' then 5
           when HeatingQC = 'Gd' then 4
           when HeatingQC = 'TA' then 3
           when HeatingQC = 'Fa' then 2
           when HeatingQC = 'Po' then 1
        end as HeatingQC
    , CentralAir
    , Electrical
    , X1stFlrSF
    , X2ndFlrSF
    , LowQualFinSF
    , GrLivArea
    , BsmtFullBath
    , BsmtHalfBath
    , FullBath
    , HalfBath
    , BedroomAbvGr
    , KitchenAbvGr
    , case when KitchenQual = 'Ex' then 5
           when KitchenQual = 'Gd' then 4
           when KitchenQual = 'TA' then 3
           when KitchenQual = 'Fa' then 2
           when KitchenQual = 'Po' then 1
        end as KitchenQual
    , TotRmsAbvGrd
    , case when Functional = 'Typ' then 8
           when Functional = 'Min1' then 7
           when Functional = 'Min2' then 6
           when Functional = 'Mod' then 5
           when Functional = 'Maj1' then 4
           when Functional = 'Maj2' then 3
           when Functional = 'Sev' then 2
           when Functional = 'Sal' then 1
        end as Functional
    , Fireplaces
    , case when FireplaceQu = 'Ex' then 5
           when FireplaceQu = 'Gd' then 4
           when FireplaceQu = 'TA' then 3
           when FireplaceQu = 'Fa' then 2
           when FireplaceQu = 'Po' then 1
           when FireplaceQu = 'NA' then 0
        end as FireplaceQu
    , GarageType
    , GarageYrBlt
    , case when GarageFinish = 'Fin' then 3
           when GarageFinish = 'RFn' then 2
           when GarageFinish = 'Unf' then 1
           when GarageFinish = 'NA' then 0
        end as GarageFinish
    , GarageCars
    , GarageArea
    , case when GarageQual = 'Ex' then 5
           when GarageQual = 'Gd' then 4
           when GarageQual = 'TA' then 3
           when GarageQual = 'Fa' then 2
           when GarageQual = 'Po' then 1
           when GarageQual = 'NA' then 0
        end as GarageQual
    , case when GarageCond = 'Ex' then 5
           when GarageCond = 'Gd' then 4
           when GarageCond = 'TA' then 3
           when GarageCond = 'Fa' then 2
           when GarageCond = 'Po' then 1
           when GarageCond = 'NA' then 0
        end as GarageCond
    , PavedDrive
    , WoodDeckSF
    , OpenPorchSF
    , EnclosedPorch
    , X3SsnPorch
    , ScreenPorch
    , PoolArea
    , PoolQC
    , case when Fence = 'GdPrv' then 4
           when Fence = 'MnPrv' then 3
           when Fence = 'GdWo' then 2
           when Fence = 'MnWw' then 1
           when Fence = 'NA' then 0
        end as Fence
    , MiscFeature
    , MiscVal
    , MoSold
    , YrSold
    , SaleType
    , SaleCondition
from test"

cleaned_test = sqldf::sqldf(query)

mice_impute = mice(cleaned_test,m=5,maxit = 5,method = 'pmm')

## 
##  iter imp variable
##   1   1  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   1   2  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   1   3  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   1   4  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   1   5  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   2   1  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   2   2  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   2   3  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   2   4  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   2   5  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   3   1  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   3   2  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   3   3  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   3   4  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   3   5  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   4   1  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   4   2  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   4   3  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   4   4  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   4   5  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   5   1  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   5   2  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   5   3  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   5   4  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##   5   5  LotFrontage*  MasVnrArea*  BsmtFinSF1*  BsmtFinSF2*  BsmtUnfSF*  TotalBsmtSF*  BsmtFullBath*  BsmtHalfBath*  GarageYrBlt*  GarageCars*  GarageArea*
##  * Please inspect the loggedEvents

## Warning: Number of logged events: 591

new_test = complete(mice_impute,2)

new_test[is.na(new_test)] = 0

The code below makes predictions on the test set using my 20-fold cross validated linear model. The output can be seen below.

predictions = predict(lm_cv20,new_test)

output_df = data.frame(Id = cleaned_test$Id, SalePrice = predictions)

head(output_df)

##     Id SalePrice
## 1 1461  128234.7
## 2 1462  159431.6
## 3 1463  160906.0
## 4 1464  190147.8
## 5 1465  251799.0
## 6 1466  171994.6

#write.csv(output_df,"Kaggle submission.csv",row.names = F)

Results

Kaggle username: AuChan

Score: 0.21546

Final

Austin Chan

December 8, 2019

Problem 1:

1a, 1b, 1c

1a).

1b).

1c).

1d).

1e).

Problem 2:

a).

Univariate Descriptive Statistics

Scatterplot Matrix

Correlation Matrix

Hypothesis Testing

Analysis

Familywise Error

b).

c).

d).

Cleaning the Data

Fitting the Model

Residual Analysis

Cross Validation

Predictions

Results