Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(μ=σ=\frac{N+1}{2}\)

N <- 44
n <- 10000
X <- runif(n,1,N)
Y <- rnorm(n,(N+1)/2,(N+1)/2)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)
y <- as.numeric(quantile(Y,0.25))

5 points

\(P(X>x | X>y)\)

X_gr_y <- X[which(X>y)]
a <- length(X_gr_y[which(X_gr_y>x)])
P_a <- a / length(X_gr_y)

\(P(X>x | X>y) = 0.58\)

\(P(X>x, Y>y)\)

X_gr_x <- X[which(X>x)]
#The above calculation is for the probability of X being above the median, which should be around 50%
Y_gr_y <- Y[which(Y>y)]
#The above calculation is for the probability of Y being above the 1st Quartile of Y, which should always be 75%
P_b <- length(X_gr_x)/length(X) * length(Y_gr_y)/length(Y)

\(P(X>x , Y>y) = 0.38\)

\(P(X<x | X>y)\)

c <- length(X_gr_y[which(X_gr_y<x)])
P_c <- c / length(X_gr_y)

\(P(X<x | X>y) = 0.42\)

5 points Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

library(kableExtra)
#calculate missing options
X_le_x <- X[which(X<=x)]
Y_le_y <- Y[which(Y<=y)]
X_gr_x_and_Y_gr_y <- length(X_gr_x)/n * length(Y_gr_y)/n
X_gr_x_and_Y_le_y <- length(X_gr_x)/n * length(Y_le_y)/n
X_le_x_and_Y_gr_y <- length(X_le_x)/n * length(Y_gr_y)/n
X_le_x_and_Y_le_y <- length(X_le_x)/n * length(Y_le_y)/n
Tot_X_gr_x <- X_gr_x_and_Y_gr_y + X_gr_x_and_Y_le_y
Tot_X_le_x <- X_le_x_and_Y_gr_y + X_le_x_and_Y_le_y
Tot_Y_gr_y <- X_gr_x_and_Y_gr_y + X_le_x_and_Y_gr_y
Tot_Y_le_y <- X_gr_x_and_Y_le_y + X_le_x_and_Y_le_y

d <- matrix(c(X_gr_x_and_Y_gr_y,X_gr_x_and_Y_le_y,Tot_X_gr_x,X_le_x_and_Y_gr_y,X_le_x_and_Y_le_y,Tot_X_le_x, Tot_Y_gr_y,Tot_Y_le_y,Tot_X_gr_x+Tot_X_le_x), ncol = 3, byrow=TRUE)
colnames(d) <- c("Y>y", "Y\u2264y","Total")
rownames(d) <- c("X>x", "X\u2264x","Total")

Joint Probabilities

	Y>y	Y≤y	Total
X>x	0.375	0.125	0.5
X≤x	0.375	0.125	0.5
Total	0.750	0.250	1.0

From the table we see that P(X>x)P(Y>y) = 0.375.

From part b we see that \(P(X>x, Y>y)\)is also 0.375.

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Count table:

c_t <- d[1:2,1:2]*n
fisher.test(round(c_t))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  round(c_t)
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9125 1.0959
## sample estimates:
## odds ratio 
##          1

chisq.test(c_t)

## 
##  Pearson's Chi-squared test
## 
## data:  c_t
## X-squared = 0, df = 1, p-value = 1

With such a high p-value, we can comfortably reject the null-hypothesis and state that these variables are in fact independent (as would be expected since they were generated independently). As we have a fairly large sample size, Chi-square is the more appropriate choice, though at large cell values we would expect the two to yield similar results (As we see they do).

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Descriptive and Inferential Statistics

5 points. . Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

summary(P2_train)

##        Id         MSSubClass       MSZoning     LotFrontage 
##  Min.   :   1   Min.   : 20.0   C (all):  10   Min.   : 21  
##  1st Qu.: 366   1st Qu.: 20.0   FV     :  65   1st Qu.: 59  
##  Median : 730   Median : 50.0   RH     :  16   Median : 69  
##  Mean   : 730   Mean   : 56.9   RL     :1151   Mean   : 70  
##  3rd Qu.:1095   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80  
##  Max.   :1460   Max.   :190.0                  Max.   :313  
##                                                NA's   :259  
##     LotArea        Street      Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  Median :  9478               NA's:1369   IR3: 10   Low:  36   
##  Mean   : 10517                           Reg:925   Lvl:1311   
##  3rd Qu.: 11602                                                
##  Max.   :215245                                                
##                                                                
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle   OverallQual    OverallCond  
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.0   Min.   :1.00  
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.0   1st Qu.:5.00  
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.0   Median :5.00  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.1   Mean   :5.58  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.0   3rd Qu.:6.00  
##  PosA   :   1                 1.5Unf : 14   Max.   :10.0   Max.   :9.00  
##  (Other):   2                 (Other): 19                                
##    YearBuilt     YearRemodAdd    RoofStyle       RoofMatl     Exterior1st 
##  Min.   :1872   Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515  
##  1st Qu.:1954   1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222  
##  Median :1973   Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220  
##  Mean   :1971   Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206  
##  3rd Qu.:2000   3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108  
##  Max.   :2010   Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61  
##                                               (Other):   2   (Other):128  
##   Exterior2nd    MasVnrType    MasVnrArea   ExterQual ExterCond
##  VinylSd:504   BrkCmn : 15   Min.   :   0   Ex: 52    Ex:   3  
##  MetalSd:214   BrkFace:445   1st Qu.:   0   Fa: 14    Fa:  28  
##  HdBoard:207   None   :864   Median :   0   Gd:488    Gd: 146  
##  Wd Sdng:197   Stone  :128   Mean   : 104   TA:906    Po:   1  
##  Plywood:142   NA's   :  8   3rd Qu.: 166             TA:1282  
##  CmentBd: 60                 Max.   :1600                      
##  (Other):136                 NA's   :8                         
##   Foundation  BsmtQual   BsmtCond    BsmtExposure BsmtFinType1
##  BrkTil:146   Ex  :121   Fa  :  45   Av  :221     ALQ :220    
##  CBlock:634   Fa  : 35   Gd  :  65   Gd  :134     BLQ :148    
##  PConc :647   Gd  :618   Po  :   2   Mn  :114     GLQ :418    
##  Slab  : 24   TA  :649   TA  :1311   No  :953     LwQ : 74    
##  Stone :  6   NA's: 37   NA's:  37   NA's: 38     Rec :133    
##  Wood  :  3                                       Unf :430    
##                                                   NA's: 37    
##    BsmtFinSF1   BsmtFinType2   BsmtFinSF2       BsmtUnfSF   
##  Min.   :   0   ALQ :  19    Min.   :   0.0   Min.   :   0  
##  1st Qu.:   0   BLQ :  33    1st Qu.:   0.0   1st Qu.: 223  
##  Median : 384   GLQ :  14    Median :   0.0   Median : 478  
##  Mean   : 444   LwQ :  46    Mean   :  46.5   Mean   : 567  
##  3rd Qu.: 712   Rec :  54    3rd Qu.:   0.0   3rd Qu.: 808  
##  Max.   :5644   Unf :1256    Max.   :1474.0   Max.   :2336  
##                 NA's:  38                                   
##   TotalBsmtSF    Heating     HeatingQC CentralAir Electrical  
##  Min.   :   0   Floor:   1   Ex:741    N:  95     FuseA:  94  
##  1st Qu.: 796   GasA :1428   Fa: 49    Y:1365     FuseF:  27  
##  Median : 992   GasW :  18   Gd:241               FuseP:   3  
##  Mean   :1057   Grav :   7   Po:  1               Mix  :   1  
##  3rd Qu.:1298   OthW :   2   TA:428               SBrkr:1334  
##  Max.   :6110   Wall :   4                        NA's :   1  
##                                                               
##    X1stFlrSF      X2ndFlrSF     LowQualFinSF     GrLivArea   
##  Min.   : 334   Min.   :   0   Min.   :  0.0   Min.   : 334  
##  1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.0   1st Qu.:1130  
##  Median :1087   Median :   0   Median :  0.0   Median :1464  
##  Mean   :1163   Mean   : 347   Mean   :  5.8   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.0   3rd Qu.:1777  
##  Max.   :4692   Max.   :2065   Max.   :572.0   Max.   :5642  
##                                                              
##   BsmtFullBath    BsmtHalfBath       FullBath       HalfBath    
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.00   1st Qu.:0.000  
##  Median :0.000   Median :0.0000   Median :2.00   Median :0.000  
##  Mean   :0.425   Mean   :0.0575   Mean   :1.57   Mean   :0.383  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:2.00   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :2.0000   Max.   :3.00   Max.   :2.000  
##                                                                 
##   BedroomAbvGr   KitchenAbvGr  KitchenQual  TotRmsAbvGrd   Functional 
##  Min.   :0.00   Min.   :0.00   Ex:100      Min.   : 2.00   Maj1:  14  
##  1st Qu.:2.00   1st Qu.:1.00   Fa: 39      1st Qu.: 5.00   Maj2:   5  
##  Median :3.00   Median :1.00   Gd:586      Median : 6.00   Min1:  31  
##  Mean   :2.87   Mean   :1.05   TA:735      Mean   : 6.52   Min2:  34  
##  3rd Qu.:3.00   3rd Qu.:1.00               3rd Qu.: 7.00   Mod :  15  
##  Max.   :8.00   Max.   :3.00               Max.   :14.00   Sev :   1  
##                                                            Typ :1360  
##    Fireplaces    FireplaceQu   GarageType   GarageYrBlt   GarageFinish
##  Min.   :0.000   Ex  : 24    2Types :  6   Min.   :1900   Fin :352    
##  1st Qu.:0.000   Fa  : 33    Attchd :870   1st Qu.:1961   RFn :422    
##  Median :1.000   Gd  :380    Basment: 19   Median :1980   Unf :605    
##  Mean   :0.613   Po  : 20    BuiltIn: 88   Mean   :1978   NA's: 81    
##  3rd Qu.:1.000   TA  :313    CarPort:  9   3rd Qu.:2002               
##  Max.   :3.000   NA's:690    Detchd :387   Max.   :2010               
##                              NA's   : 81   NA's   :81                 
##    GarageCars     GarageArea   GarageQual  GarageCond  PavedDrive
##  Min.   :0.00   Min.   :   0   Ex  :   3   Ex  :   2   N:  90    
##  1st Qu.:1.00   1st Qu.: 334   Fa  :  48   Fa  :  35   P:  30    
##  Median :2.00   Median : 480   Gd  :  14   Gd  :   9   Y:1340    
##  Mean   :1.77   Mean   : 473   Po  :   3   Po  :   7             
##  3rd Qu.:2.00   3rd Qu.: 576   TA  :1311   TA  :1326             
##  Max.   :4.00   Max.   :1418   NA's:  81   NA's:  81             
##                                                                  
##    WoodDeckSF     OpenPorchSF    EnclosedPorch   X3SsnPorch   
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0   Min.   :  0.0  
##  1st Qu.:  0.0   1st Qu.:  0.0   1st Qu.:  0   1st Qu.:  0.0  
##  Median :  0.0   Median : 25.0   Median :  0   Median :  0.0  
##  Mean   : 94.2   Mean   : 46.7   Mean   : 22   Mean   :  3.4  
##  3rd Qu.:168.0   3rd Qu.: 68.0   3rd Qu.:  0   3rd Qu.:  0.0  
##  Max.   :857.0   Max.   :547.0   Max.   :552   Max.   :508.0  
##                                                               
##   ScreenPorch       PoolArea      PoolQC       Fence      MiscFeature
##  Min.   :  0.0   Min.   :  0.0   Ex  :   2   GdPrv:  59   Gar2:   2  
##  1st Qu.:  0.0   1st Qu.:  0.0   Fa  :   2   GdWo :  54   Othr:   2  
##  Median :  0.0   Median :  0.0   Gd  :   3   MnPrv: 157   Shed:  49  
##  Mean   : 15.1   Mean   :  2.8   NA's:1453   MnWw :  11   TenC:   1  
##  3rd Qu.:  0.0   3rd Qu.:  0.0               NA's :1179   NA's:1406  
##  Max.   :480.0   Max.   :738.0                                       
##                                                                      
##     MiscVal          MoSold          YrSold        SaleType   
##  Min.   :    0   Min.   : 1.00   Min.   :2006   WD     :1267  
##  1st Qu.:    0   1st Qu.: 5.00   1st Qu.:2007   New    : 122  
##  Median :    0   Median : 6.00   Median :2008   COD    :  43  
##  Mean   :   43   Mean   : 6.32   Mean   :2008   ConLD  :   9  
##  3rd Qu.:    0   3rd Qu.: 8.00   3rd Qu.:2009   ConLI  :   5  
##  Max.   :15500   Max.   :12.00   Max.   :2010   ConLw  :   5  
##                                                 (Other):   9  
##  SaleCondition    SalePrice     
##  Abnorml: 101   Min.   : 34900  
##  AdjLand:   4   1st Qu.:129975  
##  Alloca :  12   Median :163000  
##  Family :  20   Mean   :180921  
##  Normal :1198   3rd Qu.:214000  
##  Partial: 125   Max.   :755000  
##

As we see, the data contains 80 variables, including the ID of the house.

Let’s look at some of these variables in more detail:

The categorical variable ‘Neighborhood’ provides an interesting snapshot of where more expensive homes can be found. Northridge, Northridge Heights and Stone Brook in particular stand out as the areas with the most expensive homes.

## Loading required package: ggplot2

The variable ‘OverallCond’ rates the overall condition of the house on a scale of 1 to 10, with 1 being “Very Poor” and 10 being “Very Excellent”. Below is a histogram depicting the distribution of this variable within the training dataset along with a line indicating the mean value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    5.00    5.58    6.00    9.00

The variable ‘X1stFlrSF’ indicates the square footage of the first floor. It is right-tailed with a mean of 1163 sq ft.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334     882    1087    1163    1391    4692

The ‘TotRmsAbvGrd’ variable indicates how many total rooms the house has above grade (excluding bathrooms).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    5.00    6.00    6.52    7.00   14.00

Let us also take a look at the distribution of final Sales Price (our dependent variable for this dataset).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

To view the relationships between these three independent variables (and a couple of other ones) and our dependent variable, let’s take a look at the correlation matrix for them:

plot(P2_train[,c("OverallCond","X1stFlrSF", "X2ndFlrSF", "BedroomAbvGr", "TotRmsAbvGrd", "SalePrice")])

The corrrelation matrix for these variables is then:

require(ggcorrplot)

## Loading required package: ggcorrplot

## Warning: package 'ggcorrplot' was built under R version 3.6.3

cor_m <- cor(P2_train[,c("OverallCond","X1stFlrSF", "X2ndFlrSF", "BedroomAbvGr", "TotRmsAbvGrd", "SalePrice")])
cor_m

##              OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond      1.00000   -0.1442   0.02894      0.01298     -0.05758
## X1stFlrSF       -0.14420    1.0000  -0.20265      0.12740      0.40952
## X2ndFlrSF        0.02894   -0.2026   1.00000      0.50290      0.61642
## BedroomAbvGr     0.01298    0.1274   0.50290      1.00000      0.67662
## TotRmsAbvGrd    -0.05758    0.4095   0.61642      0.67662      1.00000
## SalePrice       -0.07786    0.6059   0.31933      0.16821      0.53372
##              SalePrice
## OverallCond   -0.07786
## X1stFlrSF      0.60585
## X2ndFlrSF      0.31933
## BedroomAbvGr   0.16821
## TotRmsAbvGrd   0.53372
## SalePrice      1.00000

ggcorrplot(cor_m)

Correlation testing between selected pairs to test the hypothesis: “The correlation between each pairwise set of variables is 0”.

To start, the plot above shows that most of the correlations between pairs are not in fact 0. But let us check each pairing.

Pairings

All pairings with BedroomAbvGr

cor.test(~BedroomAbvGr+OverallCond, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  BedroomAbvGr and OverallCond
## t = 0.5, df = 1458, p-value = 0.6
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.02059  0.04652
## sample estimates:
##     cor 
## 0.01298

cor.test(~BedroomAbvGr+X1stFlrSF, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  BedroomAbvGr and X1stFlrSF
## t = 4.9, df = 1458, p-value = 0.000001
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.09424 0.16028
## sample estimates:
##    cor 
## 0.1274

cor.test(~BedroomAbvGr+X2ndFlrSF, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  BedroomAbvGr and X2ndFlrSF
## t = 22, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4774 0.5276
## sample estimates:
##    cor 
## 0.5029

cor.test(~BedroomAbvGr+TotRmsAbvGrd, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  BedroomAbvGr and TotRmsAbvGrd
## t = 35, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6580 0.6944
## sample estimates:
##    cor 
## 0.6766

cor.test(~BedroomAbvGr+SalePrice, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  BedroomAbvGr and SalePrice
## t = 6.5, df = 1458, p-value = 0.0000000001
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1354 0.2006
## sample estimates:
##    cor 
## 0.1682

Remaining pairings with OverallCond

cor.test(~OverallCond+X1stFlrSF, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  OverallCond and X1stFlrSF
## t = -5.6, df = 1458, p-value = 0.00000003
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.1769 -0.1112
## sample estimates:
##     cor 
## -0.1442

cor.test(~OverallCond+X2ndFlrSF, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  OverallCond and X2ndFlrSF
## t = 1.1, df = 1458, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.004624  0.062443
## sample estimates:
##     cor 
## 0.02894

cor.test(~OverallCond+TotRmsAbvGrd, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  OverallCond and TotRmsAbvGrd
## t = -2.2, df = 1458, p-value = 0.03
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.09097 -0.02407
## sample estimates:
##      cor 
## -0.05758

cor.test(~OverallCond+SalePrice, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  OverallCond and SalePrice
## t = -3, df = 1458, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.11113 -0.04441
## sample estimates:
##      cor 
## -0.07786

Remaining pairings with X1stFlrSF

cor.test(~X1stFlrSF+X2ndFlrSF, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  X1stFlrSF and X2ndFlrSF
## t = -7.9, df = 1458, p-value = 0.000000000000005
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.2346 -0.1702
## sample estimates:
##     cor 
## -0.2026

cor.test(~X1stFlrSF+TotRmsAbvGrd, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  X1stFlrSF and TotRmsAbvGrd
## t = 17, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3812 0.4371
## sample estimates:
##    cor 
## 0.4095

cor.test(~X1stFlrSF+SalePrice, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  X1stFlrSF and SalePrice
## t = 29, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5842 0.6267
## sample estimates:
##    cor 
## 0.6059

Remaining pairings with X2ndFlrSF

cor.test(~X2ndFlrSF+TotRmsAbvGrd, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  X2ndFlrSF and TotRmsAbvGrd
## t = 30, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5952 0.6368
## sample estimates:
##    cor 
## 0.6164

cor.test(~X2ndFlrSF+SalePrice, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  X2ndFlrSF and SalePrice
## t = 13, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2889 0.3492
## sample estimates:
##    cor 
## 0.3193

Remaining pairing with TotRmsAbvGrd

cor.test(~TotRmsAbvGrd+SalePrice, data = P2_train, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  TotRmsAbvGrd and SalePrice
## t = 24, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5093 0.5573
## sample estimates:
##    cor 
## 0.5337

Conclusion

All of the tests returned the alternate hypothesis, indicating that the variables were not in fact independent of each other. This makes sense, since each variable was collected from the same homes - they are related in some way. Despite all of the tests rejecting the hypothesis and indicating that there are in fact non-0 correlations for every pairing, there are 3 cases where 0 is in fact within the 80% confidence interval, and not all of the variables neccesarily tie in to the rest, so I would be wary of family-wise error.

Linear Algebra and Correlation

I_cor <- solve(cor_m)
I_cor

##              OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond      1.02389    0.1724   0.02925     -0.07101      0.04221
## X1stFlrSF        0.17239    3.4116   2.47397     -0.11909     -1.85229
## X2ndFlrSF        0.02925    2.4740   3.45389     -0.36048     -2.15556
## BedroomAbvGr    -0.07101   -0.1191  -0.36048      2.09611     -1.48249
## TotRmsAbvGrd     0.04221   -1.8523  -2.15556     -1.48249      4.18325
## SalePrice       -0.04465   -1.8349  -1.38842      0.62039     -0.16948
##              SalePrice
## OverallCond   -0.04465
## X1stFlrSF     -1.83487
## X2ndFlrSF     -1.38842
## BedroomAbvGr   0.62039
## TotRmsAbvGrd  -0.16948
## SalePrice      2.53765

round(cor_m %*% I_cor,6)

##              OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond            1         0         0            0            0
## X1stFlrSF              0         1         0            0            0
## X2ndFlrSF              0         0         1            0            0
## BedroomAbvGr           0         0         0            1            0
## TotRmsAbvGrd           0         0         0            0            1
## SalePrice              0         0         0            0            0
##              SalePrice
## OverallCond          0
## X1stFlrSF            0
## X2ndFlrSF            0
## BedroomAbvGr         0
## TotRmsAbvGrd         0
## SalePrice            1

round(I_cor %*% cor_m,6)

##              OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond            1         0         0            0            0
## X1stFlrSF              0         1         0            0            0
## X2ndFlrSF              0         0         1            0            0
## BedroomAbvGr           0         0         0            1            0
## TotRmsAbvGrd           0         0         0            0            1
## SalePrice              0         0         0            0            0
##              SalePrice
## OverallCond          0
## X1stFlrSF            0
## X2ndFlrSF            0
## BedroomAbvGr         0
## TotRmsAbvGrd         0
## SalePrice            1

Both products yield the identity matrix, as expected from the product of two matrices which are the inverse of one another. Below is the LU decomposition of our correlation matrix:

require(matrixcalc)

## Loading required package: matrixcalc

dec_mat <- lu.decomposition(cor_m)
dec_mat

## $L
##          [,1]    [,2]   [,3]    [,4]    [,5] [,6]
## [1,]  1.00000  0.0000 0.0000  0.0000 0.00000    0
## [2,] -0.14420  1.0000 0.0000  0.0000 0.00000    0
## [3,]  0.02894 -0.2027 1.0000  0.0000 0.00000    0
## [4,]  0.01298  0.1320 0.5514  1.0000 0.00000    0
## [5,] -0.05758  0.4097 0.7294  0.3454 1.00000    0
## [6,] -0.07786  0.6073 0.4610 -0.2214 0.06679    1
## 
## $U
##      [,1]    [,2]     [,3]                     [,4]     [,5]     [,6]
## [1,]    1 -0.1442  0.02894  0.012980060094550580768 -0.05758 -0.07786
## [2,]    0  0.9792 -0.19847  0.129272510195122647403  0.40121  0.59463
## [3,]    0  0.0000  0.95893  0.528726854416190938935  0.69941  0.44211
## [4,]    0  0.0000  0.00000  0.691241584946385878574  0.23877 -0.15304
## [5,]    0  0.0000  0.00000 -0.000000000000000027756  0.23970  0.01601
## [6,]    0  0.0000  0.00000  0.000000000000000001854  0.00000  0.39407

#To confirm, let us multiply the L an dU matrices and see if we get our original matrix. 

dec_mat$L %*% dec_mat$U

##          [,1]    [,2]     [,3]    [,4]     [,5]     [,6]
## [1,]  1.00000 -0.1442  0.02894 0.01298 -0.05758 -0.07786
## [2,] -0.14420  1.0000 -0.20265 0.12740  0.40952  0.60585
## [3,]  0.02894 -0.2026  1.00000 0.50290  0.61642  0.31933
## [4,]  0.01298  0.1274  0.50290 1.00000  0.67662  0.16821
## [5,] -0.05758  0.4095  0.61642 0.67662  1.00000  0.53372
## [6,] -0.07786  0.6059  0.31933 0.16821  0.53372  1.00000

cor_m

##              OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond      1.00000   -0.1442   0.02894      0.01298     -0.05758
## X1stFlrSF       -0.14420    1.0000  -0.20265      0.12740      0.40952
## X2ndFlrSF        0.02894   -0.2026   1.00000      0.50290      0.61642
## BedroomAbvGr     0.01298    0.1274   0.50290      1.00000      0.67662
## TotRmsAbvGrd    -0.05758    0.4095   0.61642      0.67662      1.00000
## SalePrice       -0.07786    0.6059   0.31933      0.16821      0.53372
##              SalePrice
## OverallCond   -0.07786
## X1stFlrSF      0.60585
## X2ndFlrSF      0.31933
## BedroomAbvGr   0.16821
## TotRmsAbvGrd   0.53372
## SalePrice      1.00000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

As we saw above, the First Floor Square Footage variable [X1stFlrSF] is right-skewed so it works for this question.

require(MASS)

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.6.3

exp_fit <- fitdistr(P2_train$X1stFlrSF,"exponential")
exp_fit

##       rate   
##   0.00086012 
##  (0.00002251)

lambda <- exp_fit$estimate

The calculated \(\lambda\) for the X1stFlrSF variable is 0.0009

exp_dist <- rexp(1000,rate = lambda)
hist(P2_train$X1stFlrSF, main = "Histogram of 1st Floor Sq Ft", xlab = "X1stFlrSF", col = 'red')

hist(exp_dist, main = "Histogram of Fitted Exponensial Density Function", xlab = "Fitted Exp Fxn", col = 'blue')

The \(5^{th}\) and \(95^{th}\) percentiles using the CDF of our exponential distribution are:

qexp(c(0.05,0.95), lambda)

## [1]   59.63 3482.92

The 95% confidence interval based on our fitted exponential distribution is:

qexp(.025,rate = lambda)

## [1] 29.44

qexp(.025, rate = lambda, lower.tail = FALSE)

## [1] 4289

Using the empirical data, we see that the 5th and 95%ile were in fact:

quantile(P2_train$X1stFlrSF,c(0.05,0.95))

##   5%  95% 
##  673 1831

Modeling

10 points. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Multiple Linear Regression

The first step is to select the variables used within the model. There is a total of 80 variables, so it’s doubtful that they will all make it into our final model. Let us begin by taking a subset of these 80 variables and clean them up by removing ‘NA’s and assigning numerical values where appropriate. We start by assigning linear numerical values to the variables where quality/condition was evaluated wither on a “Excellent to Poor” scale, or in the case of the type of damage, a sliding scale of severity. In the rest of the cases it seems that in every variable an ’NA’ corresponds to a lack of the feature in question, so it can be assigned a 0 value.

library(plyr)
want <- c("LotArea","Neighborhood","OverallQual","OverallCond", "YearBuilt","ExterQual","BsmtQual","BsmtCond","TotalBsmtSF","HeatingQC","CentralAir","X1stFlrSF","X2ndFlrSF","FullBath","BsmtFullBath","BedroomAbvGr","Kitchen","KitchnQual","TotRmsAbvGrd","Functional","GarageArea","PoolArea","WoodDeckSF","MiscVal")

Base <- P2_train[,(names(P2_train) %in% want)]

Model <- data.frame(lapply(Base, function(x) {mapvalues(x, c("Ex","Gd","TA","Fa","Po","Typ","Min1","Min2","Mod","Maj1","Maj2","Sev","Sal"), c(5,4,3,2,1,7,6,5,4,3,2,1,0))}))

Model[,-2] <- sapply(Model[,-2],as.numeric)

Model[is.na(Model)] <- 0

#Create 2 new additional variables showing the total number of full baths and the square 
#footage of the first 2 foors.
Model$TotFullBath <- Model$FullBath+Model$BsmtFullBath
Model$TotSF <- Model$X1stFlrSF + Model$X2ndFlrSF
drop <- c("FullBath","BsmtFullBath", "X1stFlrSF", "X2ndFlrSF")
Model <- Model[,!(names(Model) %in% drop)]

Let us now center the variables to reduce the impact of significantly higher values (square footage -versus- number of baths, etc.). Finally we will add the Sales Price back to our model DataFrame.

#Took approach from https://www.gastonsanchez.com/visually-enforced/how-to/2014/01/15/Center-data-in-R/
center_scale <- function(x){
  scale(x, scale = TRUE)
}

Model[,-2] <- center_scale(Model[,-2])
Model$SalePrice <- P2_train$SalePrice

In this case I will start by creating a correlation matrix and then proceed to add variables using forward selection (using the correlation scores as a guide of what variable to add next).

correl <- cor(Model[,-2])
ggcorrplot(correl)

print(correl[,"SalePrice"])

##      LotArea  OverallQual  OverallCond    YearBuilt    ExterQual 
##      0.26384      0.79098     -0.07786      0.52290     -0.63688 
##     BsmtQual     BsmtCond  TotalBsmtSF    HeatingQC   CentralAir 
##     -0.43888      0.14737      0.61358     -0.40018      0.25133 
## BedroomAbvGr TotRmsAbvGrd   Functional   GarageArea   WoodDeckSF 
##      0.16821      0.53372      0.11533      0.62343      0.32441 
##     PoolArea      MiscVal  TotFullBath        TotSF    SalePrice 
##      0.09240     -0.02119      0.58293      0.71688      1.00000

To start our model, let’s select the 5 variables that seem to have the highest correlation with SalePrice: ‘OverallQual’, ‘TotSF’, ‘ExterQual’, ‘GarageArea’ and ‘TotalBsmtSF’. We will also include the ‘Neighborhood’ variable for which we were not able to calculate a correlation (since it is not numeric) by assigning it a dummy variable.

multi <- lm(SalePrice~OverallQual + TotSF + ExterQual + GarageArea + TotalBsmtSF + Neighborhood, data = Model)
summary(multi)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + TotSF + ExterQual + GarageArea + 
##     TotalBsmtSF + Neighborhood, data = Model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -428443  -15434    -233   13721  266435 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)           168251       8542   19.70 < 0.0000000000000002 ***
## OverallQual            21051       1612   13.06 < 0.0000000000000002 ***
## TotSF                  23929       1303   18.36 < 0.0000000000000002 ***
## ExterQual              -9422       1315   -7.16     0.00000000000124 ***
## GarageArea              8346       1229    6.79     0.00000000001612 ***
## TotalBsmtSF             9013       1204    7.49     0.00000000000012 ***
## NeighborhoodBlueste    -8382      26023   -0.32               0.7474    
## NeighborhoodBrDale    -17309      12304   -1.41               0.1597    
## NeighborhoodBrkSide     3546       9855    0.36               0.7191    
## NeighborhoodClearCr    35018      10890    3.22               0.0013 ** 
## NeighborhoodCollgCr    14041       8917    1.57               0.1156    
## NeighborhoodCrawfor    31256       9914    3.15               0.0017 ** 
## NeighborhoodEdwards    -4093       9411   -0.43               0.6637    
## NeighborhoodGilbert    14458       9410    1.54               0.1246    
## NeighborhoodIDOTRR    -14731      10462   -1.41               0.1593    
## NeighborhoodMeadowV    -1306      12243   -0.11               0.9151    
## NeighborhoodMitchel     9380       9958    0.94               0.3464    
## NeighborhoodNAmes       5605       8982    0.62               0.5327    
## NeighborhoodNoRidge    66639      10272    6.49     0.00000000011987 ***
## NeighborhoodNPkVill    -2172      14417   -0.15               0.8803    
## NeighborhoodNridgHt    57602       9439    6.10     0.00000000134249 ***
## NeighborhoodNWAmes      6591       9541    0.69               0.4898    
## NeighborhoodOldTown   -14068       9325   -1.51               0.1316    
## NeighborhoodSawyer      9145       9627    0.95               0.3423    
## NeighborhoodSawyerW     8788       9649    0.91               0.3626    
## NeighborhoodSomerst    18911       9258    2.04               0.0413 *  
## NeighborhoodStoneBr    66493      10955    6.07     0.00000000164078 ***
## NeighborhoodSWISU      -8002      11252   -0.71               0.4771    
## NeighborhoodTimber     30859      10172    3.03               0.0025 ** 
## NeighborhoodVeenker    47655      13440    3.55               0.0004 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34700 on 1430 degrees of freedom
## Multiple R-squared:  0.813,  Adjusted R-squared:  0.81 
## F-statistic:  215 on 29 and 1430 DF,  p-value: <0.0000000000000002

We can immediately see that there are very low P-values for each of our selected variables. Let us proceed by adding additional variables.

multi <- update(multi, .~. + TotFullBath + YearBuilt + TotRmsAbvGrd, data = Model)
summary(multi)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + TotSF + ExterQual + GarageArea + 
##     TotalBsmtSF + Neighborhood + TotFullBath + YearBuilt + TotRmsAbvGrd, 
##     data = Model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413816  -14605    -135   13873  268557 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         165258.19    8661.32   19.08 < 0.0000000000000002 ***
## OverallQual          20543.19    1618.67   12.69 < 0.0000000000000002 ***
## TotSF                21065.09    2043.60   10.31 < 0.0000000000000002 ***
## ExterQual            -8955.01    1314.11   -6.81       0.000000000014 ***
## GarageArea            7684.06    1228.05    6.26       0.000000000517 ***
## TotalBsmtSF           7785.07    1225.67    6.35       0.000000000286 ***
## NeighborhoodBlueste  -7748.00   25793.30   -0.30              0.76392    
## NeighborhoodBrDale  -12134.96   12263.68   -0.99              0.32258    
## NeighborhoodBrkSide  14238.58   10636.12    1.34              0.18088    
## NeighborhoodClearCr  37661.27   10994.93    3.43              0.00063 ***
## NeighborhoodCollgCr  12474.08    8844.62    1.41              0.15865    
## NeighborhoodCrawfor  39568.55   10496.92    3.77              0.00017 ***
## NeighborhoodEdwards     -7.89    9669.07    0.00              0.99935    
## NeighborhoodGilbert  11900.28    9320.52    1.28              0.20189    
## NeighborhoodIDOTRR   -2954.47   11259.31   -0.26              0.79305    
## NeighborhoodMeadowV    412.26   12221.86    0.03              0.97310    
## NeighborhoodMitchel   9065.21    9929.72    0.91              0.36143    
## NeighborhoodNAmes    11269.47    9206.83    1.22              0.22114    
## NeighborhoodNoRidge  66653.90   10254.97    6.50       0.000000000111 ***
## NeighborhoodNPkVill  -2326.23   14354.11   -0.16              0.87128    
## NeighborhoodNridgHt  56253.81    9356.18    6.01       0.000000002317 ***
## NeighborhoodNWAmes    8039.21    9564.92    0.84              0.40078    
## NeighborhoodOldTown  -2632.33   10401.86   -0.25              0.80026    
## NeighborhoodSawyer   13311.34    9747.38    1.37              0.17227    
## NeighborhoodSawyerW   7450.96    9603.94    0.78              0.43798    
## NeighborhoodSomerst  17189.53    9182.57    1.87              0.06141 .  
## NeighborhoodStoneBr  64982.35   10908.20    5.96       0.000000003227 ***
## NeighborhoodSWISU     -312.28   12010.40   -0.03              0.97926    
## NeighborhoodTimber   28967.29   10118.40    2.86              0.00426 ** 
## NeighborhoodVeenker  48913.76   13407.82    3.65              0.00027 ***
## TotFullBath           6294.02    1206.94    5.21       0.000000211009 ***
## YearBuilt             3702.22    2088.85    1.77              0.07655 .  
## TotRmsAbvGrd          1606.95    1659.68    0.97              0.33309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34300 on 1427 degrees of freedom
## Multiple R-squared:  0.818,  Adjusted R-squared:  0.814 
## F-statistic:  200 on 32 and 1427 DF,  p-value: <0.0000000000000002

multi <- update(multi, .~.  + BsmtQual + LotArea + BedroomAbvGr + CentralAir + Functional + BsmtCond - YearBuilt, data = Model)
summary(multi)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + TotSF + ExterQual + GarageArea + 
##     TotalBsmtSF + Neighborhood + TotFullBath + TotRmsAbvGrd + 
##     BsmtQual + LotArea + BedroomAbvGr + CentralAir + Functional + 
##     BsmtCond, data = Model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -445285  -15028     502   13021  259444 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)           164852       8281   19.91 < 0.0000000000000002 ***
## OverallQual            18411       1599   11.51 < 0.0000000000000002 ***
## TotSF                  20899       2008   10.41 < 0.0000000000000002 ***
## ExterQual              -8301       1299   -6.39       0.000000000223 ***
## GarageArea              6112       1190    5.14       0.000000316995 ***
## TotalBsmtSF             7171       1262    5.68       0.000000016309 ***
## NeighborhoodBlueste    -7352      24834   -0.30              0.76726    
## NeighborhoodBrDale     -4894      11815   -0.41              0.67878    
## NeighborhoodBrkSide    15760       9546    1.65              0.09897 .  
## NeighborhoodClearCr    28431      10752    2.64              0.00828 ** 
## NeighborhoodCollgCr    14162       8668    1.63              0.10252    
## NeighborhoodCrawfor    37124       9546    3.89              0.00011 ***
## NeighborhoodEdwards     1051       9122    0.12              0.90829    
## NeighborhoodGilbert    11600       9125    1.27              0.20387    
## NeighborhoodIDOTRR      -166      10167   -0.02              0.98697    
## NeighborhoodMeadowV      144      11797    0.01              0.99023    
## NeighborhoodMitchel     7452       9643    0.77              0.43975    
## NeighborhoodNAmes      13742       8742    1.57              0.11619    
## NeighborhoodNoRidge    67702       9990    6.78       0.000000000018 ***
## NeighborhoodNPkVill    -1453      13794   -0.11              0.91612    
## NeighborhoodNridgHt    50967       9112    5.59       0.000000026660 ***
## NeighborhoodNWAmes     10430       9249    1.13              0.25964    
## NeighborhoodOldTown    -3688       9018   -0.41              0.68264    
## NeighborhoodSawyer     15947       9363    1.70              0.08875 .  
## NeighborhoodSawyerW     7453       9337    0.80              0.42488    
## NeighborhoodSomerst    19542       8964    2.18              0.02941 *  
## NeighborhoodStoneBr    62714      10503    5.97       0.000000002969 ***
## NeighborhoodSWISU       6801      10960    0.62              0.53500    
## NeighborhoodTimber     21094       9966    2.12              0.03446 *  
## NeighborhoodVeenker    43672      12883    3.39              0.00072 ***
## TotFullBath             5808       1152    5.04       0.000000525708 ***
## TotRmsAbvGrd            5125       1849    2.77              0.00564 ** 
## BsmtQual               -7959       1223   -6.51       0.000000000105 ***
## LotArea                 5352        995    5.38       0.000000086881 ***
## BedroomAbvGr           -4496       1352   -3.32              0.00091 ***
## CentralAir              3910        986    3.97       0.000076691803 ***
## Functional              4207        910    4.62       0.000004109705 ***
## BsmtCond                3663       1064    3.44              0.00059 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33000 on 1422 degrees of freedom
## Multiple R-squared:  0.832,  Adjusted R-squared:  0.828 
## F-statistic:  190 on 37 and 1422 DF,  p-value: <0.0000000000000002

Using this model let us set up our prediction:

TEST <- P2_test[,(names(P2_train) %in% want)] 

Prediction <- data.frame(lapply(TEST, function(x) {mapvalues(x, c("Ex","Gd","TA","Fa","Po","Typ","Min1","Min2","Mod","Maj1","Maj2","Sev","Sal"), c(5,4,3,2,1,7,6,5,4,3,2,1,0))}))

Prediction[,-2] <- sapply(Prediction[,-2],as.numeric)

Prediction[is.na(Prediction)] <- 0
Prediction$TotFullBath <- Prediction$FullBath + Prediction$BsmtFullBath
Prediction$TotSF <- Prediction$X1stFlrSF + Prediction$X2ndFlrSF
Prediction[,-2] <- center_scale(Prediction[,-2])

Prediction$Predict <- predict(multi,Prediction,se.fit = FALSE)

Final <- data.frame(P2_test$Id,round(Prediction$Predict,2))
names(Final) = c("ID" ,"SalePrice")
write.csv(Final,file = "DATA605_Final_kaggle.csv", quote = FALSE, row.names = FALSE)

Random Forest

keep <- c("MSSubClass","LotArea", "Neighborhood", "HouseStyle", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd", "ExterQual", "BsmtQual", "TotalBsmtSF", "X1stFlrSF", "X2ndFlrSF", "GrLivArea", "BsmtFullBath", "FullBath", "HalfBath", "BedroomAbvGr", "TotRmsAbvGrd", "GarageCars", "GarageArea", "Functional")

Forest <- P2_train[, (names(P2_train) %in% keep)]
Frst_Test <- P2_test[, (names(P2_test) %in% keep)]
summary(Frst_Test)

##    MSSubClass       LotArea       Neighborhood  HouseStyle 
##  Min.   : 20.0   Min.   : 1470   NAmes  :218   1.5Fin:160  
##  1st Qu.: 20.0   1st Qu.: 7391   OldTown:126   1.5Unf:  5  
##  Median : 50.0   Median : 9399   CollgCr:117   1Story:745  
##  Mean   : 57.4   Mean   : 9819   Somerst: 96   2.5Unf: 13  
##  3rd Qu.: 70.0   3rd Qu.:11518   Edwards: 94   2Story:427  
##  Max.   :190.0   Max.   :56600   NridgHt: 89   SFoyer: 46  
##                                  (Other):719   SLvl  : 63  
##   OverallQual     OverallCond     YearBuilt     YearRemodAdd  ExterQual
##  Min.   : 1.00   Min.   :1.00   Min.   :1879   Min.   :1950   Ex: 55   
##  1st Qu.: 5.00   1st Qu.:5.00   1st Qu.:1953   1st Qu.:1963   Fa: 21   
##  Median : 6.00   Median :5.00   Median :1973   Median :1992   Gd:491   
##  Mean   : 6.08   Mean   :5.55   Mean   :1971   Mean   :1984   TA:892   
##  3rd Qu.: 7.00   3rd Qu.:6.00   3rd Qu.:2001   3rd Qu.:2004            
##  Max.   :10.00   Max.   :9.00   Max.   :2010   Max.   :2010            
##                                                                        
##  BsmtQual    TotalBsmtSF     X1stFlrSF      X2ndFlrSF      GrLivArea   
##  Ex  :137   Min.   :   0   Min.   : 407   Min.   :   0   Min.   : 407  
##  Fa  : 53   1st Qu.: 784   1st Qu.: 874   1st Qu.:   0   1st Qu.:1118  
##  Gd  :591   Median : 988   Median :1079   Median :   0   Median :1432  
##  TA  :634   Mean   :1046   Mean   :1157   Mean   : 326   Mean   :1486  
##  NA's: 44   3rd Qu.:1305   3rd Qu.:1382   3rd Qu.: 676   3rd Qu.:1721  
##             Max.   :5095   Max.   :5095   Max.   :1862   Max.   :5095  
##             NA's   :1                                                  
##   BsmtFullBath      FullBath       HalfBath      BedroomAbvGr 
##  Min.   :0.000   Min.   :0.00   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:1.00   1st Qu.:0.000   1st Qu.:2.00  
##  Median :0.000   Median :2.00   Median :0.000   Median :3.00  
##  Mean   :0.434   Mean   :1.57   Mean   :0.378   Mean   :2.85  
##  3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:1.000   3rd Qu.:3.00  
##  Max.   :3.000   Max.   :4.00   Max.   :2.000   Max.   :6.00  
##  NA's   :2                                                    
##   TotRmsAbvGrd     Functional     GarageCars     GarageArea  
##  Min.   : 3.00   Typ    :1357   Min.   :0.00   Min.   :   0  
##  1st Qu.: 5.00   Min2   :  36   1st Qu.:1.00   1st Qu.: 318  
##  Median : 6.00   Min1   :  34   Median :2.00   Median : 480  
##  Mean   : 6.38   Mod    :  20   Mean   :1.77   Mean   : 473  
##  3rd Qu.: 7.00   Maj1   :   5   3rd Qu.:2.00   3rd Qu.: 576  
##  Max.   :15.00   (Other):   5   Max.   :5.00   Max.   :1488  
##                  NA's   :   2   NA's   :1      NA's   :1

The data needs cleaning, with the summary of our test data showing that there are NA’s in columns that didn’t have them for the training data. We must impute them before actually running our prediction. In addition there are several factor levels present for some of the variables within the test data set that are not present within the train dataset. To account for this we must assign the factor levels within the test dataset to the training dataset.

library(forcats)

## Warning: package 'forcats' was built under R version 3.6.3

require(tidyr)

## Loading required package: tidyr

Forest <- transform(
  Forest,
  OverallQual= as.factor(OverallQual),
  OverallCond = as.factor(OverallCond),
  BsmtQual = fct_explicit_na(BsmtQual, na_level = "None")
)

Frst_Test <- transform(
  Frst_Test,
  OverallQual= as.factor(OverallQual),
  OverallCond = as.factor(OverallCond),
  BsmtQual = fct_explicit_na(BsmtQual, na_level = "None")
)


Frst_Test$TotalBsmtSF[is.na(Frst_Test$TotalBsmtSF)] <- as.integer(0)
Frst_Test$BsmtFullBath[is.na(Frst_Test$BsmtFullBath)] <- as.integer(0)
Frst_Test$Functional[is.na(Frst_Test$Functional)] <- as.factor("Typ")
Frst_Test$GarageCars[is.na(Frst_Test$GarageCars)] <- as.integer(0)
Frst_Test$GarageArea[is.na(Frst_Test$GarageArea)] <- as.integer(0)

levels(Frst_Test$Neighborhood) <- levels(Forest$Neighborhood)
levels(Frst_Test$HouseStyle) <- levels(Forest$HouseStyle)
levels(Frst_Test$OverallQual) <- levels(Forest$OverallQual)
levels(Frst_Test$OverallCond) <- levels(Forest$OverallCond)
levels(Frst_Test$ExterQual) <- levels(Forest$ExterQual)
levels(Frst_Test$BsmtQual) <- levels(Forest$BsmtQual)
levels(Frst_Test$Funtional) <- levels(Forest$Funtional)


summary(Frst_Test)

##    MSSubClass       LotArea       Neighborhood   HouseStyle   OverallQual 
##  Min.   : 20.0   Min.   : 1470   NAmes  :218   1Story :745   5      :428  
##  1st Qu.: 20.0   1st Qu.: 7391   OldTown:126   2.5Unf :427   6      :357  
##  Median : 50.0   Median : 9399   CollgCr:117   1.5Fin :160   7      :281  
##  Mean   : 57.4   Mean   : 9819   Somerst: 96   SFoyer : 63   8      :174  
##  3rd Qu.: 70.0   3rd Qu.:11518   Edwards: 94   2Story : 46   4      :110  
##  Max.   :190.0   Max.   :56600   NridgHt: 89   2.5Fin : 13   9      : 64  
##                                  (Other):719   (Other):  5   (Other): 45  
##   OverallCond    YearBuilt     YearRemodAdd  ExterQual BsmtQual  
##  5      :824   Min.   :1879   Min.   :1950   Ex: 55    Ex  :137  
##  6      :279   1st Qu.:1953   1st Qu.:1963   Fa: 21    Fa  : 53  
##  7      :185   Median :1973   Median :1992   Gd:491    Gd  :591  
##  8      : 72   Mean   :1971   Mean   :1984   TA:892    TA  :634  
##  4      : 44   3rd Qu.:2001   3rd Qu.:2004             None: 44  
##  3      : 25   Max.   :2010   Max.   :2010                       
##  (Other): 30                                                     
##   TotalBsmtSF     X1stFlrSF      X2ndFlrSF      GrLivArea   
##  Min.   :   0   Min.   : 407   Min.   :   0   Min.   : 407  
##  1st Qu.: 784   1st Qu.: 874   1st Qu.:   0   1st Qu.:1118  
##  Median : 988   Median :1079   Median :   0   Median :1432  
##  Mean   :1045   Mean   :1157   Mean   : 326   Mean   :1486  
##  3rd Qu.:1304   3rd Qu.:1382   3rd Qu.: 676   3rd Qu.:1721  
##  Max.   :5095   Max.   :5095   Max.   :1862   Max.   :5095  
##                                                             
##   BsmtFullBath      FullBath       HalfBath      BedroomAbvGr 
##  Min.   :0.000   Min.   :0.00   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:1.00   1st Qu.:0.000   1st Qu.:2.00  
##  Median :0.000   Median :2.00   Median :0.000   Median :3.00  
##  Mean   :0.434   Mean   :1.57   Mean   :0.378   Mean   :2.85  
##  3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:1.000   3rd Qu.:3.00  
##  Max.   :3.000   Max.   :4.00   Max.   :2.000   Max.   :6.00  
##                                                               
##   TotRmsAbvGrd   Functional    GarageCars     GarageArea  
##  Min.   : 3.00   Maj1:   5   Min.   :0.00   Min.   :   0  
##  1st Qu.: 5.00   Maj2:   4   1st Qu.:1.00   1st Qu.: 318  
##  Median : 6.00   Min1:  34   Median :2.00   Median : 480  
##  Mean   : 6.38   Min2:  36   Mean   :1.76   Mean   : 472  
##  3rd Qu.: 7.00   Mod :  20   3rd Qu.:2.00   3rd Qu.: 576  
##  Max.   :15.00   Sev :   1   Max.   :5.00   Max.   :1488  
##                  Typ :1359

require(randomForest)

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 3.6.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Frst_mod <- randomForest(
  x = Forest,
  y = P2_train$SalePrice
)
Frst_pred <- predict(Frst_mod, newdata = Frst_Test)

Rand_For_Pred <- data.frame(P2_test$Id,Frst_pred)
names(Rand_For_Pred) = c("ID" ,"SalePrice")
write.csv(Rand_For_Pred,file = "DATA605_Final_kaggle_RF.csv", quote = FALSE, row.names = FALSE)

Kaggle Scores

My user name on Kaggle.com is ‘mishakollontai’.

Upon submission of the multiple linear model, I received a score of 0.17969.

The Random Forest model received a score of 0.15722, which is significantly more accurate.

DATA605 Final Project

Misha Kollontai

5/13/2020