DATA605_Final Exam

Final Exam Part 1: (PageRank)

Question: Form the A matrix. Then, introduce decay and form the B matrix

Step 1: Mat A,Decay and Mat B

As shown in the notes, the transition matrix \(A\) is given by

\[A = \left[ \begin{array}{c} 0 & \frac{1}{2} & \frac{1}{2} & 0 & 0 & 0 \\ \frac{1}{6} & \frac{1}{6} & \frac{1}{6} & \frac{1}{6} & \frac{1}{6} & \frac{1}{6} \\ \frac{1}{3} & \frac{1}{3} & 0 & 0 & \frac{1}{3} & 0 \\ 0 & 0 & 0 & 0 & \frac{1}{2} & \frac{1}{2} \\ 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2} \\ 0 & 0 & 0 & 1 & 0 & 0 \end{array} \right]\]

The matrix \(B\) is obtained by \[B = 0.85 \times A + \frac{0.15}{n} \approx \left[ \begin{array}{c} 0.025 & 0.45 & 0.45 & 0.025 & 0.025 & 0.025 \\ 0.1667 & 0.1667 & 0.1667 & 0.1667 & 0.1667 & 0.1667 \\ 0.3083 & 0.3083 & 0.025 & 0.025 & 0.3083 & 0.025 \\ 0.025 & 0.025 & 0.025 & 0.025 & 0.45 & 0.45 \\ 0.025 & 0.025 & 0.025 & 0.45 & 0.025 & 0.45 \\ 0.025 & 0.025 & 0.025 & 0.875 & 0.025 & 0.025 \end{array} \right]\]

In R, these are stored as below:

A <- matrix(
  c(0, 1/2, 1/2, 0, 0, 0,
  0, 0, 0, 0, 0, 0,
  1/3, 1/3, 0, 0, 1/3, 0,
  0, 0, 0, 0, 1/2, 1/2,
  0, 0, 0, 1/2, 0, 1/2,
  0, 0, 0, 1, 0, 0),
  nrow = 6, byrow = TRUE)
A[2, ] <- rep(1/6, 6)
B <- 0.85 * A + 0.15 / nrow(A)
Question: Start with a uniform rank vector r and perform power iterations on B till convergence.That is, compute the solution r = Bn × r. Attempt this for a sufficiently large n so that r actually converges.

Step 2:Power Iterations

The following function is created to perform power iterations on \(B\) until convergence, utilizing a uniform rank vector

\(r^T = \left[ \begin{array}{c}\frac{1}{6} & \frac{1}{6} & \frac{1}{6} & \frac{1}{6} & \frac{1}{6} & \frac{1}{6} \end{array} \right]\)

power_iterate <- function(mat, vec) {
  converged <- FALSE
  n <- 0
  while(!converged) {
    vec <- crossprod(mat, vec)
    n <- n + 1
    if(identical(crossprod(mat, vec), vec)) {
      converged <- TRUE
    }
  }
  print(paste('Converged in', n, 'iterations'))
  return(vec)
}

r <- matrix(rep(1/nrow(B), nrow(B)), ncol=1)
it_results <- power_iterate(B, r)
## [1] "Converged in 66 iterations"

The page rank vector and associated page rankings, as calculated, are:

page vector rank
1 0.0517 6
2 0.0737 4
3 0.0574 5
4 0.3487 1
5 0.1999 3
6 0.2686 2
Question: Compute the eigen-decomposition of B and verify that you indeed get an eigenvalueof 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1.

Step 3:Eigen-Decomposition

For this exercise, we are interested in the largest eigenvalue of \(B\), as well as the associated left eigenvector of \(B\). Taking only the real components, this can be shown to be

Re(eigen(B)$values[1])
## [1] 1
ev_results <- matrix(Re(eigen(t(B))$vectors[,1]))

The eigenvectors returned by the eigen function return normal vectors (i.e. vectors with length 1); the returned vector is divided by its sum to give a vector with a sum 1. This vector is identical to the vector returned using the power_iterate function:

ev_results <- ev_results / sum(ev_results)
identical(round(ev_results, 7), round(it_results, 7))
## [1] TRUE
page vector rank
1 0.0517 6
2 0.0737 4
3 0.0574 5
4 0.3487 1
5 0.1999 3
6 0.2686 2
Question: Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A.

Step 4: igraph Netwk

Using the igraph package, the network can be visualized in a directed graph, and the page rank of the nodes in the network returned.

library(igraph)

The igraph package handles decay using a damping factor of 0.85 and automatically assigns a uniform random probability to dangling nodes, which matches the two approaches outlined above. The page rank vector returned matches that returned through power iteration:

identical(round(it_results, 13), round(g_results, 13))
## [1] TRUE
Question: Verify that you do get the same PageRank vector as the three approaches above.

Step 5: Comparison of Results

As shown in the above sections, the page rank vector for the given universe of six pages was derived through three methods:

  • Power iteration of the matrix \(B\)
  • Eigenvector corresponding to \(\lambda = 1\) for the matrix \(B\)
  • igraph implementation using the matrix \(A\)

The three methods return the same results (to 13 decimal points of accuracy) once the eigenvector is scaled from being a unit vector.

all_results <- data.frame(
  "Power Iteration" = round(ev_results, 7), 
  Eigenvector = round(ev_results, 7), 
  Graph = round(g_results, 7), 
  Rank = rank(1 - ev_results),
  row.names = seq(1, 6), 
  check.names = FALSE)
kable(all_results, padding = 0, align = 'c', row.names = TRUE)
Power Iteration Eigenvector Graph Rank
1 0.0517047 0.0517047 0.0517047 6
2 0.0736793 0.0736793 0.0736793 4
3 0.0574124 0.0574124 0.0574124 5
4 0.3487037 0.3487037 0.3487037 1
5 0.1999038 0.1999038 0.1999038 3
6 0.2685961 0.2685961 0.2685961 2

Final Exam Part 2 (kaggle MNIST)

library(tidyverse)
library(grid)
library(matrixcalc)
library(caret)
library(nnet)
library(OpenImageR)

Load and Read train Data

train_data <- read_csv("train.csv")
## Rows: 42000 Columns: 785
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (785): label, pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pi...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 3: Data Format Images

Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.) (5 points)
digit<-function(x){
  m<-matrix(unlist(x), nrow=28, byrow=T)
  m<-t(apply(m, 2, rev))
  image(m, col=grey.colors(255))
}

par(mfrow=c(3,4))

for(i in 1:10){
  digit(train_data[i, -1])
}

Step 4:Frequency Distribution

4. What is the frequency distribution of the numbers in the dataset? (5 points)
train_label <- train_data$label
train_data_freq <- as.data.frame(table(train_label))
ggplot(train_data_freq, aes(x = train_label,y=Freq)) + 
  geom_histogram(stat='identity') + 
  labs(title = 'Frequency Distribution of Drawn Digits - Training Set',
       x = 'Drawn Digit')
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Step 5:Pixel Intensity.

5. For each number, provide the mean pixel intensity. What does this tell you? (5 points)
labels = train_data[,1]
data <- train_data[,-1]/255
get_number_intensity <- function(target, labels, data){
  x = data[labels==target,]
  means = rowMeans(x)
  return(mean(means))
}
for (i in 1:9) {
  mean_intensity <- get_number_intensity(i , labels, data)
  
  ret_string = str_interp("Pixel intensity for number ${i} is ${mean_intensity}")
  
  print(ret_string)
}
## [1] "Pixel intensity for number 1 is 0.075972720428906"
## [1] "Pixel intensity for number 2 is 0.149415262873165"
## [1] "Pixel intensity for number 3 is 0.141657603055012"
## [1] "Pixel intensity for number 4 is 0.121212097314368"
## [1] "Pixel intensity for number 5 is 0.129231294625887"
## [1] "Pixel intensity for number 6 is 0.138730078688473"
## [1] "Pixel intensity for number 7 is 0.1147021021542"
## [1] "Pixel intensity for number 8 is 0.150981134516322"
## [1] "Pixel intensity for number 9 is 0.12281787715086"

Step 6: Generate Component with PCA

6. Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why? (5 points)
train_pca <- prcomp(data)
train_pca_std <- train_pca$sdev
train_pca_cum_var <- cumsum(train_pca_std^2)/sum(train_pca_std^2)
plot(train_pca_cum_var)

which.max(train_pca_cum_var >= .95)
## [1] 154
which.max(train_pca_cum_var >= 1)
## [1] 704

Step 7: Plot 1st 10 images by PCA

Plot the first 10 images generated by PCA. They will appear to be noise. Why? (5 points)
train_pca_rot <- train_pca$rotation
for (d in 1:10){
  plot(4,4, xlim=c(1,28), ylim=c(1,28))
  imageShow(array(train_pca_rot[,d],c(28,28)))
}

Step 8: Re-run PCA using 8’s

Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%). Plot the first 10 images. What do you see? (5 points)
x = data[labels==8,]
pca_8 <- prcomp(x)
pca_8_std <- pca_8$sdev
pca_8_cum_var <- cumsum(pca_8_std^2)/sum(pca_8_std^2)
plot(pca_8_cum_var)

pca_8_rot <- pca_8$rotation
for (d in 1:10){
  plot(4,4, xlim=c(1,28), ylim=c(1,28))
  imageShow(array(pca_8_rot[,d],c(28,28)))
}

The variance hit 95% slower than I would have thought. But our images are clearly horizontal eights - that look vaguely like anthrax or a worn hippy tattoo.

Step 9: Test the Model

An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal. (10 points)
test_size <- floor(.2 * nrow(train_data))
set.seed(5593)
test_index <- sample(seq_len(nrow(data)), size = test_size)

train_df <- data[-test_index,]
train_labels <- c(labels[-test_index,])[[1]]

test_df <- data[test_index,]
test_labels <- c(labels[test_index,])[[1]]
mn_model <- multinom(train_labels~., train_df, MaxNWts = 100000)
## # weights:  7860 (7065 variable)
## initial  value 77366.859125 
## iter  10 value 21811.620903
## iter  20 value 17688.896952
## iter  30 value 16823.987031
## iter  40 value 16350.435489
## iter  50 value 15948.139268
## iter  60 value 15168.768809
## iter  70 value 13746.644483
## iter  80 value 11899.115029
## iter  90 value 10593.975978
## iter 100 value 9859.334204
## final  value 9859.334204 
## stopped after 100 iterations
predictions <- predict(mn_model, test_df)
test_labels <- as.factor(test_labels)
confusionMatrix(predictions,test_labels)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 780   0  11   4   1  13  11  10   3   7
##          1   0 922  16   9   7   9   1   9  29   3
##          2   3   5 704  22   8   4   3  10   7   0
##          3   2   3  18 780   1  20   0   3  19   8
##          4   2   4   5   2 713  16   8  10   4  23
##          5   8   2   7  40   0 630  10   6  29   7
##          6   9   0  13   3   7  20 775   0   2   0
##          7   3   1  14  11   4   6   1 848   3  20
##          8   5  13  23  19   6  36   4   1 707  10
##          9   2   1   9   8  30   9   0  33  11 742
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9049          
##                  95% CI : (0.8984, 0.9111)
##     No Information Rate : 0.1132          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8942          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.95823   0.9695  0.85854  0.86860  0.91763  0.82569
## Specificity           0.99209   0.9889  0.99182  0.99014  0.99029  0.98573
## Pos Pred Value        0.92857   0.9174  0.91906  0.91335  0.90597  0.85250
## Neg Pred Value        0.99550   0.9961  0.98480  0.98436  0.99159  0.98264
## Prevalence            0.09690   0.1132  0.09762  0.10690  0.09250  0.09083
## Detection Rate        0.09286   0.1098  0.08381  0.09286  0.08488  0.07500
## Detection Prevalence  0.10000   0.1196  0.09119  0.10167  0.09369  0.08798
## Balanced Accuracy     0.97516   0.9792  0.92518  0.92937  0.95396  0.90571
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.95326   0.9118  0.86855  0.90488
## Specificity           0.99288   0.9916  0.98458  0.98641
## Pos Pred Value        0.93486   0.9308  0.85801  0.87811
## Neg Pred Value        0.99498   0.9891  0.98588  0.98968
## Prevalence            0.09679   0.1107  0.09690  0.09762
## Detection Rate        0.09226   0.1010  0.08417  0.08833
## Detection Prevalence  0.09869   0.1085  0.09810  0.10060
## Balanced Accuracy     0.97307   0.9517  0.92656  0.94564

Final Exam Part 3 (Kaggle House Prices)

load and read data

house_data <- read.csv("https://raw.githubusercontent.com/omocharly/DATA605/main/train.csv")
head(house_data)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
str(house_data)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Reduce the data to Numeric variables Only

Since we have alot of categorical variable in our data, We trim down the variables in the data only to numeric variables relevant to our analysis.

numeric_data <- house_data %>% select_if(is.numeric)
str(numeric_data)
## 'data.frame':    1460 obs. of  38 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

i decide to further filter down the data to take away the variable with too much zero

final_data <- numeric_data[,c(1, 4, 7,8, 10, 12:15, 17, 28:30, 38)]
str(final_data)
## 'data.frame':    1460 obs. of  14 variables:
##  $ Id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LotArea     : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ YearBuilt   : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd: int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ BsmtFinSF1  : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtUnfSF   : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ X1stFlrSF   : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF   : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ GrLivArea   : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ GarageArea  : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ WoodDeckSF  : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ SalePrice   : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Univariate Descriptive Statistics and plots

summary(final_data)
##        Id            LotArea         YearBuilt     YearRemodAdd 
##  Min.   :   1.0   Min.   :  1300   Min.   :1872   Min.   :1950  
##  1st Qu.: 365.8   1st Qu.:  7554   1st Qu.:1954   1st Qu.:1967  
##  Median : 730.5   Median :  9478   Median :1973   Median :1994  
##  Mean   : 730.5   Mean   : 10517   Mean   :1971   Mean   :1985  
##  3rd Qu.:1095.2   3rd Qu.: 11602   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :1460.0   Max.   :215245   Max.   :2010   Max.   :2010  
##    BsmtFinSF1       BsmtUnfSF       TotalBsmtSF       X1stFlrSF   
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   : 334  
##  1st Qu.:   0.0   1st Qu.: 223.0   1st Qu.: 795.8   1st Qu.: 882  
##  Median : 383.5   Median : 477.5   Median : 991.5   Median :1087  
##  Mean   : 443.6   Mean   : 567.2   Mean   :1057.4   Mean   :1163  
##  3rd Qu.: 712.2   3rd Qu.: 808.0   3rd Qu.:1298.2   3rd Qu.:1391  
##  Max.   :5644.0   Max.   :2336.0   Max.   :6110.0   Max.   :4692  
##    X2ndFlrSF      GrLivArea      GarageArea       WoodDeckSF    
##  Min.   :   0   Min.   : 334   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:   0   1st Qu.:1130   1st Qu.: 334.5   1st Qu.:  0.00  
##  Median :   0   Median :1464   Median : 480.0   Median :  0.00  
##  Mean   : 347   Mean   :1515   Mean   : 473.0   Mean   : 94.24  
##  3rd Qu.: 728   3rd Qu.:1777   3rd Qu.: 576.0   3rd Qu.:168.00  
##  Max.   :2065   Max.   :5642   Max.   :1418.0   Max.   :857.00  
##   OpenPorchSF       SalePrice     
##  Min.   :  0.00   Min.   : 34900  
##  1st Qu.:  0.00   1st Qu.:129975  
##  Median : 25.00   Median :163000  
##  Mean   : 46.66   Mean   :180921  
##  3rd Qu.: 68.00   3rd Qu.:214000  
##  Max.   :547.00   Max.   :755000
par(mfrow=c(2, 3))
plot(house_data$YearBuilt,house_data$SalePrice)
plot(house_data$BsmtUnfSF,house_data$SalePrice)
plot(house_data$X1stFlrSF,house_data$SalePrice)
plot(house_data$GrLivArea,house_data$SalePrice)
plot(house_data$GarageArea,house_data$SalePrice)

Correlation Matrix

Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

I choose BsmtUnfSF, X1stFlrSF and SalePrice

corr_data <- house_data[, c("BsmtUnfSF", "X1stFlrSF", "SalePrice")]
corr_matrix <- round(cor(corr_data),2)
corr_matrix
##           BsmtUnfSF X1stFlrSF SalePrice
## BsmtUnfSF      1.00      0.32      0.21
## X1stFlrSF      0.32      1.00      0.61
## SalePrice      0.21      0.61      1.00
cor.test(corr_data$BsmtUnfSF, corr_data$X1stFlrSF, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$BsmtUnfSF and corr_data$X1stFlrSF
## t = 12.807, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2874940 0.3478369
## sample estimates:
##       cor 
## 0.3179874
cor.test(corr_data$BsmtUnfSF, corr_data$SalePrice, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$BsmtUnfSF and corr_data$SalePrice
## t = 8.3847, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1822292 0.2462680
## sample estimates:
##       cor 
## 0.2144791
cor.test(corr_data$X1stFlrSF, corr_data$SalePrice, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$X1stFlrSF and corr_data$SalePrice
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5841687 0.6266715
## sample estimates:
##       cor 
## 0.6058522

In all 3 tests we have a very small p value, therefore, we can reject the the null hypothesis. The true correlation is not 0 for any of the three pairs of variables.

The formula to estimate the familywise error rate is:

\(FWE≤1–(1–alphaIT)^c\)

Where:

αIT = alpha level for an individual test (e.g. .05), c = Number of comparisons.

Source: https://www.statisticshowto.datasciencecentral.com/familywise-error-rate/

In our case…

FWE <- 1-((1-0.05)^3)
FWE
## [1] 0.142625

So the probability of a family-wise error is just over 14%.

This is definitelty a significant risk.

Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

precision_matrix <- solve(corr_matrix)
precision_matrix
##             BsmtUnfSF  X1stFlrSF   SalePrice
## BsmtUnfSF  1.11451514 -0.3406203 -0.02626983
## X1stFlrSF -0.34062025  1.6967113 -0.96346364
## SalePrice -0.02626983 -0.9634636  1.59322948
round(corr_matrix %*% precision_matrix, 2)
##           BsmtUnfSF X1stFlrSF SalePrice
## BsmtUnfSF         1         0         0
## X1stFlrSF         0         1         0
## SalePrice         0         0         1
round(precision_matrix %*% corr_matrix, 2)
##           BsmtUnfSF X1stFlrSF SalePrice
## BsmtUnfSF         1         0         0
## X1stFlrSF         0         1         0
## SalePrice         0         0         1
 lu.decomposition(corr_matrix)
## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.32 1.0000000    0
## [3,] 0.21 0.6047237    1
## 
## $U
##      [,1]   [,2]     [,3]
## [1,]    1 0.3200 0.210000
## [2,]    0 0.8976 0.542800
## [3,]    0 0.0000 0.627656

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

From the above analysis, we can see that the X1stFlrSF variable is right-skewed. The minimum value of the Lot Area is absolutely above zero so need not to shift.

#lot Area Minimum Value
min(house_data$X1stFlrSF)
## [1] 334
hist(house_data$X1stFlrSF, breaks = 20, main = "Histogram of X 1st Floor SF")

library(MASS)
d <- fitdistr(house_data$X1stFlrSF, densfun = 'exponential')
lambda <- d$estimate
epdf <- rexp(1000, lambda)

Optimal Value of λ:

The optimal value of lambda = 1/λ.

optimal_value <- 1/lambda
optimal_value
##     rate 
## 1162.627

Plot a histogram and compare it with a histogram of your original variable.

par(mfrow=c(1,2))
hist(house_data$X1stFlrSF, breaks = 20, col="violet", main = "Original - Lot Area")
hist(epdf, breaks = 20, col="royalblue", main = "Exponential - Lot Area")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

round(quantile(epdf, c(.05, .95)), 3)
##       5%      95% 
##   59.508 3582.737
conf_int <- t.test(house_data$X1stFlrSF)
conf_int
## 
##  One Sample t-test
## 
## data:  house_data$X1stFlrSF
## t = 114.91, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1142.780 1182.473
## sample estimates:
## mean of x 
##  1162.627
round(quantile(house_data$X1stFlrSF, c(.05, .95)), 3)
##      5%     95% 
##  672.95 1831.25

The simulated data is not a good fit for the observed data in this case. The simulated exponential distribution is much more skewed than our original data. While the 95the percentile isn’t that far off, the 5th percentile is very different in our observed data vs. our simulation.

str(final_data)
## 'data.frame':    1460 obs. of  14 variables:
##  $ Id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LotArea     : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ YearBuilt   : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd: int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ BsmtFinSF1  : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtUnfSF   : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ X1stFlrSF   : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF   : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ GrLivArea   : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ GarageArea  : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ WoodDeckSF  : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ SalePrice   : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Modelling

# Standardize predictors
means <- sapply(final_data[,2:13],mean)
stdev <- sapply(final_data[,2:13],sd)
df.scaled <- as.data.frame(scale(final_data[,2:13], center=means, scale=stdev))
df.scaled$SalePrice <- final_data$SalePrice
df.scaled$Id <- final_data$Id
head(df.scaled)
##       LotArea  YearBuilt YearRemodAdd  BsmtFinSF1   BsmtUnfSF TotalBsmtSF
## 1 -0.20707076  1.0506338    0.8783671  0.57522774 -0.94426706  -0.4591452
## 2 -0.09185490  0.1566800   -0.4294298  1.17159068 -0.64100836   0.4663051
## 3  0.07345481  0.9844150    0.8299302  0.09287536 -0.30153966  -0.3132614
## 4 -0.09686428 -1.8629933   -0.7200514 -0.49910256 -0.06164845  -0.6870887
## 5  0.37501979  0.9513056    0.7330564  0.46340969 -0.17480468   0.1996113
## 6  0.36049258  0.7195398    0.4908717  0.63223302 -1.13889578  -0.5959113
##     X1stFlrSF  X2ndFlrSF  GrLivArea  GarageArea WoodDeckSF OpenPorchSF
## 1 -0.79316202  1.1614536  0.3702066  0.35088009 -0.7519182  0.21642900
## 2  0.25705235 -0.7948909 -0.4823466 -0.06071021  1.6256378 -0.70424195
## 3 -0.62761099  1.1889432  0.5148362  0.63150985 -0.7519182 -0.07033736
## 4 -0.52155486  0.9369551  0.3835277  0.79053338 -0.7519182 -0.17598812
## 5 -0.04559563  1.6173231  1.2988806  1.69790291  0.7799299  0.56356723
## 6 -0.94836612  0.5017028 -0.2920446  0.03283304 -0.4327832 -0.25145296
##   SalePrice Id
## 1    208500  1
## 2    181500  2
## 3    223500  3
## 4    140000  4
## 5    250000  5
## 6    143000  6
attach(df.scaled)
model_1 <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + GarageArea + WoodDeckSF + OpenPorchSF)
summary(model_1)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + 
##     GrLivArea + GarageArea + WoodDeckSF + OpenPorchSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -626565  -18103   -3396   14109  281540 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  180921.2     1070.4 169.030  < 2e-16 ***
## LotArea        3773.3     1155.6   3.265 0.001119 ** 
## YearBuilt     13846.4     1496.0   9.256  < 2e-16 ***
## YearRemodAdd  11793.7     1377.5   8.561  < 2e-16 ***
## BsmtFinSF1     7883.3     3204.7   2.460 0.014013 *  
## BsmtUnfSF      1408.7     3025.7   0.466 0.641591    
## TotalBsmtSF   10654.5     3469.5   3.071 0.002174 ** 
## X1stFlrSF     15002.4     8972.3   1.672 0.094725 .  
## X2ndFlrSF     16330.1     9986.8   1.635 0.102232    
## GrLivArea     15749.9    11843.8   1.330 0.183790    
## GarageArea    11974.5     1408.2   8.503  < 2e-16 ***
## WoodDeckSF     3831.2     1148.2   3.337 0.000869 ***
## OpenPorchSF     904.3     1158.8   0.780 0.435294    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40900 on 1447 degrees of freedom
## Multiple R-squared:  0.7371, Adjusted R-squared:  0.735 
## F-statistic: 338.2 on 12 and 1447 DF,  p-value: < 2.2e-16

Remove Variables with High P-values from model

We try to improve on the model by removing variable with high P-value

# Remove BsmtUnfSF and OpenPorchSF
model_2 <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + GarageArea + WoodDeckSF)
summary(model_2)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + 
##     GarageArea + WoodDeckSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -625862  -18181   -3391   14450  280297 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    180921       1070 169.099  < 2e-16 ***
## LotArea          3728       1150   3.242  0.00122 ** 
## YearBuilt       13894       1494   9.298  < 2e-16 ***
## YearRemodAdd    11921       1370   8.703  < 2e-16 ***
## BsmtFinSF1       6511       1277   5.099 3.87e-07 ***
## TotalBsmtSF     12122       2040   5.944 3.49e-09 ***
## X1stFlrSF       14905       8967   1.662  0.09671 .  
## X2ndFlrSF       16408       9982   1.644  0.10043    
## GrLivArea       15965      11835   1.349  0.17755    
## GarageArea      12040       1406   8.564  < 2e-16 ***
## WoodDeckSF       3735       1142   3.271  0.00110 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40880 on 1449 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7352 
## F-statistic:   406 on 10 and 1449 DF,  p-value: < 2.2e-16

We take out the next highest p-value

# Remove GrLivArea
model_3 <- lm(SalePrice ~ LotArea + YearBuilt + YearRemodAdd + BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GarageArea + WoodDeckSF)
summary(model_3)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + YearRemodAdd + 
##     BsmtFinSF1 + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GarageArea + 
##     WoodDeckSF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -626418  -18241   -3365   14384  279706 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    180921       1070 169.051  < 2e-16 ***
## LotArea          3713       1150   3.228  0.00128 ** 
## YearBuilt       13559       1474   9.199  < 2e-16 ***
## YearRemodAdd    11992       1369   8.759  < 2e-16 ***
## BsmtFinSF1       6447       1276   5.050 4.97e-07 ***
## TotalBsmtSF     12210       2039   5.988 2.68e-09 ***
## X1stFlrSF       26703       1980  13.489  < 2e-16 ***
## X2ndFlrSF       29780       1177  25.306  < 2e-16 ***
## GarageArea      12011       1406   8.543  < 2e-16 ***
## WoodDeckSF       3738       1142   3.272  0.00109 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40890 on 1450 degrees of freedom
## Multiple R-squared:  0.7367, Adjusted R-squared:  0.735 
## F-statistic: 450.7 on 9 and 1450 DF,  p-value: < 2.2e-16

Our 4th model has all very low p-values and a moderately OK R2 vale at 0.735. Let’s see how it does on the test data.

par(mfrow=c(2,2))
plot(model_3)

Work with Test Data

#Load the data and remove columns same as our training data
test_df <- read.csv("https://raw.githubusercontent.com/omocharly/DATA605/main/test.csv")
test_df <- test_df %>% select_if(is.numeric)
test_df <- test_df[,c(1, 4, 7,8, 10:17, 28:34)]
test_df <- test_df[,-c(6,11,16:19)]
str(test_df)
## 'data.frame':    1459 obs. of  13 variables:
##  $ Id          : int  1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 ...
##  $ LotArea     : int  11622 14267 13830 9978 5005 10000 7980 8402 10176 8400 ...
##  $ YearBuilt   : int  1961 1958 1997 1998 1992 1993 1992 1998 1990 1970 ...
##  $ YearRemodAdd: int  1961 1958 1998 1998 1992 1994 2007 1998 1990 1970 ...
##  $ BsmtFinSF1  : int  468 923 791 602 263 0 935 0 637 804 ...
##  $ BsmtUnfSF   : int  270 406 137 324 1017 763 233 789 663 0 ...
##  $ TotalBsmtSF : int  882 1329 928 926 1280 763 1168 789 1300 882 ...
##  $ X1stFlrSF   : int  896 1329 928 926 1280 763 1187 789 1341 882 ...
##  $ X2ndFlrSF   : int  0 0 701 678 0 892 0 676 0 0 ...
##  $ GrLivArea   : int  896 1329 1629 1604 1280 1655 1187 1465 1341 882 ...
##  $ GarageArea  : int  730 312 482 470 506 440 420 393 506 525 ...
##  $ WoodDeckSF  : int  140 393 212 360 0 157 483 0 192 240 ...
##  $ OpenPorchSF : int  0 36 34 36 82 84 21 75 0 0 ...
# Standardize test predictors
test.scaled <- as.data.frame(scale(test_df[,2:13], center=means, scale=stdev))
test.scaled$SalePrice <- test_df$SalePrice
test.scaled$Id <- test_df$Id
head(test.scaled)
##       LotArea  YearBuilt YearRemodAdd  BsmtFinSF1  BsmtUnfSF TotalBsmtSF
## 1  0.11072464 -0.3399610   -1.1559837  0.05341016 -0.6726921  -0.3998799
## 2  0.37572111 -0.4392892   -1.3012945  1.05100259 -0.3649071   0.6190272
## 3  0.33193908  0.8519774    0.6361825  0.76159116 -0.9736877  -0.2950259
## 4 -0.05398395  0.8850868    0.6361825  0.34720661 -0.5504834  -0.2995848
## 5 -0.55221739  0.6864304    0.3455610 -0.39605455  1.0178620   0.5073350
## 6 -0.05177982  0.7195398    0.4424348 -0.97268490  0.4430284  -0.6711326
##    X1stFlrSF  X2ndFlrSF  GrLivArea  GarageArea WoodDeckSF OpenPorchSF   Id
## 1 -0.6896926 -0.7948909 -1.1788522  1.20212368  0.3650544  -0.7042419 1461
## 2  0.4303636 -0.7948909 -0.3548443 -0.75293027  2.3835835  -0.1608952 1462
## 3 -0.6069171  0.8109610  0.2160619  0.04218737  0.9394975  -0.1910811 1463
## 4 -0.6120906  0.7582726  0.1684864 -0.01393859  2.1202971  -0.1608952 1464
## 5  0.3036136 -0.7948909 -0.4480923  0.15443927 -0.7519182   0.5333813 1465
## 6 -1.0337284  1.2485041  0.2655405 -0.15425346  0.5006868   0.5635672 1466
sp_predictions <- predict(model_3,newdata=test.scaled)
sp_predictions <- data.frame(as.vector(sp_predictions))
sp_predictions$Id <- test.scaled$Id
sp_predictions[,c(1,2)] <- sp_predictions[,c(2,1)]
colnames(sp_predictions) <- c("Id", "SalePrice")
sp_predictions[is.na(sp_predictions)] <- 0
head(sp_predictions)
##     Id SalePrice
## 1 1461  132036.3
## 2 1462  162773.4
## 3 1463  214604.9
## 4 1464  212925.6
## 5 1465  179443.9
## 6 1466  190921.1
write.csv(sp_predictions,'price_predictions.csv', row.names=FALSE)

Kaggle Result

My kaggle username is Charles Ugiagbe and my kaggle submission score is 0.47612