Data 605 Final

Problem 1: Playing With PageRank

Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes.

First let’s form A:

A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
              0, 0, 0, 0, 0, 0,
              1/3, 1/3, 0, 0, 1/3, 0,
              0, 0, 0, 0, 1/2, 1/2,
              0, 0, 0, 1/2, 0, 1/2,
              0, 0, 0, 1, 0, 0
              ), nrow = 6)

A

##      [,1] [,2]      [,3] [,4] [,5] [,6]
## [1,]  0.0    0 0.3333333  0.0  0.0    0
## [2,]  0.5    0 0.3333333  0.0  0.0    0
## [3,]  0.5    0 0.0000000  0.0  0.0    0
## [4,]  0.0    0 0.0000000  0.0  0.5    1
## [5,]  0.0    0 0.3333333  0.5  0.0    0
## [6,]  0.0    0 0.0000000  0.5  0.5    0

Initially I solved the problem with A as is, but the B matrix would converge at 0 (solving for the eigenvector showed me I’m not right). After some digging I found making A row-stochastic with a uniform vector gets me to the solution I want. So we will move forward with the vector below.

A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
             1/6, 1/6, 1/6, 1/6, 1/6, 1/6,
             1/3, 1/3, 0, 0, 1/3, 0,
             0, 0, 0, 0, 1/2, 1/2,
             0, 0, 0, 1/2, 0, 1/2,
             0, 0, 0, 1, 0, 0), nrow=6)
print(A)

##      [,1]      [,2]      [,3] [,4] [,5] [,6]
## [1,]  0.0 0.1666667 0.3333333  0.0  0.0    0
## [2,]  0.5 0.1666667 0.3333333  0.0  0.0    0
## [3,]  0.5 0.1666667 0.0000000  0.0  0.0    0
## [4,]  0.0 0.1666667 0.0000000  0.0  0.5    1
## [5,]  0.0 0.1666667 0.3333333  0.5  0.0    0
## [6,]  0.0 0.1666667 0.0000000  0.5  0.5    0

Next we form the decay matrix B

\[B = 0.85 \times A + \frac{0.15}{n}\]

n <- 6
B <- 0.85 * A + (0.15/n)
print(B)

##       [,1]      [,2]      [,3]  [,4]  [,5]  [,6]
## [1,] 0.025 0.1666667 0.3083333 0.025 0.025 0.025
## [2,] 0.450 0.1666667 0.3083333 0.025 0.025 0.025
## [3,] 0.450 0.1666667 0.0250000 0.025 0.025 0.025
## [4,] 0.025 0.1666667 0.0250000 0.025 0.450 0.875
## [5,] 0.025 0.1666667 0.3083333 0.450 0.025 0.025
## [6,] 0.025 0.1666667 0.0250000 0.450 0.450 0.025

Start with a uniform rank vector r and perform power iterations on B till convergence. That is, compute the solution r = Bn × r. Attempt this for a sufficiently large n so that r actually converges.

Just like in the example each element of r, to start, will be 1/6

r <- matrix(c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6), nrow=6, ncol=1)
r

##           [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667

Plotting one of the values in r over n shows when n converges which is n~10

n <- 30
ns <- c(0)
rs <- c(r[1])
new_r <- r
Bn <- B

for (i in 1:n) {
  Bn <- Bn%*%B
  new_r <- Bn%*%r
  rs <- append(rs, new_r[4])
  ns <- append(ns, i)
}

plot(ns, rs)

Result:

new_r

##            [,1]
## [1,] 0.05170475
## [2,] 0.07367927
## [3,] 0.05741242
## [4,] 0.34870368
## [5,] 0.19990381
## [6,] 0.26859608

Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1

As we can see 1 is an eigenvalue of B

x<-eigen(B)
x$values

## [1]  1.00000000+0i  0.57619235+0i -0.42500000+0i -0.42500000-0i -0.34991524+0i
## [6] -0.08461044+0i

To make sure the vector sums to 1 we divide it by the total sum. We can see the values are the same as the previous problem.

x$vectors[,1]/sum(x$vectors[,1])

## [1] 0.05170475+0i 0.07367926+0i 0.05741241+0i 0.34870369+0i 0.19990381+0i
## [6] 0.26859608+0i

Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally. Verify that you do get the same PageRank vector as the two approaches above

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
              0, 0, 0, 0, 0, 0,
              1/3, 1/3, 0, 0, 1/3, 0,
              0, 0, 0, 0, 1/2, 1/2,
              0, 0, 0, 1/2, 0, 1/2,
              0, 0, 0, 1, 0, 0
              ), nrow = 6, byrow = T)

This is really cool. I didn’t know R had the ability to do this.

g <- graph_from_adjacency_matrix(A, weighted=TRUE)
plot(g)

The same results are returned

as.matrix(page.rank(g)$vector)

##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608

Problem 2

Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.)

library(keras)

## 
## Attaching package: 'keras'

## The following objects are masked from 'package:igraph':
## 
##     %<-%, normalize

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## x purrr::compose()       masks igraph::compose()
## x tidyr::crossing()      masks igraph::crossing()
## x dplyr::filter()        masks stats::filter()
## x dplyr::groups()        masks igraph::groups()
## x dplyr::lag()           masks stats::lag()
## x purrr::simplify()      masks igraph::simplify()

library(readr)

test <-read_csv("test.csv")

## Rows: 28000 Columns: 784

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (784): pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pixel7, p...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

train <-read_csv("train.csv")

## Rows: 42000 Columns: 785

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (785): label, pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pi...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

plot_img = function(row)
{
  img <- matrix(row, nrow=28,ncol=28)
  mode(img) = "numeric"
  image(img, useRaster=TRUE, axes=FALSE)
}

for (i in 1:10) {
  plot_img(test[i,])
}

What is the frequency distribution of the numbers in the dataset?

As we can see the dataset is mainly consisting of 0’s. This is because the the images are mostly a uniform background color.

x_test <- array_reshape(as.matrix(test) / 255, c(nrow(test), 28, 28, 1))
hist(x_test)

hist(x_test[x_test > 0])

For each number, provide the mean pixel intensity. What does this tell you? (5 points)

I won’t show all the means, but when I looked at it all the initial values are 0 and the means increase in value towards the middle. This just means all the drawings are towards the center of the image.

colMeans(test)[1:20]

##  pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  pixel8  pixel9 
##       0       0       0       0       0       0       0       0       0       0 
## pixel10 pixel11 pixel12 pixel13 pixel14 pixel15 pixel16 pixel17 pixel18 pixel19 
##       0       0       0       0       0       0       0       0       0       0

Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why?

X <- test
#Scaling the pixels
Xreduced <- X/255
Xcov <- cov(Xreduced)

#Reduce to principle components
pcaX <- prcomp(Xcov)

(cumsum(pcaX$sdev^2) / sum(pcaX$sdev^2))[1:20]

##  [1] 0.2537120 0.4202695 0.5442091 0.6401227 0.7112063 0.7702888 0.8042082
##  [8] 0.8314769 0.8539140 0.8716505 0.8859296 0.8987083 0.9082055 0.9176209
## [15] 0.9258411 0.9331399 0.9388017 0.9440446 0.9486074 0.9529410

Plot the first 10 images generated by PCA

It appears to be just noise. This is because its looking at the variation across a variety of different images - so it makes sense that it looks like a blob.

k<- 1
j<-784

for (i in 1:10){
  img<- matrix(pcaX$rotation[k:j],nrow=28, ncol=28)
  image(img, useRaster=TRUE, axes=FALSE)
  k<- k+784
  j<- j+784
}

Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%).

First we need to filter the training data down to 8’s only. Then, we drop the label column.

library(dplyr)

train_8 <- train %>% filter(label == 8)
train_8 <- dplyr::select(train_8,c(2:785))
dim(train_8)

## [1] 4063  784

knitr::opts_chunk$set(cache = TRUE)
X <- train_8
#Scaling the pixels
Xreduced <- X/255
Xcov <- cov(Xreduced)

#Reduce to principle components
pcaX <- prcomp(Xcov)

Now Let’s plot the first few images. The images all resemble 8! As we expected.

k<- 1
j<-784

for (i in 1:10){
  img<- matrix(pcaX$rotation[k:j],nrow=28, ncol=28)
  image(img, useRaster=TRUE, axes=FALSE)
  k<- k+784
  j<- j+784
}

Multinomial

Here we plug our dataset into a multinomial model. This took quite a long time to run. If I had more time I would fiqure out how to optimize it.

knitr::opts_chunk$set(cache = TRUE)
library(nnet)

train$label <- as.factor(train$label)
X <- train[2:785]/255
X$label <- train$label
model <- multinom(label~., data=X, MaxNWts=1000000)

## # weights:  7860 (7065 variable)
## initial  value 96708.573906 
## iter  10 value 25322.714106
## iter  20 value 20402.086316
## iter  30 value 19312.872829
## iter  40 value 18703.256586
## iter  50 value 18197.815143
## iter  60 value 17732.985798
## iter  70 value 16739.962157
## iter  80 value 14961.658448
## iter  90 value 13446.085942
## iter 100 value 12442.636014
## final  value 12442.636014 
## stopped after 100 iterations

The calssification accuracy (checking against the dataset we already used) is 83%. Not the best.

prediction <- predict(model, train[2:785])
prediction <- as.data.frame(prediction, row.names=c('predicted_value'))

## Warning in as.data.frame.factor(prediction, row.names = c("predicted_value")):
## 'row.names' is not a character vector of length 42000 -- omitting it. Will be an
## error!

prediction$actual <- train$label
prediction$equal <- ifelse(prediction$prediction==prediction$actual,1,0)

sum(prediction$equal) / nrow(prediction)

## [1] 0.8342381

Interestingly it look like the had difficulty classifying 8.

library("caret")

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

confusionMatrix(prediction$prediction, prediction$actual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 3841    0   12    5    6   31   14   12    4    9
##          1    1 3734   11    3    2   11    4    7    5    3
##          2    9   13 3460   49   22   14   17   43    5    8
##          3   11   37   99 3833   18  235    4   52   32   36
##          4    6    2   22    2 3222   15    8   13    2   17
##          5    9    0    3   12    2 1926   11    3    1    4
##          6   27    5   31   13   22   53 3808    3    2    0
##          7    2    4   14    8    2    8    2 3420    1   19
##          8  217  867  492  391  473 1425  266  234 4005  303
##          9    9   22   33   35  303   77    3  614    6 3789
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8342          
##                  95% CI : (0.8306, 0.8378)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8158          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.92957  0.79718  0.82835  0.88095  0.79126  0.50751
## Specificity           0.99754  0.99874  0.99524  0.98608  0.99771  0.99882
## Pos Pred Value        0.97636  0.98757  0.95055  0.87973  0.97371  0.97717
## Neg Pred Value        0.99236  0.97514  0.98131  0.98624  0.97803  0.95331
## Prevalence            0.09838  0.11152  0.09945  0.10360  0.09695  0.09036
## Detection Rate        0.09145  0.08890  0.08238  0.09126  0.07671  0.04586
## Detection Prevalence  0.09367  0.09002  0.08667  0.10374  0.07879  0.04693
## Balanced Accuracy     0.96356  0.89796  0.91179  0.93351  0.89448  0.75317
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.92047  0.77710  0.98572  0.90473
## Specificity           0.99588  0.99840  0.87695  0.97086
## Pos Pred Value        0.96065  0.98276  0.46178  0.77469
## Neg Pred Value        0.99135  0.97453  0.99826  0.98925
## Prevalence            0.09850  0.10479  0.09674  0.09971
## Detection Rate        0.09067  0.08143  0.09536  0.09021
## Detection Prevalence  0.09438  0.08286  0.20650  0.11645
## Balanced Accuracy     0.95818  0.88775  0.93134  0.93779

Problem 3

Load Data

test2 <-read_csv("test2.csv")

## Rows: 1459 Columns: 80

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

train2 <-read_csv("train2.csv")

## Rows: 1460 Columns: 81

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive and Inferential Statistics - Univariate Descriptive Stats

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(train2)

##                vars    n      mean       sd   median   trimmed      mad   min
## Id                1 1460    730.50   421.61    730.5    730.50   541.15     1
## MSSubClass        2 1460     56.90    42.30     50.0     49.15    44.48    20
## MSZoning*         3 1460      4.03     0.63      4.0      4.06     0.00     1
## LotFrontage       4 1201     70.05    24.28     69.0     68.94    16.31    21
## LotArea           5 1460  10516.83  9981.26   9478.5   9563.28  2962.23  1300
## Street*           6 1460      2.00     0.06      2.0      2.00     0.00     1
## Alley*            7   91      1.45     0.50      1.0      1.44     0.00     1
## LotShape*         8 1460      2.94     1.41      4.0      3.05     0.00     1
## LandContour*      9 1460      3.78     0.71      4.0      4.00     0.00     1
## Utilities*       10 1460      1.00     0.03      1.0      1.00     0.00     1
## LotConfig*       11 1460      4.02     1.62      5.0      4.27     0.00     1
## LandSlope*       12 1460      1.06     0.28      1.0      1.00     0.00     1
## Neighborhood*    13 1460     13.15     5.89     13.0     13.11     7.41     1
## Condition1*      14 1460      3.03     0.87      3.0      3.00     0.00     1
## Condition2*      15 1460      3.01     0.26      3.0      3.00     0.00     1
## BldgType*        16 1460      1.49     1.20      1.0      1.14     0.00     1
## HouseStyle*      17 1460      4.04     1.91      3.0      4.03     1.48     1
## OverallQual      18 1460      6.10     1.38      6.0      6.08     1.48     1
## OverallCond      19 1460      5.58     1.11      5.0      5.48     0.00     1
## YearBuilt        20 1460   1971.27    30.20   1973.0   1974.13    37.06  1872
## YearRemodAdd     21 1460   1984.87    20.65   1994.0   1986.37    19.27  1950
## RoofStyle*       22 1460      2.41     0.83      2.0      2.26     0.00     1
## RoofMatl*        23 1460      2.08     0.60      2.0      2.00     0.00     1
## Exterior1st*     24 1460     10.62     3.20     13.0     10.93     1.48     1
## Exterior2nd*     25 1460     11.34     3.54     14.0     11.65     2.97     1
## MasVnrType*      26 1452      2.76     0.62      3.0      2.73     0.00     1
## MasVnrArea       27 1452    103.69   181.07      0.0     63.15     0.00     0
## ExterQual*       28 1460      3.54     0.69      4.0      3.65     0.00     1
## ExterCond*       29 1460      4.73     0.73      5.0      4.95     0.00     1
## Foundation*      30 1460      2.40     0.72      2.0      2.46     1.48     1
## BsmtQual*        31 1423      3.26     0.87      3.0      3.43     1.48     1
## BsmtCond*        32 1423      3.81     0.66      4.0      4.00     0.00     1
## BsmtExposure*    33 1422      3.27     1.15      4.0      3.46     0.00     1
## BsmtFinType1*    34 1423      3.73     1.83      3.0      3.79     2.97     1
## BsmtFinSF1       35 1460    443.64   456.10    383.5    386.08   568.58     0
## BsmtFinType2*    36 1422      5.71     0.94      6.0      5.98     0.00     1
## BsmtFinSF2       37 1460     46.55   161.32      0.0      1.38     0.00     0
## BsmtUnfSF        38 1460    567.24   441.87    477.5    519.29   426.99     0
## TotalBsmtSF      39 1460   1057.43   438.71    991.5   1036.70   347.67     0
## Heating*         40 1460      2.04     0.30      2.0      2.00     0.00     1
## HeatingQC*       41 1460      2.54     1.74      1.0      2.42     0.00     1
## CentralAir*      42 1460      1.93     0.25      2.0      2.00     0.00     1
## Electrical*      43 1459      4.68     1.05      5.0      5.00     0.00     1
## 1stFlrSF         44 1460   1162.63   386.59   1087.0   1129.99   347.67   334
## 2ndFlrSF         45 1460    346.99   436.53      0.0    285.36     0.00     0
## LowQualFinSF     46 1460      5.84    48.62      0.0      0.00     0.00     0
## GrLivArea        47 1460   1515.46   525.48   1464.0   1467.67   483.33   334
## BsmtFullBath     48 1460      0.43     0.52      0.0      0.39     0.00     0
## BsmtHalfBath     49 1460      0.06     0.24      0.0      0.00     0.00     0
## FullBath         50 1460      1.57     0.55      2.0      1.56     0.00     0
## HalfBath         51 1460      0.38     0.50      0.0      0.34     0.00     0
## BedroomAbvGr     52 1460      2.87     0.82      3.0      2.85     0.00     0
## KitchenAbvGr     53 1460      1.05     0.22      1.0      1.00     0.00     0
## KitchenQual*     54 1460      3.34     0.83      4.0      3.50     0.00     1
## TotRmsAbvGrd     55 1460      6.52     1.63      6.0      6.41     1.48     2
## Functional*      56 1460      6.75     0.98      7.0      7.00     0.00     1
## Fireplaces       57 1460      0.61     0.64      1.0      0.53     1.48     0
## FireplaceQu*     58  770      3.73     1.13      3.0      3.80     1.48     1
## GarageType*      59 1379      3.28     1.79      2.0      3.11     0.00     1
## GarageYrBlt      60 1379   1978.51    24.69   1980.0   1981.07    31.13  1900
## GarageFinish*    61 1379      2.18     0.81      2.0      2.23     1.48     1
## GarageCars       62 1460      1.77     0.75      2.0      1.77     0.00     0
## GarageArea       63 1460    472.98   213.80    480.0    469.81   177.91     0
## GarageQual*      64 1379      4.86     0.61      5.0      5.00     0.00     1
## GarageCond*      65 1379      4.90     0.52      5.0      5.00     0.00     1
## PavedDrive*      66 1460      2.86     0.50      3.0      3.00     0.00     1
## WoodDeckSF       67 1460     94.24   125.34      0.0     71.76     0.00     0
## OpenPorchSF      68 1460     46.66    66.26     25.0     33.23    37.06     0
## EnclosedPorch    69 1460     21.95    61.12      0.0      3.87     0.00     0
## 3SsnPorch        70 1460      3.41    29.32      0.0      0.00     0.00     0
## ScreenPorch      71 1460     15.06    55.76      0.0      0.00     0.00     0
## PoolArea         72 1460      2.76    40.18      0.0      0.00     0.00     0
## PoolQC*          73    7      2.14     0.90      2.0      2.14     1.48     1
## Fence*           74  281      2.43     0.86      3.0      2.48     0.00     1
## MiscFeature*     75   54      2.91     0.45      3.0      3.00     0.00     1
## MiscVal          76 1460     43.49   496.12      0.0      0.00     0.00     0
## MoSold           77 1460      6.32     2.70      6.0      6.25     2.97     1
## YrSold           78 1460   2007.82     1.33   2008.0   2007.77     1.48  2006
## SaleType*        79 1460      8.51     1.56      9.0      8.92     0.00     1
## SaleCondition*   80 1460      4.77     1.10      5.0      5.00     0.00     1
## SalePrice        81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
##                   max  range   skew kurtosis      se
## Id               1460   1459   0.00    -1.20   11.03
## MSSubClass        190    170   1.40     1.56    1.11
## MSZoning*           5      4  -1.73     6.25    0.02
## LotFrontage       313    292   2.16    17.34    0.70
## LotArea        215245 213945  12.18   202.26  261.22
## Street*             2      1 -15.49   238.01    0.00
## Alley*              2      1   0.20    -1.98    0.05
## LotShape*           4      3  -0.61    -1.60    0.04
## LandContour*        4      3  -3.16     8.65    0.02
## Utilities*          2      1  38.13  1453.00    0.00
## LotConfig*          5      4  -1.13    -0.59    0.04
## LandSlope*          3      2   4.80    24.47    0.01
## Neighborhood*      25     24   0.02    -1.06    0.15
## Condition1*         9      8   3.01    16.34    0.02
## Condition2*         8      7  13.14   247.54    0.01
## BldgType*           5      4   2.24     3.41    0.03
## HouseStyle*         8      7   0.31    -0.96    0.05
## OverallQual        10      9   0.22     0.09    0.04
## OverallCond         9      8   0.69     1.09    0.03
## YearBuilt        2010    138  -0.61    -0.45    0.79
## YearRemodAdd     2010     60  -0.50    -1.27    0.54
## RoofStyle*          6      5   1.47     0.61    0.02
## RoofMatl*           8      7   8.09    66.28    0.02
## Exterior1st*       15     14  -0.72    -0.37    0.08
## Exterior2nd*       16     15  -0.69    -0.52    0.09
## MasVnrType*         4      3  -0.07    -0.13    0.02
## MasVnrArea       1600   1600   2.66    10.03    4.75
## ExterQual*          4      3  -1.83     3.86    0.02
## ExterCond*          5      4  -2.56     5.29    0.02
## Foundation*         6      5   0.09     1.02    0.02
## BsmtQual*           4      3  -1.31     1.27    0.02
## BsmtCond*           4      3  -3.39    10.14    0.02
## BsmtExposure*       4      3  -1.15    -0.39    0.03
## BsmtFinType1*       6      5  -0.02    -1.39    0.05
## BsmtFinSF1       5644   5644   1.68    11.06   11.94
## BsmtFinType2*       6      5  -3.56    12.32    0.02
## BsmtFinSF2       1474   1474   4.25    20.01    4.22
## BsmtUnfSF        2336   2336   0.92     0.46   11.56
## TotalBsmtSF      6110   6110   1.52    13.18   11.48
## Heating*            6      5   9.83   110.98    0.01
## HeatingQC*          5      4   0.48    -1.51    0.05
## CentralAir*         2      1  -3.52    10.42    0.01
## Electrical*         5      4  -3.06     7.49    0.03
## 1stFlrSF         4692   4358   1.37     5.71   10.12
## 2ndFlrSF         2065   2065   0.81    -0.56   11.42
## LowQualFinSF      572    572   8.99    82.83    1.27
## GrLivArea        5642   5308   1.36     4.86   13.75
## BsmtFullBath        3      3   0.59    -0.84    0.01
## BsmtHalfBath        2      2   4.09    16.31    0.01
## FullBath            3      3   0.04    -0.86    0.01
## HalfBath            2      2   0.67    -1.08    0.01
## BedroomAbvGr        8      8   0.21     2.21    0.02
## KitchenAbvGr        3      3   4.48    21.42    0.01
## KitchenQual*        4      3  -1.42     1.72    0.02
## TotRmsAbvGrd       14     12   0.67     0.87    0.04
## Functional*         7      6  -4.08    16.37    0.03
## Fireplaces          3      3   0.65    -0.22    0.02
## FireplaceQu*        5      4  -0.16    -0.98    0.04
## GarageType*         6      5   0.76    -1.30    0.05
## GarageYrBlt      2010    110  -0.65    -0.42    0.66
## GarageFinish*       3      2  -0.35    -1.41    0.02
## GarageCars          4      4  -0.34     0.21    0.02
## GarageArea       1418   1418   0.18     0.90    5.60
## GarageQual*         5      4  -4.43    18.25    0.02
## GarageCond*         5      4  -5.28    26.77    0.01
## PavedDrive*         3      2  -3.30     9.22    0.01
## WoodDeckSF        857    857   1.54     2.97    3.28
## OpenPorchSF       547    547   2.36     8.44    1.73
## EnclosedPorch     552    552   3.08    10.37    1.60
## 3SsnPorch         508    508  10.28   123.06    0.77
## ScreenPorch       480    480   4.11    18.34    1.46
## PoolArea          738    738  14.80   222.19    1.05
## PoolQC*             3      2  -0.22    -1.90    0.34
## Fence*              4      3  -0.57    -0.88    0.05
## MiscFeature*        4      3  -2.93    10.71    0.06
## MiscVal         15500  15500  24.43   697.64   12.98
## MoSold             12     11   0.21    -0.41    0.07
## YrSold           2010      4   0.10    -1.19    0.03
## SaleType*           9      8  -3.83    14.57    0.04
## SaleCondition*      6      5  -2.74     6.82    0.03
## SalePrice      755000 720100   1.88     6.50 2079.11

hist(train2$SalePrice)

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

First I chose Year built because I think there is a correlation there, and I was correct.

train2 %>%
  ggplot(aes(y=SalePrice, x=YearBuilt)) + geom_point()

train2 %>%
  ggplot(aes(y=SalePrice, x=YearRemodAdd)) + geom_point()

Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

cordf <- train2 %>% select(YearBuilt, LotArea, TotalBsmtSF)
cor <- cor(cordf)
cor

##              YearBuilt    LotArea TotalBsmtSF
## YearBuilt   1.00000000 0.01422765   0.3914520
## LotArea     0.01422765 1.00000000   0.2608331
## TotalBsmtSF 0.39145200 0.26083313   1.0000000

In each test the alternative hypothesis is true: that the correlation is not equal to 0. This is suspect especially for the test with a pval=0.014. So we may need to be worried about a Familywise error.

cor.test(train2$YearBuilt, train2$LotArea, conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train2$YearBuilt and train2$LotArea
## t = 0.54332, df = 1458, p-value = 0.587
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.01934322  0.04776648
## sample estimates:
##        cor 
## 0.01422765

cor.test(train2$YearBuilt, train2$TotalBsmtSF, conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train2$YearBuilt and train2$TotalBsmtSF
## t = 16.243, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3626548 0.4195023
## sample estimates:
##      cor 
## 0.391452

cor.test(train2$TotalBsmtSF, train2$LotArea, conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train2$TotalBsmtSF and train2$LotArea
## t = 10.317, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2292786 0.2918400
## sample estimates:
##       cor 
## 0.2608331

Linear Algebra and Correlation

Invert your correlation matrix from above.

precision_matrix <- solve(cor)
precision_matrix

##              YearBuilt    LotArea TotalBsmtSF
## YearBuilt    1.1926351  0.1124547  -0.4961913
## LotArea      0.1124547  1.0836039  -0.3266604
## TotalBsmtSF -0.4961913 -0.3266604   1.2794390

Multiply the correlation matrix by the precision matrix

round(precision_matrix %*% cor)

##             YearBuilt LotArea TotalBsmtSF
## YearBuilt           1       0           0
## LotArea             0       1           0
## TotalBsmtSF         0       0           1

LU Decomposition

library(matrixcalc)

## 
## Attaching package: 'matrixcalc'

## The following object is masked from 'package:igraph':
## 
##     %s%

lu.decomposition(cor)

## $L
##            [,1]      [,2] [,3]
## [1,] 1.00000000 0.0000000    0
## [2,] 0.01422765 1.0000000    0
## [3,] 0.39145200 0.2553154    1
## 
## $U
##      [,1]       [,2]      [,3]
## [1,]    1 0.01422765 0.3914520
## [2,]    0 0.99979757 0.2552637
## [3,]    0 0.00000000 0.7815926

Calculus-Based Probability & Statistics

hist(train2$GrLivArea)

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

lambda <- fitdistr(train2$GrLivArea,"exponential")
lambda$estimate

##        rate 
## 0.000659864

sample <- rexp(1000, lambda$estimate)

hist(sample, breaks=50)

5th and 95th percentiles

l<-qexp(0.05, rate = lambda$estimate)
u<-qexp(0.95, rate = lambda$estimate)
c(l,u)

## [1]   77.73313 4539.92351

quantile(train2$GrLivArea, c(0.05, 0.95))

##     5%    95% 
##  848.0 2466.1

Model

I’m going to use a linear model since that’s what I’m most comfortable with. First I’m going to remove qualitative data from the table

prep <- train2[,sapply(train2, is.numeric)]

Now we are going to create the model and remove any variables that are not statistically significant

model <- lm(SalePrice~., data=prep)
summary(model)

## 
## Call:
## lm(formula = SalePrice ~ ., data = prep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -442182  -16955   -2824   15125  318183 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.351e+05  1.701e+06  -0.197 0.843909    
## Id            -1.205e+00  2.658e+00  -0.453 0.650332    
## MSSubClass    -2.001e+02  3.451e+01  -5.797 8.84e-09 ***
## LotFrontage   -1.160e+02  6.126e+01  -1.894 0.058503 .  
## LotArea        5.422e-01  1.575e-01   3.442 0.000599 ***
## OverallQual    1.866e+04  1.482e+03  12.592  < 2e-16 ***
## OverallCond    5.239e+03  1.368e+03   3.830 0.000135 ***
## YearBuilt      3.164e+02  8.766e+01   3.610 0.000321 ***
## YearRemodAdd   1.194e+02  8.668e+01   1.378 0.168607    
## MasVnrArea     3.141e+01  7.022e+00   4.473 8.54e-06 ***
## BsmtFinSF1     1.736e+01  5.838e+00   2.973 0.003014 ** 
## BsmtFinSF2     8.342e+00  8.766e+00   0.952 0.341532    
## BsmtUnfSF      5.005e+00  5.277e+00   0.948 0.343173    
## TotalBsmtSF           NA         NA      NA       NA    
## `1stFlrSF`     4.597e+01  7.360e+00   6.246 6.02e-10 ***
## `2ndFlrSF`     4.663e+01  6.102e+00   7.641 4.72e-14 ***
## LowQualFinSF   3.341e+01  2.794e+01   1.196 0.232009    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   9.043e+03  3.198e+03   2.828 0.004776 ** 
## BsmtHalfBath   2.465e+03  5.073e+03   0.486 0.627135    
## FullBath       5.433e+03  3.531e+03   1.539 0.124182    
## HalfBath      -1.098e+03  3.321e+03  -0.331 0.740945    
## BedroomAbvGr  -1.022e+04  2.155e+03  -4.742 2.40e-06 ***
## KitchenAbvGr  -2.202e+04  6.710e+03  -3.282 0.001063 ** 
## TotRmsAbvGrd   5.464e+03  1.487e+03   3.674 0.000251 ***
## Fireplaces     4.372e+03  2.189e+03   1.998 0.046020 *  
## GarageYrBlt   -4.728e+01  9.106e+01  -0.519 0.603742    
## GarageCars     1.685e+04  3.491e+03   4.827 1.58e-06 ***
## GarageArea     6.274e+00  1.213e+01   0.517 0.605002    
## WoodDeckSF     2.144e+01  1.002e+01   2.139 0.032662 *  
## OpenPorchSF   -2.252e+00  1.949e+01  -0.116 0.907998    
## EnclosedPorch  7.295e+00  2.062e+01   0.354 0.723590    
## `3SsnPorch`    3.349e+01  3.758e+01   0.891 0.373163    
## ScreenPorch    5.805e+01  2.041e+01   2.844 0.004532 ** 
## PoolArea      -6.052e+01  2.990e+01  -2.024 0.043204 *  
## MiscVal       -3.761e+00  6.960e+00  -0.540 0.589016    
## MoSold        -2.217e+02  4.229e+02  -0.524 0.600188    
## YrSold        -2.474e+02  8.458e+02  -0.293 0.769917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36800 on 1085 degrees of freedom
##   (339 observations deleted due to missingness)
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.8034 
## F-statistic: 131.8 on 35 and 1085 DF,  p-value: < 2.2e-16

prep <- prep %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea,BsmtFinSF1,`1stFlrSF`,`2ndFlrSF`,BsmtFullBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,WoodDeckSF,ScreenPorch,PoolArea,SalePrice)

model <- lm(SalePrice~., data=prep)
summary(model)

## 
## Call:
## lm(formula = SalePrice ~ ., data = prep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -471826  -16295   -1995   13876  299766 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.391e+05  8.879e+04  -9.450  < 2e-16 ***
## MSSubClass   -1.632e+02  2.582e+01  -6.322 3.44e-10 ***
## LotArea       4.209e-01  1.002e-01   4.202 2.81e-05 ***
## OverallQual   1.869e+04  1.125e+03  16.619  < 2e-16 ***
## OverallCond   5.078e+03  9.226e+02   5.504 4.40e-08 ***
## YearBuilt     3.918e+02  4.508e+01   8.691  < 2e-16 ***
## MasVnrArea    2.949e+01  5.874e+00   5.021 5.78e-07 ***
## BsmtFinSF1    9.989e+00  2.975e+00   3.358 0.000807 ***
## `1stFlrSF`    5.770e+01  4.651e+00  12.407  < 2e-16 ***
## `2ndFlrSF`    4.883e+01  4.112e+00  11.874  < 2e-16 ***
## BsmtFullBath  8.955e+03  2.373e+03   3.774 0.000167 ***
## BedroomAbvGr -1.026e+04  1.646e+03  -6.233 6.00e-10 ***
## KitchenAbvGr -1.338e+04  5.098e+03  -2.624 0.008774 ** 
## TotRmsAbvGrd  5.418e+03  1.214e+03   4.461 8.78e-06 ***
## Fireplaces    2.507e+03  1.734e+03   1.446 0.148471    
## GarageCars    1.042e+04  1.695e+03   6.146 1.03e-09 ***
## WoodDeckSF    2.462e+01  7.911e+00   3.112 0.001892 ** 
## ScreenPorch   5.085e+01  1.689e+01   3.010 0.002657 ** 
## PoolArea     -3.049e+01  2.341e+01  -1.302 0.192979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34820 on 1433 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8095, Adjusted R-squared:  0.8071 
## F-statistic: 338.4 on 18 and 1433 DF,  p-value: < 2.2e-16

Residualus

We need to confirm this is a model that fits the data correctly. The first plot is almost on a line, the second shows a normal distribution, and the last is randomly placed. We are good to move forward with this model.

qqnorm(model$residuals)
qqline(model$residuals)

hist(model$residuals)

plot(x=model$residuals)

Predict Test Data

test_p <- test2 %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea,BsmtFinSF1,`1stFlrSF`,`2ndFlrSF`,BsmtFullBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,WoodDeckSF,ScreenPorch,PoolArea)

prediction <- predict(model, test_p)
prediction <- as.data.frame(prediction, row.names=c('SalePrice'))

## Warning in as.data.frame.numeric(prediction, row.names = c("SalePrice")):
## 'row.names' is not a character vector of length 1459 -- omitting it. Will be an
## error!

prediction$Id <- test2$Id
head(prediction)

##   prediction   Id
## 1   124344.6 1461
## 2   170976.1 1462
## 3   170828.1 1463
## 4   199895.3 1464
## 5   196684.0 1465
## 6   182311.1 1466

write.csv(prediction,"/Users/jordanglendrange/Documents/Data 605/Final/prediction.csv")

Submission

User Name: Jordan Glendrange Score: 1.40373