Question 1: PageRank

PageRank Functionality Verification:

Form the A Matrix, Decay, B Matrix

As a first note, I do not see how \(r_f = A \times r_i\). The matrices \(A\) and \(r_i\) do not multiply to equal \(r_f\).

For the PageRank Algorithm to work properly, the columns of the transition matrix should sum to 1, as is the case in Markov transition matrices.

A <- matrix(round(c(0,1/2,1/2,0,0,0,0,0,0,0,0,0,1/3,1/3,0,0,1/3, 0,0,0,0,0,1/2,1/2,0,0,0,1/2,0,1/2,0,0,0,1,0,0),2), nrow = 6,byrow = TRUE)
A

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.00 0.50  0.5  0.0 0.00  0.0
## [2,] 0.00 0.00  0.0  0.0 0.00  0.0
## [3,] 0.33 0.33  0.0  0.0 0.33  0.0
## [4,] 0.00 0.00  0.0  0.0 0.50  0.5
## [5,] 0.00 0.00  0.0  0.5 0.00  0.5
## [6,] 0.00 0.00  0.0  1.0 0.00  0.0

A_orig <- A

ri <- matrix(c(1/6,1/6,1/6,1/6,1/6,1/6),nrow = 6)
ri

##           [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667

rf_1 <- A %*% ri
rf_1

##           [,1]
## [1,] 0.1666667
## [2,] 0.0000000
## [3,] 0.1650000
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667

rf <- (A %^% 6) %*% ri
rf

##            [,1]
## [1,] 0.03428356
## [2,] 0.00000000
## [3,] 0.06706975
## [4,] 0.16666667
## [5,] 0.16666667
## [6,] 0.16666667

Here we can see that given \(A\) and \(r_i\), \(r_f\) on its own is not equivalent to \(A \times r_i\).

Currently, row 2 in A reflects a 0 probability. Because prior to any clicks, there is a uniformly distributed probability of landing on each page, \(\frac 16\) will be placed in each entry of row 2.

If we were to set the second row to enable us to have a transition matrix where all of the columns sum to 1, row 2 would sum to greater than 1 itself. A way around this would be to transpose the matrix A in order to make use of the condition that the column sums should all equal 1.

A <- matrix(c(0,1/2,1/2,0,0,0,1/6,1/6,1/6,1/6,1/6,1/6,1/3,1/3,0,0,1/3, 0,0,0,0,0,1/2,1/2,0,0,0,1/2,0,1/2,0,0,0,1,0,0), nrow = 6,byrow = TRUE)
A

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
## [1,] 0.0000000 0.5000000 0.5000000 0.0000000 0.0000000 0.0000000
## [2,] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
## [3,] 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333 0.0000000
## [4,] 0.0000000 0.0000000 0.0000000 0.0000000 0.5000000 0.5000000
## [5,] 0.0000000 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
## [6,] 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

A <- t(A)
A

##      [,1]      [,2]      [,3] [,4] [,5] [,6]
## [1,]  0.0 0.1666667 0.3333333  0.0  0.0    0
## [2,]  0.5 0.1666667 0.3333333  0.0  0.0    0
## [3,]  0.5 0.1666667 0.0000000  0.0  0.0    0
## [4,]  0.0 0.1666667 0.0000000  0.0  0.5    1
## [5,]  0.0 0.1666667 0.3333333  0.5  0.0    0
## [6,]  0.0 0.1666667 0.0000000  0.5  0.5    0

colSums(A)

## [1] 1 1 1 1 1 1

Recalculating \(r_f\),

ri <- matrix(c(1/6,1/6,1/6,1/6,1/6,1/6),nrow = 6)
ri

##           [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667

rf_1 <- A %*% ri
rf_1

##            [,1]
## [1,] 0.08333333
## [2,] 0.16666667
## [3,] 0.11111111
## [4,] 0.27777778
## [5,] 0.16666667
## [6,] 0.19444444

rf <- (A %^% 6) %*% ri
rf

##            [,1]
## [1,] 0.01316372
## [2,] 0.02272305
## [3,] 0.01515704
## [4,] 0.41632730
## [5,] 0.21763903
## [6,] 0.31498985

Because the goal is to find a probability vector \(r\), such that \(A \times r\) = \(r\), it appears that \(r_i\) from the example is a sufficient probability vector for the manipulated version of A. s Let \(d\) be the decay factor, and introduce \(d\) to A to form B, the decayed version of A. From Page and Brin, we will use \(d = 0.85\). Note, \(n\) is the rank of the matrix A.

\(B = d \times A + \frac {1 - d}n\)

\(B = 0.85 \times A + \frac {0.15}{6}\)

d <- 0.85
A_decay <- d*A
B <- A_decay + (1-d)/nrow(A)
B

##       [,1]      [,2]      [,3]  [,4]  [,5]  [,6]
## [1,] 0.025 0.1666667 0.3083333 0.025 0.025 0.025
## [2,] 0.450 0.1666667 0.3083333 0.025 0.025 0.025
## [3,] 0.450 0.1666667 0.0250000 0.025 0.025 0.025
## [4,] 0.025 0.1666667 0.0250000 0.025 0.450 0.875
## [5,] 0.025 0.1666667 0.3083333 0.450 0.025 0.025
## [6,] 0.025 0.1666667 0.0250000 0.450 0.450 0.025

It is worth noting that column 2 of B is identical to column 2 of A after the manipulation. This is because for a given entry in row 2, \(\frac 16\), we have \(0.85\times \frac 16 + 0.15 \times \frac 16\), which is equal to \(\frac 16\)

Step 2:

For our uniform rank vector r, we will use the example \(r_i\). So, \(r = \begin{vmatrix} \frac 16 \\ \frac 16 \\ \frac 16 \\ \frac 16 \\ \frac 16 \\ \frac 16\end{vmatrix}\)

matpow <- function(x,y){
  out <- x %^% y
  return(out)
}

r <- matrix(c(1/6,1/6,1/6,1/6,1/6,1/6),nrow = 6)
r

##           [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667

r0 <- matpow(B,1) %*% r
r1 <- matpow(B,10) %*% r
r2 <- matpow(B,20) %*% r
r3 <- matpow(B,30) %*% r
r4 <- matpow(B,40) %*% r
r5 <- matpow(B,50) %*% r
r6 <- matpow(B,60) %*% r


result <- matrix(cbind(r1,r2,r3,r4,r5,r6), nrow = 6)
result

##            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
## [1,] 0.05205661 0.05170616 0.05170475 0.05170475 0.05170475 0.05170475
## [2,] 0.07428990 0.07368173 0.07367927 0.07367926 0.07367926 0.07367926
## [3,] 0.05782138 0.05741406 0.05741242 0.05741241 0.05741241 0.05741241
## [4,] 0.34797267 0.34870083 0.34870367 0.34870369 0.34870369 0.34870369
## [5,] 0.19975859 0.19990313 0.19990381 0.19990381 0.19990381 0.19990381
## [6,] 0.26810085 0.26859408 0.26859607 0.26859608 0.26859608 0.26859608

The results converge at or around 30 iterations. At 40 iterations and beyond, there is no further convergence. \(n = 40\) is a sufficiently large value to achieve convergence of \(r = B^n \times r\)

Step 3

Eigen Decomposition:

eigen(B)

## eigen() decomposition
## $values
## [1]  1.00000000+0i  0.57619235+0i -0.42500000+0i -0.42500000-0i -0.34991524+0i
## [6] -0.08461044+0i
## 
## $vectors
##              [,1]          [,2]                      [,3]
## [1,] 0.1044385+0i  0.2931457+0i  2.486934e-15+0.0000e+00i
## [2,] 0.1488249+0i  0.5093703+0i -8.528385e-16-6.9832e-23i
## [3,] 0.1159674+0i  0.3414619+0i -1.930646e-15-0.0000e+00i
## [4,] 0.7043472+0i -0.5890805+0i -7.071068e-01+0.0000e+00i
## [5,] 0.4037861+0i -0.1413606+0i  7.071068e-01+0.0000e+00i
## [6,] 0.5425377+0i -0.4135367+0i  0.000000e+00-1.7058e-08i
##                           [,4]           [,5]            [,6]
## [1,]  2.486934e-15-0.0000e+00i -0.06471710+0i -0.212296003+0i
## [2,] -8.528385e-16+6.9832e-23i  0.01388698+0i  0.854071294+0i
## [3,] -1.930646e-15+0.0000e+00i  0.07298180+0i -0.363638739+0i
## [4,] -7.071068e-01+0.0000e+00i -0.66058664+0i  0.018399984+0i
## [5,]  7.071068e-01-0.0000e+00i  0.73761812+0i -0.304719509+0i
## [6,]  0.000000e+00+1.7058e-08i -0.09918316+0i  0.008182973+0i

#B %*% r

The largest eigenvalue is in fact 1. The corresponding eigenvector does not look to be the same as the vector obtained in the power iteration method. It will need to be normalized first. Recall, the sum of the column must be equal to 1. To normalize, I will divide each element by the sum of the column to land at a final sum of 1.

evec <- as.numeric(eigen(B)$vectors[,which.max(eigen(B)$values)])

## Warning in which.max(eigen(B)$values): imaginary parts discarded in coercion

evec <- matrix(evec,nrow = 6)

evec

##           [,1]
## [1,] 0.1044385
## [2,] 0.1488249
## [3,] 0.1159674
## [4,] 0.7043472
## [5,] 0.4037861
## [6,] 0.5425377

factor <- colSums(evec)

norm_vec <- evec/factor

norm_vec

##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608

#colSums(norm_vec)

Below, I will display the eigenvector and the \(40^{th}\) power iteration vector side by side.

matrix(cbind(r4,norm_vec),nrow = 6)

##            [,1]       [,2]
## [1,] 0.05170475 0.05170475
## [2,] 0.07367926 0.07367926
## [3,] 0.05741241 0.05741241
## [4,] 0.34870369 0.34870369
## [5,] 0.19990381 0.19990381
## [6,] 0.26859608 0.26859608

As shown in the final summary matrix, the eigenvector and the r-vector are equivalent when the eigenvector is normalized.

Step 4

Graph of Page Rank

The only package I can find that includes page.rank as a function is \(igraph\). For this step, I will use the original A matrix, as it may not be necessary to install the uniform second row.

library(igraph)

## Warning: package 'igraph' was built under R version 4.0.5

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:EBImage':
## 
##     normalize

## The following object is masked from 'package:imager':
## 
##     spectrum

## The following object is masked from 'package:matrixcalc':
## 
##     %s%

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

pltt <- graph_from_adjacency_matrix(A_orig,weighted = TRUE)
plot(pltt)

pr <- matrix(page.rank(pltt)$vector,nrow = 6)
pr

##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608

r4

##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608

matrix(cbind(pr,r4),nrow = 6
       )

##            [,1]       [,2]
## [1,] 0.05170475 0.05170475
## [2,] 0.07367926 0.07367926
## [3,] 0.05741241 0.05741241
## [4,] 0.34870369 0.34870369
## [5,] 0.19990381 0.19990381
## [6,] 0.26859608 0.26859608

It is plain to see that the two vectors are essentially equivalent.

Question 2

train <- read_csv("train.csv")

## Rows: 42000 Columns: 785
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (785): label, pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

test <- read_csv("test.csv")

## Rows: 28000 Columns: 784
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (784): pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pixel7, p...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ten <- train[1:10, ]
ten

labs <- ten[,1]
ten_data <- ten[,2:785]
max(ten_data)

## [1] 255

ten_scale <- ten_data/255
max(ten_scale)

## [1] 1

digit <- matrix(unlist(ten_scale[1,]),nrow = 28,byrow = TRUE)
digit

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7]       [,8]      [,9]      [,10]
##  [1,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##  [2,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##  [3,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##  [4,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##  [5,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##  [6,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##  [7,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##  [8,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##  [9,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [10,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [11,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [12,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [13,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [14,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [15,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [16,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [17,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [18,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.08627451
## [19,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.40392157
## [20,]    0    0    0    0    0    0    0 0.00000000 0.3490196 0.94117647
## [21,]    0    0    0    0    0    0    0 0.05882353 0.8627451 0.99215686
## [22,]    0    0    0    0    0    0    0 0.36862745 0.9921569 0.99215686
## [23,]    0    0    0    0    0    0    0 0.34901961 0.9843137 0.99215686
## [24,]    0    0    0    0    0    0    0 0.00000000 0.8392157 0.85490196
## [25,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [26,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [27,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
## [28,]    0    0    0    0    0    0    0 0.00000000 0.0000000 0.00000000
##            [,11]      [,12]      [,13]      [,14]     [,15]      [,16]
##  [1,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##  [2,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##  [3,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##  [4,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##  [5,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##  [6,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##  [7,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##  [8,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##  [9,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [10,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.21176471
## [11,] 0.00000000 0.00000000 0.00000000 0.00000000 0.3647059 0.99607843
## [12,] 0.00000000 0.00000000 0.00000000 0.09019608 0.8235294 0.99607843
## [13,] 0.00000000 0.00000000 0.06274510 0.81960784 0.9921569 0.99607843
## [14,] 0.00000000 0.00000000 0.10588235 0.99215686 0.9921569 0.99607843
## [15,] 0.00000000 0.07843137 0.80784314 0.99607843 0.9960784 0.77647059
## [16,] 0.00000000 0.65882353 0.99215686 0.99215686 0.7686275 0.02745098
## [17,] 0.07843137 0.79607843 0.99215686 0.97254902 0.2980392 0.00000000
## [18,] 0.73725490 0.99215686 0.96078431 0.36470588 0.0000000 0.00000000
## [19,] 0.99215686 0.99215686 0.74901961 0.00000000 0.0000000 0.00000000
## [20,] 0.99215686 0.76470588 0.09803922 0.00000000 0.0000000 0.00000000
## [21,] 0.99215686 0.31372549 0.00000000 0.00000000 0.0000000 0.00000000
## [22,] 0.99215686 0.36862745 0.00000000 0.00000000 0.0000000 0.00000000
## [23,] 0.98039216 0.51372549 0.00000000 0.00000000 0.0000000 0.00000000
## [24,] 0.37254902 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [25,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [26,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [27,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [28,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
##            [,17]     [,18]     [,19]      [,20]     [,21]      [,22]      [,23]
##  [1,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
##  [2,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
##  [3,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
##  [4,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
##  [5,] 0.00000000 0.0000000 0.0000000 0.00000000 0.7372549 1.00000000 0.36862745
##  [6,] 0.00000000 0.0000000 0.0000000 0.74901961 0.9803922 0.99215686 0.36470588
##  [7,] 0.00000000 0.0000000 0.4823529 0.97254902 0.9921569 0.65490196 0.03921569
##  [8,] 0.00000000 0.3137255 0.9686275 0.99215686 0.8156863 0.05098039 0.00000000
##  [9,] 0.11372549 0.8117647 0.9921569 0.92156863 0.3019608 0.00000000 0.00000000
## [10,] 0.81960784 0.9921569 0.9921569 0.34509804 0.0000000 0.00000000 0.00000000
## [11,] 0.99215686 0.9333333 0.6666667 0.06666667 0.0000000 0.00000000 0.00000000
## [12,] 0.99215686 0.6235294 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [13,] 0.94117647 0.3176471 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [14,] 0.05098039 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [15,] 0.02745098 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [16,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [17,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [18,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [19,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [20,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [21,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [22,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [23,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [24,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [25,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [26,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [27,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [28,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
##       [,24] [,25] [,26] [,27] [,28]
##  [1,]     0     0     0     0     0
##  [2,]     0     0     0     0     0
##  [3,]     0     0     0     0     0
##  [4,]     0     0     0     0     0
##  [5,]     0     0     0     0     0
##  [6,]     0     0     0     0     0
##  [7,]     0     0     0     0     0
##  [8,]     0     0     0     0     0
##  [9,]     0     0     0     0     0
## [10,]     0     0     0     0     0
## [11,]     0     0     0     0     0
## [12,]     0     0     0     0     0
## [13,]     0     0     0     0     0
## [14,]     0     0     0     0     0
## [15,]     0     0     0     0     0
## [16,]     0     0     0     0     0
## [17,]     0     0     0     0     0
## [18,]     0     0     0     0     0
## [19,]     0     0     0     0     0
## [20,]     0     0     0     0     0
## [21,]     0     0     0     0     0
## [22,]     0     0     0     0     0
## [23,]     0     0     0     0     0
## [24,]     0     0     0     0     0
## [25,]     0     0     0     0     0
## [26,]     0     0     0     0     0
## [27,]     0     0     0     0     0
## [28,]     0     0     0     0     0

image(digit)

This figure needs to be rotated in order to properly visualize the digit. This could have been avoided if I manually added the appropriate x and y coordinates for each z-value. All matrices should rotate roughly 90 degrees clockwise.

for (i in 1:10){
  digit <- rotateImage(matrix(unlist(train[i, -1]), nrow = 28, byrow = TRUE),270)
  image(digit, useRaster = TRUE, axes = FALSE)
}

Step 4: Frequency Distribution

train <- data_frame(train)

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

ggplot(train, aes(x=label))+
  geom_histogram(aes(y=..count..), color = 'yellow', binwidth = 1)+
  #geom_density(alpha=.2,fill = "red")+
  ggtitle("Frequency Histogram of Digit Distribution")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Digit')+
  scale_x_continuous(breaks = seq(min(train$label),max(train$label), by = 1))+
  ylab('Density')

table(train$label)

## 
##    0    1    2    3    4    5    6    7    8    9 
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

The distribution is mostly uniform, with the most digits being 1, and the least being 5. The spread between 1 and 5 is close to 1000 occurrences.

Step 5: Mean Pixel Intensity

For the mean pixel intensity, I will take the mean of the values of the pixels for each image in the training data. I will leave the raw numbers for this calculation. I will calculate the mean pixel intensity for each image, then I will calculate the mean of the means.

train_data <- train[,-1]

train_data$mean <- rowMeans(train_data)

ggplot(train_data, aes(x=mean))+
  geom_histogram(aes(y=..density..), color = 'red', binwidth = 1)+
  geom_density(alpha=.2,fill = "orange")+
  ggtitle(" Histogram of Pixel Density")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Pixel Intensity')+
  ylab('Density')

mean(train_data$mean)

## [1] 33.40891

The pixel intensity is relatively normally distributed. The average pixel intensity for a given image is 33.41. The higher the pixel intensity, the larger the physical shape of the number.

Step 6:

Principal Components:

trained <- train[,-1]
scaled <- scale(trained,center =TRUE,scale =TRUE)
train_mean <- attr(scaled,"scaled:center")
train_sd <- attr(scaled,"scaled:scale")

pcomp <- princomp(trained)
std_pcomp <- pcomp$sdev
var_pcomp <- std_pcomp^2

cvar <- cumsum(std_pcomp^2)/sum(std_pcomp^2)

cvar_95 <- which.max(cvar >= 0.95)

#cvar_95

cvar1 <- which.max(cvar >= 1)

cvar1

## Comp.708 
##      708

154 components are needed to cover 95% of the variance. 708 components should be needed to account for 100% of the variance.

Step 7: Visualization

Again, this will need to be rotated.

for (i in 1:10){
  digit <- rotateImage(matrix(pcomp$loadings[,i], nrow = 28, byrow = TRUE),270)
  image(digit, useRaster = TRUE, axes = FALSE)
}

There is a lot of noise in each image. This is because the parameters are not properly tuned. The noise will need to be reduced by refitting the parameters.

eights <- which(train$label == 8)
#eights

eights_data <- train[c(eights),]
eights_drop <- eights_data[,-1]

pcomp_8 <- princomp(eights_drop)
std_8 <- pcomp_8$sdev
var_8 <- std_8^2

cvar <- cumsum(std_8^2)/sum(std_8^2)

cvar_100 <- which.max(cvar >= 1)
cvar_100

## Comp.540 
##      540

for (i in 1:10){
  digit <- rotateImage(matrix(pcomp_8$loadings[,i], nrow = 28, byrow = TRUE),270)
  image(digit, useRaster = TRUE, axes = FALSE)
}

These images, especially the thumbnails, look very much like 8s. Because we are only fitting to eights as opposed to all possible digits, we see a much more clear image of the desired digit.

Step 9:

The weights kept showing up as errors. As I increased the weight limits, I received an error that there were too many weights. I needed to add the MaxNWts parameter to remove the errors. Note, the model requires a great deal of computational power and time. I am using all components for the model. The computational time could be reduced by using the 540 components needed to account for 100% of the variance.

training <- as.factor(train$label)
labels <- as.factor(train$label)

fit <- multinom(labels ~., data = train_data, MaxNWts = 10000)

## # weights:  7870 (7074 variable)
## initial  value 96708.573906 
## iter  10 value 27932.928742
## iter  20 value 22897.023793
## iter  30 value 21671.757612
## iter  40 value 21248.433597
## iter  50 value 21070.878338
## iter  60 value 20985.749008
## iter  70 value 20904.370367
## iter  80 value 20840.141748
## iter  90 value 20767.634400
## iter 100 value 20700.473515
## final  value 20700.473515 
## stopped after 100 iterations

Prediction:

prediction <- predict(fit,train_data)
confused <- confusionMatrix(prediction,labels)
confused

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 3961    0   11    5   10   24   45    6   18   13
##          1   25 4606  153   88   88   83   57  111  384   84
##          2    7   13 3644   88   36   16   64   39   26    9
##          3    8   15   77 3786   13   92    3    9   67   56
##          4    9    3   25    3 3448   12   18   15   20   63
##          5   24   14   23  176   46 3286   92   12  168   52
##          6   10    1   13    5   28   42 3768    3   14    0
##          7   36   11  106   81   26   47   18 4047   41  190
##          8   38   20  104   88   52  139   42    7 3257   23
##          9   14    1   21   31  325   54   30  152   68 3698
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8929          
##                  95% CI : (0.8899, 0.8958)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8809          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.95862   0.9833  0.87240  0.87014  0.84676  0.86588
## Specificity           0.99651   0.9712  0.99212  0.99097  0.99557  0.98411
## Pos Pred Value        0.96775   0.8111  0.92440  0.91760  0.95354  0.84408
## Neg Pred Value        0.99549   0.9979  0.98600  0.98508  0.98374  0.98664
## Prevalence            0.09838   0.1115  0.09945  0.10360  0.09695  0.09036
## Detection Rate        0.09431   0.1097  0.08676  0.09014  0.08210  0.07824
## Detection Prevalence  0.09745   0.1352  0.09386  0.09824  0.08610  0.09269
## Balanced Accuracy     0.97756   0.9773  0.93226  0.93056  0.92116  0.92499
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.91080  0.91956  0.80162  0.88300
## Specificity           0.99694  0.98521  0.98648  0.98159
## Pos Pred Value        0.97013  0.87921  0.86393  0.84160
## Neg Pred Value        0.99032  0.99053  0.97892  0.98697
## Prevalence            0.09850  0.10479  0.09674  0.09971
## Detection Rate        0.08971  0.09636  0.07755  0.08805
## Detection Prevalence  0.09248  0.10960  0.08976  0.10462
## Balanced Accuracy     0.95387  0.95239  0.89405  0.93230

The model’s accuracy was 89.29%. All things considered, this is pretty good, but it may lead to many misdelivered envelopes for the USPS. Come to think of it, when I lived at 248, we received a lot of letters for 246. I won’t provide the street number, but that was also frequently misinterpreted.

Question 3

train_house <- read_csv("train_house.csv")

## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

test_house <- read_csv("test_house.csv")

## Rows: 1459 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptive and Inferential Statistics

ggplot(train_house, aes(x=LotArea, y = SalePrice))+
  geom_point(aes(color = "orange"))+
  ggtitle('Sale Price vs. Lot Area')+
  theme(plot.title = element_text(hjust = 0.5),legend.position = "none")+
  xlab('Lot Area')+
  ylab('Sale Price')

ggplot(train_house, aes(x=GrLivArea, y = SalePrice))+
  geom_point(aes(color = "blue"))+
  ggtitle('Sale Price vs. Living Area')+
  theme(plot.title = element_text(hjust = 0.5),legend.position = "none")+
  xlab('Lot Area')+
  ylab('Sale Price')

ggplot(train_house, aes(x=SalePrice))+
  geom_histogram(aes(y=..density..), color = 'red', fill = 'green', binwidth = 25000)+
  geom_density(alpha=.1, fill = 'pink')+
  ggtitle('Distribution of Sale Price')+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Sale Price')+
  ylab('Density')

The distribution of sale price shows a heavy right skew. The Above Ground Living Area of a home heavily influences the sale price.

Next, we will visualize all of the relationships. Originally, I was going to use all numeric variables, but there are just too many columns available in the training set. I will limit the selection to 10 or fewer variables.

Columns 44 and 45 are uncallable by name currently.

names(train_house)[44] <- "FirstFlrSF"
names(train_house)[45] <- "SecondFlrSF"

train_house["CentralAir"][train_house["CentralAir"] == "Y"] <- '1'
train_house["CentralAir"][train_house["CentralAir"] == "N"] <- '-1'

train_house$CentralAir <- as.numeric(train_house$CentralAir)

train_house %>%
  dplyr::select(c(LotArea,BsmtFinSF1,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,GarageArea,WoodDeckSF,SalePrice)) %>%
  ggpairs()

Sale Price is most heavily correlated with GrLivArea,GarageArea,TotalBsmtSF, and FirstFlrSF. Interestingly, LotArea and Sale Price were not heavily correlated. If Lot Area were limited to a certain value below 50,000, we would see a much more linear relationship.

train_filter <- train_house[train_house$LotArea < 50000, ]

train_filter %>%
  dplyr::select(c(LotArea,BsmtFinSF1,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,GarageArea,WoodDeckSF,SalePrice)) %>%
  ggpairs()

ggplot(train_filter, aes(x=LotArea, y = SalePrice))+
  geom_point(aes(color = "orange"))+
  geom_smooth(method = "lm")+
  ggtitle('Sale Price vs. Lot Area')+
  theme(plot.title = element_text(hjust = 0.5),legend.position = "none")+
  xlab('Lot Area')+
  ylab('Sale Price')

## `geom_smooth()` using formula 'y ~ x'

After filtering the data, it became more apparent that the spread of the LotArea vs. SalePrice data is too great to filter outliers. The fit for the LotArea vs. SalePrice is slightly better.

Three Variables on One Scatterplot

ggplot(train_filter, aes(x=LotArea, y = SalePrice,color = HouseStyle))+
  geom_point()+
  geom_smooth(method = "lm")+
  ggtitle('Sale Price vs. Lot Area, Colored By House Style')+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Lot Area')+
  ylab('Sale Price')

## `geom_smooth()` using formula 'y ~ x'

ggplot(train_filter, aes(x=GrLivArea, y = SalePrice,color = YearBuilt))+
  geom_point()+
  geom_smooth(method = "lm")+
  ggtitle('Sale Price vs. Above Ground Living Area, Colored By House Style')+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Above Ground Living Area')+
  ylab('Sale Price')

## `geom_smooth()` using formula 'y ~ x'

Correlation Matrix

Because Sale Price is the best representation of home value, I will keep that in the correlation matrix as one of the variables.

selection <- train_house %>%
  dplyr::select(c(LotArea,GrLivArea,SalePrice))
corr_mat <- cor(selection)
corr_mat

##             LotArea GrLivArea SalePrice
## LotArea   1.0000000 0.2631162 0.2638434
## GrLivArea 0.2631162 1.0000000 0.7086245
## SalePrice 0.2638434 0.7086245 1.0000000

cor.test(train_house$LotArea,train_house$GrLivArea,conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train_house$LotArea and train_house$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162

cor.test(train_house$LotArea,train_house$SalePrice,conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train_house$LotArea and train_house$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

cor.test(train_house$SalePrice,train_house$GrLivArea,conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train_house$SalePrice and train_house$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

train_house %>%
  dplyr::select(c(LotArea,GrLivArea,SalePrice)) %>%
  ggpairs()

alpha <- 0.2
c <- 3
fwe <- 1 - (1 - alpha)^c
fwe

## [1] 0.488

fwe_orig <- 1 - (1 - alpha)^9
fwe_orig

## [1] 0.8657823

Interestingly, the only strong correlation is between the Sale Price and Above Ground Living Area. As the number of comparisons increases, the risk for familywise error also increases. In this case, for only three comparisons, familywise error is still a concern, but it is not as extreme. There is a 48.8% chance of a Type I (Rejecting Null Hypothesis when it is actually True) Error. It is, however, a greater concern for the original correlation tests, where the number of comparisons was much higher. For the 9 component comparison, the risk of a Type I Error was 86.6%.

Linear Algebra and Correlation:

Precision Matrix:

precision_mat <- solve(corr_mat)

corprec <- corr_mat %*% precision_mat
preccor <- precision_mat %*% corr_mat
corprec

##                 LotArea     GrLivArea    SalePrice
## LotArea    1.000000e+00  0.000000e+00 0.000000e+00
## GrLivArea  1.387779e-17  1.000000e+00 2.220446e-16
## SalePrice -2.775558e-17 -2.220446e-16 1.000000e+00

preccor

##                 LotArea   GrLivArea    SalePrice
## LotArea    1.000000e+00 1.94289e-16 1.387779e-16
## GrLivArea -1.110223e-16 1.00000e+00 0.000000e+00
## SalePrice -1.110223e-16 0.00000e+00 1.000000e+00

LU <- lu.decomposition(corr_mat)

LU

## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.2631162 1.0000000    0
## [3,] 0.2638434 0.6867466    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.2631162 0.2638434
## [2,]    0 0.9307699 0.6392030
## [3,]    0 0.0000000 0.4914162

LU Decomposition on precision matrix multiplied by correlation matrix

lu.decomposition(corprec)

## $L
##               [,1]          [,2] [,3]
## [1,]  1.000000e+00  0.000000e+00    0
## [2,]  1.387779e-17  1.000000e+00    0
## [3,] -2.775558e-17 -2.220446e-16    1
## 
## $U
##      [,1] [,2]         [,3]
## [1,]    1    0 0.000000e+00
## [2,]    0    1 2.220446e-16
## [3,]    0    0 1.000000e+00

lu.decomposition(preccor)

## $L
##               [,1]         [,2] [,3]
## [1,]  1.000000e+00 0.000000e+00    0
## [2,] -1.110223e-16 1.000000e+00    0
## [3,] -1.110223e-16 2.157042e-32    1
## 
## $U
##      [,1]        [,2]         [,3]
## [1,]    1 1.94289e-16 1.387779e-16
## [2,]    0 1.00000e+00 1.540744e-32
## [3,]    0 0.00000e+00 1.000000e+00

Calculus Based Probability and Statistics

Conveniently for me, SalePrice is skewed to the right. I will try a log transformation to shift the distribution.

p1 <- ggplot(train_house, aes(x=SalePrice))+
  geom_histogram(aes(y=..density..), color = 'green', fill = 'blue', binwidth = 25000)+
  geom_density(alpha=.1, fill = 'pink')+
  ggtitle('Distribution of Sale Price')+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Sale Price')+
  ylab('Density')

pricelog <- log(train_house$SalePrice)

p3 <- ggplot(train_house, aes(x=log(SalePrice)))+
  geom_histogram(aes(y=..density..), color = 'green', fill = 'blue', binwidth = 0.5)+
  geom_density(alpha=.1, fill = 'pink')+
  ggtitle('Distribution of Sale Price')+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Sale Price')+
  ylab('Density')

The log transformation shifted the distribution and it is much more centered now. For an exponential distribution, \(\lambda\) is the expectation. I will use both the log transformed values and the raw values.

exp_dens <- fitdistr(train_house$SalePrice, 'exponential')
exp_log <- fitdistr(pricelog, 'exponential')
exp_dens

##        rate    
##   5.527268e-06 
##  (1.446552e-07)

exp_log

##       rate    
##   0.083166647 
##  (0.002176571)

lam <- exp_dens$estimate
lamlog <- exp_log$estimate

lam

##         rate 
## 5.527268e-06

lamlog

##       rate 
## 0.08316665

samp <- rexp(1000,lam)
samplog <- rexp(1000,lamlog)
summary(samp)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     236.3   55259.2  121740.3  178049.4  244363.8 1689144.1

summary(samplog)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00195  3.18307  7.47977 11.20312 16.10674 83.41697

The log transformed values make it easier to comprehend the range of values.

sampdf <- data_frame(samp)
samplogdf <- data_frame(samplog)

p2 <- ggplot(sampdf, aes(x=samp))+
  geom_histogram(aes(y=..density..), color = 'green', fill = 'blue', binwidth = 25000)+
  geom_density(alpha=.1, fill = 'pink')+
  ggtitle('Sample Distribution of Sale Price')+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Sale Price')+
  ylab('Density')

ggarrange(p1,p2)

p4 <- ggplot(samplogdf, aes(x=samplog))+
  geom_histogram(aes(y=..density..), color = 'green', fill = 'blue', binwidth = 5)+
  geom_density(alpha=.1, fill = 'pink')+
  ggtitle('Sample Distribution of Sale Price')+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Sale Price')+
  ylab('Density')

ggarrange(p3,p4)

Both samples ended up being aggressively right skewed, more than either of the original variables.

samp_5 <- qexp(0.05,lam)
samp_95 <- qexp(.95,lam)

log_5 <- qexp(.05,lamlog)
log_95 <- qexp(.95,lamlog)

table(samp_5,samp_95)

##                   samp_95
## samp_5             541991.465498888
##   9280.04416175455                1

table(log_5,log_95)

##                    log_95
## log_5               36.0208373433093
##   0.616753182601496                1

So, the \(5^{th}\) and \(95^{th}\) confidence intervals are 9280 and 541,991 for the original data, and 0.617 and 36.02 for the log transformed data.

Empirical Data Percentiles and Confidence Interval

interval <- t.test(train_house$SalePrice,conf.level=0.95)
interval05 <- t.test(train_house$SalePrice,conf.level=.05)

quantile(train_house$SalePrice, c(0.05,0.95))

##     5%    95% 
##  88000 326100

quantile(pricelog, c(0.05,.95))

##       5%      95% 
## 11.38509 12.69496

The empirical percentiles are far more reasonable than the theoretical percentiles. Assuming normality on a distribution that is far from normal is not necessarily an ideal approach.

Confidence Intervals:

table(interval05$conf.int[1],interval05$conf.int[2])

##                  
##                   181051.592315692
##   180790.79946513                1

table(interval$conf.int[1],interval$conf.int[2])

##                   
##                    184999.550739737
##   176842.841041085                1

mean(train_house$SalePrice)

## [1] 180921.2

There is a 5% chance that the sample mean falls between 180790 and 181051. There is a 95% chance that the sample mean falls between 176842 and 185000. It turns out that the mean Sale Price is 180921. So, both confidence intervals captured the sample mean.

Modeling

Next, we will build a multiple regression model and submit it to the Kaggle Competition.

train_house %>%
  dplyr::select(c(LotArea,BsmtFinSF1,YearBuilt,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,GarageArea,WoodDeckSF,SalePrice)) %>%
  ggpairs()

## 
## Call:
## lm(formula = SalePrice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF + 
##     FirstFlrSF + SecondFlrSF + GrLivArea + GarageArea + WoodDeckSF, 
##     data = train_house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -629975  -18224   -3047   14203  270296 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.323e+06  8.574e+04 -15.429  < 2e-16 ***
## LotArea      3.224e-01  1.180e-01   2.732 0.006369 ** 
## BsmtFinSF1   1.268e+01  2.865e+00   4.425 1.04e-05 ***
## YearBuilt    6.721e+02  4.416e+01  15.219  < 2e-16 ***
## TotalBsmtSF  2.879e+01  4.766e+00   6.040 1.95e-09 ***
## FirstFlrSF   3.368e+01  2.378e+01   1.416 0.156926    
## SecondFlrSF  3.384e+01  2.344e+01   1.443 0.149109    
## GrLivArea    3.796e+01  2.308e+01   1.645 0.100253    
## GarageArea   5.898e+01  6.735e+00   8.757  < 2e-16 ***
## WoodDeckSF   3.469e+01  9.325e+00   3.720 0.000207 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41920 on 1450 degrees of freedom
## Multiple R-squared:  0.7233, Adjusted R-squared:  0.7215 
## F-statistic:   421 on 9 and 1450 DF,  p-value: < 2.2e-16

I will be removing p-values higher than 0.10. I will remove them one at a time and check the p-values after removal to make sure that I don’t remove valuable variables.

## 
## Call:
## lm(formula = SalePrice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageArea + WoodDeckSF, data = train_house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -628662  -18089   -3256   14460  271795 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.343e+06  8.385e+04 -16.013  < 2e-16 ***
## LotArea      3.269e-01  1.176e-01   2.780 0.005512 ** 
## BsmtFinSF1   1.281e+01  2.861e+00   4.476  8.2e-06 ***
## YearBuilt    6.823e+02  4.325e+01  15.774  < 2e-16 ***
## TotalBsmtSF  2.869e+01  3.439e+00   8.343  < 2e-16 ***
## GrLivArea    7.108e+01  2.542e+00  27.962  < 2e-16 ***
## GarageArea   5.944e+01  6.692e+00   8.882  < 2e-16 ***
## WoodDeckSF   3.484e+01  9.324e+00   3.736 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41920 on 1452 degrees of freedom
## Multiple R-squared:  0.7229, Adjusted R-squared:  0.7215 
## F-statistic:   541 on 7 and 1452 DF,  p-value: < 2.2e-16

With an Adjusted \(R^2\) of 0.7215, this model should be pretty effective at explaining the variability in the dependent variable, sale price.

I will also perform the same regression, but I will use the logarithm of the Sale Price.

## 
## Call:
## lm(formula = SalePrice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF + 
##     FirstFlrSF + SecondFlrSF + GrLivArea + GarageArea + WoodDeckSF, 
##     data = train_house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -629975  -18224   -3047   14203  270296 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.323e+06  8.574e+04 -15.429  < 2e-16 ***
## LotArea      3.224e-01  1.180e-01   2.732 0.006369 ** 
## BsmtFinSF1   1.268e+01  2.865e+00   4.425 1.04e-05 ***
## YearBuilt    6.721e+02  4.416e+01  15.219  < 2e-16 ***
## TotalBsmtSF  2.879e+01  4.766e+00   6.040 1.95e-09 ***
## FirstFlrSF   3.368e+01  2.378e+01   1.416 0.156926    
## SecondFlrSF  3.384e+01  2.344e+01   1.443 0.149109    
## GrLivArea    3.796e+01  2.308e+01   1.645 0.100253    
## GarageArea   5.898e+01  6.735e+00   8.757  < 2e-16 ***
## WoodDeckSF   3.469e+01  9.325e+00   3.720 0.000207 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41920 on 1450 degrees of freedom
## Multiple R-squared:  0.7233, Adjusted R-squared:  0.7215 
## F-statistic:   421 on 9 and 1450 DF,  p-value: < 2.2e-16

Here, we can remove FirstFlrSF and SecondFlrSF to begin.

## 
## Call:
## lm(formula = logprice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageArea + WoodDeckSF, data = train_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91533 -0.07240  0.00738  0.09147  0.67537 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.515e+00  3.915e-01   6.424 1.79e-10 ***
## LotArea     1.798e-06  5.492e-07   3.274  0.00109 ** 
## BsmtFinSF1  4.191e-05  1.336e-05   3.137  0.00174 ** 
## YearBuilt   4.384e-03  2.020e-04  21.704  < 2e-16 ***
## TotalBsmtSF 1.282e-04  1.606e-05   7.983 2.88e-15 ***
## GrLivArea   3.437e-04  1.187e-05  28.954  < 2e-16 ***
## GarageArea  3.292e-04  3.125e-05  10.535  < 2e-16 ***
## WoodDeckSF  1.848e-04  4.354e-05   4.244 2.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1958 on 1452 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.7598 
## F-statistic: 660.3 on 7 and 1452 DF,  p-value: < 2.2e-16

With an adjusted \(R^2\) of 0.7598, this model may be slightly better than the non-logarithmic model.

res_sct <- ggplot(data = model_adjust, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = 'red') +
  xlab("Fitted values") +
  ylab("Residuals") +
  ggtitle("Linearity of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))

res_sct_log <- ggplot(data = model_adjust_log, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = 'red') +
  xlab("Fitted values") +
  ylab("Residuals") +
  ggtitle("Linearity of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))

res_hst <- ggplot(data = model_adjust, aes(x = .resid)) +
  geom_histogram(binwidth = 25000) +
  xlab("Residuals") +
  ggtitle("Histogram of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))

res_hst_log <- ggplot(data = model_adjust_log, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  xlab("Residuals") +
  ggtitle("Histogram of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))

res_npp <- ggplot(data = model_adjust, aes(sample = .resid)) +
  stat_qq()+
  ggtitle("Normal Probability Plot of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))

res_npp_log <- ggplot(data = model_adjust_log, aes(sample = .resid)) +
  stat_qq()+
  ggtitle("Normal Probability Plot of Residuals")+
  theme(plot.title = element_text(hjust = 0.5))

lin_analysis <- ggarrange(p1, res_sct, res_hst, res_npp,
                      ncol = 2, nrow = 2)

lin_analysis_log <- ggarrange(p3, res_sct_log, res_hst_log, res_npp_log,
                      ncol = 2, nrow = 2)

lin_analysis

lin_analysis_log

Both models look to be decent fits for the data, with the edge going to the logarithmic model.

names(test_house)[44] <- "FirstFlrSF"
names(test_house)[45] <- "SecondFlrSF"

test_house["CentralAir"][test_house["CentralAir"] == "Y"] <- '1'
test_house["CentralAir"][test_house["CentralAir"] == "N"] <- '-1'

test_house$CentralAir <- as.numeric(test_house$CentralAir)

test_house_test <- test_house %>% dplyr::select(c(Id,LotArea,BsmtFinSF1,OverallQual,OverallCond,YearBuilt,MasVnrArea,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,CentralAir,YrSold,GarageArea,WoodDeckSF,FullBath))

which(is.na(test_house_test))

##  [1]  3579  8986  9001  9177  9287  9299  9336  9606  9620  9635  9644  9663
## [13]  9887  9952  9981 10157 10874 20084

test_house_test <- test_house_test %>%
  replace(is.na(.),0)

which(is.na(test_house_test))

## integer(0)

house_predict <- predict(model_adjust,test_house_test,type = "response")
head(house_predict)

##        1        2        3        4        5        6 
## 142393.3 174609.2 213005.0 212616.0 179278.5 191596.1

house_log_predict <- predict(model_adjust_log,test_house_test,type = "response")
head(house_log_predict)

##        1        2        3        4        5        6 
## 11.83972 11.96564 12.20453 12.20861 12.03851 12.11075

house_log_predict_dollar <- exp(house_log_predict)
head(house_log_predict_dollar)

##        1        2        3        4        5        6 
## 138651.2 157258.1 199691.4 200508.0 169145.3 181815.8

house_kag <- data.frame(Id = test_house_test$Id, SalePrice = house_predict)

house_kag_log <- data.frame(Id = test_house_test$Id, SalePrice = house_log_predict_dollar)
house_kag_log <- drop_na(house_kag_log)
write.csv(house_kag,file = "house_kag.csv", row.names = FALSE)
write.csv(house_kag_log,file = "house_kag_log.csv", row.names = FALSE)

Scores:

As I expected, house_kag performed poorer than house_kag_log. Using the log transformation allowed for a better prediction than leaving the Sale Price untransformed.

house_kag RMSE: 0.21194 (Would be \(3618^{th}\) place)

house_kag_log RMSE: 0.19997 (\(3537^{th}\) place)

Kaggle Username: shanehylton (Shane Hylton)

Trying Again:

I will try this model one more time, but this time, I will try some exponential transformations. I will also subset the years sold to try to fit the test data better.

Originally, I had grouped the data into buckets representing the year each house was sold. My predictions did not work out the way I expected them to. I have instead added Year Sold as a parameter. Further, I originally used exponentiation to develop a model. I was able to obtain a strong R-squared value, but the actual predictions were dramatically skewed.

train_log <- train_log %>% dplyr::select(c(Id,LotArea,BsmtFinSF1,OverallQual,OverallCond,YearBuilt,MasVnrArea,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,CentralAir,YrSold,GarageArea,WoodDeckSF,FullBath,logprice,SalePrice))

train_log <- train_log %>%
  replace(is.na(.),0)

mod_final <-  lm(logprice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF + GrLivArea + GarageArea + WoodDeckSF + CentralAir + OverallQual + OverallCond + YrSold, data = train_log)
summary(mod_final)

## 
## Call:
## lm(formula = logprice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF + 
##     GrLivArea + GarageArea + WoodDeckSF + CentralAir + OverallQual + 
##     OverallCond + YrSold, data = train_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54809 -0.06741  0.00729  0.07564  0.47111 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.305e+01  6.290e+00   2.074 0.038233 *  
## LotArea      2.732e-06  4.465e-07   6.117 1.22e-09 ***
## BsmtFinSF1   5.050e-05  1.089e-05   4.637 3.86e-06 ***
## YearBuilt    3.242e-03  2.114e-04  15.342  < 2e-16 ***
## TotalBsmtSF  7.603e-05  1.344e-05   5.659 1.83e-08 ***
## GrLivArea    2.440e-04  1.090e-05  22.386  < 2e-16 ***
## GarageArea   2.255e-04  2.562e-05   8.804  < 2e-16 ***
## WoodDeckSF   1.309e-04  3.538e-05   3.699 0.000225 ***
## CentralAir   4.364e-02  9.597e-03   4.547 5.90e-06 ***
## OverallQual  9.847e-02  4.899e-03  20.098  < 2e-16 ***
## OverallCond  5.106e-02  4.359e-03  11.714  < 2e-16 ***
## YrSold      -4.462e-03  3.129e-03  -1.426 0.154134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1583 on 1448 degrees of freedom
## Multiple R-squared:  0.8441, Adjusted R-squared:  0.8429 
## F-statistic: 712.6 on 11 and 1448 DF,  p-value: < 2.2e-16

pred_all <- predict(mod_final, test_house_test,type = "response")

pred_price <- exp(pred_all)

house_price_kaggle_edit <- data.frame(Id = test_house_test$Id,SalePrice = pred_price)

write.csv(house_price_kaggle_edit,file = "house_price_kaggle_edit.csv", row.names = FALSE)

Kaggle Score

Score for this submission: 0.15308

Leaderboard Position: 2500

Knitting Issues

This took so long to knit. I had to try a number of times. It kept getting hung up halfway through.

DATA 605 Final Exam

Shane Hylton

5/22/2022