As a first note, I do not see how \(r_f = A \times r_i\). The matrices \(A\) and \(r_i\) do not multiply to equal \(r_f\).
For the PageRank Algorithm to work properly, the columns of the transition matrix should sum to 1, as is the case in Markov transition matrices.
A <- matrix(round(c(0,1/2,1/2,0,0,0,0,0,0,0,0,0,1/3,1/3,0,0,1/3, 0,0,0,0,0,1/2,1/2,0,0,0,1/2,0,1/2,0,0,0,1,0,0),2), nrow = 6,byrow = TRUE)
A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.00 0.50 0.5 0.0 0.00 0.0
## [2,] 0.00 0.00 0.0 0.0 0.00 0.0
## [3,] 0.33 0.33 0.0 0.0 0.33 0.0
## [4,] 0.00 0.00 0.0 0.0 0.50 0.5
## [5,] 0.00 0.00 0.0 0.5 0.00 0.5
## [6,] 0.00 0.00 0.0 1.0 0.00 0.0
A_orig <- A
ri <- matrix(c(1/6,1/6,1/6,1/6,1/6,1/6),nrow = 6)
ri
## [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667
rf_1 <- A %*% ri
rf_1
## [,1]
## [1,] 0.1666667
## [2,] 0.0000000
## [3,] 0.1650000
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667
rf <- (A %^% 6) %*% ri
rf
## [,1]
## [1,] 0.03428356
## [2,] 0.00000000
## [3,] 0.06706975
## [4,] 0.16666667
## [5,] 0.16666667
## [6,] 0.16666667
Here we can see that given \(A\) and \(r_i\), \(r_f\) on its own is not equivalent to \(A \times r_i\).
Currently, row 2 in A reflects a 0 probability. Because prior to any clicks, there is a uniformly distributed probability of landing on each page, \(\frac 16\) will be placed in each entry of row 2.
If we were to set the second row to enable us to have a transition matrix where all of the columns sum to 1, row 2 would sum to greater than 1 itself. A way around this would be to transpose the matrix A in order to make use of the condition that the column sums should all equal 1.
A <- matrix(c(0,1/2,1/2,0,0,0,1/6,1/6,1/6,1/6,1/6,1/6,1/3,1/3,0,0,1/3, 0,0,0,0,0,1/2,1/2,0,0,0,1/2,0,1/2,0,0,0,1,0,0), nrow = 6,byrow = TRUE)
A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0000000 0.5000000 0.5000000 0.0000000 0.0000000 0.0000000
## [2,] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
## [3,] 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333 0.0000000
## [4,] 0.0000000 0.0000000 0.0000000 0.0000000 0.5000000 0.5000000
## [5,] 0.0000000 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
## [6,] 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
A <- t(A)
A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0 0.1666667 0.3333333 0.0 0.0 0
## [2,] 0.5 0.1666667 0.3333333 0.0 0.0 0
## [3,] 0.5 0.1666667 0.0000000 0.0 0.0 0
## [4,] 0.0 0.1666667 0.0000000 0.0 0.5 1
## [5,] 0.0 0.1666667 0.3333333 0.5 0.0 0
## [6,] 0.0 0.1666667 0.0000000 0.5 0.5 0
colSums(A)
## [1] 1 1 1 1 1 1
Recalculating \(r_f\),
ri <- matrix(c(1/6,1/6,1/6,1/6,1/6,1/6),nrow = 6)
ri
## [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667
rf_1 <- A %*% ri
rf_1
## [,1]
## [1,] 0.08333333
## [2,] 0.16666667
## [3,] 0.11111111
## [4,] 0.27777778
## [5,] 0.16666667
## [6,] 0.19444444
rf <- (A %^% 6) %*% ri
rf
## [,1]
## [1,] 0.01316372
## [2,] 0.02272305
## [3,] 0.01515704
## [4,] 0.41632730
## [5,] 0.21763903
## [6,] 0.31498985
Because the goal is to find a probability vector \(r\), such that \(A \times r\) = \(r\), it appears that \(r_i\) from the example is a sufficient probability vector for the manipulated version of A. s Let \(d\) be the decay factor, and introduce \(d\) to A to form B, the decayed version of A. From Page and Brin, we will use \(d = 0.85\). Note, \(n\) is the rank of the matrix A.
\(B = d \times A + \frac {1 - d}n\)
\(B = 0.85 \times A + \frac {0.15}{6}\)
d <- 0.85
A_decay <- d*A
B <- A_decay + (1-d)/nrow(A)
B
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.025 0.1666667 0.3083333 0.025 0.025 0.025
## [2,] 0.450 0.1666667 0.3083333 0.025 0.025 0.025
## [3,] 0.450 0.1666667 0.0250000 0.025 0.025 0.025
## [4,] 0.025 0.1666667 0.0250000 0.025 0.450 0.875
## [5,] 0.025 0.1666667 0.3083333 0.450 0.025 0.025
## [6,] 0.025 0.1666667 0.0250000 0.450 0.450 0.025
It is worth noting that column 2 of B is identical to column 2 of A after the manipulation. This is because for a given entry in row 2, \(\frac 16\), we have \(0.85\times \frac 16 + 0.15 \times \frac 16\), which is equal to \(\frac 16\)
For our uniform rank vector r, we will use the example \(r_i\). So, \(r = \begin{vmatrix} \frac 16 \\ \frac 16 \\ \frac 16 \\ \frac 16 \\ \frac 16 \\ \frac 16\end{vmatrix}\)
matpow <- function(x,y){
out <- x %^% y
return(out)
}
r <- matrix(c(1/6,1/6,1/6,1/6,1/6,1/6),nrow = 6)
r
## [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667
r0 <- matpow(B,1) %*% r
r1 <- matpow(B,10) %*% r
r2 <- matpow(B,20) %*% r
r3 <- matpow(B,30) %*% r
r4 <- matpow(B,40) %*% r
r5 <- matpow(B,50) %*% r
r6 <- matpow(B,60) %*% r
result <- matrix(cbind(r1,r2,r3,r4,r5,r6), nrow = 6)
result
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.05205661 0.05170616 0.05170475 0.05170475 0.05170475 0.05170475
## [2,] 0.07428990 0.07368173 0.07367927 0.07367926 0.07367926 0.07367926
## [3,] 0.05782138 0.05741406 0.05741242 0.05741241 0.05741241 0.05741241
## [4,] 0.34797267 0.34870083 0.34870367 0.34870369 0.34870369 0.34870369
## [5,] 0.19975859 0.19990313 0.19990381 0.19990381 0.19990381 0.19990381
## [6,] 0.26810085 0.26859408 0.26859607 0.26859608 0.26859608 0.26859608
The results converge at or around 30 iterations. At 40 iterations and beyond, there is no further convergence. \(n = 40\) is a sufficiently large value to achieve convergence of \(r = B^n \times r\)
Eigen Decomposition:
eigen(B)
## eigen() decomposition
## $values
## [1] 1.00000000+0i 0.57619235+0i -0.42500000+0i -0.42500000-0i -0.34991524+0i
## [6] -0.08461044+0i
##
## $vectors
## [,1] [,2] [,3]
## [1,] 0.1044385+0i 0.2931457+0i 2.486934e-15+0.0000e+00i
## [2,] 0.1488249+0i 0.5093703+0i -8.528385e-16-6.9832e-23i
## [3,] 0.1159674+0i 0.3414619+0i -1.930646e-15-0.0000e+00i
## [4,] 0.7043472+0i -0.5890805+0i -7.071068e-01+0.0000e+00i
## [5,] 0.4037861+0i -0.1413606+0i 7.071068e-01+0.0000e+00i
## [6,] 0.5425377+0i -0.4135367+0i 0.000000e+00-1.7058e-08i
## [,4] [,5] [,6]
## [1,] 2.486934e-15-0.0000e+00i -0.06471710+0i -0.212296003+0i
## [2,] -8.528385e-16+6.9832e-23i 0.01388698+0i 0.854071294+0i
## [3,] -1.930646e-15+0.0000e+00i 0.07298180+0i -0.363638739+0i
## [4,] -7.071068e-01+0.0000e+00i -0.66058664+0i 0.018399984+0i
## [5,] 7.071068e-01-0.0000e+00i 0.73761812+0i -0.304719509+0i
## [6,] 0.000000e+00+1.7058e-08i -0.09918316+0i 0.008182973+0i
#B %*% r
The largest eigenvalue is in fact 1. The corresponding eigenvector does not look to be the same as the vector obtained in the power iteration method. It will need to be normalized first. Recall, the sum of the column must be equal to 1. To normalize, I will divide each element by the sum of the column to land at a final sum of 1.
evec <- as.numeric(eigen(B)$vectors[,which.max(eigen(B)$values)])
## Warning in which.max(eigen(B)$values): imaginary parts discarded in coercion
evec <- matrix(evec,nrow = 6)
evec
## [,1]
## [1,] 0.1044385
## [2,] 0.1488249
## [3,] 0.1159674
## [4,] 0.7043472
## [5,] 0.4037861
## [6,] 0.5425377
factor <- colSums(evec)
norm_vec <- evec/factor
norm_vec
## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
#colSums(norm_vec)
Below, I will display the eigenvector and the \(40^{th}\) power iteration vector side by side.
matrix(cbind(r4,norm_vec),nrow = 6)
## [,1] [,2]
## [1,] 0.05170475 0.05170475
## [2,] 0.07367926 0.07367926
## [3,] 0.05741241 0.05741241
## [4,] 0.34870369 0.34870369
## [5,] 0.19990381 0.19990381
## [6,] 0.26859608 0.26859608
As shown in the final summary matrix, the eigenvector and the r-vector are equivalent when the eigenvector is normalized.
Graph of Page Rank
The only package I can find that includes page.rank as a function is \(igraph\). For this step, I will use the original A matrix, as it may not be necessary to install the uniform second row.
library(igraph)
## Warning: package 'igraph' was built under R version 4.0.5
##
## Attaching package: 'igraph'
## The following object is masked from 'package:EBImage':
##
## normalize
## The following object is masked from 'package:imager':
##
## spectrum
## The following object is masked from 'package:matrixcalc':
##
## %s%
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
pltt <- graph_from_adjacency_matrix(A_orig,weighted = TRUE)
plot(pltt)
pr <- matrix(page.rank(pltt)$vector,nrow = 6)
pr
## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
r4
## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
matrix(cbind(pr,r4),nrow = 6
)
## [,1] [,2]
## [1,] 0.05170475 0.05170475
## [2,] 0.07367926 0.07367926
## [3,] 0.05741241 0.05741241
## [4,] 0.34870369 0.34870369
## [5,] 0.19990381 0.19990381
## [6,] 0.26859608 0.26859608
It is plain to see that the two vectors are essentially equivalent.
train <- read_csv("train.csv")
## Rows: 42000 Columns: 785
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (785): label, pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test <- read_csv("test.csv")
## Rows: 28000 Columns: 784
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (784): pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pixel7, p...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ten <- train[1:10, ]
ten
labs <- ten[,1]
ten_data <- ten[,2:785]
max(ten_data)
## [1] 255
ten_scale <- ten_data/255
max(ten_scale)
## [1] 1
digit <- matrix(unlist(ten_scale[1,]),nrow = 28,byrow = TRUE)
digit
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [2,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [3,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [4,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [5,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [6,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [7,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [8,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [9,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [10,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [11,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [12,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [13,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [14,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [15,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [16,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [17,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [18,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.08627451
## [19,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.40392157
## [20,] 0 0 0 0 0 0 0 0.00000000 0.3490196 0.94117647
## [21,] 0 0 0 0 0 0 0 0.05882353 0.8627451 0.99215686
## [22,] 0 0 0 0 0 0 0 0.36862745 0.9921569 0.99215686
## [23,] 0 0 0 0 0 0 0 0.34901961 0.9843137 0.99215686
## [24,] 0 0 0 0 0 0 0 0.00000000 0.8392157 0.85490196
## [25,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [26,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [27,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [28,] 0 0 0 0 0 0 0 0.00000000 0.0000000 0.00000000
## [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [2,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [3,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [4,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [5,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [6,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [7,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [8,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [9,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [10,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.21176471
## [11,] 0.00000000 0.00000000 0.00000000 0.00000000 0.3647059 0.99607843
## [12,] 0.00000000 0.00000000 0.00000000 0.09019608 0.8235294 0.99607843
## [13,] 0.00000000 0.00000000 0.06274510 0.81960784 0.9921569 0.99607843
## [14,] 0.00000000 0.00000000 0.10588235 0.99215686 0.9921569 0.99607843
## [15,] 0.00000000 0.07843137 0.80784314 0.99607843 0.9960784 0.77647059
## [16,] 0.00000000 0.65882353 0.99215686 0.99215686 0.7686275 0.02745098
## [17,] 0.07843137 0.79607843 0.99215686 0.97254902 0.2980392 0.00000000
## [18,] 0.73725490 0.99215686 0.96078431 0.36470588 0.0000000 0.00000000
## [19,] 0.99215686 0.99215686 0.74901961 0.00000000 0.0000000 0.00000000
## [20,] 0.99215686 0.76470588 0.09803922 0.00000000 0.0000000 0.00000000
## [21,] 0.99215686 0.31372549 0.00000000 0.00000000 0.0000000 0.00000000
## [22,] 0.99215686 0.36862745 0.00000000 0.00000000 0.0000000 0.00000000
## [23,] 0.98039216 0.51372549 0.00000000 0.00000000 0.0000000 0.00000000
## [24,] 0.37254902 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [25,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [26,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [27,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [28,] 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.00000000
## [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [2,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [3,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [4,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [5,] 0.00000000 0.0000000 0.0000000 0.00000000 0.7372549 1.00000000 0.36862745
## [6,] 0.00000000 0.0000000 0.0000000 0.74901961 0.9803922 0.99215686 0.36470588
## [7,] 0.00000000 0.0000000 0.4823529 0.97254902 0.9921569 0.65490196 0.03921569
## [8,] 0.00000000 0.3137255 0.9686275 0.99215686 0.8156863 0.05098039 0.00000000
## [9,] 0.11372549 0.8117647 0.9921569 0.92156863 0.3019608 0.00000000 0.00000000
## [10,] 0.81960784 0.9921569 0.9921569 0.34509804 0.0000000 0.00000000 0.00000000
## [11,] 0.99215686 0.9333333 0.6666667 0.06666667 0.0000000 0.00000000 0.00000000
## [12,] 0.99215686 0.6235294 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [13,] 0.94117647 0.3176471 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [14,] 0.05098039 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [15,] 0.02745098 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [16,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [17,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [18,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [19,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [20,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [21,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [22,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [23,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [24,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [25,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [26,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [27,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [28,] 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000 0.00000000
## [,24] [,25] [,26] [,27] [,28]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
## [6,] 0 0 0 0 0
## [7,] 0 0 0 0 0
## [8,] 0 0 0 0 0
## [9,] 0 0 0 0 0
## [10,] 0 0 0 0 0
## [11,] 0 0 0 0 0
## [12,] 0 0 0 0 0
## [13,] 0 0 0 0 0
## [14,] 0 0 0 0 0
## [15,] 0 0 0 0 0
## [16,] 0 0 0 0 0
## [17,] 0 0 0 0 0
## [18,] 0 0 0 0 0
## [19,] 0 0 0 0 0
## [20,] 0 0 0 0 0
## [21,] 0 0 0 0 0
## [22,] 0 0 0 0 0
## [23,] 0 0 0 0 0
## [24,] 0 0 0 0 0
## [25,] 0 0 0 0 0
## [26,] 0 0 0 0 0
## [27,] 0 0 0 0 0
## [28,] 0 0 0 0 0
image(digit)
This figure needs to be rotated in order to properly visualize the digit. This could have been avoided if I manually added the appropriate x and y coordinates for each z-value. All matrices should rotate roughly 90 degrees clockwise.
for (i in 1:10){
digit <- rotateImage(matrix(unlist(train[i, -1]), nrow = 28, byrow = TRUE),270)
image(digit, useRaster = TRUE, axes = FALSE)
}
train <- data_frame(train)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
ggplot(train, aes(x=label))+
geom_histogram(aes(y=..count..), color = 'yellow', binwidth = 1)+
#geom_density(alpha=.2,fill = "red")+
ggtitle("Frequency Histogram of Digit Distribution")+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Digit')+
scale_x_continuous(breaks = seq(min(train$label),max(train$label), by = 1))+
ylab('Density')
table(train$label)
##
## 0 1 2 3 4 5 6 7 8 9
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188
The distribution is mostly uniform, with the most digits being 1, and the least being 5. The spread between 1 and 5 is close to 1000 occurrences.
For the mean pixel intensity, I will take the mean of the values of the pixels for each image in the training data. I will leave the raw numbers for this calculation. I will calculate the mean pixel intensity for each image, then I will calculate the mean of the means.
train_data <- train[,-1]
train_data$mean <- rowMeans(train_data)
ggplot(train_data, aes(x=mean))+
geom_histogram(aes(y=..density..), color = 'red', binwidth = 1)+
geom_density(alpha=.2,fill = "orange")+
ggtitle(" Histogram of Pixel Density")+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Pixel Intensity')+
ylab('Density')
mean(train_data$mean)
## [1] 33.40891
The pixel intensity is relatively normally distributed. The average pixel intensity for a given image is 33.41. The higher the pixel intensity, the larger the physical shape of the number.
Principal Components:
trained <- train[,-1]
scaled <- scale(trained,center =TRUE,scale =TRUE)
train_mean <- attr(scaled,"scaled:center")
train_sd <- attr(scaled,"scaled:scale")
pcomp <- princomp(trained)
std_pcomp <- pcomp$sdev
var_pcomp <- std_pcomp^2
cvar <- cumsum(std_pcomp^2)/sum(std_pcomp^2)
cvar_95 <- which.max(cvar >= 0.95)
#cvar_95
cvar1 <- which.max(cvar >= 1)
cvar1
## Comp.708
## 708
154 components are needed to cover 95% of the variance. 708 components should be needed to account for 100% of the variance.
Again, this will need to be rotated.
for (i in 1:10){
digit <- rotateImage(matrix(pcomp$loadings[,i], nrow = 28, byrow = TRUE),270)
image(digit, useRaster = TRUE, axes = FALSE)
}
There is a lot of noise in each image. This is because the parameters are not properly tuned. The noise will need to be reduced by refitting the parameters.
eights <- which(train$label == 8)
#eights
eights_data <- train[c(eights),]
eights_drop <- eights_data[,-1]
pcomp_8 <- princomp(eights_drop)
std_8 <- pcomp_8$sdev
var_8 <- std_8^2
cvar <- cumsum(std_8^2)/sum(std_8^2)
cvar_100 <- which.max(cvar >= 1)
cvar_100
## Comp.540
## 540
for (i in 1:10){
digit <- rotateImage(matrix(pcomp_8$loadings[,i], nrow = 28, byrow = TRUE),270)
image(digit, useRaster = TRUE, axes = FALSE)
}
These images, especially the thumbnails, look very much like 8s. Because we are only fitting to eights as opposed to all possible digits, we see a much more clear image of the desired digit.
The weights kept showing up as errors. As I increased the weight limits, I received an error that there were too many weights. I needed to add the MaxNWts parameter to remove the errors. Note, the model requires a great deal of computational power and time. I am using all components for the model. The computational time could be reduced by using the 540 components needed to account for 100% of the variance.
training <- as.factor(train$label)
labels <- as.factor(train$label)
fit <- multinom(labels ~., data = train_data, MaxNWts = 10000)
## # weights: 7870 (7074 variable)
## initial value 96708.573906
## iter 10 value 27932.928742
## iter 20 value 22897.023793
## iter 30 value 21671.757612
## iter 40 value 21248.433597
## iter 50 value 21070.878338
## iter 60 value 20985.749008
## iter 70 value 20904.370367
## iter 80 value 20840.141748
## iter 90 value 20767.634400
## iter 100 value 20700.473515
## final value 20700.473515
## stopped after 100 iterations
Prediction:
prediction <- predict(fit,train_data)
confused <- confusionMatrix(prediction,labels)
confused
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 3961 0 11 5 10 24 45 6 18 13
## 1 25 4606 153 88 88 83 57 111 384 84
## 2 7 13 3644 88 36 16 64 39 26 9
## 3 8 15 77 3786 13 92 3 9 67 56
## 4 9 3 25 3 3448 12 18 15 20 63
## 5 24 14 23 176 46 3286 92 12 168 52
## 6 10 1 13 5 28 42 3768 3 14 0
## 7 36 11 106 81 26 47 18 4047 41 190
## 8 38 20 104 88 52 139 42 7 3257 23
## 9 14 1 21 31 325 54 30 152 68 3698
##
## Overall Statistics
##
## Accuracy : 0.8929
## 95% CI : (0.8899, 0.8958)
## No Information Rate : 0.1115
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8809
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.95862 0.9833 0.87240 0.87014 0.84676 0.86588
## Specificity 0.99651 0.9712 0.99212 0.99097 0.99557 0.98411
## Pos Pred Value 0.96775 0.8111 0.92440 0.91760 0.95354 0.84408
## Neg Pred Value 0.99549 0.9979 0.98600 0.98508 0.98374 0.98664
## Prevalence 0.09838 0.1115 0.09945 0.10360 0.09695 0.09036
## Detection Rate 0.09431 0.1097 0.08676 0.09014 0.08210 0.07824
## Detection Prevalence 0.09745 0.1352 0.09386 0.09824 0.08610 0.09269
## Balanced Accuracy 0.97756 0.9773 0.93226 0.93056 0.92116 0.92499
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.91080 0.91956 0.80162 0.88300
## Specificity 0.99694 0.98521 0.98648 0.98159
## Pos Pred Value 0.97013 0.87921 0.86393 0.84160
## Neg Pred Value 0.99032 0.99053 0.97892 0.98697
## Prevalence 0.09850 0.10479 0.09674 0.09971
## Detection Rate 0.08971 0.09636 0.07755 0.08805
## Detection Prevalence 0.09248 0.10960 0.08976 0.10462
## Balanced Accuracy 0.95387 0.95239 0.89405 0.93230
The model’s accuracy was 89.29%. All things considered, this is pretty good, but it may lead to many misdelivered envelopes for the USPS. Come to think of it, when I lived at 248, we received a lot of letters for 246. I won’t provide the street number, but that was also frequently misinterpreted.
train_house <- read_csv("train_house.csv")
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_house <- read_csv("test_house.csv")
## Rows: 1459 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(train_house, aes(x=LotArea, y = SalePrice))+
geom_point(aes(color = "orange"))+
ggtitle('Sale Price vs. Lot Area')+
theme(plot.title = element_text(hjust = 0.5),legend.position = "none")+
xlab('Lot Area')+
ylab('Sale Price')
ggplot(train_house, aes(x=GrLivArea, y = SalePrice))+
geom_point(aes(color = "blue"))+
ggtitle('Sale Price vs. Living Area')+
theme(plot.title = element_text(hjust = 0.5),legend.position = "none")+
xlab('Lot Area')+
ylab('Sale Price')
ggplot(train_house, aes(x=SalePrice))+
geom_histogram(aes(y=..density..), color = 'red', fill = 'green', binwidth = 25000)+
geom_density(alpha=.1, fill = 'pink')+
ggtitle('Distribution of Sale Price')+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Sale Price')+
ylab('Density')
The distribution of sale price shows a heavy right skew. The Above Ground Living Area of a home heavily influences the sale price.
Next, we will visualize all of the relationships. Originally, I was going to use all numeric variables, but there are just too many columns available in the training set. I will limit the selection to 10 or fewer variables.
Columns 44 and 45 are uncallable by name currently.
names(train_house)[44] <- "FirstFlrSF"
names(train_house)[45] <- "SecondFlrSF"
train_house["CentralAir"][train_house["CentralAir"] == "Y"] <- '1'
train_house["CentralAir"][train_house["CentralAir"] == "N"] <- '-1'
train_house$CentralAir <- as.numeric(train_house$CentralAir)
train_house %>%
dplyr::select(c(LotArea,BsmtFinSF1,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,GarageArea,WoodDeckSF,SalePrice)) %>%
ggpairs()
Sale Price is most heavily correlated with GrLivArea,GarageArea,TotalBsmtSF, and FirstFlrSF. Interestingly, LotArea and Sale Price were not heavily correlated. If Lot Area were limited to a certain value below 50,000, we would see a much more linear relationship.
train_filter <- train_house[train_house$LotArea < 50000, ]
train_filter %>%
dplyr::select(c(LotArea,BsmtFinSF1,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,GarageArea,WoodDeckSF,SalePrice)) %>%
ggpairs()
ggplot(train_filter, aes(x=LotArea, y = SalePrice))+
geom_point(aes(color = "orange"))+
geom_smooth(method = "lm")+
ggtitle('Sale Price vs. Lot Area')+
theme(plot.title = element_text(hjust = 0.5),legend.position = "none")+
xlab('Lot Area')+
ylab('Sale Price')
## `geom_smooth()` using formula 'y ~ x'
After filtering the data, it became more apparent that the spread of the LotArea vs. SalePrice data is too great to filter outliers. The fit for the LotArea vs. SalePrice is slightly better.
ggplot(train_filter, aes(x=LotArea, y = SalePrice,color = HouseStyle))+
geom_point()+
geom_smooth(method = "lm")+
ggtitle('Sale Price vs. Lot Area, Colored By House Style')+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Lot Area')+
ylab('Sale Price')
## `geom_smooth()` using formula 'y ~ x'
ggplot(train_filter, aes(x=GrLivArea, y = SalePrice,color = YearBuilt))+
geom_point()+
geom_smooth(method = "lm")+
ggtitle('Sale Price vs. Above Ground Living Area, Colored By House Style')+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Above Ground Living Area')+
ylab('Sale Price')
## `geom_smooth()` using formula 'y ~ x'
Because Sale Price is the best representation of home value, I will keep that in the correlation matrix as one of the variables.
selection <- train_house %>%
dplyr::select(c(LotArea,GrLivArea,SalePrice))
corr_mat <- cor(selection)
corr_mat
## LotArea GrLivArea SalePrice
## LotArea 1.0000000 0.2631162 0.2638434
## GrLivArea 0.2631162 1.0000000 0.7086245
## SalePrice 0.2638434 0.7086245 1.0000000
cor.test(train_house$LotArea,train_house$GrLivArea,conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train_house$LotArea and train_house$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
cor.test(train_house$LotArea,train_house$SalePrice,conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train_house$LotArea and train_house$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
cor.test(train_house$SalePrice,train_house$GrLivArea,conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train_house$SalePrice and train_house$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
train_house %>%
dplyr::select(c(LotArea,GrLivArea,SalePrice)) %>%
ggpairs()
alpha <- 0.2
c <- 3
fwe <- 1 - (1 - alpha)^c
fwe
## [1] 0.488
fwe_orig <- 1 - (1 - alpha)^9
fwe_orig
## [1] 0.8657823
Interestingly, the only strong correlation is between the Sale Price and Above Ground Living Area. As the number of comparisons increases, the risk for familywise error also increases. In this case, for only three comparisons, familywise error is still a concern, but it is not as extreme. There is a 48.8% chance of a Type I (Rejecting Null Hypothesis when it is actually True) Error. It is, however, a greater concern for the original correlation tests, where the number of comparisons was much higher. For the 9 component comparison, the risk of a Type I Error was 86.6%.
Precision Matrix:
precision_mat <- solve(corr_mat)
corprec <- corr_mat %*% precision_mat
preccor <- precision_mat %*% corr_mat
corprec
## LotArea GrLivArea SalePrice
## LotArea 1.000000e+00 0.000000e+00 0.000000e+00
## GrLivArea 1.387779e-17 1.000000e+00 2.220446e-16
## SalePrice -2.775558e-17 -2.220446e-16 1.000000e+00
preccor
## LotArea GrLivArea SalePrice
## LotArea 1.000000e+00 1.94289e-16 1.387779e-16
## GrLivArea -1.110223e-16 1.00000e+00 0.000000e+00
## SalePrice -1.110223e-16 0.00000e+00 1.000000e+00
LU <- lu.decomposition(corr_mat)
LU
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.2631162 1.0000000 0
## [3,] 0.2638434 0.6867466 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.2631162 0.2638434
## [2,] 0 0.9307699 0.6392030
## [3,] 0 0.0000000 0.4914162
lu.decomposition(corprec)
## $L
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0.000000e+00 0
## [2,] 1.387779e-17 1.000000e+00 0
## [3,] -2.775558e-17 -2.220446e-16 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0 0.000000e+00
## [2,] 0 1 2.220446e-16
## [3,] 0 0 1.000000e+00
lu.decomposition(preccor)
## $L
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0.000000e+00 0
## [2,] -1.110223e-16 1.000000e+00 0
## [3,] -1.110223e-16 2.157042e-32 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 1.94289e-16 1.387779e-16
## [2,] 0 1.00000e+00 1.540744e-32
## [3,] 0 0.00000e+00 1.000000e+00
Conveniently for me, SalePrice is skewed to the right. I will try a log transformation to shift the distribution.
p1 <- ggplot(train_house, aes(x=SalePrice))+
geom_histogram(aes(y=..density..), color = 'green', fill = 'blue', binwidth = 25000)+
geom_density(alpha=.1, fill = 'pink')+
ggtitle('Distribution of Sale Price')+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Sale Price')+
ylab('Density')
pricelog <- log(train_house$SalePrice)
p3 <- ggplot(train_house, aes(x=log(SalePrice)))+
geom_histogram(aes(y=..density..), color = 'green', fill = 'blue', binwidth = 0.5)+
geom_density(alpha=.1, fill = 'pink')+
ggtitle('Distribution of Sale Price')+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Sale Price')+
ylab('Density')
The log transformation shifted the distribution and it is much more centered now. For an exponential distribution, \(\lambda\) is the expectation. I will use both the log transformed values and the raw values.
exp_dens <- fitdistr(train_house$SalePrice, 'exponential')
exp_log <- fitdistr(pricelog, 'exponential')
exp_dens
## rate
## 5.527268e-06
## (1.446552e-07)
exp_log
## rate
## 0.083166647
## (0.002176571)
lam <- exp_dens$estimate
lamlog <- exp_log$estimate
lam
## rate
## 5.527268e-06
lamlog
## rate
## 0.08316665
samp <- rexp(1000,lam)
samplog <- rexp(1000,lamlog)
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 236.3 55259.2 121740.3 178049.4 244363.8 1689144.1
summary(samplog)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00195 3.18307 7.47977 11.20312 16.10674 83.41697
The log transformed values make it easier to comprehend the range of values.
sampdf <- data_frame(samp)
samplogdf <- data_frame(samplog)
p2 <- ggplot(sampdf, aes(x=samp))+
geom_histogram(aes(y=..density..), color = 'green', fill = 'blue', binwidth = 25000)+
geom_density(alpha=.1, fill = 'pink')+
ggtitle('Sample Distribution of Sale Price')+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Sale Price')+
ylab('Density')
ggarrange(p1,p2)
p4 <- ggplot(samplogdf, aes(x=samplog))+
geom_histogram(aes(y=..density..), color = 'green', fill = 'blue', binwidth = 5)+
geom_density(alpha=.1, fill = 'pink')+
ggtitle('Sample Distribution of Sale Price')+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Sale Price')+
ylab('Density')
ggarrange(p3,p4)
Both samples ended up being aggressively right skewed, more than either of the original variables.
samp_5 <- qexp(0.05,lam)
samp_95 <- qexp(.95,lam)
log_5 <- qexp(.05,lamlog)
log_95 <- qexp(.95,lamlog)
table(samp_5,samp_95)
## samp_95
## samp_5 541991.465498888
## 9280.04416175455 1
table(log_5,log_95)
## log_95
## log_5 36.0208373433093
## 0.616753182601496 1
So, the \(5^{th}\) and \(95^{th}\) confidence intervals are 9280 and 541,991 for the original data, and 0.617 and 36.02 for the log transformed data.
interval <- t.test(train_house$SalePrice,conf.level=0.95)
interval05 <- t.test(train_house$SalePrice,conf.level=.05)
quantile(train_house$SalePrice, c(0.05,0.95))
## 5% 95%
## 88000 326100
quantile(pricelog, c(0.05,.95))
## 5% 95%
## 11.38509 12.69496
The empirical percentiles are far more reasonable than the theoretical percentiles. Assuming normality on a distribution that is far from normal is not necessarily an ideal approach.
Confidence Intervals:
table(interval05$conf.int[1],interval05$conf.int[2])
##
## 181051.592315692
## 180790.79946513 1
table(interval$conf.int[1],interval$conf.int[2])
##
## 184999.550739737
## 176842.841041085 1
mean(train_house$SalePrice)
## [1] 180921.2
There is a 5% chance that the sample mean falls between 180790 and 181051. There is a 95% chance that the sample mean falls between 176842 and 185000. It turns out that the mean Sale Price is 180921. So, both confidence intervals captured the sample mean.
Next, we will build a multiple regression model and submit it to the Kaggle Competition.
train_house %>%
dplyr::select(c(LotArea,BsmtFinSF1,YearBuilt,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,GarageArea,WoodDeckSF,SalePrice)) %>%
ggpairs()
##
## Call:
## lm(formula = SalePrice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF +
## FirstFlrSF + SecondFlrSF + GrLivArea + GarageArea + WoodDeckSF,
## data = train_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -629975 -18224 -3047 14203 270296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.323e+06 8.574e+04 -15.429 < 2e-16 ***
## LotArea 3.224e-01 1.180e-01 2.732 0.006369 **
## BsmtFinSF1 1.268e+01 2.865e+00 4.425 1.04e-05 ***
## YearBuilt 6.721e+02 4.416e+01 15.219 < 2e-16 ***
## TotalBsmtSF 2.879e+01 4.766e+00 6.040 1.95e-09 ***
## FirstFlrSF 3.368e+01 2.378e+01 1.416 0.156926
## SecondFlrSF 3.384e+01 2.344e+01 1.443 0.149109
## GrLivArea 3.796e+01 2.308e+01 1.645 0.100253
## GarageArea 5.898e+01 6.735e+00 8.757 < 2e-16 ***
## WoodDeckSF 3.469e+01 9.325e+00 3.720 0.000207 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41920 on 1450 degrees of freedom
## Multiple R-squared: 0.7233, Adjusted R-squared: 0.7215
## F-statistic: 421 on 9 and 1450 DF, p-value: < 2.2e-16
I will be removing p-values higher than 0.10. I will remove them one at a time and check the p-values after removal to make sure that I don’t remove valuable variables.
##
## Call:
## lm(formula = SalePrice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageArea + WoodDeckSF, data = train_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -628662 -18089 -3256 14460 271795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.343e+06 8.385e+04 -16.013 < 2e-16 ***
## LotArea 3.269e-01 1.176e-01 2.780 0.005512 **
## BsmtFinSF1 1.281e+01 2.861e+00 4.476 8.2e-06 ***
## YearBuilt 6.823e+02 4.325e+01 15.774 < 2e-16 ***
## TotalBsmtSF 2.869e+01 3.439e+00 8.343 < 2e-16 ***
## GrLivArea 7.108e+01 2.542e+00 27.962 < 2e-16 ***
## GarageArea 5.944e+01 6.692e+00 8.882 < 2e-16 ***
## WoodDeckSF 3.484e+01 9.324e+00 3.736 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41920 on 1452 degrees of freedom
## Multiple R-squared: 0.7229, Adjusted R-squared: 0.7215
## F-statistic: 541 on 7 and 1452 DF, p-value: < 2.2e-16
With an Adjusted \(R^2\) of 0.7215, this model should be pretty effective at explaining the variability in the dependent variable, sale price.
I will also perform the same regression, but I will use the logarithm of the Sale Price.
##
## Call:
## lm(formula = SalePrice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF +
## FirstFlrSF + SecondFlrSF + GrLivArea + GarageArea + WoodDeckSF,
## data = train_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -629975 -18224 -3047 14203 270296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.323e+06 8.574e+04 -15.429 < 2e-16 ***
## LotArea 3.224e-01 1.180e-01 2.732 0.006369 **
## BsmtFinSF1 1.268e+01 2.865e+00 4.425 1.04e-05 ***
## YearBuilt 6.721e+02 4.416e+01 15.219 < 2e-16 ***
## TotalBsmtSF 2.879e+01 4.766e+00 6.040 1.95e-09 ***
## FirstFlrSF 3.368e+01 2.378e+01 1.416 0.156926
## SecondFlrSF 3.384e+01 2.344e+01 1.443 0.149109
## GrLivArea 3.796e+01 2.308e+01 1.645 0.100253
## GarageArea 5.898e+01 6.735e+00 8.757 < 2e-16 ***
## WoodDeckSF 3.469e+01 9.325e+00 3.720 0.000207 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41920 on 1450 degrees of freedom
## Multiple R-squared: 0.7233, Adjusted R-squared: 0.7215
## F-statistic: 421 on 9 and 1450 DF, p-value: < 2.2e-16
Here, we can remove FirstFlrSF and SecondFlrSF to begin.
##
## Call:
## lm(formula = logprice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageArea + WoodDeckSF, data = train_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.91533 -0.07240 0.00738 0.09147 0.67537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.515e+00 3.915e-01 6.424 1.79e-10 ***
## LotArea 1.798e-06 5.492e-07 3.274 0.00109 **
## BsmtFinSF1 4.191e-05 1.336e-05 3.137 0.00174 **
## YearBuilt 4.384e-03 2.020e-04 21.704 < 2e-16 ***
## TotalBsmtSF 1.282e-04 1.606e-05 7.983 2.88e-15 ***
## GrLivArea 3.437e-04 1.187e-05 28.954 < 2e-16 ***
## GarageArea 3.292e-04 3.125e-05 10.535 < 2e-16 ***
## WoodDeckSF 1.848e-04 4.354e-05 4.244 2.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1958 on 1452 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.7598
## F-statistic: 660.3 on 7 and 1452 DF, p-value: < 2.2e-16
With an adjusted \(R^2\) of 0.7598, this model may be slightly better than the non-logarithmic model.
res_sct <- ggplot(data = model_adjust, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = 'red') +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Linearity of Residuals")+
theme(plot.title = element_text(hjust = 0.5))
res_sct_log <- ggplot(data = model_adjust_log, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = 'red') +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Linearity of Residuals")+
theme(plot.title = element_text(hjust = 0.5))
res_hst <- ggplot(data = model_adjust, aes(x = .resid)) +
geom_histogram(binwidth = 25000) +
xlab("Residuals") +
ggtitle("Histogram of Residuals")+
theme(plot.title = element_text(hjust = 0.5))
res_hst_log <- ggplot(data = model_adjust_log, aes(x = .resid)) +
geom_histogram(binwidth = 0.25) +
xlab("Residuals") +
ggtitle("Histogram of Residuals")+
theme(plot.title = element_text(hjust = 0.5))
res_npp <- ggplot(data = model_adjust, aes(sample = .resid)) +
stat_qq()+
ggtitle("Normal Probability Plot of Residuals")+
theme(plot.title = element_text(hjust = 0.5))
res_npp_log <- ggplot(data = model_adjust_log, aes(sample = .resid)) +
stat_qq()+
ggtitle("Normal Probability Plot of Residuals")+
theme(plot.title = element_text(hjust = 0.5))
lin_analysis <- ggarrange(p1, res_sct, res_hst, res_npp,
ncol = 2, nrow = 2)
lin_analysis_log <- ggarrange(p3, res_sct_log, res_hst_log, res_npp_log,
ncol = 2, nrow = 2)
lin_analysis
lin_analysis_log
Both models look to be decent fits for the data, with the edge going to the logarithmic model.
names(test_house)[44] <- "FirstFlrSF"
names(test_house)[45] <- "SecondFlrSF"
test_house["CentralAir"][test_house["CentralAir"] == "Y"] <- '1'
test_house["CentralAir"][test_house["CentralAir"] == "N"] <- '-1'
test_house$CentralAir <- as.numeric(test_house$CentralAir)
test_house_test <- test_house %>% dplyr::select(c(Id,LotArea,BsmtFinSF1,OverallQual,OverallCond,YearBuilt,MasVnrArea,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,CentralAir,YrSold,GarageArea,WoodDeckSF,FullBath))
which(is.na(test_house_test))
## [1] 3579 8986 9001 9177 9287 9299 9336 9606 9620 9635 9644 9663
## [13] 9887 9952 9981 10157 10874 20084
test_house_test <- test_house_test %>%
replace(is.na(.),0)
which(is.na(test_house_test))
## integer(0)
house_predict <- predict(model_adjust,test_house_test,type = "response")
head(house_predict)
## 1 2 3 4 5 6
## 142393.3 174609.2 213005.0 212616.0 179278.5 191596.1
house_log_predict <- predict(model_adjust_log,test_house_test,type = "response")
head(house_log_predict)
## 1 2 3 4 5 6
## 11.83972 11.96564 12.20453 12.20861 12.03851 12.11075
house_log_predict_dollar <- exp(house_log_predict)
head(house_log_predict_dollar)
## 1 2 3 4 5 6
## 138651.2 157258.1 199691.4 200508.0 169145.3 181815.8
house_kag <- data.frame(Id = test_house_test$Id, SalePrice = house_predict)
house_kag_log <- data.frame(Id = test_house_test$Id, SalePrice = house_log_predict_dollar)
house_kag_log <- drop_na(house_kag_log)
write.csv(house_kag,file = "house_kag.csv", row.names = FALSE)
write.csv(house_kag_log,file = "house_kag_log.csv", row.names = FALSE)
As I expected, house_kag performed poorer than house_kag_log. Using the log transformation allowed for a better prediction than leaving the Sale Price untransformed.
house_kag RMSE: 0.21194 (Would be \(3618^{th}\) place)
house_kag_log RMSE: 0.19997 (\(3537^{th}\) place)
Kaggle Username: shanehylton (Shane Hylton)
I will try this model one more time, but this time, I will try some exponential transformations. I will also subset the years sold to try to fit the test data better.
Originally, I had grouped the data into buckets representing the year each house was sold. My predictions did not work out the way I expected them to. I have instead added Year Sold as a parameter. Further, I originally used exponentiation to develop a model. I was able to obtain a strong R-squared value, but the actual predictions were dramatically skewed.
train_log <- train_log %>% dplyr::select(c(Id,LotArea,BsmtFinSF1,OverallQual,OverallCond,YearBuilt,MasVnrArea,TotalBsmtSF,FirstFlrSF,SecondFlrSF,GrLivArea,CentralAir,YrSold,GarageArea,WoodDeckSF,FullBath,logprice,SalePrice))
train_log <- train_log %>%
replace(is.na(.),0)
mod_final <- lm(logprice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF + GrLivArea + GarageArea + WoodDeckSF + CentralAir + OverallQual + OverallCond + YrSold, data = train_log)
summary(mod_final)
##
## Call:
## lm(formula = logprice ~ LotArea + BsmtFinSF1 + YearBuilt + TotalBsmtSF +
## GrLivArea + GarageArea + WoodDeckSF + CentralAir + OverallQual +
## OverallCond + YrSold, data = train_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.54809 -0.06741 0.00729 0.07564 0.47111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.305e+01 6.290e+00 2.074 0.038233 *
## LotArea 2.732e-06 4.465e-07 6.117 1.22e-09 ***
## BsmtFinSF1 5.050e-05 1.089e-05 4.637 3.86e-06 ***
## YearBuilt 3.242e-03 2.114e-04 15.342 < 2e-16 ***
## TotalBsmtSF 7.603e-05 1.344e-05 5.659 1.83e-08 ***
## GrLivArea 2.440e-04 1.090e-05 22.386 < 2e-16 ***
## GarageArea 2.255e-04 2.562e-05 8.804 < 2e-16 ***
## WoodDeckSF 1.309e-04 3.538e-05 3.699 0.000225 ***
## CentralAir 4.364e-02 9.597e-03 4.547 5.90e-06 ***
## OverallQual 9.847e-02 4.899e-03 20.098 < 2e-16 ***
## OverallCond 5.106e-02 4.359e-03 11.714 < 2e-16 ***
## YrSold -4.462e-03 3.129e-03 -1.426 0.154134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1583 on 1448 degrees of freedom
## Multiple R-squared: 0.8441, Adjusted R-squared: 0.8429
## F-statistic: 712.6 on 11 and 1448 DF, p-value: < 2.2e-16
pred_all <- predict(mod_final, test_house_test,type = "response")
pred_price <- exp(pred_all)
house_price_kaggle_edit <- data.frame(Id = test_house_test$Id,SalePrice = pred_price)
write.csv(house_price_kaggle_edit,file = "house_price_kaggle_edit.csv", row.names = FALSE)
Score for this submission: 0.15308
Leaderboard Position: 2500
This took so long to knit. I had to try a number of times. It kept getting hung up halfway through.