Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes.
First let’s form A:
A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
0, 0, 0, 0, 0, 0,
1/3, 1/3, 0, 0, 1/3, 0,
0, 0, 0, 0, 1/2, 1/2,
0, 0, 0, 1/2, 0, 1/2,
0, 0, 0, 1, 0, 0
), nrow = 6)
A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0 0 0.3333333 0.0 0.0 0
## [2,] 0.5 0 0.3333333 0.0 0.0 0
## [3,] 0.5 0 0.0000000 0.0 0.0 0
## [4,] 0.0 0 0.0000000 0.0 0.5 1
## [5,] 0.0 0 0.3333333 0.5 0.0 0
## [6,] 0.0 0 0.0000000 0.5 0.5 0
Initially I solved the problem with A as is, but the B matrix would converge at 0 (solving for the eigenvector showed me I’m not right). After some digging I found making A row-stochastic with a uniform vector gets me to the solution I want. So we will move forward with the vector below.
A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
1/6, 1/6, 1/6, 1/6, 1/6, 1/6,
1/3, 1/3, 0, 0, 1/3, 0,
0, 0, 0, 0, 1/2, 1/2,
0, 0, 0, 1/2, 0, 1/2,
0, 0, 0, 1, 0, 0), nrow=6)
print(A)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0 0.1666667 0.3333333 0.0 0.0 0
## [2,] 0.5 0.1666667 0.3333333 0.0 0.0 0
## [3,] 0.5 0.1666667 0.0000000 0.0 0.0 0
## [4,] 0.0 0.1666667 0.0000000 0.0 0.5 1
## [5,] 0.0 0.1666667 0.3333333 0.5 0.0 0
## [6,] 0.0 0.1666667 0.0000000 0.5 0.5 0
Next we form the decay matrix B
\[B = 0.85 \times A + \frac{0.15}{n}\]
n <- 6
B <- 0.85 * A + (0.15/n)
print(B)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.025 0.1666667 0.3083333 0.025 0.025 0.025
## [2,] 0.450 0.1666667 0.3083333 0.025 0.025 0.025
## [3,] 0.450 0.1666667 0.0250000 0.025 0.025 0.025
## [4,] 0.025 0.1666667 0.0250000 0.025 0.450 0.875
## [5,] 0.025 0.1666667 0.3083333 0.450 0.025 0.025
## [6,] 0.025 0.1666667 0.0250000 0.450 0.450 0.025
Start with a uniform rank vector r and perform power iterations on B till convergence. That is, compute the solution r = Bn × r. Attempt this for a sufficiently large n so that r actually converges.
Just like in the example each element of r, to start, will be 1/6
r <- matrix(c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6), nrow=6, ncol=1)
r
## [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667
Plotting one of the values in r over n shows when n converges which is n~10
n <- 30
ns <- c(0)
rs <- c(r[1])
new_r <- r
Bn <- B
for (i in 1:n) {
Bn <- Bn%*%B
new_r <- Bn%*%r
rs <- append(rs, new_r[4])
ns <- append(ns, i)
}
plot(ns, rs)
Result:
new_r
## [,1]
## [1,] 0.05170475
## [2,] 0.07367927
## [3,] 0.05741242
## [4,] 0.34870368
## [5,] 0.19990381
## [6,] 0.26859608
Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1
As we can see 1 is an eigenvalue of B
x<-eigen(B)
x$values
## [1] 1.00000000+0i 0.57619235+0i -0.42500000+0i -0.42500000-0i -0.34991524+0i
## [6] -0.08461044+0i
To make sure the vector sums to 1 we divide it by the total sum. We can see the values are the same as the previous problem.
x$vectors[,1]/sum(x$vectors[,1])
## [1] 0.05170475+0i 0.07367926+0i 0.05741241+0i 0.34870369+0i 0.19990381+0i
## [6] 0.26859608+0i
Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally. Verify that you do get the same PageRank vector as the two approaches above
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
0, 0, 0, 0, 0, 0,
1/3, 1/3, 0, 0, 1/3, 0,
0, 0, 0, 0, 1/2, 1/2,
0, 0, 0, 1/2, 0, 1/2,
0, 0, 0, 1, 0, 0
), nrow = 6, byrow = T)
This is really cool. I didn’t know R had the ability to do this.
g <- graph_from_adjacency_matrix(A, weighted=TRUE)
plot(g)
The same results are returned
as.matrix(page.rank(g)$vector)
## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.)
library(keras)
##
## Attaching package: 'keras'
## The following objects are masked from 'package:igraph':
##
## %<-%, normalize
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## x purrr::compose() masks igraph::compose()
## x tidyr::crossing() masks igraph::crossing()
## x dplyr::filter() masks stats::filter()
## x dplyr::groups() masks igraph::groups()
## x dplyr::lag() masks stats::lag()
## x purrr::simplify() masks igraph::simplify()
library(readr)
test <-read_csv("test.csv")
## Rows: 28000 Columns: 784
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (784): pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pixel7, p...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train <-read_csv("train.csv")
## Rows: 42000 Columns: 785
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (785): label, pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
plot_img = function(row)
{
img <- matrix(row, nrow=28,ncol=28)
mode(img) = "numeric"
image(img, useRaster=TRUE, axes=FALSE)
}
for (i in 1:10) {
plot_img(test[i,])
}
What is the frequency distribution of the numbers in the dataset?
As we can see the dataset is mainly consisting of 0’s. This is because the the images are mostly a uniform background color.
x_test <- array_reshape(as.matrix(test) / 255, c(nrow(test), 28, 28, 1))
hist(x_test)
hist(x_test[x_test > 0])
For each number, provide the mean pixel intensity. What does this tell you? (5 points)
I won’t show all the means, but when I looked at it all the initial values are 0 and the means increase in value towards the middle. This just means all the drawings are towards the center of the image.
colMeans(test)[1:20]
## pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9
## 0 0 0 0 0 0 0 0 0 0
## pixel10 pixel11 pixel12 pixel13 pixel14 pixel15 pixel16 pixel17 pixel18 pixel19
## 0 0 0 0 0 0 0 0 0 0
Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why?
X <- test
#Scaling the pixels
Xreduced <- X/255
Xcov <- cov(Xreduced)
#Reduce to principle components
pcaX <- prcomp(Xcov)
(cumsum(pcaX$sdev^2) / sum(pcaX$sdev^2))[1:20]
## [1] 0.2537120 0.4202695 0.5442091 0.6401227 0.7112063 0.7702888 0.8042082
## [8] 0.8314769 0.8539140 0.8716505 0.8859296 0.8987083 0.9082055 0.9176209
## [15] 0.9258411 0.9331399 0.9388017 0.9440446 0.9486074 0.9529410
Plot the first 10 images generated by PCA
It appears to be just noise. This is because its looking at the variation across a variety of different images - so it makes sense that it looks like a blob.
k<- 1
j<-784
for (i in 1:10){
img<- matrix(pcaX$rotation[k:j],nrow=28, ncol=28)
image(img, useRaster=TRUE, axes=FALSE)
k<- k+784
j<- j+784
}
Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%).
First we need to filter the training data down to 8’s only. Then, we drop the label column.
library(dplyr)
train_8 <- train %>% filter(label == 8)
train_8 <- dplyr::select(train_8,c(2:785))
dim(train_8)
## [1] 4063 784
knitr::opts_chunk$set(cache = TRUE)
X <- train_8
#Scaling the pixels
Xreduced <- X/255
Xcov <- cov(Xreduced)
#Reduce to principle components
pcaX <- prcomp(Xcov)
Now Let’s plot the first few images. The images all resemble 8! As we expected.
k<- 1
j<-784
for (i in 1:10){
img<- matrix(pcaX$rotation[k:j],nrow=28, ncol=28)
image(img, useRaster=TRUE, axes=FALSE)
k<- k+784
j<- j+784
}
Multinomial
Here we plug our dataset into a multinomial model. This took quite a long time to run. If I had more time I would fiqure out how to optimize it.
knitr::opts_chunk$set(cache = TRUE)
library(nnet)
train$label <- as.factor(train$label)
X <- train[2:785]/255
X$label <- train$label
model <- multinom(label~., data=X, MaxNWts=1000000)
## # weights: 7860 (7065 variable)
## initial value 96708.573906
## iter 10 value 25322.714106
## iter 20 value 20402.086316
## iter 30 value 19312.872829
## iter 40 value 18703.256586
## iter 50 value 18197.815143
## iter 60 value 17732.985798
## iter 70 value 16739.962157
## iter 80 value 14961.658448
## iter 90 value 13446.085942
## iter 100 value 12442.636014
## final value 12442.636014
## stopped after 100 iterations
The calssification accuracy (checking against the dataset we already used) is 83%. Not the best.
prediction <- predict(model, train[2:785])
prediction <- as.data.frame(prediction, row.names=c('predicted_value'))
## Warning in as.data.frame.factor(prediction, row.names = c("predicted_value")):
## 'row.names' is not a character vector of length 42000 -- omitting it. Will be an
## error!
prediction$actual <- train$label
prediction$equal <- ifelse(prediction$prediction==prediction$actual,1,0)
sum(prediction$equal) / nrow(prediction)
## [1] 0.8342381
Interestingly it look like the had difficulty classifying 8.
library("caret")
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
confusionMatrix(prediction$prediction, prediction$actual)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 3841 0 12 5 6 31 14 12 4 9
## 1 1 3734 11 3 2 11 4 7 5 3
## 2 9 13 3460 49 22 14 17 43 5 8
## 3 11 37 99 3833 18 235 4 52 32 36
## 4 6 2 22 2 3222 15 8 13 2 17
## 5 9 0 3 12 2 1926 11 3 1 4
## 6 27 5 31 13 22 53 3808 3 2 0
## 7 2 4 14 8 2 8 2 3420 1 19
## 8 217 867 492 391 473 1425 266 234 4005 303
## 9 9 22 33 35 303 77 3 614 6 3789
##
## Overall Statistics
##
## Accuracy : 0.8342
## 95% CI : (0.8306, 0.8378)
## No Information Rate : 0.1115
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8158
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.92957 0.79718 0.82835 0.88095 0.79126 0.50751
## Specificity 0.99754 0.99874 0.99524 0.98608 0.99771 0.99882
## Pos Pred Value 0.97636 0.98757 0.95055 0.87973 0.97371 0.97717
## Neg Pred Value 0.99236 0.97514 0.98131 0.98624 0.97803 0.95331
## Prevalence 0.09838 0.11152 0.09945 0.10360 0.09695 0.09036
## Detection Rate 0.09145 0.08890 0.08238 0.09126 0.07671 0.04586
## Detection Prevalence 0.09367 0.09002 0.08667 0.10374 0.07879 0.04693
## Balanced Accuracy 0.96356 0.89796 0.91179 0.93351 0.89448 0.75317
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.92047 0.77710 0.98572 0.90473
## Specificity 0.99588 0.99840 0.87695 0.97086
## Pos Pred Value 0.96065 0.98276 0.46178 0.77469
## Neg Pred Value 0.99135 0.97453 0.99826 0.98925
## Prevalence 0.09850 0.10479 0.09674 0.09971
## Detection Rate 0.09067 0.08143 0.09536 0.09021
## Detection Prevalence 0.09438 0.08286 0.20650 0.11645
## Balanced Accuracy 0.95818 0.88775 0.93134 0.93779
Load Data
test2 <-read_csv("test2.csv")
## Rows: 1459 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train2 <-read_csv("train2.csv")
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Descriptive and Inferential Statistics - Univariate Descriptive Stats
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(train2)
## vars n mean sd median trimmed mad min
## Id 1 1460 730.50 421.61 730.5 730.50 541.15 1
## MSSubClass 2 1460 56.90 42.30 50.0 49.15 44.48 20
## MSZoning* 3 1460 4.03 0.63 4.0 4.06 0.00 1
## LotFrontage 4 1201 70.05 24.28 69.0 68.94 16.31 21
## LotArea 5 1460 10516.83 9981.26 9478.5 9563.28 2962.23 1300
## Street* 6 1460 2.00 0.06 2.0 2.00 0.00 1
## Alley* 7 91 1.45 0.50 1.0 1.44 0.00 1
## LotShape* 8 1460 2.94 1.41 4.0 3.05 0.00 1
## LandContour* 9 1460 3.78 0.71 4.0 4.00 0.00 1
## Utilities* 10 1460 1.00 0.03 1.0 1.00 0.00 1
## LotConfig* 11 1460 4.02 1.62 5.0 4.27 0.00 1
## LandSlope* 12 1460 1.06 0.28 1.0 1.00 0.00 1
## Neighborhood* 13 1460 13.15 5.89 13.0 13.11 7.41 1
## Condition1* 14 1460 3.03 0.87 3.0 3.00 0.00 1
## Condition2* 15 1460 3.01 0.26 3.0 3.00 0.00 1
## BldgType* 16 1460 1.49 1.20 1.0 1.14 0.00 1
## HouseStyle* 17 1460 4.04 1.91 3.0 4.03 1.48 1
## OverallQual 18 1460 6.10 1.38 6.0 6.08 1.48 1
## OverallCond 19 1460 5.58 1.11 5.0 5.48 0.00 1
## YearBuilt 20 1460 1971.27 30.20 1973.0 1974.13 37.06 1872
## YearRemodAdd 21 1460 1984.87 20.65 1994.0 1986.37 19.27 1950
## RoofStyle* 22 1460 2.41 0.83 2.0 2.26 0.00 1
## RoofMatl* 23 1460 2.08 0.60 2.0 2.00 0.00 1
## Exterior1st* 24 1460 10.62 3.20 13.0 10.93 1.48 1
## Exterior2nd* 25 1460 11.34 3.54 14.0 11.65 2.97 1
## MasVnrType* 26 1452 2.76 0.62 3.0 2.73 0.00 1
## MasVnrArea 27 1452 103.69 181.07 0.0 63.15 0.00 0
## ExterQual* 28 1460 3.54 0.69 4.0 3.65 0.00 1
## ExterCond* 29 1460 4.73 0.73 5.0 4.95 0.00 1
## Foundation* 30 1460 2.40 0.72 2.0 2.46 1.48 1
## BsmtQual* 31 1423 3.26 0.87 3.0 3.43 1.48 1
## BsmtCond* 32 1423 3.81 0.66 4.0 4.00 0.00 1
## BsmtExposure* 33 1422 3.27 1.15 4.0 3.46 0.00 1
## BsmtFinType1* 34 1423 3.73 1.83 3.0 3.79 2.97 1
## BsmtFinSF1 35 1460 443.64 456.10 383.5 386.08 568.58 0
## BsmtFinType2* 36 1422 5.71 0.94 6.0 5.98 0.00 1
## BsmtFinSF2 37 1460 46.55 161.32 0.0 1.38 0.00 0
## BsmtUnfSF 38 1460 567.24 441.87 477.5 519.29 426.99 0
## TotalBsmtSF 39 1460 1057.43 438.71 991.5 1036.70 347.67 0
## Heating* 40 1460 2.04 0.30 2.0 2.00 0.00 1
## HeatingQC* 41 1460 2.54 1.74 1.0 2.42 0.00 1
## CentralAir* 42 1460 1.93 0.25 2.0 2.00 0.00 1
## Electrical* 43 1459 4.68 1.05 5.0 5.00 0.00 1
## 1stFlrSF 44 1460 1162.63 386.59 1087.0 1129.99 347.67 334
## 2ndFlrSF 45 1460 346.99 436.53 0.0 285.36 0.00 0
## LowQualFinSF 46 1460 5.84 48.62 0.0 0.00 0.00 0
## GrLivArea 47 1460 1515.46 525.48 1464.0 1467.67 483.33 334
## BsmtFullBath 48 1460 0.43 0.52 0.0 0.39 0.00 0
## BsmtHalfBath 49 1460 0.06 0.24 0.0 0.00 0.00 0
## FullBath 50 1460 1.57 0.55 2.0 1.56 0.00 0
## HalfBath 51 1460 0.38 0.50 0.0 0.34 0.00 0
## BedroomAbvGr 52 1460 2.87 0.82 3.0 2.85 0.00 0
## KitchenAbvGr 53 1460 1.05 0.22 1.0 1.00 0.00 0
## KitchenQual* 54 1460 3.34 0.83 4.0 3.50 0.00 1
## TotRmsAbvGrd 55 1460 6.52 1.63 6.0 6.41 1.48 2
## Functional* 56 1460 6.75 0.98 7.0 7.00 0.00 1
## Fireplaces 57 1460 0.61 0.64 1.0 0.53 1.48 0
## FireplaceQu* 58 770 3.73 1.13 3.0 3.80 1.48 1
## GarageType* 59 1379 3.28 1.79 2.0 3.11 0.00 1
## GarageYrBlt 60 1379 1978.51 24.69 1980.0 1981.07 31.13 1900
## GarageFinish* 61 1379 2.18 0.81 2.0 2.23 1.48 1
## GarageCars 62 1460 1.77 0.75 2.0 1.77 0.00 0
## GarageArea 63 1460 472.98 213.80 480.0 469.81 177.91 0
## GarageQual* 64 1379 4.86 0.61 5.0 5.00 0.00 1
## GarageCond* 65 1379 4.90 0.52 5.0 5.00 0.00 1
## PavedDrive* 66 1460 2.86 0.50 3.0 3.00 0.00 1
## WoodDeckSF 67 1460 94.24 125.34 0.0 71.76 0.00 0
## OpenPorchSF 68 1460 46.66 66.26 25.0 33.23 37.06 0
## EnclosedPorch 69 1460 21.95 61.12 0.0 3.87 0.00 0
## 3SsnPorch 70 1460 3.41 29.32 0.0 0.00 0.00 0
## ScreenPorch 71 1460 15.06 55.76 0.0 0.00 0.00 0
## PoolArea 72 1460 2.76 40.18 0.0 0.00 0.00 0
## PoolQC* 73 7 2.14 0.90 2.0 2.14 1.48 1
## Fence* 74 281 2.43 0.86 3.0 2.48 0.00 1
## MiscFeature* 75 54 2.91 0.45 3.0 3.00 0.00 1
## MiscVal 76 1460 43.49 496.12 0.0 0.00 0.00 0
## MoSold 77 1460 6.32 2.70 6.0 6.25 2.97 1
## YrSold 78 1460 2007.82 1.33 2008.0 2007.77 1.48 2006
## SaleType* 79 1460 8.51 1.56 9.0 8.92 0.00 1
## SaleCondition* 80 1460 4.77 1.10 5.0 5.00 0.00 1
## SalePrice 81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
## max range skew kurtosis se
## Id 1460 1459 0.00 -1.20 11.03
## MSSubClass 190 170 1.40 1.56 1.11
## MSZoning* 5 4 -1.73 6.25 0.02
## LotFrontage 313 292 2.16 17.34 0.70
## LotArea 215245 213945 12.18 202.26 261.22
## Street* 2 1 -15.49 238.01 0.00
## Alley* 2 1 0.20 -1.98 0.05
## LotShape* 4 3 -0.61 -1.60 0.04
## LandContour* 4 3 -3.16 8.65 0.02
## Utilities* 2 1 38.13 1453.00 0.00
## LotConfig* 5 4 -1.13 -0.59 0.04
## LandSlope* 3 2 4.80 24.47 0.01
## Neighborhood* 25 24 0.02 -1.06 0.15
## Condition1* 9 8 3.01 16.34 0.02
## Condition2* 8 7 13.14 247.54 0.01
## BldgType* 5 4 2.24 3.41 0.03
## HouseStyle* 8 7 0.31 -0.96 0.05
## OverallQual 10 9 0.22 0.09 0.04
## OverallCond 9 8 0.69 1.09 0.03
## YearBuilt 2010 138 -0.61 -0.45 0.79
## YearRemodAdd 2010 60 -0.50 -1.27 0.54
## RoofStyle* 6 5 1.47 0.61 0.02
## RoofMatl* 8 7 8.09 66.28 0.02
## Exterior1st* 15 14 -0.72 -0.37 0.08
## Exterior2nd* 16 15 -0.69 -0.52 0.09
## MasVnrType* 4 3 -0.07 -0.13 0.02
## MasVnrArea 1600 1600 2.66 10.03 4.75
## ExterQual* 4 3 -1.83 3.86 0.02
## ExterCond* 5 4 -2.56 5.29 0.02
## Foundation* 6 5 0.09 1.02 0.02
## BsmtQual* 4 3 -1.31 1.27 0.02
## BsmtCond* 4 3 -3.39 10.14 0.02
## BsmtExposure* 4 3 -1.15 -0.39 0.03
## BsmtFinType1* 6 5 -0.02 -1.39 0.05
## BsmtFinSF1 5644 5644 1.68 11.06 11.94
## BsmtFinType2* 6 5 -3.56 12.32 0.02
## BsmtFinSF2 1474 1474 4.25 20.01 4.22
## BsmtUnfSF 2336 2336 0.92 0.46 11.56
## TotalBsmtSF 6110 6110 1.52 13.18 11.48
## Heating* 6 5 9.83 110.98 0.01
## HeatingQC* 5 4 0.48 -1.51 0.05
## CentralAir* 2 1 -3.52 10.42 0.01
## Electrical* 5 4 -3.06 7.49 0.03
## 1stFlrSF 4692 4358 1.37 5.71 10.12
## 2ndFlrSF 2065 2065 0.81 -0.56 11.42
## LowQualFinSF 572 572 8.99 82.83 1.27
## GrLivArea 5642 5308 1.36 4.86 13.75
## BsmtFullBath 3 3 0.59 -0.84 0.01
## BsmtHalfBath 2 2 4.09 16.31 0.01
## FullBath 3 3 0.04 -0.86 0.01
## HalfBath 2 2 0.67 -1.08 0.01
## BedroomAbvGr 8 8 0.21 2.21 0.02
## KitchenAbvGr 3 3 4.48 21.42 0.01
## KitchenQual* 4 3 -1.42 1.72 0.02
## TotRmsAbvGrd 14 12 0.67 0.87 0.04
## Functional* 7 6 -4.08 16.37 0.03
## Fireplaces 3 3 0.65 -0.22 0.02
## FireplaceQu* 5 4 -0.16 -0.98 0.04
## GarageType* 6 5 0.76 -1.30 0.05
## GarageYrBlt 2010 110 -0.65 -0.42 0.66
## GarageFinish* 3 2 -0.35 -1.41 0.02
## GarageCars 4 4 -0.34 0.21 0.02
## GarageArea 1418 1418 0.18 0.90 5.60
## GarageQual* 5 4 -4.43 18.25 0.02
## GarageCond* 5 4 -5.28 26.77 0.01
## PavedDrive* 3 2 -3.30 9.22 0.01
## WoodDeckSF 857 857 1.54 2.97 3.28
## OpenPorchSF 547 547 2.36 8.44 1.73
## EnclosedPorch 552 552 3.08 10.37 1.60
## 3SsnPorch 508 508 10.28 123.06 0.77
## ScreenPorch 480 480 4.11 18.34 1.46
## PoolArea 738 738 14.80 222.19 1.05
## PoolQC* 3 2 -0.22 -1.90 0.34
## Fence* 4 3 -0.57 -0.88 0.05
## MiscFeature* 4 3 -2.93 10.71 0.06
## MiscVal 15500 15500 24.43 697.64 12.98
## MoSold 12 11 0.21 -0.41 0.07
## YrSold 2010 4 0.10 -1.19 0.03
## SaleType* 9 8 -3.83 14.57 0.04
## SaleCondition* 6 5 -2.74 6.82 0.03
## SalePrice 755000 720100 1.88 6.50 2079.11
hist(train2$SalePrice)
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
First I chose Year built because I think there is a correlation there, and I was correct.
train2 %>%
ggplot(aes(y=SalePrice, x=YearBuilt)) + geom_point()
train2 %>%
ggplot(aes(y=SalePrice, x=YearRemodAdd)) + geom_point()
Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
cordf <- train2 %>% select(YearBuilt, LotArea, TotalBsmtSF)
cor <- cor(cordf)
cor
## YearBuilt LotArea TotalBsmtSF
## YearBuilt 1.00000000 0.01422765 0.3914520
## LotArea 0.01422765 1.00000000 0.2608331
## TotalBsmtSF 0.39145200 0.26083313 1.0000000
In each test the alternative hypothesis is true: that the correlation is not equal to 0. This is suspect especially for the test with a pval=0.014. So we may need to be worried about a Familywise error.
cor.test(train2$YearBuilt, train2$LotArea, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: train2$YearBuilt and train2$LotArea
## t = 0.54332, df = 1458, p-value = 0.587
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.01934322 0.04776648
## sample estimates:
## cor
## 0.01422765
cor.test(train2$YearBuilt, train2$TotalBsmtSF, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: train2$YearBuilt and train2$TotalBsmtSF
## t = 16.243, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.3626548 0.4195023
## sample estimates:
## cor
## 0.391452
cor.test(train2$TotalBsmtSF, train2$LotArea, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: train2$TotalBsmtSF and train2$LotArea
## t = 10.317, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2292786 0.2918400
## sample estimates:
## cor
## 0.2608331
Invert your correlation matrix from above.
precision_matrix <- solve(cor)
precision_matrix
## YearBuilt LotArea TotalBsmtSF
## YearBuilt 1.1926351 0.1124547 -0.4961913
## LotArea 0.1124547 1.0836039 -0.3266604
## TotalBsmtSF -0.4961913 -0.3266604 1.2794390
Multiply the correlation matrix by the precision matrix
round(precision_matrix %*% cor)
## YearBuilt LotArea TotalBsmtSF
## YearBuilt 1 0 0
## LotArea 0 1 0
## TotalBsmtSF 0 0 1
LU Decomposition
library(matrixcalc)
##
## Attaching package: 'matrixcalc'
## The following object is masked from 'package:igraph':
##
## %s%
lu.decomposition(cor)
## $L
## [,1] [,2] [,3]
## [1,] 1.00000000 0.0000000 0
## [2,] 0.01422765 1.0000000 0
## [3,] 0.39145200 0.2553154 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.01422765 0.3914520
## [2,] 0 0.99979757 0.2552637
## [3,] 0 0.00000000 0.7815926
hist(train2$GrLivArea)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
lambda <- fitdistr(train2$GrLivArea,"exponential")
lambda$estimate
## rate
## 0.000659864
sample <- rexp(1000, lambda$estimate)
hist(sample, breaks=50)
5th and 95th percentiles
l<-qexp(0.05, rate = lambda$estimate)
u<-qexp(0.95, rate = lambda$estimate)
c(l,u)
## [1] 77.73313 4539.92351
quantile(train2$GrLivArea, c(0.05, 0.95))
## 5% 95%
## 848.0 2466.1
I’m going to use a linear model since that’s what I’m most comfortable with. First I’m going to remove qualitative data from the table
prep <- train2[,sapply(train2, is.numeric)]
Now we are going to create the model and remove any variables that are not statistically significant
model <- lm(SalePrice~., data=prep)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ ., data = prep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -442182 -16955 -2824 15125 318183
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.351e+05 1.701e+06 -0.197 0.843909
## Id -1.205e+00 2.658e+00 -0.453 0.650332
## MSSubClass -2.001e+02 3.451e+01 -5.797 8.84e-09 ***
## LotFrontage -1.160e+02 6.126e+01 -1.894 0.058503 .
## LotArea 5.422e-01 1.575e-01 3.442 0.000599 ***
## OverallQual 1.866e+04 1.482e+03 12.592 < 2e-16 ***
## OverallCond 5.239e+03 1.368e+03 3.830 0.000135 ***
## YearBuilt 3.164e+02 8.766e+01 3.610 0.000321 ***
## YearRemodAdd 1.194e+02 8.668e+01 1.378 0.168607
## MasVnrArea 3.141e+01 7.022e+00 4.473 8.54e-06 ***
## BsmtFinSF1 1.736e+01 5.838e+00 2.973 0.003014 **
## BsmtFinSF2 8.342e+00 8.766e+00 0.952 0.341532
## BsmtUnfSF 5.005e+00 5.277e+00 0.948 0.343173
## TotalBsmtSF NA NA NA NA
## `1stFlrSF` 4.597e+01 7.360e+00 6.246 6.02e-10 ***
## `2ndFlrSF` 4.663e+01 6.102e+00 7.641 4.72e-14 ***
## LowQualFinSF 3.341e+01 2.794e+01 1.196 0.232009
## GrLivArea NA NA NA NA
## BsmtFullBath 9.043e+03 3.198e+03 2.828 0.004776 **
## BsmtHalfBath 2.465e+03 5.073e+03 0.486 0.627135
## FullBath 5.433e+03 3.531e+03 1.539 0.124182
## HalfBath -1.098e+03 3.321e+03 -0.331 0.740945
## BedroomAbvGr -1.022e+04 2.155e+03 -4.742 2.40e-06 ***
## KitchenAbvGr -2.202e+04 6.710e+03 -3.282 0.001063 **
## TotRmsAbvGrd 5.464e+03 1.487e+03 3.674 0.000251 ***
## Fireplaces 4.372e+03 2.189e+03 1.998 0.046020 *
## GarageYrBlt -4.728e+01 9.106e+01 -0.519 0.603742
## GarageCars 1.685e+04 3.491e+03 4.827 1.58e-06 ***
## GarageArea 6.274e+00 1.213e+01 0.517 0.605002
## WoodDeckSF 2.144e+01 1.002e+01 2.139 0.032662 *
## OpenPorchSF -2.252e+00 1.949e+01 -0.116 0.907998
## EnclosedPorch 7.295e+00 2.062e+01 0.354 0.723590
## `3SsnPorch` 3.349e+01 3.758e+01 0.891 0.373163
## ScreenPorch 5.805e+01 2.041e+01 2.844 0.004532 **
## PoolArea -6.052e+01 2.990e+01 -2.024 0.043204 *
## MiscVal -3.761e+00 6.960e+00 -0.540 0.589016
## MoSold -2.217e+02 4.229e+02 -0.524 0.600188
## YrSold -2.474e+02 8.458e+02 -0.293 0.769917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36800 on 1085 degrees of freedom
## (339 observations deleted due to missingness)
## Multiple R-squared: 0.8096, Adjusted R-squared: 0.8034
## F-statistic: 131.8 on 35 and 1085 DF, p-value: < 2.2e-16
prep <- prep %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea,BsmtFinSF1,`1stFlrSF`,`2ndFlrSF`,BsmtFullBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,WoodDeckSF,ScreenPorch,PoolArea,SalePrice)
model <- lm(SalePrice~., data=prep)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ ., data = prep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -471826 -16295 -1995 13876 299766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.391e+05 8.879e+04 -9.450 < 2e-16 ***
## MSSubClass -1.632e+02 2.582e+01 -6.322 3.44e-10 ***
## LotArea 4.209e-01 1.002e-01 4.202 2.81e-05 ***
## OverallQual 1.869e+04 1.125e+03 16.619 < 2e-16 ***
## OverallCond 5.078e+03 9.226e+02 5.504 4.40e-08 ***
## YearBuilt 3.918e+02 4.508e+01 8.691 < 2e-16 ***
## MasVnrArea 2.949e+01 5.874e+00 5.021 5.78e-07 ***
## BsmtFinSF1 9.989e+00 2.975e+00 3.358 0.000807 ***
## `1stFlrSF` 5.770e+01 4.651e+00 12.407 < 2e-16 ***
## `2ndFlrSF` 4.883e+01 4.112e+00 11.874 < 2e-16 ***
## BsmtFullBath 8.955e+03 2.373e+03 3.774 0.000167 ***
## BedroomAbvGr -1.026e+04 1.646e+03 -6.233 6.00e-10 ***
## KitchenAbvGr -1.338e+04 5.098e+03 -2.624 0.008774 **
## TotRmsAbvGrd 5.418e+03 1.214e+03 4.461 8.78e-06 ***
## Fireplaces 2.507e+03 1.734e+03 1.446 0.148471
## GarageCars 1.042e+04 1.695e+03 6.146 1.03e-09 ***
## WoodDeckSF 2.462e+01 7.911e+00 3.112 0.001892 **
## ScreenPorch 5.085e+01 1.689e+01 3.010 0.002657 **
## PoolArea -3.049e+01 2.341e+01 -1.302 0.192979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34820 on 1433 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8095, Adjusted R-squared: 0.8071
## F-statistic: 338.4 on 18 and 1433 DF, p-value: < 2.2e-16
Residualus
We need to confirm this is a model that fits the data correctly. The first plot is almost on a line, the second shows a normal distribution, and the last is randomly placed. We are good to move forward with this model.
qqnorm(model$residuals)
qqline(model$residuals)
hist(model$residuals)
plot(x=model$residuals)
Predict Test Data
test_p <- test2 %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea,BsmtFinSF1,`1stFlrSF`,`2ndFlrSF`,BsmtFullBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,WoodDeckSF,ScreenPorch,PoolArea)
prediction <- predict(model, test_p)
prediction <- as.data.frame(prediction, row.names=c('SalePrice'))
## Warning in as.data.frame.numeric(prediction, row.names = c("SalePrice")):
## 'row.names' is not a character vector of length 1459 -- omitting it. Will be an
## error!
prediction$Id <- test2$Id
head(prediction)
## prediction Id
## 1 124344.6 1461
## 2 170976.1 1462
## 3 170828.1 1463
## 4 199895.3 1464
## 5 196684.0 1465
## 6 182311.1 1466
write.csv(prediction,"/Users/jordanglendrange/Documents/Data 605/Final/prediction.csv")
User Name: Jordan Glendrange Score: 1.40373