Let’s use the 6 page universe that we had in the previous discussion
For this directed graph, perform the following calculations in R.
A. Form the A matrix. Then, introduce decay and form the B matrix as we
did in the course notes. (5 Points)
B. Start with a uniform rank vector r and perform power iterations on B
till conver- gence. That is, compute the solution r = Bn × r. Attempt
this for a sufficiently large n so that r actually converges. (5
Points)
C. Compute the eigen-decomposition of B and verify that you indeed get
an eigenvalue of 1 as the largest eigenvalue and that its corresponding
eigenvector is the same vector that you obtained in the previous power
iteration method. Further, this eigenvector has all positive entries and
it sums to 1.(10 points)
D. Use the graph package in R and its page.rank method to compute the
Page Rank of the graph as given in A. Note that you don’t need to apply
decay. The package starts with a connected graph and applies decay
internally. Verify that you do get the same PageRank vector as the two
approaches above. (10 points)
Answers:
We’ll load our required libraries first.
suppressPackageStartupMessages(library(expm))
suppressPackageStartupMessages(library(markovchain))
suppressPackageStartupMessages(library(igraph))
Question 1 - Part A)
# Form matrix A
A <- matrix( c( 0, 1/2, 1/2, 0, 0, 0,
1/6, 1/6, 1/6, 1/6, 1/6, 1/6,
1/3, 1/3, 0, 0, 1/3, 0,
0, 0, 0, 0, 1/2, 1/2,
0, 0, 0, 1/2, 0, 0.5,
0, 0, 0, 1, 0, 0), nrow=6, byrow=FALSE)
A
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0 0.1666667 0.3333333 0.0 0.0 0
## [2,] 0.5 0.1666667 0.3333333 0.0 0.0 0
## [3,] 0.5 0.1666667 0.0000000 0.0 0.0 0
## [4,] 0.0 0.1666667 0.0000000 0.0 0.5 1
## [5,] 0.0 0.1666667 0.3333333 0.5 0.0 0
## [6,] 0.0 0.1666667 0.0000000 0.5 0.5 0
# Form the decay
num_pages <- 6 # pages
B <- (0.85 * A) + 0.15/num_pages
B
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.025 0.1666667 0.3083333 0.025 0.025 0.025
## [2,] 0.450 0.1666667 0.3083333 0.025 0.025 0.025
## [3,] 0.450 0.1666667 0.0250000 0.025 0.025 0.025
## [4,] 0.025 0.1666667 0.0250000 0.025 0.450 0.875
## [5,] 0.025 0.1666667 0.3083333 0.450 0.025 0.025
## [6,] 0.025 0.1666667 0.0250000 0.450 0.450 0.025
Question 1 - Part B)
# Uniform rank vector
R <- matrix( c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6), nrow=6, byrow=FALSE)
# Perform numerous power iterations until convergence
r1 <- B%^% 1 %*% R
r1
## [,1]
## [1,] 0.09583333
## [2,] 0.16666667
## [3,] 0.11944444
## [4,] 0.26111111
## [5,] 0.16666667
## [6,] 0.19027778
r10 <- B%^% 10 %*% R
r10
## [,1]
## [1,] 0.05205661
## [2,] 0.07428990
## [3,] 0.05782138
## [4,] 0.34797267
## [5,] 0.19975859
## [6,] 0.26810085
r100 <- B%^% 100 %*% R
r100
## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
r1000 <- B%^% 1000 %*% R
r1000
## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
From the above for r’s convergence, n = 100 is sufficiently large enough.
Question 1 - Part C)
We’ll compute the eigen-decomposition of B and verify the largest eigenvalue is 1
b_decomp <- eigen(B)
print(b_decomp$vector)
## [,1] [,2] [,3]
## [1,] 0.1044385+0i 0.2931457+0i 1.832478e-15+0.000000e+00i
## [2,] 0.1488249+0i 0.5093703+0i 1.431623e-17-2.961329e-23i
## [3,] 0.1159674+0i 0.3414619+0i -1.537973e-15-0.000000e+00i
## [4,] 0.7043472+0i -0.5890805+0i -7.071068e-01+0.000000e+00i
## [5,] 0.4037861+0i -0.1413606+0i 7.071068e-01+0.000000e+00i
## [6,] 0.5425377+0i -0.4135367+0i 0.000000e+00-1.311726e-08i
## [,4] [,5] [,6]
## [1,] 1.832478e-15-0.000000e+00i -0.06471710+0i -0.212296003+0i
## [2,] 1.431623e-17+2.961329e-23i 0.01388698+0i 0.854071294+0i
## [3,] -1.537973e-15+0.000000e+00i 0.07298180+0i -0.363638739+0i
## [4,] -7.071068e-01+0.000000e+00i -0.66058664+0i 0.018399984+0i
## [5,] 7.071068e-01-0.000000e+00i 0.73761812+0i -0.304719509+0i
## [6,] 0.000000e+00+1.311726e-08i -0.09918316+0i 0.008182973+0i
print (b_decomp$values)
## [1] 1.00000000+0i 0.57619235+0i -0.42500000+0i -0.42500000-0i -0.34991524+0i
## [6] -0.08461044+0i
From the above output the first and largest value is 1.
Now, we’ll show the corresponding eigenvector is the same vector in the previous power iteration
x <- r100 / sqrt(sum(r100^2)) # normalized the r vector
x <- round(as.vector(x),3)
# the pwer iteration eigenvector
print(x)
## [1] 0.104 0.149 0.116 0.704 0.404 0.543
# our corresponding eigenvector with real components only
b_compare <-round(as.vector(round(Re(b_decomp$vectors[,1]),3)),3)
print(b_compare)
## [1] 0.104 0.149 0.116 0.704 0.404 0.543
# We'll compare the two eigenvectors
setequal(x, b_compare)
## [1] TRUE
So we see the power iteration vector and eigenvector are equal.
Question 1 - Part D)
# Graph
ig <- graph.adjacency(t(A), mode="directed", weighted=TRUE)
LO = layout_nicely(ig)
plot(ig, asp = -10, layout=LO)
Now use the page.rank method.
pr <- page_rank(ig)
print(pr$vector)
## [1] 0.05170475 0.07367926 0.05741241 0.34870369 0.19990381 0.26859608
Finally, we’ll get the same PageRank as the two approaches above
# round both to 3 decimal places
pr_vector <- round(pr$vector,3)
r_vector <- round(as.vector(r100),3)
setequal(pr_vector, r_vector)
## [1] TRUE
# they are equivalent
Answers:
Question 2 - Part 3)
pixels <- read_csv("digit-recognizer/train.csv", show_col_types = FALSE)
#dim(pixels)
Plot First Ten Images
# Subselect the first ten images. Note, we start from column 2 to 785
# since that is where the pixel values are.
first_ten <- pixels[1:10,2:785]
for (image_num in 1:10){
p <- pixels[image_num,2:785]
df1 <- as.data.frame(matrix(nrow=784, ncol=3))
colnames(df1) <- c("x","y", "value")
x <- 0
y <- 1
for (i in 1:784){
x <- x+1
if (x %% 29 == 0) {
y <- y+1
x <- 1
}
df1[i,] <- c(x,y,p[[i]])
}
as.cimg(df1) %>% plot # image
}
Dividing all pixels by the max value of 255 for mix-max scaling.
min_max <- df1['value'] / 255
#min_max
Question 2 - Part 4)
We can use a histogram to see the distribution. We see highest frequency is near zero since the background color is mostly black (value of 0).
hist(df1$value/255)
Question 2 - Part 5)
c <- 1
for (i in 1:nrow(first_ten)) {
density <- mean(unlist(first_ten[i,]))
print (paste0("Image ", c, " - density: ", density))
c <- c + 1
}
## [1] "Image 1 - density: 21.2359693877551"
## [1] "Image 2 - density: 56.8992346938775"
## [1] "Image 3 - density: 17.1237244897959"
## [1] "Image 4 - density: 19.1645408163265"
## [1] "Image 5 - density: 65.1696428571428"
## [1] "Image 6 - density: 29.4145408163265"
## [1] "Image 7 - density: 21.8775510204081"
## [1] "Image 8 - density: 30.9795918367347"
## [1] "Image 9 - density: 35.6122448979592"
## [1] "Image 10 - density: 40.2678571428572"
The above information tells us the mean amount of white/gray pixels in each image. All images are mostly composed of pixel values of 0 (black backgrounds). The higher the mean density (away from 0), the more distinguishing (white/gray) pixels are in an image. Image 1 has the lowest density and it represents a 1 which is a relatively simple drawing. Image 5 has the highest mean density and represents a large 0 with more non-black pixels.
Question 2 - Part 6)
Now we’ll reduce the data with PCA
pca <- prcomp(first_ten)
#plot(pca$x[,1], pca$x[,2])
pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var)*100,1)
barplot(pca.var.per)
To cover at least 95% of the variance we’ll need 8 out of 10 components.
sum(pca.var[1:8])/sum(pca.var) # > 95% of the variance uses 8 components
## [1] 0.974999
To cover 100% of the variance, the number of components is number_of_images - 1. When identifying an image, using n-1 images means using all available other images (i.e. no compression).
sum(pca.var[1:9])/sum(pca.var) # 100% of the variance uses 9 components
## [1] 1
Question 2 - Part 7)
Plot First Ten Images Generated by PCA
num_components <- 8
recon <- pca$x[, 1:num_components] %*% t(pca$rotation[, 1:num_components])
recon <- scale(recon, scale = TRUE, center = TRUE)
# Iterate through each image
for (image_num in 1:10){
p <- recon[image_num,]
# For each image we have to make a dataframe for the
# plot function which needs: x; y; pixel value
df1 <- as.data.frame(matrix(nrow=784, ncol=3))
colnames(df1) <- c("x","y", "value")
x <- 0
y <- 1
for (i in 1:784){
x <- x+1
if (x %% 29 == 0) {
y <- y+1
x <- 1
}
#print (paste0( x, ' ', y, ' ', p[[i]]))
df1[i,] <- c(x,y,p[[i]])
}
as.cimg(df1) %>% plot # plot the completed image
}
Note, the images appear noisy since the non-useful pixels have been eliminated; in other words since the surrounding, black pixels appear in just about all the images, they do not contribute any useful information. What is left are pixels near each graphic image (the ‘noisy’ images).
Question 2 - Part 8)
Rerun PCA with only the pixel-digits for the 8’s.
only_eights <- pixels[pixels$label == 8,2:785]
pca <- prcomp(only_eights)
pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var)*100,1)
length(pca.var)
## [1] 784
sum(pca.var)
## [1] 2950838
sum(pca.var[1:length(pca.var)])/sum(pca.var)
## [1] 1
Plot the first 10 images of the number ‘8.’
num_components <- length(pca.var)
recon <- pca$x[, 1:num_components] %*% t(pca$rotation[, 1:num_components])
recon <- scale(recon, scale = TRUE, center = TRUE)
# Iterate through each image
for (image_num in 1:10){
p <- recon[image_num,]
# For each image we have to make a dataframe for the
# plot function which needs: x; y; pixel value
df1 <- as.data.frame(matrix(nrow=784, ncol=3))
colnames(df1) <- c("x","y", "value")
x <- 0
y <- 1
for (i in 1:784){
x <- x+1
if (x %% 29 == 0) {
y <- y+1
x <- 1
}
df1[i,] <- c(x,y,p[[i]])
}
as.cimg(df1) %>% plot # plot the completed image
}
Of course, we notice only images of 8’s appear but more importantly, all pixels of each image appears. Covering 100% of the variance essentially means keeping all available information (i.e. all pixels).
Question 2 - Part 9)
Building a multinomial model
# construct the training set and remove the label column
training_set <- pixels
numbers <- pixels[,1]
training_set <- subset(training_set, select = -c(1))
# construct pca
pca <- prcomp(training_set)
pca.var <- pca$sdev^2
pca.var.per <- head(round(pca.var/sum(pca.var)*100,1), 100)
barplot(pca.var.per)
length(pca.var)
## [1] 784
sum(pca.var)
## [1] 3434021
# Determining the Components Responsible for 90% Variance
for (num_components in 1:length(pca.var)){
account_variance <- sum(pca.var[1:num_components])/sum(pca.var)
if (account_variance >= 0.90) {
break
}
}
print(paste0("90% num components: ", num_components))
## [1] "90% num components: 87"
# predict from pca
trg <- predict(pca, training_set)
trg <- data.frame(trg, numbers)
# Include 87 components
mymodel <- multinom(label~PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+PC11+PC12+PC13+PC14+PC15+PC16+PC17+PC18+PC19+PC20+PC21+PC22+PC23+PC24+PC25+PC26+PC27+PC28+PC29+PC30+PC31+PC32+PC33+PC34+PC35+PC36+PC37+PC38+PC39+PC40+PC41+PC42+PC43+PC44+PC45+PC46+PC47+PC48+PC49+PC50+PC51+PC52+PC53+PC54+PC55+PC56+PC57+PC58+PC59+PC60+PC61+PC62+PC63+PC64+PC65+PC66+PC67+PC68+PC69+PC70+PC71+PC72+PC73+PC74+PC75+PC76+PC77+PC78+PC79+PC80+PC81+PC82+PC83+PC84+PC85+PC86+PC87, data=trg)
## # weights: 890 (792 variable)
## initial value 96708.573906
## iter 10 value 21715.535210
## iter 20 value 20402.362049
## iter 30 value 20098.169753
## iter 40 value 20028.124805
## iter 50 value 20002.190793
## iter 60 value 19995.271690
## iter 70 value 19993.083032
## iter 80 value 19991.688681
## iter 90 value 19990.873235
## iter 100 value 19989.236903
## final value 19989.236903
## stopped after 100 iterations
Classification accuracy matrix
p <- predict(mymodel, trg)
tab <- table(p, trg$label)
tab
##
## p 0 1 2 3 4 5 6 7 8 9
## 0 3514 3 14 8 10 66 19 2 14 12
## 1 7 4441 60 28 41 80 13 57 110 37
## 2 48 77 3581 118 14 37 14 51 60 4
## 3 53 35 89 3805 4 328 1 6 203 56
## 4 32 1 61 2 3724 77 44 70 41 292
## 5 233 39 17 156 3 2874 51 4 131 19
## 6 89 14 153 34 79 124 3975 5 62 8
## 7 109 10 91 63 16 40 11 4038 34 168
## 8 38 60 91 75 18 105 9 8 3332 37
## 9 9 4 20 62 163 64 0 160 76 3555
sum(diag(tab))/sum(tab) # accuracy
## [1] 0.877119
So with 87 components in the PCA, 87.7% accuracy is achieved.
Answers
Load our required libraries.
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(corrplot))
suppressPackageStartupMessages(library(matrixcalc))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library (MASS))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(reshape2))
Question 3 - Part 1)
From the housing data, check only quantitative variables with summary data and box plots. Additionally, we’ll use scatter plots with the sale price.
houses <- read_csv("question_3/train.csv", show_col_types = FALSE)
# Question 1 - Descriptive and Inferential Statistics (5 points)
# a) Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
area_columns <- c('1stFlrSF','2ndFlrSF','GrLivArea')
h <- houses[area_columns]
h_long <- melt(h)
## No id variables; using all as measure variables
ggplot(h_long, aes(x = variable, y = value)) + geom_boxplot()
summary(h)
## 1stFlrSF 2ndFlrSF GrLivArea
## Min. : 334 Min. : 0 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.:1130
## Median :1087 Median : 0 Median :1464
## Mean :1163 Mean : 347 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :5642
room_columns <- c('FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd')
h <- houses[room_columns]
h_long <- melt(h)
## No id variables; using all as measure variables
ggplot(h_long, aes(x = variable, y = value)) + geom_boxplot()
summary(h)
## FullBath HalfBath BedroomAbvGr KitchenAbvGr
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :0.0000 Median :3.000 Median :1.000
## Mean :1.565 Mean :0.3829 Mean :2.866 Mean :1.047
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :2.0000 Max. :8.000 Max. :3.000
## TotRmsAbvGrd
## Min. : 2.000
## 1st Qu.: 5.000
## Median : 6.000
## Mean : 6.518
## 3rd Qu.: 7.000
## Max. :14.000
misc_columns <- c('OverallQual','OverallCond','MoSold', 'Fireplaces', 'GarageCars')
h <- houses[misc_columns]
h_long <- melt(h)
## No id variables; using all as measure variables
ggplot(h_long, aes(x = variable, y = value)) + geom_boxplot()
summary(houses[misc_columns])
## OverallQual OverallCond MoSold Fireplaces
## Min. : 1.000 Min. :1.000 Min. : 1.000 Min. :0.000
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.: 5.000 1st Qu.:0.000
## Median : 6.000 Median :5.000 Median : 6.000 Median :1.000
## Mean : 6.099 Mean :5.575 Mean : 6.322 Mean :0.613
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.: 8.000 3rd Qu.:1.000
## Max. :10.000 Max. :9.000 Max. :12.000 Max. :3.000
## GarageCars
## Min. :0.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.767
## 3rd Qu.:2.000
## Max. :4.000
pairs(houses[,c("SalePrice","LotArea", "1stFlrSF", "GrLivArea", "YearBuilt", "MSSubClass")],
pch = 19,
cex = 0.15,
lower.panel=NULL)
# b) Derive a correlation matrix for any three quantitative variables in the dataset.
sub_data <- houses[, c("SalePrice","LotArea", "1stFlrSF", "GrLivArea", "YearBuilt", "MSSubClass")]
correlation <- cor(sub_data)
#round(correlation,2)
corrplot(correlation)
In the correlation plot above, we’re primarily concerned with the values
in the first row where SalePrice’s relationship with other quantitative
variables. By icon size and color, the 1st floor square footage closely
correlates with a house’s sale price followed by the ground living area
and year built variables. However, build class (MSSubClass) and lot size
can be quickly eliminated from the regression model.
# c) Test the hypotheses that the correlations between each pairwise set of variables
# is 0 and provide an 80% confidence interval.
# get the upper triangle from the correlation matrix
corr_values <- round(correlation[upper.tri(correlation)],2)
t <- t.test(corr_values, conf.level = 0.80)
print(t)
##
## One Sample t-test
##
## data: corr_values
## t = 3.0032, df = 14, p-value = 0.009491
## alternative hypothesis: true mean is not equal to 0
## 80 percent confidence interval:
## 0.1233101 0.3233566
## sample estimates:
## mean of x
## 0.2233333
Discuss the meaning of your analysis.
Taking a quick examination at the available data, there are some viable quantitative variables that can help predict sale prices. The scatter and correlation plots clearly show the 1st floor square foot and ground living area variables have a positive correlation with sale prices while building class and lot sizes can be ignored. Also, the box plots show which variables have many outliers (e.g. ground living area) and which ones have little variance.
Would you be worried about familywise error? Why or why not? Yes, with a 80% confidence interval, we have a signficance leve of 0.20. With the number of samples equaling 14 (from upper triangle of our correlation matrix), the family wise error rate is:
\(Familywise \hspace{0.5em} error \hspace{0.5em} rate= 1 - (1 - \alpha)^n\)
Which is : \(Error \hspace{0.5em} rate = 1 - (1 -0.20)^14 = 0.866\)
So with 14 tests, we have an 86.6% chance of a having a Type 1 (family wise) error.
Question 3 - Part 2)
# Question 2 - Linear Algebra and Correlation. (5 points)
# a) Invert your correlation matrix from above.
# (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
# we use the solve function to find the inverse
precision_matrix <- solve(correlation)
# b) Multiply the correlation matrix by the precision matrix, and then multiply
# the precision matrix by the correlation matrix.
a <- correlation %*% precision_matrix
b <- precision_matrix %*% correlation
round(precision_matrix,2)
## SalePrice LotArea 1stFlrSF GrLivArea YearBuilt MSSubClass
## SalePrice 3.26 -0.19 -0.51 -1.75 -1.22 0.28
## LotArea -0.19 1.14 -0.19 -0.10 0.15 0.10
## 1stFlrSF -0.51 -0.19 1.91 -0.67 -0.15 0.46
## GrLivArea -1.75 -0.10 -0.67 2.56 0.61 -0.54
## YearBuilt -1.22 0.15 -0.15 0.61 1.56 -0.21
## MSSubClass 0.28 0.10 0.46 -0.54 -0.21 1.20
# c) Conduct LU decomposition on the matrix
precision_decomp <- lu.decomposition(precision_matrix)
round(precision_decomp$L,2)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00 0.00 0.00 0.00 0.00 0
## [2,] -0.06 1.00 0.00 0.00 0.00 0
## [3,] -0.16 -0.20 1.00 0.00 0.00 0
## [4,] -0.54 -0.17 -0.55 1.00 0.00 0
## [5,] -0.37 0.07 -0.18 -0.20 1.00 0
## [6,] 0.09 0.10 0.30 -0.07 -0.03 1
round(precision_decomp$U,2)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 3.26 -0.19 -0.51 -1.75 -1.22 0.28
## [2,] 0.00 1.13 -0.22 -0.20 0.08 0.12
## [3,] 0.00 0.00 1.78 -0.98 -0.32 0.53
## [4,] 0.00 0.00 0.00 1.05 -0.21 -0.07
## [5,] 0.00 0.00 0.00 0.00 1.00 -0.03
## [6,] 0.00 0.00 0.00 0.00 0.00 1.00
Question 3 - Part 3)
To choose a right skewed variable, we’ll quickly look at the histogram of any candidate variable.
# Question 3 - Calculus-Based Probability & Statistics (10 points)
# Many times, it makes sense to fit a closed form distribution to data.
# a) Select a variable in the Kaggle.com training dataset that is skewed to the right,
# shift it so that the minimum value is absolutely above zero if necessary.
# the below snippet stolen from:
# https://drsimonj.svbtle.com/quick-plot-of-all-variables
houses %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 100)
# from the above facet histograms, 1stFlrSF is right skewed and the minimum is above zero
print(min(houses$`1stFlrSF`))
## [1] 334
hist(x=houses$`1stFlrSF`,breaks=100)
# b) Then load the MASS package and run fitdistr to fit an exponential probability density function.
# (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
op <- options(digits = 10)
r <- fitdistr(houses$`1stFlrSF`, "exponential")
print(paste0("The estimated rate is " , round(r$estimate,8)))
## [1] "The estimated rate is 0.00086012"
# c) Find the optimal value of λ for this distribution, and then take 1000 samples from
# this exponential distribution using this value (e.g., rexp(1000, λ)).
# From the output of the fitdistr function
exp_distribution <- rexp(1000, r$estimate)
# d) Plot a histogram and compare it with a histogram of your original variable.
par(mfrow=c(1,2)) # set the plotting area into a 1*2 array
hist(x=houses$`1stFlrSF`,breaks=100, xlim=c(0, 6000), main="1st Floor Square Footage")
hist(exp_distribution,breaks=100, xlim=c(0, 6000), main="Fitted Distribution")
For the exponential distribution, we’ll find the 95th and 5th percentiles by using the exponential distribution’s CDF. First, to obtain the CDF, we integrate our original function:
\(f(x) = e^{-\lambda x}\)
\(cdf(x) = 1 - e^{-\lambda x}\)
Isolating \(x\), we end up with:
\(x = \frac{log(1 - cdf(x))}{-\lambda}\)
So for the 95th percentile:
log(1-0.95)/-r$estimate
## rate
## 3482.918364
\(x = \frac{log(1 - 0.95)}{-0.0008601} = 3482.918364\)
And for the 5th percentile:
log(1-0.05)/-r$estimate
## rate
## 59.63495422
\(x = \frac{log(1 - 0.95)}{-0.0008601} = 59.63495422\)
Reference: https://stats.stackexchange.com/a/55760
# f) Also generate a 95% confidence interval from the empirical data, assuming normality.
# Since we have normality, the t.test function will work here
t <- t.test(houses$`1stFlrSF`,conf.level=0.95)
print(paste0("The empirical data's 95% confidence interval is: ", round(t$conf.int[1],2), ' ', round(t$conf.int[2],2)))
## [1] "The empirical data's 95% confidence interval is: 1142.78 1182.47"
# Just for completeness, we'll do the same for the fitted distribution
ft <- t.test(exp_distribution,conf.level=0.95)
print(paste0("The fitted distribution's 95% confidence interval is: ", round(ft$conf.int[1],2), ' ', round(ft$conf.int[2],2)))
## [1] "The fitted distribution's 95% confidence interval is: 1141.58 1286.28"
For the empirical data, we simply have to sort all the values and find the 5th and 95th index values.
# g) Finally, provide the empirical 5th percentile and 95th percentile of the data.
# sort the values of houses$`1stFlr`
# Simply, sort the all the values and find the 5th & 95th indexes
values <- houses$`1stFlrSF` %>%
sort()
lower_bound <- length(values) * 0.05
upper_bound <- length(values) * 0.95
# Show the actual values
print(paste0("The 5th and 95th percentiles are: ", values[lower_bound], " and ", values[upper_bound]))
## [1] "The 5th and 95th percentiles are: 672 and 1831"
Discussion
This section explores the feasibility of using a fitted distribution instead of empirical data. Comparing their histograms side by side, they are both right skewed with Poisson distributions but the empirical data behaves differently near zero. The fitted distribution assumes a high value versus a low value for the empirical set.
However for the percentiles and confidence intervals, both distributions are interchangeable. The percentiles are reasonably close (672/1831 versus 59/3483) and the confidence intervals are within the same order of magnitude. Additionally, the population means were close as well.
So when working with problematic data (e.g. not enough data or strange outliers) or making some assumptions, using the fitdistr function can be useful.
Question 3 - Part 4)
Next, a linear model with the best overall variables is created.
# Create a linear model using our few predictors
multiple_model <- lm(SalePrice ~ OverallQual + `1stFlrSF` + YearBuilt + GrLivArea , data=houses)
summary(multiple_model)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + `1stFlrSF` + YearBuilt +
## GrLivArea, data = houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462172.06 -20306.72 -1542.95 17129.11 285342.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -968324.61646 81056.96825 -11.94622 < 2.22e-16 ***
## OverallQual 23810.52203 1135.98289 20.96028 < 2.22e-16 ***
## `1stFlrSF` 35.53502 3.31918 10.70596 < 2.22e-16 ***
## YearBuilt 449.90971 42.71840 10.53199 < 2.22e-16 ***
## GrLivArea 50.02524 2.72883 18.33213 < 2.22e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39250.5 on 1455 degrees of freedom
## Multiple R-squared: 0.7565598, Adjusted R-squared: 0.7558905
## F-statistic: 1130.457 on 4 and 1455 DF, p-value: < 2.2204e-16
We see that all variables are significant and 75.7% of the data can be explained. Additionally, the p-value is statistically small so the multiple regression model above can be used.
test_houses <- read_csv('question_3/test.csv', show_col_types = FALSE)
predictions <- predict(multiple_model, newdata = test_houses)
# Create an output dataframe
Id <- test_houses$Id
SalePrice <- predictions
output <- cbind(Id, SalePrice)
write.csv(output,"predictions.csv", row.names = FALSE, quote=FALSE)
Kaggle Check
After submitting the generated predictions on Kaggle, my file received a score of 0.59953. My Kaggle name is “Cliff Lee”.
Kaggle score