1. Playing with PageRank

Let’s use the 6 page universe that we had in the previous discussion For this directed graph, perform the following calculations in R.
A. Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes. (5 Points)
B. Start with a uniform rank vector r and perform power iterations on B till conver- gence. That is, compute the solution r = Bn × r. Attempt this for a sufficiently large n so that r actually converges. (5 Points)
C. Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1.(10 points)
D. Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally. Verify that you do get the same PageRank vector as the two approaches above. (10 points)

Answers:

We’ll load our required libraries first.

suppressPackageStartupMessages(library(expm))
suppressPackageStartupMessages(library(markovchain))
suppressPackageStartupMessages(library(igraph))

Question 1 - Part A)

# Form matrix A
A <- matrix( c(  0,   1/2, 1/2,   0,   0,    0, 
                 1/6, 1/6, 1/6, 1/6, 1/6,  1/6, 
                 1/3, 1/3,   0,   0, 1/3,    0, 
                 0,     0,   0,   0, 1/2,  1/2, 
                 0,     0,   0, 1/2,    0, 0.5,
                 0,     0,   0,   1,    0,   0), nrow=6, byrow=FALSE)
A
##      [,1]      [,2]      [,3] [,4] [,5] [,6]
## [1,]  0.0 0.1666667 0.3333333  0.0  0.0    0
## [2,]  0.5 0.1666667 0.3333333  0.0  0.0    0
## [3,]  0.5 0.1666667 0.0000000  0.0  0.0    0
## [4,]  0.0 0.1666667 0.0000000  0.0  0.5    1
## [5,]  0.0 0.1666667 0.3333333  0.5  0.0    0
## [6,]  0.0 0.1666667 0.0000000  0.5  0.5    0
# Form the decay
num_pages <- 6 # pages
B <- (0.85 * A) + 0.15/num_pages
B
##       [,1]      [,2]      [,3]  [,4]  [,5]  [,6]
## [1,] 0.025 0.1666667 0.3083333 0.025 0.025 0.025
## [2,] 0.450 0.1666667 0.3083333 0.025 0.025 0.025
## [3,] 0.450 0.1666667 0.0250000 0.025 0.025 0.025
## [4,] 0.025 0.1666667 0.0250000 0.025 0.450 0.875
## [5,] 0.025 0.1666667 0.3083333 0.450 0.025 0.025
## [6,] 0.025 0.1666667 0.0250000 0.450 0.450 0.025

Question 1 - Part B)

# Uniform rank vector
R <- matrix( c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6), nrow=6, byrow=FALSE)

# Perform numerous power iterations until convergence
r1 <- B%^% 1 %*% R
r1
##            [,1]
## [1,] 0.09583333
## [2,] 0.16666667
## [3,] 0.11944444
## [4,] 0.26111111
## [5,] 0.16666667
## [6,] 0.19027778
r10 <- B%^% 10 %*% R
r10
##            [,1]
## [1,] 0.05205661
## [2,] 0.07428990
## [3,] 0.05782138
## [4,] 0.34797267
## [5,] 0.19975859
## [6,] 0.26810085
r100 <- B%^% 100 %*% R
r100
##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
r1000 <- B%^% 1000 %*% R
r1000
##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608

From the above for r’s convergence, n = 100 is sufficiently large enough.

Question 1 - Part C)

We’ll compute the eigen-decomposition of B and verify the largest eigenvalue is 1

b_decomp <- eigen(B)
print(b_decomp$vector)
##              [,1]          [,2]                        [,3]
## [1,] 0.1044385+0i  0.2931457+0i  1.832478e-15+0.000000e+00i
## [2,] 0.1488249+0i  0.5093703+0i  1.431623e-17-2.961329e-23i
## [3,] 0.1159674+0i  0.3414619+0i -1.537973e-15-0.000000e+00i
## [4,] 0.7043472+0i -0.5890805+0i -7.071068e-01+0.000000e+00i
## [5,] 0.4037861+0i -0.1413606+0i  7.071068e-01+0.000000e+00i
## [6,] 0.5425377+0i -0.4135367+0i  0.000000e+00-1.311726e-08i
##                             [,4]           [,5]            [,6]
## [1,]  1.832478e-15-0.000000e+00i -0.06471710+0i -0.212296003+0i
## [2,]  1.431623e-17+2.961329e-23i  0.01388698+0i  0.854071294+0i
## [3,] -1.537973e-15+0.000000e+00i  0.07298180+0i -0.363638739+0i
## [4,] -7.071068e-01+0.000000e+00i -0.66058664+0i  0.018399984+0i
## [5,]  7.071068e-01-0.000000e+00i  0.73761812+0i -0.304719509+0i
## [6,]  0.000000e+00+1.311726e-08i -0.09918316+0i  0.008182973+0i
print (b_decomp$values)
## [1]  1.00000000+0i  0.57619235+0i -0.42500000+0i -0.42500000-0i -0.34991524+0i
## [6] -0.08461044+0i

From the above output the first and largest value is 1.

Now, we’ll show the corresponding eigenvector is the same vector in the previous power iteration

x <- r100 / sqrt(sum(r100^2)) # normalized the r vector
x <- round(as.vector(x),3)

# the pwer iteration eigenvector
print(x)
## [1] 0.104 0.149 0.116 0.704 0.404 0.543
# our corresponding eigenvector with real components only
b_compare <-round(as.vector(round(Re(b_decomp$vectors[,1]),3)),3)

print(b_compare)
## [1] 0.104 0.149 0.116 0.704 0.404 0.543
# We'll compare the two eigenvectors
setequal(x, b_compare)
## [1] TRUE

So we see the power iteration vector and eigenvector are equal.

Question 1 - Part D)

# Graph
ig <- graph.adjacency(t(A), mode="directed", weighted=TRUE)
LO = layout_nicely(ig)
plot(ig, asp = -10, layout=LO)

Now use the page.rank method.

pr <- page_rank(ig)
print(pr$vector)
## [1] 0.05170475 0.07367926 0.05741241 0.34870369 0.19990381 0.26859608

Finally, we’ll get the same PageRank as the two approaches above

# round both to 3 decimal places
pr_vector <- round(pr$vector,3)
r_vector <- round(as.vector(r100),3)
setequal(pr_vector, r_vector)
## [1] TRUE
# they are equivalent





2. Playing with PageRank

  1. Login to Kaggle.com
  2. Download data from https://www.kaggle.com/c/digit-recognizer/overview.\
  3. Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.) (5 points)
  4. What is the frequency distribution of the numbers in the dataset? (5 points)
  5. For each number, provide the mean pixel intensity. What does this tell you? (5 points)
  6. Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why? (5 points)
  7. Plot the first 10 images generated by PCA. They will appear to be noise. Why? (5 points)
  8. Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%). Plot the first 10 images. What do you see? (5 points)
  9. An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal. (10 points)

Answers:

Question 2 - Part 3)

pixels <- read_csv("digit-recognizer/train.csv", show_col_types = FALSE)
#dim(pixels)

Plot First Ten Images

# Subselect the first ten images. Note, we start from column 2 to 785
# since that is where the pixel values are.
first_ten <- pixels[1:10,2:785]

for (image_num in 1:10){
  p <- pixels[image_num,2:785]
  df1 <- as.data.frame(matrix(nrow=784, ncol=3))
  colnames(df1) <- c("x","y", "value")
  x <- 0
  y <- 1
  for (i in 1:784){
    x <- x+1
    if (x %% 29 == 0) {
      y <- y+1
      x <- 1
    }
    df1[i,] <- c(x,y,p[[i]])
  }
  as.cimg(df1) %>% plot    # image
}

Dividing all pixels by the max value of 255 for mix-max scaling.

min_max <- df1['value'] / 255
#min_max

Question 2 - Part 4)

Frequency Distribution

We can use a histogram to see the distribution. We see highest frequency is near zero since the background color is mostly black (value of 0).

hist(df1$value/255)

Question 2 - Part 5)

c <- 1
for (i in 1:nrow(first_ten)) {
  density <- mean(unlist(first_ten[i,]))
  print (paste0("Image ", c, " - density: ", density))
  c <- c + 1
}
## [1] "Image 1 - density: 21.2359693877551"
## [1] "Image 2 - density: 56.8992346938775"
## [1] "Image 3 - density: 17.1237244897959"
## [1] "Image 4 - density: 19.1645408163265"
## [1] "Image 5 - density: 65.1696428571428"
## [1] "Image 6 - density: 29.4145408163265"
## [1] "Image 7 - density: 21.8775510204081"
## [1] "Image 8 - density: 30.9795918367347"
## [1] "Image 9 - density: 35.6122448979592"
## [1] "Image 10 - density: 40.2678571428572"

The above information tells us the mean amount of white/gray pixels in each image. All images are mostly composed of pixel values of 0 (black backgrounds). The higher the mean density (away from 0), the more distinguishing (white/gray) pixels are in an image. Image 1 has the lowest density and it represents a 1 which is a relatively simple drawing. Image 5 has the highest mean density and represents a large 0 with more non-black pixels.

Question 2 - Part 6)

Now we’ll reduce the data with PCA

pca <- prcomp(first_ten)
#plot(pca$x[,1], pca$x[,2])

pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var)*100,1)
barplot(pca.var.per)

To cover at least 95% of the variance we’ll need 8 out of 10 components.

sum(pca.var[1:8])/sum(pca.var)  # > 95% of the variance uses 8 components
## [1] 0.974999

To cover 100% of the variance, the number of components is number_of_images - 1. When identifying an image, using n-1 images means using all available other images (i.e. no compression).

sum(pca.var[1:9])/sum(pca.var)  # 100% of the variance uses 9 components
## [1] 1

Question 2 - Part 7)

Plot First Ten Images Generated by PCA

num_components <- 8
recon <- pca$x[, 1:num_components] %*% t(pca$rotation[, 1:num_components])
recon <- scale(recon, scale = TRUE, center = TRUE)

# Iterate through each image
for (image_num in 1:10){
  p <- recon[image_num,]
  
  # For each image we have to make a dataframe for the 
  # plot function which needs: x; y; pixel value
  df1 <- as.data.frame(matrix(nrow=784, ncol=3))
  colnames(df1) <- c("x","y", "value")
  x <- 0
  y <- 1
  for (i in 1:784){
    x <- x+1
    if (x %% 29 == 0) {
      y <- y+1
      x <- 1
    }
    #print (paste0( x, ' ', y, ' ', p[[i]]))
    df1[i,] <- c(x,y,p[[i]])
  }
  as.cimg(df1) %>% plot    # plot the completed image
}

Note, the images appear noisy since the non-useful pixels have been eliminated; in other words since the surrounding, black pixels appear in just about all the images, they do not contribute any useful information. What is left are pixels near each graphic image (the ‘noisy’ images).

Question 2 - Part 8)

Rerun PCA with only the pixel-digits for the 8’s.

only_eights <- pixels[pixels$label == 8,2:785]

pca <- prcomp(only_eights)

pca.var <- pca$sdev^2
pca.var.per <- round(pca.var/sum(pca.var)*100,1)

length(pca.var)
## [1] 784
sum(pca.var)
## [1] 2950838
sum(pca.var[1:length(pca.var)])/sum(pca.var)
## [1] 1

Plot the first 10 images of the number ‘8.’

num_components <- length(pca.var)
recon <- pca$x[, 1:num_components] %*% t(pca$rotation[, 1:num_components])
recon <- scale(recon, scale = TRUE, center = TRUE)

# Iterate through each image
for (image_num in 1:10){
  p <- recon[image_num,]
  
  # For each image we have to make a dataframe for the 
  # plot function which needs: x; y; pixel value
  df1 <- as.data.frame(matrix(nrow=784, ncol=3))
  colnames(df1) <- c("x","y", "value")
  x <- 0
  y <- 1
  for (i in 1:784){
    x <- x+1
    if (x %% 29 == 0) {
      y <- y+1
      x <- 1
    }
    df1[i,] <- c(x,y,p[[i]])
  }
  as.cimg(df1) %>% plot    # plot the completed image
}

Of course, we notice only images of 8’s appear but more importantly, all pixels of each image appears. Covering 100% of the variance essentially means keeping all available information (i.e. all pixels).

Question 2 - Part 9)

Building a multinomial model

# construct the training set and remove the label column
training_set <- pixels
numbers <- pixels[,1]
training_set <- subset(training_set, select = -c(1))  

# construct pca
pca <- prcomp(training_set)
pca.var <- pca$sdev^2
pca.var.per <- head(round(pca.var/sum(pca.var)*100,1), 100)
barplot(pca.var.per)

length(pca.var)
## [1] 784
sum(pca.var)
## [1] 3434021
# Determining the Components Responsible for 90% Variance
for (num_components in 1:length(pca.var)){
  account_variance <- sum(pca.var[1:num_components])/sum(pca.var)  
  if (account_variance >= 0.90) {
    break
  }
}
print(paste0("90% num components: ", num_components))
## [1] "90% num components: 87"
# predict from pca
trg <- predict(pca, training_set)
trg <- data.frame(trg, numbers)

# Include 87 components
mymodel <- multinom(label~PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+PC11+PC12+PC13+PC14+PC15+PC16+PC17+PC18+PC19+PC20+PC21+PC22+PC23+PC24+PC25+PC26+PC27+PC28+PC29+PC30+PC31+PC32+PC33+PC34+PC35+PC36+PC37+PC38+PC39+PC40+PC41+PC42+PC43+PC44+PC45+PC46+PC47+PC48+PC49+PC50+PC51+PC52+PC53+PC54+PC55+PC56+PC57+PC58+PC59+PC60+PC61+PC62+PC63+PC64+PC65+PC66+PC67+PC68+PC69+PC70+PC71+PC72+PC73+PC74+PC75+PC76+PC77+PC78+PC79+PC80+PC81+PC82+PC83+PC84+PC85+PC86+PC87, data=trg)
## # weights:  890 (792 variable)
## initial  value 96708.573906 
## iter  10 value 21715.535210
## iter  20 value 20402.362049
## iter  30 value 20098.169753
## iter  40 value 20028.124805
## iter  50 value 20002.190793
## iter  60 value 19995.271690
## iter  70 value 19993.083032
## iter  80 value 19991.688681
## iter  90 value 19990.873235
## iter 100 value 19989.236903
## final  value 19989.236903 
## stopped after 100 iterations

Classification accuracy matrix

p <- predict(mymodel, trg)
tab <- table(p, trg$label)
tab
##    
## p      0    1    2    3    4    5    6    7    8    9
##   0 3514    3   14    8   10   66   19    2   14   12
##   1    7 4441   60   28   41   80   13   57  110   37
##   2   48   77 3581  118   14   37   14   51   60    4
##   3   53   35   89 3805    4  328    1    6  203   56
##   4   32    1   61    2 3724   77   44   70   41  292
##   5  233   39   17  156    3 2874   51    4  131   19
##   6   89   14  153   34   79  124 3975    5   62    8
##   7  109   10   91   63   16   40   11 4038   34  168
##   8   38   60   91   75   18  105    9    8 3332   37
##   9    9    4   20   62  163   64    0  160   76 3555
sum(diag(tab))/sum(tab)  # accuracy
## [1] 0.877119

So with 87 components in the PCA, 87.7% accuracy is achieved.




3. House Prices: Advanced Regression Techniques Competition

  1. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not? 5 points
  2. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix. 5 points
  3. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R- devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss. 10 points
  4. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. 10 points

Answers

Load our required libraries.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(corrplot))
suppressPackageStartupMessages(library(matrixcalc))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library (MASS))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(reshape2))

Question 3 - Part 1)

From the housing data, check only quantitative variables with summary data and box plots. Additionally, we’ll use scatter plots with the sale price.

houses <- read_csv("question_3/train.csv", show_col_types = FALSE)
# Question 1 - Descriptive and Inferential Statistics (5 points)
# a) Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. 

area_columns <- c('1stFlrSF','2ndFlrSF','GrLivArea')
h <- houses[area_columns]
h_long <- melt(h)
## No id variables; using all as measure variables
ggplot(h_long, aes(x = variable, y = value)) +   geom_boxplot()

summary(h)
##     1stFlrSF       2ndFlrSF      GrLivArea   
##  Min.   : 334   Min.   :   0   Min.   : 334  
##  1st Qu.: 882   1st Qu.:   0   1st Qu.:1130  
##  Median :1087   Median :   0   Median :1464  
##  Mean   :1163   Mean   : 347   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:1777  
##  Max.   :4692   Max.   :2065   Max.   :5642
room_columns <- c('FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd')
h <- houses[room_columns]
h_long <- melt(h)
## No id variables; using all as measure variables
ggplot(h_long, aes(x = variable, y = value)) +   geom_boxplot()

summary(h)
##     FullBath        HalfBath       BedroomAbvGr    KitchenAbvGr  
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :0.0000   Median :3.000   Median :1.000  
##  Mean   :1.565   Mean   :0.3829   Mean   :2.866   Mean   :1.047  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :2.0000   Max.   :8.000   Max.   :3.000  
##   TotRmsAbvGrd   
##  Min.   : 2.000  
##  1st Qu.: 5.000  
##  Median : 6.000  
##  Mean   : 6.518  
##  3rd Qu.: 7.000  
##  Max.   :14.000
misc_columns <- c('OverallQual','OverallCond','MoSold', 'Fireplaces', 'GarageCars')
h <- houses[misc_columns]
h_long <- melt(h)
## No id variables; using all as measure variables
ggplot(h_long, aes(x = variable, y = value)) +   geom_boxplot()

summary(houses[misc_columns])
##   OverallQual      OverallCond        MoSold         Fireplaces   
##  Min.   : 1.000   Min.   :1.000   Min.   : 1.000   Min.   :0.000  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.: 5.000   1st Qu.:0.000  
##  Median : 6.000   Median :5.000   Median : 6.000   Median :1.000  
##  Mean   : 6.099   Mean   :5.575   Mean   : 6.322   Mean   :0.613  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.: 8.000   3rd Qu.:1.000  
##  Max.   :10.000   Max.   :9.000   Max.   :12.000   Max.   :3.000  
##    GarageCars   
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.767  
##  3rd Qu.:2.000  
##  Max.   :4.000
pairs(houses[,c("SalePrice","LotArea", "1stFlrSF", "GrLivArea", "YearBuilt", "MSSubClass")], 
      pch = 19,  
      cex = 0.15,
      lower.panel=NULL)

# b) Derive a correlation matrix for any three quantitative variables in the dataset. 
sub_data <- houses[, c("SalePrice","LotArea", "1stFlrSF", "GrLivArea", "YearBuilt", "MSSubClass")]
correlation <- cor(sub_data)
#round(correlation,2)

corrplot(correlation)

In the correlation plot above, we’re primarily concerned with the values in the first row where SalePrice’s relationship with other quantitative variables. By icon size and color, the 1st floor square footage closely correlates with a house’s sale price followed by the ground living area and year built variables. However, build class (MSSubClass) and lot size can be quickly eliminated from the regression model.

# c) Test the hypotheses that the correlations between each pairwise set of variables 
# is 0 and provide an 80% confidence interval. 

# get the upper triangle from the correlation matrix
corr_values <- round(correlation[upper.tri(correlation)],2)
t <- t.test(corr_values, conf.level = 0.80)
print(t)
## 
##  One Sample t-test
## 
## data:  corr_values
## t = 3.0032, df = 14, p-value = 0.009491
## alternative hypothesis: true mean is not equal to 0
## 80 percent confidence interval:
##  0.1233101 0.3233566
## sample estimates:
## mean of x 
## 0.2233333

Discuss the meaning of your analysis.

Taking a quick examination at the available data, there are some viable quantitative variables that can help predict sale prices. The scatter and correlation plots clearly show the 1st floor square foot and ground living area variables have a positive correlation with sale prices while building class and lot sizes can be ignored. Also, the box plots show which variables have many outliers (e.g. ground living area) and which ones have little variance.

Would you be worried about familywise error? Why or why not? Yes, with a 80% confidence interval, we have a signficance leve of 0.20. With the number of samples equaling 14 (from upper triangle of our correlation matrix), the family wise error rate is:

\(Familywise \hspace{0.5em} error \hspace{0.5em} rate= 1 - (1 - \alpha)^n\)

Which is : \(Error \hspace{0.5em} rate = 1 - (1 -0.20)^14 = 0.866\)

So with 14 tests, we have an 86.6% chance of a having a Type 1 (family wise) error.

Question 3 - Part 2)

# Question 2 - Linear Algebra and Correlation. (5 points)
# a) Invert your correlation matrix from above. 
#    (This is known as the precision matrix and contains variance inflation factors on the diagonal.) 


# we use the solve function to find the inverse
precision_matrix <- solve(correlation)

# b) Multiply the correlation matrix by the precision matrix, and then multiply 
#    the precision matrix by the correlation matrix. 
a <- correlation %*% precision_matrix
b <- precision_matrix %*% correlation
round(precision_matrix,2)
##            SalePrice LotArea 1stFlrSF GrLivArea YearBuilt MSSubClass
## SalePrice       3.26   -0.19    -0.51     -1.75     -1.22       0.28
## LotArea        -0.19    1.14    -0.19     -0.10      0.15       0.10
## 1stFlrSF       -0.51   -0.19     1.91     -0.67     -0.15       0.46
## GrLivArea      -1.75   -0.10    -0.67      2.56      0.61      -0.54
## YearBuilt      -1.22    0.15    -0.15      0.61      1.56      -0.21
## MSSubClass      0.28    0.10     0.46     -0.54     -0.21       1.20
# c) Conduct LU decomposition on the matrix
precision_decomp <- lu.decomposition(precision_matrix)
round(precision_decomp$L,2)
##       [,1]  [,2]  [,3]  [,4]  [,5] [,6]
## [1,]  1.00  0.00  0.00  0.00  0.00    0
## [2,] -0.06  1.00  0.00  0.00  0.00    0
## [3,] -0.16 -0.20  1.00  0.00  0.00    0
## [4,] -0.54 -0.17 -0.55  1.00  0.00    0
## [5,] -0.37  0.07 -0.18 -0.20  1.00    0
## [6,]  0.09  0.10  0.30 -0.07 -0.03    1
round(precision_decomp$U,2)
##      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] 3.26 -0.19 -0.51 -1.75 -1.22  0.28
## [2,] 0.00  1.13 -0.22 -0.20  0.08  0.12
## [3,] 0.00  0.00  1.78 -0.98 -0.32  0.53
## [4,] 0.00  0.00  0.00  1.05 -0.21 -0.07
## [5,] 0.00  0.00  0.00  0.00  1.00 -0.03
## [6,] 0.00  0.00  0.00  0.00  0.00  1.00

Question 3 - Part 3)

To choose a right skewed variable, we’ll quickly look at the histogram of any candidate variable.

# Question 3 - Calculus-Based Probability & Statistics (10 points) 
#   Many times, it makes sense to fit a closed form distribution to data. 
# a) Select a variable in the Kaggle.com training dataset that is skewed to the right, 
#    shift it so that the minimum value is absolutely above zero if necessary. 

# the below snippet stolen from:
# https://drsimonj.svbtle.com/quick-plot-of-all-variables
houses %>%
  keep(is.numeric) %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins = 100)

# from the above facet histograms, 1stFlrSF is right skewed and the minimum is above zero
print(min(houses$`1stFlrSF`))
## [1] 334
hist(x=houses$`1stFlrSF`,breaks=100)

# b) Then load the MASS package and run fitdistr to fit an exponential probability density function. 
#    (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). 

op <- options(digits = 10)
r <- fitdistr(houses$`1stFlrSF`, "exponential")

print(paste0("The estimated rate is " , round(r$estimate,8)))
## [1] "The estimated rate is 0.00086012"
# c) Find the optimal value of λ for this distribution, and then take 1000 samples from 
#    this exponential distribution using this value (e.g., rexp(1000, λ)). 

# From the output of the fitdistr function 
exp_distribution <- rexp(1000, r$estimate)


# d) Plot a histogram and compare it with a histogram of your original variable. 
par(mfrow=c(1,2))    # set the plotting area into a 1*2 array
hist(x=houses$`1stFlrSF`,breaks=100, xlim=c(0, 6000), main="1st Floor Square Footage")
hist(exp_distribution,breaks=100, xlim=c(0, 6000), main="Fitted Distribution")

For the exponential distribution, we’ll find the 95th and 5th percentiles by using the exponential distribution’s CDF. First, to obtain the CDF, we integrate our original function:

\(f(x) = e^{-\lambda x}\)
\(cdf(x) = 1 - e^{-\lambda x}\)

Isolating \(x\), we end up with:

\(x = \frac{log(1 - cdf(x))}{-\lambda}\)

So for the 95th percentile:

log(1-0.95)/-r$estimate
##        rate 
## 3482.918364

\(x = \frac{log(1 - 0.95)}{-0.0008601} = 3482.918364\)

And for the 5th percentile:

log(1-0.05)/-r$estimate
##        rate 
## 59.63495422

\(x = \frac{log(1 - 0.95)}{-0.0008601} = 59.63495422\)

Reference: https://stats.stackexchange.com/a/55760

# f) Also generate a 95% confidence interval from the empirical data, assuming normality. 
# Since we have normality, the t.test function will work here
t <- t.test(houses$`1stFlrSF`,conf.level=0.95)
print(paste0("The empirical data's 95% confidence interval is: ", round(t$conf.int[1],2), ' ', round(t$conf.int[2],2)))
## [1] "The empirical data's 95% confidence interval is: 1142.78 1182.47"
# Just for completeness, we'll do the same for the fitted distribution
ft <- t.test(exp_distribution,conf.level=0.95)
print(paste0("The fitted distribution's 95% confidence interval is: ", round(ft$conf.int[1],2), ' ', round(ft$conf.int[2],2)))
## [1] "The fitted distribution's 95% confidence interval is: 1141.58 1286.28"

For the empirical data, we simply have to sort all the values and find the 5th and 95th index values.

# g) Finally, provide the empirical 5th percentile and 95th percentile of the data. 
# sort the values of houses$`1stFlr`

# Simply, sort the all the values and find the 5th & 95th indexes
values <- houses$`1stFlrSF` %>%
  sort()

lower_bound <- length(values) * 0.05
upper_bound <- length(values) * 0.95

# Show the actual values
print(paste0("The 5th and 95th percentiles are: ", values[lower_bound], " and ", values[upper_bound]))
## [1] "The 5th and 95th percentiles are: 672 and 1831"

Discussion

This section explores the feasibility of using a fitted distribution instead of empirical data. Comparing their histograms side by side, they are both right skewed with Poisson distributions but the empirical data behaves differently near zero. The fitted distribution assumes a high value versus a low value for the empirical set.

However for the percentiles and confidence intervals, both distributions are interchangeable. The percentiles are reasonably close (672/1831 versus 59/3483) and the confidence intervals are within the same order of magnitude. Additionally, the population means were close as well.

So when working with problematic data (e.g. not enough data or strange outliers) or making some assumptions, using the fitdistr function can be useful.

Question 3 - Part 4)

Next, a linear model with the best overall variables is created.

# Create a linear model using our few predictors
multiple_model <- lm(SalePrice ~ OverallQual + `1stFlrSF` + YearBuilt + GrLivArea , data=houses)
summary(multiple_model)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + `1stFlrSF` + YearBuilt + 
##     GrLivArea, data = houses)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -462172.06  -20306.72   -1542.95   17129.11  285342.88 
## 
## Coefficients:
##                  Estimate    Std. Error   t value   Pr(>|t|)    
## (Intercept) -968324.61646   81056.96825 -11.94622 < 2.22e-16 ***
## OverallQual   23810.52203    1135.98289  20.96028 < 2.22e-16 ***
## `1stFlrSF`       35.53502       3.31918  10.70596 < 2.22e-16 ***
## YearBuilt       449.90971      42.71840  10.53199 < 2.22e-16 ***
## GrLivArea        50.02524       2.72883  18.33213 < 2.22e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39250.5 on 1455 degrees of freedom
## Multiple R-squared:  0.7565598,  Adjusted R-squared:  0.7558905 
## F-statistic: 1130.457 on 4 and 1455 DF,  p-value: < 2.2204e-16

We see that all variables are significant and 75.7% of the data can be explained. Additionally, the p-value is statistically small so the multiple regression model above can be used.

test_houses <- read_csv('question_3/test.csv', show_col_types = FALSE)

predictions <- predict(multiple_model, newdata = test_houses)

# Create an output dataframe
Id <- test_houses$Id
SalePrice <- predictions
output <- cbind(Id, SalePrice)
write.csv(output,"predictions.csv", row.names = FALSE, quote=FALSE)

Kaggle Check

After submitting the generated predictions on Kaggle, my file received a score of 0.59953. My Kaggle name is “Cliff Lee”.

Kaggle score