D605 - Final Project

knitr::opts_chunk$set(echo = TRUE)
library(tinytex)
library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(corrplot)

## corrplot 0.84 loaded

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(PerformanceAnalytics)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

Source files: [https://github.com/djlofland/DATA605_S2020/tree/master/]

Problem 1: Prob Distributions

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\)

N <- 8
count <- 10000

Y_mu <- (N+1)/2
Y_std <- Y_mu

# Generate distributions
X <- runif(count, 1, N)
Y <- rnorm(count, Y_mu, Y_std)

x <- median(X)
y <- quantile(Y, 0.25)

hist(X)

hist(Y)

Part A: Probability

5 points. Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Problem 1

a. P(X>x | X>y)

prob_X_gt_x <- sum((X > x) == TRUE) / count       # 1 - punif(x, 1, N)  ... expected: 0.5
prob_X_gt_y <- sum((X > y) == TRUE) / count   
prob_X_gt_x_and_y <- sum((X > x) == TRUE & (X > y) == TRUE) / count 
(answer <- prob_X_gt_x + prob_X_gt_y - prob_X_gt_x_and_y)

## [1] 0.9332

Note that P(X>x) overlaps P(X>y), so to calculate the P(X>x or X>y), we need to determine the independent probabilities and subtract the overlap where X>x and X>y. In this situation, the probabilities are NOT independent. ANSWER: 0.9332

b. P(X>x, Y>y)

prob_X_gt_x <- 1 - punif(x, 1, N) # 0.5

y_z <- (y - Y_mu) / Y_std
prob_Y_gt_y <- 1 - pnorm(y_z, Y_mu, Y_std)

(answer <- prob_X_gt_x * prob_Y_gt_y)

##       25% 
## 0.4377546

Since P(X>x) is independent of P(Y>y), we just multiply the probability of each. ANSWER: 0.4377546

c. P(X<x | X>y)

prob_X_lt_x <- punif(x, 1, N)
prob_X_lt_y <- punif(y, 1, N)

(answer <- prob_X_lt_x - prob_X_lt_y)

##       25% 
## 0.4345134

Since y < x, we are finding the probability that X falls within the region bounded by y and x. This is simply the probability of X < x minus probability X < y.
ANSWER: 0.4345134

Part B: Probabilities

5 points. Investigate whether \(P(X>x\text{ and }Y>y)=P(X>x)\ P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

# MARGINAL Probabilities



# JOINT Probability Table
prob_X_gt_x <- sum((X>x) == TRUE) / count
prob_X_le_x <- 1 - prob_X_gt_x
prob_Y_gt_y <- sum((Y>y) == TRUE) / count
prob_Y_le_y <- 1 - prob_Y_gt_y

prob_table <- matrix(c(prob_Y_le_y*prob_X_le_x, prob_Y_le_y*prob_X_gt_x, 
                       prob_Y_gt_y*prob_X_le_x, prob_Y_gt_y*prob_X_gt_x), ncol=2)
(total <- sum(prob_table))

## [1] 1

(p_joint <- prob_table[2,2])

## [1] 0.375

(p_marg <- prob_X_gt_x * prob_Y_gt_y)

## [1] 0.375

Joint & Marginal Table

	\(Y\le y\)	\(Y>y\)
\(X\le x\)	0.125	0.375	0.5
\(X>x\)	0.125	0.375	0.5
	0.25	0.75	1

ANSWER: p_joint (0.375) is equal to p_marg (0.375)

Part C: Independence

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

a <- sum((X<=x) == TRUE & (Y<=y) == TRUE)
b <- sum((X>x) == TRUE & (Y<=y) == TRUE)
c <- sum((X<=x) == TRUE & (Y>y) == TRUE)
d <- sum((X>x) == TRUE & (Y>y) == TRUE)

cont_table <- matrix(c(a, b, c, d), 
                     ncol = 2,
                     dimnames = list(X = c("<= x", "> x"), 
                                     Y = c("<= y", "> y")))
(cont_table)

##       Y
## X      <= y  > y
##   <= x 1275 3725
##   > x  1225 3775

fisher.test(cont_table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  cont_table
## p-value = 0.2578
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9624387 1.1560181
## sample estimates:
## odds ratio 
##   1.054812

chisq.test(cont_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 1.2805, df = 1, p-value = 0.2578

Our null hypothesis is that X IS independent of Y and the alternative is that they are dependent. With both the Fisher and Chi Squared Tests, we observerd a p-value of 0.3437, which is greater that 0.05, so we do NOT reject the null hypothesis and accept that X IS independent of Y. This was expected since by design, X and Y were drawn from different distributions. Fisher’s Test is intended when cell sizes are small and Chi Squared when cell sizes are large. Fisher’s is more accurate at small values and requires more computation at larger values, whereas, Chi Squared is a good approximation at larger values. With large values, Fisher and Chi Squared should give the same result (as we see). In this situation, the cell sizes are large, so the Chi Squared Test would have been more appropriate.

Problem 2: Kaggle

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Note: My Kaggle username is djlofland

Part A: Desc/Inferential Stats

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Load Kaggle Data

train <- as_tibble(read.csv(file='./datasets/train.csv'))
test <- as_tibble(read.csv(file='./datasets/test.csv'))

Descriptive Stats

# A Few Quantitative Features
plot(train$GrLivArea, train$SalePrice, las=3, main='Sale Price vs Living Area (Above Ground)', xlab='Living Area (Above Ground)', ylab='SalePrice (USD)')

plot(train$TotRmsAbvGrd, train$SalePrice, las=3, main='Sale Price vs Total Rooms (Above Ground)', xlab='Rooms', ylab='SalePrice (USD)')

plot(train$GarageCars, train$SalePrice, las=3, main='Sale Price vs Garage Size', xlab='Garage Size (Cars)', ylab='SalePrice (USD)')

# A Few Categorical Features
plot(train$Neighborhood, train$SalePrice, las=3, main='Sale Price vs Neighborbood', xlab='Neighborbood', ylab='SalePrice (USD)')

plot(train$KitchenQual, train$SalePrice, las=3, main='Sale Price vs Kitchen Quality', xlab='Kitchen Quality', ylab='SalePrice (USD)')

Feature Correlations

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(corrgram)

## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

## 
## Attaching package: 'corrgram'

## The following object is masked from 'package:lattice':
## 
##     panel.fill

my_data <- train %>% dplyr::select(X1stFlrSF, FullBath, BedroomAbvGr, GarageCars, HalfBath, WoodDeckSF, LotArea)

my_data %>% 
  corrgram(lower.panel=corrgram::panel.ellipse,
           upper.panel=panel.cor,
           diag.panel=panel.density)

# cols <- train %>% dplyr::select(X1stFlrSF, FullBath, BedroomAbvGr, GarageCars, HalfBath, WoodDeckSF, LotArea)
res <- rcorr(as.matrix(my_data))
corrplot(res$r, type="upper", order="hclust", p.mat = res$P, sig.level = 0.95, insig = "blank")

# Get pairwise Confidence intervals for 80% (ie alpha=0.2)
my_corr <- psych::corr.test(my_data, alpha=.20)
print(my_corr, short=FALSE)

## Call:psych::corr.test(x = my_data, alpha = 0.2)
## Correlation matrix 
##              X1stFlrSF FullBath BedroomAbvGr GarageCars HalfBath
## X1stFlrSF         1.00     0.38         0.13       0.44    -0.12
## FullBath          0.38     1.00         0.36       0.47     0.14
## BedroomAbvGr      0.13     0.36         1.00       0.09     0.23
## GarageCars        0.44     0.47         0.09       1.00     0.22
## HalfBath         -0.12     0.14         0.23       0.22     1.00
## WoodDeckSF        0.24     0.19         0.05       0.23     0.11
## LotArea           0.30     0.13         0.12       0.15     0.01
##              WoodDeckSF LotArea
## X1stFlrSF          0.24    0.30
## FullBath           0.19    0.13
## BedroomAbvGr       0.05    0.12
## GarageCars         0.23    0.15
## HalfBath           0.11    0.01
## WoodDeckSF         1.00    0.17
## LotArea            0.17    1.00
## Sample Size 
## [1] 1460
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##              X1stFlrSF FullBath BedroomAbvGr GarageCars HalfBath
## X1stFlrSF            0        0         0.00          0     0.00
## FullBath             0        0         0.00          0     0.00
## BedroomAbvGr         0        0         0.00          0     0.00
## GarageCars           0        0         0.00          0     0.00
## HalfBath             0        0         0.00          0     0.00
## WoodDeckSF           0        0         0.07          0     0.00
## LotArea              0        0         0.00          0     0.59
##              WoodDeckSF LotArea
## X1stFlrSF          0.00    0.00
## FullBath           0.00    0.00
## BedroomAbvGr       0.15    0.00
## GarageCars         0.00    0.00
## HalfBath           0.00    0.59
## WoodDeckSF         0.00    0.00
## LotArea            0.00    0.00
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## X1FSF-FllBt      0.35  0.38      0.41  0.00      0.32      0.44
## X1FSF-BdrAG      0.09  0.13      0.16  0.00      0.07      0.18
## X1FSF-GrgCr      0.41  0.44      0.47  0.00      0.38      0.49
## X1FSF-HlfBt     -0.15 -0.12     -0.09  0.00     -0.17     -0.06
## X1FSF-WdDSF      0.20  0.24      0.27  0.00      0.17      0.30
## X1FSF-LotAr      0.27  0.30      0.33  0.00      0.24      0.36
## FllBt-BdrAG      0.33  0.36      0.39  0.00      0.30      0.42
## FllBt-GrgCr      0.44  0.47      0.50  0.00      0.42      0.52
## FllBt-HlfBt      0.10  0.14      0.17  0.00      0.08      0.19
## FllBt-WdDSF      0.16  0.19      0.22  0.00      0.13      0.25
## FllBt-LotAr      0.09  0.13      0.16  0.00      0.07      0.18
## BdrAG-GrgCr      0.05  0.09      0.12  0.00      0.04      0.13
## BdrAG-HlfBt      0.19  0.23      0.26  0.00      0.16      0.29
## BdrAG-WdDSF      0.01  0.05      0.08  0.07      0.00      0.09
## BdrAG-LotAr      0.09  0.12      0.15  0.00      0.07      0.17
## GrgCr-HlfBt      0.19  0.22      0.25  0.00      0.16      0.28
## GrgCr-WdDSF      0.19  0.23      0.26  0.00      0.16      0.29
## GrgCr-LotAr      0.12  0.15      0.19  0.00      0.09      0.21
## HlfBt-WdDSF      0.07  0.11      0.14  0.00      0.06      0.16
## HlfBt-LotAr     -0.02  0.01      0.05  0.59     -0.02      0.05
## WdDSF-LotAr      0.14  0.17      0.20  0.00      0.11      0.23

As we see from the plots and tables, most of the pairwise correlations are with -0.2 and +0.3. For linear regressions, we ideally want features that have no correlation and are thus independent of each other. The challenge with housing data is that, as we expect, many features are correlated. Looking at the pairwise correlation table and setting our alpha to 0.2, we find that only LotArea and HalfBath have a confidence interval for r that overlaps 0.0. The next closest is WoodDeckSF and BedroomAbvGr. SInce we do see correlations, between features, we will expect any linear model to not be as accurate at predicting SalePrice. That said, many of these correlations are small, so the effect on predictions may not be dramatic. Using 80% confidence bounds, we are reasonable confident our correlations are NOT zero for most pairs, however, the magnitude difference from 0 is not that large. So, if we have type I errors such that the real population falls outside our interval, it probably won’t result in a very large r, either positive or negative and consequently, familywise errors won’t lead to major differences in prediction power of a linear model.

Part B: Linear Alg

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

library(matlib)

## 
## Attaching package: 'matlib'

## The following object is masked from 'package:psych':
## 
##     tr

corr_mat <- as.matrix(my_corr$r)

# Make sure determinant != 0, i.e. the inverse exists
print(paste('Determinant of Correlation Matrix: ', det(corr_mat)))

## [1] "Determinant of Correlation Matrix:  0.350756830479752"

# Invert the Correlation Matrix to get the Precision Matrix
print('Presision Matrix: ')

## [1] "Presision Matrix: "

(pres_mat <- inv(corr_mat))

##                                                                 
## [1,]  1.52861037 -0.28283323 -0.08813545 -0.52830583  0.38057116
## [2,] -0.28283323  1.55107173 -0.47991767 -0.54918229 -0.00779719
## [3,] -0.08813545 -0.47991767  1.23925935  0.22283008 -0.27877766
## [4,] -0.52830583 -0.54918229  0.22283008  1.57793506 -0.37412840
## [5,]  0.38057116 -0.00779719 -0.27877766 -0.37412840  1.20484257
## [6,] -0.17177269 -0.08488016  0.04972401 -0.09593524 -0.11741114
## [7,] -0.30570003  0.04639730 -0.10052032 -0.02181180 -0.01870148
##                             
## [1,] -0.17177269 -0.30570003
## [2,] -0.08488016  0.04639730
## [3,]  0.04972401 -0.10052032
## [4,] -0.09593524 -0.02181180
## [5,] -0.11741114 -0.01870148
## [6,]  1.10864924 -0.11763314
## [7,] -0.11763314  1.12157519

# Multiply the Correlation Matrix by the Presision Matrix and vice versa
print('Correlation Matrix * Precision Matrix and vice versa')

## [1] "Correlation Matrix * Precision Matrix and vice versa"

corr_mat %*% pres_mat

##                                                                    
## X1stFlrSF     1.000000e+00 -2.099920e-10 2.549849e-09 -2.011754e-10
## FullBath      2.985424e-09  1.000000e+00 4.609705e-09 -4.647446e-09
## BedroomAbvGr  3.062899e-09  3.780336e-09 1.000000e+00 -5.039961e-09
## GarageCars    3.581008e-09 -5.117253e-10 8.202727e-10  1.000000e+00
## HalfBath      5.421901e-09  5.298227e-09 4.012798e-09 -5.732065e-09
## WoodDeckSF    4.557388e-09  4.716603e-09 2.132956e-09 -4.433049e-09
## LotArea      -2.656335e-09 -7.394940e-10 1.976166e-09 -3.004904e-09
##                                                      
## X1stFlrSF    4.966524e-09  2.987008e-09 -7.282273e-09
## FullBath     6.598785e-09  5.249805e-09 -5.279897e-09
## BedroomAbvGr 5.815376e-09  3.410413e-09 -1.245897e-09
## GarageCars   1.450829e-09  9.920393e-10 -6.286842e-09
## HalfBath     1.000000e+00  3.424746e-09 -9.377641e-10
## WoodDeckSF   5.023480e-09  1.000000e+00 -6.854421e-09
## LotArea      1.959763e-09 -3.051024e-09  1.000000e+00

pres_mat %*% corr_mat

##          X1stFlrSF      FullBath  BedroomAbvGr    GarageCars      HalfBath
## [1,]  1.000000e+00  2.985424e-09  3.062899e-09  3.581008e-09  5.421901e-09
## [2,] -2.099920e-10  1.000000e+00  3.780336e-09 -5.117253e-10  5.298227e-09
## [3,]  2.549849e-09  4.609705e-09  1.000000e+00  8.202727e-10  4.012798e-09
## [4,] -2.011754e-10 -4.647446e-09 -5.039961e-09  1.000000e+00 -5.732065e-09
## [5,]  4.966524e-09  6.598785e-09  5.815376e-09  1.450829e-09  1.000000e+00
## [6,]  2.987008e-09  5.249805e-09  3.410413e-09  9.920393e-10  3.424746e-09
## [7,] -7.282273e-09 -5.279897e-09 -1.245897e-09 -6.286842e-09 -9.377641e-10
##         WoodDeckSF       LotArea
## [1,]  4.557388e-09 -2.656335e-09
## [2,]  4.716603e-09 -7.394940e-10
## [3,]  2.132956e-09  1.976166e-09
## [4,] -4.433049e-09 -3.004904e-09
## [5,]  5.023480e-09  1.959763e-09
## [6,]  1.000000e+00 -3.051024e-09
## [7,] -6.854421e-09  1.000000e+00

# Helper function to factor a matrix (from my Homework Assignment #2)
factor_matrix <- function(A) {
  size <- nrow(A)
  max <- size - 1
  
  # Build our base L matrix
  L <- matrix(0, nrow=size, ncol=size)
  
  for (i in 1:size) {
    L[i,i] = 1L
  }
  
  # Copy A into U - we will be decompsing U as we go and keep our orig A
  U <- matrix(0, nrow=size, ncol=size)
  for (r in 1:size) {
    for (c in 1:size) {
      U[r,c] <- A[r,c]
    }
  }
  
  for (c in 1:max) {
    base_row <- c
    next_row <- c+1
    
    for (r in next_row:size) {
      # print(c(c, r))
      
      L[r, c] = U[r, c] / U[base_row, c]
      U[r,] = U[r,] - U[base_row,] * L[r, c]
    }
  }
  
  U <- round(U, 5)
  L <- round(L, 5)
  
  return(list("A"=A, "U"=U, "L"=L))
}

print('LU Decomposition on Presision Matrix')

## [1] "LU Decomposition on Presision Matrix"

(lu_matrix <- factor_matrix(pres_mat))

## $A
##                                                                 
## [1,]  1.52861037 -0.28283323 -0.08813545 -0.52830583  0.38057116
## [2,] -0.28283323  1.55107173 -0.47991767 -0.54918229 -0.00779719
## [3,] -0.08813545 -0.47991767  1.23925935  0.22283008 -0.27877766
## [4,] -0.52830583 -0.54918229  0.22283008  1.57793506 -0.37412840
## [5,]  0.38057116 -0.00779719 -0.27877766 -0.37412840  1.20484257
## [6,] -0.17177269 -0.08488016  0.04972401 -0.09593524 -0.11741114
## [7,] -0.30570003  0.04639730 -0.10052032 -0.02181180 -0.01870148
##                             
## [1,] -0.17177269 -0.30570003
## [2,] -0.08488016  0.04639730
## [3,]  0.04972401 -0.10052032
## [4,] -0.09593524 -0.02181180
## [5,] -0.11741114 -0.01870148
## [6,]  1.10864924 -0.11763314
## [7,] -0.11763314  1.12157519
## 
## $U
##         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
## [1,] 1.52861 -0.28283 -0.08814 -0.52831  0.38057 -0.17177 -0.30570
## [2,] 0.00000  1.49874 -0.49623 -0.64693  0.06262 -0.11666 -0.01017
## [3,] 0.00000  0.00000  1.06988 -0.02183 -0.23610  0.00119 -0.12151
## [4,] 0.00000  0.00000  0.00000  1.11565 -0.22039 -0.20564 -0.13433
## [5,] 0.00000  0.00000  0.00000  0.00000  1.01184 -0.11013  0.00448
## [6,] 0.00000  0.00000  0.00000  0.00000  0.00000  1.03038 -0.17691
## [7,] 0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  1.00000
## 
## $L
##          [,1]     [,2]     [,3]     [,4]     [,5]    [,6] [,7]
## [1,]  1.00000  0.00000  0.00000  0.00000  0.00000  0.0000    0
## [2,] -0.18503  1.00000  0.00000  0.00000  0.00000  0.0000    0
## [3,] -0.05766 -0.33109  1.00000  0.00000  0.00000  0.0000    0
## [4,] -0.34561 -0.43165 -0.02040  1.00000  0.00000  0.0000    0
## [5,]  0.24897  0.04178 -0.22068 -0.19754  1.00000  0.0000    0
## [6,] -0.11237 -0.07784  0.00112 -0.18432 -0.10884  1.0000    0
## [7,] -0.19999 -0.00678 -0.11358 -0.12041  0.00443 -0.1717    1

Part C: Calc Prob/Stats

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Note: Based on initial data exploration, none of our raw numeric feature really follow an exponential distribution. After a bunch of data exploration (not shown), I finally settled on a new feature PropertyAge, defined as YrSold - YrBuilt, reflecting the age of the property when purchased. While it’s not a perfect exponential fit, it was the closet I could find after hours of digging. There is a large initial peak reflecting that most properties purchased were new.

# Create a new feature, PropertyAge
train$PropertyAge <- train$YrSold - train$YearBuilt
my_data <- train$PropertyAge

# Fit an expoential distribution using MASS package
dist <- fitdistr(my_data, "exponential")
lambda <- dist$estimate

# drawn random samples from exponential dist with rate, lambda
samples <- rexp(1000, lambda)

hist(my_data, freq = FALSE, breaks = 100, xlim = c(0, quantile(my_data, 0.99)), col = rgb(0,0,1,0.2), xlab='PropertyAge', ylab='Frequency', main='PropertyAge Histograms')
curve(dexp(x, lambda), from = 0, col = "red", add = TRUE)
hist(samples, freq = FALSE, breaks = 250, xlim = c(0, quantile(my_data, 0.99)), col=rgb(1,0,0,0.1), add=TRUE, alpha=0.5)
legend("topright", c("PropertyAge", "Simulation"), col=c(rgb(0,0,1,0.2), rgb(1,0,0,0.1)), lwd=10)

The optimal \(\lambda\) for this fit is: 0.0273613

# Calculate the 5th and 95th percentiles using the PDF, given lambda
lower_5 <- qexp(0.05, lambda)
upper_95 <- qexp(0.95, lambda)

The 5th percential, represents the lower age, i.e. 5% of simulated homes are less than 1.8746645 years old. The 95th percentile, says that 95% of simulated homes are less than 109.487859 years old, or alternatively, 5% of simulated homes are older than 109.487859 years old.

Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data.

mu <- mean(train$PropertyAge)
n  <- length(train$PropertyAge)

lower_5 <- mu - 1.96*mu/sqrt(n)
upper_95 <- mu + 1.96*mu/sqrt(n)

percent_5 <- quantile(train$PropertyAge, c(0.05))
percent_95 <- quantile(train$PropertyAge, c(0.95))

The mean age of real properties was 36.5479452 with a 95% confidence interval of 34.6731985 and 38.422692 years old. At the extremes, 5% are less than 1 years old and 5% are older than 91 years old (95% percentile).

Part D: Modeling

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Please see my GitHub repo and download/view the KaggleProject.ipynb and/or KaggleProject.html files.
https://github.com/djlofland/DATA605_S2020/tree/master/FinalProject/