1 Problem 1

1.1 Generate random variables

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of N+1/2. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

library(dplyr)
library(ggplot2)
set.seed(321)

# choose a value for N
N <- 13

# Generate uniform random numbers
X <- runif(10000, 1, N)

# generate normal random numbers
Y <- rnorm(10000, (N+1)/2, (N+1)/2)

# store X and Y in a dataframe
df <- data.frame(cbind(X,Y)) %>% as_tibble()
head(df)
## # A tibble: 6 x 2
##       X     Y
##   <dbl> <dbl>
## 1 12.5   6.34
## 2 12.2   6.28
## 3  3.86  9.26
## 4  4.06 12.0 
## 5  5.69 19.2 
## 6  5.09  6.21
# find x - median of X
x <- median(X)
paste("x = ", round(x,3))
## [1] "x =  7.022"
# find y - the 1st quartile of Y
y <- quantile(Y, 0.25 )
paste("y = ", round(y,3))
## [1] "y =  2.349"
# visual of X and Y
df %>% ggplot(aes(x=df$Y)) + geom_histogram(binwidth=1, color="black", fill="blue", alpha=.9) +
  geom_histogram(aes(x=df$X), color="black", fill="grey", alpha=.5) + 
  geom_vline(aes(xintercept=x), color="blue", linetype="dashed", size=1) + 
  geom_vline(aes(xintercept=y), color= "black", linetype="dashed", size=1) + 
  geom_text(aes(x=x), y=1000, label="x", size=9)+
  geom_text(aes(x=y), y=1000, label="y", size=9)+
  theme_minimal()

1.2 Calculate Probabilties

1.2.1 a. P(X>x | X>y)

Conditional probability computed by finding P(X > x) after first changing the sample space, given that X > y

# get new sample space
x_gt_y <- df %>% filter(X > y) 

# probability is events where X > x divided by the # of events in the new sample space
p_a <- sum(x_gt_y$X > x)/nrow(x_gt_y)

paste("The probability that P(X>x | X>y) is approximately ", round(p_a,3 ))
## [1] "The probability that P(X>x | X>y) is approximately  0.565"

1.2.2 b. P(X>x, Y>y)

p_b <- sum(X > x & Y > y)/10000
paste("The probability that P(X>x, Y>y) is approximately", round(p_b,3 ))
## [1] "The probability that P(X>x, Y>y) is approximately 0.376"

1.2.3 c. P(X < x | X > y)

Conditional probability computed by finding P(X < x) after first changing the sample space, given that X > y

x_gt_y <- df %>% filter(X>y) 
p_c <- sum(x_gt_y$X<x)/nrow(x_gt_y) 

paste("The probability that P(X<x | X>y) is approximately", round(p_c,3 ))
## [1] "The probability that P(X<x | X>y) is approximately 0.435"

1.3 Marginal/Joint Probability

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y)
This is done by first building a table and evaluating the marginal and joint probabilities.

# find joint probabilities of each of the events
Xgtx_Ygty <- sum(X > x & Y >y)/10000
Xltx_Ygty <- sum(X < x & Y >y)/10000
Xgtx_Ylty <- sum(X > x & Y< y)/10000
Xltx_Ylty <- sum(X < x & Y <y)/10000

# create a dataframe to store the probability distribution - creating row and column names
df2 <- data.frame("Y_greater_than_y"=c(Xgtx_Ygty,Xltx_Ygty), "Y_less_than_y"=c(Xgtx_Ylty, Xltx_Ylty))
row.names(df2) <- c("X_greater_than_x","X_less_than_x")

# joint probability distribution 
df2
##                  Y_greater_than_y Y_less_than_y
## X_greater_than_x           0.3765        0.1235
## X_less_than_x              0.3735        0.1265
# calculate row and column sums - the marginals 
df2 <- cbind(df2, total=rowSums(df2))
df2 <- rbind(df2, total=colSums(df2))


df2
##                  Y_greater_than_y Y_less_than_y total
## X_greater_than_x           0.3765        0.1235   0.5
## X_less_than_x              0.3735        0.1265   0.5
## total                      0.7500        0.2500   1.0
# Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) 
paste("P(X>x and Y>y)=P(X>x)P(Y>y): ",  round(sum(X > x)/10000 * sum(Y>y)/10000,2) == round(df2[1,1],2))
## [1] "P(X>x and Y>y)=P(X>x)P(Y>y):  TRUE"

The third column and row are the marignal probabiity distribtuions -> they sum to 1
As far as the marginals go, we see total Y greater thatn y is 0.75 - the upper 3 quantiles. The rest make sense give our distribution.

1.4 Independence Testing

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Many applications of probaiblity and statistics are predicated on independence of variables. The Fisher’s Exact Test and Chi Square Test allow for a way of evaluating the concept of independence.

Like all other statistical test, we start with null and alternative hypothesis.
H0 : the variables are independent - there no relationship between the variables.
H1 : the variables are dependent - knowing the value of one variable helps predict the other.

We will either reject or not reject the null hypthesis based on the p-value computed from the tests: I’ll choose an value of alpha of 0.05.

The functions for the tests require counts, not probabilities, and data should be in a table format.

# multiply by the number of events to convert the probability distribution table into table of counts
df2 <- df2*10000
df2 <-df2[1:2,1:2]

# convert to table
table <- as.table(as.matrix(df2))
table
##                  Y_greater_than_y Y_less_than_y
## X_greater_than_x             3765          1235
## X_less_than_x                3735          1265
# Fishers Exact test 
fisher.test(table, conf.int=T)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table
## p-value = 0.503
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9421511 1.1315373
## sample estimates:
## odds ratio 
##   1.032527
# Chi Square Test
chisq.test(table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 0.44853, df = 1, p-value = 0.503

For both the fisher’s and chi-square test, the p-value is much higher than the significance level chosen, so the null hypothesis, that there is no relationship between the variables and that they are independent, cannot be rejected. This makes sense intuitively when looking at the joint values from the table above. For example, knowing the value of Y_greater_than_y will not help us predict the if X is greater than or less than x.

Generally the chi-square test is used with large sample sizes, and provides an approximate p-value whereas fisher’s exact test provides an exact p-value on small sample sizes. In this case there was no real benefit in choosing one over the other since we selected the large sample size. In datasets with limited events, Fisher’s Test would be more appropriate.


2 Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques .

2.1 Descriptive statisitics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset.

First Look at the dataset

require(dplyr)
# require(PerformanceAnalytics)
# source("http://www.sthda.com/upload/rquery_cormat.r")

# load the training data
train <- read.csv('C:/NYBackup/DATA605/regression/train.csv') %>% as_tibble()
test <- read.csv('C:/NYBackup/DATA605/regression/test.csv') %>% as_tibble()

train
## # A tibble: 1,460 x 81
##       Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
##    <int>      <int> <fct>          <int>   <int> <fct>  <fct> <fct>   
##  1     1         60 RL                65    8450 Pave   <NA>  Reg     
##  2     2         20 RL                80    9600 Pave   <NA>  Reg     
##  3     3         60 RL                68   11250 Pave   <NA>  IR1     
##  4     4         70 RL                60    9550 Pave   <NA>  IR1     
##  5     5         60 RL                84   14260 Pave   <NA>  IR1     
##  6     6         50 RL                85   14115 Pave   <NA>  IR1     
##  7     7         20 RL                75   10084 Pave   <NA>  Reg     
##  8     8         60 RL                NA   10382 Pave   <NA>  IR1     
##  9     9         50 RM                51    6120 Pave   <NA>  Reg     
## 10    10        190 RL                50    7420 Pave   <NA>  Reg     
## # ... with 1,450 more rows, and 73 more variables: LandContour <fct>,
## #   Utilities <fct>, LotConfig <fct>, LandSlope <fct>, Neighborhood <fct>,
## #   Condition1 <fct>, Condition2 <fct>, BldgType <fct>, HouseStyle <fct>,
## #   OverallQual <int>, OverallCond <int>, YearBuilt <int>,
## #   YearRemodAdd <int>, RoofStyle <fct>, RoofMatl <fct>,
## #   Exterior1st <fct>, Exterior2nd <fct>, MasVnrType <fct>,
## #   MasVnrArea <int>, ExterQual <fct>, ExterCond <fct>, Foundation <fct>,
## #   BsmtQual <fct>, BsmtCond <fct>, BsmtExposure <fct>,
## #   BsmtFinType1 <fct>, BsmtFinSF1 <int>, BsmtFinType2 <fct>,
## #   BsmtFinSF2 <int>, BsmtUnfSF <int>, TotalBsmtSF <int>, Heating <fct>,
## #   HeatingQC <fct>, CentralAir <fct>, Electrical <fct>, X1stFlrSF <int>,
## #   X2ndFlrSF <int>, LowQualFinSF <int>, GrLivArea <int>,
## #   BsmtFullBath <int>, BsmtHalfBath <int>, FullBath <int>,
## #   HalfBath <int>, BedroomAbvGr <int>, KitchenAbvGr <int>,
## #   KitchenQual <fct>, TotRmsAbvGrd <int>, Functional <fct>,
## #   Fireplaces <int>, FireplaceQu <fct>, GarageType <fct>,
## #   GarageYrBlt <int>, GarageFinish <fct>, GarageCars <int>,
## #   GarageArea <int>, GarageQual <fct>, GarageCond <fct>,
## #   PavedDrive <fct>, WoodDeckSF <int>, OpenPorchSF <int>,
## #   EnclosedPorch <int>, X3SsnPorch <int>, ScreenPorch <int>,
## #   PoolArea <int>, PoolQC <fct>, Fence <fct>, MiscFeature <fct>,
## #   MiscVal <int>, MoSold <int>, YrSold <int>, SaleType <fct>,
## #   SaleCondition <fct>, SalePrice <int>
# Iterative approach since there are 81 variables 
# scatterplot matrix on the whole dataset
# output supressed since there are so many 
# pairs(train[,c(81,1:8)])
# pairs(train[,c(81,9:15)])
# pairs(train[,c(81,16:22)])
# pairs(train[,c(81,23:29)])
# pairs(train[,c(81,30:36)])
# pairs(train[,c(81,37:43)])
# pairs(train[,c(81,49:55)])
# pairs(train[,c(81,61:67)])
# pairs(train[,c(81,73:80)])

2.1.1 Iteration 1

Select variables that are most correlated with Sale Price based on the scatterplot matrices above. 12 variables were selected. Summary statistics,new scatterplot matrices, and a correlation matrix are given.

# initial selection of the training set - 12 variables selected including the target
train_selection1 <- train %>% dplyr::select(SalePrice, GrLivArea, OverallQual, YearBuilt, X1stFlrSF, FullBath, TotRmsAbvGrd,LotArea, OverallCond, BedroomAbvGr, YrSold, LotFrontage, OverallCond )

#summary statistics
summary(train_selection1)
##    SalePrice        GrLivArea     OverallQual       YearBuilt   
##  Min.   : 34900   Min.   : 334   Min.   : 1.000   Min.   :1872  
##  1st Qu.:129975   1st Qu.:1130   1st Qu.: 5.000   1st Qu.:1954  
##  Median :163000   Median :1464   Median : 6.000   Median :1973  
##  Mean   :180921   Mean   :1515   Mean   : 6.099   Mean   :1971  
##  3rd Qu.:214000   3rd Qu.:1777   3rd Qu.: 7.000   3rd Qu.:2000  
##  Max.   :755000   Max.   :5642   Max.   :10.000   Max.   :2010  
##                                                                 
##    X1stFlrSF       FullBath      TotRmsAbvGrd       LotArea      
##  Min.   : 334   Min.   :0.000   Min.   : 2.000   Min.   :  1300  
##  1st Qu.: 882   1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:  7554  
##  Median :1087   Median :2.000   Median : 6.000   Median :  9478  
##  Mean   :1163   Mean   :1.565   Mean   : 6.518   Mean   : 10517  
##  3rd Qu.:1391   3rd Qu.:2.000   3rd Qu.: 7.000   3rd Qu.: 11602  
##  Max.   :4692   Max.   :3.000   Max.   :14.000   Max.   :215245  
##                                                                  
##   OverallCond     BedroomAbvGr       YrSold      LotFrontage    
##  Min.   :1.000   Min.   :0.000   Min.   :2006   Min.   : 21.00  
##  1st Qu.:5.000   1st Qu.:2.000   1st Qu.:2007   1st Qu.: 59.00  
##  Median :5.000   Median :3.000   Median :2008   Median : 69.00  
##  Mean   :5.575   Mean   :2.866   Mean   :2008   Mean   : 70.05  
##  3rd Qu.:6.000   3rd Qu.:3.000   3rd Qu.:2009   3rd Qu.: 80.00  
##  Max.   :9.000   Max.   :8.000   Max.   :2010   Max.   :313.00  
##                                                 NA's   :259
# Scatterplot matrix sepatate into two for clarity 
pairs(train_selection1[,1:8])

pairs(train_selection1[,c(1,7:12)])

# correlation matrix
cor_matrix1 <- cor(train_selection1)
cor_matrix1
##                SalePrice   GrLivArea OverallQual   YearBuilt   X1stFlrSF
## SalePrice     1.00000000  0.70862448  0.79098160  0.52289733  0.60585218
## GrLivArea     0.70862448  1.00000000  0.59300743  0.19900971  0.56602397
## OverallQual   0.79098160  0.59300743  1.00000000  0.57232277  0.47622383
## YearBuilt     0.52289733  0.19900971  0.57232277  1.00000000  0.28198586
## X1stFlrSF     0.60585218  0.56602397  0.47622383  0.28198586  1.00000000
## FullBath      0.56066376  0.63001165  0.55059971  0.46827079  0.38063749
## TotRmsAbvGrd  0.53372316  0.82548937  0.42745234  0.09558913  0.40951598
## LotArea       0.26384335  0.26311617  0.10580574  0.01422765  0.29947458
## OverallCond  -0.07785589 -0.07968587 -0.09193234 -0.37598320 -0.14420278
## BedroomAbvGr  0.16821315  0.52126951  0.10167636 -0.07065122  0.12740075
## YrSold       -0.02892259 -0.03652582 -0.02734671 -0.01361768 -0.01360377
## LotFrontage           NA          NA          NA          NA          NA
##                 FullBath TotRmsAbvGrd     LotArea OverallCond BedroomAbvGr
## SalePrice     0.56066376   0.53372316  0.26384335 -0.07785589   0.16821315
## GrLivArea     0.63001165   0.82548937  0.26311617 -0.07968587   0.52126951
## OverallQual   0.55059971   0.42745234  0.10580574 -0.09193234   0.10167636
## YearBuilt     0.46827079   0.09558913  0.01422765 -0.37598320  -0.07065122
## X1stFlrSF     0.38063749   0.40951598  0.29947458 -0.14420278   0.12740075
## FullBath      1.00000000   0.55478425  0.12603063 -0.19414949   0.36325198
## TotRmsAbvGrd  0.55478425   1.00000000  0.19001478 -0.05758317   0.67661994
## LotArea       0.12603063   0.19001478  1.00000000 -0.00563627   0.11968991
## OverallCond  -0.19414949  -0.05758317 -0.00563627  1.00000000   0.01298006
## BedroomAbvGr  0.36325198   0.67661994  0.11968991  0.01298006   1.00000000
## YrSold       -0.01966884  -0.03451635 -0.01426141  0.04394975  -0.03601389
## LotFrontage           NA           NA          NA          NA           NA
##                   YrSold LotFrontage
## SalePrice    -0.02892259          NA
## GrLivArea    -0.03652582          NA
## OverallQual  -0.02734671          NA
## YearBuilt    -0.01361768          NA
## X1stFlrSF    -0.01360377          NA
## FullBath     -0.01966884          NA
## TotRmsAbvGrd -0.03451635          NA
## LotArea      -0.01426141          NA
## OverallCond   0.04394975          NA
## BedroomAbvGr -0.03601389          NA
## YrSold        1.00000000          NA
## LotFrontage           NA           1

2.1.2 Iteration 2

The selection is refined to 3 variables, which show strong correlation amongst each other.

# make new selection baed on scatterplot results
train_selection2 <- train_selection1 %>% dplyr::select(SalePrice, GrLivArea, OverallQual)
pairs(train_selection2)

# create correlation matrix with 3 variables - using pearson
cor_matrix2 <- cor(train_selection2)
cor_matrix2
##             SalePrice GrLivArea OverallQual
## SalePrice   1.0000000 0.7086245   0.7909816
## GrLivArea   0.7086245 1.0000000   0.5930074
## OverallQual 0.7909816 0.5930074   1.0000000
# 
# 3 pairwise events to test

2.1.3 Correlation Test

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

We test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. There are 3 pairs to test.

Null hypothesis is that the correlation between a pair of variables is 0. We look to reject or not reject the null.

cor.test(train_selection2$SalePrice, train_selection2$GrLivArea, conf.level = .80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_selection2$SalePrice and train_selection2$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245
cor.test(train_selection2$SalePrice, train_selection2$OverallQual, conf.level = .80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_selection2$SalePrice and train_selection2$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816
cor.test(train_selection2$GrLivArea, train_selection2$OverallQual, conf.level = .80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_selection2$GrLivArea and train_selection2$OverallQual
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5708061 0.6143422
## sample estimates:
##       cor 
## 0.5930074

From the results of the test, the variables seem to be correlated. The outputs of the correlation test (using the Pearson method) show variables that have a reasonably high level of correlation. In each case we can reject the null hypothesis that the correlation between variables is 0. 80% confidence intervals are provided in the output of each test. There is not much cause to have concern about familywise error, that is, having at least one type I error. In this case there are only 3 hypothesis tests being performed. The chance of incorrectly rejecting the null in any of these tests is very small when considering the small p-values. Type I errors should always be a concern, but here we can be confident in the results of the correlation test.

2.2 Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

# invert the correlation matrix to get precision matrix
precision_matrix <- solve(cor_matrix2)
precision_matrix
##             SalePrice  GrLivArea OverallQual
## SalePrice    3.498623 -1.2927630  -2.0007280
## GrLivArea   -1.292763  2.0200794  -0.1753704
## OverallQual -2.000728 -0.1753704   2.6865350
#multiply correlation matrix by precision matrix
mat <- cor_matrix2 %*% precision_matrix
mat
##                 SalePrice     GrLivArea OverallQual
## SalePrice    1.000000e+00  0.000000e+00           0
## GrLivArea   -4.440892e-16  1.000000e+00           0
## OverallQual -4.440892e-16 -8.326673e-17           1
# LU decomposition
LU.factorize <- function(A){
  # create a diagonal matrix to store L
  L <- diag(nrow(A))
  
  # U is set to A
  U <- A

  # loop over i rows and j columns 
  for(i in 2:ncol(A)){
    for( j in 1:2){
      
      # This if statement ensures we work on the lower triangle - that is, indices where i is greater than j
      if(j<i){
            # This if/else control sturcture is needed to avoid dividing by 0       
            if(i-j == 1){
            multiplier <- -1*(U[i,j]/U[i-1,j])
            U[i,] <- multiplier*U[i-1,] + U[i,] 
            L[i,j] <- multiplier*-1}
    
            else{
            multiplier <- -1*(U[i,j]/U[1,j])
            U[i,] <- multiplier*U[1,] + U[i,] 
            L[i,j] <- multiplier*-1
              }
      }
    }
  }     
  print("U")
  print(U)
  print("L")
  print(L)
  print("L times U equals A ?")
  L%*%U == A

}

LU.factorize(mat)
## [1] "U"
##             SalePrice GrLivArea OverallQual
## SalePrice           1         0           0
## GrLivArea           0         1           0
## OverallQual         0         0           1
## [1] "L"
##               [,1]          [,2] [,3]
## [1,]  1.000000e+00  0.000000e+00    0
## [2,] -4.440892e-16  1.000000e+00    0
## [3,] -4.440892e-16 -8.326673e-17    1
## [1] "L times U equals A ?"
##      SalePrice GrLivArea OverallQual
## [1,]      TRUE      TRUE        TRUE
## [2,]      TRUE      TRUE        TRUE
## [3,]      TRUE      TRUE        TRUE

2.3 Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable.

2.3.1 Right skew variable

# find a variable that is skewed to the right
hist(train$X1stFlrSF, breaks=30)

The histogram is skewed to the right. Finding probabilities using normality will lead to inaccuracies

2.3.2 Fit the exponential distribution

The parameter of interest the rate, most often referred to as lambda.

require(MASS)
x <- fitdistr(train$X1stFlrSF, densfun = 'exponential')

# find the rate, lambda
lambda <- x$estimate

#generate exponential random numbers
random_exp <- rexp(1000,lambda)

# plot the result
hist(random_exp, breaks=100)

This now resembles a classic exponential distribution.

2.3.3 5th and 95 percentiles

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). These are calculated from the the cumulative distribution function of the exponential distribution

cdf <- ecdf(random_exp)
plot(cdf)

exp_quantiles <- quantile(cdf,c(.05,.95))
exp_quantiles
##         5%        95% 
##   49.05657 3698.80675

2.3.4 Assuming normality

Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

# 95% confidence interval assuming normality 
mean <- mean(train$X1stFlrSF)
sd <- sd(train$X1stFlrSF)
err <- qnorm(.975)*sd/sqrt(nrow(train))
low_bound <- mean -err
up_bound <- mean + err
paste("95% of the data is is between ", round(low_bound,3), " and ", round(up_bound,3)) 
## [1] "95% of the data is is between  1142.797  and  1182.457"
# empirical 5th and 95 percentile
quantile(train$X1stFlrSF, c(.05, 0.95)) 
##      5%     95% 
##  672.95 1831.25

The results of the methods show completely different results. Results for the exponential do not seem reasonable. There may be an issue with the optimization of parameter lambda that is skewing the results.

2.4 Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Using the variables selected from a previous section, build a model and then remove variables with a high p-value using backward elimination. The goal is avoid overfitting by reducing the number of explanatory variables.

2.4.1 Linear Model 1

lm1 <- lm(SalePrice~ GrLivArea + OverallQual + YearBuilt + X1stFlrSF + FullBath + TotRmsAbvGrd + LotArea + OverallCond + BedroomAbvGr + YrSold +  LotFrontage + OverallCond, data=train) 
summary(lm1)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + YearBuilt + 
##     X1stFlrSF + FullBath + TotRmsAbvGrd + LotArea + OverallCond + 
##     BedroomAbvGr + YrSold + LotFrontage + OverallCond, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -513412  -18598   -1289   15964  279374 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.017e+05  1.739e+06  -0.403  0.68674    
## GrLivArea     5.239e+01  4.914e+00  10.661  < 2e-16 ***
## OverallQual   2.208e+04  1.348e+03  16.384  < 2e-16 ***
## YearBuilt     5.926e+02  5.459e+01  10.855  < 2e-16 ***
## X1stFlrSF     2.687e+01  4.110e+00   6.537 9.31e-11 ***
## FullBath     -1.289e+03  3.070e+03  -0.420  0.67471    
## TotRmsAbvGrd  4.330e+03  1.448e+03   2.991  0.00283 ** 
## LotArea       8.430e-01  1.641e-01   5.137 3.26e-07 ***
## OverallCond   6.957e+03  1.176e+03   5.915 4.34e-09 ***
## BedroomAbvGr -1.214e+04  2.031e+03  -5.977 3.00e-09 ***
## YrSold       -2.851e+02  8.645e+02  -0.330  0.74161    
## LotFrontage   4.343e+01  5.780e+01   0.751  0.45256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39740 on 1189 degrees of freedom
##   (259 observations deleted due to missingness)
## Multiple R-squared:  0.7749, Adjusted R-squared:  0.7728 
## F-statistic: 372.2 on 11 and 1189 DF,  p-value: < 2.2e-16

2.4.2 Linear Model 2

lm2 <- lm(SalePrice~ GrLivArea + OverallQual + YearBuilt + X1stFlrSF + TotRmsAbvGrd + LotArea + OverallCond + BedroomAbvGr +   OverallCond, data=train)
summary(lm2)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + YearBuilt + 
##     X1stFlrSF + TotRmsAbvGrd + LotArea + OverallCond + BedroomAbvGr + 
##     OverallCond, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -496265  -17810   -1641   15359  283630 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.243e+06  8.796e+04 -14.133  < 2e-16 ***
## GrLivArea     5.205e+01  4.072e+00  12.783  < 2e-16 ***
## OverallQual   2.124e+04  1.145e+03  18.559  < 2e-16 ***
## YearBuilt     5.805e+02  4.486e+01  12.940  < 2e-16 ***
## X1stFlrSF     2.915e+01  3.316e+00   8.790  < 2e-16 ***
## TotRmsAbvGrd  3.932e+03  1.253e+03   3.138  0.00174 ** 
## LotArea       6.989e-01  1.050e-01   6.656 3.97e-11 ***
## OverallCond   6.687e+03  9.803e+02   6.821 1.32e-11 ***
## BedroomAbvGr -1.144e+04  1.735e+03  -6.594 5.97e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37630 on 1451 degrees of freedom
## Multiple R-squared:  0.7769, Adjusted R-squared:  0.7757 
## F-statistic: 631.6 on 8 and 1451 DF,  p-value: < 2.2e-16

2.4.3 Linear Model 3

lm3 <- lm(SalePrice~ GrLivArea + OverallQual + YearBuilt , data=train)
summary(lm3)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + YearBuilt, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -393773  -22639   -2424   18437  290554 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.053e+06  8.376e+04  -12.57   <2e-16 ***
## GrLivArea    6.209e+01  2.581e+00   24.06   <2e-16 ***
## OverallQual  2.520e+04  1.172e+03   21.50   <2e-16 ***
## YearBuilt    5.001e+02  4.409e+01   11.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40750 on 1456 degrees of freedom
## Multiple R-squared:  0.7374, Adjusted R-squared:  0.7368 
## F-statistic:  1363 on 3 and 1456 DF,  p-value: < 2.2e-16
  • F-statisitc: small p-value suggest that the relationship between variables is statistcally significant
  • standard error: Standard errors are small relative to magnitudes of coefficients
  • p-values: small p-values suggest statistically significant coefficients
  • R^2: There’s a relatively strong relationship between the explanatory variables and the response. Considering lm1, where 11 variables were used, there was only a about a 0.05 drop off in R-squared.
  • The model has some positive diagnostics.

An analysis is perfromed to provide diagnostics on the residuals

# visual residual analysis 
train %>% 
  ggplot(aes(x=fitted(lm3),y=resid(lm3))) +
    geom_point()

qqnorm(resid(lm3))
qqline(resid(lm3))

The residual plots give a less favorable view of the model. Residual variance is not particularly constant, especially on the high end. The Q-Q plot also shows residuals deviating form a normal distribution on the high and low end.

2.4.4 Kaggle

username: RobetWelk Score: 0.61514

test <- read.csv(file="C:/NYBackup/DATA605/regression/test.csv") %>% 
  dplyr::select(Id, GrLivArea, OverallQual, YearBuilt)

predictions <- predict(lm3,test)
prediction_df <- data.frame(cbind(Id=test$Id, SalePrice=predictions))

prediction_df %>% 
  write.csv("C:/NYBackup/DATA605/regression/predictions_rjw.csv")