Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \[\mu=\sigma=(N+1)/2\]

N <- 10
count <- 10000
mu <- (N+1)/2
sigma <- mu

X <- runif(count, min=1, max=N)
median_x <- median(X)
Y <- rnorm(count, mean=mu, sd = sigma)
firstq_y <- quantile(Y, c(0.25), names=FALSE)

data <- data.frame(x = X, y = Y)

Probability.

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

  1. \[P(X>x \mid X>y)\]
given <- X[ X > firstq_y ]
X_gt_x <- given[ given > median_x ]
length(X_gt_x) / length(given)
## [1] 0.5513287

The probability that X is greater than the median of x given that X is also greater than the first quartile of Y is 0.5513287.

  1. \[P(X>x, Y>y)\]
nrow(data %>% filter(X > median_x & Y > firstq_y)) / nrow(data)
## [1] 0.3768

The probability that X is greater than the median of x and Y is greater than the first quartile of Y is 0.3768.

  1. \[P(X<x \mid X>y)\]
# should be 1-P(X>x | X>y) 
given <- X[ X > firstq_y ]
X_lt_x <- given[ given < median_x ]
length(X_lt_x) / length(given)
## [1] 0.4486713

The probability that X is less than the median of x given that Y is greater than the first quartile of Y is 0.4486713.

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

mat <- matrix(c( nrow( data %>% filter(X < median_x & Y < firstq_y) ), # X < x & Y < y
                 nrow( data %>% filter(X > median_x & Y < firstq_y) ), # X > x & Y < y
                 nrow( data %>% filter(X < median_x & Y > firstq_y) ), # X < x & Y > y
                 nrow( data %>% filter(X > median_x & Y > firstq_y) ) # X > x & Y > y
              ) , nrow=2)

colnames(mat) <- c('Y lt y','Y gt y')
rownames(mat) <- c('X lt x','X gt x')

mat <- as.table(mat)
mat
##        Y lt y Y gt y
## X lt x   1268   3732
## X gt x   1232   3768
# Marginal probability
p_x_gt_x <- margin.table(mat,1)[2] / margin.table(mat)
p_y_gt_y <- margin.table(mat,2)[2] / margin.table(mat)
p_x_gt_x %*% p_y_gt_y
##       [,1]
## [1,] 0.375
# Joint probability 
p_xgtx_and_ygty = mat[2,2] / margin.table(mat)
p_xgtx_and_ygty
## [1] 0.3768

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

summary(mat) # chi-sq
## Number of cases in table: 10000 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 0.6912, df = 1, p-value = 0.4058
fisher.test(mat) # fisher's
## 
##  Fisher's Exact Test for Count Data
## 
## data:  mat
## p-value = 0.4189
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9482096 1.1388313
## sample estimates:
## odds ratio 
##   1.039161

Both the chi squared test and the fisher’s exact test have p-values greater than 0.05, so we cannot reject the null hypothesis that the ratios vary. The fisher’s exact test is for comparing two proportions that is conditional on the marginal frequencies. It can however be much less accurate when p-values are too large. In this sense ordinary chi-square tests are generally “more accurate” than “exact” tests.

Problem 2.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

data_samplesub <- read_csv("sample_submission.csv")
data_test <- read_csv("test.csv")
data_train <- read_csv("train.csv")

train <- data_train %>% dplyr::select(-Id)
train_colnames <- names(train)

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set.
dim(train)
## [1] 1460   80
str(train)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1460 obs. of  80 variables:
##  $ MSSubClass   : num  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : num  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : num  8450 9600 11250 9550 14260 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : num  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : num  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : num  2003 1976 2001 1915 2000 ...
##  $ YearRemodAdd : num  2003 1976 2002 1970 2000 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : num  706 978 486 216 655 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : num  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : num  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : num  856 1262 920 756 1145 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ 1stFlrSF     : num  856 1262 920 961 1145 ...
##  $ 2ndFlrSF     : num  854 0 866 756 1053 ...
##  $ LowQualFinSF : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : num  1710 1262 1786 1717 2198 ...
##  $ BsmtFullBath : num  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : num  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : num  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : num  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : num  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : num  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : num  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : num  2003 1976 2001 1998 2000 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : num  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : num  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : num  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : num  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: num  0 0 0 272 0 0 0 228 205 0 ...
##  $ 3SsnPorch    : num  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : num  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : num  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : num  2008 2007 2008 2006 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : num  208500 181500 223500 140000 250000 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   MSSubClass = col_double(),
##   ..   MSZoning = col_character(),
##   ..   LotFrontage = col_double(),
##   ..   LotArea = col_double(),
##   ..   Street = col_character(),
##   ..   Alley = col_character(),
##   ..   LotShape = col_character(),
##   ..   LandContour = col_character(),
##   ..   Utilities = col_character(),
##   ..   LotConfig = col_character(),
##   ..   LandSlope = col_character(),
##   ..   Neighborhood = col_character(),
##   ..   Condition1 = col_character(),
##   ..   Condition2 = col_character(),
##   ..   BldgType = col_character(),
##   ..   HouseStyle = col_character(),
##   ..   OverallQual = col_double(),
##   ..   OverallCond = col_double(),
##   ..   YearBuilt = col_double(),
##   ..   YearRemodAdd = col_double(),
##   ..   RoofStyle = col_character(),
##   ..   RoofMatl = col_character(),
##   ..   Exterior1st = col_character(),
##   ..   Exterior2nd = col_character(),
##   ..   MasVnrType = col_character(),
##   ..   MasVnrArea = col_double(),
##   ..   ExterQual = col_character(),
##   ..   ExterCond = col_character(),
##   ..   Foundation = col_character(),
##   ..   BsmtQual = col_character(),
##   ..   BsmtCond = col_character(),
##   ..   BsmtExposure = col_character(),
##   ..   BsmtFinType1 = col_character(),
##   ..   BsmtFinSF1 = col_double(),
##   ..   BsmtFinType2 = col_character(),
##   ..   BsmtFinSF2 = col_double(),
##   ..   BsmtUnfSF = col_double(),
##   ..   TotalBsmtSF = col_double(),
##   ..   Heating = col_character(),
##   ..   HeatingQC = col_character(),
##   ..   CentralAir = col_character(),
##   ..   Electrical = col_character(),
##   ..   `1stFlrSF` = col_double(),
##   ..   `2ndFlrSF` = col_double(),
##   ..   LowQualFinSF = col_double(),
##   ..   GrLivArea = col_double(),
##   ..   BsmtFullBath = col_double(),
##   ..   BsmtHalfBath = col_double(),
##   ..   FullBath = col_double(),
##   ..   HalfBath = col_double(),
##   ..   BedroomAbvGr = col_double(),
##   ..   KitchenAbvGr = col_double(),
##   ..   KitchenQual = col_character(),
##   ..   TotRmsAbvGrd = col_double(),
##   ..   Functional = col_character(),
##   ..   Fireplaces = col_double(),
##   ..   FireplaceQu = col_character(),
##   ..   GarageType = col_character(),
##   ..   GarageYrBlt = col_double(),
##   ..   GarageFinish = col_character(),
##   ..   GarageCars = col_double(),
##   ..   GarageArea = col_double(),
##   ..   GarageQual = col_character(),
##   ..   GarageCond = col_character(),
##   ..   PavedDrive = col_character(),
##   ..   WoodDeckSF = col_double(),
##   ..   OpenPorchSF = col_double(),
##   ..   EnclosedPorch = col_double(),
##   ..   `3SsnPorch` = col_double(),
##   ..   ScreenPorch = col_double(),
##   ..   PoolArea = col_double(),
##   ..   PoolQC = col_character(),
##   ..   Fence = col_character(),
##   ..   MiscFeature = col_character(),
##   ..   MiscVal = col_double(),
##   ..   MoSold = col_double(),
##   ..   YrSold = col_double(),
##   ..   SaleType = col_character(),
##   ..   SaleCondition = col_character(),
##   ..   SalePrice = col_double()
##   .. )
train %>% 
  select_if(is.numeric) %>% 
  skimr::skim()
## Skim summary statistics
##  n obs: 1460 
##  n variables: 37 
## 
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────
##       variable missing complete    n       mean       sd    p0       p25
##       1stFlrSF       0     1460 1460   1162.63    386.59   334    882   
##       2ndFlrSF       0     1460 1460    346.99    436.53     0      0   
##      3SsnPorch       0     1460 1460      3.41     29.32     0      0   
##   BedroomAbvGr       0     1460 1460      2.87      0.82     0      2   
##     BsmtFinSF1       0     1460 1460    443.64    456.1      0      0   
##     BsmtFinSF2       0     1460 1460     46.55    161.32     0      0   
##   BsmtFullBath       0     1460 1460      0.43      0.52     0      0   
##   BsmtHalfBath       0     1460 1460      0.058     0.24     0      0   
##      BsmtUnfSF       0     1460 1460    567.24    441.87     0    223   
##  EnclosedPorch       0     1460 1460     21.95     61.12     0      0   
##     Fireplaces       0     1460 1460      0.61      0.64     0      0   
##       FullBath       0     1460 1460      1.57      0.55     0      1   
##     GarageArea       0     1460 1460    472.98    213.8      0    334.5 
##     GarageCars       0     1460 1460      1.77      0.75     0      1   
##    GarageYrBlt      81     1379 1460   1978.51     24.69  1900   1961   
##      GrLivArea       0     1460 1460   1515.46    525.48   334   1129.5 
##       HalfBath       0     1460 1460      0.38      0.5      0      0   
##   KitchenAbvGr       0     1460 1460      1.05      0.22     0      1   
##        LotArea       0     1460 1460  10516.83   9981.26  1300   7553.5 
##    LotFrontage     259     1201 1460     70.05     24.28    21     59   
##   LowQualFinSF       0     1460 1460      5.84     48.62     0      0   
##     MasVnrArea       8     1452 1460    103.69    181.07     0      0   
##        MiscVal       0     1460 1460     43.49    496.12     0      0   
##         MoSold       0     1460 1460      6.32      2.7      1      5   
##     MSSubClass       0     1460 1460     56.9      42.3     20     20   
##    OpenPorchSF       0     1460 1460     46.66     66.26     0      0   
##    OverallCond       0     1460 1460      5.58      1.11     1      5   
##    OverallQual       0     1460 1460      6.1       1.38     1      5   
##       PoolArea       0     1460 1460      2.76     40.18     0      0   
##      SalePrice       0     1460 1460 180921.2   79442.5  34900 129975   
##    ScreenPorch       0     1460 1460     15.06     55.76     0      0   
##    TotalBsmtSF       0     1460 1460   1057.43    438.71     0    795.75
##   TotRmsAbvGrd       0     1460 1460      6.52      1.63     2      5   
##     WoodDeckSF       0     1460 1460     94.24    125.34     0      0   
##      YearBuilt       0     1460 1460   1971.27     30.2   1872   1954   
##   YearRemodAdd       0     1460 1460   1984.87     20.65  1950   1967   
##         YrSold       0     1460 1460   2007.82      1.33  2006   2007   
##       p50       p75   p100     hist
##    1087     1391.25   4692 ▃▇▃▁▁▁▁▁
##       0      728      2065 ▇▁▂▂▁▁▁▁
##       0        0       508 ▇▁▁▁▁▁▁▁
##       3        3         8 ▁▃▇▂▁▁▁▁
##     383.5    712.25   5644 ▇▂▁▁▁▁▁▁
##       0        0      1474 ▇▁▁▁▁▁▁▁
##       0        1         3 ▇▁▆▁▁▁▁▁
##       0        0         2 ▇▁▁▁▁▁▁▁
##     477.5    808      2336 ▇▆▅▂▂▁▁▁
##       0        0       552 ▇▁▁▁▁▁▁▁
##       1        1         3 ▇▁▇▁▁▁▁▁
##       2        2         3 ▁▁▇▁▁▇▁▁
##     480      576      1418 ▁▅▇▅▂▁▁▁
##       2        2         4 ▁▃▁▇▁▂▁▁
##    1980     2002      2010 ▁▁▁▂▅▃▃▇
##    1464     1776.75   5642 ▂▇▅▁▁▁▁▁
##       0        1         2 ▇▁▁▅▁▁▁▁
##       1        1         3 ▁▁▇▁▁▁▁▁
##    9478.5  11601.5  215245 ▇▁▁▁▁▁▁▁
##      69       80       313 ▃▇▁▁▁▁▁▁
##       0        0       572 ▇▁▁▁▁▁▁▁
##       0      166      1600 ▇▂▁▁▁▁▁▁
##       0        0     15500 ▇▁▁▁▁▁▁▁
##       6        8        12 ▂▂▇▆▆▅▂▃
##      50       70       190 ▇▆▂▁▁▁▁▁
##      25       68       547 ▇▂▁▁▁▁▁▁
##       5        6         9 ▁▁▁▇▂▂▁▁
##       6        7        10 ▁▁▂▇▇▆▃▁
##       0        0       738 ▇▁▁▁▁▁▁▁
##  163000   214000    755000 ▃▇▂▁▁▁▁▁
##       0        0       480 ▇▁▁▁▁▁▁▁
##     991.5   1298.25   6110 ▂▇▂▁▁▁▁▁
##       6        7        14 ▁▆▆▇▁▁▁▁
##       0      168       857 ▇▃▁▁▁▁▁▁
##    1973     2000      2010 ▁▁▂▂▃▅▃▇
##    1994     2004      2010 ▅▂▂▂▁▂▅▇
##    2008     2009      2010 ▇▇▁▇▁▇▁▅
train %>% ggplot(aes(SalePrice)) +
  geom_histogram(fill="blue", binwidth = 10000) 

train %>% ggplot( aes(x=factor(TotRmsAbvGrd), y=SalePrice))+
  geom_boxplot() + labs(x='Rooms') +
  scale_y_continuous()

train %>% ggplot(aes(x =YearBuilt)) +
  geom_histogram(fill="blue") 

train %>% ggplot( aes(x=factor(GrLivArea), y=SalePrice))+
  geom_point(fill="blue") + labs(x='Living Area') 

train %>% ggplot( aes(x=factor(GarageCars), y=SalePrice))+
  geom_boxplot() + labs(x='Garage Cars')

train %>% ggplot(aes(x=factor(OverallQual), y=SalePrice)) +
   geom_boxplot() + labs(x='Overall Quality of House') 

train %>% ggplot( aes(x=factor(TotalBsmtSF), y=SalePrice))+
  geom_point() + labs(x='Basement Square Footage') 

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
pairs(~ OverallQual + YearBuilt + 
        GarageCars + GrLivArea + TotalBsmtSF + TotRmsAbvGrd +
        YrSold + SalePrice, data = train, main = "House Prices")

Derive a correlation matrix for any three quantitative variables in the dataset.
train %>%
  dplyr::select(LotArea, OverallQual, OverallCond, GrLivArea, TotRmsAbvGrd, GarageCars, SalePrice ) %>%
  cor(use="pairwise.complete.obs") %>%
  corrplot()

train %>%
  dplyr::select( GrLivArea, OverallQual, GarageCars, SalePrice) %>%
  cor(use="pairwise.complete.obs") %>%
  corrplot()

cor_var <- train %>%
  dplyr::select( GrLivArea, OverallQual, SalePrice) %>%
  cor(use="pairwise.complete.obs")
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
Pearson’s Corellation
cor.test(train$OverallQual, train$SalePrice, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

The p-value is very close to 0 and the confidence interval is between 0.579 and 0.622, with a positive correlation of 0.601.

cor.test(train$GrLivArea, train$SalePrice, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$GrLivArea and train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

The p-value is again very close to 0 and the confidence interval is between 0.692 and 0.725, with a positive correlation of 0.709.

cor.test(train$GarageCars, train$SalePrice, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$GarageCars and train$SalePrice
## t = 31.839, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6201771 0.6597899
## sample estimates:
##       cor 
## 0.6404092

The p-value is also very close to 0 and the confidence interval is between 0.620 and 0.659, with a positive correlation of 0.640.

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

The p-values are all very close to zero and we can therefore reject the null hypothesis that these variables are not correlated to the sales price variable.

The familywise error rate (FWE or FWER) is the probability of a coming to at least one false conclusion in a series of hypothesis tests. Since the p-values are all essentially zero, I would not be vey concerned about familywise error.

Linear Algebra and Correlation.

  • Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
cor_var_inv <- ginv(cor_var) %>% round()
cor_var_inv
##      [,1] [,2] [,3]
## [1,]    2    0   -1
## [2,]    0    3   -2
## [3,]   -1   -2    3
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
cor_var %*% cor_var_inv %>% round()
##             [,1] [,2] [,3]
## GrLivArea      1    0    0
## OverallQual    0    1    0
## SalePrice      0    0    1
cor_var_inv %*% cor_var %>% round()
##      GrLivArea OverallQual SalePrice
## [1,]         1           0         0
## [2,]         0           1         0
## [3,]         0           0         1
Conduct LU decomposition on the matrix.
cor_var
##             GrLivArea OverallQual SalePrice
## GrLivArea   1.0000000   0.5930074 0.7086245
## OverallQual 0.5930074   1.0000000 0.7909816
## SalePrice   0.7086245   0.7909816 1.0000000
Gauss_Elim <- function(a) {
    n <- nrow(a)
    #create an augmented matrix with a and the identity
    a <- cbind(a,diag(n))
    
    #multiply a row by a constant    
    for (i in 1 : n) {
        c <- diag(n)
        c[i, i] <- (1 / a[i, i])
        a <- c %*% a
       
        #if at the last row, stop
        if (i == n) {
            break ()
        }
        #add a multiple of a row to a different row
        for (j in (i + 1) : n) {
            r <- diag(n)
            r[j, i] <- (- a[j, i])
            a <- r %*% a    
        }
    }
    return(a)
}
#call Gauss Elimination function on matrix a
Gauss_Elim(cor_var)
##      GrLivArea OverallQual SalePrice                              
## [1,]         1   0.5930074 0.7086245  1.0000000  0.000000 0.000000
## [2,]         0   1.0000000 0.5718616 -0.9146519  1.542395 0.000000
## [3,]         0   0.0000000 1.0000000 -1.2927630 -2.000728 3.498623
#use Gauss_Elim() to compute Upper Triangular Matrix and Lower Triangular Matrix

#define LU Decomposition function
LU_decomp <- function(a) {
    n <- nrow (a)
    #compute the U matrix (Upper Triangular)
    g <- Gauss_Elim(a)
    #compute the L matrix (Lower Triangular) with multipliers located below the diagonal 
    h <- Gauss_Elim(g[, n + 1 : n])
    return(list(L = h[, n + 1 : n], U = g[, 1 : n]))
}

print(LU <- LU_decomp(cor_var))
## $L
##                                   
## [1,] 1.0000000 0.0000000 0.0000000
## [2,] 0.5930074 0.6483422 0.0000000
## [3,] 0.7086245 0.3707620 0.2858268
## 
## $U
##      GrLivArea OverallQual SalePrice
## [1,]         1   0.5930074 0.7086245
## [2,]         0   1.0000000 0.5718616
## [3,]         0   0.0000000 1.0000000
#show that the LU decomp is equal to matrix a
LU $ L %*% LU $ U
##      GrLivArea OverallQual SalePrice
## [1,] 1.0000000   0.5930074 0.7086245
## [2,] 0.5930074   1.0000000 0.7909816
## [3,] 0.7086245   0.7909816 1.0000000

Calculus-Based Probability & Statistics.

  • Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
#select variable skewed to the right
train %>% ggplot(aes(LotArea)) +
  geom_histogram(fill="blue") 

skimr::skim(train$LotArea)
## 
## Skim summary statistics
## 
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────
##       variable missing complete    n     mean      sd   p0    p25    p50
##  train$LotArea       0     1460 1460 10516.83 9981.26 1300 7553.5 9478.5
##      p75   p100     hist
##  11601.5 215245 ▇▁▁▁▁▁▁▁
# shift it so that the minimum value is absolutely above zero
train_LotAreaShift <- train %>% mutate(LotArea = LotArea - 1300)
skimr::skim(train_LotAreaShift$LotArea)
## 
## Skim summary statistics
## 
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────
##                    variable missing complete    n    mean      sd p0
##  train_LotAreaShift$LotArea       0     1460 1460 9216.83 9981.26  0
##     p25    p50     p75   p100     hist
##  6253.5 8178.5 10301.5 213945 ▇▁▁▁▁▁▁▁
#run fitdistr to fit an exponential probability density function.
lotarea_lambda <- fitdistr(train_LotAreaShift$LotArea, densfun = "exponential")
lotarea_lambda
##        rate    
##   1.084972e-04 
##  (2.839501e-06)
#Find optimal value of λ for this distribution, take 1000 samples from this exponential distribution using this value
lotarea_dist <- rexp(1000, lotarea_lambda$estimate)
#Plot a histogram and compare it with a histogram of your original variable
par(mfrow=c(1,2))
hist(lotarea_dist, freq = FALSE, breaks = 30)
hist(train_LotAreaShift$LotArea, freq = FALSE, breaks = 30)

plotdist(lotarea_dist, histo = TRUE, demp = TRUE)

lotarea_origdist <- sample(train$LotArea,  1000, replace=TRUE, prob=NULL)
plotdist(lotarea_origdist, histo = TRUE, demp = TRUE)

#Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF)

quantile(lotarea_dist, c(0.05, 0.95))
##         5%        95% 
##   590.6183 25967.5737
#generate a 95% confidence interval from the empirical data, assuming normality

CI(lotarea_dist, ci = 0.95)
##    upper     mean    lower 
## 9350.913 8802.742 8254.572
#provide the empirical 5th percentile and 95th percentile of the data

quantile(lotarea_origdist, c(.05, .95))
##       5%      95% 
##  4039.95 17134.00

Modeling.

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

house_regr_mdl <- lm(SalePrice~ OverallQual + YearBuilt + GarageCars + GrLivArea + TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + YrSold + MoSold + FullBath + HouseStyle + `1stFlrSF` + `2ndFlrSF` + Neighborhood + YearRemodAdd, data = train)
#summary(house_regr_mdl)

step <- stepAIC(house_regr_mdl, direction="both")
## Start:  AIC=30488.1
## SalePrice ~ OverallQual + YearBuilt + GarageCars + GrLivArea + 
##     TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + YrSold + 
##     MoSold + FullBath + HouseStyle + `1stFlrSF` + `2ndFlrSF` + 
##     Neighborhood + YearRemodAdd
## 
##                Df  Sum of Sq        RSS   AIC
## - YrSold        1 1.5197e+08 1.6072e+12 30486
## - `1stFlrSF`    1 3.2198e+08 1.6074e+12 30486
## - FullBath      1 1.3114e+09 1.6084e+12 30487
## - GrLivArea     1 1.7234e+09 1.6088e+12 30488
## - MoSold        1 1.7525e+09 1.6088e+12 30488
## <none>                       1.6071e+12 30488
## - `2ndFlrSF`    1 3.8592e+09 1.6109e+12 30490
## - YearBuilt     1 7.4439e+09 1.6145e+12 30493
## - TotRmsAbvGrd  1 1.4610e+10 1.6217e+12 30499
## - BedroomAbvGr  1 2.0971e+10 1.6280e+12 30505
## - TotalBsmtSF   1 2.2727e+10 1.6298e+12 30507
## - YearRemodAdd  1 2.4635e+10 1.6317e+12 30508
## - HouseStyle    7 4.6281e+10 1.6534e+12 30516
## - LotArea       1 4.3176e+10 1.6503e+12 30525
## - GarageCars    1 4.5008e+10 1.6521e+12 30526
## - OverallQual   1 2.0660e+11 1.8137e+12 30663
## - Neighborhood 24 3.1906e+11 1.9261e+12 30704
## 
## Step:  AIC=30486.24
## SalePrice ~ OverallQual + YearBuilt + GarageCars + GrLivArea + 
##     TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + MoSold + 
##     FullBath + HouseStyle + `1stFlrSF` + `2ndFlrSF` + Neighborhood + 
##     YearRemodAdd
## 
##                Df  Sum of Sq        RSS   AIC
## - `1stFlrSF`    1 3.0496e+08 1.6075e+12 30484
## - FullBath      1 1.3195e+09 1.6085e+12 30485
## - MoSold        1 1.6415e+09 1.6089e+12 30486
## - GrLivArea     1 1.7699e+09 1.6090e+12 30486
## <none>                       1.6072e+12 30486
## - `2ndFlrSF`    1 3.8054e+09 1.6110e+12 30488
## + YrSold        1 1.5197e+08 1.6071e+12 30488
## - YearBuilt     1 7.5380e+09 1.6148e+12 30491
## - TotRmsAbvGrd  1 1.4586e+10 1.6218e+12 30497
## - BedroomAbvGr  1 2.0884e+10 1.6281e+12 30503
## - TotalBsmtSF   1 2.2744e+10 1.6300e+12 30505
## - YearRemodAdd  1 2.4483e+10 1.6317e+12 30506
## - HouseStyle    7 4.6287e+10 1.6535e+12 30514
## - LotArea       1 4.3193e+10 1.6504e+12 30523
## - GarageCars    1 4.5165e+10 1.6524e+12 30525
## - OverallQual   1 2.0656e+11 1.8138e+12 30661
## - Neighborhood 24 3.1902e+11 1.9263e+12 30703
## 
## Step:  AIC=30484.51
## SalePrice ~ OverallQual + YearBuilt + GarageCars + GrLivArea + 
##     TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + MoSold + 
##     FullBath + HouseStyle + `2ndFlrSF` + Neighborhood + YearRemodAdd
## 
##                Df  Sum of Sq        RSS   AIC
## - FullBath      1 1.2941e+09 1.6088e+12 30484
## - MoSold        1 1.6196e+09 1.6092e+12 30484
## <none>                       1.6075e+12 30484
## + `1stFlrSF`    1 3.0496e+08 1.6072e+12 30486
## + YrSold        1 1.3494e+08 1.6074e+12 30486
## - YearBuilt     1 7.5318e+09 1.6151e+12 30489
## - TotRmsAbvGrd  1 1.4628e+10 1.6222e+12 30496
## - `2ndFlrSF`    1 2.0201e+10 1.6277e+12 30501
## - BedroomAbvGr  1 2.0858e+10 1.6284e+12 30501
## - TotalBsmtSF   1 2.3637e+10 1.6312e+12 30504
## - YearRemodAdd  1 2.4524e+10 1.6321e+12 30505
## - HouseStyle    7 4.8754e+10 1.6563e+12 30514
## - LotArea       1 4.3542e+10 1.6511e+12 30522
## - GarageCars    1 4.5922e+10 1.6535e+12 30524
## - GrLivArea     1 6.3600e+10 1.6711e+12 30539
## - OverallQual   1 2.0669e+11 1.8142e+12 30659
## - Neighborhood 24 3.1903e+11 1.9266e+12 30701
## 
## Step:  AIC=30483.69
## SalePrice ~ OverallQual + YearBuilt + GarageCars + GrLivArea + 
##     TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + MoSold + 
##     HouseStyle + `2ndFlrSF` + Neighborhood + YearRemodAdd
## 
##                Df  Sum of Sq        RSS   AIC
## - MoSold        1 1.7056e+09 1.6105e+12 30483
## <none>                       1.6088e+12 30484
## + FullBath      1 1.2941e+09 1.6075e+12 30484
## + `1stFlrSF`    1 2.7959e+08 1.6085e+12 30485
## + YrSold        1 1.4317e+08 1.6087e+12 30486
## - YearBuilt     1 6.7102e+09 1.6155e+12 30488
## - TotRmsAbvGrd  1 1.4341e+10 1.6232e+12 30495
## - `2ndFlrSF`    1 2.0399e+10 1.6292e+12 30500
## - BedroomAbvGr  1 2.3258e+10 1.6321e+12 30503
## - YearRemodAdd  1 2.3750e+10 1.6326e+12 30503
## - TotalBsmtSF   1 2.5221e+10 1.6340e+12 30504
## - HouseStyle    7 4.8139e+10 1.6570e+12 30513
## - LotArea       1 4.3567e+10 1.6524e+12 30521
## - GarageCars    1 4.5827e+10 1.6547e+12 30523
## - GrLivArea     1 6.3151e+10 1.6720e+12 30538
## - OverallQual   1 2.0687e+11 1.8157e+12 30658
## - Neighborhood 24 3.2311e+11 1.9319e+12 30703
## 
## Step:  AIC=30483.24
## SalePrice ~ OverallQual + YearBuilt + GarageCars + GrLivArea + 
##     TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + HouseStyle + 
##     `2ndFlrSF` + Neighborhood + YearRemodAdd
## 
##                Df  Sum of Sq        RSS   AIC
## <none>                       1.6105e+12 30483
## + MoSold        1 1.7056e+09 1.6088e+12 30484
## + FullBath      1 1.3802e+09 1.6092e+12 30484
## + `1stFlrSF`    1 2.5729e+08 1.6103e+12 30485
## + YrSold        1 3.5679e+07 1.6105e+12 30485
## - YearBuilt     1 6.6995e+09 1.6172e+12 30487
## - TotRmsAbvGrd  1 1.4715e+10 1.6252e+12 30494
## - `2ndFlrSF`    1 2.0274e+10 1.6308e+12 30500
## - YearRemodAdd  1 2.3775e+10 1.6343e+12 30503
## - BedroomAbvGr  1 2.4180e+10 1.6347e+12 30503
## - TotalBsmtSF   1 2.5718e+10 1.6362e+12 30504
## - HouseStyle    7 4.7554e+10 1.6581e+12 30512
## - LotArea       1 4.3687e+10 1.6542e+12 30520
## - GarageCars    1 4.5780e+10 1.6563e+12 30522
## - GrLivArea     1 6.2799e+10 1.6733e+12 30537
## - OverallQual   1 2.0549e+11 1.8160e+12 30657
## - Neighborhood 24 3.2156e+11 1.9321e+12 30701
step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## SalePrice ~ OverallQual + YearBuilt + GarageCars + GrLivArea + 
##     TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + YrSold + 
##     MoSold + FullBath + HouseStyle + `1stFlrSF` + `2ndFlrSF` + 
##     Neighborhood + YearRemodAdd
## 
## Final Model:
## SalePrice ~ OverallQual + YearBuilt + GarageCars + GrLivArea + 
##     TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + HouseStyle + 
##     `2ndFlrSF` + Neighborhood + YearRemodAdd
## 
## 
##           Step Df   Deviance Resid. Df   Resid. Dev      AIC
## 1                                 1414 1.607075e+12 30488.10
## 2     - YrSold  1  151967691      1415 1.607227e+12 30486.24
## 3 - `1stFlrSF`  1  304957174      1416 1.607532e+12 30484.51
## 4   - FullBath  1 1294128942      1417 1.608827e+12 30483.69
## 5     - MoSold  1 1705642249      1418 1.610532e+12 30483.24
house_regr_mdl_aic <- lm(SalePrice~OverallQual + YearBuilt + GarageCars + GrLivArea + TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + HouseStyle + `2ndFlrSF` + Neighborhood + YearRemodAdd, data = train)
summary(house_regr_mdl_aic)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + GarageCars + 
##     GrLivArea + TotalBsmtSF + TotRmsAbvGrd + BedroomAbvGr + LotArea + 
##     HouseStyle + `2ndFlrSF` + Neighborhood + YearRemodAdd, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -398905  -14372     -59   13885  262903 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -9.525e+05  1.623e+05  -5.869 5.46e-09 ***
## OverallQual          1.581e+04  1.175e+03  13.451  < 2e-16 ***
## YearBuilt            1.717e+02  7.069e+01   2.429 0.015276 *  
## GarageCars           1.068e+04  1.683e+03   6.349 2.91e-10 ***
## GrLivArea            4.008e+01  5.391e+00   7.436 1.79e-13 ***
## TotalBsmtSF          1.854e+01  3.895e+00   4.759 2.15e-06 ***
## TotRmsAbvGrd         4.251e+03  1.181e+03   3.599 0.000330 ***
## BedroomAbvGr        -7.730e+03  1.675e+03  -4.614 4.31e-06 ***
## LotArea              6.383e-01  1.029e-01   6.202 7.30e-10 ***
## HouseStyle1.5Unf     1.349e+04  9.970e+03   1.353 0.176266    
## HouseStyle1Story     2.321e+04  4.659e+03   4.981 7.11e-07 ***
## HouseStyle2.5Fin    -1.535e+04  1.329e+04  -1.155 0.248198    
## HouseStyle2.5Unf    -2.642e+04  1.097e+04  -2.408 0.016150 *  
## HouseStyle2Story    -9.881e+03  4.098e+03  -2.411 0.016032 *  
## HouseStyleSFoyer     2.414e+04  7.235e+03   3.336 0.000871 ***
## HouseStyleSLvl       1.946e+04  5.801e+03   3.355 0.000814 ***
## `2ndFlrSF`           3.145e+01  7.444e+00   4.225 2.54e-05 ***
## NeighborhoodBlueste  1.007e+04  2.546e+04   0.396 0.692511    
## NeighborhoodBrDale   1.051e+04  1.244e+04   0.844 0.398592    
## NeighborhoodBrkSide  3.181e+04  1.065e+04   2.987 0.002869 ** 
## NeighborhoodClearCr  3.335e+04  1.115e+04   2.991 0.002826 ** 
## NeighborhoodCollgCr  2.511e+04  8.867e+03   2.832 0.004686 ** 
## NeighborhoodCrawfor  4.963e+04  1.041e+04   4.768 2.05e-06 ***
## NeighborhoodEdwards  1.440e+04  9.707e+03   1.483 0.138251    
## NeighborhoodGilbert  1.944e+04  9.400e+03   2.068 0.038805 *  
## NeighborhoodIDOTRR   1.727e+04  1.123e+04   1.538 0.124183    
## NeighborhoodMeadowV  1.959e+04  1.248e+04   1.570 0.116687    
## NeighborhoodMitchel  1.641e+04  9.914e+03   1.656 0.098030 .  
## NeighborhoodNAmes    2.244e+04  9.267e+03   2.421 0.015589 *  
## NeighborhoodNoRidge  7.821e+04  1.027e+04   7.612 4.89e-14 ***
## NeighborhoodNPkVill  1.370e+04  1.422e+04   0.963 0.335513    
## NeighborhoodNridgHt  7.329e+04  9.176e+03   7.987 2.83e-15 ***
## NeighborhoodNWAmes   1.727e+04  9.565e+03   1.805 0.071253 .  
## NeighborhoodOldTown  1.079e+04  1.034e+04   1.044 0.296799    
## NeighborhoodSawyer   2.195e+04  9.794e+03   2.241 0.025180 *  
## NeighborhoodSawyerW  2.104e+04  9.592e+03   2.194 0.028408 *  
## NeighborhoodSomerst  3.394e+04  9.167e+03   3.702 0.000222 ***
## NeighborhoodStoneBr  7.675e+04  1.074e+04   7.148 1.40e-12 ***
## NeighborhoodSWISU    2.090e+04  1.220e+04   1.714 0.086829 .  
## NeighborhoodTimber   3.252e+04  1.018e+04   3.195 0.001429 ** 
## NeighborhoodVeenker  5.402e+04  1.326e+04   4.074 4.87e-05 ***
## YearRemodAdd         2.713e+02  5.930e+01   4.575 5.17e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33700 on 1418 degrees of freedom
## Multiple R-squared:  0.8251, Adjusted R-squared:   0.82 
## F-statistic: 163.1 on 41 and 1418 DF,  p-value: < 2.2e-16
preds_test = predict(house_regr_mdl_aic, newdata = data_test)
data_test$SalePrice <- preds_test
house_preds <- data_test %>% dplyr::select(Id, SalePrice) %>% mutate( SalePrice = ifelse(is.na(SalePrice), 0, SalePrice))
write.csv(house_preds, "KaggleHousePricePreds.csv", row.names=FALSE)

Using stepwise AIC variable selection, the resulting multivariate model explains 82.5% of the data with statistically significat p-values for OverallQual, GarageCars, GrLivArea, TotalBsmtSF, BedroomAbvGr, HouseStyle1Storyl, NeighborhoodNoRidge, NeighborhoodNridgHt, and YearRemodAdd. The residual standard error, the standard deviation of the residuals, is 33700 on 1418 degrees of freedom, so the predicted price will deviate from the actual price by a mean of 33700.

Report your Kaggle.com user name and score

Username: stephroark Submission: KaggleHousePricePreds.csv Score: 0.47708