Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).

set.seed(2020)
N <- 8

# mu = sigma = (N + 1)/2
mu <- (N+1) / 2
sigma <- (N + 1) / 2

# Generate the random variable X
X <- runif(10000, min = 1, max = N)

# Generate the random variable Y
Y <- rnorm(10000, mean = mu, sd = sigma)

Part I

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

# Small x is estimated as the median of the X variable
x <- median(X)

# Small y is estimated as the 1st quartile of the Y variable
y <- quantile(Y, 0.25)

The median of X is 4.4247068.
The estimated 1st quartile of Y is 1.4292412.

Part A

P(X > x | X > y)

\[P(A | B) = \frac{P(A \cap B)}{P(B)}\]

# P(X > x & X > y)
a_and_b <- length(X[(X > x) & (X > y)])/length(X)

# P(X > y)
b <- length(X[X > y])/length(X)

# P(X > x | X > y)
probability <- a_and_b / b

Answer: P(X > x | X > y) = 0.534188

Part B

P(X > x, Y > y)

P(X > x, Y > y) = P(X > x) * P(Y > y)

# P(X > x)
a <- length(X[X > x])/length(X)

# P(Y > y)
b <- length(Y[Y > y])/length(Y)  

probability <- a * b

Answer: P(X > x, Y > y) = 0.375

Part C

P(X < x | X > y)

# P(X > x & X > y)
a_and_b <- length(X[(X < x) & (X > y)])/length(X)

# P(X > y)
b <- length(X[X > y])/length(X)

# P(X > x | X > y)
probability <- a_and_b / b

Answer: P(X < x | X > y) = 0.465812

Part II

Investigate whether \(P(X>x\,and\,Y>y) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

a <- length(X[(X > x)])/length(X)
b <- length(Y[(Y > y)])/length(Y)
P(X > x) P(Y > y) P(X > x) * P(Y > y)
0.5 0.75 0.375
  X > x X <= x Total
Y > y 0.375 0.375 0.75
Y <= y 0.125 0.125 0.25
  X > x X <= x Total
Y > y 3739 3761 7500
Y <= y 1261 1239 2500

\(P(X>x\,and\,Y>y) = P(X>x)P(Y>y)\) : True, especially because X and Y are independent.

Part III

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

# Construct a contingency table
contingency <- matrix(c(sum(X>x & Y>y), sum(X>x & Y<=y), sum(X<=x & Y>y), sum(X<=x & Y<=y)), nrow = 2)
colnames(contingency) <- c('X > x', 'X <= x' )
rownames(contingency) <- c('Y > y', 'Y <= y')

contingency
##        X > x X <= x
## Y > y   3739   3761
## Y <= y  1261   1239

Fisher’s Exact Test:

fisher.result <- fisher.test(contingency)
fisher.result
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency
## p-value = 0.6277
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.891325 1.070505
## sample estimates:
## odds ratio 
##  0.9768025

Hypotheses:
\(H_0\): (null) the variables are independent
\(H_1\): the variables are dependent

The Fisher’s Exact Test on the data of X and Y produces a p-value of 0.6276998 which is greater than the significance level of 5%. The null hypothesis, \(H_0\) is not rejected. The variables are independent.

Chi Square Test:

chisquare.result <- chisq.test(contingency)
chisquare.result
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency
## X-squared = 0.2352, df = 1, p-value = 0.6277

Hypotheses:
\(H_0\): (null) the variables are independent
\(H_1\): the variables are dependent

The Chi Square Test on the data of X and Y produces a p-value of 0.6276946. The p-value does not provide any evidence to reject the null hypothesis: the variables are independent.

Summary

According to Matthias Döring, Fisher’s test is “preferable to the chi-squared test because it is an exact test.” However, Fisher’s test can become computationally infeasible in larger same sizes. For this circumstance and where the contingency table dimensions go beyond 2x2, the Chi Square is preferred. However, in this instance, both tests performed well.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition.https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Part I

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

# Import libraries for this section
library(ggplot2)
library(dplyr)
library(tidyr)
library(pander)
# Import the training and test data 
train_df <- read.csv("https://raw.githubusercontent.com/logicalschema/Data/main/DATA%20605/Final/train.csv", stringsAsFactors = FALSE)  
test_df <- read.csv("https://raw.githubusercontent.com/logicalschema/Data/main/DATA%20605/Final/test.csv", stringsAsFactors = FALSE)

# Will use this later
test_labels <- test_df$Id
test_df$SalePrice <- NA
combined_df <- rbind(train_df, test_df)


# Here is a glimpse of the data
head(train_df)
# Here is a summary of the data
summary(train_df)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 

Cleaning

Data is missing throughout so some cleaning is needed.

na_col <- which(colSums(is.na(combined_df)) > 0)
sort(colSums(sapply(combined_df[na_col], is.na)), decreasing = TRUE)
##       PoolQC  MiscFeature        Alley        Fence    SalePrice  FireplaceQu 
##         2909         2814         2721         2348         1459         1420 
##  LotFrontage  GarageYrBlt GarageFinish   GarageQual   GarageCond   GarageType 
##          486          159          159          159          159          157 
##     BsmtCond BsmtExposure     BsmtQual BsmtFinType2 BsmtFinType1   MasVnrType 
##           82           82           81           80           79           24 
##   MasVnrArea     MSZoning    Utilities BsmtFullBath BsmtHalfBath   Functional 
##           23            4            2            2            2            2 
##  Exterior1st  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1
# Function to clean data: Courtesy of Gabriel Sanchez

clean <- function(df){
  
  quality <- c("None", "Fa", "TA", "Gd", "Ex")
  quality2 <- c("None","Po","Fa", "TA", "Gd", "Ex")
  finish <- c("None"=0, "Unf"=1, "RFn"=2, "Fin"=3)
  exp <- c("None","No", "Mn", "Av", "Gd")
  fin <- c("None","Unf","LwQ","Rec","BLQ","ALQ","GLQ")

  ## Pool
  df[["PoolQC"]][is.na(df[["PoolQC"]])] <- "None"
  #poolqc<-revalue(df[["PoolQC"]], levels=quality)
  poolqc<-factor(df[["PoolQC"]], levels=quality)
  df[["PoolQC"]] <- as.numeric(poolqc)
  
  ## MiscFeature
  df[["MiscFeature"]][is.na(df[["MiscFeature"]])] <- "None"
  df[["MiscFeature"]] <- as.factor(df[["MiscFeature"]])

  ## Alley
  df[["Alley"]][is.na(df[["Alley"]])] <- "None"
  df[["Alley"]] <- as.factor(df[["Alley"]])
  
  ## Fence
  df[["Fence"]][is.na(df[["Fence"]])] <- "None"
  df[["Fence"]] <- as.factor(df[["Fence"]])
  
  ## Fireplace
  df[["FireplaceQu"]][is.na(df[["FireplaceQu"]])] <- "None"
  #fp<-revalue(df[["FireplaceQu"]], levels=quality)
  fp<-factor(df[["FireplaceQu"]], levels=quality2)
  df[["FireplaceQu"]] <- as.numeric(fp)

  ## Lot
  df[["LotFrontage"]][is.na(df[["LotFrontage"]])] <- 69 # media

  ## Garage
  df[["GarageYrBlt"]][is.na(df[["GarageYrBlt"]])] <- 0
  df[["GarageYrBlt"]] <- factor(df[["GarageYrBlt"]])

  df[["GarageFinish"]][is.na(df[["GarageFinish"]])] <- "None"
  #gf <- revalue(df[["GarageFinish"]], finish)
  gf <- factor(df[["GarageFinish"]], finish)
  df[["GarageFinish"]]<-as.integer(gf)
  
  df[["GarageQual"]][is.na(df[["GarageQual"]])] <- "None"
  df[["GarageQual"]] <- as.numeric(factor(df[["GarageQual"]], levels=quality))
  
  df[["GarageCond"]][is.na(df[["GarageCond"]])] <- "None"
  df[["GarageCond"]] <- as.numeric(factor(df[["GarageCond"]], levels=quality))
  
  df[["GarageType"]][is.na(df[["GarageType"]])] <- "None"

  df[["GarageCars"]][is.na(df[["GarageCars"]])] <- 0
  df[["GarageArea"]][is.na(df[["GarageArea"]])] <- 0

  ## Basement
  df[["BsmtCond"]][is.na(df[["BsmtCond"]])] <- "None"
  df[["BsmtCond"]] <- as.numeric(factor(df[["BsmtCond"]], levels=quality))

  df[["BsmtExposure"]][is.na(df[["BsmtExposure"]])] <- "None"
  df[["BsmtExposure"]] <- as.numeric(factor(df[["BsmtExposure"]], levels=exp))

  df[["BsmtQual"]][is.na(df[["BsmtQual"]])] <- "None"
  df[["BsmtQual"]] <- as.numeric(factor(df[["BsmtQual"]], levels=quality))
  
  df[["BsmtFinType1"]][is.na(df[["BsmtFinType1"]])] <- "None"
  df[["BsmtFinType1"]] <- as.numeric(factor(df[["BsmtFinType1"]], levels=fin))

  df[["BsmtFinType2"]][is.na(df[["BsmtFinType2"]])] <- "None"
  df[["BsmtFinType2"]] <- as.numeric(factor(df[["BsmtFinType2"]], levels=fin)) 
  
  df[["BsmtFullBath"]][is.na(df[["BsmtFullBath"]])] <- 0
  df[["BsmtHalfBath"]][is.na(df[["BsmtHalfBath"]])] <- 0
  df[["BsmtFinSF1"]][is.na(df[["BsmtFinSF1"]])] <- 0
  df[["BsmtFinSF2"]][is.na(df[["BsmtFinSF2"]])] <- 0
  df[["BsmtUnfSF"]][is.na(df[["BsmtUnfSF"]])] <- 0
  df[["TotalBsmtSF"]][is.na(df[["TotalBsmtSF"]])] <- 0
  
  ## Masory
  df[["MasVnrType"]][is.na(df[["MasVnrType"]])] <- "None"
  df[["MasVnrArea"]][is.na(df[["MasVnrArea"]])] <- 0
  
  ## MSZoning
  df[["MSZoning"]][is.na(df[["MSZoning"]])] <- "RL"
  
  ## Utilities
  df[["Utilities"]][is.na(df[["Utilities"]])] <- "AllPub"
  
  ## Functional
  df[["Functional"]][is.na(df[["Functional"]])] <- "Typ"

  ## Exteriors
  df[["Exterior1st"]][is.na(df[["Exterior1st"]])] <- "Other"
  df[["Exterior2nd"]][is.na(df[["Exterior2nd"]])] <- "Other"
  
  df[["ExterQual"]] <- as.numeric(factor(df[["ExterQual"]], levels=quality))
  df[["ExterCond"]] <- as.numeric(factor(df[["ExterCond"]], levels=quality2))

  ## Electrical
  df[["Electrical"]][is.na(df[["Electrical"]])] <- "SBrkr"

  ## Kitchen Quality
  df[["KitchenQual"]][is.na(df[["KitchenQual"]])] <- "TA"
  df[["KitchenQual"]] <- as.numeric(factor(df[["KitchenQual"]], levels=quality))
  
  ## Sale Type
  df[["SaleType"]][is.na(df[["SaleType"]])] <- "WD"
  
  df[["Neighborhood"]]<- as.factor(df[["Neighborhood"]])
  
  return(df)
}


# Features: this function creates new variables to add to the dataset: Courtesy of Gabriel Sanchez
fe <- function(df){

  df[["TotalSF"]] <- df[["TotalBsmtSF"]] + df[["X1stFlrSF"]] + df[["X2ndFlrSF"]]
  df[["OverallGrade"]] <- df[["OverallQual"]] * df[["OverallCond"]]
  df[["ExterGrade"]] <- df[["ExterQual"]] * df[["ExterCond"]]
  df[["FireplaceScore"]] <- df[["Fireplaces"]] * df[["FireplaceQu"]]
  df[["TotalBath"]] <- df[["BsmtFullBath"]] + (0.5 * df[["BsmtHalfBath"]]) +   df[["FullBath"]] + (0.5 * df[["HalfBath"]])
  df[["TotalSqFeet"]] <- df[["GrLivArea"]] + df[["TotalBsmtSF"]] 
  df[["TotalPorchSF"]] <- df[["OpenPorchSF"]] + df[["EnclosedPorch"]] +   df[["X3SsnPorch"]] + df[["ScreenPorch"]]
  df[["PoolScore"]] <- df[["PoolArea"]] * df[["PoolQC"]]
  df[["KitchenScore"]] <- df[["KitchenAbvGr"]] * df[["KitchenQual"]]
  df[["Remod"]] <- ifelse(df[["YearBuilt"]]==df[["YearRemodAdd"]], 0, 1)
  df[["Age"]] <- as.numeric(df[["YrSold"]])-df[["YearRemodAdd"]]

    return(df) 
}

The following code will call the functions to add features and clean the data set.

train_df <- fe(clean(train_df))
test_df <- fe(clean(test_df))
test_df$SalePrice <- NA
combined_df <- rbind(train_df, test_df)

# Using a log function to normalize the data.
train_df$SalePrice <- log(train_df$SalePrice)


head(combined_df)

Scatterplot Matrix

The code generates a scatterplot matrix for the independent variables TotalSqFeet, LotArea, OverallGrade, and the dependent variable SalePrice. TotalSqFeet and OverallGrade are features that were added to help organize the data.

# Sale Price 
SalePrice <- train_df$SalePrice

# TotalSqFeet: df[["GrLivArea"]] + df[["TotalBsmtSF"]] (see Features)
TotalSqFeet <- train_df$TotalSqFeet

# LotArea: Lot size in square feet
LotArea <- train_df$LotArea

# OverallGrade: df[["OverallQual"]] * df[["OverallCond"]] (See Features) 
OverallGrade <- train_df$OverallGrade


matrixData = data.frame(SalePrice, TotalSqFeet, LotArea, OverallGrade)
pairs(matrixData, col="dodgerblue1")

Correlation Matrix

# Using the corrplot library
library(corrplot)
## corrplot 0.84 loaded
correlationMatrix <- cor(matrixData)
corrplot(correlationMatrix, method = "shade")

Hypothesis Testing
We’re going to test the hypothesis that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. We will be using the variables SalePrice, TotalSqFeet, LotArea, and OverallGrade.

# SalePrice vs TotalSqFeet
SalePricevTotalSqFeet <- cor.test(SalePrice, TotalSqFeet, method="pearson", conf.level=0.8)
SalePricevTotalSqFeet
## 
##  Pearson's product-moment correlation
## 
## data:  SalePrice and TotalSqFeet
## t = 46.567, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7594241 0.7864287
## sample estimates:
##       cor 
## 0.7732768
# SalePrice vs LotArea
SalePricevLotArea <- cor.test(SalePrice, LotArea, method="pearson", conf.level=0.8)
SalePricevLotArea
## 
##  Pearson's product-moment correlation
## 
## data:  SalePrice and LotArea
## t = 10.168, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2257075 0.2883910
## sample estimates:
##       cor 
## 0.2573199
# SalePrice vs OverallGrade
SalePriecvOverallGrade <- cor.test(SalePrice, OverallGrade, method="pearson", conf.level=0.8)
SalePriecvOverallGrade
## 
##  Pearson's product-moment correlation
## 
## data:  SalePrice and OverallGrade
## t = 29.155, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5852309 0.6276507
## sample estimates:
##       cor 
## 0.6068728

The cor.test function we used has these items of information: statistic, parameter, p.value, estimate, null.value, alternative, method, data.name, conf.int.

Analysis of Correlation Tests

The following is a breakdown of each variable vs SalePrice.

  p-value Correlation Coefficient t Statistic Analysis
TotalSqFeet 8.831832810^{-291} 0.7732768 46.5669241 Correlation Exists
LotArea 1.643992110^{-23} 0.2573199 10.1678313 Correlation Exists
OverallGrade 1.292740110^{-147} 0.6068728 29.1554099 Correlation Exists
  80% Confidence Interval
TotalSqFeet (0.7594241, 0.7864287)
LotArea (0.2257075, 0.288391)
OverallGrade (0.5852309, 0.6276507)

Additional Correlation Chart

numericVars <- which(sapply(combined_df, is.numeric))
numericVarNames <- names(numericVars)
combined_numVar <- combined_df[, numericVars]
cor_numVar <- cor(combined_numVar, use="pairwise.complete.obs")
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
cor_high <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_high
##  [1] "SalePrice"      "OverallQual"    "TotalSF"        "TotalSqFeet"   
##  [5] "GrLivArea"      "ExterQual"      "KitchenQual"    "GarageCars"    
##  [9] "TotalBath"      "GarageArea"     "BsmtQual"       "ExterGrade"    
## [13] "TotalBsmtSF"    "X1stFlrSF"      "OverallGrade"   "FullBath"      
## [17] "TotRmsAbvGrd"   "YearBuilt"      "FireplaceQu"    "YearRemodAdd"  
## [21] "FireplaceScore" "Age"
cor_numVar <- cor_numVar[cor_high, cor_high]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt", tl.cex = .6,cl.cex = .6, number.cex=.5)  

Familywise Error?

According to the webpage https://www.statisticshowto.com/familywise-error-rate/:

The familywise error rate (FWE or FWER) is the probability of a coming to at least one false conclusion in a series of hypothesis tests . In other words, it’s the probability of making at least one Type I Error. The term “familywise” error rate comes from family of tests, which is the technical definition for a series of tests on data.

I would not be worried about familywise error because of the significant p-values, high t-statistic, and the number of observations in the study.

Part II

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

# Contains a function for LU decomposition
library(matrixcalc)
## Warning: package 'matrixcalc' was built under R version 4.0.3
# Original correlation matrix
print(correlationMatrix)
##              SalePrice TotalSqFeet    LotArea OverallGrade
## SalePrice    1.0000000   0.7732768 0.25731989   0.60687281
## TotalSqFeet  0.7732768   1.0000000 0.30681366   0.42773887
## LotArea      0.2573199   0.3068137 1.00000000   0.08561631
## OverallGrade 0.6068728   0.4277389 0.08561631   1.00000000
# Inverts the correlation matrix from previous steps creating the precision matrix
precisionMatrix <- solve(correlationMatrix)
print(precisionMatrix)
##               SalePrice TotalSqFeet    LotArea OverallGrade
## SalePrice     3.2537366  -2.0034397 -0.1278159   -1.1067121
## TotalSqFeet  -2.0034397   2.5785843 -0.2873903    0.1374777
## LotArea      -0.1278159  -0.2873903  1.1120506    0.1052863
## OverallGrade -1.1067121   0.1374777  0.1052863    1.6038147
# Multiply the correlation matrix by the precision matrix 
print(correlationMatrix %*% precisionMatrix)  
##                 SalePrice   TotalSqFeet      LotArea  OverallGrade
## SalePrice    1.000000e+00  6.938894e-17 0.000000e+00  0.000000e+00
## TotalSqFeet  5.551115e-17  1.000000e+00 3.469447e-17  0.000000e+00
## LotArea      8.326673e-17 -8.326673e-17 1.000000e+00 -2.775558e-17
## OverallGrade 0.000000e+00  5.551115e-17 1.387779e-17  1.000000e+00
# Multiply the precision matrix by the correlation matrix
print(precisionMatrix %*% correlationMatrix)  
##                  SalePrice   TotalSqFeet       LotArea  OverallGrade
## SalePrice     1.000000e+00  4.996004e-16  1.942890e-16 -2.220446e-16
## TotalSqFeet  -3.608225e-16  1.000000e+00 -1.387779e-16 -1.665335e-16
## LotArea      -4.163336e-17 -6.938894e-17  1.000000e+00  0.000000e+00
## OverallGrade  2.220446e-16  1.110223e-16  2.775558e-17  1.000000e+00
# LU decomposition 
# https://www.rdocumentation.org/packages/matrixcalc/versions/1.0-3/topics/lu.decomposition
LU <-lu.decomposition(correlationMatrix)
L <- LU$L
U <- LU$U

print(L)
##           [,1]       [,2]        [,3] [,4]
## [1,] 1.0000000  0.0000000  0.00000000    0
## [2,] 0.7732768  1.0000000  0.00000000    0
## [3,] 0.2573199  0.2682155  1.00000000    0
## [4,] 0.6068728 -0.1033268 -0.06564743    1
print(U)
##      [,1]          [,2]          [,3]        [,4]
## [1,]    1  7.732768e-01  2.573199e-01  0.60687281
## [2,]    0  4.020429e-01  1.078341e-01 -0.04154182
## [3,]    0 -1.387779e-17  9.048637e-01 -0.05940198
## [4,]    0 -9.110411e-19 -6.938894e-18  0.62351342
print(L %*% U)
##           [,1]      [,2]       [,3]       [,4]
## [1,] 1.0000000 0.7732768 0.25731989 0.60687281
## [2,] 0.7732768 1.0000000 0.30681366 0.42773887
## [3,] 0.2573199 0.3068137 1.00000000 0.08561631
## [4,] 0.6068728 0.4277389 0.08561631 1.00000000

Part III

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html. Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

The variable LotArea has a right skew as shown in the histogram below.

hist(LotArea)

Next, we will load the MASS package and run fitdistr to try to fit an exponential probability density function.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Get the standard deviation of LotArea
sdLotArea <- sd(LotArea)

# Do the fit
fit <- fitdistr(LotArea, densfun = "exponential")  
fit 
##        rate    
##   9.508570e-05 
##  (2.488507e-06)
# the information stored in fit
names(fit)  
## [1] "estimate" "sd"       "vcov"     "n"        "loglik"
# Find the optimal value of lambda
lambda <- fit$estimate

# Take 1000 samples from the Exponential Probability Density Function
samples <- rexp(1000, lambda)

# Histogram of the samples
hist(samples)

The histogram of the 1,000 samples produces a less skewed distribution than the original.

Now, to find the 5th and 95th percentiles.

cdf <- ecdf(samples)
plot(cdf)

summary(cdf)
## Empirical CDF:     1000 unique values with summary
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     9.24  2837.95  7025.45 10277.20 14454.17 82617.00
exp_quantiles <- quantile(cdf,c(.05,.95))
exp_quantiles
##         5%        95% 
##   463.7209 30708.3385
# 95th confidence interval 
mu <- mean(samples)
sd <- sd(samples)
err <- qnorm(.975) * sd/sqrt(1000)
lower <- mu - err
upper <- mu + err

The 95% confidence interval is (9637.8796934, 1.09165310^{4}) with a mean of 1.027720510^{4}.

Part IV

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

For constructing the model, I will use the variables OverallQual, TotalSqFeet, LotArea, and OverallGrade.

lm <- lm(SalePrice ~ OverallQual + TotalSqFeet + LotArea + OverallGrade, data = train_df)
summary(lm)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + TotalSqFeet + LotArea + 
##     OverallGrade, data = train_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54110 -0.07758  0.02444  0.10885  0.54196 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.052e+01  2.311e-02 455.110  < 2e-16 ***
## OverallQual  1.430e-01  6.369e-03  22.450  < 2e-16 ***
## TotalSqFeet  1.869e-04  8.626e-06  21.669  < 2e-16 ***
## LotArea      3.195e-06  5.323e-07   6.002 2.45e-09 ***
## OverallGrade 3.474e-03  7.830e-04   4.437 9.83e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1911 on 1455 degrees of freedom
## Multiple R-squared:  0.7717, Adjusted R-squared:  0.7711 
## F-statistic:  1230 on 4 and 1455 DF,  p-value: < 2.2e-16
names(lm)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Prediction

We will predict using the test data with the model we constructed. I use the exp() to revert the SalePrice.

# Create the predictions using the test data
predictions <- exp(predict(lm, test_df))
summary(predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   52250  128465  157916  175420  201583 1402095
# Export for scoring 
export <- data.frame(test_df$Id, predictions)
colnames(export)[1] <- 'Id'
colnames(export)[2] <- 'SalePrice'
write.csv(export, file="export.csv", row.names=F, quote=FALSE)

Submission

I submitted my attempt to Kaggle. My username os sungsulee and my score was 0.18511. The model can be improved upon further iterations to remove and possibly add more variables.

The YouTube video of the presentation is here.

Kaggle Score