Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).
set.seed(2020)
N <- 8
# mu = sigma = (N + 1)/2
mu <- (N+1) / 2
sigma <- (N + 1) / 2
# Generate the random variable X
X <- runif(10000, min = 1, max = N)
# Generate the random variable Y
Y <- rnorm(10000, mean = mu, sd = sigma)
Probability
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
# Small x is estimated as the median of the X variable
x <- median(X)
# Small y is estimated as the 1st quartile of the Y variable
y <- quantile(Y, 0.25)
The median of X is 4.4247068.
The estimated 1st quartile of Y is 1.4292412.
P(X > x | X > y)
\[P(A | B) = \frac{P(A \cap B)}{P(B)}\]
# P(X > x & X > y)
a_and_b <- length(X[(X > x) & (X > y)])/length(X)
# P(X > y)
b <- length(X[X > y])/length(X)
# P(X > x | X > y)
probability <- a_and_b / b
Answer: P(X > x | X > y) = 0.534188
P(X > x, Y > y)
P(X > x, Y > y) = P(X > x) * P(Y > y)
# P(X > x)
a <- length(X[X > x])/length(X)
# P(Y > y)
b <- length(Y[Y > y])/length(Y)
probability <- a * b
Answer: P(X > x, Y > y) = 0.375
P(X < x | X > y)
# P(X > x & X > y)
a_and_b <- length(X[(X < x) & (X > y)])/length(X)
# P(X > y)
b <- length(X[X > y])/length(X)
# P(X > x | X > y)
probability <- a_and_b / b
Answer: P(X < x | X > y) = 0.465812
Investigate whether \(P(X>x\,and\,Y>y) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
a <- length(X[(X > x)])/length(X)
b <- length(Y[(Y > y)])/length(Y)
| P(X > x) | P(Y > y) | P(X > x) * P(Y > y) |
|---|---|---|
| 0.5 | 0.75 | 0.375 |
| X > x | X <= x | Total | |
|---|---|---|---|
| Y > y | 0.375 | 0.375 | 0.75 |
| Y <= y | 0.125 | 0.125 | 0.25 |
| X > x | X <= x | Total | |
|---|---|---|---|
| Y > y | 3739 | 3761 | 7500 |
| Y <= y | 1261 | 1239 | 2500 |
\(P(X>x\,and\,Y>y) = P(X>x)P(Y>y)\) : True, especially because X and Y are independent.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
# Construct a contingency table
contingency <- matrix(c(sum(X>x & Y>y), sum(X>x & Y<=y), sum(X<=x & Y>y), sum(X<=x & Y<=y)), nrow = 2)
colnames(contingency) <- c('X > x', 'X <= x' )
rownames(contingency) <- c('Y > y', 'Y <= y')
contingency
## X > x X <= x
## Y > y 3739 3761
## Y <= y 1261 1239
Fisher’s Exact Test:
fisher.result <- fisher.test(contingency)
fisher.result
##
## Fisher's Exact Test for Count Data
##
## data: contingency
## p-value = 0.6277
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.891325 1.070505
## sample estimates:
## odds ratio
## 0.9768025
Hypotheses:
\(H_0\): (null) the variables are independent
\(H_1\): the variables are dependent
The Fisher’s Exact Test on the data of X and Y produces a p-value of 0.6276998 which is greater than the significance level of 5%. The null hypothesis, \(H_0\) is not rejected. The variables are independent.
Chi Square Test:
chisquare.result <- chisq.test(contingency)
chisquare.result
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency
## X-squared = 0.2352, df = 1, p-value = 0.6277
Hypotheses:
\(H_0\): (null) the variables are independent
\(H_1\): the variables are dependent
The Chi Square Test on the data of X and Y produces a p-value of 0.6276946. The p-value does not provide any evidence to reject the null hypothesis: the variables are independent.
Summary
According to Matthias Döring, Fisher’s test is “preferable to the chi-squared test because it is an exact test.” However, Fisher’s test can become computationally infeasible in larger same sizes. For this circumstance and where the contingency table dimensions go beyond 2x2, the Chi Square is preferred. However, in this instance, both tests performed well.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition.https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
# Import libraries for this section
library(ggplot2)
library(dplyr)
library(tidyr)
library(pander)
# Import the training and test data
train_df <- read.csv("https://raw.githubusercontent.com/logicalschema/Data/main/DATA%20605/Final/train.csv", stringsAsFactors = FALSE)
test_df <- read.csv("https://raw.githubusercontent.com/logicalschema/Data/main/DATA%20605/Final/test.csv", stringsAsFactors = FALSE)
# Will use this later
test_labels <- test_df$Id
test_df$SalePrice <- NA
combined_df <- rbind(train_df, test_df)
# Here is a glimpse of the data
head(train_df)
# Here is a summary of the data
summary(train_df)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
Cleaning
Data is missing throughout so some cleaning is needed.
na_col <- which(colSums(is.na(combined_df)) > 0)
sort(colSums(sapply(combined_df[na_col], is.na)), decreasing = TRUE)
## PoolQC MiscFeature Alley Fence SalePrice FireplaceQu
## 2909 2814 2721 2348 1459 1420
## LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond GarageType
## 486 159 159 159 159 157
## BsmtCond BsmtExposure BsmtQual BsmtFinType2 BsmtFinType1 MasVnrType
## 82 82 81 80 79 24
## MasVnrArea MSZoning Utilities BsmtFullBath BsmtHalfBath Functional
## 23 4 2 2 2 2
## Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
# Function to clean data: Courtesy of Gabriel Sanchez
clean <- function(df){
quality <- c("None", "Fa", "TA", "Gd", "Ex")
quality2 <- c("None","Po","Fa", "TA", "Gd", "Ex")
finish <- c("None"=0, "Unf"=1, "RFn"=2, "Fin"=3)
exp <- c("None","No", "Mn", "Av", "Gd")
fin <- c("None","Unf","LwQ","Rec","BLQ","ALQ","GLQ")
## Pool
df[["PoolQC"]][is.na(df[["PoolQC"]])] <- "None"
#poolqc<-revalue(df[["PoolQC"]], levels=quality)
poolqc<-factor(df[["PoolQC"]], levels=quality)
df[["PoolQC"]] <- as.numeric(poolqc)
## MiscFeature
df[["MiscFeature"]][is.na(df[["MiscFeature"]])] <- "None"
df[["MiscFeature"]] <- as.factor(df[["MiscFeature"]])
## Alley
df[["Alley"]][is.na(df[["Alley"]])] <- "None"
df[["Alley"]] <- as.factor(df[["Alley"]])
## Fence
df[["Fence"]][is.na(df[["Fence"]])] <- "None"
df[["Fence"]] <- as.factor(df[["Fence"]])
## Fireplace
df[["FireplaceQu"]][is.na(df[["FireplaceQu"]])] <- "None"
#fp<-revalue(df[["FireplaceQu"]], levels=quality)
fp<-factor(df[["FireplaceQu"]], levels=quality2)
df[["FireplaceQu"]] <- as.numeric(fp)
## Lot
df[["LotFrontage"]][is.na(df[["LotFrontage"]])] <- 69 # media
## Garage
df[["GarageYrBlt"]][is.na(df[["GarageYrBlt"]])] <- 0
df[["GarageYrBlt"]] <- factor(df[["GarageYrBlt"]])
df[["GarageFinish"]][is.na(df[["GarageFinish"]])] <- "None"
#gf <- revalue(df[["GarageFinish"]], finish)
gf <- factor(df[["GarageFinish"]], finish)
df[["GarageFinish"]]<-as.integer(gf)
df[["GarageQual"]][is.na(df[["GarageQual"]])] <- "None"
df[["GarageQual"]] <- as.numeric(factor(df[["GarageQual"]], levels=quality))
df[["GarageCond"]][is.na(df[["GarageCond"]])] <- "None"
df[["GarageCond"]] <- as.numeric(factor(df[["GarageCond"]], levels=quality))
df[["GarageType"]][is.na(df[["GarageType"]])] <- "None"
df[["GarageCars"]][is.na(df[["GarageCars"]])] <- 0
df[["GarageArea"]][is.na(df[["GarageArea"]])] <- 0
## Basement
df[["BsmtCond"]][is.na(df[["BsmtCond"]])] <- "None"
df[["BsmtCond"]] <- as.numeric(factor(df[["BsmtCond"]], levels=quality))
df[["BsmtExposure"]][is.na(df[["BsmtExposure"]])] <- "None"
df[["BsmtExposure"]] <- as.numeric(factor(df[["BsmtExposure"]], levels=exp))
df[["BsmtQual"]][is.na(df[["BsmtQual"]])] <- "None"
df[["BsmtQual"]] <- as.numeric(factor(df[["BsmtQual"]], levels=quality))
df[["BsmtFinType1"]][is.na(df[["BsmtFinType1"]])] <- "None"
df[["BsmtFinType1"]] <- as.numeric(factor(df[["BsmtFinType1"]], levels=fin))
df[["BsmtFinType2"]][is.na(df[["BsmtFinType2"]])] <- "None"
df[["BsmtFinType2"]] <- as.numeric(factor(df[["BsmtFinType2"]], levels=fin))
df[["BsmtFullBath"]][is.na(df[["BsmtFullBath"]])] <- 0
df[["BsmtHalfBath"]][is.na(df[["BsmtHalfBath"]])] <- 0
df[["BsmtFinSF1"]][is.na(df[["BsmtFinSF1"]])] <- 0
df[["BsmtFinSF2"]][is.na(df[["BsmtFinSF2"]])] <- 0
df[["BsmtUnfSF"]][is.na(df[["BsmtUnfSF"]])] <- 0
df[["TotalBsmtSF"]][is.na(df[["TotalBsmtSF"]])] <- 0
## Masory
df[["MasVnrType"]][is.na(df[["MasVnrType"]])] <- "None"
df[["MasVnrArea"]][is.na(df[["MasVnrArea"]])] <- 0
## MSZoning
df[["MSZoning"]][is.na(df[["MSZoning"]])] <- "RL"
## Utilities
df[["Utilities"]][is.na(df[["Utilities"]])] <- "AllPub"
## Functional
df[["Functional"]][is.na(df[["Functional"]])] <- "Typ"
## Exteriors
df[["Exterior1st"]][is.na(df[["Exterior1st"]])] <- "Other"
df[["Exterior2nd"]][is.na(df[["Exterior2nd"]])] <- "Other"
df[["ExterQual"]] <- as.numeric(factor(df[["ExterQual"]], levels=quality))
df[["ExterCond"]] <- as.numeric(factor(df[["ExterCond"]], levels=quality2))
## Electrical
df[["Electrical"]][is.na(df[["Electrical"]])] <- "SBrkr"
## Kitchen Quality
df[["KitchenQual"]][is.na(df[["KitchenQual"]])] <- "TA"
df[["KitchenQual"]] <- as.numeric(factor(df[["KitchenQual"]], levels=quality))
## Sale Type
df[["SaleType"]][is.na(df[["SaleType"]])] <- "WD"
df[["Neighborhood"]]<- as.factor(df[["Neighborhood"]])
return(df)
}
# Features: this function creates new variables to add to the dataset: Courtesy of Gabriel Sanchez
fe <- function(df){
df[["TotalSF"]] <- df[["TotalBsmtSF"]] + df[["X1stFlrSF"]] + df[["X2ndFlrSF"]]
df[["OverallGrade"]] <- df[["OverallQual"]] * df[["OverallCond"]]
df[["ExterGrade"]] <- df[["ExterQual"]] * df[["ExterCond"]]
df[["FireplaceScore"]] <- df[["Fireplaces"]] * df[["FireplaceQu"]]
df[["TotalBath"]] <- df[["BsmtFullBath"]] + (0.5 * df[["BsmtHalfBath"]]) + df[["FullBath"]] + (0.5 * df[["HalfBath"]])
df[["TotalSqFeet"]] <- df[["GrLivArea"]] + df[["TotalBsmtSF"]]
df[["TotalPorchSF"]] <- df[["OpenPorchSF"]] + df[["EnclosedPorch"]] + df[["X3SsnPorch"]] + df[["ScreenPorch"]]
df[["PoolScore"]] <- df[["PoolArea"]] * df[["PoolQC"]]
df[["KitchenScore"]] <- df[["KitchenAbvGr"]] * df[["KitchenQual"]]
df[["Remod"]] <- ifelse(df[["YearBuilt"]]==df[["YearRemodAdd"]], 0, 1)
df[["Age"]] <- as.numeric(df[["YrSold"]])-df[["YearRemodAdd"]]
return(df)
}
The following code will call the functions to add features and clean the data set.
train_df <- fe(clean(train_df))
test_df <- fe(clean(test_df))
test_df$SalePrice <- NA
combined_df <- rbind(train_df, test_df)
# Using a log function to normalize the data.
train_df$SalePrice <- log(train_df$SalePrice)
head(combined_df)
Scatterplot Matrix
The code generates a scatterplot matrix for the independent variables TotalSqFeet, LotArea, OverallGrade, and the dependent variable SalePrice. TotalSqFeet and OverallGrade are features that were added to help organize the data.
# Sale Price
SalePrice <- train_df$SalePrice
# TotalSqFeet: df[["GrLivArea"]] + df[["TotalBsmtSF"]] (see Features)
TotalSqFeet <- train_df$TotalSqFeet
# LotArea: Lot size in square feet
LotArea <- train_df$LotArea
# OverallGrade: df[["OverallQual"]] * df[["OverallCond"]] (See Features)
OverallGrade <- train_df$OverallGrade
matrixData = data.frame(SalePrice, TotalSqFeet, LotArea, OverallGrade)
pairs(matrixData, col="dodgerblue1")
Correlation Matrix
# Using the corrplot library
library(corrplot)
## corrplot 0.84 loaded
correlationMatrix <- cor(matrixData)
corrplot(correlationMatrix, method = "shade")
Hypothesis Testing
We’re going to test the hypothesis that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. We will be using the variables SalePrice, TotalSqFeet, LotArea, and OverallGrade.
# SalePrice vs TotalSqFeet
SalePricevTotalSqFeet <- cor.test(SalePrice, TotalSqFeet, method="pearson", conf.level=0.8)
SalePricevTotalSqFeet
##
## Pearson's product-moment correlation
##
## data: SalePrice and TotalSqFeet
## t = 46.567, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7594241 0.7864287
## sample estimates:
## cor
## 0.7732768
# SalePrice vs LotArea
SalePricevLotArea <- cor.test(SalePrice, LotArea, method="pearson", conf.level=0.8)
SalePricevLotArea
##
## Pearson's product-moment correlation
##
## data: SalePrice and LotArea
## t = 10.168, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2257075 0.2883910
## sample estimates:
## cor
## 0.2573199
# SalePrice vs OverallGrade
SalePriecvOverallGrade <- cor.test(SalePrice, OverallGrade, method="pearson", conf.level=0.8)
SalePriecvOverallGrade
##
## Pearson's product-moment correlation
##
## data: SalePrice and OverallGrade
## t = 29.155, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5852309 0.6276507
## sample estimates:
## cor
## 0.6068728
The cor.test function we used has these items of information: statistic, parameter, p.value, estimate, null.value, alternative, method, data.name, conf.int.
Analysis of Correlation Tests
The following is a breakdown of each variable vs SalePrice.
| p-value | Correlation Coefficient | t Statistic | Analysis | |
|---|---|---|---|---|
| TotalSqFeet | 8.831832810^{-291} | 0.7732768 | 46.5669241 | Correlation Exists |
| LotArea | 1.643992110^{-23} | 0.2573199 | 10.1678313 | Correlation Exists |
| OverallGrade | 1.292740110^{-147} | 0.6068728 | 29.1554099 | Correlation Exists |
| 80% Confidence Interval | |
|---|---|
| TotalSqFeet | (0.7594241, 0.7864287) |
| LotArea | (0.2257075, 0.288391) |
| OverallGrade | (0.5852309, 0.6276507) |
Additional Correlation Chart
numericVars <- which(sapply(combined_df, is.numeric))
numericVarNames <- names(numericVars)
combined_numVar <- combined_df[, numericVars]
cor_numVar <- cor(combined_numVar, use="pairwise.complete.obs")
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
cor_high <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_high
## [1] "SalePrice" "OverallQual" "TotalSF" "TotalSqFeet"
## [5] "GrLivArea" "ExterQual" "KitchenQual" "GarageCars"
## [9] "TotalBath" "GarageArea" "BsmtQual" "ExterGrade"
## [13] "TotalBsmtSF" "X1stFlrSF" "OverallGrade" "FullBath"
## [17] "TotRmsAbvGrd" "YearBuilt" "FireplaceQu" "YearRemodAdd"
## [21] "FireplaceScore" "Age"
cor_numVar <- cor_numVar[cor_high, cor_high]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt", tl.cex = .6,cl.cex = .6, number.cex=.5)
Familywise Error?
According to the webpage https://www.statisticshowto.com/familywise-error-rate/:
The familywise error rate (FWE or FWER) is the probability of a coming to at least one false conclusion in a series of hypothesis tests . In other words, it’s the probability of making at least one Type I Error. The term “familywise” error rate comes from family of tests, which is the technical definition for a series of tests on data.
I would not be worried about familywise error because of the significant p-values, high t-statistic, and the number of observations in the study.
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
# Contains a function for LU decomposition
library(matrixcalc)
## Warning: package 'matrixcalc' was built under R version 4.0.3
# Original correlation matrix
print(correlationMatrix)
## SalePrice TotalSqFeet LotArea OverallGrade
## SalePrice 1.0000000 0.7732768 0.25731989 0.60687281
## TotalSqFeet 0.7732768 1.0000000 0.30681366 0.42773887
## LotArea 0.2573199 0.3068137 1.00000000 0.08561631
## OverallGrade 0.6068728 0.4277389 0.08561631 1.00000000
# Inverts the correlation matrix from previous steps creating the precision matrix
precisionMatrix <- solve(correlationMatrix)
print(precisionMatrix)
## SalePrice TotalSqFeet LotArea OverallGrade
## SalePrice 3.2537366 -2.0034397 -0.1278159 -1.1067121
## TotalSqFeet -2.0034397 2.5785843 -0.2873903 0.1374777
## LotArea -0.1278159 -0.2873903 1.1120506 0.1052863
## OverallGrade -1.1067121 0.1374777 0.1052863 1.6038147
# Multiply the correlation matrix by the precision matrix
print(correlationMatrix %*% precisionMatrix)
## SalePrice TotalSqFeet LotArea OverallGrade
## SalePrice 1.000000e+00 6.938894e-17 0.000000e+00 0.000000e+00
## TotalSqFeet 5.551115e-17 1.000000e+00 3.469447e-17 0.000000e+00
## LotArea 8.326673e-17 -8.326673e-17 1.000000e+00 -2.775558e-17
## OverallGrade 0.000000e+00 5.551115e-17 1.387779e-17 1.000000e+00
# Multiply the precision matrix by the correlation matrix
print(precisionMatrix %*% correlationMatrix)
## SalePrice TotalSqFeet LotArea OverallGrade
## SalePrice 1.000000e+00 4.996004e-16 1.942890e-16 -2.220446e-16
## TotalSqFeet -3.608225e-16 1.000000e+00 -1.387779e-16 -1.665335e-16
## LotArea -4.163336e-17 -6.938894e-17 1.000000e+00 0.000000e+00
## OverallGrade 2.220446e-16 1.110223e-16 2.775558e-17 1.000000e+00
# LU decomposition
# https://www.rdocumentation.org/packages/matrixcalc/versions/1.0-3/topics/lu.decomposition
LU <-lu.decomposition(correlationMatrix)
L <- LU$L
U <- LU$U
print(L)
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 0.0000000 0.00000000 0
## [2,] 0.7732768 1.0000000 0.00000000 0
## [3,] 0.2573199 0.2682155 1.00000000 0
## [4,] 0.6068728 -0.1033268 -0.06564743 1
print(U)
## [,1] [,2] [,3] [,4]
## [1,] 1 7.732768e-01 2.573199e-01 0.60687281
## [2,] 0 4.020429e-01 1.078341e-01 -0.04154182
## [3,] 0 -1.387779e-17 9.048637e-01 -0.05940198
## [4,] 0 -9.110411e-19 -6.938894e-18 0.62351342
print(L %*% U)
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 0.7732768 0.25731989 0.60687281
## [2,] 0.7732768 1.0000000 0.30681366 0.42773887
## [3,] 0.2573199 0.3068137 1.00000000 0.08561631
## [4,] 0.6068728 0.4277389 0.08561631 1.00000000
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html. Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
The variable LotArea has a right skew as shown in the histogram below.
hist(LotArea)
Next, we will load the MASS package and run fitdistr to try to fit an exponential probability density function.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Get the standard deviation of LotArea
sdLotArea <- sd(LotArea)
# Do the fit
fit <- fitdistr(LotArea, densfun = "exponential")
fit
## rate
## 9.508570e-05
## (2.488507e-06)
# the information stored in fit
names(fit)
## [1] "estimate" "sd" "vcov" "n" "loglik"
# Find the optimal value of lambda
lambda <- fit$estimate
# Take 1000 samples from the Exponential Probability Density Function
samples <- rexp(1000, lambda)
# Histogram of the samples
hist(samples)
The histogram of the 1,000 samples produces a less skewed distribution than the original.
Now, to find the 5th and 95th percentiles.
cdf <- ecdf(samples)
plot(cdf)
summary(cdf)
## Empirical CDF: 1000 unique values with summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.24 2837.95 7025.45 10277.20 14454.17 82617.00
exp_quantiles <- quantile(cdf,c(.05,.95))
exp_quantiles
## 5% 95%
## 463.7209 30708.3385
# 95th confidence interval
mu <- mean(samples)
sd <- sd(samples)
err <- qnorm(.975) * sd/sqrt(1000)
lower <- mu - err
upper <- mu + err
The 95% confidence interval is (9637.8796934, 1.09165310^{4}) with a mean of 1.027720510^{4}.
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
For constructing the model, I will use the variables OverallQual, TotalSqFeet, LotArea, and OverallGrade.
lm <- lm(SalePrice ~ OverallQual + TotalSqFeet + LotArea + OverallGrade, data = train_df)
summary(lm)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + TotalSqFeet + LotArea +
## OverallGrade, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.54110 -0.07758 0.02444 0.10885 0.54196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.052e+01 2.311e-02 455.110 < 2e-16 ***
## OverallQual 1.430e-01 6.369e-03 22.450 < 2e-16 ***
## TotalSqFeet 1.869e-04 8.626e-06 21.669 < 2e-16 ***
## LotArea 3.195e-06 5.323e-07 6.002 2.45e-09 ***
## OverallGrade 3.474e-03 7.830e-04 4.437 9.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1911 on 1455 degrees of freedom
## Multiple R-squared: 0.7717, Adjusted R-squared: 0.7711
## F-statistic: 1230 on 4 and 1455 DF, p-value: < 2.2e-16
names(lm)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Prediction
We will predict using the test data with the model we constructed. I use the exp() to revert the SalePrice.
# Create the predictions using the test data
predictions <- exp(predict(lm, test_df))
summary(predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52250 128465 157916 175420 201583 1402095
# Export for scoring
export <- data.frame(test_df$Id, predictions)
colnames(export)[1] <- 'Id'
colnames(export)[2] <- 'SalePrice'
write.csv(export, file="export.csv", row.names=F, quote=FALSE)
Submission
I submitted my attempt to Kaggle. My username os sungsulee and my score was 0.18511. The model can be improved upon further iterations to remove and possibly add more variables.
The YouTube video of the presentation is here.
Kaggle Score