Final Project
Load Packages
library(ggplot2)
library(tidyverse)
library(knitr)
library(data.table)
library(RCurl)
library(magrittr)
library(psych)
library(Matrix)
library(matrixcalc)
library(MASS)
library(Rmisc)Problem 1
Q. Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu\)=\(\sigma\)=(N+1)/2.
Answer
Please note that in this RMD, the questions are answered in piecemeal basis, and italicized and marked with Q, wherever they appear.
Let’s choose N = 7. Right below, I created a function f1(), where, I am initializing the variables: mean, sd, df, X and Y. f1() is being called from code chunk, right below it, for printing whatever is required.
f1 <- function(N, seed_value)
{
mean <- (N + 1) / 2
sd <- mean # Since it's given that mu = sigma
#
set.seed(seed_value)
X <- runif(10000, min = 1, max = N)
Y <- rnorm(X, mean, sd)
#
df <- data.frame(X, Y)
#
list1 <- list(mean = mean, sd = sd, df = df, X = X, Y = Y)
return(list1)
}Below code chunk, calls the above function f1(). Gets the salient variables, and prints them.
N <- 7
seed_value = 121
list1 = f1(N, seed_value)
#
X = list1$X
Y = list1$Y
#
cat('mean(X) = ', mean(X), 'median(X) = ', median(X), 'sd(X) = ', sd(X))## mean(X) = 4.020999 median(X) = 4.040544 sd(X) = 1.733148
cat('mean(Y) = ', mean(Y), 'median(Y) = ', median(Y), 'sd(Y) = ', sd(Y))## mean(Y) = 3.967011 median(Y) = 4.000224 sd(Y) = 4.021847
#
head(list1$df, n = 10)## X Y
## 1 3.395334 1.3882545
## 2 6.709554 9.4388717
## 3 4.258905 4.4950922
## 4 5.576774 4.0064366
## 5 4.305032 4.3511589
## 6 3.526142 -2.7237422
## 7 3.806151 1.7920659
## 8 4.679762 8.6216448
## 9 2.426268 3.6394108
## 10 2.528021 -0.1594078
#
hist(X)hist(Y, probability = TRUE, main = "Normal, mean = 4, sd = 4")
abline(v = (N + 1) / 2, col = "blue")So, we observe that means and medians of both X and Y are almost equal. And from the histograms, while the distribution of X is uniform, the distribution of Y is Normal.
Q. Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
x <- median(X) # Median of X
y <- quantile(Y, prob = c(.25)) # First quartile of Y
cat('Median(X) = ', x, ', ', '1st quartile of the Y = ', y)## Median(X) = 4.040544 , 1st quartile of the Y = 1.252943
a. P(X>x | X>y)
Meaning: Conditional probability: P(X > x | X > y) = P(X > x and X > y) / P(X > y)
cat('P(X > x | X > y) = ', sum(X > x & X > y) / sum(X > y))## P(X > x | X > y) = 0.5216484
b. P(X>x, Y>y)
Meaning: Joint probability: P(X > x, Y > y) = P(X > x and Y > y)
cat('P(X > x, Y > y) = ', sum(X > x & Y > y) / length(X))## P(X > x, Y > y) = 0.3745
c. P(X<x | X>y)
Meaning: Conditional probability: P(X < x | X > y) = P(X < x and X > y) / P(X > y)
cat('P(X < x | X > y) = ', sum(X < x & X > y) / sum(X > y))## P(X < x | X > y) = 0.4783516
Q. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
prob_Y_lesser_than_y <- c( (sum(Y < y & X < x) / length(X)), (sum(Y < y & X > x) / length(X)) )
prob_Y_greater_than_y <- c( (sum(Y > y & X < x) / length(X)), (sum(Y > y & X > x) / length(X)) )
#
table1 <- rbind.data.frame(prob_Y_lesser_than_y, prob_Y_greater_than_y) # Building the table of Joint Probability Distributions.
#
colnames(table1) <- c("P(X < x)", "P(X > x)") # Setting the headiings
rownames(table1) <- c("P(Y < y)", "P(Y > y)") # Setting the columns names
table1## P(X < x) P(X > x)
## P(Y < y) 0.1245 0.1255
## P(Y > y) 0.3755 0.3745
Now, let’s compute the important observations from above table:
print(paste('P(X > x) = ', sum(table1[,2]))) # P(X > x) = 0.1255 + 0.3745 = 0.5 (added contents of right most column)## [1] "P(X > x) = 0.5"
print(paste('P(Y > y) = ', sum(table1[2,]))) # P(Y > y) = 0.3755 + 0.3745 = 0.75 (added contents of bottom most row)## [1] "P(Y > y) = 0.75"
#
print(paste('P(X > x and Y > y) = ', table1[2,2])) # P(X > x and Y > y) (right most, bottom most element of table)## [1] "P(X > x and Y > y) = 0.3745"
print(paste('Therefore, P(X > x).P(Y > y) = ', sum(table1[,2]) * sum(table1[2,]))) # The product: P(X>x)P(Y>y)## [1] "Therefore, P(X > x).P(Y > y) = 0.375"
So, we observe that P(X > x and Y > y) = 0.3745 and P(X>x)P(Y>y) = 0.375. They are almost equal.
Q. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Let’s state the null and alternative hypotheses
H0: X > x and Y > y are independent events.
HA: X > x and Y > y are not independent events.
Computing the Contingency Table:
count_Y_lesser_than_y <- c( (sum(Y < y & X < x)), (sum(Y < y & X > x)) )
count_Y_greater_than_y <- c( (sum(Y > y & X < x)), (sum(Y > y & X > x)) )
#
table2 <- rbind.data.frame(count_Y_lesser_than_y, count_Y_greater_than_y) # Building the table of counts.
#
colnames(table2) <- c("Count(X < x)", "Count(X > x)") # Setting the headiings
rownames(table2) <- c("Count(Y < y)", "Count(Y > y)") # Setting the columns names
table2## Count(X < x) Count(X > x)
## Count(Y < y) 1245 1255
## Count(Y > y) 3755 3745
Fisher’s Exact test, below:
fisher.test(table2)##
## Fisher's Exact Test for Count Data
##
## data: table2
## p-value = 0.8354
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9028128 1.0842857
## sample estimates:
## odds ratio
## 0.9893898
Chi-squared test, below:
chisq.test(table2)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table2
## X-squared = 0.0432, df = 1, p-value = 0.8353
We observe that p-values from both tests are almost equal and greater than 0.05. Therefore, We can’t rule out independence.
On the issue of appropriateness of applicability of Fisher vs Chi-squared, I read several competing views. One view, states that While Chi-squared test is applicable on contingency tables having order greater than or equal to 2 x 2, Fisher’s Exact test is typically used for 2 x 2 tables (reference: https://www.datascienceblog.net/post/statistical_test/contingency_table_tests/). However, in this case, since order of the contingency table is 2 x 2, both should be appropriate.
Problem 2
Q. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
Answer
I registered with Kaggle.com, and will report my Kaggle.com user and score in the sequel.
Q. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
A very basic descriptive statistics of train.csv is provided below:
train_csv <- read.csv("./house-prices-advanced-regression-techniques/train.csv")
head(train_csv)## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
dim(train_csv)## [1] 1460 81
describe(train_csv)## vars n mean sd median trimmed mad min
## Id 1 1460 730.50 421.61 730.5 730.50 541.15 1
## MSSubClass 2 1460 56.90 42.30 50.0 49.15 44.48 20
## MSZoning* 3 1460 4.03 0.63 4.0 4.06 0.00 1
## LotFrontage 4 1201 70.05 24.28 69.0 68.94 16.31 21
## LotArea 5 1460 10516.83 9981.26 9478.5 9563.28 2962.23 1300
## Street* 6 1460 2.00 0.06 2.0 2.00 0.00 1
## Alley* 7 91 1.45 0.50 1.0 1.44 0.00 1
## LotShape* 8 1460 2.94 1.41 4.0 3.05 0.00 1
## LandContour* 9 1460 3.78 0.71 4.0 4.00 0.00 1
## Utilities* 10 1460 1.00 0.03 1.0 1.00 0.00 1
## LotConfig* 11 1460 4.02 1.62 5.0 4.27 0.00 1
## LandSlope* 12 1460 1.06 0.28 1.0 1.00 0.00 1
## Neighborhood* 13 1460 13.15 5.89 13.0 13.11 7.41 1
## Condition1* 14 1460 3.03 0.87 3.0 3.00 0.00 1
## Condition2* 15 1460 3.01 0.26 3.0 3.00 0.00 1
## BldgType* 16 1460 1.49 1.20 1.0 1.14 0.00 1
## HouseStyle* 17 1460 4.04 1.91 3.0 4.03 1.48 1
## OverallQual 18 1460 6.10 1.38 6.0 6.08 1.48 1
## OverallCond 19 1460 5.58 1.11 5.0 5.48 0.00 1
## YearBuilt 20 1460 1971.27 30.20 1973.0 1974.13 37.06 1872
## YearRemodAdd 21 1460 1984.87 20.65 1994.0 1986.37 19.27 1950
## RoofStyle* 22 1460 2.41 0.83 2.0 2.26 0.00 1
## RoofMatl* 23 1460 2.08 0.60 2.0 2.00 0.00 1
## Exterior1st* 24 1460 10.62 3.20 13.0 10.93 1.48 1
## Exterior2nd* 25 1460 11.34 3.54 14.0 11.65 2.97 1
## MasVnrType* 26 1452 2.76 0.62 3.0 2.73 0.00 1
## MasVnrArea 27 1452 103.69 181.07 0.0 63.15 0.00 0
## ExterQual* 28 1460 3.54 0.69 4.0 3.65 0.00 1
## ExterCond* 29 1460 4.73 0.73 5.0 4.95 0.00 1
## Foundation* 30 1460 2.40 0.72 2.0 2.46 1.48 1
## BsmtQual* 31 1423 3.26 0.87 3.0 3.43 1.48 1
## BsmtCond* 32 1423 3.81 0.66 4.0 4.00 0.00 1
## BsmtExposure* 33 1422 3.27 1.15 4.0 3.46 0.00 1
## BsmtFinType1* 34 1423 3.73 1.83 3.0 3.79 2.97 1
## BsmtFinSF1 35 1460 443.64 456.10 383.5 386.08 568.58 0
## BsmtFinType2* 36 1422 5.71 0.94 6.0 5.98 0.00 1
## BsmtFinSF2 37 1460 46.55 161.32 0.0 1.38 0.00 0
## BsmtUnfSF 38 1460 567.24 441.87 477.5 519.29 426.99 0
## TotalBsmtSF 39 1460 1057.43 438.71 991.5 1036.70 347.67 0
## Heating* 40 1460 2.04 0.30 2.0 2.00 0.00 1
## HeatingQC* 41 1460 2.54 1.74 1.0 2.42 0.00 1
## CentralAir* 42 1460 1.93 0.25 2.0 2.00 0.00 1
## Electrical* 43 1459 4.68 1.05 5.0 5.00 0.00 1
## X1stFlrSF 44 1460 1162.63 386.59 1087.0 1129.99 347.67 334
## X2ndFlrSF 45 1460 346.99 436.53 0.0 285.36 0.00 0
## LowQualFinSF 46 1460 5.84 48.62 0.0 0.00 0.00 0
## GrLivArea 47 1460 1515.46 525.48 1464.0 1467.67 483.33 334
## BsmtFullBath 48 1460 0.43 0.52 0.0 0.39 0.00 0
## BsmtHalfBath 49 1460 0.06 0.24 0.0 0.00 0.00 0
## FullBath 50 1460 1.57 0.55 2.0 1.56 0.00 0
## HalfBath 51 1460 0.38 0.50 0.0 0.34 0.00 0
## BedroomAbvGr 52 1460 2.87 0.82 3.0 2.85 0.00 0
## KitchenAbvGr 53 1460 1.05 0.22 1.0 1.00 0.00 0
## KitchenQual* 54 1460 3.34 0.83 4.0 3.50 0.00 1
## TotRmsAbvGrd 55 1460 6.52 1.63 6.0 6.41 1.48 2
## Functional* 56 1460 6.75 0.98 7.0 7.00 0.00 1
## Fireplaces 57 1460 0.61 0.64 1.0 0.53 1.48 0
## FireplaceQu* 58 770 3.73 1.13 3.0 3.80 1.48 1
## GarageType* 59 1379 3.28 1.79 2.0 3.11 0.00 1
## GarageYrBlt 60 1379 1978.51 24.69 1980.0 1981.07 31.13 1900
## GarageFinish* 61 1379 2.18 0.81 2.0 2.23 1.48 1
## GarageCars 62 1460 1.77 0.75 2.0 1.77 0.00 0
## GarageArea 63 1460 472.98 213.80 480.0 469.81 177.91 0
## GarageQual* 64 1379 4.86 0.61 5.0 5.00 0.00 1
## GarageCond* 65 1379 4.90 0.52 5.0 5.00 0.00 1
## PavedDrive* 66 1460 2.86 0.50 3.0 3.00 0.00 1
## WoodDeckSF 67 1460 94.24 125.34 0.0 71.76 0.00 0
## OpenPorchSF 68 1460 46.66 66.26 25.0 33.23 37.06 0
## EnclosedPorch 69 1460 21.95 61.12 0.0 3.87 0.00 0
## X3SsnPorch 70 1460 3.41 29.32 0.0 0.00 0.00 0
## ScreenPorch 71 1460 15.06 55.76 0.0 0.00 0.00 0
## PoolArea 72 1460 2.76 40.18 0.0 0.00 0.00 0
## PoolQC* 73 7 2.14 0.90 2.0 2.14 1.48 1
## Fence* 74 281 2.43 0.86 3.0 2.48 0.00 1
## MiscFeature* 75 54 2.91 0.45 3.0 3.00 0.00 1
## MiscVal 76 1460 43.49 496.12 0.0 0.00 0.00 0
## MoSold 77 1460 6.32 2.70 6.0 6.25 2.97 1
## YrSold 78 1460 2007.82 1.33 2008.0 2007.77 1.48 2006
## SaleType* 79 1460 8.51 1.56 9.0 8.92 0.00 1
## SaleCondition* 80 1460 4.77 1.10 5.0 5.00 0.00 1
## SalePrice 81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
## max range skew kurtosis se
## Id 1460 1459 0.00 -1.20 11.03
## MSSubClass 190 170 1.40 1.56 1.11
## MSZoning* 5 4 -1.73 6.25 0.02
## LotFrontage 313 292 2.16 17.34 0.70
## LotArea 215245 213945 12.18 202.26 261.22
## Street* 2 1 -15.49 238.01 0.00
## Alley* 2 1 0.20 -1.98 0.05
## LotShape* 4 3 -0.61 -1.60 0.04
## LandContour* 4 3 -3.16 8.65 0.02
## Utilities* 2 1 38.13 1453.00 0.00
## LotConfig* 5 4 -1.13 -0.59 0.04
## LandSlope* 3 2 4.80 24.47 0.01
## Neighborhood* 25 24 0.02 -1.06 0.15
## Condition1* 9 8 3.01 16.34 0.02
## Condition2* 8 7 13.14 247.54 0.01
## BldgType* 5 4 2.24 3.41 0.03
## HouseStyle* 8 7 0.31 -0.96 0.05
## OverallQual 10 9 0.22 0.09 0.04
## OverallCond 9 8 0.69 1.09 0.03
## YearBuilt 2010 138 -0.61 -0.45 0.79
## YearRemodAdd 2010 60 -0.50 -1.27 0.54
## RoofStyle* 6 5 1.47 0.61 0.02
## RoofMatl* 8 7 8.09 66.28 0.02
## Exterior1st* 15 14 -0.72 -0.37 0.08
## Exterior2nd* 16 15 -0.69 -0.52 0.09
## MasVnrType* 4 3 -0.07 -0.13 0.02
## MasVnrArea 1600 1600 2.66 10.03 4.75
## ExterQual* 4 3 -1.83 3.86 0.02
## ExterCond* 5 4 -2.56 5.29 0.02
## Foundation* 6 5 0.09 1.02 0.02
## BsmtQual* 4 3 -1.31 1.27 0.02
## BsmtCond* 4 3 -3.39 10.14 0.02
## BsmtExposure* 4 3 -1.15 -0.39 0.03
## BsmtFinType1* 6 5 -0.02 -1.39 0.05
## BsmtFinSF1 5644 5644 1.68 11.06 11.94
## BsmtFinType2* 6 5 -3.56 12.32 0.02
## BsmtFinSF2 1474 1474 4.25 20.01 4.22
## BsmtUnfSF 2336 2336 0.92 0.46 11.56
## TotalBsmtSF 6110 6110 1.52 13.18 11.48
## Heating* 6 5 9.83 110.98 0.01
## HeatingQC* 5 4 0.48 -1.51 0.05
## CentralAir* 2 1 -3.52 10.42 0.01
## Electrical* 5 4 -3.06 7.49 0.03
## X1stFlrSF 4692 4358 1.37 5.71 10.12
## X2ndFlrSF 2065 2065 0.81 -0.56 11.42
## LowQualFinSF 572 572 8.99 82.83 1.27
## GrLivArea 5642 5308 1.36 4.86 13.75
## BsmtFullBath 3 3 0.59 -0.84 0.01
## BsmtHalfBath 2 2 4.09 16.31 0.01
## FullBath 3 3 0.04 -0.86 0.01
## HalfBath 2 2 0.67 -1.08 0.01
## BedroomAbvGr 8 8 0.21 2.21 0.02
## KitchenAbvGr 3 3 4.48 21.42 0.01
## KitchenQual* 4 3 -1.42 1.72 0.02
## TotRmsAbvGrd 14 12 0.67 0.87 0.04
## Functional* 7 6 -4.08 16.37 0.03
## Fireplaces 3 3 0.65 -0.22 0.02
## FireplaceQu* 5 4 -0.16 -0.98 0.04
## GarageType* 6 5 0.76 -1.30 0.05
## GarageYrBlt 2010 110 -0.65 -0.42 0.66
## GarageFinish* 3 2 -0.35 -1.41 0.02
## GarageCars 4 4 -0.34 0.21 0.02
## GarageArea 1418 1418 0.18 0.90 5.60
## GarageQual* 5 4 -4.43 18.25 0.02
## GarageCond* 5 4 -5.28 26.77 0.01
## PavedDrive* 3 2 -3.30 9.22 0.01
## WoodDeckSF 857 857 1.54 2.97 3.28
## OpenPorchSF 547 547 2.36 8.44 1.73
## EnclosedPorch 552 552 3.08 10.37 1.60
## X3SsnPorch 508 508 10.28 123.06 0.77
## ScreenPorch 480 480 4.11 18.34 1.46
## PoolArea 738 738 14.80 222.19 1.05
## PoolQC* 3 2 -0.22 -1.90 0.34
## Fence* 4 3 -0.57 -0.88 0.05
## MiscFeature* 4 3 -2.93 10.71 0.06
## MiscVal 15500 15500 24.43 697.64 12.98
## MoSold 12 11 0.21 -0.41 0.07
## YrSold 2010 4 0.10 -1.19 0.03
## SaleType* 9 8 -3.83 14.57 0.04
## SaleCondition* 6 5 -2.74 6.82 0.03
## SalePrice 755000 720100 1.88 6.50 2079.11
Facts:
1) There are 1460 observations and 81 columns in train.csv. If we exclude Id and SalePrice, then there are 79 regressor variables.
2) SalePrice is the dependent variable.
Since price of a house may depend quite a bit on spatial characteristics, I’ll consider 3 area related variables, and the dependent variable SalePrice, for statistical analysis. They are:
1) LotArea: Lot size in square feet
2) GrLivArea: Above grade (ground) living area square feet
3) GarageArea: Size of garage in square feet
4) SalePrice: Dependent variable
Univariate descriptive statistics of LotArea:
summary(train_csv$LotArea)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
hist(train_csv$LotArea, main = "LotArea: Lot size in square feet")Univariate descriptive statistics of GrLivArea:
summary(train_csv$GrLivArea)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
hist(train_csv$GrLivArea, main = "GrLivArea: Above grade (ground) living area square feet")Univariate descriptive statistics of GarageArea:
summary(train_csv$GarageArea)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 334.5 480.0 473.0 576.0 1418.0
hist(train_csv$GarageArea, main = "GarageArea: Size of garage in square feet")Univariate descriptive statistics of SalePrice:
summary(train_csv$SalePrice)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
hist(train_csv$SalePrice, main = "SalePrice: Dependent variable")Now, I’ll provide the scatter plots.
Scatter plot of LotArea vs SalePrice:
plot(train_csv$LotArea, train_csv$SalePrice, xlab = 'Lot size in square feet', ylab = 'Sale Price', main = 'Scatter Plot of Lot size vs. Sale Price')Scatter plot of GrLivArea vs SalePrice:
plot(train_csv$GrLivArea, train_csv$SalePrice, xlab = 'Above grade (ground) living area square feet', ylab = 'Sale Price', main = 'Scatter Plot of Above grade (ground) living area vs. Sale Price')Scatter plot of GarageArea vs SalePrice:
plot(train_csv$GarageArea, train_csv$SalePrice, xlab = 'Size of garage in square feet', ylab = 'Sale Price', main = 'Scatter Plot of Size of garage vs. Sale Price')Now, I’ll provide the correlation matrix.
Correlation matrix of the three variables LotArea, GrLivArea, GarageArea:
cor_df <- data.frame(train_csv$LotArea, train_csv$GrLivArea, train_csv$GarageArea)
cor_matrix <- cor(cor_df)
colnames(cor_matrix) <- c("LotArea", "GrLivArea", "GarageArea") # Setting the headiings
rownames(cor_matrix) <- c("LotArea", "GrLivArea", "GarageArea") # Setting the columns names
cor_matrix## LotArea GrLivArea GarageArea
## LotArea 1.0000000 0.2631162 0.1804028
## GrLivArea 0.2631162 1.0000000 0.4689975
## GarageArea 0.1804028 0.4689975 1.0000000
Now, for a comparative visualization, I’ll display pairwise scatter plot matrices.
Pairwise scatter plot matrix of four variables SalePrice, LotArea, GrLivArea, GarageArea:
pairs(~SalePrice+LotArea+GrLivArea+GarageArea, data = train_csv, main = "Pairwise scatter plot matrix")Now, we’ll test this hypothesis, the correlations between each pairwise set of variables is 0, and provide 80% Confidence Level.
For the following analysis, I referred chapter 6 of Introductory Statistics with R, by Peter Dalgaard, Springer, and some internet resources e.g https://rcompanion.org/handbook/D_01.html
Hypothesis testing: LotArea, GrLivArea
cor.test(train_csv$LotArea, train_csv$GrLivArea, method = "pearson", conf.level = 0.80)##
## Pearson's product-moment correlation
##
## data: train_csv$LotArea and train_csv$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
Hypothesis testing: LotArea, GarageArea
cor.test(train_csv$LotArea, train_csv$GarageArea, method = "pearson", conf.level = 0.80)##
## Pearson's product-moment correlation
##
## data: train_csv$LotArea and train_csv$GarageArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1477356 0.2126767
## sample estimates:
## cor
## 0.1804028
Hypothesis testing: GarageArea, GrLivArea
cor.test(train_csv$GarageArea, train_csv$GrLivArea, method = "pearson", conf.level = 0.80)##
## Pearson's product-moment correlation
##
## data: train_csv$GarageArea and train_csv$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
Assuming an Alpha of 0.05, we observe that each p-value is less than Alpha. Therefore the null hypothesis can be rejected.
The issue of Familywise Error:
First, the definition of Familywise Error Rate (from https://www.statisticshowto.com/familywise-error-rate/):
The familywise error rate (FWE or FWER) is the probability of a coming to at least one false conclusion in a series of hypothesis tests . In other words, it’s the probability of making at least one Type I Error. The term “familywise” error rate comes from family of tests, which is the technical definition for a series of tests on data.
In my case the tests are very few, so I wouldn’t worry about it.
Q. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Inversion
prec_matrix <- solve(cor_matrix)
prec_matrix## LotArea GrLivArea GarageArea
## LotArea 1.07920917 -0.2469705 -0.07886378
## GrLivArea -0.24697046 1.3385010 -0.58319943
## GarageArea -0.07886378 -0.5831994 1.28774631
Now, I’ll multiply correlation matrix by the precision matrix:
round(cor_matrix %*% prec_matrix)## LotArea GrLivArea GarageArea
## LotArea 1 0 0
## GrLivArea 0 1 0
## GarageArea 0 0 1
Then, I’ll multiply precision matrix by the correlation matrix:
round(prec_matrix %*% cor_matrix)## LotArea GrLivArea GarageArea
## LotArea 1 0 0
## GrLivArea 0 1 0
## GarageArea 0 0 1
LU Decomposition of correlation matrix (Lower Triangular):
expand(lu(cor_matrix))$L## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.0000000 . .
## [2,] 0.2631162 1.0000000 .
## [3,] 0.1804028 0.4528838 1.0000000
LU Decomposition correlation matrix (Upper Triangular):
expand(lu(cor_matrix))$U## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.2631162 0.1804028
## [2,] . 0.9307699 0.4215306
## [3,] . . 0.7765505
LU Decomposition of precision matrix (Lower Triangular). Used another function:
lu.decomposition(prec_matrix)$L## [,1] [,2] [,3]
## [1,] 1.00000000 0.0000000 0
## [2,] -0.22884393 1.0000000 0
## [3,] -0.07307553 -0.4689975 1
LU Decomposition precision matrix (Upper Triangular):
lu.decomposition(prec_matrix)$U## [,1] [,2] [,3]
## [1,] 1.079209 -0.2469705 -0.07886378
## [2,] 0.000000 1.2819833 -0.60124693
## [3,] 0.000000 0.0000000 1.00000000
Q. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
I tested the right skewness of some of the variables MasVnrArea, BsmtFinSF1, TotalBsmtSF, BsmtUnfSF, WoodDeckSF, OpenPorchSF, EnclosedPorch, and finally selected TotalBsmtSF, which appeared skewed to the right. TotalBsmtSF is the Total square feet of basement area. The histogram and summary of TotalBsmtSF is as follows:
Histogram of Total square feet of basement area:
hist(train_csv$TotalBsmtSF, breaks = 200, main = "Total square feet of basement area")Summary of Total square feet of basement area:
summary(train_csv$TotalBsmtSF)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.4 1298.2 6110.0
From the histogram, since it’s not clear whether parts of skew side are zero, I’ll shift it to raise raise it, to be on the safe side. Pleasee see shifted histogram below.
TotalBsmtSF_shft <- train_csv$TotalBsmtSF[train_csv$TotalBsmtSF < 5000] + 0.01
hist(TotalBsmtSF_shft, breaks = 200, main = "Total square feet of basement area, shifted")We’ll calculate optimal value of \(\lambda\), for this distribution.
lambda <- fitdistr(TotalBsmtSF_shft, 'exponential')
lambda## rate
## 9.487878e-04
## (2.483942e-05)
lambda$estimate## rate
## 0.0009487878
Then, we’ll take 1000 samples from this exponential distribution, using this value rexp(1000, \(\lambda\))) and plot histogram.
exp_samp_1000 <- rexp(1000, lambda$estimate)
hist(exp_samp_1000, breaks = 200, main = 'Fitted Distribution with 1000 samples')Comparison: Compared with original histogram, this histogram with sample of 1000, looks more skewed.
5th and 95th percentiles using the cumulative distribution function (CDF).
print( paste('5th Percentile = ', qexp(0.05, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)) )## [1] "5th Percentile = 54.0619225502357"
print( paste('95th Percentile = ', qexp(0.95, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)) )## [1] "95th Percentile = 3157.43116303767"
95% confidence interval from the empirical data, assuming normality. In the following, I computed Confidence Interval with CI, but it can be computed, using qnorm.
# Confidence Interval using shifted data:
CI(TotalBsmtSF_shft, ci = 0.95)## upper mean lower
## 1075.464 1053.976 1032.489
# Confidence Interval using, without the shift:
CI(train_csv$TotalBsmtSF, ci = 0.95)## upper mean lower
## 1079.951 1057.429 1034.908
Finally, provide the empirical 5th percentile and 95th percentile of the data.
quantile(TotalBsmtSF_shft, c(0.05, 0.95))## 5% 95%
## 518.61 1752.11
Discussion: The values of exponential model are way off the values of the given data set. Furthermore, the histograms are also very different. Histogram of exp_samp_1000 has gone more skewed to the right.
Q. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I’ll select most of the quantitative variables and create a model, to determine the significant ones:
df2 <- data.frame(train_csv$LotArea, train_csv$OverallQual, train_csv$YearBuilt, train_csv$YearRemodAdd, train_csv$MasVnrArea, train_csv$BsmtFinSF1, train_csv$TotalBsmtSF, train_csv$X1stFlrSF, train_csv$X2ndFlrSF, train_csv$GrLivArea, train_csv$FullBath, train_csv$TotRmsAbvGrd, train_csv$Fireplaces, train_csv$GarageCars, train_csv$GarageArea, train_csv$WoodDeckSF, train_csv$OpenPorchSF, train_csv$SalePrice)
colnames(df2) <- c("LotArea", "OverallQual", "YearBuilt", "YearRemodAdd", "MasVnrArea", "BsmtFinSF1", "TotalBsmtSF", "X1stFlrSF", "X2ndFlrSF", "GrLivArea", "FullBath", "TotRmsAbvGrd", "Fireplaces", "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "SalePrice")
mod1 <- lm(SalePrice ~., data = df2)
summary(mod1)##
## Call:
## lm(formula = SalePrice ~ ., data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -517677 -17003 -1878 14217 289379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.131e+06 1.261e+05 -8.973 < 2e-16 ***
## LotArea 4.928e-01 1.033e-01 4.769 2.04e-06 ***
## OverallQual 1.915e+04 1.172e+03 16.340 < 2e-16 ***
## YearBuilt 1.762e+02 4.948e+01 3.560 0.000383 ***
## YearRemodAdd 3.621e+02 6.163e+01 5.875 5.24e-09 ***
## MasVnrArea 2.968e+01 6.115e+00 4.853 1.35e-06 ***
## BsmtFinSF1 1.640e+01 2.582e+00 6.352 2.84e-10 ***
## TotalBsmtSF 1.050e+01 4.272e+00 2.457 0.014142 *
## X1stFlrSF 2.363e+01 2.067e+01 1.143 0.253146
## X2ndFlrSF 1.601e+01 2.033e+01 0.788 0.431023
## GrLivArea 2.143e+01 2.020e+01 1.061 0.288838
## FullBath -1.710e+03 2.610e+03 -0.655 0.512480
## TotRmsAbvGrd 1.744e+03 1.081e+03 1.614 0.106766
## Fireplaces 6.677e+03 1.788e+03 3.734 0.000196 ***
## GarageCars 9.986e+03 2.938e+03 3.398 0.000696 ***
## GarageArea 9.015e+00 9.981e+00 0.903 0.366560
## WoodDeckSF 2.709e+01 8.103e+00 3.343 0.000850 ***
## OpenPorchSF 6.811e+00 1.561e+01 0.436 0.662624
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36110 on 1434 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.795, Adjusted R-squared: 0.7926
## F-statistic: 327.2 on 17 and 1434 DF, p-value: < 2.2e-16
Now, I’ll only consider significant quantitative variables and recreate the model:
df2 <- data.frame(train_csv$LotArea, train_csv$OverallQual, train_csv$YearBuilt, train_csv$YearRemodAdd, train_csv$MasVnrArea, train_csv$BsmtFinSF1, train_csv$TotalBsmtSF, train_csv$Fireplaces, train_csv$GarageCars, train_csv$WoodDeckSF, train_csv$SalePrice)
#
colnames(df2) <- c("LotArea", "OverallQual", "YearBuilt", "YearRemodAdd", "MasVnrArea", "BsmtFinSF1", "TotalBsmtSF", "Fireplaces", "GarageCars", "WoodDeckSF", "SalePrice")
#
mod1 <- lm(SalePrice ~., data = df2)
summary(mod1)##
## Call:
## lm(formula = SalePrice ~ ., data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -424514 -20402 -2604 16180 367067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.217e+05 1.273e+05 -6.455 1.48e-10 ***
## LotArea 7.234e-01 1.115e-01 6.487 1.20e-10 ***
## OverallQual 2.541e+04 1.188e+03 21.392 < 2e-16 ***
## YearBuilt -5.532e+01 4.854e+01 -1.140 0.255
## YearRemodAdd 4.391e+02 6.651e+01 6.603 5.67e-11 ***
## MasVnrArea 4.773e+01 6.528e+00 7.311 4.37e-13 ***
## BsmtFinSF1 1.409e+01 2.733e+00 5.157 2.85e-07 ***
## TotalBsmtSF 2.212e+01 3.280e+00 6.742 2.25e-11 ***
## Fireplaces 1.293e+04 1.867e+03 6.925 6.54e-12 ***
## GarageCars 1.769e+04 1.855e+03 9.535 < 2e-16 ***
## WoodDeckSF 3.926e+01 8.779e+00 4.472 8.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39380 on 1441 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.755, Adjusted R-squared: 0.7533
## F-statistic: 444.2 on 10 and 1441 DF, p-value: < 2.2e-16
Now, I’ll give histogram, for visualizations:
hist(mod1$residuals, breaks = 200)Now, I’ll import the test data set:
test_csv <- read.csv("./house-prices-advanced-regression-techniques/test.csv")
test_csv[complete.cases(test_csv),]## [1] Id MSSubClass MSZoning LotFrontage LotArea
## [6] Street Alley LotShape LandContour Utilities
## [11] LotConfig LandSlope Neighborhood Condition1 Condition2
## [16] BldgType HouseStyle OverallQual OverallCond YearBuilt
## [21] YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## [26] MasVnrType MasVnrArea ExterQual ExterCond Foundation
## [31] BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## [36] BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## [41] HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## [46] LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## [51] HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## [56] Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## [61] GarageFinish GarageCars GarageArea GarageQual GarageCond
## [66] PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## [71] ScreenPorch PoolArea PoolQC Fence MiscFeature
## [76] MiscVal MoSold YrSold SaleType SaleCondition
## <0 rows> (or 0-length row.names)
Prediction the output for Kaggle score:
prediction <- predict(mod1, test_csv)
#
kaggle_score <- data.frame(Id = test_csv[,"Id"], SalePrice = prediction)
kaggle_score[kaggle_score < 0] <- 0
kaggle_score <- replace(kaggle_score, is.na(kaggle_score), 0)
write.csv(kaggle_score, file = "kaggle_score.csv", row.names = FALSE)Kaggle and presentation related information
User Name
shovanbiswas
Display Name
Shovan Biswas
Kaggle score is copied below:
Presentation link:
Marker: 605-16