Load the Packages

library(tidyverse)
library(knitr)
library(kableExtra)
library(stats)
library(MASS)
library(psych)
library(ggpubr)
library(matrixcalc)
library(caret)

Problem Set 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu =\sigma =(N+1)/2\).
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

P(X>x | X>y)
P(X>x, Y>y)
P(X<x | X>y)

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Solution

Part 1

set.seed(11223)
N <- 10
n <- 10000
sigma <- (N + 1)/2
mu <- sigma
df <- data.frame(X = runif(n, min = 1, max = N), Y = rnorm(n, mean = mu, sd = sigma))
head(df, 10)

##           X           Y
## 1  3.802151  8.19249690
## 2  6.218204  1.20405093
## 3  6.405295  5.52387780
## 4  7.084643  7.01930549
## 5  6.506409 -4.56864516
## 6  9.597822  9.83654871
## 7  5.410977 10.89626221
## 8  7.078516  0.05238785
## 9  9.758482  3.74136127
## 10 5.094755 12.80918355

Part 2

Calculate the median of the X and Y variables.

x = quantile(df$X, 0.5)
x

##      50% 
## 5.532476

y = quantile(df$Y, 0.25)
y

##      25% 
## 1.783892

Calculate as a minimum the below probabilities a through c.

a.) P(X>x | X>y)

The probability of X>x given or conditioned on event X>y is 0.549.

prob_Xx_Xy = df %>% filter(X > x, X > y) %>% nrow()/n

prob_Xy = df %>% filter(X > y) %>% nrow()/n

prob = prob_Xx_Xy/prob_Xy
prob

## [1] 0.5493902

b.) P(X>x, Y>y)

The probability of X>x and Y>y is 0.377.

prob_Xx_Yy <- df %>% filter(X > x, Y > y) %>% nrow()/n
prob_Xx_Yy

## [1] 0.3766

c.) P(X<x | X>y)

The probability of Xy is 0.451

prob_xX_Xy = df %>% filter(X < x, X > y) %>% nrow()/n

prob_Xy = df %>% filter(X > y) %>% nrow()/n

prob = prob_xX_Xy/prob_Xy
prob

## [1] 0.4506098

Part 3

Investigate whether P(X>x and Y>y) = P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Joint Probability:

joint_prob_AB = df %>% mutate(A = ifelse(X > x, "X > x", "X < x"), B = ifelse(Y > y, " Y > y", " Y < y")) %>% group_by(A, B) %>% summarise(count = n()) %>% mutate(probability = count/n)

joint_prob_AB

## # A tibble: 4 x 4
## # Groups:   A [2]
##   A     B        count probability
##   <chr> <chr>    <int>       <dbl>
## 1 X < x " Y < y"  1266       0.127
## 2 X < x " Y > y"  3734       0.373
## 3 X > x " Y < y"  1234       0.123
## 4 X > x " Y > y"  3766       0.377

Marginal Probability:

marginal_prob_A = joint_prob_AB %>% ungroup() %>% group_by(A) %>% summarize(count = sum(count), probability = sum(probability))

marginal_prob_B = joint_prob_AB %>% ungroup() %>% group_by(B) %>% summarize(count = sum(count), probability = sum(probability))

marginal_prob_A

## # A tibble: 2 x 3
##   A     count probability
##   <chr> <int>       <dbl>
## 1 X < x  5000         0.5
## 2 X > x  5000         0.5

marginal_prob_B

## # A tibble: 2 x 3
##   B        count probability
##   <chr>    <int>       <dbl>
## 1 " Y < y"  2500        0.25
## 2 " Y > y"  7500        0.75

Table:

tab <- bind_rows(joint_prob_AB, marginal_prob_A, marginal_prob_B) %>% dplyr::select(-count) %>% spread(A, probability)

tab$B[is.na(tab$B)] <- "Total"

tab <- tab %>% rename(Event = 'B') 

names(tab)[4] <- "Total"

tab[,4][is.na(tab[,4])] <- 1

kable(tab, format='html') %>%
  kable_styling("striped", full_width = FALSE)

Event	X < x	X > x	Total
Y < y	0.1266	0.1234	0.25
Y > y	0.3734	0.3766	0.75
Total	0.5000	0.5000	1.00

From the table above,

P(X>x and Y>y) = 0.3766
P(X>x)P(Y>y) = (0.5000)*(0.75) = 0.375

There is only decimal 0.001 difference, so it can be concluded that P(X>x and Y>y) = P(X>x)P(Y>y).

Part 4

From the result below, since the p-value for both tests has much higher than the 5% significance level, we cannot reject the null hypothesis. Thus, the independence holds for the two variables, X and Y.

Fisher’s Exact Test is used when there is a small sample set and the Chi Square Test is used when sample set is large. Thus, I would say that the Chi Square test is most appropriate in this case.

Chi-square Test:

tableXY <- table(df$X > x, df$Y>y)

chisq.test(tableXY)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tableXY
## X-squared = 0.51253, df = 1, p-value = 0.474

Fisher’s Exact Test:

fisher.test(tableXY)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tableXY
## p-value = 0.474
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9441663 1.1339759
## sample estimates:
## odds ratio 
##   1.034733

Problem Set 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Solution

Load the Data

house_price_train <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-605/master/train.csv", header=T, stringsAsFactors = F)

house_price_test <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-605/master/test.csv", header=T, stringsAsFactors = F)

Descriptive & Inferential Statistics

Descriptive Statistics:

There are 1460 obervations for each of the 81 variables.

str(house_price_train)

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

If I want to buy a house and live myself, I’ll judge the sales price based on these key factors such as the overall condition, overall quality, lot area, etc. For this exercise, I’ll choose the following variables as dependent and independent variables.

Independent Variable: Lot Area, Overall Condition, and Overall Quality

Dependent Variable: Sales Price

Summary statistics of the independent and dependent variables.

df <- house_price_train %>% dplyr::select(SalePrice, OverallCond, OverallQual, LotArea)

describe(df)

##             vars    n      mean       sd   median   trimmed      mad   min
## SalePrice      1 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
## OverallCond    2 1460      5.58     1.11      5.0      5.48     0.00     1
## OverallQual    3 1460      6.10     1.38      6.0      6.08     1.48     1
## LotArea        4 1460  10516.83  9981.26   9478.5   9563.28  2962.23  1300
##                max  range  skew kurtosis      se
## SalePrice   755000 720100  1.88     6.50 2079.11
## OverallCond      9      8  0.69     1.09    0.03
## OverallQual     10      9  0.22     0.09    0.04
## LotArea     215245 213945 12.18   202.26  261.22

Histogram:

par(mfrow=c(2,2))

hist(df$SalePrice, main="Sales Price", ylab="Frequency", col="hotpink", breaks=10)
hist(df$OverallCond, main="Overall Condition", ylab="Frequency", col="hotpink", breaks=10)
hist(df$OverallQual, main="Overall Quality", ylab="Frequency", col="hotpink", breaks=10)
hist(df$LotArea, main="Lot Area", ylab="Frequency", col="hotpink", breaks=10)

Scatter Plot:

ggplot(df, aes(x = LotArea, y = SalePrice)) + geom_point() + ggtitle("Sales Price vs Lot Area") +  ylab("Sales Price") +xlab("Lot Area")

ggplot(df, aes(x = OverallQual, y = SalePrice)) + geom_point(aes(color = factor(OverallQual))) + ggtitle("Sales Price vs Overall Quality") + labs(color='Rates') +  ylab("Sales Price") +  xlab("Overall Quality")

ggplot(df, aes(x = OverallCond, y = SalePrice)) + geom_point(aes(color = factor(OverallCond))) + 
ggtitle("Sales Price vs Overall Condition") + labs(color='Rates') +  ylab("Sales Price") +  xlab("Overall Condition")

Correlation Matrix:

correlation_matrix = df %>% dplyr::select(SalePrice, OverallQual, OverallCond, LotArea) %>% cor() %>% as.matrix()
correlation_matrix

##               SalePrice OverallQual OverallCond     LotArea
## SalePrice    1.00000000  0.79098160 -0.07785589  0.26384335
## OverallQual  0.79098160  1.00000000 -0.09193234  0.10580574
## OverallCond -0.07785589 -0.09193234  1.00000000 -0.00563627
## LotArea      0.26384335  0.10580574 -0.00563627  1.00000000

Hypothesis Testing:

cor.test(df$LotArea, df$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  df$LotArea and df$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

cor.test(df$OverallQual, df$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  df$OverallQual and df$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

cor.test(df$OverallCond, df$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  df$OverallCond and df$SalePrice
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.1111272 -0.0444103
## sample estimates:
##         cor 
## -0.07785589

From the statistic summary or histograms, it shows that the Lot Area variable is right skewed and the rest are slightly right skewed. All the scatter plots show that a straight line can be drawn through the points to show a linear relationship.

From the above hypothesis testing, we can see that p-value for each pairwise is much less than the 0.2 significant value and this indicates that it is statistically significant. Therefore, we can reject the null hypothesis and we can say that changes in the predictor’s value are related to changes in the response variable.

From the 3 response variables, the Overall Quality variable has the strongest relationship with the sales price. It has 0.791 correlation value. This makes sense as the overall material and finish of the house is the single most important factor for determining the fair price of a house and experienced real estate agent would tell you in front the reason why a house price is higher than the other even though they look the same.

Family-wise error rate is the the probability of a coming to at least one false conclusion in a series of hypothesis tests. In other words, it’s the probability of making at least one Type I Error. A Type I error, is the incorrect rejection of a true null hypothesis. I don’t think I would worry about family-wise error because this type of exercise a lot of things are just common sense. For example, we know the bigger the lot area will definitely have higher in sales price if we’re comparing apple to apple. It’s just a matter of how strong the relationship is.

Linear Algebra and Correlation

Precision Matrix:

precision_matrix = solve(correlation_matrix)
precision_matrix

##               SalePrice OverallQual  OverallCond      LotArea
## SalePrice    2.92833783  -2.2582064  0.017378675 -0.533593316
## OverallQual -2.25820645   2.7613570  0.079757301  0.304094866
## OverallCond  0.01737868   0.0797573  1.008643943 -0.007339038
## LotArea     -0.53359332   0.3040949 -0.007339038  1.108568702

Multiply the Correlation Matrix by Precision Matrix:

cmpm = round(correlation_matrix %*% precision_matrix)
cmpm

##             SalePrice OverallQual OverallCond LotArea
## SalePrice           1           0           0       0
## OverallQual         0           1           0       0
## OverallCond         0           0           1       0
## LotArea             0           0           0       1

LU Decomposition:

decomp_mtx = lu.decomposition(correlation_matrix)
decomp_mtx

## $L
##             [,1]        [,2]        [,3] [,4]
## [1,]  1.00000000  0.00000000 0.000000000    0
## [2,]  0.79098160  1.00000000 0.000000000    0
## [3,] -0.07785589 -0.08107364 1.000000000    0
## [4,]  0.26384335 -0.27484978 0.006620283    1
## 
## $U
##      [,1]          [,2]        [,3]         [,4]
## [1,]    1  7.909816e-01 -0.07785589  0.263843354
## [2,]    0  3.743481e-01 -0.03034976 -0.102889497
## [3,]    0 -3.469447e-18  0.99147789  0.006563864
## [4,]    0 -1.385482e-17  0.00000000  0.902064074

Calculus-Based Probability & Statistics

From the above analysis, we can see that the Lot Area variable is right-skewed. The minimum value of the Lot Area is absolutely above zero so need not to shift.

Lot area minimum value:

min(df$LotArea)

## [1] 1300

Exponential probability density function:

epdf <- fitdistr(df$LotArea, densfun="exponential")
epdf

##        rate    
##   9.508570e-05 
##  (2.488507e-06)

Optimal Value of \(\lambda\):

The optimal value of lambda = \(\frac{1}{\lambda}\).

lambda <- epdf$estimate

optimal_value <- 1/lambda
optimal_value

##     rate 
## 10516.83

1000 samples of lambda:

epdf_samples <- rexp(1000, lambda)

Histogram of the 1000 samples versus the original:

par(mfrow = c(1, 2))
hist(epdf_samples, breaks = 50, col="hotpink", main = "Exponential - Lot Area")
hist(df$LotArea, breaks = 50, col="royalblue", main = "Original - Lot Area")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF):

qexp(c(0.05, 0.95), rate = lambda)

## [1]   539.4428 31505.6013

Generate a 95% confidence interval from the empirical data, assuming normality:

qnorm(c(0.025, 0.975), mean=mean(df$LotArea), sd=sd(df$LotArea))

## [1] -9046.092 30079.748

Empirical 5th percentile and 95th percentile of the data:

quantile(df$LotArea, c(0.05, 0.95))

##       5%      95% 
##  3311.70 17401.15

From the histogram, we can see that the values from the exponential distribution is much more spread out compared to the original lot area data.

The empirical 5th and 95th percentiles (3311.70 and 17401.15) are quite different than the confidence intervals calculated while assuming normality (-9046.092 and 30079.748). The lot area variable is not normally distributed. However, the range of the empirical values is within the 95% confidence interval.

Modeling

mlm_1 <- lm(SalePrice ~ OverallQual + OverallCond + LotArea, data = df)

summary(mlm_1)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + OverallCond + LotArea, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -271356  -26923   -1468   20176  385490 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.022e+05  8.633e+03 -11.832   <2e-16 ***
## OverallQual  4.430e+04  8.884e+02  49.860   <2e-16 ***
## OverallCond -4.237e+02  1.098e+03  -0.386      0.7    
## LotArea      1.450e+00  1.226e-01  11.831   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46470 on 1456 degrees of freedom
## Multiple R-squared:  0.6585, Adjusted R-squared:  0.6578 
## F-statistic: 935.9 on 3 and 1456 DF,  p-value: < 2.2e-16

From the summary above, we can see that the p-value of the overall condition variable is greater than the significant value 0.05, which indicates that it is not statistically significant. So, we can remove this variable from the model and rebuild the model.

mlm_2 <- lm(SalePrice ~ OverallQual + LotArea, data = df)

summary(mlm_2)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + LotArea, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -271225  -26819   -1459   20172  385190 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.047e+05  5.547e+03  -18.88   <2e-16 ***
## OverallQual  4.433e+04  8.844e+02   50.12   <2e-16 ***
## LotArea      1.450e+00  1.225e-01   11.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46460 on 1457 degrees of freedom
## Multiple R-squared:  0.6585, Adjusted R-squared:  0.658 
## F-statistic:  1405 on 2 and 1457 DF,  p-value: < 2.2e-16

Removing the Overall Condition variable does not seem to make any difference but the R-squared is good enough. The reported R-squared of 0.658 for this model means that the model explains 65.8 percent of the data’s variation. The p-value of the model is much smaller than the 0.05 significant level. This indicates that the sample data provide enough evidence to reject the null hypothesis for the entire population and the model fits the data better than the intercept-only model. This suggests that there is a strong association between the changes in the 2 variables (OverallQual, LotArea) and the shifts in the SalesPrice variable.
For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. From the model summary, we can see that the standard error for all the variables including the intercept is at least 5 times smaller than the corresponding coefficients. This large ratio means that there is relatively little variability in the slope and intercept estimates.

The linear model is \[SalesPrice = OverallQual*4.433e+04 + LotArea*1.45 - 1.047e+05\]

Right now, we can do residual analysis to check whether the basic assumptions of linear regression is met:

par(mfrow=c(2,2))

# Residuals plot
plot(mlm_2$fitted.values, mlm_2$residuals, 
     xlab='Fitted Values', ylab='Residuals')
abline(h = 0, lty = 3, col="blue")
abline(h = 2e+5, lty = 3, col="red")
abline(h = -2e+5, lty = 3, col="red")

# Histogram plot
hist(mlm_2$residuals, xlab="Residuals")

qqnorm(mlm_2$residuals)
qqline(mlm_2$residuals)

From the residuals plot, it looks to be heteroscedasticity. We can see that the residuals are not uniformly scattered above and below zero and the residuals tend to increase as we move to the right. It looks like a cone shaped pattern. The vertical range of the residuals increases as the fitted values increases. But the residuals histogram does not seem skewed. Also, the QQ plot shows heavily tails.

We can rectify the heteroscedasticity by using methods such the Box-Cox Transformation.

# Transform the sales price
new_saleprice = BoxCoxTrans(df$SalePrice)

# Combine the transformed sales price to the dataframe
df_new <- cbind(df, SalePrice_new = predict(new_saleprice, df$SalePrice))

# Create a new multiple regression model
mlm_3 <- lm(SalePrice_new ~ OverallQual + LotArea, data = df_new)

# Check the new model summary
summary(mlm_3)

## 
## Call:
## lm(formula = SalePrice_new ~ OverallQual + LotArea, data = df_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31026 -0.11439  0.01546  0.13021  0.63025 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.054e+01  2.626e-02  401.57   <2e-16 ***
## OverallQual 2.307e-01  4.187e-03   55.12   <2e-16 ***
## LotArea     6.915e-06  5.801e-07   11.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2199 on 1457 degrees of freedom
## Multiple R-squared:  0.6973, Adjusted R-squared:  0.6969 
## F-statistic:  1678 on 2 and 1457 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))

# Residuals plot
plot(mlm_3$fitted.values, mlm_3$residuals, 
     xlab='Fitted Values', ylab='Residuals')
abline(h = 0, lty = 3, col="blue")
abline(h = 0.5, lty = 3, col="red")
abline(h = -0.5, lty = 3, col="red")

# Histogram plot
hist(mlm_3$residuals, xlab="Residuals")

qqnorm(mlm_3$residuals)
qqline(mlm_3$residuals)

After doing the transformation, we can see that the problem of heteroscedasticity has been rectified.

The new model is \[SalesPrice = exp(OverallQual*2.307e-01 + LotArea*6.915e-06 + 1.054e+01)\]

Now, let’s test the new model with the test data.

test_df <- house_price_test %>% dplyr::select(LotArea, OverallQual, Id)

test_df <- na.omit(test_df)

predict_price <- predict(mlm_3, newdata = test_df, type="response")

predict_price_df <- data.frame(test_df$Id, exp(predict_price))

colnames(predict_price_df) <- c('Id', 'SalePrice')

Write the predicted sales price result to a csv file for submission to the Kaggle.com.

write.csv(predict_price_df, file = "predicted_prices_siesiongwong.csv", row.names = FALSE)

Kaggle Submission

My Kaggle user name is Siong, and the resulting score on Kaggle.com from this model is 0.21829.

Presentation Link

Here is the Youtube link for my presentation.

https://youtu.be/WTtR3E0_FT0

DATA 605 Final Exam

Sie Siong Wong

12/10/2020

Load the Packages

Problem Set 1

Solution

Part 1

Part 2

Part 3

Part 4

Problem Set 2

Solution

Load the Data

Descriptive & Inferential Statistics

Linear Algebra and Correlation

Calculus-Based Probability & Statistics

Modeling

Kaggle Submission

Presentation Link