library(tidyverse)
library(knitr)
library(kableExtra)
library(stats)
library(MASS)
library(psych)
library(ggpubr)
library(matrixcalc)
library(caret)
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu =\sigma =(N+1)/2\).
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
set.seed(11223)
N <- 10
n <- 10000
sigma <- (N + 1)/2
mu <- sigma
df <- data.frame(X = runif(n, min = 1, max = N), Y = rnorm(n, mean = mu, sd = sigma))
head(df, 10)
## X Y
## 1 3.802151 8.19249690
## 2 6.218204 1.20405093
## 3 6.405295 5.52387780
## 4 7.084643 7.01930549
## 5 6.506409 -4.56864516
## 6 9.597822 9.83654871
## 7 5.410977 10.89626221
## 8 7.078516 0.05238785
## 9 9.758482 3.74136127
## 10 5.094755 12.80918355
Calculate the median of the X and Y variables.
x = quantile(df$X, 0.5)
x
## 50%
## 5.532476
y = quantile(df$Y, 0.25)
y
## 25%
## 1.783892
Calculate as a minimum the below probabilities a through c.
a.) P(X>x | X>y)
The probability of X>x given or conditioned on event X>y is 0.549.
prob_Xx_Xy = df %>% filter(X > x, X > y) %>% nrow()/n
prob_Xy = df %>% filter(X > y) %>% nrow()/n
prob = prob_Xx_Xy/prob_Xy
prob
## [1] 0.5493902
b.) P(X>x, Y>y)
The probability of X>x and Y>y is 0.377.
prob_Xx_Yy <- df %>% filter(X > x, Y > y) %>% nrow()/n
prob_Xx_Yy
## [1] 0.3766
c.) P(X<x | X>y)
The probability of X
prob_xX_Xy = df %>% filter(X < x, X > y) %>% nrow()/n
prob_Xy = df %>% filter(X > y) %>% nrow()/n
prob = prob_xX_Xy/prob_Xy
prob
## [1] 0.4506098
Investigate whether P(X>x and Y>y) = P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
Joint Probability:
joint_prob_AB = df %>% mutate(A = ifelse(X > x, "X > x", "X < x"), B = ifelse(Y > y, " Y > y", " Y < y")) %>% group_by(A, B) %>% summarise(count = n()) %>% mutate(probability = count/n)
joint_prob_AB
## # A tibble: 4 x 4
## # Groups: A [2]
## A B count probability
## <chr> <chr> <int> <dbl>
## 1 X < x " Y < y" 1266 0.127
## 2 X < x " Y > y" 3734 0.373
## 3 X > x " Y < y" 1234 0.123
## 4 X > x " Y > y" 3766 0.377
Marginal Probability:
marginal_prob_A = joint_prob_AB %>% ungroup() %>% group_by(A) %>% summarize(count = sum(count), probability = sum(probability))
marginal_prob_B = joint_prob_AB %>% ungroup() %>% group_by(B) %>% summarize(count = sum(count), probability = sum(probability))
marginal_prob_A
## # A tibble: 2 x 3
## A count probability
## <chr> <int> <dbl>
## 1 X < x 5000 0.5
## 2 X > x 5000 0.5
marginal_prob_B
## # A tibble: 2 x 3
## B count probability
## <chr> <int> <dbl>
## 1 " Y < y" 2500 0.25
## 2 " Y > y" 7500 0.75
Table:
tab <- bind_rows(joint_prob_AB, marginal_prob_A, marginal_prob_B) %>% dplyr::select(-count) %>% spread(A, probability)
tab$B[is.na(tab$B)] <- "Total"
tab <- tab %>% rename(Event = 'B')
names(tab)[4] <- "Total"
tab[,4][is.na(tab[,4])] <- 1
kable(tab, format='html') %>%
kable_styling("striped", full_width = FALSE)
| Event | X < x | X > x | Total |
|---|---|---|---|
| Y < y | 0.1266 | 0.1234 | 0.25 |
| Y > y | 0.3734 | 0.3766 | 0.75 |
| Total | 0.5000 | 0.5000 | 1.00 |
From the table above,
P(X>x and Y>y) = 0.3766
P(X>x)P(Y>y) = (0.5000)*(0.75) = 0.375
There is only decimal 0.001 difference, so it can be concluded that P(X>x and Y>y) = P(X>x)P(Y>y).
From the result below, since the p-value for both tests has much higher than the 5% significance level, we cannot reject the null hypothesis. Thus, the independence holds for the two variables, X and Y.
Fisher’s Exact Test is used when there is a small sample set and the Chi Square Test is used when sample set is large. Thus, I would say that the Chi Square test is most appropriate in this case.
Chi-square Test:
tableXY <- table(df$X > x, df$Y>y)
chisq.test(tableXY)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tableXY
## X-squared = 0.51253, df = 1, p-value = 0.474
Fisher’s Exact Test:
fisher.test(tableXY)
##
## Fisher's Exact Test for Count Data
##
## data: tableXY
## p-value = 0.474
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9441663 1.1339759
## sample estimates:
## odds ratio
## 1.034733
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
house_price_train <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-605/master/train.csv", header=T, stringsAsFactors = F)
house_price_test <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-605/master/test.csv", header=T, stringsAsFactors = F)
Descriptive Statistics:
There are 1460 obervations for each of the 81 variables.
str(house_price_train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
If I want to buy a house and live myself, I’ll judge the sales price based on these key factors such as the overall condition, overall quality, lot area, etc. For this exercise, I’ll choose the following variables as dependent and independent variables.
Independent Variable: Lot Area, Overall Condition, and Overall Quality
Dependent Variable: Sales Price
Summary statistics of the independent and dependent variables.
df <- house_price_train %>% dplyr::select(SalePrice, OverallCond, OverallQual, LotArea)
describe(df)
## vars n mean sd median trimmed mad min
## SalePrice 1 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
## OverallCond 2 1460 5.58 1.11 5.0 5.48 0.00 1
## OverallQual 3 1460 6.10 1.38 6.0 6.08 1.48 1
## LotArea 4 1460 10516.83 9981.26 9478.5 9563.28 2962.23 1300
## max range skew kurtosis se
## SalePrice 755000 720100 1.88 6.50 2079.11
## OverallCond 9 8 0.69 1.09 0.03
## OverallQual 10 9 0.22 0.09 0.04
## LotArea 215245 213945 12.18 202.26 261.22
Histogram:
par(mfrow=c(2,2))
hist(df$SalePrice, main="Sales Price", ylab="Frequency", col="hotpink", breaks=10)
hist(df$OverallCond, main="Overall Condition", ylab="Frequency", col="hotpink", breaks=10)
hist(df$OverallQual, main="Overall Quality", ylab="Frequency", col="hotpink", breaks=10)
hist(df$LotArea, main="Lot Area", ylab="Frequency", col="hotpink", breaks=10)
Scatter Plot:
ggplot(df, aes(x = LotArea, y = SalePrice)) + geom_point() + ggtitle("Sales Price vs Lot Area") + ylab("Sales Price") +xlab("Lot Area")
ggplot(df, aes(x = OverallQual, y = SalePrice)) + geom_point(aes(color = factor(OverallQual))) + ggtitle("Sales Price vs Overall Quality") + labs(color='Rates') + ylab("Sales Price") + xlab("Overall Quality")
ggplot(df, aes(x = OverallCond, y = SalePrice)) + geom_point(aes(color = factor(OverallCond))) +
ggtitle("Sales Price vs Overall Condition") + labs(color='Rates') + ylab("Sales Price") + xlab("Overall Condition")
Correlation Matrix:
correlation_matrix = df %>% dplyr::select(SalePrice, OverallQual, OverallCond, LotArea) %>% cor() %>% as.matrix()
correlation_matrix
## SalePrice OverallQual OverallCond LotArea
## SalePrice 1.00000000 0.79098160 -0.07785589 0.26384335
## OverallQual 0.79098160 1.00000000 -0.09193234 0.10580574
## OverallCond -0.07785589 -0.09193234 1.00000000 -0.00563627
## LotArea 0.26384335 0.10580574 -0.00563627 1.00000000
Hypothesis Testing:
cor.test(df$LotArea, df$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df$LotArea and df$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
cor.test(df$OverallQual, df$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df$OverallQual and df$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
cor.test(df$OverallCond, df$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df$OverallCond and df$SalePrice
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.1111272 -0.0444103
## sample estimates:
## cor
## -0.07785589
From the statistic summary or histograms, it shows that the Lot Area variable is right skewed and the rest are slightly right skewed. All the scatter plots show that a straight line can be drawn through the points to show a linear relationship.
From the above hypothesis testing, we can see that p-value for each pairwise is much less than the 0.2 significant value and this indicates that it is statistically significant. Therefore, we can reject the null hypothesis and we can say that changes in the predictor’s value are related to changes in the response variable.
From the 3 response variables, the Overall Quality variable has the strongest relationship with the sales price. It has 0.791 correlation value. This makes sense as the overall material and finish of the house is the single most important factor for determining the fair price of a house and experienced real estate agent would tell you in front the reason why a house price is higher than the other even though they look the same.
Family-wise error rate is the the probability of a coming to at least one false conclusion in a series of hypothesis tests. In other words, it’s the probability of making at least one Type I Error. A Type I error, is the incorrect rejection of a true null hypothesis. I don’t think I would worry about family-wise error because this type of exercise a lot of things are just common sense. For example, we know the bigger the lot area will definitely have higher in sales price if we’re comparing apple to apple. It’s just a matter of how strong the relationship is.
Precision Matrix:
precision_matrix = solve(correlation_matrix)
precision_matrix
## SalePrice OverallQual OverallCond LotArea
## SalePrice 2.92833783 -2.2582064 0.017378675 -0.533593316
## OverallQual -2.25820645 2.7613570 0.079757301 0.304094866
## OverallCond 0.01737868 0.0797573 1.008643943 -0.007339038
## LotArea -0.53359332 0.3040949 -0.007339038 1.108568702
Multiply the Correlation Matrix by Precision Matrix:
cmpm = round(correlation_matrix %*% precision_matrix)
cmpm
## SalePrice OverallQual OverallCond LotArea
## SalePrice 1 0 0 0
## OverallQual 0 1 0 0
## OverallCond 0 0 1 0
## LotArea 0 0 0 1
LU Decomposition:
decomp_mtx = lu.decomposition(correlation_matrix)
decomp_mtx
## $L
## [,1] [,2] [,3] [,4]
## [1,] 1.00000000 0.00000000 0.000000000 0
## [2,] 0.79098160 1.00000000 0.000000000 0
## [3,] -0.07785589 -0.08107364 1.000000000 0
## [4,] 0.26384335 -0.27484978 0.006620283 1
##
## $U
## [,1] [,2] [,3] [,4]
## [1,] 1 7.909816e-01 -0.07785589 0.263843354
## [2,] 0 3.743481e-01 -0.03034976 -0.102889497
## [3,] 0 -3.469447e-18 0.99147789 0.006563864
## [4,] 0 -1.385482e-17 0.00000000 0.902064074
From the above analysis, we can see that the Lot Area variable is right-skewed. The minimum value of the Lot Area is absolutely above zero so need not to shift.
Lot area minimum value:
min(df$LotArea)
## [1] 1300
Exponential probability density function:
epdf <- fitdistr(df$LotArea, densfun="exponential")
epdf
## rate
## 9.508570e-05
## (2.488507e-06)
Optimal Value of \(\lambda\):
The optimal value of lambda = \(\frac{1}{\lambda}\).
lambda <- epdf$estimate
optimal_value <- 1/lambda
optimal_value
## rate
## 10516.83
1000 samples of lambda:
epdf_samples <- rexp(1000, lambda)
Histogram of the 1000 samples versus the original:
par(mfrow = c(1, 2))
hist(epdf_samples, breaks = 50, col="hotpink", main = "Exponential - Lot Area")
hist(df$LotArea, breaks = 50, col="royalblue", main = "Original - Lot Area")
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF):
qexp(c(0.05, 0.95), rate = lambda)
## [1] 539.4428 31505.6013
Generate a 95% confidence interval from the empirical data, assuming normality:
qnorm(c(0.025, 0.975), mean=mean(df$LotArea), sd=sd(df$LotArea))
## [1] -9046.092 30079.748
Empirical 5th percentile and 95th percentile of the data:
quantile(df$LotArea, c(0.05, 0.95))
## 5% 95%
## 3311.70 17401.15
From the histogram, we can see that the values from the exponential distribution is much more spread out compared to the original lot area data.
The empirical 5th and 95th percentiles (3311.70 and 17401.15) are quite different than the confidence intervals calculated while assuming normality (-9046.092 and 30079.748). The lot area variable is not normally distributed. However, the range of the empirical values is within the 95% confidence interval.
mlm_1 <- lm(SalePrice ~ OverallQual + OverallCond + LotArea, data = df)
summary(mlm_1)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + OverallCond + LotArea,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -271356 -26923 -1468 20176 385490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.022e+05 8.633e+03 -11.832 <2e-16 ***
## OverallQual 4.430e+04 8.884e+02 49.860 <2e-16 ***
## OverallCond -4.237e+02 1.098e+03 -0.386 0.7
## LotArea 1.450e+00 1.226e-01 11.831 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46470 on 1456 degrees of freedom
## Multiple R-squared: 0.6585, Adjusted R-squared: 0.6578
## F-statistic: 935.9 on 3 and 1456 DF, p-value: < 2.2e-16
From the summary above, we can see that the p-value of the overall condition variable is greater than the significant value 0.05, which indicates that it is not statistically significant. So, we can remove this variable from the model and rebuild the model.
mlm_2 <- lm(SalePrice ~ OverallQual + LotArea, data = df)
summary(mlm_2)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + LotArea, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -271225 -26819 -1459 20172 385190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.047e+05 5.547e+03 -18.88 <2e-16 ***
## OverallQual 4.433e+04 8.844e+02 50.12 <2e-16 ***
## LotArea 1.450e+00 1.225e-01 11.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46460 on 1457 degrees of freedom
## Multiple R-squared: 0.6585, Adjusted R-squared: 0.658
## F-statistic: 1405 on 2 and 1457 DF, p-value: < 2.2e-16
Removing the Overall Condition variable does not seem to make any difference but the R-squared is good enough. The reported R-squared of 0.658 for this model means that the model explains 65.8 percent of the data’s variation. The p-value of the model is much smaller than the 0.05 significant level. This indicates that the sample data provide enough evidence to reject the null hypothesis for the entire population and the model fits the data better than the intercept-only model. This suggests that there is a strong association between the changes in the 2 variables (OverallQual, LotArea) and the shifts in the SalesPrice variable.
For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. From the model summary, we can see that the standard error for all the variables including the intercept is at least 5 times smaller than the corresponding coefficients. This large ratio means that there is relatively little variability in the slope and intercept estimates.
The linear model is \[SalesPrice = OverallQual*4.433e+04 + LotArea*1.45 - 1.047e+05\]
Right now, we can do residual analysis to check whether the basic assumptions of linear regression is met:
par(mfrow=c(2,2))
# Residuals plot
plot(mlm_2$fitted.values, mlm_2$residuals,
xlab='Fitted Values', ylab='Residuals')
abline(h = 0, lty = 3, col="blue")
abline(h = 2e+5, lty = 3, col="red")
abline(h = -2e+5, lty = 3, col="red")
# Histogram plot
hist(mlm_2$residuals, xlab="Residuals")
qqnorm(mlm_2$residuals)
qqline(mlm_2$residuals)
From the residuals plot, it looks to be heteroscedasticity. We can see that the residuals are not uniformly scattered above and below zero and the residuals tend to increase as we move to the right. It looks like a cone shaped pattern. The vertical range of the residuals increases as the fitted values increases. But the residuals histogram does not seem skewed. Also, the QQ plot shows heavily tails.
We can rectify the heteroscedasticity by using methods such the Box-Cox Transformation.
# Transform the sales price
new_saleprice = BoxCoxTrans(df$SalePrice)
# Combine the transformed sales price to the dataframe
df_new <- cbind(df, SalePrice_new = predict(new_saleprice, df$SalePrice))
# Create a new multiple regression model
mlm_3 <- lm(SalePrice_new ~ OverallQual + LotArea, data = df_new)
# Check the new model summary
summary(mlm_3)
##
## Call:
## lm(formula = SalePrice_new ~ OverallQual + LotArea, data = df_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31026 -0.11439 0.01546 0.13021 0.63025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.054e+01 2.626e-02 401.57 <2e-16 ***
## OverallQual 2.307e-01 4.187e-03 55.12 <2e-16 ***
## LotArea 6.915e-06 5.801e-07 11.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2199 on 1457 degrees of freedom
## Multiple R-squared: 0.6973, Adjusted R-squared: 0.6969
## F-statistic: 1678 on 2 and 1457 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
# Residuals plot
plot(mlm_3$fitted.values, mlm_3$residuals,
xlab='Fitted Values', ylab='Residuals')
abline(h = 0, lty = 3, col="blue")
abline(h = 0.5, lty = 3, col="red")
abline(h = -0.5, lty = 3, col="red")
# Histogram plot
hist(mlm_3$residuals, xlab="Residuals")
qqnorm(mlm_3$residuals)
qqline(mlm_3$residuals)
After doing the transformation, we can see that the problem of heteroscedasticity has been rectified.
The new model is \[SalesPrice = exp(OverallQual*2.307e-01 + LotArea*6.915e-06 + 1.054e+01)\]
Now, let’s test the new model with the test data.
test_df <- house_price_test %>% dplyr::select(LotArea, OverallQual, Id)
test_df <- na.omit(test_df)
predict_price <- predict(mlm_3, newdata = test_df, type="response")
predict_price_df <- data.frame(test_df$Id, exp(predict_price))
colnames(predict_price_df) <- c('Id', 'SalePrice')
Write the predicted sales price result to a csv file for submission to the Kaggle.com.
write.csv(predict_price_df, file = "predicted_prices_siesiongwong.csv", row.names = FALSE)
My Kaggle user name is Siong, and the resulting score on Kaggle.com from this model is 0.21829.