Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
Variable Selections & Calculation
\[Let \space X = X4, Y = Y1 \]
x <- as.numeric(quantile(data1$X4, 0.75))
y <- as.numeric(quantile(data1$Y1, 0.25))
paste0("x = ", x)## [1] "x = 12.525"
paste0("y = ", y)## [1] "y = 18.55"
This is the probability that X is greater than the 3rd quartile of all its entries, given that in Y is greater than its 1st quartile.
\[\begin{align} P(X>x | Y>y) &= \frac{P(Y > y \space and \space X > x)}{P(Y>y)} \\ \\ &= \frac{P(Y > 18.55 \space and \space X > 12.525)}{P(Y>18.55)}\\ \\ &= \frac{0.2}{0.75} \\ \\ &= \frac{4}{15} \end{align}\]
Solution in R P(Y > y and X > x)
pYANDX <- nrow( subset( data1, Y1 > y & X4 > x) ) / nrow(data1)
pYANDX## [1] 0.2
P(Y > y)
pY <- nrow( subset( data1, data1$Y1 > y ) ) / nrow(data1)
pY## [1] 0.75
P(X>x | Y>y)
pYANDX/pY## [1] 0.2666667
This is the probability that an entry in X is greater its 3rd quartile and an entry in Y is greater than its 1st quartile.
\[\begin{align} P(X>x , Y>y) &= P(X > x | Y > y)P(Y > y) \\ \\ &= \frac{4}{15}*\frac{3}{4} \\ \\ &= \frac{1}{5} \end{align}\]
Solution in R P(X > x)
nrow( subset( data1, data1$Y1 > y & data1$X4 > x) ) / nrow(data1) ## [1] 0.2
This is the probability that in X is less than its 3rd quartile given that Y is greater than its 1st quartile.
\[\begin{align} P(X<x | Y>y) &= \frac{P(Y > y \space and \space X < x)}{P(Y>y)} \\ \\ &= \frac{P(Y > 18.55 \space and \space X < 12.525)}{P(Y>18.55)}\\ \\ &= \frac{0.55}{0.75} \\ \\ &= \frac{11}{15} \end{align}\]
Solution in R P(Y > y and X < x)
pYANDX <- nrow( subset( data1, Y1 > y & X4 < x) ) / nrow(data1)
pYANDX## [1] 0.55
P(Y > y)
pY <- nrow( subset( data1, data1$Y1 > y ) ) / nrow(data1)
pY## [1] 0.75
P(X
pYANDX/pY## [1] 0.7333333
In addition, make a table of counts as shown below.
#Create Table
q1Table <- data.frame("x/y" = c("<=1st quartile", ">1st quartile", "Total"),
"<=3rd quartile" = c("","", ""),
">3rd quartile" = c("","", ""),
"Total" = c("","", ""), stringsAsFactors = FALSE)
# Calculate first quartiles
xQ1 <- as.numeric(quantile(data1$X4, 0.25))
yQ1 <- as.numeric(quantile(data1$Y1, 0.25))
names(q1Table) <- c("x/y","<=1st quartile", ">1st quartile", "Total")
# Calculate counts
q1Table$`<=1st quartile`[1] <- nrow( subset( data1, X4 <= xQ1 & Y1 <= yQ1 ) )
q1Table$`<=1st quartile`[2] <- nrow( subset( data1, X4 > xQ1 & Y1 <= yQ1 ) )
q1Table$`>1st quartile`[1] <- nrow( subset( data1, X4 <= xQ1 & Y1 > yQ1 ) )
q1Table$`>1st quartile`[2] <- nrow( subset( data1, X4 > xQ1 & Y1 > yQ1 ) )
# Calculate totals
q1Table$`<=1st quartile`[3] <- as.integer(q1Table$`<=1st quartile`[1]) + as.integer(q1Table$`<=1st quartile`[2])
q1Table$`>1st quartile`[3] <- as.integer(q1Table$`>1st quartile`[1]) + as.integer(q1Table$`>1st quartile`[2])
q1Table$Total[1] <- as.integer(q1Table$`<=1st quartile`[1]) + as.integer(q1Table$`>1st quartile`[1])
q1Table$Total[2] <- as.integer(q1Table$`<=1st quartile`[2]) + as.integer(q1Table$`>1st quartile`[2])
q1Table$Total[3] <- as.integer(q1Table$`<=1st quartile`[3]) + as.integer(q1Table$`>1st quartile`[3])
# Print table
kable(q1Table) %>%
kable_styling(full_width = F, bootstrap_options = "bordered") %>%
column_spec(1, bold = T, border_right = T)| x/y | <=1st quartile | >1st quartile | Total |
|---|---|---|---|
| <=1st quartile | 1 | 4 | 5 |
| >1st quartile | 4 | 11 | 15 |
| Total | 5 | 15 | 20 |
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
paste0("P(AB) = ", nrow( subset( data1, X4 > xQ1 & Y1 > yQ1 ) )/nrow(data1)) ## [1] "P(AB) = 0.55"
paste0("P(A) = ", nrow( subset( data1, X4 > xQ1) )/nrow(data1))## [1] "P(A) = 0.75"
paste0("P(B) = ", nrow( subset( data1, Y1 > yQ1 ) )/nrow(data1))## [1] "P(B) = 0.75"
paste0("P(A)P(B) = ", (nrow( subset( data1, X4 > xQ1 ) )/nrow(data1))*(nrow( subset( data1, Y1 > yQ1 ) )/nrow(data1)))## [1] "P(A)P(B) = 0.5625"
\[P(AB) = 0.55 \ne P(A)P(B) = 0.5625\]
\(P(AB)\) is not equal to \(P(A)P(B)\) therefore splitting the data in this manner does not make them independent.
# Create A & B based on criteria and perform test
A <- subset( data1, X4 > xQ1 )
B <- subset( data1, Y1 > yQ1 )
chisq.test(A,B)##
## Pearson's Chi-squared test
##
## data: A
## X-squared = 110.45, df = 98, p-value = 0.1838
The p-value of the Chi Square is not statistically significant therefore we cannot reject the null hypothesis that the data is independent.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
# load data and select ignore the index variable
training_data <- read.csv("data/train.csv", header = TRUE, stringsAsFactors = FALSE)#Summarize data and save to CSV
summary <- describe(training_data )
summary$na <- nrow(training_data) - summary$n
summary$na_pct <- summary$na/nrow(training_data)*100
summary <- dplyr::select(summary, vars, n, mean, sd, median, min, max, na, na_pct )
names(summary) <- c("Variables", "Cases", "Mean", "SD", "Median", "Min", "Max", "NAs", "NA_Pct")
datatable(round(summary,2), options = list(filter = FALSE, pageLength = 81, dom = 'tip'), filter = 'none')long_training_data <- dplyr::select_if(training_data,is.numeric) %>%
gather(training_data)
names(long_training_data) <- c("key", "value")
ggplot(long_training_data, aes(factor(key), value)) +
geom_boxplot() +
facet_wrap(~key, scale="free", ncol = 4) +
labs(x="", y="")training_data_categorical <- dplyr::select_if(training_data,is.character)
long_categorical_data <- gather(training_data_categorical)
ggplot(long_categorical_data, aes(value)) +
geom_histogram(bins = 10, stat="count") +
facet_wrap(~key, scales = 'free_x', ncol = 4) +
theme( axis.text.x = element_text(angle = 90))training_data_numeric <- dplyr::select_if(training_data,is.numeric)
long_numeric_data <- gather(training_data_numeric)
ggplot(long_numeric_data, aes(value)) +
geom_histogram(bins = 10, stat="count") +
facet_wrap(~key, scales = 'free', ncol = 4) +
theme( axis.text.x = element_text(angle = 90))ggpairs(dplyr::select(training_data, SalePrice, LotArea, LotFrontage, BedroomAbvGr, KitchenAbvGr), upper = "blank", diag = "blank")Correlation Matrix
corrData <- dplyr::select(training_data, SalePrice, LotArea, BedroomAbvGr)
corrMatrix <- round( cor(corrData, method = "pearson", use = "complete.obs"), 4 )
corrMatrix## SalePrice LotArea BedroomAbvGr
## SalePrice 1.0000 0.2638 0.1682
## LotArea 0.2638 1.0000 0.1197
## BedroomAbvGr 0.1682 0.1197 1.0000
80% Confidence Intervals: SalePrice & LotArea
cor.test(corrData$SalePrice,corrData$LotArea, conf.level = 0.8)##
## Pearson's product-moment correlation
##
## data: corrData$SalePrice and corrData$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
80% Confidence Intervals: SalePrice & BedroomAbvGr
cor.test(corrData$SalePrice,corrData$BedroomAbvGr, conf.level = 0.8)##
## Pearson's product-moment correlation
##
## data: corrData$SalePrice and corrData$BedroomAbvGr
## t = 6.5159, df = 1458, p-value = 9.927e-11
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1354160 0.2006421
## sample estimates:
## cor
## 0.1682132
80% Confidence Intervals: BedroomAbvGr & LoarArea
cor.test(corrData$BedroomAbvGr,corrData$LotArea, conf.level = 0.8)##
## Pearson's product-moment correlation
##
## data: corrData$BedroomAbvGr and corrData$LotArea
## t = 4.6033, df = 1458, p-value = 4.52e-06
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.08647564 0.15263840
## sample estimates:
## cor
## 0.1196899
Each of the above the above correlations fall within their 80% confidence interval. The family-wise error is making one or more type I errors when performing multiple hypotheses tests. I would be a bit concerned about this because we have a very wide confidence interval, making the family wise error more likely when these inferences are combined. We may use the Bonferroni method to select a significance p-value in this context.
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
invCorrMatrix <- solve(corrMatrix)
invCorrMatrix## SalePrice LotArea BedroomAbvGr
## SalePrice 1.0971260 -0.27121886 -0.15207170
## LotArea -0.2712189 1.08158397 -0.08384659
## BedroomAbvGr -0.1520717 -0.08384659 1.03561490
corrMatrix %*% invCorrMatrix ## SalePrice LotArea BedroomAbvGr
## SalePrice 1.000000e+00 2.428613e-17 0
## LotArea -2.081668e-17 1.000000e+00 0
## BedroomAbvGr -2.775558e-17 0.000000e+00 1
invCorrMatrix %*% corrMatrix ## SalePrice LotArea BedroomAbvGr
## SalePrice 1.000000e+00 -2.081668e-17 -2.775558e-17
## LotArea 2.428613e-17 1.000000e+00 0.000000e+00
## BedroomAbvGr 0.000000e+00 0.000000e+00 1.000000e+00
I decided to use my solution function from Assignment 2 to perform the decomposition.
decompose_A <- function(A){
# get the size of the matrix A
rows = dim(A)[1]
cols = dim(A)[2]
# initialize L & U
U <- A
L <- diag(rows)
for(col in 1:(cols - 1)){ # no need to process the last column
for(row in (col + 1):rows){
if((col) > rows){break}
if(U[row,col] == 0){next}
coeff <- ( (-1*U[row,col])/U[col,col] ) # calculate factor to change element to 0
U[row,] <- ( coeff * U[col,] ) + U[row,] # perform row operation 3
L[row,col] <- -1*coeff # use the opposite sign for the same location in L
}
}
return( list(L,U) )
}
LU <- decompose_A(corrMatrix)Extract L
LU[[1]]## [,1] [,2] [,3]
## [1,] 1.0000 0.0000000 0
## [2,] 0.2638 1.0000000 0
## [3,] 0.1682 0.0809631 1
Extract U
LU[[2]]## SalePrice LotArea BedroomAbvGr
## SalePrice 1 0.2638000 0.16820000
## LotArea 0 0.9304096 0.07532884
## BedroomAbvGr 0 0.0000000 0.96560990
Does A = LU?
LU[[1]] %*% LU[[2]] == corrMatrix## SalePrice LotArea BedroomAbvGr
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., \(rexp(1000,\lambda)\)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
I selected the LotFrontage variable because seems to be right-skewed based on its histogram above in Problem 2.1.4 and its minimun value is 21 based on the summary table in Problem 2.1.1.
# fit PDF
expLotFrontage <- fitdistr(na.omit(training_data$LotFrontage), densfun = "exponential")
#extract lambda
lambda <- expLotFrontage$estimate
#sample PDF
sampleExp <- rexp(1000,lambda)
originalPlot <- ggplot(aes(x = LotFrontage, fill = LotFrontage), data = training_data) +
geom_histogram(fill = "green") +
labs(x="Frequency", y="LotFrontage (ft)")
pdfPlot <- ggplot( aes(x = X, fill = X), data = data.frame(X =sampleExp) ) +
geom_histogram(fill = "blue") +
labs(x="Frequency", y="Sample of Exponential PDF")
grid.arrange(originalPlot,pdfPlot,ncol=2,nrow=1)quantile(sampleExp, probs=c(0.05, 0.95))## 5% 95%
## 4.227185 208.969174
ci <- t.test(training_data$LotFrontage, conf.level=0.95)
paste0("The lower bound is ", ci[["conf.int"]][1], ", the upper bound is ", ci[["conf.int"]][2],".")## [1] "The lower bound is 68.6751299039821, the upper bound is 71.4247868320712."
quantile(training_data$LotFrontage, probs=c(0.05, 0.95), na.rm = TRUE)## 5% 95%
## 34 107
The plots of the observed data and the sample from the exponential distribution are quite similar, however the plot from the sample of the exponential PDF more resembles the exponential distribution, but this is expected. The 5th and 95th percentile are much lower than that of the observed data. It seems that the exponential distribution fits LotFrontage ok but not very well.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I decided to start my model with 15 independent variables I deemed important in the context of house prices. These variables along with their descriptions are below.
| variable | description |
|---|---|
GrLivArea |
Above grade (ground) living area square feet |
MSZoning |
The general zoning classification |
LotFrontage |
Linear feet of street connected to property |
LotArea |
Lot size in square feet |
Street |
Type of road access |
Utilities |
Type of utilities available |
OverallQual |
Overall material and finish quality |
OverallCond |
Overall condition rating |
TotalBsmtSF |
TotalBsmtSF |
TotRmsAbvGrd |
Total rooms above grade (does not include bathrooms) |
BedroomAbvGr |
Number of bedrooms above basement level |
Functional |
Home functionality rating |
GarageArea |
Size of garage in square feet |
PoolArea |
Pool area in square feet |
MiscVal |
Dollar Value of miscellaneous feature |
ggplot( training_data, aes(x=SalePrice)) +
labs(title = "Histogram of SalePrice" ) +
geom_histogram() The plot above shows that our target value is right-skewed. This seems like a good candidate for a log transformation.
ggplot( training_data, aes(x=log(SalePrice))) +
labs(title = "Histogram of log(SalePrice)" ) +
geom_histogram() The log transformation made a difference!
model_data <- dplyr::select(training_data, GrLivArea, MSZoning, LotFrontage, LotArea, Street, Utilities,
OverallQual, OverallCond, TotalBsmtSF, TotRmsAbvGrd, BedroomAbvGr, Functional,
GarageArea, PoolArea, MiscVal, SalePrice)
model_data$logSalePrice <- log(model_data$SalePrice)
gather( model_data , "VARIABLE", "VALUE", -logSalePrice) %>%
ggplot( aes(x=VALUE, y=logSalePrice)) +
geom_point() + facet_wrap(~VARIABLE, scale = "free", ncol = 4) +
labs(x="", y="Sale Price", title = "Log Sale Price vs. Predictors" ) +
theme( axis.text.x = element_text(angle = 90)) +
scale_x_discrete(labels = NULL)Based on the plots above it shows seems that the predictors Utilities, PoolArea, Street, and MiscVal needs to be binned or transformed. I will remove these variables from the model. There is no issue with missing data so we will proceed will building our model.
## Start: AIC=-5410.48
## logSalePrice ~ GrLivArea + MSZoning + LotFrontage + LotArea +
## OverallQual + OverallCond + TotalBsmtSF + TotRmsAbvGrd +
## BedroomAbvGr + Functional + GarageArea + SalePrice
##
## Df Sum of Sq RSS AIC
## - TotRmsAbvGrd 1 0.0000 12.841 -5412.5
## - LotArea 1 0.0001 12.841 -5412.5
## <none> 12.841 -5410.5
## - GrLivArea 1 0.0304 12.871 -5409.6
## - LotFrontage 1 0.0533 12.894 -5407.5
## - TotalBsmtSF 1 0.1287 12.969 -5400.5
## - Functional 5 0.2971 13.138 -5393.0
## - BedroomAbvGr 1 0.2277 13.069 -5391.4
## - OverallCond 1 0.6587 13.500 -5352.4
## - GarageArea 1 0.9156 13.756 -5329.8
## - OverallQual 1 1.6480 14.489 -5267.5
## - MSZoning 4 2.9687 15.810 -5168.7
## - SalePrice 1 21.3359 34.177 -4236.8
##
## Step: AIC=-5412.48
## logSalePrice ~ GrLivArea + MSZoning + LotFrontage + LotArea +
## OverallQual + OverallCond + TotalBsmtSF + BedroomAbvGr +
## Functional + GarageArea + SalePrice
##
## Df Sum of Sq RSS AIC
## - LotArea 1 0.0001 12.841 -5414.5
## <none> 12.841 -5412.5
## - GrLivArea 1 0.0464 12.887 -5410.1
## - LotFrontage 1 0.0532 12.894 -5409.5
## - TotalBsmtSF 1 0.1305 12.971 -5402.3
## - Functional 5 0.2976 13.138 -5395.0
## - BedroomAbvGr 1 0.2917 13.133 -5387.5
## - OverallCond 1 0.6607 13.502 -5354.2
## - GarageArea 1 0.9168 13.758 -5331.7
## - OverallQual 1 1.6483 14.489 -5269.4
## - MSZoning 4 2.9795 15.820 -5169.9
## - SalePrice 1 21.4923 34.333 -4233.3
##
## Step: AIC=-5414.47
## logSalePrice ~ GrLivArea + MSZoning + LotFrontage + OverallQual +
## OverallCond + TotalBsmtSF + BedroomAbvGr + Functional + GarageArea +
## SalePrice
##
## Df Sum of Sq RSS AIC
## <none> 12.841 -5414.5
## - GrLivArea 1 0.0464 12.887 -5412.1
## - LotFrontage 1 0.0588 12.900 -5411.0
## - TotalBsmtSF 1 0.1310 12.972 -5404.3
## - Functional 5 0.2985 13.139 -5396.9
## - BedroomAbvGr 1 0.2923 13.133 -5389.4
## - OverallCond 1 0.6607 13.502 -5356.2
## - GarageArea 1 0.9176 13.758 -5333.6
## - OverallQual 1 1.6712 14.512 -5269.5
## - MSZoning 4 2.9802 15.821 -5171.8
## - SalePrice 1 21.7349 34.576 -4226.9
##
## Call:
## lm(formula = logSalePrice ~ GrLivArea + MSZoning + LotFrontage +
## OverallQual + OverallCond + TotalBsmtSF + BedroomAbvGr +
## Functional + GarageArea + SalePrice, data = model_data)
##
## Coefficients:
## (Intercept) GrLivArea MSZoningFV MSZoningRH
## 1.044e+01 2.193e-05 4.328e-01 3.236e-01
## MSZoningRL MSZoningRM LotFrontage OverallQual
## 3.949e-01 2.850e-01 -3.487e-04 4.726e-02
## OverallCond TotalBsmtSF BedroomAbvGr FunctionalMaj2
## 2.272e-02 3.236e-05 2.510e-02 -2.312e-01
## FunctionalMin1 FunctionalMin2 FunctionalMod FunctionalTyp
## -2.560e-02 -3.063e-02 -1.027e-01 -2.374e-02
## GarageArea SalePrice
## 1.710e-04 3.392e-06
houseModel <- lm(formula = logSalePrice ~ GrLivArea +
LotArea + OverallQual + OverallCond + TotalBsmtSF +
GarageArea, data = model_data)
summary(houseModel)##
## Call:
## lm(formula = logSalePrice ~ GrLivArea + LotArea + OverallQual +
## OverallCond + TotalBsmtSF + GarageArea, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47725 -0.07374 0.01461 0.10421 0.52127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.038e+01 3.369e-02 308.197 < 2e-16 ***
## GrLivArea 1.865e-04 1.178e-05 15.840 < 2e-16 ***
## LotArea 2.814e-06 5.031e-07 5.594 2.65e-08 ***
## OverallQual 1.410e-01 4.863e-03 28.986 < 2e-16 ***
## OverallCond 2.911e-02 4.327e-03 6.728 2.47e-11 ***
## TotalBsmtSF 1.271e-04 1.374e-05 9.247 < 2e-16 ***
## GarageArea 3.609e-04 2.820e-05 12.801 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1803 on 1453 degrees of freedom
## Multiple R-squared: 0.7971, Adjusted R-squared: 0.7963
## F-statistic: 951.4 on 6 and 1453 DF, p-value: < 2.2e-16
\[log(SalePrice) = 10.38 + 10.3846 + 2e-04 * GrLivArea + 0 * LotArea + 0.141 * OverallQual + \\ 0.0291 * OverallCond + 1e-04 * TotalBsmtSF + 4e-04 * GarageArea\]
To assess whether this model is reliable, we will check for linearity, nearly normal residuals, constant variability (homoscedasticity), and issues with multicollinearity.
Linearity
The final variables showed signs of linearity. This assumption is met, however some transformations would have made the relationship more concrete.
Nearly Normal Residuals
Investigating the Q-Q plot below, we can see the evidence of skewness with strong deviations of the first two residuals from the line. The others show only small deviations from the line. I would say that the assumption of nearly normal residuals has been met but the outliers should be addressed.
ggplot( houseModel, aes(sample = houseModel$residuals) ) +
stat_qq() +
stat_qq_line() +
labs(x="Theoretical Quantiles", y="Sample Quantiles") +
ggtitle("Normal Q-Q Plot")Constant Variability and Independence
The plot of the residuals vs. the fitted values below shows the two outliers that may be influential. I would say that the constant variability is not met due to the two outliers.
ggplot(aes(x = houseModel$fitted.values, y = houseModel$residuals ), data = houseModel) +
geom_point(shape=1) +
labs(x="Fitted Values", y="Residuals") +
ggtitle("Residuals vs. Fitted Values") +
geom_abline(intercept = 0,
slope = 0,
color = "blue",
size = 1.0,
linetype="dashed")Multicollinearity
Multicollinearity is not a huge problem. There is a strong positive correlation between the the OverallCond and OverallQual variables.
model_data$sqrt_SalePrice <- sqrt(model_data$SalePrice)
corrplot( round( cor(dplyr::select(model_data,logSalePrice, GrLivArea, LotArea, OverallQual, OverallCond, TotalBsmtSF, GarageArea), method = "pearson", use = "complete.obs"), 2 ),
method="circle",
type="upper",
tl.col="black")Kaggle Score (RMSE): 0.17298
Kaggle Username: A.Gilharry
# Predict wins
test_data <- read.csv("data/test.csv", header = TRUE, stringsAsFactors = FALSE)
test_data <- dplyr::select(test_data,Id, GrLivArea, LotArea, OverallQual, OverallCond, TotalBsmtSF, GarageArea)
predicts <- predict(houseModel, newdata = test_data, interval="prediction")
test_data$SalePrice <- round(exp(predicts[,1]))
test_data$SalePrice[is.na(test_data$SalePrice)] <- median(test_data$SalePrice, na.rm = TRUE)
predictions <- dplyr::select(test_data, Id, SalePrice)
write.csv(predictions, "data/predictions.csv", row.names = FALSE)
datatable(predictions, options = list(filter = FALSE, pageLength = 1416, dom = 'tip'), filter = 'none')