library(caret)
library(knitr)
library(randomForest)
library(MASS)
library(dplyr)
library(mice)
set.seed(1234)
training <- read.csv(file="train.csv")
testing <- read.csv(file="test.csv")
The data set has 81 variables and 1460 observations. The first variable represents a correlative number. The target variable is SalePrice which represents the property’s sale price in dollars.
According to the following paper, it is clear that there are outliers.
http://www.amstat.org/publications/jse/v19n3/decock.pdf
The below figure shows the scatter plot of Sales Price versus GRLivArea (Above grade (ground) living area square feet). Houses with more 4000 living area square feet indicate outliers. These outliers will be excluded from the analysis.
# We exclude any house with more than 4000 square feet from the
# data set.
training <- training %>%
filter(GrLivArea < 4000)
We split the data set into 2 parts: training data (90%) and testing data (20%). We use GrLivArea as predictor, since that this variable is good explanatory variable for the variable target Sale Price.
inTrain <- createDataPartition(y = training$SalePrice,
p = 0.90,
list = FALSE)
train <- training[inTrain, ]
test <- training[-inTrain, ]
With the following instruction tune the model.
model <- lm(SalePrice ~ GrLivArea, data = train)
The output of the model is:
model <- lm(SalePrice ~ GrLivArea, data = train)
kable(summary(model)$coef, digits=4)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 12215.0814 | 4658.8909 | 2.6219 | 0.0088 |
| GrLivArea | 111.9363 | 2.9472 | 37.9809 | 0.0000 |
The terms of the model are significant. To evaluate the model according to the root-mean-squared-error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price in the testing data, we use the following code:
pred.test <- predict(model, test)
rmse <- sqrt( sum( (log(pred.test) - log(test$SalePrice))^2 , na.rm = TRUE ) / length(pred.test) )
The rmse between the logarithm of the predicted value and the logarithm of the observed sales price in the testing data is 0.2708674.
To generate the submission file, we use the following code:
pred.testing <- predict(model, testing)
submission <- data.frame(Id=testing$Id, SalePrice=pred.testing)
write.csv(submission, file="submission001.csv", row.names = FALSE, quote=FALSE)
The result of this submission file on Kaggle was a RMSE of 0.28854.