Loading packages and setting the random seed

library(caret)
library(knitr)
library(randomForest)
library(MASS)
library(dplyr)
library(mice)

set.seed(1234)

Loading the training and testing data sets

training <- read.csv(file="train.csv")

testing <- read.csv(file="test.csv")

The data set has 81 variables and 1460 observations. The first variable represents a correlative number. The target variable is SalePrice which represents the property’s sale price in dollars.

Removing outliers

According to the following paper, it is clear that there are outliers.

http://www.amstat.org/publications/jse/v19n3/decock.pdf

The below figure shows the scatter plot of Sales Price versus GRLivArea (Above grade (ground) living area square feet). Houses with more 4000 living area square feet indicate outliers. These outliers will be excluded from the analysis.

# We exclude any house with more than 4000 square feet from the 
# data set. 

training <- training %>% 
            filter(GrLivArea < 4000)

Simple linear regression model

We split the data set into 2 parts: training data (90%) and testing data (20%). We use GrLivArea as predictor, since that this variable is good explanatory variable for the variable target Sale Price.

inTrain <- createDataPartition(y = training$SalePrice, 
                               p = 0.90, 
                               list = FALSE)



train <- training[inTrain, ]

test <- training[-inTrain, ]

With the following instruction tune the model.

model <- lm(SalePrice ~ GrLivArea, data = train)

The output of the model is:

model <- lm(SalePrice ~ GrLivArea, data = train)
  
kable(summary(model)$coef, digits=4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12215.0814 4658.8909 2.6219 0.0088
GrLivArea 111.9363 2.9472 37.9809 0.0000

The terms of the model are significant. To evaluate the model according to the root-mean-squared-error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price in the testing data, we use the following code:

pred.test <- predict(model, test)

rmse <- sqrt( sum( (log(pred.test) - log(test$SalePrice))^2 , na.rm = TRUE ) / length(pred.test) )

The rmse between the logarithm of the predicted value and the logarithm of the observed sales price in the testing data is 0.2708674.

Predicting with the simple linear regression model

To generate the submission file, we use the following code:

pred.testing <- predict(model, testing)
submission <- data.frame(Id=testing$Id, SalePrice=pred.testing)
write.csv(submission, file="submission001.csv", row.names = FALSE, quote=FALSE)

The result of this submission file on Kaggle was a RMSE of 0.28854.