How to create a basic neural network in R
First you will need to install a package called "neuralnet", but if you try installing it the traditional way with the install.packages() function it won't load due to being made under an older version of R. Instead we will install a different package that will let us install the neuralnet package that we want:
install.packages("remotes")
library(remotes)
install_github("cran/neuralnet")
From here you will select the data to make a neural network with. I am chosing to use real estate data that can be downloaded from kaggle:
The data set used for this tutorial can be found here: https://www.kaggle.com/quantbruce/real-estate-price-prediction
set.seed(500)
library(MASS)
RealEstate <- read.csv("file:///C:/Users/Dakota/Documents/Real estate.csv")
data <- RealEstate
After selecting the data set, we will check to make sure there are no missing values, because then the neural network won't run.
apply(data,2,function(x) sum(is.na(x)))
## No TransactionDate HouseAge
## 0 0 0
## DistanceMRT NumberConvenienceStores Latitude
## 0 0 0
## Longitude HousePriceUnitArea
## 0 0
As you can see our dataset has no missing values!
From here we will need to begin randomly splitting the values to a train and a test set and from there running a linear regression model. We use Mean Squared Error as a measure of how far our predicted data is from the real data. Essentially we are trying to predict HousePrice using all the variables to create a pathway that predicts it accurately.
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]
lm.fit <- glm(HousePriceUnitArea~., data=train)
summary(lm.fit)
##
## Call:
## glm(formula = HousePriceUnitArea ~ ., data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -35.904 -5.657 -1.345 4.441 75.164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.312e+03 8.440e+03 -1.103 0.270732
## No -4.296e-03 4.585e-03 -0.937 0.349507
## TransactionDate 4.038e+00 1.993e+00 2.026 0.043693 *
## HouseAge -2.861e-01 4.834e-02 -5.919 8.77e-09 ***
## DistanceMRT -4.993e-03 9.026e-04 -5.532 6.86e-08 ***
## NumberConvenienceStores 1.047e+00 2.310e-01 4.532 8.43e-06 ***
## Latitude 2.056e+02 5.548e+01 3.705 0.000251 ***
## Longitude -3.212e+01 5.980e+01 -0.537 0.591572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 92.44944)
##
## Null deviance: 60116 on 309 degrees of freedom
## Residual deviance: 27920 on 302 degrees of freedom
## AIC: 2292.9
##
## Number of Fisher Scoring iterations: 2
pr.lm <- predict(lm.fit,test)
MSE.lm <- sum((pr.lm - test$medv)^2)/nrow(test)
Now we have to train our data for our neural network package. To do this we will normalize our data. This is similar to finding the z-score in other areas of stats. We find the min and max so we can scale it on the [0,1] interval.
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
scaled <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
train_ <- scaled[index,]
test_ <- scaled[-index,]
We will now reference the neuralnet package that we installed in the beginning. From here we will run our neuralnetwork. This examples takes the scaled values and outputs them into a three-layed neural network.
library(neuralnet)
n <- names(RealEstate)
f <- as.formula(paste("HousePriceUnitArea ~", paste(n[!n %in% "HousePrice"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
The graph is shown below:
p1 <- plot(nn, rep="best")
This graph seems a big complicated, but each black line shows the connections between each layer and the weights on each connection, the blue lines show the bias term added in each step. This is similar to an intercept in a linear model.
We will now predict the value of HousePrices using Mean squared Error that was mentioned above.
pr.nn <- compute(nn,test_[,1:8])
pr.nn_ <- pr.nn$net.result*(max(data$HousePriceUnitArea)-min(data$HousePriceUnitArea))+min(data$HousePriceUnitArea)
test.r <- (test_$HousePriceUnitArea)*(max(data$HousePriceUnitArea)-min(data$HousePriceUnitArea))+min(data$HousePriceUnitArea)
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
Now we will out put the two MSE:
print(paste(MSE.lm,MSE.nn))
## [1] "0 0.520851480256125"
Now we will plot the models to see which is better at predicting house prices, the linear model or neural network:
par(mfrow=c(1,2))
plot(test$HousePriceUnitArea,pr.nn_,col='red',main='Real vs predicted NN',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='NN',pch=18,col='red', bty='n')
plot(test$HousePriceUnitArea,pr.lm,col='blue',main='Real vs predicted lm',pch=18, cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='LM',pch=18,col='blue', bty='n', cex=.95)
Looking at these graphs, the neural network model seems to be a better model for this data set.
Here is another graph comparing the two models:
plot(test$HousePriceUnitArea,pr.nn_,col='red',main='Real vs predicted NN',pch=18,cex=0.7)
points(test$HousePriceUnitArea,pr.lm,col='blue',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend=c('NN','LM'),pch=18,col=c('red','blue'))
One final thing we will do is cross validation,
This is done so we can see the error of our model and how accurate it is:
library(boot)
set.seed(200)
lm.fit <- glm(HousePriceUnitArea~.,data=data)
cv.glm(data,lm.fit,K=10)$delta[1]
## [1] 80.15406
I am now splitting the data into 90% train and 10% test for a random sample of 10 times.
set.seed(450)
cv.error <- NULL
k <- 10
library(plyr)
## Warning: package 'plyr' was built under R version 3.6.3
pbar <- create_progress_bar('text')
pbar$init(k)
for(i in 1:k){
index <- sample(1:nrow(data),round(0.9*nrow(data)))
train.cv <- scaled[index,]
test.cv <- scaled[-index,]
nn <- neuralnet(f,data=train.cv,hidden=c(5,2),linear.output=T)
pr.nn <- compute(nn,test.cv[,1:8])
pr.nn <- pr.nn$net.result*(max(data$HousePriceUnitArea)-min(data$HousePriceUnitArea))+min(data$HousePriceUnitArea)
test.cv.r <- (test.cv$HousePriceUnitArea)*(max(data$HousePriceUnitArea)-min(data$HousePriceUnitArea))+min(data$HousePriceUnitArea)
cv.error[i] <- sum((test.cv.r - pr.nn)^2)/nrow(test.cv)
pbar$step()
}
We now calculate the MSE and plot the results:
mean(cv.error)
## [1] 0.8496616
cv.error
## [1] 0.8601074 5.3745704 0.1546210 0.3014590 0.4527813 0.4712630 0.3829065
## [8] 0.2195910 0.1649245 0.1143915
The error for neural network is 0.4007904, which is much lower that the linear model error terms.
boxplot(cv.error,xlab='MSE CV',col='cyan',
border='blue',names='CV error (MSE)',
main='CV error (MSE) for NN',horizontal=TRUE)