Neural Network for Nutrition Rating Prediction
- Background
- Method
- Result
Import and Prepare Data to Train Model
Train The Model
Study Accuracy of the simple Neural Network Model
- Study Matrix.RMSE
- Variation of median RMSE

Neural Network for Nutrition Rating Prediction

Background

Get health care products you can count on is always import for consumers. In this simple example, We like to recommend a choices of the healthy cereals for nutrition rating. Ratings are based on important features such as calories, protein, fat, sodium, fiber, etc.

Method

A neural network is a computational system frequently employed in machine learning to create predictions based on existing data. In this example, we will train and test a neural network using the neuralnet library in R. Artificial neural networks (ANNs) have been applied in almost every aspect of food science over the past two decades, although most applications are in the development stage. ANNs hold a great deal of promise for modeling complex tasks in process control and simulation and in applications of machine perception including machine vision and electronic nose for food safety and quality control.

Result

The boxplot shows that the median RMSE across 100 samples when length of training set is fixed to 65 is 5.70. The variation of median RMSE shows that the median RMSE of our model decreases as the length of the training the set. This is an important result. The model accuracy is dependent on the length of training set. The performance of neural network model is sensitive to training-test split.

Import and Prepare Data to Train Model

Load Nutition Data from Webpage

DASL (pronounced “dazzle”) is an online library of datafiles and stories that illustrate the use of basic statistics methods. DASL is part of larger effort to enhance the teaching of statistics using computers. From many grocery stores, 77 type of cereals was found with rating:http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html.

install relevant libraries install.packages(“neuralnet”) install.packages(“boot”) install.packages(“plyr”) install.packages(“matrixStats”)

# Read the contents of the page into a vector of character strings with the readLines function:
thepage = readLines('http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html')

# If you look at the web page, you'll see that the title "The Data:" is right above the data we want. 
# We can locate this line using the grep function:
grep('The Data:',thepage)

## [1] 40

text=thepage[42:119]

# Based on the previous step, the data that we want is always preceded by the HTML tag 
# "<td class="row-text»", and followed by "</td>". Let's grab all the lines that have that pattern:

y <- strsplit(text, "\t")
df <- data.frame(matrix(unlist(y), nrow=78, byrow=T))

# Change column name to the correct one
colnames(df) <- as.character(unlist(df[1,]))
df1 = df[-1, ]
head(df1)

##                        name mfr type calories protein fat sodium fiber
## 2                 100%_Bran   N    C       70       4   1    130    10
## 3         100%_Natural_Bran   Q    C      120       3   5     15     2
## 4                  All-Bran   K    C       70       4   1    260     9
## 5 All-Bran_with_Extra_Fiber   K    C       50       4   0    140    14
## 6            Almond_Delight   R    C      110       2   2    200     1
## 7   Apple_Cinnamon_Cheerios   G    C      110       2   2    180   1.5
##   carbo sugars potass vitamins shelf weight cups    rating
## 2     5      6    280       25     3      1 0.33 68.402973
## 3     8      8    135        0     3      1    1 33.983679
## 4     7      5    320       25     3      1 0.33 59.425505
## 5     8      0    330       25     3      1  0.5 93.704912
## 6    14      8     -1       25     3      1 0.75 34.384843
## 7  10.5     10     70       25     1      1 0.75 29.509541

Simplify Data for Basic Rating Prediction

Just pick first 5 features(variables) to demonstrate this basic analysis. We first randomize the data and split it for taining and testing. Normally we divide our data set into two subsets: (1) training set-a subset to train a model. (2) test set-a subset to test the trained model.

## Creating index variable 

# Simplify Data
data=cbind(df1[,4:8],df1$rating)
colnames(data)[colnames(data)=="df1$rating"] <- "rating"

# Save data for future use
write.csv(data, file = "D:/R_Files/MyData.csv")

# Read data from saved file
data = read.csv("D:/R_Files/MyData.csv", header=T)
data =data[,-1]

# Random sampling
samplesize = 0.60 * nrow(data)
set.seed(80)
index = sample( seq_len ( nrow ( data ) ), size = samplesize )

# Create training and test set
datatrain = data[ index, ]
datatest = data[ -index, ]

Feature Scaling and Normalization

Normalized data on each features to the same scale, so you have a balance matrix to deal with during the computation.

## Scale data for neural network
## Normalized data to the same scale

max = apply(data , 2 , max)
min = apply(data, 2 , min)
scaled = as.data.frame(scale(data, center = min, scale = max - min))

Train The Model

Using Train Data to Fit Neural Network Model

In this simple example, we setup 5 features as Input Layer, and 1 nuron(rating) as output Layer. For simplicity we use 1 hidden layer with 3 neurons.

# load library
library(neuralnet)

## Warning: package 'neuralnet' was built under R version 3.4.4

# creating training and test set
trainNN = scaled[index , ]
testNN = scaled[-index , ]

# fit neural network
set.seed(2)
NN = neuralnet(rating ~ calories + protein + fat + sodium + fiber, trainNN, hidden = 3 , linear.output = T )

# plot neural network
plot(NN)

#import the function from Github
library(devtools)

## Warning: package 'devtools' was built under R version 3.4.3

source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')

## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef

#plot the model
plot.nnet(NN)

## Loading required package: scales

## Warning: package 'scales' was built under R version 3.4.3

## Loading required package: reshape

## Warning: package 'reshape' was built under R version 3.4.4

Neural Network Model Setup

Prediction Using Trained Model

We predict the rating using the neural network model. We must remember that the predicted rating will be scaled and it must me transformed in order to make a comparison with real rating.

plot(NN)
plot.nnet(NN)
## Prediction Using Trained Model

predict_testNN = compute(NN, testNN[,c(1:5)])
predict_testNN = (predict_testNN$net.result * (max(data$rating) - min(data$rating))) + min(data$rating)

plot(datatest$rating, predict_testNN, col='blue', pch=16, ylab = "predicted rating NN", xlab = "real rating")

abline(0,1)

# Calculate Root Mean Square Error (RMSE)
RMSE.NN = (sum((datatest$rating - predict_testNN)^2) / nrow(datatest)) ^ 0.5
RMSE.NN

## [1] 7.208592452

Cross Validation of the simple Neural Network Model

A commonly cross validation technique is k-fold cross validation. This method can be viewed as a recurring holdout method. The complete data is partitioned into k equal subsets and each time a subset is assigned as test set while others are used for training the model. Every data point gets a chance to be in test set and training set, thus this method reduces the dependence of performance on test-training split and reduces the variance of performance metrics.

The extreme case of k-fold cross validation will occur when k is equal to number of data points. It would mean that the predictive model is trained over all the data points except one data point, which takes the role of a test set. This method of leaving one data point as test set is known as leave-one-out cross validation.

# Load libraries
library(boot)

## Warning: package 'boot' was built under R version 3.4.4

library(plyr)

## Warning: package 'plyr' was built under R version 3.4.4

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:reshape':
## 
##     rename, round_any

# Initialize variables
set.seed(50)
k = 100
RMSE.NN = NULL

List = list( )

# Fit neural network model within nested for loop
for(j in 10:65){
  for (i in 1:k) {
    index = sample(1:nrow(data),j )
    
    trainNN = scaled[index,]
    testNN = scaled[-index,]
    datatest = data[-index,]
    
    NN = neuralnet(rating ~ calories + protein + fat + sodium + fiber, trainNN, hidden = 3, linear.output= T)
    predict_testNN = compute(NN,testNN[,c(1:5)])
    predict_testNN = (predict_testNN$net.result*(max(data$rating)-min(data$rating)))+min(data$rating)
    
    RMSE.NN [i]<- (sum((datatest$rating - predict_testNN)^2)/nrow(datatest))^0.5
  }
  List[[j]] = RMSE.NN
}

Matrix.RMSE = do.call(cbind, List)

Study Accuracy of the simple Neural Network Model

Study Matrix.RMSE

The RMSE values can be accessed using the variable Matrix.RMSE. The size of the matrix is large; therefore we will try to make sense of the data through visualizations. First, we will prepare a boxplot for one of the columns in Matrix.RMSE, where training set has length equal to 65. One can prepare these box plots for each of the training set lengths (10 to 65).

The boxplot shows that the median RMSE across 100 samples when length of training set is fixed to 65 is 5.70.

## Prepare boxplot
boxplot(Matrix.RMSE[,56], ylab = "RMSE", main = "RMSE BoxPlot (length of traning set = 65)")

Variation of median RMSE

The variation shows that the median RMSE of our model decreases as the length of the training the set. This is an important result. The reader must remember that the model accuracy is dependent on the length of training set. The performance of neural network model is sensitive to training-test split.

library(matrixStats)

## Warning: package 'matrixStats' was built under R version 3.4.4

## 
## Attaching package: 'matrixStats'

## The following object is masked from 'package:plyr':
## 
##     count

med = colMedians(Matrix.RMSE)

X = seq(10,65)

plot (med~X, type = "l", xlab = "length of training set", ylab = "median RMSE", main = "Variation of RMSE with length of training set")

Simple Neural Network for Nutrition Rating Prediction

Janpu Hou

May 20, 2018