Linear Regression for Nutrition Rating Prediction

Background

Get health care products you can count on is always import for consumers. In this simple example, We like to recommend a choices of the healthy cereals for nutrition rating. Ratings are based on important features such as calories, protein, fat, sodium, fiber, etc.

Method

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.

Result

General trend were observed, such as higher sugar less rating, higher calories less rating, and higher fiber higher rating. The relationship can be established as: Rating = -0.223 x calories + 3.27 x protein - 1.69 x fat - 0.0054 x sodium + 3.44 x fiber + 1.09 x carbo -0.725 x sugars -0.034 x potass -0.051 x vitamins

Import and Prepare Data to Train Model

Load Nutition Data from Webpage

DASL (pronounced “dazzle”) is an online library of datafiles and stories that illustrate the use of basic statistics methods. DASL is part of larger effort to enhance the teaching of statistics using computers. From many grocery stores, 77 type of cereals was found with rating:http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html.

# Read the contents of the page into a vector of character strings with the readLines function:
thepage = readLines('http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html')

# If you look at the web page, you'll see that the title "The Data:" is right above the data we want. 
# We can locate this line using the grep function:
grep('The Data:',thepage)
## [1] 40
text=thepage[42:119]

# Based on the previous step, the data that we want is always preceded by the HTML tag 
# "<td class="row-text»", and followed by "</td>". Let's grab all the lines that have that pattern:

y <- strsplit(text, "\t")
df <- data.frame(matrix(unlist(y), nrow=78, byrow=T))

# Change column name to the correct one
colnames(df) <- as.character(unlist(df[1,]))
df1 = df[-1,]
head(df1)
##                        name mfr type calories protein fat sodium fiber
## 2                 100%_Bran   N    C       70       4   1    130    10
## 3         100%_Natural_Bran   Q    C      120       3   5     15     2
## 4                  All-Bran   K    C       70       4   1    260     9
## 5 All-Bran_with_Extra_Fiber   K    C       50       4   0    140    14
## 6            Almond_Delight   R    C      110       2   2    200     1
## 7   Apple_Cinnamon_Cheerios   G    C      110       2   2    180   1.5
##   carbo sugars potass vitamins shelf weight cups    rating
## 2     5      6    280       25     3      1 0.33 68.402973
## 3     8      8    135        0     3      1    1 33.983679
## 4     7      5    320       25     3      1 0.33 59.425505
## 5     8      0    330       25     3      1  0.5 93.704912
## 6    14      8     -1       25     3      1 0.75 34.384843
## 7  10.5     10     70       25     1      1 0.75 29.509541
write.csv(df1, file = "D:/R_Files/Cereal_Data.csv")

total_data= read.csv("D:/R_Files/Cereal_Data.csv", header=T)

Tdata <- total_data[,5:17]

head(Tdata)
##   calories protein fat sodium fiber carbo sugars potass vitamins shelf
## 1       70       4   1    130  10.0   5.0      6    280       25     3
## 2      120       3   5     15   2.0   8.0      8    135        0     3
## 3       70       4   1    260   9.0   7.0      5    320       25     3
## 4       50       4   0    140  14.0   8.0      0    330       25     3
## 5      110       2   2    200   1.0  14.0      8     -1       25     3
## 6      110       2   2    180   1.5  10.5     10     70       25     1
##   weight cups   rating
## 1      1 0.33 68.40297
## 2      1 1.00 33.98368
## 3      1 0.33 59.42551
## 4      1 0.50 93.70491
## 5      1 0.75 34.38484
## 6      1 0.75 29.50954

Explortary Analysis on the Data

mean(Tdata$rating)
## [1] 42.6657
boxplot(Tdata$rating, main = "Cereal Rating Distribution", ylab = "Cereal Rating")

hist(Tdata$rating, main = "Cereal Rating Distribution",ylim=c(0,30),xlim=c(0,100),xlab = "Ceral Rating", ylab = "Frequency")
mean=mean(Tdata$rating)
sd=sd(Tdata$rating)
set.seed(80)
x <- seq(0,100)
hx <- dnorm(x,mean,sd)
y <- hx/max(hx)*30
lines(x, y,col="blue",lwd=2,lty=2) 
abline(v=mean,col="red",lwd=4,lty=2)

plot(Tdata$sugars,Tdata$rating, pch = 16, cex = 1.3, col = "blue", main = "Cereal Rating vs Sugar Content", xlab = "Sugar Content (g)", ylab = "Cereal Rating")

abline(lm(Tdata$rating ~ Tdata$sugars),col="red")

plot(Tdata$calories,Tdata$rating, pch = 16, cex = 1.3, col = "blue", main = "Cereal Rating vs Calories per serving", xlab = "Calories (per serving)", ylab = "Cereal Rating")
abline(lm(Tdata$rating ~ Tdata$calories),col="red")

plot(Tdata$fiber,Tdata$rating, pch = 16, cex = 1.3, col = "blue", main = "Cereal Rating vs Dietary Fiber", xlab = "Dietary Fiber(g)", ylab = "Cereal Rating")
abline(lm(Tdata$rating ~ Tdata$fiber),col="blue")

Linear Regression Model fit

Split data to train and test set

# Random sampling
samplesize = 0.60 * nrow(Tdata)
set.seed(80)
index = sample( seq_len ( nrow ( Tdata ) ), size = samplesize )

# Create training and test set
datatrain = Tdata[ index, ]
datatest = Tdata[ -index, ]

Linear Regression Model fit for train set

lm.fit <- glm(rating~., data=datatrain)
summary(lm.fit)
## 
## Call:
## glm(formula = rating ~ ., data = datatrain)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -5.554e-07  -1.446e-07  -2.890e-09   1.964e-07   5.085e-07  
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  5.493e+01  4.613e-07  1.191e+08   <2e-16 ***
## calories    -2.227e-01  9.241e-09 -2.410e+07   <2e-16 ***
## protein      3.273e+00  7.096e-08  4.613e+07   <2e-16 ***
## fat         -1.691e+00  9.665e-08 -1.750e+07   <2e-16 ***
## sodium      -5.449e-02  5.878e-10 -9.271e+07   <2e-16 ***
## fiber        3.443e+00  5.663e-08  6.081e+07   <2e-16 ***
## carbo        1.092e+00  4.640e-08  2.354e+07   <2e-16 ***
## sugars      -7.249e-01  4.337e-08 -1.671e+07   <2e-16 ***
## potass      -3.399e-02  1.844e-09 -1.843e+07   <2e-16 ***
## vitamins    -5.121e-02  3.204e-09 -1.598e+07   <2e-16 ***
## shelf       -3.314e-08  5.998e-08 -5.530e-01    0.584    
## weight       2.649e-08  7.365e-07  3.600e-02    0.972    
## cups         4.082e-07  2.532e-07  1.612e+00    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 7.696234e-14)
## 
##     Null deviance: 8.2821e+03  on 45  degrees of freedom
## Residual deviance: 2.5398e-12  on 33  degrees of freedom
## AIC: -1245.7
## 
## Number of Fisher Scoring iterations: 1

Linear Regression Model test prediction on test set

pr.lm <- predict(lm.fit,datatest)
MSE.lm <- sum((pr.lm - datatest$rating)^2)/nrow(datatest)
MSE.lm
## [1] 1.747757e-13

Simplify Data for Basic Rating Prediction

Just pick first 5 features(variables) to demonstrate this basic analysis.

Split Simplify Data to Train and Test set

## Creating index variable 

# Simplify Data take first 5 variables
data=Tdata[c(1:5,13)]
plot(data)

# Random sampling
samplesize = 0.60 * nrow(data)
set.seed(80)
index = sample( seq_len ( nrow ( data ) ), size = samplesize )

# Create training and test set
datatrain = data[ index, ]
datatest = data[ -index, ]

Without normalizing data

Linear Regression Model fit on train Data

lm.fit <- glm(rating~., data=datatrain)
summary(lm.fit)
## 
## Call:
## glm(formula = rating ~ ., data = datatrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -10.5166   -3.5864   -0.4027    3.6199    9.6991  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 66.57180    5.58950  11.910 9.91e-15 ***
## calories    -0.27718    0.05630  -4.923 1.51e-05 ***
## protein      5.73881    1.02054   5.623 1.60e-06 ***
## fat         -4.84089    1.12240  -4.313 0.000102 ***
## sodium      -0.04282    0.01052  -4.071 0.000215 ***
## fiber        1.48288    0.55914   2.652 0.011414 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 33.9492)
## 
##     Null deviance: 8282.1  on 45  degrees of freedom
## Residual deviance: 1358.0  on 40  degrees of freedom
## AIC: 300.26
## 
## Number of Fisher Scoring iterations: 2

Using Linear Regression Model to predict on test Data

pr.lm <- predict(lm.fit,datatest)
MSE.lm <- sum((pr.lm - datatest$rating)^2)/nrow(datatest)
MSE.lm
## [1] 39.03381

With Normalizing Data

## Normalized data to the same scale

max = apply(data , 2 , max)
min = apply(data, 2 , min)
scaled = as.data.frame(scale(data, center = min, scale = max - min))
# creating training and test set
traindata = scaled[index , ]
testdata = scaled[-index , ]

Linear Regression Model fit on train Data

lm.fit <- glm(rating~., data=traindata)
summary(lm.fit)
## 
## Call:
## glm(formula = rating ~ ., data = traindata)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.138994  -0.047401  -0.005322   0.047843   0.128190  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.53407    0.04161  12.834 9.14e-16 ***
## calories    -0.40297    0.08185  -4.923 1.51e-05 ***
## protein      0.37924    0.06744   5.623 1.60e-06 ***
## fat         -0.31990    0.07417  -4.313 0.000102 ***
## sodium      -0.18109    0.04448  -4.071 0.000215 ***
## fiber        0.27438    0.10346   2.652 0.011414 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.005930253)
## 
##     Null deviance: 1.44671  on 45  degrees of freedom
## Residual deviance: 0.23721  on 40  degrees of freedom
## AIC: -97.76
## 
## Number of Fisher Scoring iterations: 2

Using Linear Regression Model to predict on test Data

pr.lm <- predict(lm.fit,testdata)
MSE.lm <- sum((pr.lm - testdata$rating)^2)/nrow(testdata)
MSE.lm
## [1] 0.006818434