Get health care products you can count on is always import for consumers. In this simple example, We like to recommend a choices of the healthy cereals for nutrition rating. Ratings are based on important features such as calories, protein, fat, sodium, fiber, etc.
In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.
General trend were observed, such as higher sugar less rating, higher calories less rating, and higher fiber higher rating. The relationship can be established as: Rating = -0.223 x calories + 3.27 x protein - 1.69 x fat - 0.0054 x sodium + 3.44 x fiber + 1.09 x carbo -0.725 x sugars -0.034 x potass -0.051 x vitamins
DASL (pronounced “dazzle”) is an online library of datafiles and stories that illustrate the use of basic statistics methods. DASL is part of larger effort to enhance the teaching of statistics using computers. From many grocery stores, 77 type of cereals was found with rating:http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html.
# Read the contents of the page into a vector of character strings with the readLines function:
thepage = readLines('http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html')
# If you look at the web page, you'll see that the title "The Data:" is right above the data we want.
# We can locate this line using the grep function:
grep('The Data:',thepage)
## [1] 40
text=thepage[42:119]
# Based on the previous step, the data that we want is always preceded by the HTML tag
# "<td class="row-text»", and followed by "</td>". Let's grab all the lines that have that pattern:
y <- strsplit(text, "\t")
df <- data.frame(matrix(unlist(y), nrow=78, byrow=T))
# Change column name to the correct one
colnames(df) <- as.character(unlist(df[1,]))
df1 = df[-1,]
head(df1)
## name mfr type calories protein fat sodium fiber
## 2 100%_Bran N C 70 4 1 130 10
## 3 100%_Natural_Bran Q C 120 3 5 15 2
## 4 All-Bran K C 70 4 1 260 9
## 5 All-Bran_with_Extra_Fiber K C 50 4 0 140 14
## 6 Almond_Delight R C 110 2 2 200 1
## 7 Apple_Cinnamon_Cheerios G C 110 2 2 180 1.5
## carbo sugars potass vitamins shelf weight cups rating
## 2 5 6 280 25 3 1 0.33 68.402973
## 3 8 8 135 0 3 1 1 33.983679
## 4 7 5 320 25 3 1 0.33 59.425505
## 5 8 0 330 25 3 1 0.5 93.704912
## 6 14 8 -1 25 3 1 0.75 34.384843
## 7 10.5 10 70 25 1 1 0.75 29.509541
write.csv(df1, file = "D:/R_Files/Cereal_Data.csv")
total_data= read.csv("D:/R_Files/Cereal_Data.csv", header=T)
Tdata <- total_data[,5:17]
head(Tdata)
## calories protein fat sodium fiber carbo sugars potass vitamins shelf
## 1 70 4 1 130 10.0 5.0 6 280 25 3
## 2 120 3 5 15 2.0 8.0 8 135 0 3
## 3 70 4 1 260 9.0 7.0 5 320 25 3
## 4 50 4 0 140 14.0 8.0 0 330 25 3
## 5 110 2 2 200 1.0 14.0 8 -1 25 3
## 6 110 2 2 180 1.5 10.5 10 70 25 1
## weight cups rating
## 1 1 0.33 68.40297
## 2 1 1.00 33.98368
## 3 1 0.33 59.42551
## 4 1 0.50 93.70491
## 5 1 0.75 34.38484
## 6 1 0.75 29.50954
mean(Tdata$rating)
## [1] 42.6657
boxplot(Tdata$rating, main = "Cereal Rating Distribution", ylab = "Cereal Rating")
hist(Tdata$rating, main = "Cereal Rating Distribution",ylim=c(0,30),xlim=c(0,100),xlab = "Ceral Rating", ylab = "Frequency")
mean=mean(Tdata$rating)
sd=sd(Tdata$rating)
set.seed(80)
x <- seq(0,100)
hx <- dnorm(x,mean,sd)
y <- hx/max(hx)*30
lines(x, y,col="blue",lwd=2,lty=2)
abline(v=mean,col="red",lwd=4,lty=2)
plot(Tdata$sugars,Tdata$rating, pch = 16, cex = 1.3, col = "blue", main = "Cereal Rating vs Sugar Content", xlab = "Sugar Content (g)", ylab = "Cereal Rating")
abline(lm(Tdata$rating ~ Tdata$sugars),col="red")
plot(Tdata$calories,Tdata$rating, pch = 16, cex = 1.3, col = "blue", main = "Cereal Rating vs Calories per serving", xlab = "Calories (per serving)", ylab = "Cereal Rating")
abline(lm(Tdata$rating ~ Tdata$calories),col="red")
plot(Tdata$fiber,Tdata$rating, pch = 16, cex = 1.3, col = "blue", main = "Cereal Rating vs Dietary Fiber", xlab = "Dietary Fiber(g)", ylab = "Cereal Rating")
abline(lm(Tdata$rating ~ Tdata$fiber),col="blue")
# Random sampling
samplesize = 0.60 * nrow(Tdata)
set.seed(80)
index = sample( seq_len ( nrow ( Tdata ) ), size = samplesize )
# Create training and test set
datatrain = Tdata[ index, ]
datatest = Tdata[ -index, ]
lm.fit <- glm(rating~., data=datatrain)
summary(lm.fit)
##
## Call:
## glm(formula = rating ~ ., data = datatrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.554e-07 -1.446e-07 -2.890e-09 1.964e-07 5.085e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.493e+01 4.613e-07 1.191e+08 <2e-16 ***
## calories -2.227e-01 9.241e-09 -2.410e+07 <2e-16 ***
## protein 3.273e+00 7.096e-08 4.613e+07 <2e-16 ***
## fat -1.691e+00 9.665e-08 -1.750e+07 <2e-16 ***
## sodium -5.449e-02 5.878e-10 -9.271e+07 <2e-16 ***
## fiber 3.443e+00 5.663e-08 6.081e+07 <2e-16 ***
## carbo 1.092e+00 4.640e-08 2.354e+07 <2e-16 ***
## sugars -7.249e-01 4.337e-08 -1.671e+07 <2e-16 ***
## potass -3.399e-02 1.844e-09 -1.843e+07 <2e-16 ***
## vitamins -5.121e-02 3.204e-09 -1.598e+07 <2e-16 ***
## shelf -3.314e-08 5.998e-08 -5.530e-01 0.584
## weight 2.649e-08 7.365e-07 3.600e-02 0.972
## cups 4.082e-07 2.532e-07 1.612e+00 0.116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7.696234e-14)
##
## Null deviance: 8.2821e+03 on 45 degrees of freedom
## Residual deviance: 2.5398e-12 on 33 degrees of freedom
## AIC: -1245.7
##
## Number of Fisher Scoring iterations: 1
pr.lm <- predict(lm.fit,datatest)
MSE.lm <- sum((pr.lm - datatest$rating)^2)/nrow(datatest)
MSE.lm
## [1] 1.747757e-13
Just pick first 5 features(variables) to demonstrate this basic analysis.
## Creating index variable
# Simplify Data take first 5 variables
data=Tdata[c(1:5,13)]
plot(data)
# Random sampling
samplesize = 0.60 * nrow(data)
set.seed(80)
index = sample( seq_len ( nrow ( data ) ), size = samplesize )
# Create training and test set
datatrain = data[ index, ]
datatest = data[ -index, ]
lm.fit <- glm(rating~., data=datatrain)
summary(lm.fit)
##
## Call:
## glm(formula = rating ~ ., data = datatrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10.5166 -3.5864 -0.4027 3.6199 9.6991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.57180 5.58950 11.910 9.91e-15 ***
## calories -0.27718 0.05630 -4.923 1.51e-05 ***
## protein 5.73881 1.02054 5.623 1.60e-06 ***
## fat -4.84089 1.12240 -4.313 0.000102 ***
## sodium -0.04282 0.01052 -4.071 0.000215 ***
## fiber 1.48288 0.55914 2.652 0.011414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 33.9492)
##
## Null deviance: 8282.1 on 45 degrees of freedom
## Residual deviance: 1358.0 on 40 degrees of freedom
## AIC: 300.26
##
## Number of Fisher Scoring iterations: 2
pr.lm <- predict(lm.fit,datatest)
MSE.lm <- sum((pr.lm - datatest$rating)^2)/nrow(datatest)
MSE.lm
## [1] 39.03381
## Normalized data to the same scale
max = apply(data , 2 , max)
min = apply(data, 2 , min)
scaled = as.data.frame(scale(data, center = min, scale = max - min))
# creating training and test set
traindata = scaled[index , ]
testdata = scaled[-index , ]
lm.fit <- glm(rating~., data=traindata)
summary(lm.fit)
##
## Call:
## glm(formula = rating ~ ., data = traindata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.138994 -0.047401 -0.005322 0.047843 0.128190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.53407 0.04161 12.834 9.14e-16 ***
## calories -0.40297 0.08185 -4.923 1.51e-05 ***
## protein 0.37924 0.06744 5.623 1.60e-06 ***
## fat -0.31990 0.07417 -4.313 0.000102 ***
## sodium -0.18109 0.04448 -4.071 0.000215 ***
## fiber 0.27438 0.10346 2.652 0.011414 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.005930253)
##
## Null deviance: 1.44671 on 45 degrees of freedom
## Residual deviance: 0.23721 on 40 degrees of freedom
## AIC: -97.76
##
## Number of Fisher Scoring iterations: 2
pr.lm <- predict(lm.fit,testdata)
MSE.lm <- sum((pr.lm - testdata$rating)^2)/nrow(testdata)
MSE.lm
## [1] 0.006818434