My data set is nutrition data about food, which chose from 100+ Interesting Date Sets for Statistics. The data is provided by USDA(United States Department of Agriculture).
https://www.ars.usda.gov/Services/docs.htm?docid=24912
The dataset contains 50 nutrient contents of 8,618 different foods. The dataset can be divided into several food groups, such as “Dairy and Egg Products”, “Pork Products”, “Vegetables and Vegetable Products”, “Beef Products”, and “Fast Foods”. For this project, I decided to focus on the nutrition of Fast Foods. This is because nowadays, many people eat fast foods quiet often. Not only because they are convenient, but also because they are cheaper. However, fast food is not good for our health and majority of them have very high calories, which lead to obesity, CVD, and hyperlipidemia.
I generate the data set of fast food has 304 observations and 5 variables. The first column is the short description of food. The second column is the Energy Kilocalorie of the food. The third column is protein(g).The forth column is total lipid(g).The fifth column is Carbohydrate(g).
fastf.table <- read.table("fastfood.csv",header=T,sep=',')
attach(fastf.table)
head(fastf.table)
## Shrt_Desc Energ_Kcal Protein_.g.
## 1 FAST FOODS BISCUIT W/ EGG 274 8.53
## 2 FAST FOODS,BISCUIT,W/EGG&BACON 305 11.33
## 3 FAST FOODS,BISCUIT,W/EGG&HAM 233 10.64
## 4 BREAKFAST ITEMS,BISCUIT W/EGG&SAUSAGE 312 11.13
## 5 FAST FOODS,BISCUIT W/ EGG & STEAK 277 12.12
## 6 FAST FOODS,BISCUIT,W/EGG,CHS,&BACON 301 12.01
## Lipid_Tot_.g. Carbohydrt_.g.
## 1 16.23 23.46
## 2 20.73 19.06
## 3 14.08 16.37
## 4 20.77 21.05
## 5 19.21 14.37
## 6 17.48 24.44
plot(fastf.table[,2:5])
My independent variables are: Protein (g), Lipid Total (g), and Carbohydrate (g).
My dependent variable is Energy Kilocalorie.
My null hypothesis is that: the variation of Energy Kilocalorie is depend on randomness and cannot be explained by any of the four independent variables (Protein (g), Lipid Total (g), and Carbohydrate (g)).
In this part, I build the linear model in three different methods, which are Entry-wise model, Hierarchical model, and Step-wise model.
Entry-wise model is including all of the independent variables in the model at the same time.
fastf1.lm <- lm(Energ_Kcal~Protein_.g.+Lipid_Tot_.g.+Carbohydrt_.g.)
summary(fastf1.lm)
##
## Call:
## lm(formula = Energ_Kcal ~ Protein_.g. + Lipid_Tot_.g. + Carbohydrt_.g.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.481 -18.069 -4.789 11.104 108.853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.7836 6.2726 9.690 < 2e-16 ***
## Protein_.g. 2.5759 0.3169 8.127 8.63e-15 ***
## Lipid_Tot_.g. 9.4858 0.2594 36.570 < 2e-16 ***
## Carbohydrt_.g. 2.0889 0.1407 14.849 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.14 on 334 degrees of freedom
## Multiple R-squared: 0.8303, Adjusted R-squared: 0.8288
## F-statistic: 544.7 on 3 and 334 DF, p-value: < 2.2e-16
Hierarchical regression model is entering factor in a theoretically determined order. According to Atwater Factors, calories must be calculated using values per 100 grams for protein, fat and carbohydrate. This is the reason why I choose these three independent variables.
When I think about how to compute the calories, what comes to my mind first is the protein. Also, I think protein is usually the most important factor of nutrition, So I include Protein(g) in the model first.
Second, I feel that high lipid foods are usually high calories food. Especially, fast foods usually have high lipid. So, I include Lipid total (g) secondly in my model.
Finally, I add carbohydrate (g) into my model. This is because fast foods may not have high carbohydrate. Also, for myself, I don’t have concern about carbohydrate(g) in foods.
fastf2.lm1 <- lm(Energ_Kcal~Protein_.g.)
fastf2.lm2 <- lm(Energ_Kcal~Protein_.g.+Lipid_Tot_.g.)
fastf2.lm3 <- lm(Energ_Kcal~Protein_.g.+Lipid_Tot_.g.+Carbohydrt_.g.)
summary(fastf2.lm1)
##
## Call:
## lm(formula = Energ_Kcal ~ Protein_.g.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212.12 -41.05 4.09 36.26 353.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 225.3853 8.7312 25.81 < 2e-16 ***
## Protein_.g. 2.1895 0.6636 3.30 0.00107 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.79 on 336 degrees of freedom
## Multiple R-squared: 0.03139, Adjusted R-squared: 0.0285
## F-statistic: 10.89 on 1 and 336 DF, p-value: 0.001072
summary(fastf2.lm2)
##
## Call:
## lm(formula = Energ_Kcal ~ Protein_.g. + Lipid_Tot_.g.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153.075 -21.916 1.128 26.587 190.756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.4243 5.8927 21.115 <2e-16 ***
## Protein_.g. 0.4453 0.3636 1.225 0.222
## Lipid_Tot_.g. 9.5360 0.3337 28.577 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.78 on 335 degrees of freedom
## Multiple R-squared: 0.7182, Adjusted R-squared: 0.7166
## F-statistic: 427 on 2 and 335 DF, p-value: < 2.2e-16
summary(fastf2.lm3)
##
## Call:
## lm(formula = Energ_Kcal ~ Protein_.g. + Lipid_Tot_.g. + Carbohydrt_.g.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.481 -18.069 -4.789 11.104 108.853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.7836 6.2726 9.690 < 2e-16 ***
## Protein_.g. 2.5759 0.3169 8.127 8.63e-15 ***
## Lipid_Tot_.g. 9.4858 0.2594 36.570 < 2e-16 ***
## Carbohydrt_.g. 2.0889 0.1407 14.849 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.14 on 334 degrees of freedom
## Multiple R-squared: 0.8303, Adjusted R-squared: 0.8288
## F-statistic: 544.7 on 3 and 334 DF, p-value: < 2.2e-16
Step-wise linear model is to build the model in the order of correlation.
cor(fastf.table[,2:5])
## Energ_Kcal Protein_.g. Lipid_Tot_.g. Carbohydrt_.g.
## Energ_Kcal 1.0000000 0.1771651 0.84674741 0.22659671
## Protein_.g. 0.1771651 1.0000000 0.16787646 -0.45626974
## Lipid_Tot_.g. 0.8467474 0.1678765 1.00000000 -0.06517789
## Carbohydrt_.g. 0.2265967 -0.4562697 -0.06517789 1.00000000
For the correlation between independent variables, I find that protein and carbohydrate have relatively high correlation, which is -0.456297. This could cause suppression in my model and could cause the coefficients of some variables inflate or deflate.
fastf3.lm1 <- lm(Energ_Kcal~Lipid_Tot_.g.)
summary(fastf3.lm1)
##
## Call:
## lm(formula = Energ_Kcal ~ Lipid_Tot_.g.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160.517 -19.980 1.347 27.503 188.220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.7909 4.6952 27.43 <2e-16 ***
## Lipid_Tot_.g. 9.6046 0.3292 29.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.81 on 336 degrees of freedom
## Multiple R-squared: 0.717, Adjusted R-squared: 0.7161
## F-statistic: 851.2 on 1 and 336 DF, p-value: < 2.2e-16
This is a single linear model. The model is: Energy Kcal is explained by Lipid total. bo of this model is 128.7909, which means when lipid total is 0, the energy kcal is 128.7909. b1 of this model is 9.6046 and shows that with one gram increases in lipid total, energy Kcal increases 9.6046.
Adjusted R^2 is 0.7161. R^2 explains that lipid total can explain 71.61% of the variation of energy Kcal. Also, though Lipid total explains a lot of the variation in Calorie, more independents are needed in the explanation of the Calorie.
Then, for the significant test, F-statistic of this model is 851.2 on 1 and 336 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that lipid total can explain the variation in energy Kcal.
fastf3.lm2 <- lm(Energ_Kcal~Lipid_Tot_.g.+Carbohydrt_.g.)
summary(fastf3.lm2)
##
## Call:
## lm(formula = Energ_Kcal ~ Lipid_Tot_.g. + Carbohydrt_.g.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.730 -17.579 0.057 16.669 99.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.6342 4.8734 19.83 <2e-16 ***
## Lipid_Tot_.g. 9.8138 0.2800 35.05 <2e-16 ***
## Carbohydrt_.g. 1.5713 0.1371 11.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.94 on 335 degrees of freedom
## Multiple R-squared: 0.7967, Adjusted R-squared: 0.7955
## F-statistic: 656.5 on 2 and 335 DF, p-value: < 2.2e-16
In this model, I use two independent variables. The model is: Energy Kcal is explained by Lipid total and Carbohydrate. bo of this model is 96.6342, which means when lipid total and carbohydrate are 0, the energy kcal is 96.6342. b1 of this model is 9.8138 and shows that with one gram increases in lipid total, energy Kcal increases 9.8138. b2 of this model is 1.5713. This shows that with one gram increases in carbohydrate, energy Kcal increases 1.5713.
Adjusted R^2 is 0.7955. It is higher than the previous R^2 which is 0.7161. R^2 explains that lipid total and Carbohydrate can explain 79.55% of the variation of energy Kcal. The R^2 is high, but more independents are needed in the explanation of the Calorie.
Then, F-statistic of this model is 656.5 on 2 and 335 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that lipid total and carbohydrate can explain the variation in energy Kcal.
fastf3.lm3 <- lm(Energ_Kcal~Lipid_Tot_.g.+Carbohydrt_.g.+Protein_.g.)
summary(fastf3.lm3)
##
## Call:
## lm(formula = Energ_Kcal ~ Lipid_Tot_.g. + Carbohydrt_.g. + Protein_.g.)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.481 -18.069 -4.789 11.104 108.853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.7836 6.2726 9.690 < 2e-16 ***
## Lipid_Tot_.g. 9.4858 0.2594 36.570 < 2e-16 ***
## Carbohydrt_.g. 2.0889 0.1407 14.849 < 2e-16 ***
## Protein_.g. 2.5759 0.3169 8.127 8.63e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.14 on 334 degrees of freedom
## Multiple R-squared: 0.8303, Adjusted R-squared: 0.8288
## F-statistic: 544.7 on 3 and 334 DF, p-value: < 2.2e-16
Finally, I include all three independent variables. The model is: Energy Kcal is explained by Lipid total, Carbohydrate, and protein. bo of this model is 60.7836, which means when lipid total, carbohydrate, and protein are 0, the energy kcal is 60.7836. b1 of this model is 9.4858 and shows that with one gram increases in lipid total, energy Kcal increases 9.4858. b2 of this model is 2.0889. This shows that with one gram increases in carbohydrate, energy Kcal increases 2.0889. b3 of this model is 2.5759. This shows that with one gram increases in protein, energy Kcal increases 2.5759.
Adjusted R^2 is 0.8288. It is higher than the previous two R^2 which are 0.7161 and 0.8288. We can say that add protein into explain Calorie of fast food make sense.R^2 explains that lipid total, Carbohydrate, and protein can explain 82.88% of the variation of energy Kcal. After included the three factors, more independents are still needed in the explanation of the Calorie.
F-statistic of this model is 544.7 on 3 and 334 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that lipid total, carbohydrate, and protein can explain the variation in energy Kcal.
plot(Lipid_Tot_.g.,Energ_Kcal,pch = 21, bg = 'blue')
abline(fastf3.lm1)
fastC.lm <- lm(Energ_Kcal~Carbohydrt_.g.)
plot(Carbohydrt_.g.,Energ_Kcal,pch = 21, bg = 'blue')
abline(fastC.lm)
plot(Protein_.g.,Energ_Kcal,pch = 21, bg = 'blue')
abline(fastf2.lm1)
I plot the 3D scatterplot use two independent variables: lipid total(g) and carbohydrate(g).
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.3
md <- scatterplot3d(Lipid_Tot_.g.,Carbohydrt_.g.,Energ_Kcal,pch = 21, main = "Regression plane",bg = 'blue',xlab = "Lipid total(g)", ylab = "carbohydrate(g)", zlab = "Energy Kcal",axis = TRUE)
md$plane3d(fastf3.lm2)
confint(fastf3.lm3)
## 2.5 % 97.5 %
## (Intercept) 48.444902 73.122341
## Lipid_Tot_.g. 8.975585 9.996073
## Carbohydrt_.g. 1.812213 2.365649
## Protein_.g. 1.952415 3.199287
95% of the samples will have values within the above boundary.
hist(residuals(fastf3.lm3),breaks = 15, main = "Histogram of residuals", xlab = "Residuals",ylab = "Density")
The histogram of residual looks that the residual is a little bit skew left and may or may not kurtosis. Also, the histogram doesn’t show there are any outliers.
boxplot(residuals(fastf3.lm3), main = "Boxplot of residuals")
Boxplot is a good tool to see whether there are outliers or not and it can also help us to see weather skew or not. The boxplot show there are many outliers of residuals. Also, it shows the residual is just a little bit left skewness (very small).
par(mfrow = c(1,1))
qqnorm(residuals(fastf3.lm3),main = "Q-Q Plot")
qqline(residuals(fastf3.lm3))
The Q-Q plot looks linear from -2 quantile to 1 quantile. However, there are very large curvature from -3 to -2 quantile and 1 to 3 quantile. I think this may because of there are many outliers.