Dataset

My data set is nutrition data about food, which chose from 100+ Interesting Date Sets for Statistics. The data is provided by USDA(United States Department of Agriculture).

https://www.ars.usda.gov/Services/docs.htm?docid=24912

The dataset contains 50 nutrient contents of 8,618 different foods. The dataset can be divided into several food groups, such as “Dairy and Egg Products”, “Pork Products”, “Vegetables and Vegetable Products”, “Beef Products”, and “Fast Foods”. For this project, I decided to focus on the nutrition of Fast Foods. This is because nowadays, many people eat fast foods quiet often. Not only because they are convenient, but also because they are cheaper. However, fast food is not good for our health and majority of them have very high calories, which lead to obesity, CVD, and hyperlipidemia.

I generate the data set of fast food has 304 observations and 5 variables. The first column is the short description of food. The second column is the Energy Kilocalorie of the food. The third column is protein(g).The forth column is total lipid(g).The fifth column is Carbohydrate(g).

fastf.table <- read.table("fastfood.csv",header=T,sep=',')
attach(fastf.table)
head(fastf.table)
##                               Shrt_Desc Energ_Kcal Protein_.g.
## 1           FAST FOODS  BISCUIT  W/ EGG        274        8.53
## 2        FAST FOODS,BISCUIT,W/EGG&BACON        305       11.33
## 3          FAST FOODS,BISCUIT,W/EGG&HAM        233       10.64
## 4 BREAKFAST ITEMS,BISCUIT W/EGG&SAUSAGE        312       11.13
## 5     FAST FOODS,BISCUIT W/ EGG & STEAK        277       12.12
## 6   FAST FOODS,BISCUIT,W/EGG,CHS,&BACON        301       12.01
##   Lipid_Tot_.g. Carbohydrt_.g.
## 1         16.23          23.46
## 2         20.73          19.06
## 3         14.08          16.37
## 4         20.77          21.05
## 5         19.21          14.37
## 6         17.48          24.44
plot(fastf.table[,2:5])

Independent variable

My independent variables are: Protein (g), Lipid Total (g), and Carbohydrate (g).

Dependent variable

My dependent variable is Energy Kilocalorie.

Null hypothesis \(H_0\)

My null hypothesis is that: the variation of Energy Kilocalorie is depend on randomness and cannot be explained by any of the four independent variables (Protein (g), Lipid Total (g), and Carbohydrate (g)).

Multiple Linear model

In this part, I build the linear model in three different methods, which are Entry-wise model, Hierarchical model, and Step-wise model.

Entry-wise

Entry-wise model is including all of the independent variables in the model at the same time.

fastf1.lm <- lm(Energ_Kcal~Protein_.g.+Lipid_Tot_.g.+Carbohydrt_.g.)
summary(fastf1.lm)
## 
## Call:
## lm(formula = Energ_Kcal ~ Protein_.g. + Lipid_Tot_.g. + Carbohydrt_.g.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.481 -18.069  -4.789  11.104 108.853 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     60.7836     6.2726   9.690  < 2e-16 ***
## Protein_.g.      2.5759     0.3169   8.127 8.63e-15 ***
## Lipid_Tot_.g.    9.4858     0.2594  36.570  < 2e-16 ***
## Carbohydrt_.g.   2.0889     0.1407  14.849  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.14 on 334 degrees of freedom
## Multiple R-squared:  0.8303, Adjusted R-squared:  0.8288 
## F-statistic: 544.7 on 3 and 334 DF,  p-value: < 2.2e-16

Hierarchical

Hierarchical regression model is entering factor in a theoretically determined order. According to Atwater Factors, calories must be calculated using values per 100 grams for protein, fat and carbohydrate. This is the reason why I choose these three independent variables.

When I think about how to compute the calories, what comes to my mind first is the protein. Also, I think protein is usually the most important factor of nutrition, So I include Protein(g) in the model first.

Second, I feel that high lipid foods are usually high calories food. Especially, fast foods usually have high lipid. So, I include Lipid total (g) secondly in my model.

Finally, I add carbohydrate (g) into my model. This is because fast foods may not have high carbohydrate. Also, for myself, I don’t have concern about carbohydrate(g) in foods.

fastf2.lm1 <- lm(Energ_Kcal~Protein_.g.)
fastf2.lm2 <- lm(Energ_Kcal~Protein_.g.+Lipid_Tot_.g.)
fastf2.lm3 <- lm(Energ_Kcal~Protein_.g.+Lipid_Tot_.g.+Carbohydrt_.g.)

summary(fastf2.lm1)
## 
## Call:
## lm(formula = Energ_Kcal ~ Protein_.g.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212.12  -41.05    4.09   36.26  353.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 225.3853     8.7312   25.81  < 2e-16 ***
## Protein_.g.   2.1895     0.6636    3.30  0.00107 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.79 on 336 degrees of freedom
## Multiple R-squared:  0.03139,    Adjusted R-squared:  0.0285 
## F-statistic: 10.89 on 1 and 336 DF,  p-value: 0.001072
summary(fastf2.lm2)
## 
## Call:
## lm(formula = Energ_Kcal ~ Protein_.g. + Lipid_Tot_.g.)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -153.075  -21.916    1.128   26.587  190.756 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   124.4243     5.8927  21.115   <2e-16 ***
## Protein_.g.     0.4453     0.3636   1.225    0.222    
## Lipid_Tot_.g.   9.5360     0.3337  28.577   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.78 on 335 degrees of freedom
## Multiple R-squared:  0.7182, Adjusted R-squared:  0.7166 
## F-statistic:   427 on 2 and 335 DF,  p-value: < 2.2e-16
summary(fastf2.lm3)
## 
## Call:
## lm(formula = Energ_Kcal ~ Protein_.g. + Lipid_Tot_.g. + Carbohydrt_.g.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.481 -18.069  -4.789  11.104 108.853 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     60.7836     6.2726   9.690  < 2e-16 ***
## Protein_.g.      2.5759     0.3169   8.127 8.63e-15 ***
## Lipid_Tot_.g.    9.4858     0.2594  36.570  < 2e-16 ***
## Carbohydrt_.g.   2.0889     0.1407  14.849  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.14 on 334 degrees of freedom
## Multiple R-squared:  0.8303, Adjusted R-squared:  0.8288 
## F-statistic: 544.7 on 3 and 334 DF,  p-value: < 2.2e-16

Step-wise

Step-wise linear model is to build the model in the order of correlation.

cor(fastf.table[,2:5])
##                Energ_Kcal Protein_.g. Lipid_Tot_.g. Carbohydrt_.g.
## Energ_Kcal      1.0000000   0.1771651    0.84674741     0.22659671
## Protein_.g.     0.1771651   1.0000000    0.16787646    -0.45626974
## Lipid_Tot_.g.   0.8467474   0.1678765    1.00000000    -0.06517789
## Carbohydrt_.g.  0.2265967  -0.4562697   -0.06517789     1.00000000

For the correlation between independent variables, I find that protein and carbohydrate have relatively high correlation, which is -0.456297. This could cause suppression in my model and could cause the coefficients of some variables inflate or deflate.

fastf3.lm1 <- lm(Energ_Kcal~Lipid_Tot_.g.)
summary(fastf3.lm1)
## 
## Call:
## lm(formula = Energ_Kcal ~ Lipid_Tot_.g.)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -160.517  -19.980    1.347   27.503  188.220 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   128.7909     4.6952   27.43   <2e-16 ***
## Lipid_Tot_.g.   9.6046     0.3292   29.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.81 on 336 degrees of freedom
## Multiple R-squared:  0.717,  Adjusted R-squared:  0.7161 
## F-statistic: 851.2 on 1 and 336 DF,  p-value: < 2.2e-16

Interpretation

This is a single linear model. The model is: Energy Kcal is explained by Lipid total. bo of this model is 128.7909, which means when lipid total is 0, the energy kcal is 128.7909. b1 of this model is 9.6046 and shows that with one gram increases in lipid total, energy Kcal increases 9.6046.

Adjusted R^2 is 0.7161. R^2 explains that lipid total can explain 71.61% of the variation of energy Kcal. Also, though Lipid total explains a lot of the variation in Calorie, more independents are needed in the explanation of the Calorie.

Then, for the significant test, F-statistic of this model is 851.2 on 1 and 336 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that lipid total can explain the variation in energy Kcal.

fastf3.lm2 <- lm(Energ_Kcal~Lipid_Tot_.g.+Carbohydrt_.g.)
summary(fastf3.lm2)
## 
## Call:
## lm(formula = Energ_Kcal ~ Lipid_Tot_.g. + Carbohydrt_.g.)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -144.730  -17.579    0.057   16.669   99.197 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     96.6342     4.8734   19.83   <2e-16 ***
## Lipid_Tot_.g.    9.8138     0.2800   35.05   <2e-16 ***
## Carbohydrt_.g.   1.5713     0.1371   11.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.94 on 335 degrees of freedom
## Multiple R-squared:  0.7967, Adjusted R-squared:  0.7955 
## F-statistic: 656.5 on 2 and 335 DF,  p-value: < 2.2e-16

Interpretation

In this model, I use two independent variables. The model is: Energy Kcal is explained by Lipid total and Carbohydrate. bo of this model is 96.6342, which means when lipid total and carbohydrate are 0, the energy kcal is 96.6342. b1 of this model is 9.8138 and shows that with one gram increases in lipid total, energy Kcal increases 9.8138. b2 of this model is 1.5713. This shows that with one gram increases in carbohydrate, energy Kcal increases 1.5713.

Adjusted R^2 is 0.7955. It is higher than the previous R^2 which is 0.7161. R^2 explains that lipid total and Carbohydrate can explain 79.55% of the variation of energy Kcal. The R^2 is high, but more independents are needed in the explanation of the Calorie.

Then, F-statistic of this model is 656.5 on 2 and 335 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that lipid total and carbohydrate can explain the variation in energy Kcal.

fastf3.lm3 <- lm(Energ_Kcal~Lipid_Tot_.g.+Carbohydrt_.g.+Protein_.g.)
summary(fastf3.lm3)
## 
## Call:
## lm(formula = Energ_Kcal ~ Lipid_Tot_.g. + Carbohydrt_.g. + Protein_.g.)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.481 -18.069  -4.789  11.104 108.853 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     60.7836     6.2726   9.690  < 2e-16 ***
## Lipid_Tot_.g.    9.4858     0.2594  36.570  < 2e-16 ***
## Carbohydrt_.g.   2.0889     0.1407  14.849  < 2e-16 ***
## Protein_.g.      2.5759     0.3169   8.127 8.63e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.14 on 334 degrees of freedom
## Multiple R-squared:  0.8303, Adjusted R-squared:  0.8288 
## F-statistic: 544.7 on 3 and 334 DF,  p-value: < 2.2e-16

Interpretation

Finally, I include all three independent variables. The model is: Energy Kcal is explained by Lipid total, Carbohydrate, and protein. bo of this model is 60.7836, which means when lipid total, carbohydrate, and protein are 0, the energy kcal is 60.7836. b1 of this model is 9.4858 and shows that with one gram increases in lipid total, energy Kcal increases 9.4858. b2 of this model is 2.0889. This shows that with one gram increases in carbohydrate, energy Kcal increases 2.0889. b3 of this model is 2.5759. This shows that with one gram increases in protein, energy Kcal increases 2.5759.

Adjusted R^2 is 0.8288. It is higher than the previous two R^2 which are 0.7161 and 0.8288. We can say that add protein into explain Calorie of fast food make sense.R^2 explains that lipid total, Carbohydrate, and protein can explain 82.88% of the variation of energy Kcal. After included the three factors, more independents are still needed in the explanation of the Calorie.

F-statistic of this model is 544.7 on 3 and 334 DF. The p-value of this significant test is < 2.2e-16. So, we can reject the null hypothesis and say that lipid total, carbohydrate, and protein can explain the variation in energy Kcal.

Plot the Regression Line

Lipid Total(g)

plot(Lipid_Tot_.g.,Energ_Kcal,pch = 21, bg = 'blue')
abline(fastf3.lm1)

Carbohydrate(g)

fastC.lm <- lm(Energ_Kcal~Carbohydrt_.g.)
plot(Carbohydrt_.g.,Energ_Kcal,pch = 21, bg = 'blue')
abline(fastC.lm)

Protein(g)

plot(Protein_.g.,Energ_Kcal,pch = 21, bg = 'blue')
abline(fastf2.lm1)

3D Scatterplot

I plot the 3D scatterplot use two independent variables: lipid total(g) and carbohydrate(g).

library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.3
md <- scatterplot3d(Lipid_Tot_.g.,Carbohydrt_.g.,Energ_Kcal,pch = 21, main = "Regression plane",bg = 'blue',xlab = "Lipid total(g)", ylab = "carbohydrate(g)", zlab = "Energy Kcal",axis = TRUE)

md$plane3d(fastf3.lm2)

95% Confidence Intervals

confint(fastf3.lm3)
##                    2.5 %    97.5 %
## (Intercept)    48.444902 73.122341
## Lipid_Tot_.g.   8.975585  9.996073
## Carbohydrt_.g.  1.812213  2.365649
## Protein_.g.     1.952415  3.199287

95% of the samples will have values within the above boundary.

Checking Assumption: The distribution of residuals normal

Histogram of Energy Kilocalorie

hist(residuals(fastf3.lm3),breaks = 15, main = "Histogram of residuals", xlab = "Residuals",ylab = "Density")

The histogram of residual looks that the residual is a little bit skew left and may or may not kurtosis. Also, the histogram doesn’t show there are any outliers.

Boxplot

boxplot(residuals(fastf3.lm3), main = "Boxplot of residuals")

Boxplot is a good tool to see whether there are outliers or not and it can also help us to see weather skew or not. The boxplot show there are many outliers of residuals. Also, it shows the residual is just a little bit left skewness (very small).

Q-Q Plot

par(mfrow = c(1,1))
qqnorm(residuals(fastf3.lm3),main = "Q-Q Plot")
qqline(residuals(fastf3.lm3))

The Q-Q plot looks linear from -2 quantile to 1 quantile. However, there are very large curvature from -3 to -2 quantile and 1 to 3 quantile. I think this may because of there are many outliers.