Regression is used to estimate relationships between dependent and independent variables and to predict an event in the future. The independent variable is the input or cause (these are usually variables that stay the same) and the dependent variable is the output or effect (these typically change).
In this report, we will use demographic, weight, and diet data to analyze relationships between dependent and independent variables.
First, we will take a look at the data. The dataset contains 7 variables, which are various descriptive characteristics as well as weight/diet details, and 78 observations, which respresent the number of participants in the study numbered from 1-78.
diet_data <- data.frame(read.csv("Dietdata.csv"))
str(diet_data)
## 'data.frame': 78 obs. of 7 variables:
## $ Person : int 25 26 1 2 3 4 5 6 7 8 ...
## $ gender : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : int 41 32 22 46 55 33 50 50 37 28 ...
## $ Height : int 171 174 159 192 170 171 170 201 174 176 ...
## $ pre.weight : int 60 103 58 60 64 64 65 66 67 69 ...
## $ Diet : int 2 2 1 1 1 1 1 1 1 1 ...
## $ weight6weeks: num 60 103 54.2 54 63.3 61.1 62.2 64 65 60.5 ...
head(diet_data)
## Person gender Age Height pre.weight Diet weight6weeks
## 1 25 0 41 171 60 2 60.0
## 2 26 0 32 174 103 2 103.0
## 3 1 0 22 159 58 1 54.2
## 4 2 0 46 192 60 1 54.0
## 5 3 0 55 170 64 1 63.3
## 6 4 0 33 171 64 1 61.1
table(diet_data$Diet)
##
## 1 2 3
## 24 27 27
In this case, the independent variables are gender, age, height and diet. The dependent variables are pre.weight and weight6weeks. (We will also add another dependent variable to show the difference between pre.weight and weight6weeks.)
Now, we will do some exploratory data analysis.
summary(diet_data)
## Person gender Age Height
## Min. : 1.00 Min. :0.0000 Min. :16.00 Min. :141.0
## 1st Qu.:20.25 1st Qu.:0.0000 1st Qu.:32.25 1st Qu.:164.2
## Median :39.50 Median :0.0000 Median :39.00 Median :169.5
## Mean :39.50 Mean :0.4231 Mean :39.15 Mean :170.8
## 3rd Qu.:58.75 3rd Qu.:1.0000 3rd Qu.:46.75 3rd Qu.:174.8
## Max. :78.00 Max. :1.0000 Max. :60.00 Max. :201.0
## pre.weight Diet weight6weeks
## Min. : 58.00 Min. :1.000 Min. : 53.00
## 1st Qu.: 66.00 1st Qu.:1.000 1st Qu.: 61.85
## Median : 72.00 Median :2.000 Median : 68.95
## Mean : 72.53 Mean :2.038 Mean : 68.68
## 3rd Qu.: 78.00 3rd Qu.:3.000 3rd Qu.: 73.83
## Max. :103.00 Max. :3.000 Max. :103.00
There are no missing values in the data.
sum(is.na(diet_data))
## [1] 0
We will create a correlation plot using the corrplot package. First we create a correlation matrix, then pass the matrix into the corrplot function.
diet_ <- data.frame(lapply(diet_data, as.numeric))
diet_$weightdifference <- diet_$pre.weight - diet_$weight6weeks
str(diet_)
## 'data.frame': 78 obs. of 8 variables:
## $ Person : num 25 26 1 2 3 4 5 6 7 8 ...
## $ gender : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : num 41 32 22 46 55 33 50 50 37 28 ...
## $ Height : num 171 174 159 192 170 171 170 201 174 176 ...
## $ pre.weight : num 60 103 58 60 64 64 65 66 67 69 ...
## $ Diet : num 2 2 1 1 1 1 1 1 1 1 ...
## $ weight6weeks : num 60 103 54.2 54 63.3 61.1 62.2 64 65 60.5 ...
## $ weightdifference: num 0 0 3.8 6 0.7 ...
library(corrplot)
## corrplot 0.84 loaded
CHO <- cor(diet_, method = "pearson")
str(diet_)
## 'data.frame': 78 obs. of 8 variables:
## $ Person : num 25 26 1 2 3 4 5 6 7 8 ...
## $ gender : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : num 41 32 22 46 55 33 50 50 37 28 ...
## $ Height : num 171 174 159 192 170 171 170 201 174 176 ...
## $ pre.weight : num 60 103 58 60 64 64 65 66 67 69 ...
## $ Diet : num 2 2 1 1 1 1 1 1 1 1 ...
## $ weight6weeks : num 60 103 54.2 54 63.3 61.1 62.2 64 65 60.5 ...
## $ weightdifference: num 0 0 3.8 6 0.7 ...
diet_[ ,2:8]
## gender Age Height pre.weight Diet weight6weeks weightdifference
## 1 0 41 171 60 2 60.0 0.0
## 2 0 32 174 103 2 103.0 0.0
## 3 0 22 159 58 1 54.2 3.8
## 4 0 46 192 60 1 54.0 6.0
## 5 0 55 170 64 1 63.3 0.7
## 6 0 33 171 64 1 61.1 2.9
## 7 0 50 170 65 1 62.2 2.8
## 8 0 50 201 66 1 64.0 2.0
## 9 0 37 174 67 1 65.0 2.0
## 10 0 28 176 69 1 60.5 8.5
## 11 0 28 165 70 1 68.1 1.9
## 12 0 45 165 70 1 66.9 3.1
## 13 0 60 173 72 1 70.5 1.5
## 14 0 48 156 72 1 69.0 3.0
## 15 0 41 163 72 1 68.4 3.6
## 16 0 37 167 82 1 81.1 0.9
## 17 0 44 174 58 2 60.1 -2.1
## 18 0 37 172 58 2 56.0 2.0
## 19 0 41 165 59 2 57.3 1.7
## 20 0 43 171 61 2 56.7 4.3
## 21 0 20 169 62 2 55.0 7.0
## 22 0 51 174 63 2 62.4 0.6
## 23 0 31 163 63 2 60.3 2.7
## 24 0 54 173 63 2 59.4 3.6
## 25 0 50 166 65 2 62.0 3.0
## 26 0 48 163 66 2 64.0 2.0
## 27 0 16 165 68 2 63.8 4.2
## 28 0 37 167 68 2 63.3 4.7
## 29 0 30 161 76 2 72.7 3.3
## 30 0 29 169 77 2 77.5 -0.5
## 31 0 51 165 60 3 53.0 7.0
## 32 0 35 169 62 3 56.4 5.6
## 33 0 21 159 64 3 60.6 3.4
## 34 0 22 169 65 3 58.2 6.8
## 35 0 36 160 66 3 58.2 7.8
## 36 0 20 169 67 3 61.6 5.4
## 37 0 35 163 67 3 60.2 6.8
## 38 0 45 155 69 3 61.8 7.2
## 39 0 58 141 70 3 63.0 7.0
## 40 0 37 170 70 3 62.7 7.3
## 41 0 31 170 72 3 71.1 0.9
## 42 0 35 171 72 3 64.4 7.6
## 43 0 56 171 73 3 68.9 4.1
## 44 0 48 153 75 3 68.7 6.3
## 45 0 41 157 76 3 71.0 5.0
## 46 1 39 168 71 1 71.6 -0.6
## 47 1 31 158 72 1 70.9 1.1
## 48 1 40 173 74 1 69.5 4.5
## 49 1 50 160 78 1 73.9 4.1
## 50 1 43 162 80 1 71.0 9.0
## 51 1 25 165 80 1 77.6 2.4
## 52 1 52 177 83 1 79.1 3.9
## 53 1 42 166 85 1 81.5 3.5
## 54 1 39 166 87 1 81.9 5.1
## 55 1 40 190 88 1 84.5 3.5
## 56 1 51 191 71 2 66.8 4.2
## 57 1 38 199 75 2 72.6 2.4
## 58 1 54 196 75 2 69.2 5.8
## 59 1 33 190 76 2 72.5 3.5
## 60 1 45 160 78 2 72.7 5.3
## 61 1 37 194 78 2 76.3 1.7
## 62 1 44 163 79 2 73.6 5.4
## 63 1 40 171 79 2 72.9 6.1
## 64 1 37 198 79 2 71.1 7.9
## 65 1 39 180 80 2 81.4 -1.4
## 66 1 31 182 80 2 75.7 4.3
## 67 1 36 155 71 3 68.5 2.5
## 68 1 47 179 73 3 72.1 0.9
## 69 1 29 166 76 3 72.5 3.5
## 70 1 37 173 78 3 77.5 0.5
## 71 1 31 177 78 3 75.2 2.8
## 72 1 26 179 78 3 69.4 8.6
## 73 1 40 179 79 3 74.5 4.5
## 74 1 35 183 83 3 80.2 2.8
## 75 1 49 177 84 3 79.9 4.1
## 76 1 28 164 85 3 79.7 5.3
## 77 1 40 167 87 3 77.8 9.2
## 78 1 51 175 88 3 81.9 6.1
corrplot(CHO, method = "number", type = "lower", tl.col = "black", tl.srt = 45)
We will propose a model to answer the initial question: Can we predict weight loss? If yes, which variable is most effective at predicting weight loss?
x <- lm(gender~Diet, data = diet_)
head(diet_)
## Person gender Age Height pre.weight Diet weight6weeks weightdifference
## 1 25 0 41 171 60 2 60.0 0.0
## 2 26 0 32 174 103 2 103.0 0.0
## 3 1 0 22 159 58 1 54.2 3.8
## 4 2 0 46 192 60 1 54.0 6.0
## 5 3 0 55 170 64 1 63.3 0.7
## 6 4 0 33 171 64 1 61.1 2.9
summary(x)
##
## Call:
## lm(formula = gender ~ Diet, data = diet_)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4369 -0.4225 -0.4082 0.5775 0.5918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.39380 0.15380 2.560 0.0124 *
## Diet 0.01436 0.07014 0.205 0.8383
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5004 on 76 degrees of freedom
## Multiple R-squared: 0.0005512, Adjusted R-squared: -0.0126
## F-statistic: 0.04192 on 1 and 76 DF, p-value: 0.8383
diet_data123 <- lm(weightdifference ~ gender+Age+Height+Diet,diet_)
summary(diet_data123)
##
## Call:
## lm(formula = weightdifference ~ gender + Age + Height + Diet,
## data = diet_)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6204 -1.6190 0.1334 1.3499 5.8652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.25706 4.75799 1.315 0.1926
## gender 0.45433 0.60510 0.751 0.4552
## Age -0.00373 0.02909 -0.128 0.8983
## Height -0.02507 0.02692 -0.932 0.3546
## Diet 0.89512 0.35332 2.533 0.0134 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.479 on 73 degrees of freedom
## Multiple R-squared: 0.1049, Adjusted R-squared: 0.0559
## F-statistic: 2.14 on 4 and 73 DF, p-value: 0.08444
Diet has a statistically significant effect on predicting weight loss. We will take out age and gender because they show the least significance, in terms of this model, and only focus on diet and height.
diet_data12345 <- lm(weightdifference ~ Height+Diet,diet_)
summary(diet_data12345)
##
## Call:
## lm(formula = weightdifference ~ Height + Diet, data = diet_)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8512 -1.6492 0.0305 1.4483 5.9469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.10986 4.41751 1.157 0.25105
## Height -0.01836 0.02499 -0.735 0.46472
## Diet 0.91840 0.34667 2.649 0.00983 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.456 on 75 degrees of freedom
## Multiple R-squared: 0.09783, Adjusted R-squared: 0.07377
## F-statistic: 4.066 on 2 and 75 DF, p-value: 0.02106
This result shows that diet is highly significant (p value of .001) when it comes to predicting weight loss. This explains the variance in the dependent variable, weightdifference.