Introduction

Regression is used to estimate relationships between dependent and independent variables and to predict an event in the future. The independent variable is the input or cause (these are usually variables that stay the same) and the dependent variable is the output or effect (these typically change).

In this report, we will use demographic, weight, and diet data to analyze relationships between dependent and independent variables.

Data

First, we will take a look at the data. The dataset contains 7 variables, which are various descriptive characteristics as well as weight/diet details, and 78 observations, which respresent the number of participants in the study numbered from 1-78.

diet_data <- data.frame(read.csv("Dietdata.csv"))
str(diet_data)

## 'data.frame':    78 obs. of  7 variables:
##  $ Person      : int  25 26 1 2 3 4 5 6 7 8 ...
##  $ gender      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age         : int  41 32 22 46 55 33 50 50 37 28 ...
##  $ Height      : int  171 174 159 192 170 171 170 201 174 176 ...
##  $ pre.weight  : int  60 103 58 60 64 64 65 66 67 69 ...
##  $ Diet        : int  2 2 1 1 1 1 1 1 1 1 ...
##  $ weight6weeks: num  60 103 54.2 54 63.3 61.1 62.2 64 65 60.5 ...

head(diet_data)

##   Person gender Age Height pre.weight Diet weight6weeks
## 1     25      0  41    171         60    2         60.0
## 2     26      0  32    174        103    2        103.0
## 3      1      0  22    159         58    1         54.2
## 4      2      0  46    192         60    1         54.0
## 5      3      0  55    170         64    1         63.3
## 6      4      0  33    171         64    1         61.1

table(diet_data$Diet)

## 
##  1  2  3 
## 24 27 27

Analysis

In this case, the independent variables are gender, age, height and diet. The dependent variables are pre.weight and weight6weeks. (We will also add another dependent variable to show the difference between pre.weight and weight6weeks.)

Now, we will do some exploratory data analysis.

summary(diet_data)

##      Person          gender            Age            Height     
##  Min.   : 1.00   Min.   :0.0000   Min.   :16.00   Min.   :141.0  
##  1st Qu.:20.25   1st Qu.:0.0000   1st Qu.:32.25   1st Qu.:164.2  
##  Median :39.50   Median :0.0000   Median :39.00   Median :169.5  
##  Mean   :39.50   Mean   :0.4231   Mean   :39.15   Mean   :170.8  
##  3rd Qu.:58.75   3rd Qu.:1.0000   3rd Qu.:46.75   3rd Qu.:174.8  
##  Max.   :78.00   Max.   :1.0000   Max.   :60.00   Max.   :201.0  
##    pre.weight          Diet        weight6weeks   
##  Min.   : 58.00   Min.   :1.000   Min.   : 53.00  
##  1st Qu.: 66.00   1st Qu.:1.000   1st Qu.: 61.85  
##  Median : 72.00   Median :2.000   Median : 68.95  
##  Mean   : 72.53   Mean   :2.038   Mean   : 68.68  
##  3rd Qu.: 78.00   3rd Qu.:3.000   3rd Qu.: 73.83  
##  Max.   :103.00   Max.   :3.000   Max.   :103.00

There are no missing values in the data.

sum(is.na(diet_data))

## [1] 0

We will create a correlation plot using the corrplot package. First we create a correlation matrix, then pass the matrix into the corrplot function.

diet_ <- data.frame(lapply(diet_data, as.numeric))

diet_$weightdifference <- diet_$pre.weight - diet_$weight6weeks

str(diet_)

## 'data.frame':    78 obs. of  8 variables:
##  $ Person          : num  25 26 1 2 3 4 5 6 7 8 ...
##  $ gender          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age             : num  41 32 22 46 55 33 50 50 37 28 ...
##  $ Height          : num  171 174 159 192 170 171 170 201 174 176 ...
##  $ pre.weight      : num  60 103 58 60 64 64 65 66 67 69 ...
##  $ Diet            : num  2 2 1 1 1 1 1 1 1 1 ...
##  $ weight6weeks    : num  60 103 54.2 54 63.3 61.1 62.2 64 65 60.5 ...
##  $ weightdifference: num  0 0 3.8 6 0.7 ...

library(corrplot)

## corrplot 0.84 loaded

CHO <- cor(diet_, method = "pearson") 

str(diet_)

## 'data.frame':    78 obs. of  8 variables:
##  $ Person          : num  25 26 1 2 3 4 5 6 7 8 ...
##  $ gender          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age             : num  41 32 22 46 55 33 50 50 37 28 ...
##  $ Height          : num  171 174 159 192 170 171 170 201 174 176 ...
##  $ pre.weight      : num  60 103 58 60 64 64 65 66 67 69 ...
##  $ Diet            : num  2 2 1 1 1 1 1 1 1 1 ...
##  $ weight6weeks    : num  60 103 54.2 54 63.3 61.1 62.2 64 65 60.5 ...
##  $ weightdifference: num  0 0 3.8 6 0.7 ...

diet_[ ,2:8]

##    gender Age Height pre.weight Diet weight6weeks weightdifference
## 1       0  41    171         60    2         60.0              0.0
## 2       0  32    174        103    2        103.0              0.0
## 3       0  22    159         58    1         54.2              3.8
## 4       0  46    192         60    1         54.0              6.0
## 5       0  55    170         64    1         63.3              0.7
## 6       0  33    171         64    1         61.1              2.9
## 7       0  50    170         65    1         62.2              2.8
## 8       0  50    201         66    1         64.0              2.0
## 9       0  37    174         67    1         65.0              2.0
## 10      0  28    176         69    1         60.5              8.5
## 11      0  28    165         70    1         68.1              1.9
## 12      0  45    165         70    1         66.9              3.1
## 13      0  60    173         72    1         70.5              1.5
## 14      0  48    156         72    1         69.0              3.0
## 15      0  41    163         72    1         68.4              3.6
## 16      0  37    167         82    1         81.1              0.9
## 17      0  44    174         58    2         60.1             -2.1
## 18      0  37    172         58    2         56.0              2.0
## 19      0  41    165         59    2         57.3              1.7
## 20      0  43    171         61    2         56.7              4.3
## 21      0  20    169         62    2         55.0              7.0
## 22      0  51    174         63    2         62.4              0.6
## 23      0  31    163         63    2         60.3              2.7
## 24      0  54    173         63    2         59.4              3.6
## 25      0  50    166         65    2         62.0              3.0
## 26      0  48    163         66    2         64.0              2.0
## 27      0  16    165         68    2         63.8              4.2
## 28      0  37    167         68    2         63.3              4.7
## 29      0  30    161         76    2         72.7              3.3
## 30      0  29    169         77    2         77.5             -0.5
## 31      0  51    165         60    3         53.0              7.0
## 32      0  35    169         62    3         56.4              5.6
## 33      0  21    159         64    3         60.6              3.4
## 34      0  22    169         65    3         58.2              6.8
## 35      0  36    160         66    3         58.2              7.8
## 36      0  20    169         67    3         61.6              5.4
## 37      0  35    163         67    3         60.2              6.8
## 38      0  45    155         69    3         61.8              7.2
## 39      0  58    141         70    3         63.0              7.0
## 40      0  37    170         70    3         62.7              7.3
## 41      0  31    170         72    3         71.1              0.9
## 42      0  35    171         72    3         64.4              7.6
## 43      0  56    171         73    3         68.9              4.1
## 44      0  48    153         75    3         68.7              6.3
## 45      0  41    157         76    3         71.0              5.0
## 46      1  39    168         71    1         71.6             -0.6
## 47      1  31    158         72    1         70.9              1.1
## 48      1  40    173         74    1         69.5              4.5
## 49      1  50    160         78    1         73.9              4.1
## 50      1  43    162         80    1         71.0              9.0
## 51      1  25    165         80    1         77.6              2.4
## 52      1  52    177         83    1         79.1              3.9
## 53      1  42    166         85    1         81.5              3.5
## 54      1  39    166         87    1         81.9              5.1
## 55      1  40    190         88    1         84.5              3.5
## 56      1  51    191         71    2         66.8              4.2
## 57      1  38    199         75    2         72.6              2.4
## 58      1  54    196         75    2         69.2              5.8
## 59      1  33    190         76    2         72.5              3.5
## 60      1  45    160         78    2         72.7              5.3
## 61      1  37    194         78    2         76.3              1.7
## 62      1  44    163         79    2         73.6              5.4
## 63      1  40    171         79    2         72.9              6.1
## 64      1  37    198         79    2         71.1              7.9
## 65      1  39    180         80    2         81.4             -1.4
## 66      1  31    182         80    2         75.7              4.3
## 67      1  36    155         71    3         68.5              2.5
## 68      1  47    179         73    3         72.1              0.9
## 69      1  29    166         76    3         72.5              3.5
## 70      1  37    173         78    3         77.5              0.5
## 71      1  31    177         78    3         75.2              2.8
## 72      1  26    179         78    3         69.4              8.6
## 73      1  40    179         79    3         74.5              4.5
## 74      1  35    183         83    3         80.2              2.8
## 75      1  49    177         84    3         79.9              4.1
## 76      1  28    164         85    3         79.7              5.3
## 77      1  40    167         87    3         77.8              9.2
## 78      1  51    175         88    3         81.9              6.1

corrplot(CHO, method = "number", type = "lower", tl.col = "black", tl.srt = 45)

We will propose a model to answer the initial question: Can we predict weight loss? If yes, which variable is most effective at predicting weight loss?

x <- lm(gender~Diet, data = diet_)
head(diet_)

##   Person gender Age Height pre.weight Diet weight6weeks weightdifference
## 1     25      0  41    171         60    2         60.0              0.0
## 2     26      0  32    174        103    2        103.0              0.0
## 3      1      0  22    159         58    1         54.2              3.8
## 4      2      0  46    192         60    1         54.0              6.0
## 5      3      0  55    170         64    1         63.3              0.7
## 6      4      0  33    171         64    1         61.1              2.9

summary(x)

## 
## Call:
## lm(formula = gender ~ Diet, data = diet_)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4369 -0.4225 -0.4082  0.5775  0.5918 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.39380    0.15380   2.560   0.0124 *
## Diet         0.01436    0.07014   0.205   0.8383  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5004 on 76 degrees of freedom
## Multiple R-squared:  0.0005512,  Adjusted R-squared:  -0.0126 
## F-statistic: 0.04192 on 1 and 76 DF,  p-value: 0.8383

diet_data123 <- lm(weightdifference ~ gender+Age+Height+Diet,diet_)
summary(diet_data123)

## 
## Call:
## lm(formula = weightdifference ~ gender + Age + Height + Diet, 
##     data = diet_)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6204 -1.6190  0.1334  1.3499  5.8652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  6.25706    4.75799   1.315   0.1926  
## gender       0.45433    0.60510   0.751   0.4552  
## Age         -0.00373    0.02909  -0.128   0.8983  
## Height      -0.02507    0.02692  -0.932   0.3546  
## Diet         0.89512    0.35332   2.533   0.0134 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.479 on 73 degrees of freedom
## Multiple R-squared:  0.1049, Adjusted R-squared:  0.0559 
## F-statistic:  2.14 on 4 and 73 DF,  p-value: 0.08444

Diet has a statistically significant effect on predicting weight loss. We will take out age and gender because they show the least significance, in terms of this model, and only focus on diet and height.

diet_data12345 <- lm(weightdifference ~ Height+Diet,diet_)
summary(diet_data12345)

## 
## Call:
## lm(formula = weightdifference ~ Height + Diet, data = diet_)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8512 -1.6492  0.0305  1.4483  5.9469 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  5.10986    4.41751   1.157  0.25105   
## Height      -0.01836    0.02499  -0.735  0.46472   
## Diet         0.91840    0.34667   2.649  0.00983 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.456 on 75 degrees of freedom
## Multiple R-squared:  0.09783,    Adjusted R-squared:  0.07377 
## F-statistic: 4.066 on 2 and 75 DF,  p-value: 0.02106

This result shows that diet is highly significant (p value of .001) when it comes to predicting weight loss. This explains the variance in the dependent variable, weightdifference.

Regression Assignment

Daria Chylak

Intro to Data Science - June 15, 2018

Introduction

Data

Analysis