Background

The purpose of this week’s discussion topic is to build a multiple regression model.

In the building of this model, we’re to include (1) quadratic term, (1) dichotomous term, and (1) dichotomous vs. quantitative interaction term.

Once built, we’re to interpret all coefficients, conduct residual analysis, and determine whether or not the linear model was appropriate as well as why we feel this way.

Dataset

The source of data is cited APA-style below:

I’d had trouble finding / deciding on a dataset based on the characteristics above and ultimately decided on this one because it checked the box of a dichotomous term (gender) and it seemed to have potential for a quadratic term (ie. being that BMI ~ m / h^2), and there’s a relationship between weight, height, BMI, and sex. Thus, it appears that this dataset would check all the boxes for the multiple regression model we’re trying to produce.

Load data

After downloading the .csv file from Kaggle and uploading it to Github, we read the corresponding data (in raw form) and then familiarize ourselves with the dataset by displaying column names, column number, row number, the 1st 6 observations.

#Read .csv data
bmi_data <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-605/main/500person.csv")
## 
## -- Column specification --------------------------------------------------------------------
## cols(
##   Gender = col_character(),
##   Height = col_double(),
##   Weight = col_double(),
##   Index = col_double()
## )
#Familiarize ourselves with the dataset
colnames(bmi_data)
## [1] "Gender" "Height" "Weight" "Index"
ncol(bmi_data)
## [1] 4
nrow(bmi_data)
## [1] 500

Our data has 4 columns and 500 rows.

Our four columns are: Gender (categorical, dichotomous), Height (numeric, quadratic), Weight (numeric), and Index. Note that we will derive a dichotomous variable from a quantitative variable for Weight. I was not 100% certain but this was how I interpreted this spec.

Original model

We’ll take the Index as our dependent variable and the Gender, Height, and Weight as our independent variables and thus explore the relationship between. We’ll explore whether these variables make up an appropriate multiple regression model.

Before fitting our model, we explore the head and summary statistics for our dataset:

#Observe the 1st six observations
head(bmi_data)
## # A tibble: 6 x 4
##   Gender Height Weight Index
##   <chr>   <dbl>  <dbl> <dbl>
## 1 Male      174     96     4
## 2 Male      189     87     2
## 3 Female    185    110     4
## 4 Female    195    104     3
## 5 Male      149     61     3
## 6 Male      189    104     3
#Explore summary statistics for bmi_data
summary(bmi_data)
##     Gender              Height          Weight        Index      
##  Length:500         Min.   :140.0   Min.   : 50   Min.   :0.000  
##  Class :character   1st Qu.:156.0   1st Qu.: 80   1st Qu.:3.000  
##  Mode  :character   Median :170.5   Median :106   Median :4.000  
##                     Mean   :169.9   Mean   :106   Mean   :3.748  
##                     3rd Qu.:184.0   3rd Qu.:136   3rd Qu.:5.000  
##                     Max.   :199.0   Max.   :160   Max.   :5.000

From this we get a feel for the range of values that we’re dealing with and how the different variables may relate to our dependent Index variable.

BMI can be derived as mass / height ^2.

And from the above statistics it appears we’re dealing with metric systems units (kgs and ms) and thus no additional conversions will be necessary.

Additionally, it’s useful to note the descriptions the author provided for different weight ranges:

0 - Extremely Weak 1 - Weak 2 - Normal 3 - Overweight 4 - Obesity 5 - Extreme Obesity

Thus our exploration could reveal the correlation between gender, height, weight and the BMI category (Index) by making use of the following linear model:

Index = B1 x Gender + B2 x Weight + B3 x Height^2

We will first explore our multiple regression model without dichotomizing our Weight variable and then we’ll revisit it after dichotomization and a quadratic Height variable.

Summary statistics without variable “adjustments”:

#Use the lm() function, bmi_data, and the linear model defined above:
bmi.lm <- lm(Index ~ Gender + Weight + Height, data=bmi_data)

#Explore summary statistics
summary(bmi.lm)
## 
## Call:
## lm(formula = Index ~ Gender + Weight + Height, data = bmi_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3228 -0.3861  0.0654  0.3917  1.0970 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.1002043  0.2787214  21.886   <2e-16 ***
## GenderMale   0.0369855  0.0507158   0.729    0.466    
## Weight       0.0336697  0.0007836  42.969   <2e-16 ***
## Height      -0.0349487  0.0015497 -22.552   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5668 on 496 degrees of freedom
## Multiple R-squared:  0.8261, Adjusted R-squared:  0.825 
## F-statistic: 785.4 on 3 and 496 DF,  p-value: < 2.2e-16

From the above statistics we see that Gender provides little-to-no indication regarding an individual’s BMI yet Height and Weight are strong indicators. As noted by their respective low p-values and the relatively high R^2 value.

Adjusted model

Let’s now see what happens when we make our variable adjustments. First we’ll adapt our weight to be a dichotomous variable by making use of the dicho() function in the sjmisc library. The function dichotomizes the variable / column in question based on the median variable (assigning 0 below and 1 above this value).

library(sjmisc)
## Warning: package 'sjmisc' was built under R version 4.0.3
## 
## Attaching package: 'sjmisc'
## The following object is masked from 'package:purrr':
## 
##     is_empty
## The following object is masked from 'package:tidyr':
## 
##     replace_na
## The following object is masked from 'package:tibble':
## 
##     add_case
#Dichotomize Weight variable based on Median value
x <- dicho(bmi_data$Weight, val.labels = c("GT_Median", "LTE_Median"))
frq(x)
## 
## x <categorical>
## # total N=500  valid N=500  mean=0.49  sd=0.50
## 
## Value |      Label |   N | Raw % | Valid % | Cum. %
## ---------------------------------------------------
##     0 |  GT_Median | 253 | 50.60 |   50.60 |  50.60
##     1 | LTE_Median | 247 | 49.40 |   49.40 | 100.00
##  <NA> |       <NA> |   0 |  0.00 |    <NA> |   <NA>
bmi_data$Weight <- x

head(bmi_data)
## # A tibble: 6 x 4
##   Gender Height Weight Index
##   <chr>   <dbl> <fct>  <dbl>
## 1 Male      174 0          4
## 2 Male      189 0          2
## 3 Female    185 1          4
## 4 Female    195 0          3
## 5 Male      149 0          3
## 6 Male      189 0          3

Based on this output it seems our dichotomization does a good job of splitting the variable entries based on Weight and thus our dichotomization of a quantitative variable has been completed and we can assess the effects of this as well as squaring the Height on our linear model:

#Use the lm() function, bmi_data, and the linear model defined above:
bmi2.lm <- lm(Index ~ Gender + Weight + Height^2, data=bmi_data)

#Explore summary statistics
summary(bmi2.lm)
## 
## Call:
## lm(formula = Index ~ Gender + Weight + Height^2, data = bmi_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.07030 -0.49419  0.00189  0.60360  1.78840 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.828259   0.391790  22.533   <2e-16 ***
## GenderMale   0.057952   0.074366   0.779    0.436    
## Weight1      1.810940   0.074347  24.358   <2e-16 ***
## Height      -0.035325   0.002273 -15.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8311 on 496 degrees of freedom
## Multiple R-squared:  0.626,  Adjusted R-squared:  0.6238 
## F-statistic: 276.8 on 3 and 496 DF,  p-value: < 2.2e-16

There are notable coefficient differences from the original model to our adjusted model. The y-intercept (6.1 –> 8.8), Gender (0.037 –> 0.058), and Weight (0.034 –> 1.810) coeffficients all increased whereas the Height coefficient maintained a consistent value ~-0.035.

Aside from the aforementioned coefficient differences, the above statistics are similar to those we observed earlier:

  • Gender has little correlation / a high p-value
  • Weight1 and Height have strong correlations / low p-values
  • Our R-squared value, while lower than the original model, is still relatively high.

These are all indicators of a strong model.

Visualization and Analysis

It appears we have a good fit and can explore further via residual analysis:

#Subset graphs so we can display all (3) on one output
par(mfrow=c(2,2))

#Histogram of residuals
hist(resid(bmi2.lm), main = "Histogram of Residuals", xlab= "") 

#Residuals plot
plot(resid(bmi2.lm), fitted(bmi2.lm), main = "Residuals Plot") 

#Q-Q plot
qqnorm(resid(bmi2.lm))
qqline(resid(bmi2.lm))

Our summary statistics provided a vote of confidence for our linear model and now we can interpret the plots in determining the strength of our model.

  • Histogram: the histogram appears to be unimodal and symmetric. A normal distribution of residuals is what we want and it’s what we got. PASS

  • Residual Plot: we see obvious patterns (6 negatively sloped lines) but I’d seen similar patterns in my data last week and jumped to conclusions. Lesson learned. This week, we’ll mark it as reason to be concerned regarding the strength of our model but not reason to throw out the model altogether. We’ll conditionally pass based on this plot. PASS

  • Q-Q Plot: From our Q-Q plot, we observe a slight bend below the plotted line early and a slight bend below the plotted line later on but in general our residuals roughly follow the plotted line and support that we have a strong model. PASS

And so the linear model Index = (0.058 x Gender) + (1.811 x Weight) - (0.035 x Height^2) is an appropriate model.