The purpose of this week’s discussion topic is to build a multiple regression model.
In the building of this model, we’re to include (1) quadratic term, (1) dichotomous term, and (1) dichotomous vs. quantitative interaction term.
Once built, we’re to interpret all coefficients, conduct residual analysis, and determine whether or not the linear model was appropriate as well as why we feel this way.
The source of data is cited APA-style below:
I’d had trouble finding / deciding on a dataset based on the characteristics above and ultimately decided on this one because it checked the box of a dichotomous term (gender) and it seemed to have potential for a quadratic term (ie. being that BMI ~ m / h^2), and there’s a relationship between weight, height, BMI, and sex. Thus, it appears that this dataset would check all the boxes for the multiple regression model we’re trying to produce.
After downloading the .csv file from Kaggle and uploading it to Github, we read the corresponding data (in raw form) and then familiarize ourselves with the dataset by displaying column names, column number, row number, the 1st 6 observations.
#Read .csv data
bmi_data <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-605/main/500person.csv")
##
## -- Column specification --------------------------------------------------------------------
## cols(
## Gender = col_character(),
## Height = col_double(),
## Weight = col_double(),
## Index = col_double()
## )
## [1] "Gender" "Height" "Weight" "Index"
## [1] 4
## [1] 500
Our data has 4 columns and 500 rows.
Our four columns are: Gender (categorical, dichotomous), Height (numeric, quadratic), Weight (numeric), and Index. Note that we will derive a dichotomous variable from a quantitative variable for Weight. I was not 100% certain but this was how I interpreted this spec.
We’ll take the Index as our dependent variable and the Gender, Height, and Weight as our independent variables and thus explore the relationship between. We’ll explore whether these variables make up an appropriate multiple regression model.
Before fitting our model, we explore the head and summary statistics for our dataset:
## # A tibble: 6 x 4
## Gender Height Weight Index
## <chr> <dbl> <dbl> <dbl>
## 1 Male 174 96 4
## 2 Male 189 87 2
## 3 Female 185 110 4
## 4 Female 195 104 3
## 5 Male 149 61 3
## 6 Male 189 104 3
## Gender Height Weight Index
## Length:500 Min. :140.0 Min. : 50 Min. :0.000
## Class :character 1st Qu.:156.0 1st Qu.: 80 1st Qu.:3.000
## Mode :character Median :170.5 Median :106 Median :4.000
## Mean :169.9 Mean :106 Mean :3.748
## 3rd Qu.:184.0 3rd Qu.:136 3rd Qu.:5.000
## Max. :199.0 Max. :160 Max. :5.000
From this we get a feel for the range of values that we’re dealing with and how the different variables may relate to our dependent Index variable.
BMI can be derived as mass / height ^2.
And from the above statistics it appears we’re dealing with metric systems units (kgs and ms) and thus no additional conversions will be necessary.
Additionally, it’s useful to note the descriptions the author provided for different weight ranges:
0 - Extremely Weak 1 - Weak 2 - Normal 3 - Overweight 4 - Obesity 5 - Extreme Obesity
Thus our exploration could reveal the correlation between gender, height, weight and the BMI category (Index) by making use of the following linear model:
Index = B1 x Gender + B2 x Weight + B3 x Height^2
We will first explore our multiple regression model without dichotomizing our Weight variable and then we’ll revisit it after dichotomization and a quadratic Height variable.
Summary statistics without variable “adjustments”:
#Use the lm() function, bmi_data, and the linear model defined above:
bmi.lm <- lm(Index ~ Gender + Weight + Height, data=bmi_data)
#Explore summary statistics
summary(bmi.lm)
##
## Call:
## lm(formula = Index ~ Gender + Weight + Height, data = bmi_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3228 -0.3861 0.0654 0.3917 1.0970
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1002043 0.2787214 21.886 <2e-16 ***
## GenderMale 0.0369855 0.0507158 0.729 0.466
## Weight 0.0336697 0.0007836 42.969 <2e-16 ***
## Height -0.0349487 0.0015497 -22.552 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5668 on 496 degrees of freedom
## Multiple R-squared: 0.8261, Adjusted R-squared: 0.825
## F-statistic: 785.4 on 3 and 496 DF, p-value: < 2.2e-16
From the above statistics we see that Gender provides little-to-no indication regarding an individual’s BMI yet Height and Weight are strong indicators. As noted by their respective low p-values and the relatively high R^2 value.
Let’s now see what happens when we make our variable adjustments. First we’ll adapt our weight to be a dichotomous variable by making use of the dicho() function in the sjmisc library. The function dichotomizes the variable / column in question based on the median variable (assigning 0 below and 1 above this value).
## Warning: package 'sjmisc' was built under R version 4.0.3
##
## Attaching package: 'sjmisc'
## The following object is masked from 'package:purrr':
##
## is_empty
## The following object is masked from 'package:tidyr':
##
## replace_na
## The following object is masked from 'package:tibble':
##
## add_case
#Dichotomize Weight variable based on Median value
x <- dicho(bmi_data$Weight, val.labels = c("GT_Median", "LTE_Median"))
frq(x)
##
## x <categorical>
## # total N=500 valid N=500 mean=0.49 sd=0.50
##
## Value | Label | N | Raw % | Valid % | Cum. %
## ---------------------------------------------------
## 0 | GT_Median | 253 | 50.60 | 50.60 | 50.60
## 1 | LTE_Median | 247 | 49.40 | 49.40 | 100.00
## <NA> | <NA> | 0 | 0.00 | <NA> | <NA>
## # A tibble: 6 x 4
## Gender Height Weight Index
## <chr> <dbl> <fct> <dbl>
## 1 Male 174 0 4
## 2 Male 189 0 2
## 3 Female 185 1 4
## 4 Female 195 0 3
## 5 Male 149 0 3
## 6 Male 189 0 3
Based on this output it seems our dichotomization does a good job of splitting the variable entries based on Weight and thus our dichotomization of a quantitative variable has been completed and we can assess the effects of this as well as squaring the Height on our linear model:
#Use the lm() function, bmi_data, and the linear model defined above:
bmi2.lm <- lm(Index ~ Gender + Weight + Height^2, data=bmi_data)
#Explore summary statistics
summary(bmi2.lm)
##
## Call:
## lm(formula = Index ~ Gender + Weight + Height^2, data = bmi_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.07030 -0.49419 0.00189 0.60360 1.78840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.828259 0.391790 22.533 <2e-16 ***
## GenderMale 0.057952 0.074366 0.779 0.436
## Weight1 1.810940 0.074347 24.358 <2e-16 ***
## Height -0.035325 0.002273 -15.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8311 on 496 degrees of freedom
## Multiple R-squared: 0.626, Adjusted R-squared: 0.6238
## F-statistic: 276.8 on 3 and 496 DF, p-value: < 2.2e-16
There are notable coefficient differences from the original model to our adjusted model. The y-intercept (6.1 –> 8.8), Gender (0.037 –> 0.058), and Weight (0.034 –> 1.810) coeffficients all increased whereas the Height coefficient maintained a consistent value ~-0.035.
Aside from the aforementioned coefficient differences, the above statistics are similar to those we observed earlier:
These are all indicators of a strong model.
It appears we have a good fit and can explore further via residual analysis:
#Subset graphs so we can display all (3) on one output
par(mfrow=c(2,2))
#Histogram of residuals
hist(resid(bmi2.lm), main = "Histogram of Residuals", xlab= "")
#Residuals plot
plot(resid(bmi2.lm), fitted(bmi2.lm), main = "Residuals Plot")
#Q-Q plot
qqnorm(resid(bmi2.lm))
qqline(resid(bmi2.lm))
Our summary statistics provided a vote of confidence for our linear model and now we can interpret the plots in determining the strength of our model.
Histogram: the histogram appears to be unimodal and symmetric. A normal distribution of residuals is what we want and it’s what we got. PASS
Residual Plot: we see obvious patterns (6 negatively sloped lines) but I’d seen similar patterns in my data last week and jumped to conclusions. Lesson learned. This week, we’ll mark it as reason to be concerned regarding the strength of our model but not reason to throw out the model altogether. We’ll conditionally pass based on this plot. PASS
Q-Q Plot: From our Q-Q plot, we observe a slight bend below the plotted line early and a slight bend below the plotted line later on but in general our residuals roughly follow the plotted line and support that we have a strong model. PASS
And so the linear model Index = (0.058 x Gender) + (1.811 x Weight) - (0.035 x Height^2) is an appropriate model.