Harold Nelson
9/3/2020
I want to explore the relationships among age, gender, height, and weight.
I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (CDC). The data I used is available from Openintro.org. See https://www.openintro.org/book/statdata/?data=cdc.
Make a few packages available.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
Before cleaning, create a new variable BMI, the body mass index. This corrects for the influence of height on weight in looking for obesity.
The body mass index (BMI) is a measure which incorprates both height and weight.
The standard interpetation of this measure is as follows:
cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(18,5,24.9,29.9,39.9,200),labels =
c("Underweight","Normal","Overweight",
"Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(18,5,24.9,29.9,39.9,200),labels =
c("Underweight","Normal","Overweight",
"Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))
table(cdc$BMICat,cdc$BMIDesCat)
##
## Underweight Normal Overweight Obese Morbidly Obese
## Underweight 124 139 8 0 0
## Normal 79 8065 304 13 0
## Overweight 8 3392 3850 45 1
## Obese 9 826 2098 602 6
## Morbidly Obese 6 81 191 138 15
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5675 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
## BMI BMIDes DesActRatio BMICat
## Min. :12.40 Min. : 8.128 Min. :0.2667 Underweight : 271
## 1st Qu.:22.71 1st Qu.: 21.727 1st Qu.:0.8710 Normal :8461
## Median :25.60 Median : 23.746 Median :0.9444 Overweight :7296
## Mean :26.31 Mean : 23.971 Mean :0.9268 Obese :3541
## 3rd Qu.:28.89 3rd Qu.: 25.799 3rd Qu.:1.0000 Morbidly Obese: 431
## Max. :73.09 Max. :100.407 Max. :3.7778
## BMIDesCat ageCat
## Underweight : 226 18-31:5087
## Normal :12503 32-43:5263
## Overweight : 6451 44-57:4787
## Obese : 798 58-99:4863
## Morbidly Obese: 22
##
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 17534 very good 1 0 0 93 179 100 31 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 17534 14.54931 8.128107 0.5586592 Underweight Underweight 18-31
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 18743 good 1 1 1 52 68 68 44 f
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 18743 17.67899 17.67899 1 Underweight Underweight 44-57
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 16874 good 0 1 0 69 180 680 24 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 16874 26.57845 100.4075 3.777778 Overweight Morbidly Obese 18-31
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 10034 very good 1 1 1 73 290 601 56 m
## 13086 good 0 1 1 62 300 300 48 f
## 13607 very good 0 1 0 69 350 350 33 f
## 16874 good 0 1 0 69 180 680 24 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 10034 38.25671 79.28373 2.072414 Obese Morbidly Obese 44-57
## 13086 54.86472 54.86472 1.000000 Morbidly Obese Morbidly Obese 44-57
## 13607 51.68032 51.68032 1.000000 Morbidly Obese Morbidly Obese 32-43
## 16874 26.57845 100.40748 3.777778 Overweight Morbidly Obese 18-31
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 17534 very good 1 0 0 93 179 100 31 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 17534 14.54931 8.128107 0.5586592 Underweight Underweight 18-31
## Rejects
## FALSE TRUE
## 19997 3
## Keepers
## FALSE TRUE
## 3 19997
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6970 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5674 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68 Min. :18.00 m: 9566
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175 3rd Qu.:57.00
## Max. :84.00 Max. :500.0 Max. :350 Max. :99.00
## BMI BMIDes DesActRatio BMICat
## Min. :12.40 Min. :10.44 Min. :0.2667 Underweight : 270
## 1st Qu.:22.71 1st Qu.:21.73 1st Qu.:0.8710 Normal :8461
## Median :25.60 Median :23.75 Median :0.9444 Overweight :7295
## Mean :26.31 Mean :23.96 Mean :0.9266 Obese :3540
## 3rd Qu.:28.89 3rd Qu.:25.80 3rd Qu.:1.0000 Morbidly Obese: 431
## Max. :73.09 Max. :54.86 Max. :1.9681
## BMIDesCat ageCat
## Underweight : 225 18-31:5085
## Normal :12503 32-43:5263
## Overweight : 6451 44-57:4786
## Obese : 798 58-99:4863
## Morbidly Obese: 20
##
The following graphics demonstrate the ggplot2 package, which incorporates a “grammar of graphics” system. The idea is that any statistical graphic can be built from a small number of components combined in various ways.
Do a scatterplot of height and weight.
The large number of points obscures the detail, but some upward drift to the right is apparent. It makes sense to create a linear model.
##
## Call:
## lm(formula = weight ~ height, data = cdc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.114 -22.488 -5.309 16.303 315.497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -193.30051 3.84730 -50.24 <2e-16 ***
## height 5.40294 0.05716 94.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.32 on 19995 degrees of freedom
## Multiple R-squared: 0.3088, Adjusted R-squared: 0.3088
## F-statistic: 8935 on 1 and 19995 DF, p-value: < 2.2e-16
This model has a residual standard error of 33.32. Think of this as the typical error made by the model. The estimated coefficient of height indicates that an extra inch of height adds an average 5.40 pounds.
Let’s see what adding gender to the model does.
##
## Call:
## lm(formula = weight ~ height + gender, data = cdc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.34 -22.55 -5.74 15.69 323.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -117.81883 5.68976 -20.71 <2e-16 ***
## height 4.37206 0.08085 54.08 <2e-16 ***
## genderf -11.93441 0.66714 -17.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.05 on 19994 degrees of freedom
## Multiple R-squared: 0.3197, Adjusted R-squared: 0.3197
## F-statistic: 4699 on 2 and 19994 DF, p-value: < 2.2e-16
There is a small reduction of the residual standard error. The increase in weight associated with an extra inch of height is reduced to 4.37, and we see that after accounting for height, we should reduce the predicted weight for a female by 11.93 pounds.
This section demonstrates some features added to R in the dplyr package, including piping.
cdc %>%
select(height,gender,weight) %>%
group_by(height,gender) %>%
summarize(mean_weight = mean(weight),
sd_weight = sd(weight),
n=n() )%>%
ungroup() -> c2
## `summarise()` has grouped output by 'height'. You can override using the `.groups` argument.
## # A tibble: 57 × 5
## height gender mean_weight sd_weight n
## <dbl> <fct> <dbl> <dbl> <int>
## 1 48 f 105 14.1 2
## 2 49 m 160 NA 1
## 3 50 f 112 NA 1
## 4 51 f 160 14.1 2
## 5 52 f 112. 61.5 2
## 6 53 f 145. 31.8 7
## 7 54 f 98.3 11.5 3
## 8 55 m 90 NA 1
## 9 55 f 135 39.7 3
## 10 56 f 117. 20.5 17
## # … with 47 more rows
We’ve restructured our data into a different shape. Look at it.
## # A tibble: 10 × 5
## height gender mean_weight sd_weight n
## <dbl> <fct> <dbl> <dbl> <int>
## 1 48 f 105 14.1 2
## 2 49 m 160 NA 1
## 3 50 f 112 NA 1
## 4 51 f 160 14.1 2
## 5 52 f 112. 61.5 2
## 6 53 f 145. 31.8 7
## 7 54 f 98.3 11.5 3
## 8 55 m 90 NA 1
## 9 55 f 135 39.7 3
## 10 56 f 117. 20.5 17
## # A tibble: 10 × 5
## height gender mean_weight sd_weight n
## <dbl> <fct> <dbl> <dbl> <int>
## 1 77 m 241. 38.5 78
## 2 77 f 174. 34.6 2
## 3 78 m 241. 49.0 40
## 4 78 f 173. 42.0 3
## 5 79 m 248. 57.3 15
## 6 80 m 222. 77.9 10
## 7 81 m 208. 69.3 3
## 8 82 m 208. 23.3 2
## 9 83 m 265 NA 1
## 10 84 m 260 NA 1
Note that average BMI varies with age for both genders. It rises until about 60 and then declines. The yellow curve is a “loess” smoother, which summarizes the behavior of the points in the scatterplot.
ggplot(data=cdc,aes(x=age,y=BMI,color=BMI))+geom_point()+
geom_smooth(method = "loess",color="yellow")+
facet_wrap(~gender,nrow=2)
## `geom_smooth()` using formula 'y ~ x'
As before the yellow curve is a loess smoother, which describes the overall pattern in the scatterplot. The red line separates the points which indicate a desire to gain weight (above the line) from those which indicate a desire to lose weight (below the line).
d <- ggplot(data = cdc,aes(x=BMI,y=BMIDes,color=DesActRatio))
d1 = d + geom_point(aes(alpha=.1)) +
geom_smooth(method="loess",color="yellow") +
geom_abline(slope=1,intercept=0,color="red")
d1
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'