Gender Height and Weight

Harold Nelson

9/3/2020

A Sample Data Analysis

The Questions

I want to explore the relationships among age, gender, height, and weight.

The Data

I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (CDC). The data I used is available from Openintro.org. See https://www.openintro.org/book/statdata/?data=cdc.

Load the data.

load("cdc.Rdata")

Packages

Make a few packages available.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Determine the structure of the object CDC.

str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

BMI

Before cleaning, create a new variable BMI, the body mass index. This corrects for the influence of height on weight in looking for obesity.

The body mass index (BMI) is a measure which incorprates both height and weight.

The standard interpetation of this measure is as follows:

New Variables.

cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(18,5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(18,5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))


table(cdc$BMICat,cdc$BMIDesCat)
##                 
##                  Underweight Normal Overweight Obese Morbidly Obese
##   Underweight            124    139          8     0              0
##   Normal                  79   8065        304    13              0
##   Overweight               8   3392       3850    45              1
##   Obese                    9    826       2098   602              6
##   Morbidly Obese           6     81        191   138             15

Examine the data and look for anomalies

summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00            
##       BMI            BMIDes         DesActRatio                BMICat    
##  Min.   :12.40   Min.   :  8.128   Min.   :0.2667   Underweight   : 271  
##  1st Qu.:22.71   1st Qu.: 21.727   1st Qu.:0.8710   Normal        :8461  
##  Median :25.60   Median : 23.746   Median :0.9444   Overweight    :7296  
##  Mean   :26.31   Mean   : 23.971   Mean   :0.9268   Obese         :3541  
##  3rd Qu.:28.89   3rd Qu.: 25.799   3rd Qu.:1.0000   Morbidly Obese: 431  
##  Max.   :73.09   Max.   :100.407   Max.   :3.7778                        
##           BMIDesCat       ageCat    
##  Underweight   :  226   18-31:5087  
##  Normal        :12503   32-43:5263  
##  Overweight    : 6451   44-57:4787  
##  Obese         :  798   58-99:4863  
##  Morbidly Obese:   22               
## 

Look at the records which have unusual values.

cdc[cdc$height==93,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 17534 very good       1        0        0     93    179      100  31      m
##            BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 17534 14.54931 8.128107   0.5586592 Underweight Underweight  18-31
cdc[cdc$weight==68,]
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 18743    good       1        1        1     52     68       68  44      f
##            BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 18743 17.67899 17.67899           1 Underweight Underweight  44-57
cdc[cdc$wtdesire==680,]
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 16874    good       0        1        0     69    180      680  24      m
##            BMI   BMIDes DesActRatio     BMICat      BMIDesCat ageCat
## 16874 26.57845 100.4075    3.777778 Overweight Morbidly Obese  18-31
cdc[cdc$BMIDes > 50,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 10034 very good       1        1        1     73    290      601  56      m
## 13086      good       0        1        1     62    300      300  48      f
## 13607 very good       0        1        0     69    350      350  33      f
## 16874      good       0        1        0     69    180      680  24      m
##            BMI    BMIDes DesActRatio         BMICat      BMIDesCat ageCat
## 10034 38.25671  79.28373    2.072414          Obese Morbidly Obese  44-57
## 13086 54.86472  54.86472    1.000000 Morbidly Obese Morbidly Obese  44-57
## 13607 51.68032  51.68032    1.000000 Morbidly Obese Morbidly Obese  32-43
## 16874 26.57845 100.40748    3.777778     Overweight Morbidly Obese  18-31
cdc[cdc$BMIDes < 10,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 17534 very good       1        0        0     93    179      100  31      m
##            BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 17534 14.54931 8.128107   0.5586592 Underweight Underweight  18-31

Remove Anomalies

Rejects = cdc$BMIDes < 10 | 
   (cdc$BMIDes > 50 & cdc$DesActRatio > 1.0) 
table(Rejects)
## Rejects
## FALSE  TRUE 
## 19997     3
Keepers = !Rejects
table(Keepers)
## Keepers
## FALSE  TRUE 
##     3 19997
cdc = cdc[Keepers,]
summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6970   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5674   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire        age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68   Min.   :18.00   m: 9566  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175   3rd Qu.:57.00            
##  Max.   :84.00   Max.   :500.0   Max.   :350   Max.   :99.00            
##       BMI            BMIDes       DesActRatio                BMICat    
##  Min.   :12.40   Min.   :10.44   Min.   :0.2667   Underweight   : 270  
##  1st Qu.:22.71   1st Qu.:21.73   1st Qu.:0.8710   Normal        :8461  
##  Median :25.60   Median :23.75   Median :0.9444   Overweight    :7295  
##  Mean   :26.31   Mean   :23.96   Mean   :0.9266   Obese         :3540  
##  3rd Qu.:28.89   3rd Qu.:25.80   3rd Qu.:1.0000   Morbidly Obese: 431  
##  Max.   :73.09   Max.   :54.86   Max.   :1.9681                        
##           BMIDesCat       ageCat    
##  Underweight   :  225   18-31:5085  
##  Normal        :12503   32-43:5263  
##  Overweight    : 6451   44-57:4786  
##  Obese         :  798   58-99:4863  
##  Morbidly Obese:   20               
## 

Basic Facts

The following graphics demonstrate the ggplot2 package, which incorporates a “grammar of graphics” system. The idea is that any statistical graphic can be built from a small number of components combined in various ways.

ggplot(data = cdc,aes(gender,height)) + geom_boxplot() +
ggtitle("Men are Taller")

ggplot(data = cdc,aes(gender,weight)) + geom_boxplot() +
ggtitle("Men are Heavier")

ggplot(data = cdc,aes(gender,BMI)) + 
geom_boxplot() +
ggtitle("BMI Reduces the Gender Gap")

A Model - Height + Gender -> Weight

Do a scatterplot of height and weight.

ggplot(data = cdc,(aes(x = height,y = weight))) +
geom_point()

The large number of points obscures the detail, but some upward drift to the right is apparent. It makes sense to create a linear model.

lm1 = lm(weight~height,data = cdc)
summary(lm1)
## 
## Call:
## lm(formula = weight ~ height, data = cdc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.114  -22.488   -5.309   16.303  315.497 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -193.30051    3.84730  -50.24   <2e-16 ***
## height         5.40294    0.05716   94.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.32 on 19995 degrees of freedom
## Multiple R-squared:  0.3088, Adjusted R-squared:  0.3088 
## F-statistic:  8935 on 1 and 19995 DF,  p-value: < 2.2e-16

This model has a residual standard error of 33.32. Think of this as the typical error made by the model. The estimated coefficient of height indicates that an extra inch of height adds an average 5.40 pounds.

Let’s see what adding gender to the model does.

lm2 = lm(weight~height+gender,data = cdc)
summary(lm2)
## 
## Call:
## lm(formula = weight ~ height + gender, data = cdc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107.34  -22.55   -5.74   15.69  323.08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -117.81883    5.68976  -20.71   <2e-16 ***
## height         4.37206    0.08085   54.08   <2e-16 ***
## genderf      -11.93441    0.66714  -17.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.05 on 19994 degrees of freedom
## Multiple R-squared:  0.3197, Adjusted R-squared:  0.3197 
## F-statistic:  4699 on 2 and 19994 DF,  p-value: < 2.2e-16

There is a small reduction of the residual standard error. The increase in weight associated with an extra inch of height is reduced to 4.37, and we see that after accounting for height, we should reduce the predicted weight for a female by 11.93 pounds.

Examine an Aggregate Relationship

This section demonstrates some features added to R in the dplyr package, including piping.

cdc %>%
  select(height,gender,weight) %>%
  group_by(height,gender) %>%
  summarize(mean_weight = mean(weight),
            sd_weight = sd(weight),
            n=n() )%>%
  ungroup() -> c2
## `summarise()` has grouped output by 'height'. You can override using the `.groups` argument.
c2
## # A tibble: 57 × 5
##    height gender mean_weight sd_weight     n
##     <dbl> <fct>        <dbl>     <dbl> <int>
##  1     48 f            105        14.1     2
##  2     49 m            160        NA       1
##  3     50 f            112        NA       1
##  4     51 f            160        14.1     2
##  5     52 f            112.       61.5     2
##  6     53 f            145.       31.8     7
##  7     54 f             98.3      11.5     3
##  8     55 m             90        NA       1
##  9     55 f            135        39.7     3
## 10     56 f            117.       20.5    17
## # … with 47 more rows
p = ggplot(c2,aes(x=height,y=mean_weight,
                  color=gender,size = n))
p + geom_point()

Look at c2

We’ve restructured our data into a different shape. Look at it.

head(c2,10)
## # A tibble: 10 × 5
##    height gender mean_weight sd_weight     n
##     <dbl> <fct>        <dbl>     <dbl> <int>
##  1     48 f            105        14.1     2
##  2     49 m            160        NA       1
##  3     50 f            112        NA       1
##  4     51 f            160        14.1     2
##  5     52 f            112.       61.5     2
##  6     53 f            145.       31.8     7
##  7     54 f             98.3      11.5     3
##  8     55 m             90        NA       1
##  9     55 f            135        39.7     3
## 10     56 f            117.       20.5    17
tail(c2,10)
## # A tibble: 10 × 5
##    height gender mean_weight sd_weight     n
##     <dbl> <fct>        <dbl>     <dbl> <int>
##  1     77 m             241.      38.5    78
##  2     77 f             174.      34.6     2
##  3     78 m             241.      49.0    40
##  4     78 f             173.      42.0     3
##  5     79 m             248.      57.3    15
##  6     80 m             222.      77.9    10
##  7     81 m             208.      69.3     3
##  8     82 m             208.      23.3     2
##  9     83 m             265       NA       1
## 10     84 m             260       NA       1
c2 %>% 
  ggplot(aes(x = height, y = mean_weight, size = n,color=gender)) + geom_point()

BMI and Age

Note that average BMI varies with age for both genders. It rises until about 60 and then declines. The yellow curve is a “loess” smoother, which summarizes the behavior of the points in the scatterplot.

ggplot(data=cdc,aes(x=age,y=BMI,color=BMI))+geom_point()+
  geom_smooth(method = "loess",color="yellow")+
  facet_wrap(~gender,nrow=2)
## `geom_smooth()` using formula 'y ~ x'

Exploring Intentions

As before the yellow curve is a loess smoother, which describes the overall pattern in the scatterplot. The red line separates the points which indicate a desire to gain weight (above the line) from those which indicate a desire to lose weight (below the line).

d <- ggplot(data = cdc,aes(x=BMI,y=BMIDes,color=DesActRatio)) 
              
d1 = d + geom_point(aes(alpha=.1)) +
  geom_smooth(method="loess",color="yellow") +
  geom_abline(slope=1,intercept=0,color="red")
d1
## `geom_smooth()` using formula 'y ~ x'

d2 = d1 +
  facet_wrap(~gender,nrow=2) +
  ggtitle("Actual vs Desired BMI by Gender")
d2  
## `geom_smooth()` using formula 'y ~ x'

d3 = d1 + 
  facet_grid(gender~ageCat) +
  ggtitle("Actual vs Desired BMI by Gender and Age")
d3
## `geom_smooth()` using formula 'y ~ x'