PASS 10-19-2016

Harold Nelson

10/14/2016

Preliminaries

The presenter: Harold Nelson
* BS in Math. Notre Dame
* MS in Math Univ, of Kentucky
* Phd in Economics (Econometrics) UCSD
* Many years working in government, SAS

R
* Reverse-engineered version of S created by academics in New Zealand around 1995.
* Open source
* Assumes all data in RAM, does not use external memory - No longer true.
* Extremely extensible - about 12,000 “packages.”

Let’s diverge and look at R and RStudio, Using R alone is now unusual.

Analysis of a Dataframe

I want to explore the relationships among age, gender and obesity.

I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (CDC).

First, make the data and a few packages available.

load("~/cdc.Rdata")
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Determine the structure of the object CDC.

str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

BMI

The body mass index (BMI) is a measure which incorprates both height and weight.

The standard interpetation of this measure is as follows:

New Variables.

cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(18,5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(18,5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))


table(cdc$BMICat,cdc$BMIDesCat)
##                 
##                  Underweight Normal Overweight Obese Morbidly Obese
##   Underweight            124    139          8     0              0
##   Normal                  79   8065        304    13              0
##   Overweight               8   3392       3850    45              1
##   Obese                    9    826       2098   602              6
##   Morbidly Obese           6     81        191   138             15

Examine the data and look for anomalies

summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00            
##       BMI            BMIDes         DesActRatio                BMICat    
##  Min.   :12.40   Min.   :  8.128   Min.   :0.2667   Underweight   : 271  
##  1st Qu.:22.71   1st Qu.: 21.727   1st Qu.:0.8710   Normal        :8461  
##  Median :25.60   Median : 23.746   Median :0.9444   Overweight    :7296  
##  Mean   :26.31   Mean   : 23.971   Mean   :0.9268   Obese         :3541  
##  3rd Qu.:28.89   3rd Qu.: 25.799   3rd Qu.:1.0000   Morbidly Obese: 431  
##  Max.   :73.09   Max.   :100.407   Max.   :3.7778                        
##           BMIDesCat       ageCat    
##  Underweight   :  226   18-31:5087  
##  Normal        :12503   32-43:5263  
##  Overweight    : 6451   44-57:4787  
##  Obese         :  798   58-99:4863  
##  Morbidly Obese:   22               
## 

Look at the records which have unusual values.

cdc[cdc$height==93,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 17534 very good       1        0        0     93    179      100  31
##       gender      BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 17534      m 14.54931 8.128107   0.5586592 Underweight Underweight  18-31
cdc[cdc$weight==68,]
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 18743    good       1        1        1     52     68       68  44      f
##            BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 18743 17.67899 17.67899           1 Underweight Underweight  44-57
cdc[cdc$wtdesire==680,]
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 16874    good       0        1        0     69    180      680  24      m
##            BMI   BMIDes DesActRatio     BMICat      BMIDesCat ageCat
## 16874 26.57845 100.4075    3.777778 Overweight Morbidly Obese  18-31
cdc[cdc$BMIDes > 50,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 10034 very good       1        1        1     73    290      601  56
## 13086      good       0        1        1     62    300      300  48
## 13607 very good       0        1        0     69    350      350  33
## 16874      good       0        1        0     69    180      680  24
##       gender      BMI    BMIDes DesActRatio         BMICat      BMIDesCat
## 10034      m 38.25671  79.28373    2.072414          Obese Morbidly Obese
## 13086      f 54.86472  54.86472    1.000000 Morbidly Obese Morbidly Obese
## 13607      f 51.68032  51.68032    1.000000 Morbidly Obese Morbidly Obese
## 16874      m 26.57845 100.40748    3.777778     Overweight Morbidly Obese
##       ageCat
## 10034  44-57
## 13086  44-57
## 13607  32-43
## 16874  18-31
cdc[cdc$BMIDes < 10,]
##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 17534 very good       1        0        0     93    179      100  31
##       gender      BMI   BMIDes DesActRatio      BMICat   BMIDesCat ageCat
## 17534      m 14.54931 8.128107   0.5586592 Underweight Underweight  18-31

Remove Anomalies

Rejects = cdc$BMIDes < 10 | 
   (cdc$BMIDes > 50 & cdc$DesActRatio > 1.0) 
table(Rejects)
## Rejects
## FALSE  TRUE 
## 19997     3
Keepers = !Rejects
table(Keepers)
## Keepers
## FALSE  TRUE 
##     3 19997
cdc = cdc[Keepers,]
summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6970   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5674   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire        age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68   Min.   :18.00   m: 9566  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175   3rd Qu.:57.00            
##  Max.   :84.00   Max.   :500.0   Max.   :350   Max.   :99.00            
##       BMI            BMIDes       DesActRatio                BMICat    
##  Min.   :12.40   Min.   :10.44   Min.   :0.2667   Underweight   : 270  
##  1st Qu.:22.71   1st Qu.:21.73   1st Qu.:0.8710   Normal        :8461  
##  Median :25.60   Median :23.75   Median :0.9444   Overweight    :7295  
##  Mean   :26.31   Mean   :23.96   Mean   :0.9266   Obese         :3540  
##  3rd Qu.:28.89   3rd Qu.:25.80   3rd Qu.:1.0000   Morbidly Obese: 431  
##  Max.   :73.09   Max.   :54.86   Max.   :1.9681                        
##           BMIDesCat       ageCat    
##  Underweight   :  225   18-31:5085  
##  Normal        :12503   32-43:5263  
##  Overweight    : 6451   44-57:4786  
##  Obese         :  798   58-99:4863  
##  Morbidly Obese:   20               
## 

Basic Facts

The following graphics demonstrate the ggplot2 package, which incorporates a “grammar of graphics” system. The idea is that any statistical graphic can be built from a small number of components combined in various ways.

ggplot(data = cdc,aes(gender,height)) + geom_boxplot() +
ggtitle("Men are Taller")

ggplot(data = cdc,aes(gender,weight)) + geom_boxplot() +
ggtitle("Men are Heavier")

ggplot(data = cdc,aes(gender,BMI)) + 
geom_boxplot() +
ggtitle("BMI Reduces the Gender Gap")

A Model - Height + Gender -> Weight

Do a scatterplot of height and weight.

ggplot(data = cdc,(aes(x = height,y = weight))) +
geom_point()

The large number of points obscures the detail, but some upward drift to the right is apparent. It makes sense to create a linear model.

lm1 = lm(weight~height,data = cdc)
summary(lm1)
## 
## Call:
## lm(formula = weight ~ height, data = cdc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.114  -22.488   -5.309   16.303  315.497 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -193.30051    3.84730  -50.24   <2e-16 ***
## height         5.40294    0.05716   94.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.32 on 19995 degrees of freedom
## Multiple R-squared:  0.3088, Adjusted R-squared:  0.3088 
## F-statistic:  8935 on 1 and 19995 DF,  p-value: < 2.2e-16

This model has a residual standard error of 33.29. It indicates that an extra inch of height adds an average 5.39 pounds. Think of this as the typical error made by the model.

Let’s see what adding gender to the model does.

lm2 = lm(weight~height+gender,data = cdc)
summary(lm2)
## 
## Call:
## lm(formula = weight ~ height + gender, data = cdc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107.34  -22.55   -5.74   15.69  323.08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -117.81883    5.68976  -20.71   <2e-16 ***
## height         4.37206    0.08085   54.08   <2e-16 ***
## genderf      -11.93441    0.66714  -17.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.05 on 19994 degrees of freedom
## Multiple R-squared:  0.3197, Adjusted R-squared:  0.3197 
## F-statistic:  4699 on 2 and 19994 DF,  p-value: < 2.2e-16

There is a small reduction of the residual standard error. The increase in weight associated with an extra inch of height is reduced, and we see that after accounting for height, we should reduce the predicted weight for a female by 11.93 pounds.

Examine an Aggregate Relationship

This section demonstrates some features added to R in the dplyr package, including piping.

cdc %>%
  select(height,gender,weight) %>%
  group_by(height,gender) %>%
  summarize(mean_weight = mean(weight),
            sd_weight = sd(weight),
            n=n() )%>%
  ungroup() -> c2
c2
## # A tibble: 57 x 5
##    height gender mean_weight sd_weight     n
##     <dbl> <fctr>       <dbl>     <dbl> <int>
## 1      48      f   105.00000  14.14214     2
## 2      49      m   160.00000        NA     1
## 3      50      f   112.00000        NA     1
## 4      51      f   160.00000  14.14214     2
## 5      52      f   111.50000  61.51829     2
## 6      53      f   145.28571  31.82093     7
## 7      54      f    98.33333  11.54701     3
## 8      55      m    90.00000        NA     1
## 9      55      f   135.00000  39.68627     3
## 10     56      f   117.41176  20.47577    17
## # ... with 47 more rows
p = ggplot(c2,aes(x=height,y=mean_weight,
                  color=gender,size = n))
p + geom_point()

BMI and Age

Note that average BMI varies with age for both genders. It rises until about 60 and then declines. The yellow curve is a “loess” smoother, which summarizes the behavior of the points in the scatterplot.

ggplot(data=cdc,aes(x=age,y=BMI,color=BMI))+geom_point()+
  geom_smooth(method = "loess",color="yellow")+
  facet_wrap(~gender,nrow=2)

Exploring Intentions

As before the yellow curve is a loess smoother, which describes the overall pattern in the scatterplot. The red line separates the points which indicate a desire to gain weight (above the line) from those which indicate a desire to lose weight (below the line).

d <- ggplot(data = cdc,aes(x=BMI,y=BMIDes,color=DesActRatio)) 
              
d1 = d + geom_point(aes(alpha=.1)) +
  geom_smooth(method="loess",color="yellow") +
  geom_abline(slope=1,intercept=0,color="red")
d1

d2 = d1 +
  facet_wrap(~gender,nrow=2) +
  ggtitle("Actual vs Desired BMI by Gender")
d2  

d3 = d1 + 
  facet_grid(gender~ageCat) +
  ggtitle("Actual vs Desired BMI by Gender and Age")
d3