Harold Nelson
10/14/2016
The presenter: Harold Nelson
* BS in Math. Notre Dame
* MS in Math Univ, of Kentucky
* Phd in Economics (Econometrics) UCSD
* Many years working in government, SAS
R
* Reverse-engineered version of S created by academics in New Zealand around 1995.
* Open source
* Assumes all data in RAM, does not use external memory - No longer true.
* Extremely extensible - about 12,000 “packages.”
Let’s diverge and look at R and RStudio, Using R alone is now unusual.
I want to explore the relationships among age, gender and obesity.
I’ll use a sample of records from the Behavioral Risk Factors Surveillance System (BRFSS) conducted by the Centers for Disease Control (CDC).
First, make the data and a few packages available.
load("~/cdc.Rdata")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
str(cdc)
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
The body mass index (BMI) is a measure which incorprates both height and weight.
The standard interpetation of this measure is as follows:
cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(18,5,24.9,29.9,39.9,200),labels =
c("Underweight","Normal","Overweight",
"Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(18,5,24.9,29.9,39.9,200),labels =
c("Underweight","Normal","Overweight",
"Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))
table(cdc$BMICat,cdc$BMIDesCat)
##
## Underweight Normal Overweight Obese Morbidly Obese
## Underweight 124 139 8 0 0
## Normal 79 8065 304 13 0
## Overweight 8 3392 3850 45 1
## Obese 9 826 2098 602 6
## Morbidly Obese 6 81 191 138 15
summary(cdc)
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5675 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
## BMI BMIDes DesActRatio BMICat
## Min. :12.40 Min. : 8.128 Min. :0.2667 Underweight : 271
## 1st Qu.:22.71 1st Qu.: 21.727 1st Qu.:0.8710 Normal :8461
## Median :25.60 Median : 23.746 Median :0.9444 Overweight :7296
## Mean :26.31 Mean : 23.971 Mean :0.9268 Obese :3541
## 3rd Qu.:28.89 3rd Qu.: 25.799 3rd Qu.:1.0000 Morbidly Obese: 431
## Max. :73.09 Max. :100.407 Max. :3.7778
## BMIDesCat ageCat
## Underweight : 226 18-31:5087
## Normal :12503 32-43:5263
## Overweight : 6451 44-57:4787
## Obese : 798 58-99:4863
## Morbidly Obese: 22
##
cdc[cdc$height==93,]
## genhlth exerany hlthplan smoke100 height weight wtdesire age
## 17534 very good 1 0 0 93 179 100 31
## gender BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 17534 m 14.54931 8.128107 0.5586592 Underweight Underweight 18-31
cdc[cdc$weight==68,]
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 18743 good 1 1 1 52 68 68 44 f
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 18743 17.67899 17.67899 1 Underweight Underweight 44-57
cdc[cdc$wtdesire==680,]
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 16874 good 0 1 0 69 180 680 24 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 16874 26.57845 100.4075 3.777778 Overweight Morbidly Obese 18-31
cdc[cdc$BMIDes > 50,]
## genhlth exerany hlthplan smoke100 height weight wtdesire age
## 10034 very good 1 1 1 73 290 601 56
## 13086 good 0 1 1 62 300 300 48
## 13607 very good 0 1 0 69 350 350 33
## 16874 good 0 1 0 69 180 680 24
## gender BMI BMIDes DesActRatio BMICat BMIDesCat
## 10034 m 38.25671 79.28373 2.072414 Obese Morbidly Obese
## 13086 f 54.86472 54.86472 1.000000 Morbidly Obese Morbidly Obese
## 13607 f 51.68032 51.68032 1.000000 Morbidly Obese Morbidly Obese
## 16874 m 26.57845 100.40748 3.777778 Overweight Morbidly Obese
## ageCat
## 10034 44-57
## 13086 44-57
## 13607 32-43
## 16874 18-31
cdc[cdc$BMIDes < 10,]
## genhlth exerany hlthplan smoke100 height weight wtdesire age
## 17534 very good 1 0 0 93 179 100 31
## gender BMI BMIDes DesActRatio BMICat BMIDesCat ageCat
## 17534 m 14.54931 8.128107 0.5586592 Underweight Underweight 18-31
Rejects = cdc$BMIDes < 10 |
(cdc$BMIDes > 50 & cdc$DesActRatio > 1.0)
table(Rejects)
## Rejects
## FALSE TRUE
## 19997 3
Keepers = !Rejects
table(Keepers)
## Keepers
## FALSE TRUE
## 3 19997
cdc = cdc[Keepers,]
summary(cdc)
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6970 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5674 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68 Min. :18.00 m: 9566
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175 3rd Qu.:57.00
## Max. :84.00 Max. :500.0 Max. :350 Max. :99.00
## BMI BMIDes DesActRatio BMICat
## Min. :12.40 Min. :10.44 Min. :0.2667 Underweight : 270
## 1st Qu.:22.71 1st Qu.:21.73 1st Qu.:0.8710 Normal :8461
## Median :25.60 Median :23.75 Median :0.9444 Overweight :7295
## Mean :26.31 Mean :23.96 Mean :0.9266 Obese :3540
## 3rd Qu.:28.89 3rd Qu.:25.80 3rd Qu.:1.0000 Morbidly Obese: 431
## Max. :73.09 Max. :54.86 Max. :1.9681
## BMIDesCat ageCat
## Underweight : 225 18-31:5085
## Normal :12503 32-43:5263
## Overweight : 6451 44-57:4786
## Obese : 798 58-99:4863
## Morbidly Obese: 20
##
The following graphics demonstrate the ggplot2 package, which incorporates a “grammar of graphics” system. The idea is that any statistical graphic can be built from a small number of components combined in various ways.
ggplot(data = cdc,aes(gender,height)) + geom_boxplot() +
ggtitle("Men are Taller")
ggplot(data = cdc,aes(gender,weight)) + geom_boxplot() +
ggtitle("Men are Heavier")
ggplot(data = cdc,aes(gender,BMI)) +
geom_boxplot() +
ggtitle("BMI Reduces the Gender Gap")
Do a scatterplot of height and weight.
ggplot(data = cdc,(aes(x = height,y = weight))) +
geom_point()
The large number of points obscures the detail, but some upward drift to the right is apparent. It makes sense to create a linear model.
lm1 = lm(weight~height,data = cdc)
summary(lm1)
##
## Call:
## lm(formula = weight ~ height, data = cdc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.114 -22.488 -5.309 16.303 315.497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -193.30051 3.84730 -50.24 <2e-16 ***
## height 5.40294 0.05716 94.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.32 on 19995 degrees of freedom
## Multiple R-squared: 0.3088, Adjusted R-squared: 0.3088
## F-statistic: 8935 on 1 and 19995 DF, p-value: < 2.2e-16
This model has a residual standard error of 33.29. It indicates that an extra inch of height adds an average 5.39 pounds. Think of this as the typical error made by the model.
Let’s see what adding gender to the model does.
lm2 = lm(weight~height+gender,data = cdc)
summary(lm2)
##
## Call:
## lm(formula = weight ~ height + gender, data = cdc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.34 -22.55 -5.74 15.69 323.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -117.81883 5.68976 -20.71 <2e-16 ***
## height 4.37206 0.08085 54.08 <2e-16 ***
## genderf -11.93441 0.66714 -17.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.05 on 19994 degrees of freedom
## Multiple R-squared: 0.3197, Adjusted R-squared: 0.3197
## F-statistic: 4699 on 2 and 19994 DF, p-value: < 2.2e-16
There is a small reduction of the residual standard error. The increase in weight associated with an extra inch of height is reduced, and we see that after accounting for height, we should reduce the predicted weight for a female by 11.93 pounds.
This section demonstrates some features added to R in the dplyr package, including piping.
cdc %>%
select(height,gender,weight) %>%
group_by(height,gender) %>%
summarize(mean_weight = mean(weight),
sd_weight = sd(weight),
n=n() )%>%
ungroup() -> c2
c2
## # A tibble: 57 x 5
## height gender mean_weight sd_weight n
## <dbl> <fctr> <dbl> <dbl> <int>
## 1 48 f 105.00000 14.14214 2
## 2 49 m 160.00000 NA 1
## 3 50 f 112.00000 NA 1
## 4 51 f 160.00000 14.14214 2
## 5 52 f 111.50000 61.51829 2
## 6 53 f 145.28571 31.82093 7
## 7 54 f 98.33333 11.54701 3
## 8 55 m 90.00000 NA 1
## 9 55 f 135.00000 39.68627 3
## 10 56 f 117.41176 20.47577 17
## # ... with 47 more rows
p = ggplot(c2,aes(x=height,y=mean_weight,
color=gender,size = n))
p + geom_point()
Note that average BMI varies with age for both genders. It rises until about 60 and then declines. The yellow curve is a “loess” smoother, which summarizes the behavior of the points in the scatterplot.
ggplot(data=cdc,aes(x=age,y=BMI,color=BMI))+geom_point()+
geom_smooth(method = "loess",color="yellow")+
facet_wrap(~gender,nrow=2)
As before the yellow curve is a loess smoother, which describes the overall pattern in the scatterplot. The red line separates the points which indicate a desire to gain weight (above the line) from those which indicate a desire to lose weight (below the line).
d <- ggplot(data = cdc,aes(x=BMI,y=BMIDes,color=DesActRatio))
d1 = d + geom_point(aes(alpha=.1)) +
geom_smooth(method="loess",color="yellow") +
geom_abline(slope=1,intercept=0,color="red")
d1
d2 = d1 +
facet_wrap(~gender,nrow=2) +
ggtitle("Actual vs Desired BMI by Gender")
d2
d3 = d1 +
facet_grid(gender~ageCat) +
ggtitle("Actual vs Desired BMI by Gender and Age")
d3