Data Preparation
#create vector with all needed packages
load_packages <- c("psych","prettydoc", "dplyr", "tidyr", "foreign", "knitr", "ggplot2")
install_load <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
install_load(load_packages)## Loading required package: psych
## Loading required package: prettydoc
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: tidyr
## Loading required package: foreign
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## psych prettydoc dplyr tidyr foreign knitr ggplot2
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#CODE SOURCE DOCUMENTATION: https://gist.github.com/stevenworthington/3178163
url_data <- "https://www.cdc.gov/brfss/annual_data/2015/files/LLCP2015XPT.zip"
# Commented out to avoid re-downloading this large file
temp <- tempfile(fileext = ".zip")
download.file(url_data, temp)
brfss_data <- read.xport(unzip(temp))
unlink(temp)
# glimpse(brfss_data)
my_variables <- brfss_data %>%
filter(GENHLTH < 6) %>%
mutate(BMI = X_BMI5/100) %>%
dplyr::select(BMI, GENHLTH) %>%
rename(reported_general_health = GENHLTH) %>%
na.omit()
# X_BMI5CAT not usedResearch question
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Is the Body Mass Index predictive of the self-reported level of general health?
Cases
What are the cases, and how many are there?
The cases are adult U.S. residents, specifically “non-institutionalized adults who reside in each of the states and selected U.S. territories.” There are 441,456 cases in the 2015 survey, and 404,030 of the cases gave measurable responses for the 2 selected variables.
Citation: (https://www.cdc.gov/brfss/annual_data/2015/pdf/2015moduleanalysis.pdf)
nrow(brfss_data)## [1] 441456
nrow(my_variables)## [1] 404030
Data collection
Describe the method of data collection.
The data were collected as a collaborative project between the U.S. states, territories and the Centers for Disease Control and Prevention (CDC) in their Behavioral Rick Factor Surveillance System (BRFSS) telephone survey. In 2011, the survey started sampling both landline and cellular phone numbers.
Type of study
What type of study is this (observational/experiment)?
This is an observational study.
Data Source
If you collected the data, state self-collected. If not, provide a citation/link.
The CDC facilitated the study, and the data can be obtained in the following zip file: (https://www.cdc.gov/brfss/annual_data/2015/files/LLCP2015XPT.zip)
The data, survey overview, codebook, calculated variable explanations and other related materials can be found at the following site: (https://www.cdc.gov/brfss/annual_data/annual_2015.html)
Response
What is the response variable, and what type is it (numerical/categorical)?
The response variable is the self-reported description of the respondent’s general health. It is an ordinal categorical variable.
Would you say that in general your health is:
| Value | Value Label |
|---|---|
| 1 | Excellent |
| 2 | Very good |
| 3 | Good |
| 4 | Fair |
| 5 | Poor |
Explanatory
What is the explanatory variable, and what type is it (numerical/categorival)?
The explanatory variable is the Body Mass Index. The World Health Organization (WHO) describes the BMI as a “simple index of weight-for-height that is commonly used to classify underweight, overweight and obesity in adults.”
The WHO groups the BMI values into 4 primary classes:
| Class | BMI Range |
|---|---|
| Underweight | <18.50 |
| Normal range | 18.50 - 24.99 |
| Overweight | 25.00 - 29.99 |
| Obese | ???30.00 |
The BRFSS calculated the BMI by converting the reported heights and weights to meters and kilograms and then dividing the weight by the square of the height.
Citation: (http://apps.who.int/bmi/index.jsp?introPage=intro_3.html)
Relevant summary statistics
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
options(scipen=999)
kable(t(round(describe(my_variables), 4)))| BMI | reported_general_health | |
|---|---|---|
| vars | 1.0000 | 2.0000 |
| n | 404030.0000 | 404030.0000 |
| mean | 28.0427 | 2.5602 |
| sd | 6.6535 | 1.0865 |
| median | 26.9500 | 2.0000 |
| trimmed | 27.3352 | 2.5089 |
| mad | 5.1298 | 1.4826 |
| min | 12.0200 | 1.0000 |
| max | 99.9500 | 5.0000 |
| range | 87.9300 | 4.0000 |
| skew | 2.2434 | 0.3720 |
| kurtosis | 12.1673 | -0.4664 |
| se | 0.0105 | 0.0017 |
ggplot(my_variables, aes(x=BMI)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(my_variables, aes(x=reported_general_health)) + geom_bar()
boxplot(my_variables$BMI ~ my_variables$reported_general_health)