D606 Project Proposal

Data Preparation

#create vector with all needed packages
load_packages <- c("psych","prettydoc", "dplyr", "tidyr", "foreign", "knitr", "ggplot2")

install_load <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

install_load(load_packages)

## Loading required package: psych

## Loading required package: prettydoc

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: tidyr

## Loading required package: foreign

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

##     psych prettydoc     dplyr     tidyr   foreign     knitr   ggplot2 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE

#CODE SOURCE DOCUMENTATION: https://gist.github.com/stevenworthington/3178163

url_data <- "https://www.cdc.gov/brfss/annual_data/2015/files/LLCP2015XPT.zip"

# Commented out to avoid re-downloading this large file
temp <- tempfile(fileext = ".zip")
download.file(url_data, temp)
brfss_data <- read.xport(unzip(temp))
unlink(temp)
# glimpse(brfss_data)

my_variables <- brfss_data %>% 
  filter(GENHLTH < 6) %>% 
  mutate(BMI = X_BMI5/100) %>% 
  dplyr::select(BMI, GENHLTH) %>% 
  rename(reported_general_health = GENHLTH) %>%  
  na.omit() 

# X_BMI5CAT not used

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is the Body Mass Index predictive of the self-reported level of general health?

Cases

What are the cases, and how many are there?

The cases are adult U.S. residents, specifically “non-institutionalized adults who reside in each of the states and selected U.S. territories.” There are 441,456 cases in the 2015 survey, and 404,030 of the cases gave measurable responses for the 2 selected variables.

Citation: (https://www.cdc.gov/brfss/annual_data/2015/pdf/2015moduleanalysis.pdf)

nrow(brfss_data)

## [1] 441456

nrow(my_variables)

## [1] 404030

Data collection

Describe the method of data collection.

The data were collected as a collaborative project between the U.S. states, territories and the Centers for Disease Control and Prevention (CDC) in their Behavioral Rick Factor Surveillance System (BRFSS) telephone survey. In 2011, the survey started sampling both landline and cellular phone numbers.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The CDC facilitated the study, and the data can be obtained in the following zip file: (https://www.cdc.gov/brfss/annual_data/2015/files/LLCP2015XPT.zip)

The data, survey overview, codebook, calculated variable explanations and other related materials can be found at the following site: (https://www.cdc.gov/brfss/annual_data/annual_2015.html)

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is the self-reported description of the respondent’s general health. It is an ordinal categorical variable.

Would you say that in general your health is:

Value	Value Label
1	Excellent
2	Very good
3	Good
4	Fair
5	Poor

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

The explanatory variable is the Body Mass Index. The World Health Organization (WHO) describes the BMI as a “simple index of weight-for-height that is commonly used to classify underweight, overweight and obesity in adults.”

The WHO groups the BMI values into 4 primary classes:

Class	BMI Range
Underweight	<18.50
Normal range	18.50 - 24.99
Overweight	25.00 - 29.99
Obese	???30.00

The BRFSS calculated the BMI by converting the reported heights and weights to meters and kilograms and then dividing the weight by the square of the height.

Citation: (http://apps.who.int/bmi/index.jsp?introPage=intro_3.html)

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

options(scipen=999)
kable(t(round(describe(my_variables), 4)))

	BMI	reported_general_health
vars	1.0000	2.0000
n	404030.0000	404030.0000
mean	28.0427	2.5602
sd	6.6535	1.0865
median	26.9500	2.0000
trimmed	27.3352	2.5089
mad	5.1298	1.4826
min	12.0200	1.0000
max	99.9500	5.0000
range	87.9300	4.0000
skew	2.2434	0.3720
kurtosis	12.1673	-0.4664
se	0.0105	0.0017

ggplot(my_variables, aes(x=BMI)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(my_variables, aes(x=reported_general_health)) + geom_bar()
  
boxplot(my_variables$BMI ~ my_variables$reported_general_health)