Descriptive BRFSS Healthcare Data Analytics using R

Brief Summary on the topic

What is risk factor surveillance?

Keeping track of the rates of risk factors which are the things or states in our daily lives that confers risk to our health is defined as a Risk Factor Surveillance.

There are two main Surveillance systems in the United States

National Health and Nutrition Examination Survey(NHANES).
Behavior Risk Factor Surveillance System (BRFSS).

We will be using BRFSS dataset

BRFSS is a federal and state collaboration
Data collectors call randomly generated landline phone numbers or cell phone numbers.
Data from BRFSS are publicly available and can be used widely by CDC for healthcare program planning. And also independent researchers can use this data for in-depth reseacrch and analytics purpose.

Types of BRFSS Analytics

Descriptive Analysis

Aimed at developing population-based rates.
Dependent upon the sampling approach–uses “weights”.
More often done by CDC and states.

Cross-Sectional Analysis

Aimed at exploring cross-sectional associations (hinting to potential causes)
Weighting is generally not used.
More often done by independent researchers.

Resources

BRFSS Resource Provided by CDC.

Datasets https://www.cdc.gov/brfss/annual_data/annual_data.htm
Questionnaires.
Codebook - Provides univariate summary statistics about each variable.
Documentation and explanations.

Environment Setup

# Load the required packages (if packages are not available, install them first)
for (package in c('foreign','gtools','questionr','MASS','caret','readr','ggplot2','magrittr','ggthemes','dplyr','corrplot','caTools')) {
  if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package,character.only=T)
  }
}
# Package Desription
# foreign   : Reads in "foreign" data types.
# gtools    : Allows you to make macros.
# dplyr     : Helps packages for calculating means and standard deviations.
# questionr : Allows you to do a weighted analysis.
# MASS      : Allows you to do bivariate tests.

Designing Metadata

# Set up the data dictionary 
# Let us assume we are referring our native variables for " US Army Dataset". Therefore, we will refer the data variable in ARMY terms i.e.
# 1. Let's select a subpopulation (veterans)
# 2. Let's select a risk factor (alcohol).
# 3. Let's select two outcomes.
#     Continuous: hours of sleep per night
#     Categorical: asthma(yes/no)
# Define Confounding Variables (variable associated with the Exposure and with the Outcome)

Reading and Cleaning the Data

##Get the Data 
library(foreign)
# Read the csv file and save in an object called "BRFSS_data"
BRFSS_a <- read.xport("C:/CompleteMLProjects/Healthcare/BRFSS/Analytics/Data/LLCP2014.xpt")
colnames(BRFSS_a)

##   [1] "X_STATE"   "FMONTH"    "IDATE"     "IMONTH"    "IDAY"     
##   [6] "IYEAR"     "DISPCODE"  "SEQNO"     "X_PSU"     "CTELENUM" 
##  [11] "PVTRESD1"  "COLGHOUS"  "STATERES"  "LADULT"    "NUMADULT" 
##  [16] "NUMMEN"    "NUMWOMEN"  "GENHLTH"   "PHYSHLTH"  "MENTHLTH" 
##  [21] "POORHLTH"  "HLTHPLN1"  "PERSDOC2"  "MEDCOST"   "CHECKUP1" 
##  [26] "EXERANY2"  "SLEPTIM1"  "CVDINFR4"  "CVDCRHD4"  "CVDSTRK3" 
##  [31] "ASTHMA3"   "ASTHNOW"   "CHCSCNCR"  "CHCOCNCR"  "CHCCOPD1" 
##  [36] "HAVARTH3"  "ADDEPEV2"  "CHCKIDNY"  "DIABETE3"  "DIABAGE2" 
##  [41] "LASTDEN3"  "RMVTETH3"  "VETERAN3"  "MARITAL"   "CHILDREN" 
##  [46] "EDUCA"     "EMPLOY1"   "INCOME2"   "WEIGHT2"   "HEIGHT3"  
##  [51] "NUMHHOL2"  "NUMPHON2"  "CPDEMO1"   "INTERNET"  "RENTHOM1" 
##  [56] "SEX"       "PREGNANT"  "QLACTLM2"  "USEEQUIP"  "BLIND"    
##  [61] "DECIDE"    "DIFFWALK"  "DIFFDRES"  "DIFFALON"  "SMOKE100" 
##  [66] "SMOKDAY2"  "STOPSMK2"  "LASTSMK2"  "USENOW3"   "ALCDAY5"  
##  [71] "AVEDRNK2"  "DRNK3GE5"  "MAXDRNKS"  "FLUSHOT6"  "FLSHTMY2" 
##  [76] "PNEUVAC3"  "SHINGLE2"  "FALL12MN"  "FALLINJ2"  "SEATBELT" 
##  [81] "DRNKDRI2"  "HADMAM"    "HOWLONG"   "PROFEXAM"  "LENGEXAM" 
##  [86] "HADPAP2"   "LASTPAP2"  "HADHYST2"  "PCPSAAD2"  "PCPSADI1" 
##  [91] "PCPSARE1"  "PSATEST1"  "PSATIME"   "PCPSARS1"  "BLDSTOOL" 
##  [96] "LSTBLDS3"  "HADSIGM3"  "HADSGCO1"  "LASTSIG3"  "HIVTST6"  
## [101] "HIVTSTD3"  "WHRTST10"  "PDIABTST"  "PREDIAB1"  "INSULIN"  
## [106] "BLDSUGAR"  "FEETCHK2"  "DOCTDIAB"  "CHKHEMO3"  "FEETCHK"  
## [111] "EYEEXAM"   "DIABEYE"   "DIABEDU"   "PAINACT2"  "QLMENTL2" 
## [116] "QLSTRES2"  "QLHLTH2"   "MEDICARE"  "HLTHCVR1"  "DELAYMED" 
## [121] "DLYOTHER"  "NOCOV121"  "LSTCOVRG"  "DRVISITS"  "MEDSCOST" 
## [126] "CARERCVD"  "MEDBILL1"  "ASBIALCH"  "ASBIDRNK"  "ASBIBING" 
## [131] "ASBIADVC"  "ASBIRDUC"  "WTCHSALT"  "LONGWTCH"  "DRADVISE" 
## [136] "ASTHMAGE"  "ASATTACK"  "ASERVIST"  "ASDRVIST"  "ASRCHKUP" 
## [141] "ASACTLIM"  "ASYMPTOM"  "ASNOSLEP"  "ASTHMED3"  "ASINHALR" 
## [146] "IMFVPLAC"  "TETANUS"   "HPVTEST"   "HPLSTTST"  "HPVADVC2" 
## [151] "HPVADSHT"  "CNCRDIFF"  "CNCRAGE"   "CNCRTYP1"  "CSRVTRT1" 
## [156] "CSRVDOC1"  "CSRVSUM"   "CSRVRTRN"  "CSRVINST"  "CSRVINSR" 
## [161] "CSRVDEIN"  "CSRVCLIN"  "CSRVPAIN"  "CSRVCTL1"  "RRCLASS2" 
## [166] "RRCOGNT2"  "RRATWRK2"  "RRHCARE3"  "RRPHYSM2"  "RREMTSM2" 
## [171] "SCNTMNY1"  "SCNTMEL1"  "SCNTPAID"  "SCNTWRK1"  "SCNTLPAD" 
## [176] "SCNTLWK1"  "SCNTVOT1"  "SXORIENT"  "TRNSGNDR"  "RCSGENDR" 
## [181] "RCSRLTN2"  "CASTHDX2"  "CASTHNO2"  "EMTSUPRT"  "LSATISFY" 
## [186] "CTELNUM1"  "CELLFON2"  "CADULT"    "PVTRESD2"  "CCLGHOUS" 
## [191] "CSTATE"    "LANDLINE"  "HHADULT"   "QSTVER"    "QSTLANG"  
## [196] "MSCODE"    "X_STSTR"   "X_STRWT"   "X_RAWRAKE" "X_WT2RAKE"
## [201] "X_AGE80"   "X_IMPRACE" "X_IMPNPH"  "X_CHISPNC" "X_CPRACE" 
## [206] "X_CRACE1"  "X_IMPCAGE" "X_IMPCRAC" "X_IMPCSEX" "X_CLLCPWT"
## [211] "X_DUALUSE" "X_DUALCOR" "X_LLCPWT2" "X_LLCPWT"  "X_RFHLTH" 
## [216] "X_HCVU651" "X_TOTINDA" "X_LTASTH1" "X_CASTHM1" "X_ASTHMS1"
## [221] "X_DRDXAR1" "X_EXTETH2" "X_ALTETH2" "X_DENVST2" "X_PRACE1" 
## [226] "X_MRACE1"  "X_HISPANC" "X_RACE"    "X_RACEG21" "X_RACEGR3"
## [231] "X_RACE_G1" "X_AGEG5YR" "X_AGE65YR" "X_AGE_G"   "HTIN4"    
## [236] "HTM4"      "WTKG3"     "X_BMI5"    "X_BMI5CAT" "X_RFBMI5" 
## [241] "X_CHLDCNT" "X_EDUCAG"  "X_INCOMG"  "X_SMOKER3" "X_RFSMOK3"
## [246] "DRNKANY5"  "DROCDY3_"  "X_RFBING5" "X_DRNKDY4" "X_DRNKMO4"
## [251] "X_RFDRHV4" "X_RFDRMN4" "X_RFDRWM4" "X_FLSHOT6" "X_PNEUMO2"
## [256] "X_RFSEAT2" "X_RFSEAT3" "X_RFMAM2Y" "X_MAM502Y" "X_MAM5021"
## [261] "X_RFPAP32" "X_RFPAP33" "X_RFPSA21" "X_RFBLDS2" "X_RFBLDS3"
## [266] "X_RFSIGM2" "X_COL10YR" "X_HFOB3YR" "X_FS5YR"   "X_FOBTFS" 
## [271] "X_CRCREC"  "X_AIDTST3" "X_IMPEDUC" "X_IMPMRTL" "X_IMPHOME"
## [276] "RCSBRAC1"  "RCSRACE1"  "RCHISLA1"  "RCSBIRTH"

Subset unwanted values from the dataset

#define object list of variables to be kept
BRFSSVarList <- c("VETERAN3", 
            "ALCDAY5",
            "SLEPTIM1",
            "ASTHMA3",
            "X_AGE_G",
            "SMOKE100",
            "SMOKDAY2",
            "SEX",
            "X_HISPANC",
            "X_MRACE1",
            "MARITAL",
            "GENHLTH",
            "HLTHPLN1",
            "EDUCA",
            "INCOME2",
            "X_BMI5CAT",
            "EXERANY2")

# subset by varlist
BRFSS_b <- BRFSS_a[BRFSSVarList]

# check columns
colnames(BRFSS_b)

##  [1] "VETERAN3"  "ALCDAY5"   "SLEPTIM1"  "ASTHMA3"   "X_AGE_G"  
##  [6] "SMOKE100"  "SMOKDAY2"  "SEX"       "X_HISPANC" "X_MRACE1" 
## [11] "MARITAL"   "GENHLTH"   "HLTHPLN1"  "EDUCA"     "INCOME2"  
## [16] "X_BMI5CAT" "EXERANY2"

# check rows
nrow(BRFSS_b)

## [1] 464664

# SUBSET AND SEPERATE NUMBER OF VETERANS AND NON VETERANS FROM THE DATASET
BRFSS_c <- subset(BRFSS_b,VETERAN3==1)

# Check the variable value 
# BRFSS_c$VETERAN3
# Check the number of rows for BRFSS_c
nrow(BRFSS_c)

## [1] 62120

# We can see 62120 number of VETERANS are in the dataset
# Also we now know that  464664 - 62120 = 402544 number of non veterans are in the dataset.

# ONLY KEEP ROWS WITH VALID ALCOHOL/EXPOSURE VARIABLE.
BRFSS_d <- subset(BRFSS_c, ALCDAY5 < 777 | ALCDAY5 == 888)

# Take a look at the data
# BRFSS_d$ALCDAY5
# Check the number of rows for BRFSS_d
nrow(BRFSS_d)

## [1] 58991

# 58991 are the number of veterans that do not consume Alcohol.
# Hence we know now 62120 - 58991 = 3129 number of veterans that consume Alcohol.
 

# EXCLUDING SLEEP TIME VARIABLES
# Only keep variable with valid sleep data
BRFSS_e <- subset(BRFSS_d,SLEPTIM1 < 77)

# Check the number of rows for BRFSS_e
nrow(BRFSS_e)

## [1] 58321

# 58321 are the number of veterans with valid sleep data.
# Hence we know now 58991 - 58321 = 670 number of veterans that have valid sleep pattern.

# EXCLUDING ASTHMA VARIABLES 

# Only keep variables with valid Asthma data
BRFSS_f <- subset(BRFSS_e, ASTHMA3 < 7)

# Check the number of rows for BRFSS_f
nrow(BRFSS_f)

## [1] 58131

# 58131 are the number of veterans with valid Asthma data.
# Hence we know now 58321 - 58131 = 190 number of veterans  have valid Asthma data.

Generating Exposure Variable

First, we will go to our exposure, alcohol. Make a grouping variable for alcohol, and indicator variables for drinking monthly and drinking weekly

From the Data Dictionary on the ALCDAY5 tab, we see that if ALCDAY5 falls in this range, 101 to 199, our ALCGRP variable should be coded as three, drink weekly. And those in the 201 to 299 range get a two for drink monthly. And the 888’s get a one for no drinks. And the rest are nine, for unknown.

Also If ALCGRP is two, the drink monthly flag will be one and everyone else gets a zero. If ALCGRP is three, the drink weekly flag will be one and everyone else gets a zero.

# Add Indicator variable for Veterans
# First make copy of the dataset
BRFSS_g <- BRFSS_f

# add the categorical variable set to 9 to the dataset
BRFSS_g$ALCGRP <- 9

# update according to data Dictionary
BRFSS_g$ALCGRP[BRFSS_g$ALCDAY5 < 200] <- 3
BRFSS_g$ALCGRP[BRFSS_g$ALCDAY5 >= 200 & BRFSS_g$ALCDAY5 <777] <- 2
BRFSS_g$ALCGRP[BRFSS_g$ALCDAY5 == 888] <- 1

# Check the variable 
table(BRFSS_g$ALCGRP, BRFSS_g$ALCDAY5)

##    
##       101   102   103   104   105   106   107   201   202   203   204
##   1     0     0     0     0     0     0     0     0     0     0     0
##   2     0     0     0     0     0     0     0  4353  3271  1890  1576
##   3  2454  1888  1334   634   686   309  2011     0     0     0     0
##    
##       205   206   207   208   209   210   211   212   213   214   215
##   1     0     0     0     0     0     0     0     0     0     0     0
##   2  1431   564   362   415    28  1152     9   259    12   102  1090
##   3     0     0     0     0     0     0     0     0     0     0     0
##    
##       216   217   218   219   220   221   222   223   224   225   226
##   1     0     0     0     0     0     0     0     0     0     0     0
##   2    27    14    28     1  1087    31    28     9    32   504    31
##   3     0     0     0     0     0     0     0     0     0     0     0
##    
##       227   228   229   230   888
##   1     0     0     0     0 26169
##   2    40   149    81  4070     0
##   3     0     0     0     0     0

# Add flags
# Flags for Monthly drinkers
BRFSS_g$DRKMONTHLY <- 0
BRFSS_g$DRKMONTHLY[BRFSS_g$ALCGRP == 2] <- 1

table(BRFSS_g$ALCGRP,BRFSS_g$DRKMONTHLY)

##    
##         0     1
##   1 26169     0
##   2     0 22646
##   3  9316     0

# Flags for Weekly drinkers
BRFSS_g$DRKWEEKLY <- 0
BRFSS_g$DRKWEEKLY[BRFSS_g$ALCGRP == 1] <- 1

table(BRFSS_g$ALCGRP,BRFSS_g$DRKWEEKLY)

##    
##         0     1
##   1     0 26169
##   2 22646     0
##   3  9316     0

Generate outcome variables from data dictionary

First, we are going to clean up our outcome variable for sleep duration. Next, we will make sure we have binary variable or flag that is valid for our asthma outcome.

#  We need to remove the rows with no information on sleep time and we want to turn our asthma variable into an indicator variable with only ones and zeroes.

# First make copy of the dataset
BRFSS_h <- BRFSS_g

# Make and test sleep variable
# First generate a SLEEPTIM2 variable that is a continuous variable for SLEEEPTIM1 and assign them all as NA
BRFSS_h$SLEPTIM2 <- NA

# Add and check for criteria that SLEPTIM1 cannot be NA and it cannot be 77 and cannot be 99.
BRFSS_h$SLEPTIM2[!is.na(BRFSS_h$SLEPTIM1) & BRFSS_h$SLEPTIM1 !=77 & BRFSS_h$SLEPTIM1 !=99] <- BRFSS_h$SLEPTIM1

# Check the variable
table(BRFSS_h$SLEPTIM1,BRFSS_h$SLEPTIM2)

##     
##          1     2     3     4     5     6     7     8     9    10    11
##   1     38     0     0     0     0     0     0     0     0     0     0
##   2      0   134     0     0     0     0     0     0     0     0     0
##   3      0     0   465     0     0     0     0     0     0     0     0
##   4      0     0     0  1687     0     0     0     0     0     0     0
##   5      0     0     0     0  3690     0     0     0     0     0     0
##   6      0     0     0     0     0 11854     0     0     0     0     0
##   7      0     0     0     0     0     0 16557     0     0     0     0
##   8      0     0     0     0     0     0     0 17889     0     0     0
##   9      0     0     0     0     0     0     0     0  3426     0     0
##   10     0     0     0     0     0     0     0     0     0  1705     0
##   11     0     0     0     0     0     0     0     0     0     0   111
##   12     0     0     0     0     0     0     0     0     0     0     0
##   13     0     0     0     0     0     0     0     0     0     0     0
##   14     0     0     0     0     0     0     0     0     0     0     0
##   15     0     0     0     0     0     0     0     0     0     0     0
##   16     0     0     0     0     0     0     0     0     0     0     0
##   17     0     0     0     0     0     0     0     0     0     0     0
##   18     0     0     0     0     0     0     0     0     0     0     0
##   20     0     0     0     0     0     0     0     0     0     0     0
##   21     0     0     0     0     0     0     0     0     0     0     0
##   22     0     0     0     0     0     0     0     0     0     0     0
##   24     0     0     0     0     0     0     0     0     0     0     0
##     
##         12    13    14    15    16    17    18    20    21    22    24
##   1      0     0     0     0     0     0     0     0     0     0     0
##   2      0     0     0     0     0     0     0     0     0     0     0
##   3      0     0     0     0     0     0     0     0     0     0     0
##   4      0     0     0     0     0     0     0     0     0     0     0
##   5      0     0     0     0     0     0     0     0     0     0     0
##   6      0     0     0     0     0     0     0     0     0     0     0
##   7      0     0     0     0     0     0     0     0     0     0     0
##   8      0     0     0     0     0     0     0     0     0     0     0
##   9      0     0     0     0     0     0     0     0     0     0     0
##   10     0     0     0     0     0     0     0     0     0     0     0
##   11     0     0     0     0     0     0     0     0     0     0     0
##   12   411     0     0     0     0     0     0     0     0     0     0
##   13     0    19     0     0     0     0     0     0     0     0     0
##   14     0     0    38     0     0     0     0     0     0     0     0
##   15     0     0     0    32     0     0     0     0     0     0     0
##   16     0     0     0     0    35     0     0     0     0     0     0
##   17     0     0     0     0     0     3     0     0     0     0     0
##   18     0     0     0     0     0     0    24     0     0     0     0
##   20     0     0     0     0     0     0     0     8     0     0     0
##   21     0     0     0     0     0     0     0     0     1     0     0
##   22     0     0     0     0     0     0     0     0     0     2     0
##   24     0     0     0     0     0     0     0     0     0     0     2

Make and test asthma variable

# Assign 9 to ASTHMA4 
BRFSS_h$ASTHMA4 <- 9

# Then assign 1 to all who have reported ASTHMA
BRFSS_h$ASTHMA4[BRFSS_h$ASTHMA3 == 1] <- 1

# Then assign 0 to all who have reported ASTHMA
BRFSS_h$ASTHMA4[BRFSS_h$ASTHMA3 == 2] <- 0

# Check the variable
table(BRFSS_h$ASTHMA3,BRFSS_h$ASTHMA4)

##    
##         0     1
##   1     0  5343
##   2 52788     0

Generating Age variables

# First make copy of the dataset
BRFSS_i <- BRFSS_h

# From the data dictionary by default set the value of all age groups to 0.
# Age group 18 to 24 we can keep it as reference group
BRFSS_i$AGE2 <- 0   # Age 25 to 34
BRFSS_i$AGE3 <- 0   # Age 35 to 44
BRFSS_i$AGE4 <- 0   # Age 45 to 54
BRFSS_i$AGE5 <- 0   # Age 55 to 64
BRFSS_i$AGE6 <- 0   # Age 65 and older

# set conditions to update the flags

BRFSS_i$AGE2[BRFSS_i$X_AGE_G == 2] <- 1
table(BRFSS_i$X_AGE_G,BRFSS_i$AGE2)

##    
##         0     1
##   1   899     0
##   2     0  2657
##   3  3589     0
##   4  6543     0
##   5 10724     0
##   6 33719     0

BRFSS_i$AGE3[BRFSS_i$X_AGE_G == 3] <- 1
table(BRFSS_i$X_AGE_G,BRFSS_i$AGE3)

##    
##         0     1
##   1   899     0
##   2  2657     0
##   3     0  3589
##   4  6543     0
##   5 10724     0
##   6 33719     0

BRFSS_i$AGE4[BRFSS_i$X_AGE_G == 4] <- 1
table(BRFSS_i$X_AGE_G,BRFSS_i$AGE4)

##    
##         0     1
##   1   899     0
##   2  2657     0
##   3  3589     0
##   4     0  6543
##   5 10724     0
##   6 33719     0

BRFSS_i$AGE5[BRFSS_i$X_AGE_G == 5] <- 1
table(BRFSS_i$X_AGE_G,BRFSS_i$AGE5)

##    
##         0     1
##   1   899     0
##   2  2657     0
##   3  3589     0
##   4  6543     0
##   5     0 10724
##   6 33719     0

BRFSS_i$AGE6[BRFSS_i$X_AGE_G == 6] <- 1
table(BRFSS_i$X_AGE_G,BRFSS_i$AGE6)

##    
##         0     1
##   1   899     0
##   2  2657     0
##   3  3589     0
##   4  6543     0
##   5 10724     0
##   6     0 33719

MAKE SMOKING VARIABLES

# Make smoking variables
BRFSS_i$NEVERSMK <- 0
BRFSS_i$NEVERSMK [BRFSS_i$SMOKE100 == 2] <- 1
table(BRFSS_i$SMOKE100,BRFSS_i$NEVERSMK)

##    
##         0     1
##   1 35267     0
##   2     0 22622
##   7   208     0
##   9    33     0

# Make grouping variable
BRFSS_i$SMOKGRP <- 9
BRFSS_i$SMOKGRP[BRFSS_i$SMOKDAY2 == 1 | BRFSS_i$SMOKDAY2 == 2] <- 1
BRFSS_i$SMOKGRP[BRFSS_i$SMOKDAY2 == 3 | BRFSS_i$NEVERSMK == 1] <- 2

table(BRFSS_i$SMOKGRP,BRFSS_i$SMOKDAY2)

##    
##         1     2     3     7     9
##   1  6476  2095     0     0     0
##   2     0     0 26639     0     0
##   9     0     0     0    26    31

table(BRFSS_i$SMOKGRP,BRFSS_i$SMOKE100)

##    
##         1     2     7     9
##   1  8571     0     0     0
##   2 26639 22622     0     0
##   9    57     0   208    33

BRFSS_i$SMOKER <- 0
BRFSS_i$SMOKER[BRFSS_i$SMOKGRP == 1] <- 1

table(BRFSS_i$SMOKGRP, BRFSS_i$SMOKER)

##    
##         0     1
##   1     0  8571
##   2 49261     0
##   9   299     0

Make Sex variable

BRFSS_i$MALE <- 0
BRFSS_i$MALE[BRFSS_i$SEX == 1] <- 1

table(BRFSS_i$MALE, BRFSS_i$SEX)

##    
##         1     2
##   0     0  5160
##   1 52971     0

Make Hispanic variable

BRFSS_i$HISPANIC <- 0
BRFSS_i$HISPANIC[BRFSS_i$X_HISPANC == 1] <- 1

table(BRFSS_i$HISPANIC, BRFSS_i$X_HISPANC)

##    
##         1     2     9
##   0     0 55262   607
##   1  2262     0     0

Make Race variables

BRFSS_i$RACEGRP <- 9
BRFSS_i$RACEGRP[BRFSS_i$X_MRACE1 == 1] <- 1
BRFSS_i$RACEGRP[BRFSS_i$X_MRACE1 == 2] <- 2
BRFSS_i$RACEGRP[BRFSS_i$X_MRACE1 == 3] <- 3
BRFSS_i$RACEGRP[BRFSS_i$X_MRACE1 == 4] <- 4
BRFSS_i$RACEGRP[BRFSS_i$X_MRACE1 == 5] <- 5
BRFSS_i$RACEGRP[BRFSS_i$X_MRACE1 == 6 | BRFSS_i$X_MRACE1 == 7] <- 6

table(BRFSS_i$RACEGRP , BRFSS_i$X_MRACE1)

##    
##         1     2     3     4     5     6     7    77    99
##   1 49394     0     0     0     0     0     0     0     0
##   2     0  3939     0     0     0     0     0     0     0
##   3     0     0   930     0     0     0     0     0     0
##   4     0     0     0   557     0     0     0     0     0
##   5     0     0     0     0   261     0     0     0     0
##   6     0     0     0     0     0   656  1400     0     0
##   9     0     0     0     0     0     0     0   182   797

BRFSS_i$BLACK <- 0
BRFSS_i$ASIAN <- 0
BRFSS_i$OTHRACE <- 0

BRFSS_i$BLACK[BRFSS_i$RACEGRP == 2] <- 1
table(BRFSS_i$RACEGRP, BRFSS_i$BLACK)

##    
##         0     1
##   1 49394     0
##   2     0  3939
##   3   930     0
##   4   557     0
##   5   261     0
##   6  2056     0
##   9   994     0

BRFSS_i$ASIAN[BRFSS_i$RACEGRP == 4] <- 1
table(BRFSS_i$RACEGRP, BRFSS_i$ASIAN)

##    
##         0     1
##   1 49394     0
##   2  3939     0
##   3   930     0
##   4     0   557
##   5   261     0
##   6  2056     0
##   9   994     0

BRFSS_i$OTHRACE[BRFSS_i$RACEGRP == 3 | BRFSS_i$RACEGRP == 5 | BRFSS_i$RACEGRP == 6 | BRFSS_i$RACEGRP == 7] <- 1
table(BRFSS_i$RACEGRP, BRFSS_i$OTHRACE)

##    
##         0     1
##   1 49394     0
##   2  3939     0
##   3     0   930
##   4   557     0
##   5     0   261
##   6     0  2056
##   9   994     0

Make Marital variables

BRFSS_i$MARGRP <- 9
BRFSS_i$MARGRP[BRFSS_i$MARITAL == 1 | BRFSS_i$MARITAL == 5] <- 1
BRFSS_i$MARGRP[BRFSS_i$MARITAL == 2 | BRFSS_i$MARITAL == 3 ] <- 2
BRFSS_i$MARGRP[BRFSS_i$MARITAL == 4] <- 3

table(BRFSS_i$MARGRP, BRFSS_i$MARITAL)

##    
##         1     2     3     4     5     6     9
##   1 35855     0     0     0  4696     0     0
##   2     0  8396  7192     0     0     0     0
##   3     0     0     0   982     0     0     0
##   9     0     0     0     0     0   796   214

BRFSS_i$NEVERMAR <- 0
BRFSS_i$FORMERMAR <- 0

BRFSS_i$NEVERMAR[BRFSS_i$MARGRP == 3] <- 1
table(BRFSS_i$MARGRP, BRFSS_i$NEVERMAR)

##    
##         0     1
##   1 40551     0
##   2 15588     0
##   3     0   982
##   9  1010     0

BRFSS_i$FORMERMAR[BRFSS_i$MARGRP == 2] <- 1
table(BRFSS_i$MARGRP, BRFSS_i$FORMERMAR)

##    
##         0     1
##   1 40551     0
##   2     0 15588
##   3   982     0
##   9  1010     0

Make Genhealth variables

BRFSS_i$GENHLTH2 <- 9
BRFSS_i$GENHLTH2[BRFSS_i$GENHLTH == 1] <- 1
BRFSS_i$GENHLTH2[BRFSS_i$GENHLTH == 2] <- 2
BRFSS_i$GENHLTH2[BRFSS_i$GENHLTH == 3] <- 3
BRFSS_i$GENHLTH2[BRFSS_i$GENHLTH == 4] <- 4
BRFSS_i$GENHLTH2[BRFSS_i$GENHLTH == 5] <- 5

table(BRFSS_i$GENHLTH2, BRFSS_i$GENHLTH)

##    
##         1     2     3     4     5     7     9
##   1  9016     0     0     0     0     0     0
##   2     0 18111     0     0     0     0     0
##   3     0     0 18797     0     0     0     0
##   4     0     0     0  8436     0     0     0
##   5     0     0     0     0  3569     0     0
##   9     0     0     0     0     0   100   102

BRFSS_i$FAIRHLTH <- 0
BRFSS_i$POORHLTH <- 0

BRFSS_i$FAIRHLTH [BRFSS_i$GENHLTH2 == 4] <- 1
table(BRFSS_i$FAIRHLTH, BRFSS_i$GENHLTH2)

##    
##         1     2     3     4     5     9
##   0  9016 18111 18797     0  3569   202
##   1     0     0     0  8436     0     0

BRFSS_i$POORHLTH [BRFSS_i$GENHLTH2 == 5] <- 1
table(BRFSS_i$POORHLTH, BRFSS_i$GENHLTH2)

##    
##         1     2     3     4     5     9
##   0  9016 18111 18797  8436     0   202
##   1     0     0     0     0  3569     0

Make health plan variables

BRFSS_i$HLTHPLN2 <- 9
BRFSS_i$HLTHPLN2[BRFSS_i$HLTHPLN1 == 1] <- 1
BRFSS_i$HLTHPLN2[BRFSS_i$HLTHPLN1 == 2] <- 2

table(BRFSS_i$HLTHPLN1, BRFSS_i$HLTHPLN2)

##    
##         1     2     9
##   1 55795     0     0
##   2     0  2203     0
##   7     0     0    47
##   9     0     0    86

BRFSS_i$NOPLAN <- 0
BRFSS_i$NOPLAN [BRFSS_i$HLTHPLN2== 2] <- 1
table(BRFSS_i$NOPLAN, BRFSS_i$HLTHPLN2)

##    
##         1     2     9
##   0 55795     0   133
##   1     0  2203     0

Make education variables

BRFSS_i$EDGROUP <- 9
BRFSS_i$EDGROUP[BRFSS_i$EDUCA == 1 | BRFSS_i$EDUCA == 2 | BRFSS_i$EDUCA == 3] <- 1
BRFSS_i$EDGROUP[BRFSS_i$EDUCA == 4] <- 2
BRFSS_i$EDGROUP[BRFSS_i$EDUCA == 5] <- 3
BRFSS_i$EDGROUP[BRFSS_i$EDUCA == 6] <- 4

table(BRFSS_i$EDGROUP, BRFSS_i$EDUCA)

##    
##         1     2     3     4     5     6     9
##   1    33   704  1746     0     0     0     0
##   2     0     0     0 16241     0     0     0
##   3     0     0     0     0 17559     0     0
##   4     0     0     0     0     0 21742     0
##   9     0     0     0     0     0     0   106

BRFSS_i$LOWED <- 0
BRFSS_i$SOMECOLL <- 0

BRFSS_i$LOWED[BRFSS_i$EDGROUP == 1 | BRFSS_i$EDGROUP == 2 ] <- 1
table(BRFSS_i$LOWED, BRFSS_i$EDGROUP)

##    
##         1     2     3     4     9
##   0     0     0 17559 21742   106
##   1  2483 16241     0     0     0

BRFSS_i$SOMECOLL [BRFSS_i$EDGROUP == 3] <- 1
table(BRFSS_i$SOMECOLL, BRFSS_i$EDGROUP)

##    
##         1     2     3     4     9
##   0  2483 16241     0 21742   106
##   1     0     0 17559     0     0

Make income variables

BRFSS_i$INCOME3 <- BRFSS_i$INCOME2
BRFSS_i$INCOME3[BRFSS_i$INCOME2 >=77] <- 9

table(BRFSS_i$INCOME2, BRFSS_i$INCOME3)

##     
##          1     2     3     4     5     6     7     8     9
##   1   1165     0     0     0     0     0     0     0     0
##   2      0  2111     0     0     0     0     0     0     0
##   3      0     0  3148     0     0     0     0     0     0
##   4      0     0     0  4774     0     0     0     0     0
##   5      0     0     0     0  6491     0     0     0     0
##   6      0     0     0     0     0  9305     0     0     0
##   7      0     0     0     0     0     0  9636     0     0
##   8      0     0     0     0     0     0     0 15230     0
##   77     0     0     0     0     0     0     0     0  2132
##   99     0     0     0     0     0     0     0     0  4139

BRFSS_i$INC1 <- 0
BRFSS_i$INC2 <- 0
BRFSS_i$INC3 <- 0
BRFSS_i$INC4 <- 0
BRFSS_i$INC5 <- 0
BRFSS_i$INC6 <- 0
BRFSS_i$INC7 <- 0

BRFSS_i$INC1[BRFSS_i$INCOME3 == 1] <- 1
table(BRFSS_i$INC1, BRFSS_i$INCOME3)

##    
##         1     2     3     4     5     6     7     8     9
##   0     0  2111  3148  4774  6491  9305  9636 15230  6271
##   1  1165     0     0     0     0     0     0     0     0

BRFSS_i$INC2[BRFSS_i$INCOME3 == 2] <- 1
table(BRFSS_i$INC2, BRFSS_i$INCOME3)

##    
##         1     2     3     4     5     6     7     8     9
##   0  1165     0  3148  4774  6491  9305  9636 15230  6271
##   1     0  2111     0     0     0     0     0     0     0

BRFSS_i$INC3[BRFSS_i$INCOME3 == 3] <- 1
table(BRFSS_i$INC3, BRFSS_i$INCOME3)

##    
##         1     2     3     4     5     6     7     8     9
##   0  1165  2111     0  4774  6491  9305  9636 15230  6271
##   1     0     0  3148     0     0     0     0     0     0

BRFSS_i$INC4[BRFSS_i$INCOME3 == 4] <- 1
table(BRFSS_i$INC4, BRFSS_i$INCOME3)

##    
##         1     2     3     4     5     6     7     8     9
##   0  1165  2111  3148     0  6491  9305  9636 15230  6271
##   1     0     0     0  4774     0     0     0     0     0

BRFSS_i$INC5[BRFSS_i$INCOME3 == 5] <- 1
table(BRFSS_i$INC5, BRFSS_i$INCOME3)

##    
##         1     2     3     4     5     6     7     8     9
##   0  1165  2111  3148  4774     0  9305  9636 15230  6271
##   1     0     0     0     0  6491     0     0     0     0

BRFSS_i$INC6[BRFSS_i$INCOME3 == 6] <- 1
table(BRFSS_i$INC6, BRFSS_i$INCOME3)

##    
##         1     2     3     4     5     6     7     8     9
##   0  1165  2111  3148  4774  6491     0  9636 15230  6271
##   1     0     0     0     0     0  9305     0     0     0

BRFSS_i$INC7[BRFSS_i$INCOME3 == 7] <- 1
table(BRFSS_i$INC7, BRFSS_i$INCOME3)

##    
##         1     2     3     4     5     6     7     8     9
##   0  1165  2111  3148  4774  6491  9305     0 15230  6271
##   1     0     0     0     0     0     0  9636     0     0

Make BMI variables

BRFSS_i$BMICAT<- 9
BRFSS_i$BMICAT[BRFSS_i$X_BMI5CAT ==1] <- 1
BRFSS_i$BMICAT[BRFSS_i$X_BMI5CAT ==2] <- 2
BRFSS_i$BMICAT[BRFSS_i$X_BMI5CAT ==3] <- 3
BRFSS_i$BMICAT[BRFSS_i$X_BMI5CAT ==4] <- 4

table(BRFSS_i$BMICAT, BRFSS_i$X_BMI5CAT)

##    
##         1     2     3     4
##   1   478     0     0     0
##   2     0 14340     0     0
##   3     0     0 25572     0
##   4     0     0     0 16871
##   9     0     0     0     0

BRFSS_i$UNDWT <- 0
BRFSS_i$OVWT <- 0
BRFSS_i$OBESE <- 0

BRFSS_i$UNDWT[BRFSS_i$BMICAT== 1] <- 1
table(BRFSS_i$UNDWT, BRFSS_i$BMICAT)

##    
##         1     2     3     4     9
##   0     0 14340 25572 16871   870
##   1   478     0     0     0     0

BRFSS_i$OVWT[BRFSS_i$BMICAT== 3] <- 1
table(BRFSS_i$OVWT, BRFSS_i$BMICAT)

##    
##         1     2     3     4     9
##   0   478 14340     0 16871   870
##   1     0     0 25572     0     0

BRFSS_i$OBESE[BRFSS_i$BMICAT== 4] <- 1
table(BRFSS_i$OBESE, BRFSS_i$BMICAT)

##    
##         1     2     3     4     9
##   0   478 14340 25572     0   870
##   1     0     0     0 16871     0

Make exercise variables

BRFSS_i$EXERANY3<- 9
BRFSS_i$EXERANY3[BRFSS_i$EXERANY2 ==1] <- 1
BRFSS_i$EXERANY3[BRFSS_i$EXERANY2 ==2] <- 2

table(BRFSS_i$EXERANY3, BRFSS_i$EXERANY2)

##    
##         1     2     7     9
##   1 44357     0     0     0
##   2     0 13641     0     0
##   9     0     0    57    75

BRFSS_i$NOEXER <- 0
BRFSS_i$NOEXER[BRFSS_i$EXERANY3 ==2] <- 1
table(BRFSS_i$NOEXER, BRFSS_i$EXERANY3)

##    
##         1     2     9
##   0 44357     0   133
##   1     0 13641     0

nrow(BRFSS_i)

## [1] 58131

Write out analytic dataset

write.csv(BRFSS_i, file = "analytic.csv")

Finally now that we have a clean dataset in hand.Let us analyze this data.

#read in analytic table
analytic <- read.csv(file="C:/CompleteMLProjects/Healthcare/BRFSS/Analytics/Code/analytic.csv", header=TRUE, sep=",")

#Look at distribution of categorical outcome asthma

AsthmaFreq <- table(analytic$ASTHMA4)
AsthmaFreq

## 
##     0     1 
## 52788  5343

write.csv(AsthmaFreq, file = "AsthmaFreq.csv")

#what proportion of our dataset has ashtma?
PropAsthma <- 5343/52788
PropAsthma

## [1] 0.1012162

#Look at categorical outcome asthma by exposure, ALCGRP
AsthmaAlcFreq <- table(analytic$ASTHMA4, analytic$ALCGRP)
AsthmaAlcFreq

##    
##         1     2     3
##   0 23498 20749  8541
##   1  2671  1897   775

write.csv(AsthmaAlcFreq, file = "AsthmaAlcFreq.csv")

Look at distribution of sleep duration

#summary statistics
summary(analytic$SLEPTIM2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   7.000   7.116   8.000  24.000

#look at histogram and box plot of total file
hist(analytic$SLEPTIM2, 
    main = "Histogram of SLEPTIM2",
    xlab = "Class SLEPTIM2",
    ylab = "Frequency",
    xlim=c(0,15), 
    ylim=c(0,20000),
    border = "red",
    col= "yellow",
    las = 1,
    breaks = 24)

boxplot(analytic$SLEPTIM2, main="Box Plot of SLEPTIM2", 
    xlab="Total File", ylab="SLEPTIM2")

#See box plots of groups next to each other
boxplot(SLEPTIM2~ALCGRP, data=analytic, main="Box Plot of SLEPTIM2 by ALCGRP", 
    xlab="ALCGRP", ylab="SLEPTIM2")

Making frequencies per category

AsthmaFreq <- table(analytic$ASTHMA4)
AsthmaFreq

## 
##     0     1 
## 52788  5343

write.csv(AsthmaFreq, file = "AsthmaFreq.csv")

AlcFreq <- table(analytic$ALCGRP)
AlcFreq

## 
##     1     2     3 
## 26169 22646  9316

write.csv(AlcFreq , file = "AlcFreq.csv")

#USING MACROS

#install package gtools
#then call up library

library(gtools)
#use defmacro to define the macro
FreqTbl <-defmacro(OutputTable, InputVar, CSVTable, 
expr={
OutputTable <- table(InputVar);
write.csv(OutputTable, file = paste0(CSVTable, ".csv"))
})

FreqTbl (AlcFreq, analytic$ALCGRP, "Alc")
FreqTbl (AgeFreq, analytic$X_AGE_G, "Age")
FreqTbl (SexFreq, analytic$SEX, "Sex")
FreqTbl (HispFreq, analytic$X_HISPANC, "Hisp")
FreqTbl (RaceFreq, analytic$RACEGRP, "Race")
FreqTbl (MaritalFreq, analytic$MARGRP, "Mar")
FreqTbl (EdFreq, analytic$EDGROUP, "Ed")
FreqTbl (IncFreq, analytic$INCOME3, "Inc")
FreqTbl (BMIFreq, analytic$BMICAT, "BMI")
FreqTbl (SmokeFreq, analytic$SMOKGRP, "Smok")
FreqTbl (ExerFreq, analytic$EXERANY3, "Exer")
FreqTbl (HlthPlanFreq, analytic$HLTHPLN2, "HlthPln")
FreqTbl (GenHlthFreq, analytic$GENHLTH2, "GenHlth")

Checking for No Asthma frequencies

### Subset dataset with only asthma people
asthmaonly <- subset(analytic, ASTHMA4 == 1)
table(asthmaonly$ASTHMA4)

## 
##    1 
## 5343

nrow(asthmaonly)

## [1] 5343

AsthmaFreq <- table(asthmaonly$ASTHMA4)
AsthmaFreq

## 
##    1 
## 5343

write.csv(AsthmaFreq, file = "Asthma.csv")

#USING MACROS
library(gtools)
#use defmacro to define the macro
FreqTbl <-defmacro(OutputTable, InputVar, CSVTable, 
expr={
OutputTable <- table(InputVar);
write.csv(OutputTable, file = paste0(CSVTable, ".csv"))
})

FreqTbl (AlcGrpFreq, asthmaonly$ALCGRP, "Alc")
FreqTbl (AgeGrpFreq, asthmaonly$X_AGE_G, "Age")
FreqTbl (SexFreq, asthmaonly$SEX, "Sex")
FreqTbl (HispFreq, asthmaonly$X_HISPANC, "Hisp")
FreqTbl (RaceFreq, asthmaonly$RACEGRP, "Race")
FreqTbl (MaritalFreq, asthmaonly$MARGRP, "Mar")
FreqTbl (EdFreq, asthmaonly$EDGROUP, "Ed")
FreqTbl (IncFreq, asthmaonly$INCOME3, "Inc")
FreqTbl (BMIFreq, asthmaonly$BMICAT, "BMI")
FreqTbl (SmokeFreq, asthmaonly$SMOKGRP, "Smok")
FreqTbl (ExerFreq, asthmaonly$EXERANY3, "Exer")
FreqTbl (HlthPlanFreq, asthmaonly$HLTHPLN2, "HlthPln")
FreqTbl (GenHlthFreq, asthmaonly$GENHLTH2, "GenHlth")

Checking for No Asthma frequencies

#subset dataset with only asthma people
noasthmaonly <- subset(analytic, ASTHMA4 != 1)
table(noasthmaonly $ASTHMA4)

## 
##     0 
## 52788

nrow(noasthmaonly)

## [1] 52788

AsthmaFreq <- table(noasthmaonly$ASTHMA4)
AsthmaFreq

## 
##     0 
## 52788

write.csv(AsthmaFreq, file = "Asthma.csv")

#USING MACROS
library(gtools)
#use defmacro to define the macro
FreqTbl <-defmacro(OutputTable, InputVar, CSVTable, 
expr={
OutputTable <- table(InputVar);
write.csv(OutputTable, file = paste0(CSVTable, ".csv"))
})

FreqTbl (AlcGrpFreq, noasthmaonly$ALCGRP, "Alc")
FreqTbl (AgeGrpFreq, noasthmaonly$X_AGE_G, "Age")
FreqTbl (SexFreq, noasthmaonly$SEX, "Sex")
FreqTbl (HispFreq, noasthmaonly$X_HISPANC, "Hisp")
FreqTbl (RaceFreq, noasthmaonly$RACEGRP, "Race")
FreqTbl (MaritalFreq, noasthmaonly$MARGRP, "Mar")
FreqTbl (EdFreq, noasthmaonly$EDGROUP, "Ed")
FreqTbl (IncFreq, noasthmaonly$INCOME3, "Inc")
FreqTbl (BMIFreq, noasthmaonly$BMICAT, "BMI")
FreqTbl (SmokeFreq, noasthmaonly$SMOKGRP, "Smok")
FreqTbl (ExerFreq, noasthmaonly$EXERANY3, "Exer")
FreqTbl (HlthPlanFreq, noasthmaonly$HLTHPLN2, "HlthPln")
FreqTbl (GenHlthFreq, noasthmaonly$GENHLTH2, "GenHlth")

Means and Standard Deviations

mean(analytic$SLEPTIM2)

## [1] 7.115756

sd(analytic$SLEPTIM2)

## [1] 1.468601

#load package plyr
library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

#example
ddply(analytic,~ALCGRP,summarise,mean=mean(SLEPTIM2),sd=sd(SLEPTIM2))

##   ALCGRP     mean       sd
## 1      1 7.126103 1.593871
## 2      2 7.090259 1.375262
## 3      3 7.148669 1.312195

#USING MACROS
library(gtools)
SumTbl <- defmacro(OutputTable, GroupVar, CSVTable,
expr={
OutputTable <- ddply(analytic,~GroupVar,summarise,mean=mean(SLEPTIM2),sd=sd(SLEPTIM2));
write.csv(OutputTable, file = paste0(CSVTable, ".csv"))
})

SumTbl (AlcGrpSum, analytic$ALCGRP, "Alc")
SumTbl (AgeGrpSum, analytic$X_AGE_G, "Age")
SumTbl (SexSum, analytic$SEX, "Sex")
SumTbl (HispSum, analytic$X_HISPANC, "Hisp")
SumTbl (RaceSum, analytic$RACEGRP, "Race")
SumTbl (MaritalSum, analytic$MARGRP, "Mar")
SumTbl (EdSum, analytic$EDGROUP, "Ed")
SumTbl (IncSum, analytic$INCOME3, "Inc")
SumTbl (BMISum, analytic$BMICAT, "BMI")
SumTbl (SmokeSum, analytic$SMOKGRP, "Smok")
SumTbl (ExerSum, analytic$EXERANY3, "Exer")
SumTbl (HlthPlanSum, analytic$HLTHPLN2, "HlthPln")
SumTbl (GenHlthSum, analytic$GENHLTH2, "GenHlth")

weights example

WeightVarList <- c("X_STATE", "X_LLCPWT", "ASTHMA3")

BRFSS_weights <- subset(BRFSS_a[WeightVarList])

colnames(BRFSS_weights)

## [1] "X_STATE"  "X_LLCPWT" "ASTHMA3"

nrow(BRFSS_weights)

## [1] 464664

#use questionr package

library(questionr)

WeightedAsthma <- wtd.table(BRFSS_weights$ASTHMA3, 
    y=BRFSS_weights$X_STATE, weights = BRFSS_weights$X_LLCPWT, normwt = FALSE, na.rm = TRUE,
    na.show = FALSE)
write.csv(WeightedAsthma, file = "WeightedAsthma.csv")

Table1 Chisq

#load MASS library

library(MASS)

#make table

AlcTbl = table(analytic$ASTHMA4, analytic$ALCGRP) 

#run test
chisq.test(AlcTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  AlcTbl
## X-squared = 58.823, df = 2, p-value = 1.686e-13

#make macro

library(gtools)

ChiTest <- defmacro(VarName, TblName, expr={
TblName = table(analytic$ASTHMA4, analytic$VarName); 
chisq.test(TblName)})

ChiTest(ALCGRP, AlcTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  AlcTbl
## X-squared = 58.823, df = 2, p-value = 1.686e-13

ChiTest(X_AGE_G, AgeTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  AgeTbl
## X-squared = 54.193, df = 5, p-value = 1.913e-10

ChiTest(SEX, SexTbl)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  SexTbl
## X-squared = 250, df = 1, p-value < 2.2e-16

ChiTest(X_HISPANC, HispTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  HispTbl
## X-squared = 4.7509, df = 2, p-value = 0.09297

ChiTest(RACEGRP, RaceTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  RaceTbl
## X-squared = 97.668, df = 6, p-value < 2.2e-16

ChiTest(MARGRP, MarTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  MarTbl
## X-squared = 51.822, df = 3, p-value = 3.269e-11

ChiTest(EDGROUP, EdTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  EdTbl
## X-squared = 59.697, df = 4, p-value = 3.359e-12

ChiTest(INCOME3, IncTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  IncTbl
## X-squared = 269.59, df = 8, p-value < 2.2e-16

ChiTest(BMICAT, BMITbl)

## 
##  Pearson's Chi-squared test
## 
## data:  BMITbl
## X-squared = 154.35, df = 4, p-value < 2.2e-16

ChiTest(SMOKGRP, SmokTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  SmokTbl
## X-squared = 34.156, df = 2, p-value = 3.829e-08

ChiTest(EXERANY3, ExerTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  ExerTbl
## X-squared = 116.25, df = 2, p-value < 2.2e-16

ChiTest(HLTHPLN2, HlthPlnTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  HlthPlnTbl
## X-squared = 6.3515, df = 2, p-value = 0.04176

ChiTest(GENHLTH2, GenHlthTbl)

## 
##  Pearson's Chi-squared test
## 
## data:  GenHlthTbl
## X-squared = 929.84, df = 5, p-value < 2.2e-16

ANOVAS for Table 1

#example ANOVA

AlcANOVA <- lm(formula = SLEPTIM2 ~ ALCGRP, data = analytic)
summary(AlcANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ ALCGRP, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1173 -1.1149 -0.1149  0.8839 16.8851 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.113753   0.015596 456.129   <2e-16 ***
## ALCGRP      0.001171   0.008396   0.139    0.889    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 58129 degrees of freedom
## Multiple R-squared:  3.347e-07,  Adjusted R-squared:  -1.687e-05 
## F-statistic: 0.01945 on 1 and 58129 DF,  p-value: 0.8891

#make macro

library(gtools)

ANOVATest <- defmacro(VarName, TblName, expr={
TblName<- lm(formula = SLEPTIM2 ~ VarName, data = analytic);
summary(TblName)})

#call macro

ANOVATest (ALCGRP, AlcANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ ALCGRP, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1173 -1.1149 -0.1149  0.8839 16.8851 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.113753   0.015596 456.129   <2e-16 ***
## ALCGRP      0.001171   0.008396   0.139    0.889    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 58129 degrees of freedom
## Multiple R-squared:  3.347e-07,  Adjusted R-squared:  -1.687e-05 
## F-statistic: 0.01945 on 1 and 58129 DF,  p-value: 0.8891

ANOVATest (X_AGE_G, AgeANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ X_AGE_G, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3304 -0.8282 -0.0793  0.6696 17.4229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.823829   0.025087  232.15   <2e-16 ***
## X_AGE_G     0.251102   0.004737   53.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.434 on 58129 degrees of freedom
## Multiple R-squared:  0.04611,    Adjusted R-squared:  0.0461 
## F-statistic:  2810 on 1 and 58129 DF,  p-value: < 2.2e-16

ANOVATest (X_HISPANC, HispANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ X_HISPANC, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1151 -1.1151 -0.1151  0.8849 16.8849 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.074865   0.017791 397.667   <2e-16 ***
## X_HISPANC   0.020102   0.008217   2.446   0.0144 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 58129 degrees of freedom
## Multiple R-squared:  0.0001029,  Adjusted R-squared:  8.573e-05 
## F-statistic: 5.984 on 1 and 58129 DF,  p-value: 0.01444

ANOVATest (RACEGRP, RaceANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ RACEGRP, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1447 -1.1447 -0.1447  0.8553 16.9811 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.207593   0.008676  830.73   <2e-16 ***
## RACEGRP     -0.062898   0.004239  -14.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.466 on 58129 degrees of freedom
## Multiple R-squared:  0.003773,   Adjusted R-squared:  0.003756 
## F-statistic: 220.1 on 1 and 58129 DF,  p-value: < 2.2e-16

ANOVATest (MARGRP, MarANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ MARGRP, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1312 -1.0962 -0.1312  0.8688 16.9389 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.166251   0.009925 722.058  < 2e-16 ***
## MARGRP      -0.035043   0.005439  -6.443 1.18e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.468 on 58129 degrees of freedom
## Multiple R-squared:  0.0007136,  Adjusted R-squared:  0.0006964 
## F-statistic: 41.51 on 1 and 58129 DF,  p-value: 1.181e-10

ANOVATest (EDGROUP, EdANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ EDGROUP, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1284 -1.1026 -0.1155  0.8845 16.8974 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.076713   0.020434 346.323   <2e-16 ***
## EDGROUP     0.012928   0.006458   2.002   0.0453 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 58129 degrees of freedom
## Multiple R-squared:  6.893e-05,  Adjusted R-squared:  5.172e-05 
## F-statistic: 4.007 on 1 and 58129 DF,  p-value: 0.04532

ANOVATest (INCOME3, IncANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ INCOME3, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1601 -1.0937 -0.1103  0.8731 16.9229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.010660   0.020018 350.218  < 2e-16 ***
## INCOME3     0.016604   0.003013   5.511 3.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.468 on 58129 degrees of freedom
## Multiple R-squared:  0.0005223,  Adjusted R-squared:  0.0005051 
## F-statistic: 30.37 on 1 and 58129 DF,  p-value: 3.578e-08

ANOVATest (BMICAT, BMIANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ BMICAT, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2210 -1.0718 -0.1216  0.8784 16.8784 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.270774   0.019127 380.128   <2e-16 ***
## BMICAT      -0.049735   0.005818  -8.549   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.468 on 58129 degrees of freedom
## Multiple R-squared:  0.001256,   Adjusted R-squared:  0.001239 
## F-statistic: 73.09 on 1 and 58129 DF,  p-value: < 2.2e-16

ANOVATest (SMOKGRP, SmokANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ SMOKGRP, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1311 -1.1311 -0.1311  0.8689 17.0067 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.855398   0.019435  352.74   <2e-16 ***
## SMOKGRP     0.137860   0.009774   14.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.466 on 58129 degrees of freedom
## Multiple R-squared:  0.003411,   Adjusted R-squared:  0.003394 
## F-statistic: 198.9 on 1 and 58129 DF,  p-value: < 2.2e-16

ANOVATest (EXERANY3, ExerANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ EXERANY3, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1284 -1.1115 -0.1115  0.8885 16.8885 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.09463    0.01486 477.468   <2e-16 ***
## EXERANY3     0.01686    0.01082   1.559    0.119    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 58129 degrees of freedom
## Multiple R-squared:  4.18e-05,   Adjusted R-squared:  2.46e-05 
## F-statistic:  2.43 on 1 and 58129 DF,  p-value: 0.119

ANOVATest (HLTHPLN2, HlthPlnANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ HLTHPLN2, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1191 -1.1191 -0.1191  0.8809 16.8809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.17916    0.01629 440.606  < 2e-16 ***
## HLTHPLN2    -0.06003    0.01431  -4.195 2.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.468 on 58129 degrees of freedom
## Multiple R-squared:  0.0003027,  Adjusted R-squared:  0.0002855 
## F-statistic:  17.6 on 1 and 58129 DF,  p-value: 2.73e-05

ANOVATest (GENHLTH2, GenHlthANOVA)

## 
## Call:
## lm(formula = SLEPTIM2 ~ GENHLTH2, data = analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1852 -1.1019 -0.1019  0.8981 16.9397 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.226787   0.015306 472.167  < 2e-16 ***
## GENHLTH2    -0.041631   0.005265  -7.907 2.69e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.468 on 58129 degrees of freedom
## Multiple R-squared:  0.001074,   Adjusted R-squared:  0.001057 
## F-statistic: 62.52 on 1 and 58129 DF,  p-value: 2.689e-15

ttests for Table 1

t.test(analytic$SLEPTIM2~analytic$ASTHMA4)

## 
##  Welch Two Sample t-test
## 
## data:  analytic$SLEPTIM2 by analytic$ASTHMA4
## t = 5.8738, df = 6060, p-value = 4.485e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.09852347 0.19722952
## sample estimates:
## mean in group 0 mean in group 1 
##        7.129348        6.981471

t.test(analytic$SLEPTIM2~analytic$SEX)

## 
##  Welch Two Sample t-test
## 
## data:  analytic$SLEPTIM2 by analytic$SEX
## t = 12.658, df = 6146.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2342547 0.3201083
## sample estimates:
## mean in group 1 mean in group 2 
##        7.140360        6.863178

The complete Descriptive analysis of BRFFS data is done. We know that Descriptive analysis can lead to Regression. In the next part we will proceed with doing Linear Regression for this analysis.

Descriptive BRFSS Healthcare Data Analytics using R

Nazima Khan

May 9, 2017

Objective

To Analyze Behavior Risk Factor Surveillance System (BRFSS) for “VETERANS” using BRFSS dataset.

Brief Summary on the topic

What is risk factor surveillance?

Keeping track of the rates of risk factors which are the things or states in our daily lives that confers risk to our health is defined as a Risk Factor Surveillance.

There are two main Surveillance systems in the United States

We will be using BRFSS dataset

Types of BRFSS Analytics

Descriptive Analysis

Cross-Sectional Analysis

Resources

BRFSS Resource Provided by CDC.

Environment Setup

Designing Metadata

Reading and Cleaning the Data

Subset unwanted values from the dataset

Generating Exposure Variable

First, we will go to our exposure, alcohol. Make a grouping variable for alcohol, and indicator variables for drinking monthly and drinking weekly

Also If ALCGRP is two, the drink monthly flag will be one and everyone else gets a zero. If ALCGRP is three, the drink weekly flag will be one and everyone else gets a zero.

Generate outcome variables from data dictionary

First, we are going to clean up our outcome variable for sleep duration. Next, we will make sure we have binary variable or flag that is valid for our asthma outcome.

Make and test asthma variable

Generating Age variables

MAKE SMOKING VARIABLES

Make Sex variable

Make Hispanic variable

Make Race variables

Make Marital variables

Make Genhealth variables

Make health plan variables

Make education variables

Make income variables

Make BMI variables

Make exercise variables

Write out analytic dataset

Finally now that we have a clean dataset in hand.Let us analyze this data.

Look at distribution of sleep duration

Making frequencies per category

Checking for No Asthma frequencies

Checking for No Asthma frequencies

Means and Standard Deviations

weights example

Table1 Chisq

ANOVAS for Table 1

ttests for Table 1

The complete Descriptive analysis of BRFFS data is done. We know that Descriptive analysis can lead to Regression. In the next part we will proceed with doing Linear Regression for this analysis.