Exploring the BRFSS data - Caio Miyashiro

Setup

Load packages

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load data and pre-process

load("BRFSS 2013.RData")

Before start analyzing the data, we must filter out data that is invalid and can hamper our analysis process. As we will split the data by states, I extracted all the unique values that this variable could assume and noticed that there were 2 rows, with X_state = 0 and 80 that were not valid for our analysis. So I excluded them.

unique(brfss2013$X_state);

##  [1] Alabama              Alaska               Arizona             
##  [4] Arkansas             California           Colorado            
##  [7] Connecticut          Delaware             District of Columbia
## [10] Florida              80                   Georgia             
## [13] Hawaii               Idaho                Illinois            
## [16] Indiana              Iowa                 Kansas              
## [19] Kentucky             Louisiana            Maine               
## [22] Maryland             Massachusetts        Michigan            
## [25] 0                    Minnesota            Mississippi         
## [28] Missouri             Montana              Nebraska            
## [31] Nevada               New Hampshire        New Jersey          
## [34] New Mexico           New York             North Carolina      
## [37] North Dakota         Ohio                 Oklahoma            
## [40] Oregon               Pennsylvania         Rhode Island        
## [43] South Carolina       South Dakota         Tennessee           
## [46] Texas                Utah                 Vermont             
## [49] Virginia             Washington           West Virginia       
## [52] Wisconsin            Wyoming              Guam                
## [55] Puerto Rico         
## 55 Levels: 0 Alabama Alaska Arizona Arkansas California ... 80

# Remove invalid data
brfss2013 <- brfss2013[-which(brfss2013$X_state == 80 | brfss2013$X_state == 0),]
# remove genhlth == n.a.
brfss2013 <- brfss2013[-which(is.na(brfss2013$genhlth)),]
#update factors
brfss2013$X_state <- factor(brfss2013$X_state);

# save this dataset to use in question 3
q3_brfss2013 <- brfss2013;

As we have a great variation between the number of interviewed people in each state, before we start to graph our variables, I extracted the state with least interviewed people and sampled the other states with this same number. In this manner, I tried to leave each state with similarly the same distribution.

###########################################################################
# as we are using some random number generator, set a seed so you can have the same result as I did using sample_n
set.seed(123);
#start sample of each state but, how many rows should we sample from each state?
tableStats <- table(brfss2013$X_state);
minSample <- min(tableStats) #get state with min samples == 1897
# for curiosity, which state was the least sampled? (Guam)
tableStats[which(tableStats == min(tableStats))]

## Guam 
## 1896

# sample minSample samples from each state
brfss2013 <- brfss2013 %>% group_by(X_state) %>% sample_n(minSample);

Part 1: Data

As the project description explained, the data was collected from all the 50 states, plus a few districts from USA, providing a strong ground to bring evidences exclusively to the american people starting from 18 years old. Besides, all the data (interviewed choosen people) was randomly sampled from the entire USA population, giving a few possible generalized research answers. Because there wasn’t any type of random assignment, it won’t be possible to extract causal relations from the data. In short the answers found from the questions generated in this report will be not causal, but generalizable.

An additional point is that, as the data was collected only from landline telephones and after from cellphones, there might exist a dependence between the variables, as people with financial conditions to buy telephones and cellphones can have an easier access to educational resources regarding the general public health.

Part 2: Research questions

The main characteristic of this survey was the random sample of people from all the 50 states from USA. Starting from this point, we can try to find which of the states have a better general health status (GHS) and try to analyze a few variables that might have an influence in the GHS.

Research quesion 1: Which states have the best general health status?

Research quesion 2: How the sex of the interviewed people is related to the GHS?

Research quesion 3: Is the amount of minutes dedicated to physical activity related to GHS?

Part 3: Exploratory data analysis

Research quesion 1:

Considering the approach of each value of the GHS:

levels(brfss2013$genhlth)

## [1] "Excellent" "Very good" "Good"      "Fair"      "Poor"

Let’s give a score for each GHS level:

Excellent: 5 points
Very Good: 4 points
Good: 3 points
Fair: 2 points
Poor: 1 point

For the first question, we will create a summary statistic named mean_GHS, which will be calculated by the sum of each GHS sample score (variable ’genhlth) divided by the maximum possible score (number of samples * 5). After this we can plot a bar plot showing the 5 best states that have the biggests meanGHS. If we try to convert the genhlth variable to numeric, we get the opposite values that we have defined previously, so we make a little manipulation to get our values.

head((brfss2013$genhlth))

## [1] Fair      Fair      Good      Excellent Good      Excellent
## Levels: Excellent Very good Good Fair Poor

head(as.numeric(brfss2013$genhlth)) # factor number is the opposit of what we want (Excellent = 1, Very Good = 2, etc..we don't want that)

## [1] 4 4 3 1 3 1

# add a new column with our desired numeric values
brfss2013 <- mutate(brfss2013, numeric_GHS = (5 - as.numeric(genhlth)) + 1)
head(brfss2013[,c('genhlth','numeric_GHS')]) # now we have the desired conversion

## Source: local data frame [6 x 2]
## 
##     genhlth numeric_GHS
##      (fctr)       (dbl)
## 1      Fair           2
## 2      Fair           2
## 3      Good           3
## 4 Excellent           5
## 5      Good           3
## 6 Excellent           5

# extract mean_GHS from each state
statesAndMean_GHS <- brfss2013 %>% group_by(X_state) %>% summarise(mean_GHS = sum(numeric_GHS)/(n() * 5));

#order by biggest mean_GHS and print 10 highest mean_GHS score
statesAndMean_GHS <- statesAndMean_GHS[order(statesAndMean_GHS$mean_GHS,decreasing = TRUE),]
statesAndMean_GHS[1:10,]

## Source: local data frame [10 x 2]
## 
##                 X_state  mean_GHS
##                  (fctr)     (dbl)
## 1             Minnesota 0.7320675
## 2               Vermont 0.7283755
## 3  District of Columbia 0.7272152
## 4              Colorado 0.7263713
## 5         New Hampshire 0.7261603
## 6           Connecticut 0.7199367
## 7                  Utah 0.7133966
## 8                Alaska 0.7086498
## 9                 Idaho 0.7036920
## 10               Hawaii 0.7029536

As we can see, the state of Minnesota has the biggest mean_GHS score, followed by Vermont, District of Columbia and others.

Research quesion 2:

For our number 2 question, we ask ourselves if the sex can have a difference between people within the same state. For this, we selected a sample of 5 states with its mean_GHS values

# choose 5 random states
randomStates <- as.character(sample(unique(brfss2013$X_state),size = 5));
randomStates

## [1] "New Mexico"  "Idaho"       "Wisconsin"   "Connecticut" "Washington"

Next, we do the following steps:

filter de original dataset to only contain the data from the selected states
update the X_state factors to contain only the selected states
split the data by states, and split each from each state by sex and calculate its respective mean_GHS like the first question.
plot the data in a barchart form and edit the graph aesthetics to be more presentable and readable.

# 1 - filter data from these 5 states
randomStatesData <- filter(brfss2013, X_state %in% randomStates);
# 2 - update factor levels
randomStatesData$X_state <- factor(randomStatesData$X_state)
# 3 - calculate the mean_GHS for each state, grouped by state and then sex
randomStatesMean_GHS <- randomStatesData %>% group_by(X_state,sex) %>% summarise(mean_GHS = sum(numeric_GHS)/(n() * 5));
# 4 - # plot mean_GHS Scores for Each City Grouped by Sex and edit title to be bold
ggplot(randomStatesMean_GHS, aes(X_state, mean_GHS, fill = sex)) +
  geom_bar(stat="identity", position = "dodge") +
  ggtitle("mean_GHS Scores for Each City Grouped by Sex\n") +
  theme(plot.title = element_text(face = "bold"))

From the plot created above, we can see that each selected state has an overall same mean_GHS value. With connecticut being the best state with mean.GHS values for both men and woman, and New Mexico being the worst state, also for mens and womans. Without a more rigorous statistical treatment, we can not know for sure if one or more of these states have a significantly increased or decreased mean_GHS nor define if there is any difference between the mean.GHS between man and woman. But, in an exploratory analisys sense, I wouldn’t say there is a difference in the mean_GHS score between the different sexes.

Research quesion 3:

Is the amount of minutes dedicated to physical activity related to GHS?

An important factor that can contribute to the overall GHS is the amount of exercise people do. According to the “American Health Association”, it is recommended people do at least 150 minutes of exercise per week.

In the codebook for the brfss2013, the variable “exeroft1” is defined as “How many times per week or per month did you take part in this activity during the past month?”, and the column “exerhmm1” as “And when you took part in this activity (if the person does an activity), for how many minutes or hours did you usually keep at it?”

To estimate the amount of minutes each person exercise per week, I created a function to calculate it. If you look at the codebook, you’ll see how the exeroft1 was created. The function estimate the amount of times each person exercised per week, whether the original value is per week or per month, and multiply it by the amount of minutes spent in each exercise, as described by the exerhmm1 value.

# function to calculate the amount of minutes of exercises per week
calcMinExercise <- function(dataSet){
  # value starting with 1xx = amount of exercises by week
  # value starting with 2xx = amount of exercises by month
  if(dataSet['exeroft1'] > 200){ # if estimate is monthly
    # estimate amount of weekly exercise dividing it by 4 weeks
    nExercWeekly <- (dataSet['exeroft1'] - 200) / 4;
  }else{
    # just remove the 1xx header indicating it is weekly
    nExercWeekly <- dataSet['exeroft1'] - 100;
  }
  # amount of exercise minutes per week
  amountOfMinutesExercisesWeekly <- dataSet['exerhmm1'] * nExercWeekly;
  as.numeric(amountOfMinutesExercisesWeekly);
}

To do this calculation, we must eliminate the rows that have na’s in any of the columns involved in the function, i.e. exeroft1 and exerhmm1 variables. We will manipulate the q3_brfss2013 dataset that we saved in the beggining of the script saving the state of the brfss2013 after the preprocessing and before the manipulations in the question 1 and 2 section.

# remove rows with na's in exerhmm1 or exeroft1 column
q3_brfss2013 <- q3_brfss2013[!is.na(q3_brfss2013$exerhmm1) & !is.na(q3_brfss2013$exeroft1),];
# calculate the amount of minutes of exercises by week.
weekyMinExercise <- apply(q3_brfss2013[,c('exeroft1','exerhmm1')],
                          1,
                          function(dataRow) 
                            calcMinExercise(dataRow));
# and append it to q3_brfss2013 dataset
q3_brfss2013 <- cbind(q3_brfss2013,weekyMinExercise);
q3_brfss2013Filtered <- q3_brfss2013[,c('weekyMinExercise', 'genhlth')];
rm(weekyMinExercise)
rm(q3_brfss2013)

We could already try to analyze if there is any correlation between the amount of minutes spent in physical activities and the GHS, but analyzing the resulted data from the previous calculation, we see we have a few outliers.

summary(q3_brfss2013Filtered$weekyMinExercise)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -200.0    90.0   180.0   353.3   400.0 71920.0

As you can see, we have negative values as minimum values, as well as huge values for the amount of minutes spent with physical activities in a week, e.g. 71920 (that would be like doing ~1.198 hours of exercise per day!).

One approach we could take to remove the outliers was the 68-95-99.7 rule, the probability that a point is more than 3 standard deviations from the mean is less than 1%. BUT, our data isn’t normal, as we can see by the difference between the mean and the median of the data. The value 71920 is clearly an outlier, but maybe it is the only one. A way to verify this is to calculate the 99th percentil of the data, i.e. what is the cut-ff data point that contain 99% of the data, and remove anything that is above this cut-off value.

quantile(q3_brfss2013Filtered$weekyMinExercise,.99)

##  99% 
## 2800

#filter rows < 0 and > 99th percentil
q3_brfss2013Filtered <- q3_brfss2013Filtered %>% filter(weekyMinExercise > 0 & weekyMinExercise < quantile(q3_brfss2013Filtered$weekyMinExercise,.99))

# reorder factors so "Excellent" is shown on the top of the histogram, instead of the bottom
q3_brfss2013Filtered$genhlth <- factor(q3_brfss2013Filtered$genhlth,levels = c('Poor','Fair','Good','Very good','Excellent'))

To plot two ggplot graphs on the same page, I used the gridExtra package.

Two plots were created. On the right hand is a histogram showing the count of how many minutes people do exercises in a week, segmented with the GHS score, showing the their respective GHS. Because the distribution is right skewed, the number of minutes a person exercise per week decrease slowly, having people exercise from 0 minutes to more than 10 hours a week.

Because the right plot can not show the percentage of the GHS distribution within each bin of the histogram,I made a slight change in the ggplot command, telling it to “fill” the plot, showing exactly what percentage of each category of the GHS each bin have. An disadvantage of this second plot is that we don’t know the absolute number of samples in each plot, so the best interpretations we can get is using both plots.

#install.packages('gridExtra')
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# create histogram 1
h1 <- ggplot(data=q3_brfss2013Filtered, aes(x=weekyMinExercise,fill=genhlth)) + 
  geom_histogram(binwidth = 60) + 
  labs(title='Minutes of exercises per week') +
  labs(x='Minutes', y='Count')
# create filled histogram 2
h2 <- ggplot(data=q3_brfss2013Filtered, aes(x=weekyMinExercise,fill=genhlth)) + 
  geom_histogram(binwidth = 60,position='fill') + 
  labs(title='Minutes of exercises per week') +
  labs(x='Minutes', y='Count')
grid.arrange(h1, h2, ncol=2)

For this sample, we can see that there is a visual increase of the percentage of “Excellent” GHS status in people that exercise until around 500 minutes per week. As the biggest part of this sample do exercised until 500 minutes, as we can see on the right hand plot, this might really have an impact on the overall GHS status. After that, we see a exponential decrease in the amount of minutes people exercise and percentage of the GHS status in each of these bins vary a lot. We see a global maximum of “Excellent” GHS status around 1800 minutes, but they also had the biggest “Poor” score. That could indicate something, or it was just the sample variation.

Conclusion:

For the next steps:

I would try to find a few other variables that the first 5 best states had and other states didn’t.
I would try a more formal approach to see if sex was a determinant for the GHS score, as we see little differences between them in each state
A would try to investigate a little more to prove if doing until 500 minutes of exercise per week could really improve the GHS score and avaliate the region around 1800 minutes to see if there is any evidence of improving the GHS score or if it was just an outlier.

Thanks =]