##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
load("BRFSS 2013.RData")
Before start analyzing the data, we must filter out data that is invalid and can hamper our analysis process. As we will split the data by states, I extracted all the unique values that this variable could assume and noticed that there were 2 rows, with X_state = 0 and 80 that were not valid for our analysis. So I excluded them.
unique(brfss2013$X_state);
## [1] Alabama Alaska Arizona
## [4] Arkansas California Colorado
## [7] Connecticut Delaware District of Columbia
## [10] Florida 80 Georgia
## [13] Hawaii Idaho Illinois
## [16] Indiana Iowa Kansas
## [19] Kentucky Louisiana Maine
## [22] Maryland Massachusetts Michigan
## [25] 0 Minnesota Mississippi
## [28] Missouri Montana Nebraska
## [31] Nevada New Hampshire New Jersey
## [34] New Mexico New York North Carolina
## [37] North Dakota Ohio Oklahoma
## [40] Oregon Pennsylvania Rhode Island
## [43] South Carolina South Dakota Tennessee
## [46] Texas Utah Vermont
## [49] Virginia Washington West Virginia
## [52] Wisconsin Wyoming Guam
## [55] Puerto Rico
## 55 Levels: 0 Alabama Alaska Arizona Arkansas California ... 80
# Remove invalid data
brfss2013 <- brfss2013[-which(brfss2013$X_state == 80 | brfss2013$X_state == 0),]
# remove genhlth == n.a.
brfss2013 <- brfss2013[-which(is.na(brfss2013$genhlth)),]
#update factors
brfss2013$X_state <- factor(brfss2013$X_state);
# save this dataset to use in question 3
q3_brfss2013 <- brfss2013;
As we have a great variation between the number of interviewed people in each state, before we start to graph our variables, I extracted the state with least interviewed people and sampled the other states with this same number. In this manner, I tried to leave each state with similarly the same distribution.
###########################################################################
# as we are using some random number generator, set a seed so you can have the same result as I did using sample_n
set.seed(123);
#start sample of each state but, how many rows should we sample from each state?
tableStats <- table(brfss2013$X_state);
minSample <- min(tableStats) #get state with min samples == 1897
# for curiosity, which state was the least sampled? (Guam)
tableStats[which(tableStats == min(tableStats))]
## Guam
## 1896
# sample minSample samples from each state
brfss2013 <- brfss2013 %>% group_by(X_state) %>% sample_n(minSample);
As the project description explained, the data was collected from all the 50 states, plus a few districts from USA, providing a strong ground to bring evidences exclusively to the american people starting from 18 years old. Besides, all the data (interviewed choosen people) was randomly sampled from the entire USA population, giving a few possible generalized research answers. Because there wasn’t any type of random assignment, it won’t be possible to extract causal relations from the data. In short the answers found from the questions generated in this report will be not causal, but generalizable.
An additional point is that, as the data was collected only from landline telephones and after from cellphones, there might exist a dependence between the variables, as people with financial conditions to buy telephones and cellphones can have an easier access to educational resources regarding the general public health.
The main characteristic of this survey was the random sample of people from all the 50 states from USA. Starting from this point, we can try to find which of the states have a better general health status (GHS) and try to analyze a few variables that might have an influence in the GHS.
Research quesion 1: Which states have the best general health status?
Research quesion 2: How the sex of the interviewed people is related to the GHS?
Research quesion 3: Is the amount of minutes dedicated to physical activity related to GHS?
Research quesion 1:
Considering the approach of each value of the GHS:
levels(brfss2013$genhlth)
## [1] "Excellent" "Very good" "Good" "Fair" "Poor"
Let’s give a score for each GHS level:
For the first question, we will create a summary statistic named mean_GHS, which will be calculated by the sum of each GHS sample score (variable ’genhlth) divided by the maximum possible score (number of samples * 5). After this we can plot a bar plot showing the 5 best states that have the biggests meanGHS. If we try to convert the genhlth variable to numeric, we get the opposite values that we have defined previously, so we make a little manipulation to get our values.
head((brfss2013$genhlth))
## [1] Fair Fair Good Excellent Good Excellent
## Levels: Excellent Very good Good Fair Poor
head(as.numeric(brfss2013$genhlth)) # factor number is the opposit of what we want (Excellent = 1, Very Good = 2, etc..we don't want that)
## [1] 4 4 3 1 3 1
# add a new column with our desired numeric values
brfss2013 <- mutate(brfss2013, numeric_GHS = (5 - as.numeric(genhlth)) + 1)
head(brfss2013[,c('genhlth','numeric_GHS')]) # now we have the desired conversion
## Source: local data frame [6 x 2]
##
## genhlth numeric_GHS
## (fctr) (dbl)
## 1 Fair 2
## 2 Fair 2
## 3 Good 3
## 4 Excellent 5
## 5 Good 3
## 6 Excellent 5
# extract mean_GHS from each state
statesAndMean_GHS <- brfss2013 %>% group_by(X_state) %>% summarise(mean_GHS = sum(numeric_GHS)/(n() * 5));
#order by biggest mean_GHS and print 10 highest mean_GHS score
statesAndMean_GHS <- statesAndMean_GHS[order(statesAndMean_GHS$mean_GHS,decreasing = TRUE),]
statesAndMean_GHS[1:10,]
## Source: local data frame [10 x 2]
##
## X_state mean_GHS
## (fctr) (dbl)
## 1 Minnesota 0.7320675
## 2 Vermont 0.7283755
## 3 District of Columbia 0.7272152
## 4 Colorado 0.7263713
## 5 New Hampshire 0.7261603
## 6 Connecticut 0.7199367
## 7 Utah 0.7133966
## 8 Alaska 0.7086498
## 9 Idaho 0.7036920
## 10 Hawaii 0.7029536
As we can see, the state of Minnesota has the biggest mean_GHS score, followed by Vermont, District of Columbia and others.
Research quesion 2:
For our number 2 question, we ask ourselves if the sex can have a difference between people within the same state. For this, we selected a sample of 5 states with its mean_GHS values
# choose 5 random states
randomStates <- as.character(sample(unique(brfss2013$X_state),size = 5));
randomStates
## [1] "New Mexico" "Idaho" "Wisconsin" "Connecticut" "Washington"
Next, we do the following steps:
# 1 - filter data from these 5 states
randomStatesData <- filter(brfss2013, X_state %in% randomStates);
# 2 - update factor levels
randomStatesData$X_state <- factor(randomStatesData$X_state)
# 3 - calculate the mean_GHS for each state, grouped by state and then sex
randomStatesMean_GHS <- randomStatesData %>% group_by(X_state,sex) %>% summarise(mean_GHS = sum(numeric_GHS)/(n() * 5));
# 4 - # plot mean_GHS Scores for Each City Grouped by Sex and edit title to be bold
ggplot(randomStatesMean_GHS, aes(X_state, mean_GHS, fill = sex)) +
geom_bar(stat="identity", position = "dodge") +
ggtitle("mean_GHS Scores for Each City Grouped by Sex\n") +
theme(plot.title = element_text(face = "bold"))
From the plot created above, we can see that each selected state has an overall same mean_GHS value. With connecticut being the best state with mean.GHS values for both men and woman, and New Mexico being the worst state, also for mens and womans. Without a more rigorous statistical treatment, we can not know for sure if one or more of these states have a significantly increased or decreased mean_GHS nor define if there is any difference between the mean.GHS between man and woman. But, in an exploratory analisys sense, I wouldn’t say there is a difference in the mean_GHS score between the different sexes.
Research quesion 3:
Is the amount of minutes dedicated to physical activity related to GHS?
An important factor that can contribute to the overall GHS is the amount of exercise people do. According to the “American Health Association”, it is recommended people do at least 150 minutes of exercise per week.
In the codebook for the brfss2013, the variable “exeroft1” is defined as “How many times per week or per month did you take part in this activity during the past month?”, and the column “exerhmm1” as “And when you took part in this activity (if the person does an activity), for how many minutes or hours did you usually keep at it?”
To estimate the amount of minutes each person exercise per week, I created a function to calculate it. If you look at the codebook, you’ll see how the exeroft1 was created. The function estimate the amount of times each person exercised per week, whether the original value is per week or per month, and multiply it by the amount of minutes spent in each exercise, as described by the exerhmm1 value.
# function to calculate the amount of minutes of exercises per week
calcMinExercise <- function(dataSet){
# value starting with 1xx = amount of exercises by week
# value starting with 2xx = amount of exercises by month
if(dataSet['exeroft1'] > 200){ # if estimate is monthly
# estimate amount of weekly exercise dividing it by 4 weeks
nExercWeekly <- (dataSet['exeroft1'] - 200) / 4;
}else{
# just remove the 1xx header indicating it is weekly
nExercWeekly <- dataSet['exeroft1'] - 100;
}
# amount of exercise minutes per week
amountOfMinutesExercisesWeekly <- dataSet['exerhmm1'] * nExercWeekly;
as.numeric(amountOfMinutesExercisesWeekly);
}
To do this calculation, we must eliminate the rows that have na’s in any of the columns involved in the function, i.e. exeroft1 and exerhmm1 variables. We will manipulate the q3_brfss2013 dataset that we saved in the beggining of the script saving the state of the brfss2013 after the preprocessing and before the manipulations in the question 1 and 2 section.
# remove rows with na's in exerhmm1 or exeroft1 column
q3_brfss2013 <- q3_brfss2013[!is.na(q3_brfss2013$exerhmm1) & !is.na(q3_brfss2013$exeroft1),];
# calculate the amount of minutes of exercises by week.
weekyMinExercise <- apply(q3_brfss2013[,c('exeroft1','exerhmm1')],
1,
function(dataRow)
calcMinExercise(dataRow));
# and append it to q3_brfss2013 dataset
q3_brfss2013 <- cbind(q3_brfss2013,weekyMinExercise);
q3_brfss2013Filtered <- q3_brfss2013[,c('weekyMinExercise', 'genhlth')];
rm(weekyMinExercise)
rm(q3_brfss2013)
We could already try to analyze if there is any correlation between the amount of minutes spent in physical activities and the GHS, but analyzing the resulted data from the previous calculation, we see we have a few outliers.
summary(q3_brfss2013Filtered$weekyMinExercise)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -200.0 90.0 180.0 353.3 400.0 71920.0
As you can see, we have negative values as minimum values, as well as huge values for the amount of minutes spent with physical activities in a week, e.g. 71920 (that would be like doing ~1.198 hours of exercise per day!).
One approach we could take to remove the outliers was the 68-95-99.7 rule, the probability that a point is more than 3 standard deviations from the mean is less than 1%. BUT, our data isn’t normal, as we can see by the difference between the mean and the median of the data. The value 71920 is clearly an outlier, but maybe it is the only one. A way to verify this is to calculate the 99th percentil of the data, i.e. what is the cut-ff data point that contain 99% of the data, and remove anything that is above this cut-off value.
quantile(q3_brfss2013Filtered$weekyMinExercise,.99)
## 99%
## 2800
#filter rows < 0 and > 99th percentil
q3_brfss2013Filtered <- q3_brfss2013Filtered %>% filter(weekyMinExercise > 0 & weekyMinExercise < quantile(q3_brfss2013Filtered$weekyMinExercise,.99))
# reorder factors so "Excellent" is shown on the top of the histogram, instead of the bottom
q3_brfss2013Filtered$genhlth <- factor(q3_brfss2013Filtered$genhlth,levels = c('Poor','Fair','Good','Very good','Excellent'))
To plot two ggplot graphs on the same page, I used the gridExtra package.
Two plots were created. On the right hand is a histogram showing the count of how many minutes people do exercises in a week, segmented with the GHS score, showing the their respective GHS. Because the distribution is right skewed, the number of minutes a person exercise per week decrease slowly, having people exercise from 0 minutes to more than 10 hours a week.
Because the right plot can not show the percentage of the GHS distribution within each bin of the histogram,I made a slight change in the ggplot command, telling it to “fill” the plot, showing exactly what percentage of each category of the GHS each bin have. An disadvantage of this second plot is that we don’t know the absolute number of samples in each plot, so the best interpretations we can get is using both plots.
#install.packages('gridExtra')
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# create histogram 1
h1 <- ggplot(data=q3_brfss2013Filtered, aes(x=weekyMinExercise,fill=genhlth)) +
geom_histogram(binwidth = 60) +
labs(title='Minutes of exercises per week') +
labs(x='Minutes', y='Count')
# create filled histogram 2
h2 <- ggplot(data=q3_brfss2013Filtered, aes(x=weekyMinExercise,fill=genhlth)) +
geom_histogram(binwidth = 60,position='fill') +
labs(title='Minutes of exercises per week') +
labs(x='Minutes', y='Count')
grid.arrange(h1, h2, ncol=2)
For this sample, we can see that there is a visual increase of the percentage of “Excellent” GHS status in people that exercise until around 500 minutes per week. As the biggest part of this sample do exercised until 500 minutes, as we can see on the right hand plot, this might really have an impact on the overall GHS status. After that, we see a exponential decrease in the amount of minutes people exercise and percentage of the GHS status in each of these bins vary a lot. We see a global maximum of “Excellent” GHS status around 1800 minutes, but they also had the biggest “Poor” score. That could indicate something, or it was just the sample variation.
For the next steps:
Thanks =]