Exploring the BRFSS data

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.4

Load data

(DATA link)[https://d3c33hcgiwev3.cloudfront.net/_384b2d9eda4b29131fb681b243a7767d_brfss2013.RData?Expires=1531612800&Signature=QQegUiwVmyN4lPTRfiMzgYN-3VGqlpfnPFcEDYZk68pJAeZ2ia-bkeLp5r-SWhLZo1hTgoB7mfki0UC~-cYROVt1M2pgXetQMIO8S3Wu100O~9Nn7FbvuokkCZaCQ30s3LCOqBKgOKxPwQGe3nFYqF9i2rKnArBxxzpSl~y6Gfw_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A]

(Code book)[https://d3c33hcgiwev3.cloudfront.net/_e34476fda339107329fc316d1f98e042_brfss_codebook.html?Expires=1531612800&Signature=ViANU04dRrf2KV9coYRjJRBWG~x4xlkWFD-lnVcwfyXQV4Jeg6iFyqGsKTAm2EgoYHbgyeE2pAasHzrDaRpB9lvI1hD6d7uH72HLHCmiD1iVQ-rjSfYsOfjb4sw-VJjt8INnaHdt99j97X1oaASvEOBKTzvykngV5cvoVxV83Y8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A]

load("brfss2013.RData")

Part 1: Data

I think the results from the data can’t be generalised as it is given in the data that the interviewers randomly selected people but the choice of participation is left to the individual also the data is collected from the adults who reside in a private residence or college housing.

We can’t say that one variable definitely causes the other one because data sampling is not done in the best way of coming to any conclusion by the plottings.

Part 2: Research questions

Research question 1: Is sleeping hours is related to the no of mental health? Sleeping is directly related to the health of the brain.We want to know here is there any relation between sleep and the mental health.

Research question 2: Do the people who exercise sleep more? Maybe a person doing more exercise get tired too much which cause him to sleep more or maybe he is health conscious so that he sleeps for a good amount of time.

Research question 3: People of which state are more healthy? This the basic question.It will help in ranking the states in health. So we will calculate the percentage of people who have Excellent health in every state and rank them by plotting.

Part 3: Exploratory data analysis

Research question 1:

#Selected the two coulumns "menthlth" and "sleptim1" and  group it by the sleptim1 and took the mean for every hours ie.24 hours
data1<-brfss2013%>%select(menthlth,sleptim1)%>%group_by(sleptim1)%>%summarize(menthlth=mean(menthlth,na.rm = TRUE))
##Removing  Na
data1<-data1[!is.na(data1$sleptim1),]
## these are some potential outliers so deleted from the data
data1<-data1[-c(26,27,1),]
data1

## # A tibble: 24 x 2
##    sleptim1 menthlth
##       <int>    <dbl>
##  1        1    11.7 
##  2        2    13.3 
##  3        3    12.3 
##  4        4     9.77
##  5        5     6.28
##  6        6     3.93
##  7        7     2.21
##  8        8     2.22
##  9        9     2.74
## 10       10     4.69
## # ... with 14 more rows

ggplot(data1,aes(x=sleptim1,y=menthlth))+geom_point()+geom_line()+geom_smooth(method = lm)

So, We took here a variable to show the relation between sleep time and mental health sleep time corresponds to “sleptim1” in the data and mental health to mental which shows “Number Of Days Mental Health Not Good”.Then after deleting some potential outliers and we summarised as the mean of the menthlth.

From the plot, we are seeing that people who sleep between 5 to 10 hours have minimum days of bad mental health and greater for less and more than this range, so we can say that both things are related as best hours of sleep is between 5- 10 hours.

Research question 2:

data2<-select(brfss2013,sleptim1,exerhmm1)
#removing NA
data2<-data2[!is.na(data2$sleptim1),]
data2<-data2[!is.na(data2$exerhmm1),]
# groped by sleeptime and taking mean of exercise hours  for every hour
data2<-group_by(data2,sleptim1)%>%summarize(exerhmm1=mean(exerhmm1))
g<-ggplot(data2,aes(y=exerhmm1,x=sleptim1))
g+geom_point()+geom_smooth(method = lm)

Here, we have selected two variables exerhmm1 which shows Minutes Or Hours Walking, Running, Jogging, Or Swimming and sleptim1 shows hours of sleep. To Know whether there is any relation between these two variables.

According to the plot, we can see that as the sleep time increases exercise time decreases this can happen due to the following reasons:

A person who gives more time to sleep may give less time for exercise as total hours is 24 so increases in hours of one thing can impact in reverse order.
A person who sleeps more can be a lazy person so he doesn’t want to exercise a lot.
A person who exercises more is very active throughout the day so he doesn’t want to sleep more.

Research question 3:

#Selecting the requied coulumn
data3<-select(brfss2013,X_state,genhlth)
#group by both state and the health status
data3<-group_by(data3,X_state,genhlth)
#summerise to get the number in every gruop
data3<-summarize(data3,count=n())
head(data3)

## # A tibble: 6 x 3
## # Groups:   X_state [2]
##   X_state genhlth   count
##   <fct>   <fct>     <int>
## 1 0       <NA>          1
## 2 Alabama Excellent   843
## 3 Alabama Very good  1645
## 4 Alabama Good       2161
## 5 Alabama Fair       1218
## 6 Alabama Poor        610

# removing Na
data3<-data3[!is.na(data3$genhlth),]
# adding a new variable that represents the sum of people state-wise
data3<-mutate(data3,sum1=sum(count))
head(data3)

## # A tibble: 6 x 4
## # Groups:   X_state [2]
##   X_state genhlth   count  sum1
##   <fct>   <fct>     <int> <int>
## 1 Alabama Excellent   843  6477
## 2 Alabama Very good  1645  6477
## 3 Alabama Good       2161  6477
## 4 Alabama Fair       1218  6477
## 5 Alabama Poor        610  6477
## 6 Alaska  Excellent   874  4561

#adding new varible that shows percentage of each status
data3<-mutate(data3,percent=count/sum1)

## Warning: package 'bindrcpp' was built under R version 3.4.4

head(data3)

## # A tibble: 6 x 5
## # Groups:   X_state [2]
##   X_state genhlth   count  sum1 percent
##   <fct>   <fct>     <int> <int>   <dbl>
## 1 Alabama Excellent   843  6477  0.130 
## 2 Alabama Very good  1645  6477  0.254 
## 3 Alabama Good       2161  6477  0.334 
## 4 Alabama Fair       1218  6477  0.188 
## 5 Alabama Poor        610  6477  0.0942
## 6 Alaska  Excellent   874  4561  0.192

#selecting the required coulumn from data3
data7<-select(data3,X_state,genhlth,percent)
head(data7)

## # A tibble: 6 x 3
## # Groups:   X_state [2]
##   X_state genhlth   percent
##   <fct>   <fct>       <dbl>
## 1 Alabama Excellent  0.130 
## 2 Alabama Very good  0.254 
## 3 Alabama Good       0.334 
## 4 Alabama Fair       0.188 
## 5 Alabama Poor       0.0942
## 6 Alaska  Excellent  0.192

ggplot(data7,aes(x=X_state,y=percent))+geom_point(aes(color=genhlth))+theme(axis.text.x = element_text(angle = 90, hjust = 1))

Here, we want to know health status state wise. we took two variables “X_state” represents the participating states name and “genhlth” which represents the status of the health of a person. We have to group the data with both variables and counted the number in each group.Then taking the sum of every group to get the total of every group. the Total will help in getting the percentage of people with the total. Adding one more column that represents the percentage.

From the plot, we can see that the “District of Columbia” have the greatest percentage of “Excellent” healthy people and “Tennessee” have the highest percentage of poor health. And the percentage of “Good” and“Very good” health status is more than “Fair”,“Poor” and “Excellent”.So on average people are having good health conditions.