This Sample of data is retrieved using a special Questionnaire made by BRFSS. The BRFSS questionnaire consists of a core component and optional modules. Many questions are taken from established national surveys, such as the National Health Interview Survey or the National Health and Nutrition Examination Survey.
In a telephone survey such as the BRFSS, a sample record is one telephone number in the list of all telephone numbers the system randomly selects for dialing.
BRFSS divides telephone numbers into two groups, or strata, which are sampled separately.
The target population for cellular telephone samples in 2013 consists of persons residing in a private residence or college housing, who have a working cellular telephone, are aged 18 and older, and received 90 percent or more of their calls on cellular telephones.
I’d like to work in the economic-social context. Therefore we will be using the Module 19 which refers to Social Context. Section 8 which relates to Demographics.
Research quesion 1: Do Male makes more money than Females?
Research quesion 2: Which Income Range tends to drink more on average?
Research quesion 3: Over all Females drink more alcohol than Males? * * *
NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.
Research quesion 1: Because we have a huge data set and for just academic purpouse we will work with the State of California Let’s same each category of female and Male in different variables
California <- filter(brfss2013, X_state=="California")
cali_males <- filter(California, sex=="Male")
cali_females <- filter(California, sex == "Female")Let’s check our proportion of Male and Females of this sub sample
## [1] 0.441917
## [1] 0.558083
Let’s plot the ranges of income in the male sample Before let’s manipulate a little bit the data i already played with the dataset a bit and found there are some missing values. Missing values are really annoying because they make or EDA too complex when applying functions and so on.
My approach is, from the cali_males dataset, i will create a new dataset with just the income2 variable with clean values, so i can keep working from there. We can use mutate to keep adding more variables.
income_df_males <- as.data.frame(table(cali_males$income2)) %>% rename(Range = Var1)
income_df_females <- as.data.frame(table(cali_females$income2)) %>% rename(Range = Var1)now let’s plot this
Now let’s check for females
now let’s do a scatter plot to compare. Blue dots are Male. Pink dots are Female.
it Seems Females tend to better in california.
Research quesion 2: Moving on Question #2.
We have so many missing values, i propose, let’s use dplyr and do small dataframe and save it for later to plot it. We will do this for Females and Males
drink_males_income <- cali_males %>% select(income2, avedrnk2) %>% na.omit(income2, avedrnk2) %>% group_by(income_cat = income2) %>% summarise(mean_drinks= mean(avedrnk2)) %>% arrange(desc(mean_drinks))
drink_males_income## # A tibble: 8 x 2
## income_cat mean_drinks
## <fct> <dbl>
## 1 Less than $35,000 3.14
## 2 Less than $25,000 3.12
## 3 Less than $20,000 3.09
## 4 Less than $10,000 2.92
## 5 Less than $50,000 2.86
## 6 Less than $15,000 2.82
## 7 Less than $75,000 2.72
## 8 $75,000 or more 2.14
It seems on males that earn between $25,000 and $35,000 tend to drink more.
Let’s plot this.
g_males <- ggplot(data=drink_males_income, aes(x=income_cat,y=mean_drinks, group="Males")) + geom_point() + geom_line()
g_malesNow let’s do it for Female
drink_female_income <- cali_females %>% select(income2, avedrnk2) %>% na.omit(income2, avedrnk2) %>% group_by(income_cat=income2) %>% summarise(mean_drinks = mean(avedrnk2)) %>% arrange(desc(mean_drinks))
drink_female_income## # A tibble: 8 x 2
## income_cat mean_drinks
## <fct> <dbl>
## 1 Less than $10,000 2.14
## 2 Less than $20,000 1.85
## 3 Less than $15,000 1.84
## 4 Less than $50,000 1.81
## 5 Less than $25,000 1.80
## 6 $75,000 or more 1.73
## 7 Less than $35,000 1.67
## 8 Less than $75,000 1.56
Female that makes less than $10,000 are more prone to drink more.
let’s plot this
g_females<-ggplot(data=drink_female_income, aes(x=income_cat, y=mean_drinks, group="Females")) + geom_point() + geom_line()
g_femalesResearch quesion 3:
drinks_males_avg <- cali_males %>% select(avedrnk2) %>% na.omit(avedrnk2) %>% summarise(mean_drinks = mean(avedrnk2), sd_drinks=sd(avedrnk2) )
drinks_females_avg <- cali_females %>% select(avedrnk2) %>% na.omit(avedrnk2) %>% summarise(mean_dd = mean(avedrnk2), sd_females=sd(avedrnk2))
df_drinks<-data.frame(gender=c("Female","Male"),mean=c(drinks_females_avg$mean_dd, drinks_males_avg$mean_drinks))
df_drinks## gender mean
## 1 Female 1.734375
## 2 Male 2.567489
We can see that man on average drink more.