DATA 606 Data Project Proposal

Data Preparation

library(RSocrata)
library(tidyverse)
library(dplyr)
library(infer)
library(ggpubr)

library(ggplot2)
data <- read.socrata("https://chronicdata.cdc.gov/OData.svc/i8ja-z54a")

#The data is a little clutter, so I will tidy it
# Taking out non necessary columns from the data set

data<-data%>%select(-c("Class","Data_value_unit","Data_value_type","DataSource","Data_Value_Footnote_Symbol","Data_Value_Footnote","QuestionID","ResponseID"))

#Then, Break out two data frames, one for mammograms and other for pap tests

mammograms<-data%>%filter(Topic=="Mammogram")

papTests<-data%>%filter(Topic=="Pap Test")

# Applying the grouping for education levels
edu<-c("EDUCA1","EDUCA2","EDUCA3","EDUCA4")

mammograms<-mammograms%>%filter(BreakoutID %in% edu)
papTests<-papTests%>%filter(BreakoutID %in% edu)

Research question

Are women more likely to request more services at the OBGYN with higher levels of education?

Cases

Each case is a survey received by a group of women blocked by household income, Age, Race across the United States. There are 39,818 cases in the data set.

Data collection

The data was collected by the Behavioral Risk Factor Surveillance System (BRFSS).The data is a collect of surverys given to randomly selected women from each US state.

Type of study

This is a observational study.

Data Source

The data is sourced from on the CDC’s chronic health website, under “BRFSS: Table of Women’s Health”¹. For the project, RSocrata was used in the retrieval of the data set.

Dependent Variable

The response variable is Response_mammograms/PapTests; The response M/P is qualitative.

Independent Variable

The independent variables are sample size and education; The sample size is quantitative and the education is qualitative.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

m_overall<-mammograms%>%group_by(Response)%>%summarise(Total=sum(Sample_Size))
p_overall<-papTests%>%group_by(Response)%>%summarise(Total=sum(Sample_Size))


p1<-ggplot(m_overall, aes(x=Response, y=Total))+ geom_col()+theme_classic()+labs(title = "Responses accounted by sample size")
p2<-ggplot(p_overall, aes(x=Response, y=Total))+ geom_col()+theme_classic()+labs(title = "Responses accounted by sample size")
ggarrange(p1,p2,nrow = 2)

m_CG<-mammograms%>%filter(Break_Out=="College graduate")
m_DH<-papTests%>%filter(Break_Out=="Less than H.S.")

m_CG<-m_CG%>% mutate(mam_yes = ifelse(m_CG$Response == "Yes", "yes", "no"))
high_edu<-ggplot(m_CG, aes(x=Sample_Size, y=mam_yes)) +geom_boxplot()+theme_classic()+labs(title = "Recent Mammogram sample size and a College gradute")

m_DH<-m_DH%>% mutate(mam_yes = ifelse(m_DH$Response == "Yes", "yes", "no"))
no_edu<-ggplot(m_DH, aes(x=Sample_Size, y=mam_yes)) +geom_boxplot()+theme_classic2()+labs(title = "Recent Mammogram sample size and No Highshool Degree")
ggarrange(no_edu,high_edu,nrow = 2)

null_mam <- m_CG %>%
  specify(Sample_Size ~ mam_yes) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("yes", "no"))

ggplot(data = null_mam, aes(x = stat)) +geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

“BRFSS: Table of Women’s Health.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, https://chronicdata.cdc.gov/Behavioral-Risk-Factors/BRFSS-Table-of-Women-s-Health/i8ja-z54a ↩︎