https://github.com/fivethirtyeight/data/tree/master/marriage
# load data
library(psych)
library(ggplot2)
library(plyr)
divorce_df <- read.table("https://raw.githubusercontent.com/fivethirtyeight/data/master/marriage/divorce.csv",
header = TRUE, sep = ",")
#We would do analysis on divorce rates into two categories.
#1) For people aged 35 to 44
#2) For people aged 45 to 54
#Create a dataframe that has only the educational variables of divorce rates
Edu_divorce_rates_35_to_44 <- divorce_df[c('year','all_3544','HS_3544','SC_3544','BAp_3544','BAo_3544','GD_3544')]
Edu_divorce_rates_45_to_54 <- divorce_df[c('year','all_4554','HS_4554','SC_4554','BAp_4554','BAo_4554','GD_4554')]
#Rename columns to more descriptive name
Edu_divorce_rates_35_to_44<-rename(Edu_divorce_rates_35_to_44, c("all_3544"="ALL", "HS_3544"="High_School", "SC_3544"="Some_college","BAp_3544"="Bachelors_degree_more", "BAo_3544"="bachelors_degree_only", "GD_3544"="Graducate_degree"))
Edu_divorce_rates_45_to_54<-rename(Edu_divorce_rates_45_to_54, c("all_4554"="ALL", "HS_4554"="High_School", "SC_4554"="Some_college","BAp_4554"="Bachelors_degree_more", "BAo_4554"="bachelors_degree_only", "GD_4554"="Graducate_degree"))
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Are divorce rates higher for couples with High school diploma compared to couples with College degree
Is educational level correlated to divorce rates for married couples?
What are the cases, and how many are there?
There are 17 observations in the data set
Describe the method of data collection.
Data is collected from American Community Survey (years 2001-2012), via IPUMS USA.
What type of study is this (observational/experiment)?
Observational study
If you collected the data, state self-collected. If not, provide a citation/link.
Data is collected by American Community Survey via IPUMS USA and is available online here: https://github.com/fivethirtyeight/data/tree/master/marriage. For this project, data was extracted using the read.table ()
Ben Cassleman, fivethirtyeight, (2014), GitHub repository, https://github.com/fivethirtyeight/data/tree/master/marriage
What is the response variable, and what type is it (numerical/categorical)?
Rate of divorce at education level. It is a numerical variable
What is the explanatory variable, and what type is it (numerical/categorival)?
Educational level, it is a categorical variable
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(divorce_df$HS_3544)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03489 0.17254 0.17545 0.16015 0.18838 0.19240
summary(divorce_df$BAo_3544)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02751 0.10711 0.11086 0.10182 0.11186 0.11853
describe(divorce_df$all_3544)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 17 0.14 0.04 0.16 0.15 0.01 0.03 0.17 0.13 -1.83 1.8
## se
## X1 0.01
#plot of year vs high school degree holders divorce rates
plot(Edu_divorce_rates_35_to_44[c('year','High_School')])
lines(Edu_divorce_rates_35_to_44[c('year','High_School')])
#plot of year vs bachelors degree or more holders divorce rates
plot(Edu_divorce_rates_35_to_44[c('year','Bachelors_degree_more')])
lines(Edu_divorce_rates_35_to_44[c('year','Bachelors_degree_more')])
ggplot(Edu_divorce_rates_35_to_44, aes(x=High_School)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Edu_divorce_rates_35_to_44, aes(x=Bachelors_degree_more)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.