In this observational study, I plan to explore birth rates to answer the following questions:
1)Has overall birth rate declined in the United Sates?
2)What age group has significiant declines in birth rate over time?
3)What age group has significiant increase in birth rate over time?
4)Is there a correlation between women completing college and birth rate?
Birth rate is the dependent variable which is qualitative. Age range and Year are Independent Variables.
State laws require birth certificates to be completed for all births, and Federal law mandates national collection and publication of births and other vital statistics data. The National Vital Statistics System, the Federal compilation of this data, is the result of the cooperation between the National Center for Health Statistics (NCHS) and the States to provide access to statistical information from birth certificates.
This dataset includes birth rates for females by age group in the United States since 1940.The number of states in the reporting area differ historically. In 1915 (when the birth registration area was established), 10 states and the District of Columbia reported births; by 1933, 48 states and the District of Columbia were reporting births, with the last two states, Alaska and Hawaii, added to the registration area in 1959 and 1960, when these regions gained statehood.
birthrates<- read.csv("https://raw.githubusercontent.com/IsARam/DATA606/master/NCHS_-_Birth_Rates_for_Females_by_Age_Group__United_States-1.csv?_sm_au_=iVV5PSknnkM1sR2F", header = TRUE, sep = ",")The data was sourced from: https://data.cdc.gov/NCHS/NCHS-Birth-Rates-for-Females-by-Age-Group-United-S/yt7u-eiyg
head(birthrates)## Year Age.Group Birth.Rate
## 1 2015 10-14 years 0.2
## 2 2015 15-19 years 22.3
## 3 2015 20-24 years 76.8
## 4 2015 25-29 years 104.3
## 5 2015 30-34 years 101.5
## 6 2015 35-39 years 51.8
I spread the data to view the data across multiple rows which allowed me to observe summary statistics for all age groups.
library(tidyr)
birthratesspread<-spread(birthrates,key = Age.Group, value = Birth.Rate)
head(birthratesspread)## Year 10-14 years 15-19 years 20-24 years 25-29 years 30-34 years
## 1 1940 0.7 54.1 135.6 122.8 83.4
## 2 1941 0.7 56.9 145.4 128.7 85.3
## 3 1942 0.7 61.1 165.1 142.7 91.8
## 4 1943 0.8 61.7 164.0 147.8 99.5
## 5 1944 0.8 54.3 151.8 136.5 98.1
## 6 1945 0.8 51.1 138.9 132.2 100.2
## 35-39 years 40-44 years 45-49 years
## 1 46.3 15.6 1.9
## 2 46.1 15.0 1.7
## 3 47.9 14.7 1.6
## 4 52.8 15.7 1.5
## 5 54.6 16.1 1.4
## 6 56.9 16.6 1.6
summary(birthratesspread)## Year 10-14 years 15-19 years 20-24 years
## Min. :1940 Min. :0.2000 Min. :22.30 Min. : 76.8
## 1st Qu.:1959 1st Qu.:0.8000 1st Qu.:50.52 1st Qu.:107.7
## Median :1978 Median :0.9000 Median :56.45 Median :114.5
## Mean :1978 Mean :0.9395 Mean :59.60 Mean :146.2
## 3rd Qu.:1996 3rd Qu.:1.2000 3rd Qu.:70.35 3rd Qu.:188.0
## Max. :2015 Max. :1.4000 Max. :96.30 Max. :260.6
## 25-29 years 30-34 years 35-39 years 40-44 years
## Min. :104.3 Min. : 52.30 Min. :19.00 Min. : 3.800
## 1st Qu.:111.0 1st Qu.: 74.03 1st Qu.:31.25 1st Qu.: 5.800
## Median :117.0 Median : 91.80 Median :44.70 Median : 9.500
## Mean :134.7 Mean : 88.35 Mean :40.84 Mean : 9.936
## 3rd Qu.:161.3 3rd Qu.:101.65 3rd Qu.:52.65 3rd Qu.:15.150
## Max. :199.4 Max. :118.90 Max. :59.90 Max. :16.600
## 45-49 years
## Min. :0.2000
## 1st Qu.:0.3000
## Median :0.6000
## Mean :0.6803
## 3rd Qu.:0.9000
## Max. :1.9000
In order to determine there is a correlation between women completing college and birth rate and in order to do that I need to first group the data by year.
library(dplyr)
avgbirthbydecade<-birthrates %>% mutate(Year, Decade = Year - (Year %% 10)) %>% group_by(Decade) %>% dplyr::summarise(Average_Birth_Rate = mean(Birth.Rate, na.rm=TRUE))
head(avgbirthbydecade)## # A tibble: 6 x 2
## Decade Average_Birth_Rate
## <dbl> <dbl>
## 1 1940 68.9
## 2 1950 87.7
## 3 1960 75.9
## 4 1970 48.4
## 5 1980 46.5
## 6 1990 50.4
The data represents Percent of the Population with a Bachelor’s Degree or Higher by Sex and Age, for the United States: 1940 to 2000. I reformated the data fram to show education Bachelor’s_ Percentage in the same maner as Birth Rate.
education<- read.csv("https://raw.githubusercontent.com/IsARam/DATA606/master/table02.csv?_sm_au_=iVVDZ7t47jt7HsMM", header = TRUE, sep = ",")
conseducation <- education[40,2:8]
names(conseducation) <- c("1940","1950","1960","1970","1980","1990","2000")
row.names(conseducation) <- "Percent"
edu<-conseducation %>% gather("1940","1950","1960","1970","1980","1990","2000", key = "Decade", value = "Bachelors_ Percentage")
edu## Decade Bachelors_ Percentage
## 1 1940 4.9
## 2 1950 5.9
## 3 1960 7.8
## 4 1970 13.3
## 5 1980 20.5
## 6 1990 22.4
## 7 2000 29.7
The data was sourced from: https://www.census.gov/data/tables/2000/dec/phc-t-41.html
combined<-merge(edu,avgbirthbydecade, by.x="Decade")
combined## Decade Bachelors_ Percentage Average_Birth_Rate
## 1 1940 4.9 68.85750
## 2 1950 5.9 87.72125
## 3 1960 7.8 75.89375
## 4 1970 13.3 48.35000
## 5 1980 20.5 46.46500
## 6 1990 22.4 50.35500
## 7 2000 29.7 51.41250
library(ggplot2)## Warning: package 'ggplot2' was built under R version 3.5.3
library(RColorBrewer)
ggplot(birthrates, aes(x=birthrates$Year, y=birthrates$Birth.Rate, color =birthrates$Age.Group))+geom_point()+xlab("Year")+ylab("Birth Rate")+ggtitle("Birth Rate by Year")library(ggplot2)
library(RColorBrewer)
ggplot(birthrates, aes(x=birthrates$Year, y=birthrates$Birth.Rate, color =birthrates$Year))+geom_point()+xlab("Year")+ylab("Birth Rate")+ggtitle("Birth Rate by Year By Age Group")+ facet_wrap(birthrates$Age.Group)From Exploratory data analysis, I am able to determine that the overall birth rate has declined in the United Sates. The age group that has a significiant declines in birth rate over time are women ages 20 to 24.The age group that has a significiant increase in birth rate over time are women ages 30 to 34.
summary(birthratesspread$`20-24 years`)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 76.8 107.7 114.5 146.2 188.0 260.6
hist(birthratesspread$`20-24 years`,xlab = "Birth Rate", main = "Birth Rate for women ages 20-24")summary(birthratesspread$`30-34 years`)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.30 74.03 91.80 88.35 101.65 118.90
hist(birthratesspread$`30-34 years`,xlab = "Birth Rate", main = "Birth Rate for women ages 30-34")library(ggplot2)
library(RColorBrewer)
ggplot(avgbirthbydecade, aes(x=Decade, y=Average_Birth_Rate))+geom_point()+xlab("Year")+ylab("Birth Rate")+ggtitle("Average Birth Rate by Year")library(ggplot2)
library(RColorBrewer)
ggplot(edu, aes(x=Decade, y=edu$`Bachelors_ Percentage`))+geom_point()+xlab("Year")+ylab("Bachelors Percentage")+ggtitle("Bachelors Percentage By Year")library(ggplot2)
library(RColorBrewer)
ggplot() +
geom_point(data=combined, aes(x = Decade, y = `Bachelors_ Percentage`), color = "blue") +
geom_point(data=combined, aes(x = Decade, y = combined$`Average_Birth_Rate`), color = " Dark green") +
xlab('Decade') +
ylab('Total')H0 = A higher level of education causes a increase in Average Birth Rate. H1 = A higher level of education causes a decrease in Average Birth Rate.
qqnorm(combined$Average_Birth_Rate)
qqline(combined$Average_Birth_Rate)shapiro.test(combined$Average_Birth_Rate)##
## Shapiro-Wilk normality test
##
## data: combined$Average_Birth_Rate
## W = 0.85433, p-value = 0.1346
inference(combined$`Bachelors_ Percentage`, est = "median", type = "ci", null = 0, method = "theoretical")qqnorm(combined$`Bachelors_ Percentage`)
qqline(combined$`Bachelors_ Percentage`)shapiro.test(combined$`Bachelors_ Percentage`)##
## Shapiro-Wilk normality test
##
## data: combined$`Bachelors_ Percentage`
## W = 0.91535, p-value = 0.4341
ttest<- t.test(combined$Average_Birth_Rate,combined$`Bachelors_ Percentage`)
ttest##
## Welch Two Sample t-test
##
## data: combined$Average_Birth_Rate and combined$`Bachelors_ Percentage`
## t = 6.5351, df = 9.6876, p-value = 7.64e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 30.4874 62.2426
## sample estimates:
## mean of x mean of y
## 61.29357 14.92857
library(DATA606)## Loading required package: shiny
## Loading required package: openintro
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
## Loading required package: OIdata
## Loading required package: RCurl
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
## Loading required package: maps
## Loading required package: markdown
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
cor(combined$`Bachelors_ Percentage`, combined$Average_Birth_Rate)## [1] -0.7673735
plot(combined$`Bachelors_ Percentage`, combined$Average_Birth_Rate)m1<-lm(combined$`Bachelors_ Percentage`~ combined$Average_Birth_Rate, data = combined)
summary(m1)##
## Call:
## lm(formula = combined$`Bachelors_ Percentage` ~ combined$Average_Birth_Rate,
## data = combined)
##
## Residuals:
## 1 2 3 4 5 6 7
## -6.6263 2.8585 -0.5615 -7.4505 -1.0984 2.5513 10.3270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.4982 10.6060 4.007 0.0103 *
## combined$Average_Birth_Rate -0.4498 0.1681 -2.676 0.0440 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.667 on 5 degrees of freedom
## Multiple R-squared: 0.5889, Adjusted R-squared: 0.5066
## F-statistic: 7.161 on 1 and 5 DF, p-value: 0.04403
plot(m1)hist(resid(m1))anovaavg <- aov(combined$`Bachelors_ Percentage`~ combined$Average_Birth_Rate)
summary(anovaavg)## Df Sum Sq Mean Sq F value Pr(>F)
## combined$Average_Birth_Rate 1 318.3 318.3 7.161 0.044 *
## Residuals 5 222.3 44.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anovabac <- aov(combined$Average_Birth_Rate ~ combined$`Bachelors_ Percentage`)
summary(anovabac)## Df Sum Sq Mean Sq F value Pr(>F)
## combined$`Bachelors_ Percentage` 1 926.6 926.6 7.161 0.044 *
## Residuals 5 646.9 129.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From my result, it appears that there is a negative correlation which means that both variables move in the opposite direction.As education increases, birth rate decreases.
Overall birth rate by year shows birth rate was at it’s highest point for all age groups in the 1960’, mainly for 20-24 year old women. Birth Rate has been on the decline since the 1960’s, especially within women ages 20-24. It appears that women are having children later in life. There appears to be an increase of birth rates for women ages 30-34. I believe this is due to women prioritizng education and career’s before starting a family.
The analysis examined wheter there is an association between Bachelor’s and Average Birth Rate. The two sample t-test results support that there is an association. The P-value is less than .05, the null hyothesis A higher level of education causes a increase in Average Birth Rate.
Future research could entail delving deeper into Average Birth Rate by Race/Ethnicity and Socioeconomic status. It would be interesting to see if Average Birth Rate could be predicted when factoring these conditions.