In this set of problems, we will first go through some of the data that are collected by the World Bank. We will do some cleaning on the data before we start analyzing it. Then, we will try to do a simple web scraping exercise where we will analyze the data as well.
The dataset presented in this problem set is a truncated version of one you can find on the World Bank Gender Statistics website.
For the purpose of data cleansing and manipulation, we will use the tidyverse package. Once we upload the data we are going to see that there are multiple indicators of gender, countries and years in the data. In this case we are just interested in analyzing the data for one indicator that is the Adolescent Fertility Rate, in the data the indicator code for this variable is called SP.ADO.TFRT. This indicator measures the annual number of births to women 15 to 19 years of age per 1,000 women in that age group. It represents the risk of childbearing among adolescent women 15 to 19 years of age. It is also referred to as the age-specific fertility rate for women aged 15-19. Once we have completed this problem set we will have more information of how this rate has evolved over time and how it varies across different groups of countries.
# Preliminaries
rm(list=ls())
library("utils")
library("tidyverse")
# Set working directory
setwd("C:/Users/Tarek/Documents/MITx 14.310x - Data Analysis for Social Scientists")
# Read the data
gender_data <- as_tibble(read.csv("DataSets/Week3/Gender_StatsData.csv"))
# The first few observations
head(gender_data, 20)
NA
Teenage Fertility Rate
teenage_fr <- gender_data %>% filter(Indicator.Code == "SP.ADO.TFRT")
factor(teenage_fr$Country.Code)
[1] ARB CSS CEB EAR EAS EAP TEA EMU ECS ECA TEC EUU FCS HPC HIC IBD IBT IDB IDX IDA LTE
[22] LCN LAC TLA LDC LMY LIC LMC MEA MNA TMN MIC NAC OED OSS PSS PST PRE SST SAS TSA SSF
[43] SSA TSS UMC WLD AFG ALB DZA ASM AND AGO ATG ARG ARM ABW AUS AUT AZE BHS BHR BGD BRB
[64] BLR BEL BLZ BEN BMU BTN BOL BIH BWA BRA VGB BRN BGR BFA BDI CPV KHM CMR CAN CYM CAF
[85] TCD CHI CHL CHN COL COM COD COG CRI CIV HRV CUB CUW CYP CZE DNK DJI DMA DOM ECU EGY
[106] SLV GNQ ERI EST SWZ ETH FRO FJI FIN FRA PYF GAB GMB GEO DEU GHA GIB GRC GRL GRD GUM
[127] GTM GIN GNB GUY HTI HND HKG HUN ISL IND IDN IRN IRQ IRL IMN ISR ITA JAM JPN JOR KAZ
[148] KEN KIR PRK KOR XKX KWT KGZ LAO LVA LBN LSO LBR LBY LIE LTU LUX MAC MKD MDG MWI MYS
[169] MDV MLI MLT MHL MRT MUS MEX FSM MDA MCO MNG MNE MAR MOZ MMR NAM NRU NPL NLD NCL NZL
[190] NIC NER NGA MNP NOR OMN PAK PLW PAN PNG PRY PER PHL POL PRT PRI QAT ROU RUS RWA WSM
[211] SMR STP SAU SEN SRB SYC SLE SGP SXM SVK SVN SLB SOM ZAF SSD ESP LKA KNA LCA MAF VCT
[232] SDN SUR SWE CHE SYR TJK TZA THA TLS TGO TON TTO TUN TUR TKM TCA TUV UGA UKR ARE GBR
[253] USA URY UZB VUT VEN VNM VIR PSE YEM ZMB ZWE
263 Levels: ABW AFG AGO ALB AND ARB ARE ARG ARM ASM ATG AUS AUT AZE BDI BEL BEN ... ZWE
Summary Statistics
round(mean(teenage_fr$X2000, na.rm = TRUE), 2)
[1] 63.15
round(sd(teenage_fr$X2000, na.rm = TRUE), 2)
[1] 46.92
Now, we are interested in plotting the evolution of the Adolescent Fertility Rate from 1960 to 2015. In addition, we are interested in having different information in the same plot. First, we want to plot the sample mean of all the data set, but also we want to add more information such as the rate for low, middle and high income countries (an indicator for country code is stored in the variable “Country.Code”).
Inspect this variable to get a sense of what it contains. Note that it includes indicators for both countries, regions, and income group. Since we are only interested in the trends by income group, we want to filter the data to contain only the fertility rate for high, middle, and low income countries as well as the world average.
Trends By Income Group
byincomelevel <- filter(teenage_fr, Country.Code %in% c("LIC", "HIC", "MIC", "WLD"))
byincomelevel
Notice, there are still two problems with the resulting data:
It contains additional variables that we don’t need or are meaningless at this level of aggregation.
It is not organized in a very intuitive way. A more natural way to organize this data, and prepare it for plotting, is to have each observation represent either a year or a country group-year, and each of the columns represent either the fertility rate for a given group, or if the data is at the country-group year level, then just the fertility rate.
Data by Group Year Using gather()
plotdata_bygroupyear <- gather(byincomelevel, "Year", "FertilityRate", X1960:X2015) %>% select(Year, Country.Name, Country.Code, FertilityRate)
# plotdata_bygroupyear <- mutate(plotdata_bygroupyear, Year=as.numeric(str_sub(Year,-4)))
plotdata_bygroupyear <- mutate(plotdata_bygroupyear, Year=as.numeric(str_replace(Year, "X", "")))
head(plotdata_bygroupyear)
Suppose you take a look at the data and change your mind. You decided you prefer to look at the data at the year level and have the fertility rates for each income-group as separate variables.
Data by Group Year Using spread()
plotdata_byyear<-select(plotdata_bygroupyear, Country.Code, Year, FertilityRate) %>% spread(Country.Code, FertilityRate)
plotdata_byyear
Let’s plot the fertility rate over time, separately for each income level.
p <- ggplot(plotdata_bygroupyear, aes(x = Year, y = FertilityRate, group = Country.Code, col = Country.Code))
p + geom_line() +
labs(x = "Year", y = "Fertility Rate", title = 'Fertility Rate by Country-Income-Level over Time') +
scale_x_continuous(minor_breaks = seq(1960, 2015, 5), breaks = seq(1960, 2015, 10)) +
theme(plot.title = element_text(hjust = 0.5))

Now, we are not going to consider the trends of the different categories over the years. Instead, we are going to compare how the distribution of the Adolescent Fertility Rate is different between 1960 and 2000.
#Generating histdata_twoyears
histdata_twoyears <- select(teenage_fr, Country.Name, Country.Code, Indicator.Name, Indicator.Code, X1960,X2000)
histdata_twoyears <- gather(teenage_fr, Year, FertilityRate, X1960, X2000) %>%
select(Year, Country.Name, Country.Code, FertilityRate)
histdata_twoyears <- filter(histdata_twoyears,!is.na(FertilityRate))
ggplot(histdata_twoyears, aes(x=FertilityRate)) +
geom_histogram(data = subset(histdata_twoyears, Year=="X1960"), bins = 20, color="darkred", fill="red", alpha=0.2) +
geom_histogram(data=subset(histdata_twoyears, Year=="X2000"), bins = 20, color="darkblue", fill="blue", alpha=0.2) +
labs(title = "Fertility Rates' Distribution in 1960 and 2000", x = "Fertility Rate", y = "Number of Adolescents")

Kernels
ggplot(histdata_twoyears, aes(x=FertilityRate, group=Year, color=Year, alpha=0.2)) + geom_histogram(aes(y=..density..)) + geom_density(data=subset(histdata_twoyears, Year=="X1960"), color="darkred", fill="red", alpha=0.2, bw=5, kernel = "optcosine") + geom_density(data=subset(histdata_twoyears, Year=="X2000"), color="darkblue", fill="blue", alpha=0.2, bw=5, kernel = "optcosine")

Cummulative Distribution Function
ggplot(histdata_twoyears, aes(x = FertilityRate)) +
stat_ecdf(aes(color = Year,linetype = Year),
geom = "step", size = 1.5) +
scale_color_manual(values = c("#00AFBB", "#E7B800"))+
labs(x ="Fertility Rate", y = "Cummulative Distribution Function", title = "Empirical Cummulative Distribution Function in 1960 and 2000") +
theme(plot.title = element_text(hjust = 0.5))

CDF Plots
The CDFs below are calculated by integrating calculated marginal pdfs. You can find the derivatives of the below CDFs to find the corresponding pdf for each random variable. CDF is defined as the probability of X less than or equal to x. From this definition, and if the random variables are continuous, then one may find the derivative of the pdf.
x <- seq(0,1,0.001)
y <- seq(0,1,0.001)
cdfx <- 6/5 * (x^2/2 + x/3)
cdfy <- 6/5 * (x/2 + x^3/3)
plot(x, cdfx, type = "l", col="blue", xlab=" ", ylab = "Cumulative Probability", xlim=c(0,1), main="CDF plot")
lines(y, cdfy, lty=2, col="red", lwd=2)
legend("bottomright", ncol=1, legend = c("X", "Y"), lty=c(1,2), col=c("blue", "red"))

Resources
---
title: "Teenage Fertility Rate"
output: html_notebook
---

In this set of problems, we will first go through some of the data that are collected by the World Bank. We will do some cleaning on the data before we start analyzing it. Then, we will try to do a simple web scraping exercise where we will analyze the data as well.


The dataset presented in this problem set is a truncated version of one you can find on the World Bank Gender Statistics website.

For the purpose of data cleansing and manipulation, we will use the tidyverse package. Once we upload the data we are going to see that there are multiple indicators of gender, countries and years in the data. In this case we are just interested in analyzing the data for one indicator that is the Adolescent Fertility Rate, in the data the indicator code for this variable is called SP.ADO.TFRT. This indicator measures the annual number of births to women 15 to 19 years of age per 1,000 women in that age group. It represents the risk of childbearing among adolescent women 15 to 19 years of age. It is also referred to as the age-specific fertility rate for women aged 15-19. Once we have completed this problem set we will have more information of how this rate has evolved over time and how it varies across different groups of countries.

```{r}
# Preliminaries
rm(list=ls())
library("utils")
library("tidyverse")

# Set working directory
setwd("C:/Users/Tarek/Documents/MITx 14.310x - Data Analysis for Social Scientists")

# Read the data
gender_data <- as_tibble(read.csv("DataSets/Week3/Gender_StatsData.csv"))

# The first few observations
head(gender_data, 20)

```
### Teenage Fertility Rate
```{r}
teenage_fr <- gender_data %>% filter(Indicator.Code == "SP.ADO.TFRT")
factor(teenage_fr$Country.Code)
```
### Summary Statistics
```{r}
round(mean(teenage_fr$X2000, na.rm = TRUE), 2)
round(sd(teenage_fr$X2000, na.rm = TRUE), 2)
```
Now, we are interested in plotting the evolution of the Adolescent Fertility Rate from 1960 to 2015. In addition, we are interested in having different information in the same plot. First, we want to plot the sample mean of all the data set, but also we want to add more information such as the rate for low, middle and high income countries (an indicator for country code is stored in the variable "Country.Code").

Inspect this variable to get a sense of what it contains. Note that it includes indicators for both countries, regions, and income group.  Since we are only interested in the trends by income group, we want to filter the data to contain only the fertility rate for high, middle, and low income countries as well as the world average.

### Trends By Income Group
```{r}
byincomelevel <- filter(teenage_fr, Country.Code %in% c("LIC", "HIC", "MIC", "WLD"))
byincomelevel
```
Notice, there are still two problems with the resulting data:

1.     It contains additional variables that we don’t need or are meaningless at this level of aggregation.

2.     It is not organized in a very intuitive way.  A more natural way to organize this data, and prepare it for plotting, is to have each observation represent either a year or a country group-year, and each of the columns represent either the fertility rate for a given group, or if the data is at the country-group year level, then just the fertility rate.


### Data by Group Year Using `gather()`
```{r}
plotdata_bygroupyear <- gather(byincomelevel, "Year", "FertilityRate", X1960:X2015) %>% select(Year, Country.Name, Country.Code, FertilityRate)  
# Remove X from the years 
# plotdata_bygroupyear <- mutate(plotdata_bygroupyear, Year=as.numeric(str_sub(Year,-4))) 
# OR
# plotdata_bygroupyear <- mutate(plotdata_bygroupyear, Year=as.numeric(str_sub(Year, 2, 5)))
# OR
plotdata_bygroupyear <- mutate(plotdata_bygroupyear, Year=as.numeric(str_replace(Year, "X", "")))
head(plotdata_bygroupyear)
```

Suppose you take a look at the data and change your mind. You decided you prefer to look at the data at the year level and have the fertility rates for each income-group as separate variables.

### Data by Group Year Using `spread()`
```{r}
plotdata_byyear<- select(plotdata_bygroupyear, Country.Code, Year, FertilityRate) %>% spread(Country.Code, FertilityRate)
plotdata_byyear
```
Let’s plot the fertility rate over time, separately for each income level.

```{r}
p <- ggplot(plotdata_bygroupyear, aes(x = Year, y = FertilityRate, group = Country.Code, col = Country.Code))
p + geom_line() + 
    labs(x = "Year", y = "Fertility Rate", title = 'Fertility Rate by Country-Income-Level over Time') +
    scale_x_continuous(minor_breaks = seq(1960, 2015, 5), breaks = seq(1960, 2015, 10)) +
    theme(plot.title = element_text(hjust = 0.5))
```
Now, we are not going to consider the trends of the different categories over the years. Instead, we are going to compare how the distribution of the Adolescent Fertility Rate is different between 1960 and 2000. 

```{r}
#Generating histdata_twoyears
histdata_twoyears <- select(teenage_fr, Country.Name, Country.Code, Indicator.Name, Indicator.Code, X1960,X2000)

histdata_twoyears <- gather(teenage_fr, Year, FertilityRate, X1960, X2000) %>%
  select(Year, Country.Name, Country.Code, FertilityRate)

histdata_twoyears <- filter(histdata_twoyears,!is.na(FertilityRate))

ggplot(histdata_twoyears, aes(x=FertilityRate)) + 
  geom_histogram(data = subset(histdata_twoyears, Year=="X1960"), bins = 20, color="darkred", fill="red", alpha=0.2) + 
  geom_histogram(data=subset(histdata_twoyears, Year=="X2000"), bins = 20, color="darkblue", fill="blue", alpha=0.2) + 
  labs(title = "Fertility Rates' Distribution in 1960 and 2000", x = "Fertility Rate", y = "Number of Adolescents")
```
### Kernels
```{r}
ggplot(histdata_twoyears, aes(x=FertilityRate, group=Year, color=Year, alpha=0.2)) +        geom_histogram(aes(y=..density..)) + geom_density(data=subset(histdata_twoyears, Year=="X1960"), color="darkred", fill="red", alpha=0.2, bw=5) + geom_density(data = subset(histdata_twoyears, Year=="X2000"), color="darkblue", fill="blue", alpha=0.2, bw=5)
```
### Cummulative Distribution Function
```{r}
ggplot(histdata_twoyears, aes(x = FertilityRate)) +
  stat_ecdf(aes(color = Year,linetype = Year), 
              geom = "step", size = 1.5) +
  scale_color_manual(values = c("#00AFBB", "#E7B800"))+
  labs(x ="Fertility Rate", y = "Cummulative Distribution Function", title = "Empirical Cummulative Distribution Function in 1960 and 2000") +
  theme(plot.title = element_text(hjust = 0.5))
```
### CDF Plots 

The CDFs below are calculated by integrating calculated marginal pdfs. You can find the derivatives of the below CDFs to find the corresponding pdf for each random variable. CDF is defined as the probability of X less than or equal to x. From this definition, and if the random variables are continuous, then one may find the derivative of the pdf.
```{r}
x <- seq(0,1,0.001)
y <- seq(0,1,0.001)
cdfx <- 6/5 * (x^2/2 + x/3)
cdfy <- 6/5 * (x/2 + x^3/3)
plot(x, cdfx, type = "l", col="blue", xlab=" ", ylab = "Cumulative Probability", xlim=c(0,1), main="CDF plot")
lines(y, cdfy, lty=2, col="red", lwd=2)
legend("bottomright", ncol=1, legend = c("X", "Y"), lty=c(1,2), col=c("blue", "red"))
```
**Resources**

- [World Bank Gender statistics](https://datacatalog.worldbank.org/dataset/gender-statistics)
- [ECDF in ggplot](https://www.datanovia.com/en/lessons/ggplot-ecdf/)
- [tidyr](https://uc-r.github.io/tidyr)
- [Jupyter and R Markdow](https://www.datacamp.com/community/blog/jupyter-notebook-r#markdown)
- [Further ploting using ggplot](https://rpubs.com/euclid/343644)
- [Additional Resources](https://courses.edx.org/courses/course-v1:MITx+14.310x+1T2018/e6d5d84945fb4acaad25a310e2bdb4c9/)



