For this exercise, we will practice opening, examining, and analyzing data from IPUMS, a web repository of population data.
These exercises use the data set “1% sample Botswana ever-married women age 15+” available via the Econ 448 class account from https://international.ipums.org/international/ (class code is W0NV72). Please download the data and answer the following questions.
2, Why would an estimate of the effect of contraceptive use on a woman’s fertility potentially be biased? How does Miller use a natural experiment to remove this bias? In your answer, define the natural experiment and discuss how the natural experiment satisfies the “exclusion restriction” for the bias you described in the first part of this question.
Since there is a bias toward the contraction on woman’s fertility, Miller suggested that a Non-discrimination in provision of contraceptive information and services, Miller’s experiment in Profamilia recommend that access to comprehensive contraceptive information and services be provided equally to everyone voluntarily, free of discrimination, coercion or violence (based on individual choice) that a female can be more independent on their fertility choice and the contraction bias will be lower because when collecting the data. Miller’s natural experiment allow the bias to be lower when collecting the data and because of it. The bias on contraction can be a lot lower with it.
Profamilia also supplied vast quantities of contraceptive devices not requiring medical supervision to local drugstores at cost and it advertised its services by radio.Because the precise timing of its programmes is assumed to be largely exogenous, it is important to investigate regularities that may have governed. As a result, Miller’s experienment can lower the bias in the contraceptive use on a woman’s fertility.
Millers experment on Profamilia’s success rests on the ways of reaching women with modern contraceptives under difficult circumstances that it pioneered. Distin guishing features of its philosophy include its focus on the poor and its recruitment of lay-women to provide outreach in their own communities. In counties where it operated, Profamilia also supplied vast quantities of contraceptive devices not requiring medical supervision to local drugstores at cost and it advertised its services by radio.
1.2 Recommend that laws and policies support programmes to ensure that comprehensive contraceptive information and services are provided to all segments of the population. Special attention should be given to disadvantaged and marginalized populations in their access to these services.
#Remember, only need to run these commands once
pacman::p_load(fixest,ivreg,doBy,htmltools,shiny,DT, coefplot)
also installing the dependencies ‘Deriv’, ‘microbenchmark’
trying URL 'https://cran.rstudio.com/src/contrib/Deriv_4.1.3.tar.gz'
Content type 'application/x-gzip' length 37214 bytes (36 KB)
==================================================
downloaded 36 KB
trying URL 'https://cran.rstudio.com/src/contrib/microbenchmark_1.4.9.tar.gz'
Content type 'application/x-gzip' length 59607 bytes (58 KB)
==================================================
downloaded 58 KB
trying URL 'https://cran.rstudio.com/src/contrib/doBy_4.6.14.tar.gz'
Content type 'application/x-gzip' length 4486402 bytes (4.3 MB)
==================================================
downloaded 4.3 MB
* installing *source* package ‘Deriv’ ...
** package ‘Deriv’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (Deriv)
* installing *source* package ‘microbenchmark’ ...
** package ‘microbenchmark’ successfully unpacked and MD5 sums checked
** using staged installation
checking for gcc... x86_64-conda-linux-gnu-cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether x86_64-conda-linux-gnu-cc accepts -g... yes
checking for x86_64-conda-linux-gnu-cc option to accept ISO C89... none needed
checking how to run the C preprocessor... x86_64-conda-linux-gnu-cc -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdint.h... (cached) yes
checking for stdlib.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for uint64_t... yes
checking for error_at_line... yes
checking for mach_absolute_time... no
checking for library containing clock_gettime... -lrt
checking for best clockid_t to use with clock_gettime... CLOCK_MONOTONIC_RAW
configure: creating ./config.status
config.status: creating src/Makevars
config.status: creating src/config.h
** libs
x86_64-conda-linux-gnu-cc -I"/opt/conda/lib/R/include" -DNDEBUG -D_POSIX_C_SOURCE=200112L -DHAVE_CONFIG_H -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/include -I/opt/conda/include -Wl,-rpath-link,/opt/conda/lib -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1630154298835/work=/usr/local/src/conda/r-base-4.1.1 -fdebug-prefix-map=/opt/conda=/usr/local/src/conda-prefix -c do_nothing.c -o do_nothing.o
x86_64-conda-linux-gnu-cc -I"/opt/conda/lib/R/include" -DNDEBUG -D_POSIX_C_SOURCE=200112L -DHAVE_CONFIG_H -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/include -I/opt/conda/include -Wl,-rpath-link,/opt/conda/lib -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1630154298835/work=/usr/local/src/conda/r-base-4.1.1 -fdebug-prefix-map=/opt/conda=/usr/local/src/conda-prefix -c init.c -o init.o
x86_64-conda-linux-gnu-cc -I"/opt/conda/lib/R/include" -DNDEBUG -D_POSIX_C_SOURCE=200112L -DHAVE_CONFIG_H -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /opt/conda/include -I/opt/conda/include -Wl,-rpath-link,/opt/conda/lib -fpic -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/r-base-split_1630154298835/work=/usr/local/src/conda/r-base-4.1.1 -fdebug-prefix-map=/opt/conda=/usr/local/src/conda-prefix -c nanotimer.c -o nanotimer.o
x86_64-conda-linux-gnu-cc -shared -L/opt/conda/lib/R/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/opt/conda/lib -Wl,-rpath-link,/opt/conda/lib -L/opt/conda/lib -o microbenchmark.so do_nothing.o init.o nanotimer.o -lrt -L/opt/conda/lib/R/lib -lR
installing to /opt/conda/lib/R/library/00LOCK-microbenchmark/00new/microbenchmark/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (microbenchmark)
* installing *source* package ‘doBy’ ...
** package ‘doBy’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
Note: ... may be used in an incorrect context
Note: ... may be used in an incorrect context
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (doBy)
The downloaded source packages are in
‘/tmp/Rtmp7DC58a/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
doBy installed
trying URL 'https://cran.rstudio.com/src/contrib/coefplot_1.2.8.tar.gz'
Content type 'application/x-gzip' length 35905 bytes (35 KB)
==================================================
downloaded 35 KB
* installing *source* package ‘coefplot’ ...
** package ‘coefplot’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (coefplot)
The downloaded source packages are in
‘/tmp/Rtmp7DC58a/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
coefplot installed
Install these packages and (optional) change your working directory every time you start a fresh session.
library(ggplot2)
library(tidyr)
library(ivreg)
library(dplyr)
library(fixest)
library(doBy)
library(stargazer)
library(ipumsr)
library(coefplot)
#setwd("C:/Users/knoxm/OneDrive - UW/Pandemic Teaching/Econ 448 Data Course")
ddi <- read_ipums_ddi("ipumsi_00009.xml") ##Use whatever title your file has when it downloads, but the name here should be correct for this exercise.
ipumsr::ipums_view(ddi)
df_ipums <- read_ipums_micro(ddi)
Use of data from IPUMS-International is subject to conditions including that
users should cite the data appropriately. Use command `ipums_conditions()` for
more details.
df_analysis <-as.data.frame(df_ipums)
head(df_analysis)
ddi <- read_ipums_ddi("ipumsi_00009.xml")
ipumsr::ipums_view(ddi)
df_ipums <- read_ipums_micro(ddi)
Use of data from IPUMS-International is subject to conditions including that
users should cite the data appropriately. Use command `ipums_conditions()` for
more details.
df_analysis <-as.data.frame(df_ipums)
head(df_analysis)
tail(df_analysis)
stargazer(df_analysis, type = "text")
=====================================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------------
COUNTRY 5,778 72.000 0.000 72 72
YEAR 5,778 2,011.000 0.000 2,011 2,011
SAMPLE 5,778 72,201,101.000 0.000 72,201,101 72,201,101
SERIAL 5,778 28,818,525.000 15,908,041.000 3,000 61,644,000
HHWT 5,778 61.989 0.816 0 62
OWNERSHIP 5,778 1.323 0.572 0 9
OWNERSHIPD 5,778 158.956 56.817 0 999
ELECTRIC 5,778 1.442 0.517 0 2
CELL 5,778 1.071 0.295 0 2
INTERNET 5,778 1.043 0.249 0 2
TV 5,778 15.559 5.174 0 20
RADIO 5,778 1.623 0.506 0 2
TOILET 5,778 19.759 4.558 0 23
FLOOR 5,778 191.630 41.364 0 236
PERNUM 5,778 1.975 1.546 1 19
PERWT 5,778 62.000 0.000 62 62
AGE 5,778 42.705 17.268 15 94
SEX 5,778 2.000 0.000 2 2
MARST 5,778 2.304 0.695 2 4
MARSTD 5,778 244.694 65.862 210 400
CHBORN 5,778 3.501 2.756 0 14
CHSURV 5,778 13.336 29.273 0 99
CHBORNF 5,778 1.742 1.654 0 9
CHBORNM 5,778 1.758 1.670 0 9
RELIGION 5,778 5.529 1.546 1 9
RELIGIOND 5,778 5,535.263 1,554.382 1,000 9,999
YRSCHOOL 5,778 10.251 15.616 0 98
EMPSTAT 5,778 1.919 0.955 1 9
EMPSTATD 5,778 202.388 97.422 110 999
LABFORCE 5,778 1.593 0.506 1 8
YRSCHOOL_SP 2,883 12.713 20.909 0 99
---------------------------------------------------------------------
##first I will clean the data by filtering the unnecessary data out. like the years of schooling is way too big for the person, and I decided to delect the data thats unrealivent with a filter <50 and the childborn <31.
df_analysis <- filter(df_analysis, YRSCHOOL<50)
df_analysis <- filter(df_analysis, YRSCHOOL_SP<50)
df_analysis <- filter(df_analysis, CHBORN<31)
stargazer(df_analysis, type = "text")
=====================================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------------
COUNTRY 2,646 72.000 0.000 72 72
YEAR 2,646 2,011.000 0.000 2,011 2,011
SAMPLE 2,646 72,201,101.000 0.000 72,201,101 72,201,101
SERIAL 2,646 26,935,105.000 16,266,070.000 3,000 61,172,000
HHWT 2,646 62.000 0.000 62 62
OWNERSHIP 2,646 1.394 0.596 0 9
OWNERSHIPD 2,646 165.420 59.086 0 999
ELECTRIC 2,646 1.451 0.508 0 2
CELL 2,646 1.068 0.273 0 2
INTERNET 2,646 1.065 0.267 0 2
TV 2,646 15.896 5.026 0 20
RADIO 2,646 1.665 0.483 0 2
TOILET 2,646 19.656 4.502 0 23
FLOOR 2,646 191.871 40.091 0 236
PERNUM 2,646 2.114 1.345 1 18
PERWT 2,646 62.000 0.000 62 62
AGE 2,646 39.301 13.684 15 94
SEX 2,646 2.000 0.000 2 2
MARST 2,646 2.003 0.070 2 4
MARSTD 2,646 215.367 8.399 210 400
CHBORN 2,646 3.329 2.616 0 14
CHSURV 2,646 13.271 29.264 0 99
CHBORNF 2,646 1.637 1.589 0 9
CHBORNM 2,646 1.689 1.619 0 9
RELIGION 2,646 5.548 1.537 1 9
RELIGIOND 2,646 5,556.017 1,548.194 1,000 9,999
YRSCHOOL 2,646 8.070 4.779 0 18
EMPSTAT 2,646 1.882 0.950 1 9
EMPSTATD 2,646 198.126 96.365 110 999
LABFORCE 2,646 1.614 0.503 1 8
YRSCHOOL_SP 2,646 7.547 5.319 0 18
---------------------------------------------------------------------
stargazer(df_analysis[c("CHBORN","CHBORNF","AGE", "YRSCHOOL", "YRSCHOOL_SP", "LABFORCE") ], type = "text", summary.stat = c("n","mean","sd", "min", "max"), covariate.labels = c("Children Ever Born", "Female baby Ever Born", "Age", "Years of School", "Spouse Years of School", "Labor force participation"), digits = 2, title="Summary Statistics", notes="Note: Data are 1% sample of ever-married women 15+ from 2011 Botswana census. Source: IPUMS International")
Warning: the condition has length > 1 and only the first element will be usedWarning: the condition has length > 1 and only the first element will be used
Summary Statistics
==============================================================================================================
Statistic N Mean St. Dev. Min Max
--------------------------------------------------------------------------------------------------------------
Children Ever Born 2,646 3.33 2.62 0 14
Female baby Ever Born 2,646 1.64 1.59 0 9
Age 2,646 39.30 13.68 15 94
Years of School 2,646 8.07 4.78 0 18
Spouse Years of School 2,646 7.55 5.32 0 18
Labor force participation 2,646 1.61 0.50 1 8
--------------------------------------------------------------------------------------------------------------
Note: Data are 1% sample of ever-married women 15+ from 2011 Botswana census. Source: IPUMS International
stargazer(df_analysis[c("ELECTRIC", "TOILET", "FLOOR", "RELIGION", "MARST", "EMPSTAT")], type = "text")
Warning: the condition has length > 1 and only the first element will be usedWarning: the condition has length > 1 and only the first element will be used
========================================
Statistic N Mean St. Dev. Min Max
----------------------------------------
ELECTRIC 2,646 1.451 0.508 0 2
TOILET 2,646 19.656 4.502 0 23
FLOOR 2,646 191.871 40.091 0 236
RELIGION 2,646 5.548 1.537 1 9
MARST 2,646 2.003 0.070 2 4
EMPSTAT 2,646 1.882 0.950 1 9
----------------------------------------
summary(df_analysis$Employed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 1.0000 0.5091 1.0000 1.0000
3.2 Plot mean number of children against mean years of education. What relationship do you see between these data and what does that tell you about what output you would expect from a regression?
From the below graph, I would expect an negative relationsip between
df_means_by_children<- df_analysis %>%
group_by(YRSCHOOL) %>%
summarise(CHBORN.mean=mean(CHBORN)) %>%
as.data.frame()
head(df_means_by_children)
ggplot(df_means_by_children, aes(x = YRSCHOOL, y=CHBORN.mean))+geom_point()+ geom_smooth(method = lm, formula = y~poly(x,1))
mean_by_child <- lm(CHBORN~YRSCHOOL, data = df_analysis)
stargazer(mean_by_child, type = "text")
Warning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changed
===============================================
Dependent variable:
---------------------------
CHBORN
-----------------------------------------------
YRSCHOOL -0.268***
(0.009)
Constant 5.490***
(0.087)
-----------------------------------------------
Observations 2,646
R2 0.239
Adjusted R2 0.239
Residual Std. Error 2.282 (df = 2644)
F Statistic 831.654*** (df = 1; 2644)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
print.data.frame(df_means_by_children)
YRSCHOOL CHBORN.mean
1 0 5.394673
2 1 5.086957
3 2 5.518519
4 3 4.821429
5 4 4.610169
6 5 4.682540
7 6 4.984615
8 7 4.383099
9 8 2.447059
10 9 2.725762
11 10 2.226221
12 11 2.010989
13 12 1.871094
14 13 2.179104
15 14 2.071429
16 15 1.823529
17 16 1.889610
18 17 1.812500
19 18 1.758621
3.3 Using the main data frame, use ordinary least squares to estimate the coefficient β1 in the regression model Yi=β0+β1Xi+ϵi, where Y is children ever born and X is a woman’s education. Show your results. Does the sign match what you expect to see? Note: Typically, we might transform the data, perhaps by taking the log of children, or use a non-linear regression such as a Poisson regression here. However, the zeros make this a little more complicated.
f_ols<-feols(YRSCHOOL~i(AGE), data=df_analysis)
summary(f_ols)
OLS estimation, Dep. Var.: YRSCHOOL
Observations: 2,646
Standard-errors: IID
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 4.07788 Adj. R2: 0.25107
iplot(f_ols)
stargazer(mean_by_child, type = "text")
Warning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changed
===============================================
Dependent variable:
---------------------------
CHBORN
-----------------------------------------------
YRSCHOOL -0.268***
(0.009)
Constant 5.490***
(0.087)
-----------------------------------------------
Observations 2,646
R2 0.239
Adjusted R2 0.239
Residual Std. Error 2.282 (df = 2644)
F Statistic 831.654*** (df = 1; 2644)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
#It matched with my initial thoughts on the negative coefficients of 0.23 between number of child born and education status. because more child born means the female will have less time for having education. It correlates to our prior plot that between mother’s education and its child born.
3.4 Now let’s add some control variables. Choose at least two control variables from the available variables in the data. What variables did you choose and what sign do you expect the coefficients on these variables to have? Why?
#I added the number of people in family and the age of them. I will expect a positive coefficient from the number of people since more people in family means a larger possibility that there will be more children in the family. and for age, a female will have more kids as she’s older because she will have a longer birth history and more time to give birth of childrens.
3.5 Add these variables to your regression model from the previous question and display your results. Do the coefficients match your expectations? Does the coefficient on education change once you add the control variables?
var_test1 <- lm(CHBORN ~ YRSCHOOL_SP + AGE + RELIGION, data=df_analysis)
summary(var_test1)
Call:
lm(formula = CHBORN ~ YRSCHOOL_SP + AGE + RELIGION, data = df_analysis)
Residuals:
Min 1Q Median 3Q Max
-8.662 -1.087 -0.113 1.037 8.826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.530877 0.208444 2.547 0.0109 *
YRSCHOOL_SP -0.125114 0.008280 -15.110 <2e-16 ***
AGE 0.091790 0.003223 28.483 <2e-16 ***
RELIGION 0.024280 0.025725 0.944 0.3453
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.02 on 2642 degrees of freedom
Multiple R-squared: 0.4047, Adjusted R-squared: 0.404
F-statistic: 598.7 on 3 and 2642 DF, p-value: < 2.2e-16
stargazer(var_test1, type = "text")
Warning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changedWarning: length of NULL cannot be changed
===============================================
Dependent variable:
---------------------------
CHBORN
-----------------------------------------------
YRSCHOOL_SP -0.125***
(0.008)
AGE 0.092***
(0.003)
RELIGION 0.024
(0.026)
Constant 0.531**
(0.208)
-----------------------------------------------
Observations 2,646
R2 0.405
Adjusted R2 0.404
Residual Std. Error 2.020 (df = 2642)
F Statistic 598.718*** (df = 3; 2642)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
#I used RELIGION to test the variables, and we can see a positive coefficient in religion, which matched with my expectations and the cofficient on education got higher when I add the control variables. religion only has a slight impact on the Childborn.