Overview and Setup

For this exercise, we will practice opening, examining, and analyzing data from IPUMS, a web repository of population data.

These exercises use the data set “1% sample Botswana ever-married women age 15+” available via the Econ 448 class account from https://international.ipums.org/international/ (class code is W0NV72). Please download the data and answer the following questions.

2, Why would an estimate of the effect of contraceptive use on a woman’s fertility potentially be biased? How does Miller use a natural experiment to remove this bias? In your answer, define the natural experiment and discuss how the natural experiment satisfies the “exclusion restriction” for the bias you described in the first part of this question.

#Remember, only need to run these commands once
pacman::p_load(fixest,ivreg,doBy,htmltools,shiny,DT, coefplot)

Install these packages and (optional) change your working directory every time you start a fresh session.

library(ggplot2)
library(tidyr)
library(ivreg)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:doBy':
## 
##     order_by
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(fixest)
library(doBy)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(ipumsr)
library(coefplot)
#setwd("C:/Users/knoxm/OneDrive - UW/Pandemic Teaching/Econ 448 Data Course")
ddi <- read_ipums_ddi("ipumsi_00009.xml")  ##Use whatever title your file has when it downloads, but the name here should be correct for this exercise.
ipumsr::ipums_view(ddi)
df_ipums <- read_ipums_micro(ddi)  
## Use of data from IPUMS-International is subject to conditions including that
## users should cite the data appropriately. Use command `ipums_conditions()` for
## more details.
df_analysis <-as.data.frame(df_ipums)
head(df_analysis)
ddi <- read_ipums_ddi("ipumsi_00009.xml") 
ipumsr::ipums_view(ddi)
df_ipums <- read_ipums_micro(ddi)  
## Use of data from IPUMS-International is subject to conditions including that
## users should cite the data appropriately. Use command `ipums_conditions()` for
## more details.
df_analysis <-as.data.frame(df_ipums)
head(df_analysis)
tail(df_analysis)
stargazer(df_analysis, type = "text")
## 
## =====================================================================
## Statistic     N        Mean         St. Dev.       Min        Max    
## ---------------------------------------------------------------------
## COUNTRY     5,778     72.000         0.000          72         72    
## YEAR        5,778   2,011.000        0.000        2,011      2,011   
## SAMPLE      5,778 72,201,101.000     0.000      72,201,101 72,201,101
## SERIAL      5,778 28,818,525.000 15,908,041.000   3,000    61,644,000
## HHWT        5,778     61.989         0.816          0          62    
## OWNERSHIP   5,778     1.323          0.572          0          9     
## OWNERSHIPD  5,778    158.956         56.817         0         999    
## ELECTRIC    5,778     1.442          0.517          0          2     
## CELL        5,778     1.071          0.295          0          2     
## INTERNET    5,778     1.043          0.249          0          2     
## TV          5,778     15.559         5.174          0          20    
## RADIO       5,778     1.623          0.506          0          2     
## TOILET      5,778     19.759         4.558          0          23    
## FLOOR       5,778    191.630         41.364         0         236    
## PERNUM      5,778     1.975          1.546          1          19    
## PERWT       5,778     62.000         0.000          62         62    
## AGE         5,778     42.705         17.268         15         94    
## SEX         5,778     2.000          0.000          2          2     
## MARST       5,778     2.304          0.695          2          4     
## MARSTD      5,778    244.694         65.862        210        400    
## CHBORN      5,778     3.501          2.756          0          14    
## CHSURV      5,778     13.336         29.273         0          99    
## CHBORNF     5,778     1.742          1.654          0          9     
## CHBORNM     5,778     1.758          1.670          0          9     
## RELIGION    5,778     5.529          1.546          1          9     
## RELIGIOND   5,778   5,535.263      1,554.382      1,000      9,999   
## YRSCHOOL    5,778     10.251         15.616         0          98    
## EMPSTAT     5,778     1.919          0.955          1          9     
## EMPSTATD    5,778    202.388         97.422        110        999    
## LABFORCE    5,778     1.593          0.506          1          8     
## YRSCHOOL_SP 2,883     12.713         20.909         0          99    
## ---------------------------------------------------------------------
##first I will clean the data by filtering the unnecessary data out. like the years of schooling is way too big for the person, and I decided to delect the data thats unrealivent with a filter <50 and the childborn <31.

df_analysis <- filter(df_analysis, YRSCHOOL<50)
df_analysis <- filter(df_analysis, YRSCHOOL_SP<50)
df_analysis <- filter(df_analysis, CHBORN<31)

stargazer(df_analysis, type = "text")
## 
## =====================================================================
## Statistic     N        Mean         St. Dev.       Min        Max    
## ---------------------------------------------------------------------
## COUNTRY     2,646     72.000         0.000          72         72    
## YEAR        2,646   2,011.000        0.000        2,011      2,011   
## SAMPLE      2,646 72,201,101.000     0.000      72,201,101 72,201,101
## SERIAL      2,646 26,935,105.000 16,266,070.000   3,000    61,172,000
## HHWT        2,646     62.000         0.000          62         62    
## OWNERSHIP   2,646     1.394          0.596          0          9     
## OWNERSHIPD  2,646    165.420         59.086         0         999    
## ELECTRIC    2,646     1.451          0.508          0          2     
## CELL        2,646     1.068          0.273          0          2     
## INTERNET    2,646     1.065          0.267          0          2     
## TV          2,646     15.896         5.026          0          20    
## RADIO       2,646     1.665          0.483          0          2     
## TOILET      2,646     19.656         4.502          0          23    
## FLOOR       2,646    191.871         40.091         0         236    
## PERNUM      2,646     2.114          1.345          1          18    
## PERWT       2,646     62.000         0.000          62         62    
## AGE         2,646     39.301         13.684         15         94    
## SEX         2,646     2.000          0.000          2          2     
## MARST       2,646     2.003          0.070          2          4     
## MARSTD      2,646    215.367         8.399         210        400    
## CHBORN      2,646     3.329          2.616          0          14    
## CHSURV      2,646     13.271         29.264         0          99    
## CHBORNF     2,646     1.637          1.589          0          9     
## CHBORNM     2,646     1.689          1.619          0          9     
## RELIGION    2,646     5.548          1.537          1          9     
## RELIGIOND   2,646   5,556.017      1,548.194      1,000      9,999   
## YRSCHOOL    2,646     8.070          4.779          0          18    
## EMPSTAT     2,646     1.882          0.950          1          9     
## EMPSTATD    2,646    198.126         96.365        110        999    
## LABFORCE    2,646     1.614          0.503          1          8     
## YRSCHOOL_SP 2,646     7.547          5.319          0          18    
## ---------------------------------------------------------------------
stargazer(df_analysis[c("CHBORN","CHBORNF","AGE", "YRSCHOOL", "YRSCHOOL_SP", "LABFORCE") ], type = "text", summary.stat = c("n","mean","sd", "min", "max"), covariate.labels = c("Children Ever Born", "Female baby Ever Born", "Age", "Years of School", "Spouse Years of School", "Labor force participation"), digits = 2, title="Summary Statistics", notes="Note: Data are 1% sample of ever-married women 15+ from 2011 Botswana census. Source: IPUMS International")
## 
## Summary Statistics
## ==============================================================================================================
## Statistic                                  N             Mean          St. Dev.          Min          Max     
## --------------------------------------------------------------------------------------------------------------
## Children Ever Born                       2,646           3.33            2.62             0            14     
## Female baby Ever Born                    2,646           1.64            1.59             0            9      
## Age                                      2,646           39.30           13.68           15            94     
## Years of School                          2,646           8.07            4.78             0            18     
## Spouse Years of School                   2,646           7.55            5.32             0            18     
## Labor force participation                2,646           1.61            0.50             1            8      
## --------------------------------------------------------------------------------------------------------------
## Note: Data are 1% sample of ever-married women 15+ from 2011 Botswana census. Source: IPUMS International
stargazer(df_analysis[c("ELECTRIC", "TOILET", "FLOOR", "RELIGION", "MARST", "EMPSTAT")], type = "text")
## 
## ========================================
## Statistic   N    Mean   St. Dev. Min Max
## ----------------------------------------
## ELECTRIC  2,646  1.451   0.508    0   2 
## TOILET    2,646 19.656   4.502    0  23 
## FLOOR     2,646 191.871  40.091   0  236
## RELIGION  2,646  5.548   1.537    1   9 
## MARST     2,646  2.003   0.070    2   4 
## EMPSTAT   2,646  1.882   0.950    1   9 
## ----------------------------------------
df_analysis$Employed <- ifelse(df_analysis$EMPSTAT==1,1,0)

summary(df_analysis$Employed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5091  1.0000  1.0000

3.2 Plot mean number of children against mean years of education. What relationship do you see between these data and what does that tell you about what output you would expect from a regression?

From the below graph, I would expect an negative relationsip between

df_means_by_children<- df_analysis %>% 
  group_by(YRSCHOOL) %>% 
  summarise(CHBORN.mean=mean(CHBORN)) %>%
  as.data.frame()

head(df_means_by_children)  
ggplot(df_means_by_children, aes(x = YRSCHOOL, y=CHBORN.mean))+geom_point()+ geom_smooth(method = lm, formula = y~poly(x,1))
## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/integer. Defaulting to continuous.

mean_by_child <- lm(CHBORN~YRSCHOOL, data = df_analysis)

stargazer(mean_by_child, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               CHBORN           
## -----------------------------------------------
## YRSCHOOL                     -0.268***         
##                               (0.009)          
##                                                
## Constant                     5.490***          
##                               (0.087)          
##                                                
## -----------------------------------------------
## Observations                   2,646           
## R2                             0.239           
## Adjusted R2                    0.239           
## Residual Std. Error      2.282 (df = 2644)     
## F Statistic          831.654*** (df = 1; 2644) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
print.data.frame(df_means_by_children)
##    YRSCHOOL CHBORN.mean
## 1         0    5.394673
## 2         1    5.086957
## 3         2    5.518519
## 4         3    4.821429
## 5         4    4.610169
## 6         5    4.682540
## 7         6    4.984615
## 8         7    4.383099
## 9         8    2.447059
## 10        9    2.725762
## 11       10    2.226221
## 12       11    2.010989
## 13       12    1.871094
## 14       13    2.179104
## 15       14    2.071429
## 16       15    1.823529
## 17       16    1.889610
## 18       17    1.812500
## 19       18    1.758621

3.3 Using the main data frame, use ordinary least squares to estimate the coefficient β1 in the regression model Yi=β0+β1Xi+ϵi, where Y is children ever born and X is a woman’s education. Show your results. Does the sign match what you expect to see? Note: Typically, we might transform the data, perhaps by taking the log of children, or use a non-linear regression such as a Poisson regression here. However, the zeros make this a little more complicated.

f_ols<-feols(YRSCHOOL~i(AGE), data=df_analysis)

summary(f_ols)
## OLS estimation, Dep. Var.: YRSCHOOL
## Observations: 2,646 
## Standard-errors: IID 
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  7.000000    4.13613  1.692403  0.09069 .  
## AGE::16     -3.000000    4.62433 -0.648742  0.51656    
## AGE::17      1.333333    4.35986  0.305820  0.75977    
## AGE::18      0.733333    4.27178  0.171669  0.86371    
## AGE::19      3.200000    4.23827  0.755025  0.45030    
## AGE::20      2.885714    4.19480  0.687926  0.49156    
## AGE::21      2.063830    4.17990  0.493751  0.62152    
## AGE::22      3.508772    4.17225  0.840977  0.40044    
## AGE::23      3.098039    4.17648  0.741782  0.45829    
## AGE::24      3.464789    4.16516  0.831851  0.40557    
## AGE::25      3.412500    4.16190  0.819938  0.41233    
## AGE::26      3.650000    4.16190  0.877003  0.38057    
## AGE::27      2.409091    4.15956  0.579169  0.56253    
## AGE::28      3.084211    4.15784  0.741781  0.45829    
## AGE::29      3.329114    4.16223  0.799840  0.42388    
## AGE::30      3.305263    4.15784  0.794947  0.42672    
## AGE::31      3.430380    4.16223  0.824169  0.40992    
## AGE::32      2.276596    4.15807  0.547512  0.58407    
## AGE::33      2.376623    4.16290  0.570905  0.56811    
## AGE::34      2.460784    4.15636  0.592053  0.55387    
## AGE::35      2.986111    4.16475  0.716996  0.47344    
## AGE::36      1.525641    4.16256  0.366515  0.71401    
## AGE::37      2.676471    4.16643  0.642389  0.52068    
## AGE::38      1.854545    4.17356  0.444356  0.65682    
## AGE::39      2.314815    4.17425  0.554546  0.57925    
## AGE::40      2.355932    4.17104  0.564832  0.57224    
## AGE::41      1.090909    4.16735  0.261775  0.79352    
## AGE::42      1.016393    4.16990  0.243746  0.80745    
## AGE::43      1.727273    4.17356  0.413861  0.67901    
## AGE::44      1.372549    4.17648  0.328637  0.74246    
## AGE::45     -0.189189    4.16398 -0.045435  0.96376    
## AGE::46      0.555556    4.18183  0.132850  0.89432    
## AGE::47     -0.320000    4.17729 -0.076605  0.93894    
## AGE::48     -1.285714    4.17290 -0.308111  0.75802    
## AGE::49     -2.189189    4.19165 -0.522274  0.60152    
## AGE::50     -0.872340    4.17990 -0.208699  0.83470    
## AGE::51     -2.171429    4.19480 -0.517647  0.60475    
## AGE::52     -2.191489    4.17990 -0.524292  0.60012    
## AGE::53     -2.000000    4.19652 -0.476586  0.63370    
## AGE::54     -0.333333    4.18882 -0.079577  0.93658    
## AGE::55     -1.541667    4.22142 -0.365201  0.71499    
## AGE::56     -2.459459    4.19165 -0.586752  0.55742    
## AGE::57     -1.347826    4.22509 -0.319005  0.74975    
## AGE::58     -3.100000    4.20450 -0.737305  0.46100    
## AGE::59     -4.030303    4.19833 -0.959977  0.33716    
## AGE::60     -1.473684    4.24358 -0.347274  0.72841    
## AGE::61     -3.500000    4.30502 -0.813004  0.41629    
## AGE::62     -4.833333    4.22142 -1.144954  0.25233    
## AGE::63     -2.125000    4.26343 -0.498425  0.61823    
## AGE::64     -2.947368    4.24358 -0.694548  0.48740    
## AGE::65     -3.461538    4.29227 -0.806460  0.42005    
## AGE::66     -3.692308    4.29227 -0.860224  0.38975    
## AGE::67     -4.692308    4.29227 -1.093201  0.27441    
## AGE::68     -3.000000    4.30502 -0.696861  0.48595    
## AGE::69     -3.090909    4.32005 -0.715480  0.47438    
## AGE::70     -6.181818    4.32005 -1.430961  0.15256    
## AGE::71     -2.750000    4.30502 -0.638789  0.52302    
## AGE::72     -5.000000    4.77599 -1.046903  0.29524    
## AGE::73     -5.200000    4.53090 -1.147674  0.25121    
## AGE::74     -6.000000    4.33801 -1.383123  0.16675    
## AGE::75     -4.666667    4.35986 -1.070370  0.28455    
## AGE::76     -5.166667    4.46753 -1.156493  0.24759    
## AGE::77     -3.500000    5.06570 -0.690921  0.48968    
## AGE::78     -5.571429    4.42171 -1.260017  0.20778    
## AGE::79     -6.500000    4.46753 -1.454942  0.14581    
## AGE::80     -7.000000    5.06570 -1.381841  0.16714    
## AGE::81     -7.000000    5.06570 -1.381841  0.16714    
## AGE::82     -7.000000    5.84937 -1.196710  0.23153    
## AGE::83     -5.000000    5.84937 -0.854793  0.39275    
## AGE::84     -5.500000    5.06570 -1.085733  0.27770    
## AGE::85     -2.000000    5.84937 -0.341917  0.73244    
## AGE::86     -7.000000    5.84937 -1.196710  0.23153    
## AGE::87     -7.000000    4.77599 -1.465664  0.14286    
## AGE::94     -7.000000    5.84937 -1.196710  0.23153    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 4.07788   Adj. R2: 0.25107
iplot(f_ols)

stargazer(mean_by_child, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               CHBORN           
## -----------------------------------------------
## YRSCHOOL                     -0.268***         
##                               (0.009)          
##                                                
## Constant                     5.490***          
##                               (0.087)          
##                                                
## -----------------------------------------------
## Observations                   2,646           
## R2                             0.239           
## Adjusted R2                    0.239           
## Residual Std. Error      2.282 (df = 2644)     
## F Statistic          831.654*** (df = 1; 2644) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

#It matched with my initial thoughts on the negative coefficients of 0.23 between number of child born and education status. because more child born means the female will have less time for having education. It correlates to our prior plot that between mother’s education and its child born.

3.4 Now let’s add some control variables. Choose at least two control variables from the available variables in the data. What variables did you choose and what sign do you expect the coefficients on these variables to have? Why?

#I added the number of people in family and the age of them. I will expect a positive coefficient from the number of people since more people in family means a larger possibility that there will be more children in the family. and for age, a female will have more kids as she’s older because she will have a longer birth history and more time to give birth of childrens.

3.5 Add these variables to your regression model from the previous question and display your results. Do the coefficients match your expectations? Does the coefficient on education change once you add the control variables?

var_test1 <- lm(CHBORN ~ YRSCHOOL_SP + AGE + RELIGION, data=df_analysis)
summary(var_test1)
## 
## Call:
## lm(formula = CHBORN ~ YRSCHOOL_SP + AGE + RELIGION, data = df_analysis)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.662 -1.087 -0.113  1.037  8.826 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.530877   0.208444   2.547   0.0109 *  
## YRSCHOOL_SP -0.125114   0.008280 -15.110   <2e-16 ***
## AGE          0.091790   0.003223  28.483   <2e-16 ***
## RELIGION     0.024280   0.025725   0.944   0.3453    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.02 on 2642 degrees of freedom
## Multiple R-squared:  0.4047, Adjusted R-squared:  0.404 
## F-statistic: 598.7 on 3 and 2642 DF,  p-value: < 2.2e-16
stargazer(var_test1, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               CHBORN           
## -----------------------------------------------
## YRSCHOOL_SP                  -0.125***         
##                               (0.008)          
##                                                
## AGE                          0.092***          
##                               (0.003)          
##                                                
## RELIGION                       0.024           
##                               (0.026)          
##                                                
## Constant                      0.531**          
##                               (0.208)          
##                                                
## -----------------------------------------------
## Observations                   2,646           
## R2                             0.405           
## Adjusted R2                    0.404           
## Residual Std. Error      2.020 (df = 2642)     
## F Statistic          598.718*** (df = 3; 2642) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

#I used RELIGION to test the variables, and we can see a positive coefficient in religion, which matched with my expectations and the cofficient on education got higher when I add the control variables. religion only has a slight impact on the Childborn.