## EMPTY FUNCTIONS AND VAR IN THE ENV TAB/WINDOW ##
# Clear the workspace
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 530298 28.4 1180437 63.1 NA 669400 35.8
## Vcells 983050 7.6 8388608 64.0 16384 1851654 14.2
cat("\f") # Clear the console
## IMPORT DATA ##
# Load required package
library(readxl)
# Read excel file
Fertility <- read_excel("Downloads/Take-home assignment #2-6/Fertility.xls")
# view(Fertility)
# Store file in data frame
df_fertility <- as.data.frame(Fertility)
# 30000 observations, 9 variables
In the following assignment, I will explore the causal relationship between fertility and the female labor supply. That is, how much does a woman's labor supply falls when she has an additional child. I will use both OLS regression and then include and instumental variable, same-sex siblings, and compare their results. I will estimate this effect using data for married women from the 1980 U.S. Census to identify the causal effect of fertility on labor supply.
After importing the data into R Studio, the software I used to conduct the analysis, I began by cleaning the fertility data frame by renaming the variables with more suitable names. The data set contained 30'000 observations of 9 of the following variables:
kids – dummy variable for number of children (1 = more than 1)
first_boy – dummy variable for sex of first child (1 = boy)
second_boy – dummy variable for sex of second child (1 = boy)
kids_samesex – dummy variable for sex of first two children (1 = same sex)
age_mom – age of mother during census
mom_black – dummy variable for race of mother (1 = black)
mom_hispanic – dummy variable for race of mother (1 = hispanic)
mom_other – dummy variable for race of mother (1 = not black, Hispanic or white)
weeks_worked – weeks worked of mother in 1979.
## CHANGE VARIABLE NAMES ##
# Load required package
require(dplyr)
# change variable name morekids -> kids
df_fertility <- df_fertility %>%
rename(kids = morekids)
# change variable name boy1st -> first_boy
df_fertility <- df_fertility %>%
rename(first_boy = boy1st)
# change variable name boy2nd -> second_boy
df_fertility <- df_fertility %>%
rename(second_boy = boy2nd)
# change variable name samesex -> kids_samesex
df_fertility <- df_fertility %>%
rename(kids_samesex = samesex)
# change variable name agem1 -> age_mom
df_fertility <- df_fertility %>%
rename(age_mom = agem1)
# change variable name black -> mom_black
df_fertility <- df_fertility %>%
rename(mom_black = black)
# change variable name hispan -> mom_hispanic
df_fertility <- df_fertility %>%
rename(mom_hispanic = hispan)
# change variable name othrace-> mom_other
df_fertility <- df_fertility %>%
rename(mom_other = othrace)
# change variable name weeksm1 -> weeks_worked
df_fertility <- df_fertility %>%
rename(weeks_worked = weeksm1)
Addtionally, there were no missing values in the data set.
The following table demonstrates the key insights of the data:
##
## Summary Statistics of Fertility Data Frame
## =====================================================================================
## Statistic Mean St. Dev. Median Min Max
## -------------------------------------------------------------------------------------
## Number of Children (1=more than 1) 0.378 0.485 0 0 1
## First Child (1=boy) 0.515 0.500 1 0 1
## Second Child (1=boy) 0.506 0.500 1 0 1
## Sex of Children (1=same sex) 0.503 0.500 1 0 1
## Age of Mother 30.354 3.383 31 21 35
## Race of Mother: Black (1=black) 0.053 0.225 0 0 1
## Race of Mother: Hispanic (1=hispanic) 0.074 0.262 0 0 1
## Race of Mother: Other (1=not black, hispanic or white) 0.057 0.232 0 0 1
## Weeks Worked of Mother (in 1979) 19.210 21.941 6 0 52
## -------------------------------------------------------------------------------------
The age of mothers in the data set spans from 21 to 35 years, displaying a relatively normal distribution with a mean and median of approximately 30.5 years. Moreover, the weeks worked by mothers range from 0 to 52 weeks. Interestingly, the median for weeks worked is 6 weeks, while the mean is approximately 19 weeks, indicating a right-skewed distribution. Finally, the majority of variables in the data set are dummy variables, taking on values of either 0 or 1. Consequently, I generated additional graphs to provide a clearer visualization and understanding of these variables concerning the population under analysis.
According to the bar chart below, the majority of mothers’ race is categorized as “unknown,” which results from their responses being “0” for all three variables (mom_black, mom_hispanic, and mom_other). Since the mom_other variable represents mothers who are neither black, Hispanic, nor white, it suggests that those falling under the “unknown” category may be predominantly white.
Additionally, the graph title "Number of Kids" shows that the majority of mothers in the sample have 1 kid or less:
The next graph demonstrates that mothers with their children being of the same sex is almost the same as those with children of the opposite sex:
Finally, the variable weeks_worked is important in the analysis since we are trying to measure the causal effect of fertility on labor supply. Therefore, weeks_worked is the main dependent variable that we intend to explain. In the histogram below, we can see that the shape of the distribution of this variable is bimodal with the peaks at 0 and 50 weeks, which could indicate that the sample has two distinct groups of data. For instance, some mothers maybe have a stronger support system, like a nanny or partner, that allows them to work more weeks a year. On the other hand, some mothers may have a partner that works full time instead or they have younger children to take care of, so they do not work as much.
To begin with, I ran a simple OLS regression of weeks_worked on the variable kids to estimate the causal relationship of fertility on female labor supply. Based on the variables available in the fertility data frame, I ran two OLS regressions. The first is the raw estimate, no controls. In the second model, I included a few controls, such as the mothers age and the race/ethnicity of the mother, as these variables can both be related with female labor supply and correlated with the number of children. The table below demonstrates both OLS models and their respective coefficients:
##
## Summary Statistics Table of OLS models
## ========================================================================================
## Dependent variable:
## -----------------------------------------------------
## weeks_worked
## no controls controls
## (1) (2)
## ----------------------------------------------------------------------------------------
## Number of Children (1=more than 1) -6.008*** -6.896***
## (0.259) (0.258)
##
## Age of Mother 0.842***
## (0.037)
##
## Race of Mother (1=black) 11.534***
## (0.553)
##
## Race of Mother (1=hispanic) -0.247
## (0.521)
##
## Race of Mother (1=other) 3.335***
## (0.589)
##
## Constant 21.478*** -4.523***
## (0.159) (1.123)
##
## ----------------------------------------------------------------------------------------
## Observations 30,000 30,000
## R2 0.018 0.048
## Adjusted R2 0.018 0.048
## Residual Std. Error 21.747 (df = 29998) 21.413 (df = 29994)
## F Statistic 538.156*** (df = 1; 29998) 300.292*** (df = 5; 29994)
## ========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Based on the OLS model 1 no controls, if the number of children is greater than 1, then the number of weeks worked decreases by 6 weeks, ceteris paribus. This is statistically significant at the 1% level. However, the R-squared value is very low which indicates that the independent variable explains only a small proportion of the variance observed in the dependent variable, weeks worked. Additionally, the high residual standard error means that the predicted values from the model deviate noticeably from the actual observed values. Overall, the coefficient does make sense, given that when a mother has more children, they might be busier and have a lot less time to work, decreasing female labor supply. However, the very low R-squared and high residual standard error tells us that the coefficient is bias.
Secondly, when including the controls for age and race, then if the number of children is greater than 1, the number of weeks worked decreases by almost 7 weeks, ceteris paribus. This is statistically significant at the 1% level. Although the R-squared increased by a little, it is still very low. Similarily, the residual standard only decreased by a little as well, which tells us that this model is also bias.
I believe that the coefficient estimates of the variable kids obtained using OLS regression suffer from simultaneity bias, meaning that reverse causation exists which is causing bias in the results. In this exercise, we want to measure how fertility (number of kids) affects female labor supply (weeks worked in a year). However, female labor supply can also affect the number of kids a woman has. For instance, women with more weeks worked in a year might delay or have fewer children due to career considerations or other factors related to their work-life balance. Therefore, this indicates that simultaneity bias exists in this model.
In order to remove the bias within the current OLS model, we can use instrumental variable regression. Two conditions need to be met for an instrument variable to be considered valid. The first is that the instrument must be relevant, meaning that the instrument and the main independent variable must be correlated. In this case, if we were to use kids_samesex as the instrument, then kids_samesex should be correlated with the variable kids (the number of children). We can in fact argue that parents of same-sex siblings are more likely to have an additional child as sibling sex mix is typically preferred (Angrist & Evans, 1998). The second condition is that the instrument must be exogenous, meaning that it is not affected by the endogenous variable, kids, and has no direct influence on the dependent variable, weeks_worked, other than through its impact on the variable kids. In this case, the sex of siblings is not affected by the number of kids you have since sex mix is randomly assigned. Additionally, the sex does not directly influence the number of weeks worked by the mother. Overall, the variable kids_samesex seems to be a valid instrument conceptually speaking.
After using the 2SLS IV estimator to estimate the causal effect of having more kids on weeks worked using same sex siblings as an instrument with R, I found that the IV model resembles the first OLS model. In other words, using IV regression, I found that having more than 1 child decreases the number of days worked by 5.7 weeks, ceteris paribus. However, this is not statistically significant. The table below demonstrates the IV model:
##
## Summary Statistics Table of IV model
## ==========================================================================
## Dependent variable:
## ---------------------------------------
## weeks_worked
## (1) (2)
## --------------------------------------------------------------------------
## Number of Children (1=more than 1) -6.033 -5.741
## (3.758) (3.671)
##
## Age of Mother 0.818***
## (0.068)
##
## Ethnicity of Mother: Black 11.247***
## (0.636)
##
## Constant 21.488*** -4.062***
## (1.425) (1.182)
##
## --------------------------------------------------------------------------
## Observations 30,000 30,000
## R2 0.018 0.046
## Adjusted R2 0.018 0.046
## Residual Std. Error 21.747 (df = 29998) 21.432 (df = 29996)
## ==========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Even though IV estimation was used, this estimate might not be trustworthy and there still might be a bias due to omitted variables. In other words, it seems that the instrument is not getting rid of the potential bias.
Overall, given that the number of weeks worked ranges from 0 to 52 weeks, then a decrease of approximately 6 weeks may be considered significant to some. Prior research demonstrates that an increase in number of children decreases female labor supply (Angrist & Evans, 1998). However, since the coefficient is not statistically significant, then this indicates that there is insufficient evidence to support the presence of a significant relationship between the independent and dependent variable in the sample even by using the instrumental variable, same-sex siblings.