The following assignment consists of empirically estimating a discrete choice binary logit model. Specifically, I estimate the decision by women of whether to engage in an extramarital affair or not based on how religious they believe they are. By examining this relationship, we can get a better understanding on how religion or religious beliefs affect women’s decision on whether they have an extramarital affair or not.
The dataset used for the assignment contains a sample of 6366 women from the United States in the 1970s. The main dependent variable, affair, is a dummy variable for whether a woman chose to have an extramarital affair (= 1) or not (= 0) during their marriage.
After some data visualization using R, there is no missing values present in the raw data set.
Moreover, there is 6366 observations of 9 variables, including individual demographics such as age, religious level, educational attainment, number of children and occupation, and some variables regarding marriage, such as marriage satisfaction, years married, and the main dummy variable of whether the individual has ever had an extramarital affair.
Next, to ensure the proper interpretation of the variables’ values, I converted the education, occupation and occupation of husband variables into categorical variables. For instance, the education variable initially consisted of six different numerical values, which were linked to specific educational categories:
9 = grade school
12 = high school
14 = some college
16 = college graduate
17 = some graduate school
20 = advanced degree
Using numerical values for education might lead to incorrect interpretations in summary statistics or regression analysis, as these numbers represent categories rather than continuous numerical values. Treating them as numerical would imply a meaningful ordering or distance between categories, which is not appropriate for categorical variables. Therefore, by converting education, occupation and occupation of husband into categorical variables, we can accurately capture the different educational levels and occupations, and avoid misleading interpretations of coefficients or statistical results.
## CHANGE EDUC VAR TO CHAR ##
# transform var as char
df_affair$educ <- as.character(df_affair$educ)
# change value names
df_affair$educ[df_affair$educ == "9"] <- "grade school"
df_affair$educ[df_affair$educ == "12"] <- "highschool"
df_affair$educ[df_affair$educ == "14"] <- "some college"
df_affair$educ[df_affair$educ == "16"] <- "college graduate"
df_affair$educ[df_affair$educ == "17"] <- "some graduate school"
df_affair$educ[df_affair$educ == "20"] <- "advanced degree"
## CHANGE OCCUPATION VAR TO CHAR ##
# transform var as char
df_affair$occupation <- as.character(df_affair$occupation)
df_affair$occupation_husb <- as.character(df_affair$occupation_husb)
# change value names
df_affair$occupation[df_affair$occupation == "1"] <- "student"
df_affair$occupation[df_affair$occupation == "2"] <- "unskilled or semiskilled"
df_affair$occupation[df_affair$occupation == "3"] <- "whitecollar"
df_affair$occupation[df_affair$occupation == "4"] <- "skilled"
df_affair$occupation[df_affair$occupation == "5"] <- "business"
df_affair$occupation[df_affair$occupation == "6"] <- "professional"
###
df_affair$occupation_husb[df_affair$occupation_husb == "1"] <- "student"
df_affair$occupation_husb[df_affair$occupation_husb == "2"] <- "unskilled or semiskilled"
df_affair$occupation_husb[df_affair$occupation_husb == "3"] <- "whitecollar"
df_affair$occupation_husb[df_affair$occupation_husb == "4"] <- "skilled"
df_affair$occupation_husb[df_affair$occupation_husb == "5"] <- "business"
df_affair$occupation_husb[df_affair$occupation_husb == "6"] <- "professional"
The following table displays the summary statistics of the quantitative variables of the Extramarital Affair data set:
## SUMMARY STATISTICS ##
# load required packages
library(stargazer)
# labels for table
labels <- c('Affair (1=yes,0=no)',
'Marriage Satisfaction (1=very poor,5=very good)',
'Age',
'Years Married',
'Number of Children',
'Religious Level (1=not,4=strongly)'
)
# descriptive stats table
stargazer(df_affair,
type = "text",
title = "Summary Statistics Table of Extramarital Affair Dataset",
summary.stat = c("mean","median","sd","min","max"),
digits = 1,
covariate.labels = labels
)
##
## Summary Statistics Table of Extramarital Affair Dataset
## ======================================
## Statistic Mean Median St. Dev. Min Max
## ======================================
The majority of women in the sample have never had an affair given that the median is at 0 which represents “never had an affair”, and the mean is at 0.3, indicating that the majority of answers are 0. We can also observe these results in the graph below, which demonstrates that 67.8% of women have never had an affair:
Furthermore, a large majority of women report being satisfied with their marriage. Both the mean and median values for marriage satisfaction ratings are 4, indicating that a significant portion of responses cluster around this value. However, based on the graph below, when inspecting further the distribution, we observe that the majority, 42.2%, of women rate their marriage a 5 (very good):
Moreover, the age distribution of women in the sample ranges from approximately 18 to over 42 years, with a mean age of around 29 years and a median age of 27 years. Given that the sample was collected in the 1970s, it reflects a time when women typically married and had children at a younger age. Consequently, the age range in the sample is not excessively young and is suitable for the analysis.
Every woman in the sample is married since years married lies between 0.5 and 23 years, with the mean at 9 years, and median at 6 years. The standard deviation, 7.3 years, is relatively large, meaning the values are pretty scattered.
The number of children ranges from 0 to above 5 with the mean and median around 1. Interestingly, we would expect a larger mean given that the sample was obtained in 1970s when women had children earlier and more frequently as mentioned above.
Women in the sample also consider themselves mostly mildly religious with the mean and median at around 2. Based on the bar chart below, 38% consider themselves fairly religious, and 35.6% consider themselves mildly religious:
Women in the sample also have mostly a high school degree or attended some college is displayed in the following graph:
Additionally, 42.7% of them are white collar and 0.6% are students which is interesting some women in the sample are as young as 18 years old. However, since the sample comes from the 1970s, these observations are justifiable. The graph below demonstrates women’s occupation:
Finally, the majority of husbands are in business (27.9%) or are skilled (31.9%), which could tell us that most women may not work and take care of the kids instead:
Lastly, the majority of the variables in the sample are not highly correlated with each other. However, there are a few exceptions such as age and years married are extremely and positively correlated which means that the older a person is, the longer they are married for. Additionally, number of children is also strongly and positively related to both age and years married. Education and occupation are also positively related, however not as much as we would expect. The correlation plot below depicts the relationship between the variables in the sample:
A discrete choice binary logit model was used to estimate the decision by women of whether to engage in an extramarital affair or not based on how religious they believe they are, controlling for individual demographics and marriage-related variables that are correlated with religious level and affect whether they engaged in an affair or not.
The equation for the model is
\[ P[Y=\text{Affair}|\text{Religiosity}]_i = \frac{1}{1+e^{-(\beta_0 + \beta_1 \alpha_i + \beta_3 \gamma_i)}} \]
where \(\alpha\) includes the individual demographic controls, such as age, number of children and occupation which can affect whether a woman has an affair or not and may be correlated with how religious they are. Additionally, \(\gamma\) includes the controls for marriage-related variables which affect the dependent variable, affair, and are correlated with the main dependent variable, religious such as marriage rating. I chose not to include years married as the variable was highly correlated with both age and number of children.
To visualize and capture the effects of adding controls, I constructed three models, progressively incorporating additional variables as controls.There is no simple interpretation of the model coefficients, and it is best to consider predicted probabilities or differences in predicted probabilities.
##
## Summary Statistics of Discrete Choice Logit Models
## ======================================================================
## Dependent variable:
## --------------------------------
## affair
## (1) (2) (3)
## ----------------------------------------------------------------------
## How Religious -0.319*** -0.407*** -0.371***
## (0.031) (0.033) (0.035)
##
## Age 0.025*** 0.021***
## (0.005) (0.006)
##
## Number of Children 0.203*** 0.175***
## (0.026) (0.027)
##
## Occupation: Professional -0.110 -0.100
## (0.219) (0.230)
##
## Occupation: Skilled -0.666*** -0.669***
## (0.094) (0.099)
##
## Occupation: Student -1.127*** -1.092**
## (0.430) (0.441)
##
## Occupation: Semi-skilled or Unskilled -0.571*** -0.673***
## (0.110) (0.116)
##
## Occupation: White Collar -0.237*** -0.327***
## (0.087) (0.092)
##
## Marriage Satisfaction Rating -0.705***
## (0.031)
##
## Constant 0.018 -0.427** 2.526***
## (0.078) (0.174) (0.224)
##
## ----------------------------------------------------------------------
## Observations 6,366 6,366 6,366
## Log Likelihood -3,948.996 -3,794.169 -3,509.196
## Akaike Inf. Crit. 7,901.993 7,606.338 7,038.393
## ======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The first model is the raw estimate, no controls. The coefficient for religion is negative which tells us that the corresponding predictor variable has a negative relationship with the log-odds of ever having an affair (affair = 1). The log likelihood of the coefficient is very low, meaning that the model does not fit the data well. Plotting the logit model on the graph, we observe that the shape of s-curve tells us that the relationship is weak.
The second model includes individual demographics such as age, number of children and occupation. The coefficient becomes a little more negative, however the log-likelihood increases a little.
The third model’s coefficient is still negative at -0.371, and the log-likelihood increases by a bit even more. The negative coefficient indicates that when a person’s religious level increases by one unit, the log-odds of having an affair decrease. Intuitively, the coefficient makes sense as there is a known connection between religion and marriage. Individuals who are religious may be less inclined to engage in extramarital affairs due to their beliefs and values regarding fidelity and commitment within the context of marriage. The negative coefficient suggests that higher levels of religiosity are associated with a reduced likelihood of having an affair, reinforcing the notion that religious beliefs and moral principles may influence individuals’ behavior in their marital relationships.
Based on the coefficient test for model 3, we find that the relation between the probability of affair and religious level is negative, and that the corresponding coefficient is highly significant:
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5257680 0.2260772 11.1721 < 2.2e-16 ***
## religious -0.3709582 0.0343006 -10.8149 < 2.2e-16 ***
## age 0.0209652 0.0057654 3.6364 0.0002765 ***
## children 0.1750653 0.0276246 6.3373 2.338e-10 ***
## occupationprofessional -0.0995960 0.2330537 -0.4274 0.6691229
## occupationskilled -0.6685823 0.0977567 -6.8392 7.961e-12 ***
## occupationstudent -1.0917573 0.4515056 -2.4180 0.0156045 *
## occupationunskilled or semiskilled -0.6734314 0.1165257 -5.7793 7.503e-09 ***
## occupationwhitecollar -0.3274667 0.0914739 -3.5799 0.0003437 ***
## rate_marriage -0.7048972 0.0320613 -21.9859 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Furthermore, to evaluate the performance of the logic model 3, I used a confusion matrix which yielded average results. Precisely, the model’s accuracy, which measures the proportion of correct predictions among all instances, is 72%, meaning that 72% of the predictions are correct.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3891 1362
## 1 422 691
##
## Accuracy : 0.7198
## 95% CI : (0.7086, 0.7308)
## No Information Rate : 0.6775
## P-Value [Acc > NIR] : 1.578e-13
##
## Kappa : 0.2713
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9022
## Specificity : 0.3366
## Pos Pred Value : 0.7407
## Neg Pred Value : 0.6208
## Prevalence : 0.6775
## Detection Rate : 0.6112
## Detection Prevalence : 0.8252
## Balanced Accuracy : 0.6194
##
## 'Positive' Class : 0
##
Overall, as we gradually include more controls in each model, the predicted probabilities of having an affair become more precise. The following graphs illustrate the predicted probabilities for each model:
Marginal effects are usually preferred to coefficients as they provide a more intuitive interpretation of the impact of each predictor on the probability of the outcome. They represent the change in the predicted probability of the outcome for a one-unit change in the predictor, while holding other variables constant.
Based on model 1, marginal effect of -0.069 means that for each one-unit increase in religious level, the predicted probability of having an affair decreases by approximately 0.0686 units, ceteris paribus. When including individual demographics in model 2, the average marginal effect decreases to -0.083, which indicates that the inclusion of individual demographics has strengthened the relationship between “how religious” and the likelihood of having an affair. Finally, in model 3, the average marginal effect goes back to a value close to the original estimate in model, which suggests that the inclusion of individual demographics combined with marriage-related variables as controls did not significantly alter the relationship between women’s religious level and the likelihood of having an affair.
This result implies that religious level remains a relevant and influential factor in explaining the probability of having an affair, even after accounting for the effects of individual demographics and other marriage-related variables.
| Variables | Model 1 | Model 2 | Model 3 |
|---|---|---|---|
| Religiosity | -0.069 | -0.083 | -0.069 |
| Age | n/a | 0.005 | 0.0034 |
| Number of Children | n/a | 0.041 | 0.034 |
| Marriage Satisfaction | n/a | n/a | -0.13 |
| Occupation: Professional | n/a | -0.025 | -0.02 |
| Occupation: Skilled | n/a | -0.139 | -0.128 |
| Occupation: Student | n/a | -0.215 | -0.195 |
| Occupation: White Collar | n/a | -0.053 | -0.129 |
Finally, based on the above marginal effects for model 3, we can observe that the variable occupation skilled, student and white collar have the largest marginal effects. This implies that women in the sample who are skilled, a student or white collard are more convincing on having an affair.
The following paper provides evidence that being more religious decreases women’s likelihood of having an extramarital affair. The negative coefficient -0.371 gives us the shape of the S-curve, and the negative average marginal effect result of -0.069 demonstrates the overall effect of being religious on the probability having an extramarital affair.