Problem 1: Reshaping, analyzing, and visualizing data

Let’s work a bit more with the data set on resumes from (Oreopoulos, 2011) that we discussed in the last recitation.

First, read in the Oreopoulos data into R.

setwd("~/Desktop/Data Analysis")
data <- read.dta13("~/Desktop/Data Analysis/HW3/oreopoulos.dta")
  1. Find all the unique combinations of occupation types and ethnicity dyads. Find the total number of callbacks for each occupation type and ethnicity combination, and present the result in a table sorted by count.
summary_table <- data %>%
  group_by(occupation_type, name_ethnicity) %>%
  summarize(total_callbacks = sum(callback, na.rm = TRUE) 
            + sum(second_callback, na.rm = TRUE), .groups = 'drop') %>%
  arrange(desc(total_callbacks))

#install.packages("kableExtra")
library(kableExtra)

summary_table %>%
  kable("html", 
        caption = "Total Callbacks for Each Occupation Type and Ethnicity Combination", 
        col.names = c("Occupation Type", "Ethnicity", "Total Callbacks")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Total Callbacks for Each Occupation Type and Ethnicity Combination
Occupation Type Ethnicity Total Callbacks
Finance Canada 112
Marketing and Sales Canada 85
Retail Canada 74
Finance Indian 66
Retail Indian 60
Administrative Canada 58
Finance Chinese 57
Marketing and Sales Indian 48
Retail Chinese 48
Marketing and Sales Chinese 47
Programmer Canada 45
Programmer Indian 42
Programmer Chinese 36
Insurance Indian 34
Administrative Chinese 32
Insurance Canada 32
Marketing and Sales British 32
Administrative Indian 27
Retail British 27
Retail Chn-Cdn 27
Marketing and Sales Chn-Cdn 25
Marketing and Sales Pakistani 24
Executive Assisstant Canada 23
Civil Engineer Canada 22
Clerical Indian 17
Electrical Engineer Canada 17
Finance Chn-Cdn 17
Clerical Chinese 16
Finance British 16
Insurance Chinese 16
Marketing and Sales Greek 16
Administrative British 15
Retail Pakistani 15
Programmer Chn-Cdn 14
Administrative Chn-Cdn 13
Civil Engineer Indian 12
Clerical Canada 12
Insurance Chn-Cdn 12
Civil Engineer Chinese 11
Executive Assisstant Indian 11
Technology Canada 10
Accounting Chinese 9
Administrative Greek 9
Executive Assisstant Chinese 8
Finance Pakistani 8
Human Resources Payroll Canada 8
Accounting Canada 7
Administrative Pakistani 7
Executive Assistant Canada 7
Insurance Pakistani 7
Programmer British 7
Accounting Indian 6
Education Canada 6
Electrical Engineer Indian 6
Executive Assistant Chinese 6
Insurance British 6
Ecommerce Chinese 5
Electrical Engineer Chn-Cdn 5
Executive Assisstant British 5
Executive Assistant Indian 5
Human Resources Payroll Indian 5
Maintenance Technician Canada 5
Production Indian 5
Programmer Pakistani 5
Retail Greek 5
Social Worker Canada 5
Technology British 5
Technology Indian 5
Biotech and Pharmacy Canada 4
Biotech and Pharmacy Indian 4
Civil Engineer Chn-Cdn 4
Clerical Chn-Cdn 4
Electrical Engineer British 4
Electrical Engineer Chinese 4
Electrical Engineer Pakistani 4
Maintenance Technician British 4
Technology Chn-Cdn 4
Accounting Greek 3
Clerical Greek 3
Education British 3
Education Indian 3
Executive Assisstant Pakistani 3
Human Resources Payroll Chn-Cdn 3
Maintenance Technician Indian 3
Programmer Greek 3
Technology Chinese 3
Ecommerce Canada 2
Executive Assisstant Chn-Cdn 2
Executive Assistant Chn-Cdn 2
Food Services Managers Canada 2
Food Services Managers Chn-Cdn 2
Human Resources Payroll British 2
Maintenance Technician Chn-Cdn 2
Maintenance Technician Pakistani 2
Media and Arts Canada 2
Media and Arts Chinese 2
Media and Arts Indian 2
Production Canada 2
Production Chinese 2
Technology Greek 2
Technology Pakistani 2
Accounting Chn-Cdn 1
Biotech and Pharmacy Chinese 1
Civil Engineer British 1
Civil Engineer Pakistani 1
Ecommerce British 1
Ecommerce Indian 1
Electrical Engineer Greek 1
Executive Assistant Greek 1
Finance Greek 1
Food Services Managers Chinese 1
Food Services Managers Indian 1
Maintenance Technician Chinese 1
Social Worker Chinese 1
Biotech and Pharmacy British 0
Biotech and Pharmacy Chn-Cdn 0
Biotech and Pharmacy Pakistani 0
Civil Engineer Greek 0
Ecommerce Chn-Cdn 0
Ecommerce Greek 0
Ecommerce Pakistani 0
Education Chinese 0
Education Chn-Cdn 0
Education Pakistani 0
Food Services Managers Greek 0
Human Resources Payroll Chinese 0
Human Resources Payroll Greek 0
Human Resources Payroll Pakistani 0
Insurance Greek 0
Production British 0
Production Chn-Cdn 0
Production Greek 0
Production Pakistani 0
Social Worker British 0
Social Worker Chn-Cdn 0
Social Worker Indian 0
Social Worker Pakistani 0
  1. Notice that the data from your previous answer is in the long format. Reshape the data into the wide format where each row has a unique occupation type and each column tracks the ethnicity of the candidate. Each cell in the resulting data frame should contain the total number of callbacks for the respective occupation-type-name-ethnicity combination.
wide_table <- summary_table %>%
  pivot_wider(names_from = name_ethnicity, values_from = total_callbacks, values_fill = list(total_callbacks = 0))

colnames(wide_table) <- c("Occupation Type", "Canadian", "Indian", "Chinese", "British", "Chinese - Canadian", "Pakistani", "Greek")


wide_table %>%
  kable("html", caption = "Total Callbacks for Each Occupation Type and Ethnicity Combination") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Total Callbacks for Each Occupation Type and Ethnicity Combination
Occupation Type Canadian Indian Chinese British Chinese - Canadian Pakistani Greek
Finance 112 66 57 16 17 8 1
Marketing and Sales 85 48 47 32 25 24 16
Retail 74 60 48 27 27 15 5
Administrative 58 27 32 15 13 7 9
Programmer 45 42 36 7 14 5 3
Insurance 32 34 16 6 12 7 0
Executive Assisstant 23 11 8 5 2 3 0
Civil Engineer 22 12 11 1 4 1 0
Clerical 12 17 16 0 4 0 3
Electrical Engineer 17 6 4 4 5 4 1
Technology 10 5 3 5 4 2 2
Accounting 7 6 9 0 1 0 3
Human Resources Payroll 8 5 0 2 3 0 0
Executive Assistant 7 5 6 0 2 0 1
Education 6 3 0 3 0 0 0
Ecommerce 2 1 5 1 0 0 0
Maintenance Technician 5 3 1 4 2 2 0
Production 2 5 2 0 0 0 0
Social Worker 5 0 1 0 0 0 0
Biotech and Pharmacy 4 4 1 0 0 0 0
Food Services Managers 2 1 1 0 2 0 0
Media and Arts 2 2 2 0 0 0 0
  1. Run and interpret a regression where the outcome is whether a resume got a callback. The co-variates in your regression should be the ethnicity of each candidates as well as the levels of candidates’ skills: speaking, writing and personal skills. Your interpretation should detail what a one unit increase in the independent variable means for the dependent variable.

As this regression involves a dependent variable that is a binary variable, using a linear regression might model probabilities outside the probabilities of 0 (no call back) and 1 (received a call back). Instead, I opted for a logistic regression which models the probability of an event occurring and ensures that the predicted probabilities remains bound within the 0 to 1 range.

model1 <- glm(callback ~ name_ethnicity + skillspeaking + skillwriting + skillsocialper, 
              data = data, family = binomial)
stargazer(model1, 
          type = 'html',  
          header = FALSE, 
          dep.var.labels = "Got a Callback", 
          covariate.labels = c("Canadian", "Chinese", "Chinese-Canadian", "Greek", 
                               "Indian", "Pakistani", "Speaking Skills", "Writing Skills", "Social Skills", "Intercept (British)"),
          style = 'qje',   
          title = 'Regression Results: Callbacks across Ethnicities and Skills' ,
          notes = "",  
          column.sep.width = "1pt", 
          out = "summary_table.html",
          coef = list(exp(coef(model1))))
Regression Results: Callbacks across Ethnicities and Skills
Got a Callback
Canadian 1.337***
(0.118)
Chinese 0.800***
(0.125)
Chinese-Canadian 0.640***
(0.146)
Greek 0.882***
(0.204)
Indian 0.755***
(0.123)
Pakistani 0.582***
(0.167)
Speaking Skills 1.027***
(0.005)
Writing Skills 0.988***
(0.003)
Social Skills 0.996***
(0.004)
Intercept (British) 0.052
(0.286)
N 12,897
Log Likelihood -4,110.432
Akaike Inf. Crit. 8,240.865
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

Given that we ran a logistic regression, the results are in the form of log odds. Exponentiating the log odds gives us odds ratios, a clearer picture of how much more (or less) likely something is to happen in response to a predictor variable. Odds ratios represent the odds of an event occurring in one group relative to the odds of it occurring in another group, with values greater than 1 indicating higher odds and values less than 1 indicating lower odds.

Intercept (British):: The British candidates serve as the reference group, against which we can compare all the other candidates. Given that the log odds have been exponentiated, despite the intercept value of 0.052, the odds ratio is essentially 1, since the model compares all other ethnicities against British candidates.

Canadian candidate (1.337) : Candidates with a Canadian ethnicity have 33.7% higher odds of getting a callback compared to British candidates. This is statistically significant at the 1% level.

Chinese candidate (0.800): Candidates with a Chinese ethnicity have 20% lower odds of getting a callback compared to British candidates. This is statistically significant at the 1% level.

Chinese-Canadian candidate (0.640): Candidates with a Chinese-Canadian ethnicity have 36% lower odds of receiving a callback compared to British candidates. This is statistically significant at the 1% level.

Greek candidate (0.882): Greek candidates have 11.8% lower odds of receiving a callback compared to British candidates. This is statistically significant at the 1% level.

Indian candidate (0.755): Indian candidates have 24.5% lower odds of getting a callback compared to British candidates. This is statistically significant at the 1% level.

Pakistani candidate (0.582): Pakistani candidates have 41.8% lower odds of getting a callback compared to British candidates. This is statistically significant at the 1% level.

Speaking Skills (1.027): For each one-unit increase in speaking skills, the odds of receiving a callback increase by 2.7%. This is statistically significant at the 1% level, meaning speaking skills have a positive effect on the likelihood of getting a callback.

Writing Skills (0.988): For each one-unit increase in writing skills, the odds of receiving a callback decrease by 1.2%. This is statistically significant at the 1% level, meaning writing skills have a negative, but small, effect on the likelihood of getting a callback.

Social-Personal Skills (0.996): For each one-unit increase in social skills, the odds of receiving a callback decrease by 0.4%. This is statistically significant at the 1% level, but the effect is very small and likely negligible in practical terms.

  1. Run a regression where the outcome is whether a candidate gets a callback and the covariates are name-ethnicity variables and candidates’ language skills. Run the same regression but only for “Programmer” jobs. Run the same regression but only for “Retail” jobs. What do you notice? What does that tell you about the calculus of the employers?
model2 <- glm(callback ~ name_ethnicity + language_skills, 
             data = data, family = binomial)
programmerdata <- subset(data, occupation_type == "Programmer")
model_Prog <- glm(callback ~ name_ethnicity + language_skills,
                             data = programmerdata, family = binomial)
retaildata <- subset(data, occupation_type == "Retail")

model_Retail <- glm(callback ~ name_ethnicity + language_skills,
                         data = retaildata, family = binomial)
stargazer(model2, model_Prog, model_Retail, 
          type = 'html',  
          header = FALSE, 
          dep.var.labels = "Got a Callback", 
          covariate.labels = c( "Canadian", "Chinese", "Chinese-Canadian", "Greek", 
                               "Indian", "Pakistani", "Language Skills" , "Intercept(British)"),
          style = 'qje',   
          title = 'Regression Results: Callbacks and Language Skills by Ethnicity and Occupation',
          notes.align = "l", 
          column.labels = c("All Jobs", "Programmer Jobs", "Retail Jobs"),
          coef = list(exp(coef(model2)), exp(coef(model_Prog)), exp(coef(model_Retail))),
          out = "model_summary_table.html")
Regression Results: Callbacks and Language Skills by Ethnicity and Occupation
Got a Callback
All Jobs Programmer Jobs Retail Jobs
(1) (2) (3)
Canadian 1.314*** 1.851*** 0.830***
(0.118) (0.497) (0.275)
Chinese 0.794*** 1.724*** 0.515*
(0.124) (0.500) (0.290)
Chinese-Canadian 0.639*** 1.313** 0.686**
(0.146) (0.562) (0.329)
Greek 0.883*** 1.455* 0.447
(0.203) (0.767) (0.582)
Indian 0.745*** 1.611*** 0.480*
(0.122) (0.498) (0.286)
Pakistani 0.585*** 0.850 0.447
(0.166) (0.655) (0.398)
Language Skills 1.202*** 1.163*** 1.311***
(0.066) (0.210) (0.175)
Intercept(British) 0.120 0.075 0.270
(0.106) (0.468) (0.241)
N 12,910 1,172 1,391
Log Likelihood -4,137.972 -401.585 -582.409
Akaike Inf. Crit. 8,291.944 819.171 1,180.819
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

Canadian candidate: Significant Positive effect in Programmer Jobs (+85.1%) and All Jobs (+31.4%) Significant Negative effect in Retail Jobs (-17%)

Chinese candidate: Significant Positive effect in Programmer Jobs (+72.4%). Significant Negative effect in All Jobs (-20.6%) and Retail Jobs (-48.5%)

Chinese-Canadian candidate: Significant Positive effect in Programmer Jobs (+31.3%). Significant Negative effect in All Jobs (-36.1%) and Retail Jobs (-31.4%).

Greek candidate: No significant effect in Retail Jobs. Positive effect in Programmer Jobs (+45.5%). Significant negative effect in All jobs (-11.7%).

Indian candidate: Significant Negative effect in All Jobs (-25.5%). Significant Positive effect in Programmer Jobs (+61.1%). Significant Negative effect in Retail Jobs (-52%).

Pakistani candidate: Significant Negative effect in All Jobs (-41.5%). No significant effect in Programmer Jobs (-15.0%) or Retail Jobs (-55.3%).

Language Skills: All Jobs: 20.2% increase in odds per unit increase in language skills.

Programmer Jobs: 16.3% increase in odds per unit increase in language skills.

Retail Jobs: 31.1% increase in odds per unit increase in language skills.

Language Skills positively affect the likelihood of getting a callback for all job types, programmer jobs, and retail jobs.

Programmer Jobs: Candidates from ethnic groups like Canadian, Chinese, and Indian show significantly higher odds of receiving a callback in technical roles. Pakistani candidates, however, do not face significant disadvantages or advantages in this category, as their results are not statistically significant (p-value = 0.679).

Retail Jobs: While Pakistani candidates show no significant effect in retail jobs (p-value = 0.398), Chinese, Chinese-Canadian, and Indian candidates experience significantly lower odds of receiving callbacks for retail positions, indicating an ethnic disparity in customer-facing roles. Given how all coefficients are lesser than the refernce group (british), it could be likely that employment is down for teh retail industry during the time period of this data collection.

All Jobs: Only Canadians experience significant positive effect (31.4%) in hearing callbacks across jobs and everything other ethnicity experiences lower odds of hearing a callback compared to the reference British group.

Employers’ hiring decisions are influenced by both ethnic background and the specific job type, with ethnic disparities more pronounced in general and retail roles. In technical roles, such as programming, ethnic background seems to play a lesser role, suggesting that employers and the hiring process may prioritize technical skills over cultural fit in these contexts. Assumptions such as stereotypes (e.g.: Indians being good at programming roles) may also play a role in the hiring process. However, unconscious biases and preferences likely affect hiring for customer-facing positions, where ethnicity may impact employers’ perceptions of a candidate’s suitability for the role. Language skills matter for all roles, suggesting that ability to communicate is a necessity in the employer’s outlook regardless of the specific role.

  1. Run the same regression as above, but include the interaction between ethnicity and language skills. How do the results differ?
model3 <- glm(callback ~ name_ethnicity * language_skills, data = data, family = binomial)
programmerinteraction <- glm(callback ~ name_ethnicity * language_skills, data = programmerdata, family = binomial)
retailinteraction <- glm(callback ~ name_ethnicity * language_skills, data = retaildata, family = binomial)
stargazer(model3, programmerinteraction, retailinteraction, 
          type = 'html',  
          header = FALSE, 
          dep.var.labels = "Got a Callback", 
          covariate.labels = c("Canadian", "Chinese", "Chinese-Canadian", "Greek", 
                               "Indian", "Pakistani", "Language Skills", "Canadian*Language Skills", "Chinese*Language Skills" ,
                               "Chinese Canadian*Language Skills", "Greek*Language Skills", 
                               "Indian*Language Skills", "Pakistani*Language Skills", "British (Intercept)"),
          style = 'qje',   
          title = 'Regression Results: Interaction Between Ethnicity and Language Skills for Callbacks',
          column.labels = c("All Jobs", "Programmer Jobs", "Retail Jobs"),
          coef = list(exp(coef(model3)), exp(coef(programmerinteraction)), exp(coef(retailinteraction))),
          out = "interaction_model_summary_table.html")
Regression Results: Interaction Between Ethnicity and Language Skills for Callbacks
Got a Callback
All Jobs Programmer Jobs Retail Jobs
(1) (2) (3)
Canadian 1.269*** 1.579*** 0.764**
(0.131) (0.562) (0.307)
Chinese 0.789*** 1.705*** 0.430
(0.139) (0.561) (0.327)
Chinese-Canadian 0.594*** 1.215* 0.514
(0.165) (0.639) (0.376)
Greek 0.881*** 1.412 0.257
(0.241) (0.911) (0.780)
Indian 0.687*** 1.419** 0.453
(0.138) (0.563) (0.320)
Pakistani 0.484** 0.571 0.253
(0.194) (0.787) (0.499)
Language Skills 0.943*** 0.750 0.676
(0.275) (1.155) (0.610)
Canadian*Language Skills 1.222*** 1.889 1.501**
(0.299) (1.219) (0.697)
Chinese*Language Skills 1.099*** 1.060 2.250***
(0.313) (1.239) (0.720)
Chinese Canadian*Language Skills 1.407*** 1.411 3.596***
(0.357) (1.356) (0.802)
Greek*Language Skills 1.117** 1.259 6.165***
(0.462) (1.733) (1.256)
Indian*Language Skills 1.425*** 1.676 1.407**
(0.305) (1.218) (0.713)
Pakistani*Language Skills 2.181*** 4.308*** 8.518***
(0.392) (1.503) (0.918)
British (Intercept) 0.126 0.083 0.311
(0.116) (0.520) (0.263)
N 12,910 1,172 1,391
Log Likelihood -4,134.734 -400.444 -577.596
Akaike Inf. Crit. 8,297.469 828.887 1,183.193
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.

General Jobs: For Canadian candidates, the interaction with language skills significantly increases the likelihood of receiving a callback (odds ratio = 1.222), showing a positive effect. Similarly, candidates from Chinese, Chinese-Canadian, Greek, and Indian backgrounds also experience a positive interaction with language skills, though the magnitude varies. Notably, Pakistani candidates experience the most significant positive interaction (odds ratio = 2.181), indicating that language proficiency significantly boosts their callback chances in general jobs. Compared to British candidates (the reference group), language skills increase the likelihood of a callback for most ethnic groups, but particularly for Pakistani candidates, who experience the largest increase.

Programmer Jobs: Pakistani candidates have a very strong positive effect of language skills (odds ratio = 4.308, p < 0.01), meaning they significantly benefit from strong language skills in programming roles. Other ethnicities such as Canadian, Chinese, and Indian candidates also experience positive effects, but none are significant like the effect for Pakistani candidates. Greek candidates do not experience a significant effect from language skills in programming roles. Language skills, in addition to technical competency, is likely expected of Pakistani candidates as compared to others—this shows a likely bias, which necessitates additional skills for Pakistanis to receive a callback over other ethnic groups.

Retail Jobs: In Retail Jobs, language skills significantly improve callback chances for all ethnic groups. Pakistani candidates see the most substantial increase , followed by Greek, Chinese-Canadian, and Chinese candidates (odds ratio = 2.250, p < 0.01). Canadian and Indian candidates also benefit significantly. Language skills thus strongly enhance the likelihood of receiving a callback across all ethnicities in retail jobs.

In question (e), ethnicity and language skills are treated as separate, independent predictors.Each ethnic group has its own effect on the likelihood of a callback, and language skills are included as a separate main effect. However, there is no consideration of how the impact of language skills might differ across ethnicities.

In this model, the interaction recognizes that the effect of language skills might not be the same for all ethnic groups and allows the relationship to change depending on the ethnicity. It shows that language skills have a stronger or weaker effect for certain ethnicities compared to others, across job categories.

  1. Create an interactive heat map displaying the number of callbacks by the type of jobs and the ethnicity of the name. Hint: use the plot_ly() function with the argument type=“heatmap”.
 heatmapplot <- plot_ly(data = summary_table, 
                        x = ~name_ethnicity, 
                        y = ~occupation_type, 
                        z = ~total_callbacks, 
                        type = "heatmap",
                        colors = "Oranges") %>%
  layout(title = "Number of Callbacks by Job Type and Ethnicity",
         xaxis = list(title = "Job Type"),
         yaxis = list(title = "Ethnicity of Name"),
         coloraxis = list(colorbar = list(title = "Number of Callbacks")))

heatmapplot

Problem 2: Causal Inference

Remember, DAGs go from left to right in temporal order!

  1. A researcher would like to study the relationship between education and income. She hypothesizes that individuals with more education will earn more money in the future. She surveys 1,000 individuals, asking them their highest degree earned and their yearly wage, runs a regression, and finds a statistically significant positive effect, concluding that there is a causal effect of education on income. Is this researcher correct to make causal claims? Why or why not? Draw a causal graph to support your conclusions.

Based on her findings, the researcher is not correct in drawing a causal relationship between education and income. While her findings suggest a correlation between income and education, the researcher must explore confounding variables that could have influenced this relationship. Confounding variables such as socioeconomic status, networking, and work experience can influence both the level of education achieved as well as the income the students went on to earn. For example, an individual with a high socioeconomic status would have access to better institutions and, therefore, better networks, enabling them to gain more relevant work experience, positively impacting their income. In this scenario, it is also important to consider that reverse causality could exist. The higher the income an individual can earn, the more they can spend on higher education or need higher education to improve the prospects of their professional success. However, this can only be established upon controlling for confounding variables.

#install.packages("dagitty")
library(dagitty)

dag <- dagitty("dag {
  Socioeconomic_Status -> Education
  Socioeconomic_Status -> Networking
  Socioeconomic_Status -> Work_Experience
  Socioeconomic_Status -> Income 
  Education -> Work_Experience
  Education -> Income
  Networking -> Work_Experience
  Networking -> Income
  Work_Experience -> Income
}")

plot(dag)

  1. Another researcher also wants to study the relationship between education and income (labor economists aren’t very creative). This one, however, has better data. Her country introduced a scholarship to help people attend university: anyone who applied received two free years of university education. Looking only at people who applied in the first year of the program, she compares 1,000 people who received the scholarship to 1,000 people who did not receive the scholarship, and found that five years after the program started, the people who obtained the scholarship had a higher income than those who did not receive the scholarship. She concludes that education increases income. Is this researcher correct to make causal claims? Why or why not? Draw a causal graph to support your conclusions.

It would not be correct to draw a casual relationship as the conclusion based on this researcher’s findings. There are multiple reasons for this: Similar to the previous question, confounding variables have to be considered to understand the relationship between education and income. Apart from the scholarship, factors such as student motivation, which affects both scholarship and income, and factors such as networks, which enable students to learn about the scholarship and boost their chances of finding well-paying jobs. It is also important to note that there may be systemic factors that render the scholarship recipients to be different from those who did not—there may be selection bias in the sample. The treatment (the dependent variable) has to be randomly assigned to draw a causal relationship. Here, the researcher only compares the scholarship recipients to those who did not receive it. There is na assignment of people to the treatment group and the control—the groups could be different from each other in ways that are not controlled. This means claiming a direct, causal relationship is difficult.

dag <- dagitty('
   dag {
    "Student Motivation" -> "Education" -> "Income"
    "Student Motivation" -> "Income"
    "Student Networks" -> "Education"
    "Student Networks" -> "Income"
    "Student Motivation" -> "Student Networks"
    "Education" -> "Income"
  }
')

plot(dag)

  1. A third researcher decides that she wants to settle this debate once and for all. She has the full support of a government and infinite money to solve this problem; her government agrees to let her run a randomized experiment. She flips a coin for every city in the country, and if the coin comes up heads, that city will get a free kindergarten. If tails, it gets a free park. Twenty years later, she finds that the cities that got kindergartens have higher incomes than the cities that received parks, and concludes that education increases income. Is this researcher correct to make causal claims? Why or why not? Draw a causal graph to support your conclusions.

The causal relationship between education and income drawn by the researcher is incorrect. There are multiple reasons for this: There may be confounding variables that affect both education and outcome in cities that received the randomized intervention. Factors such as the pre-existing economic status of cities, and urbanization are factors that affect education and income that an individual in these cities receives. While the randomization is limited to which cities get parks/kindergartens, it does not mean all potential confounding variables that influence our outcome variable are accounted for at an individual level. The random assignment occurs at the city level, but the dependent variable is measured at the individual level. So, while the intervention (park/kindergarten) may be random, this does not mean that the individual level factors, such as the city’s income distribution and proximity to the service, are all controlled. A kindergarten is an educational intervention, whereas a park is a leisure/public-good intervention. The effects of these two may not be directly comparable, as the way people engage with these services are not the same. Kindergartens cater to individuals and their children who are of age to attend the kindergarten, while parks are publicly accessible. The prescription of park versus kindergarten also suggests that education and leisure/community well-being are opposed to each other.
The introduction of kindergartens and parks may have caused spillover effects that impacted the city in manners that this research has not fully grasped—which could account for the differences.

In the DAG, kindergarten represents the educational intervention given, that seems to result in higher income. Here park represents the non-educational intervention given, that seems to result in lower income.

dag <- dagitty('
  dag {
    "Pre-existing Economic Status" -> "Kindergarten"
    "Pre-existing Economic Status" -> "Park"  
    "Pre-existing Economic Status" -> "Income"
    "Urbanization" -> "Kindergarten"
    "Urbanization" -> "Park"
    "Urbanization" -> "Income"
    "Kindergarten" -> "Income"
    "Park" -> "Income"
  }
')

plot(dag)

  1. Why did the control group cities receive a park instead of nothing?

By providing the control group with a park, the researcher allows for a comparison between two types of interventions: one related to education (kindergarten) and one related to community infrastructure (park). This helps isolate the specific effect of education (kindergarten) on income, while controlling for other potential confounding factors (e.g., general community improvements). The idea is that the kindergarten intervention focuses on education, while the park intervention might improve general community well-being but not have the same direct impact on education. Here, it could also be perceived as kindergartens being educational interventions while the park is a leisure-promoting intervention. The kindergarten intervention focuses on early childhood education, which improves cognitive development and enhances future income potential. In contrast, the park intervention aims to improve leisure, physical health, and community well-being but does not directly impact education or cognitive development If the control group received nothing, it would be difficult to distinguish whether any observed differences in income were due to the kindergarten intervention or simply due to the fact that the experimental group was receiving something (any intervention) while the control group was not. Also, considering that this is a public intervention experiment, withholding public goods for the sake of an experiment would raise ethical concerns of accessibility, and citizen equality.