Problem 1: Reshaping, analyzing, and visualizing data

Let’s work a bit more with the data set on resumes from (Oreopoulos, 2011) that we discussed in the last recitation.

First, read in the Oreopoulos data into R.

setwd("~/Desktop/Data Analysis")
data <- read.dta13("~/Desktop/Data Analysis/HW3/oreopoulos.dta")

Find all the unique combinations of occupation types and ethnicity dyads. Find the total number of callbacks for each occupation type and ethnicity combination, and present the result in a table sorted by count.

summary_table <- data %>%
  group_by(occupation_type, name_ethnicity) %>%
  summarize(total_callbacks = sum(callback, na.rm = TRUE) 
            + sum(second_callback, na.rm = TRUE), .groups = 'drop') %>%
  arrange(desc(total_callbacks))

#install.packages("kableExtra")
library(kableExtra)

summary_table %>%
  kable("html", 
        caption = "Total Callbacks for Each Occupation Type and Ethnicity Combination", 
        col.names = c("Occupation Type", "Ethnicity", "Total Callbacks")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Total Callbacks for Each Occupation Type and Ethnicity Combination
Occupation Type	Ethnicity	Total Callbacks
Finance	Canada	112
Marketing and Sales	Canada	85
Retail	Canada	74
Finance	Indian	66
Retail	Indian	60
Administrative	Canada	58
Finance	Chinese	57
Marketing and Sales	Indian	48
Retail	Chinese	48
Marketing and Sales	Chinese	47
Programmer	Canada	45
Programmer	Indian	42
Programmer	Chinese	36
Insurance	Indian	34
Administrative	Chinese	32
Insurance	Canada	32
Marketing and Sales	British	32
Administrative	Indian	27
Retail	British	27
Retail	Chn-Cdn	27
Marketing and Sales	Chn-Cdn	25
Marketing and Sales	Pakistani	24
Executive Assisstant	Canada	23
Civil Engineer	Canada	22
Clerical	Indian	17
Electrical Engineer	Canada	17
Finance	Chn-Cdn	17
Clerical	Chinese	16
Finance	British	16
Insurance	Chinese	16
Marketing and Sales	Greek	16
Administrative	British	15
Retail	Pakistani	15
Programmer	Chn-Cdn	14
Administrative	Chn-Cdn	13
Civil Engineer	Indian	12
Clerical	Canada	12
Insurance	Chn-Cdn	12
Civil Engineer	Chinese	11
Executive Assisstant	Indian	11
Technology	Canada	10
Accounting	Chinese	9
Administrative	Greek	9
Executive Assisstant	Chinese	8
Finance	Pakistani	8
Human Resources Payroll	Canada	8
Accounting	Canada	7
Administrative	Pakistani	7
Executive Assistant	Canada	7
Insurance	Pakistani	7
Programmer	British	7
Accounting	Indian	6
Education	Canada	6
Electrical Engineer	Indian	6
Executive Assistant	Chinese	6
Insurance	British	6
Ecommerce	Chinese	5
Electrical Engineer	Chn-Cdn	5
Executive Assisstant	British	5
Executive Assistant	Indian	5
Human Resources Payroll	Indian	5
Maintenance Technician	Canada	5
Production	Indian	5
Programmer	Pakistani	5
Retail	Greek	5
Social Worker	Canada	5
Technology	British	5
Technology	Indian	5
Biotech and Pharmacy	Canada	4
Biotech and Pharmacy	Indian	4
Civil Engineer	Chn-Cdn	4
Clerical	Chn-Cdn	4
Electrical Engineer	British	4
Electrical Engineer	Chinese	4
Electrical Engineer	Pakistani	4
Maintenance Technician	British	4
Technology	Chn-Cdn	4
Accounting	Greek	3
Clerical	Greek	3
Education	British	3
Education	Indian	3
Executive Assisstant	Pakistani	3
Human Resources Payroll	Chn-Cdn	3
Maintenance Technician	Indian	3
Programmer	Greek	3
Technology	Chinese	3
Ecommerce	Canada	2
Executive Assisstant	Chn-Cdn	2
Executive Assistant	Chn-Cdn	2
Food Services Managers	Canada	2
Food Services Managers	Chn-Cdn	2
Human Resources Payroll	British	2
Maintenance Technician	Chn-Cdn	2
Maintenance Technician	Pakistani	2
Media and Arts	Canada	2
Media and Arts	Chinese	2
Media and Arts	Indian	2
Production	Canada	2
Production	Chinese	2
Technology	Greek	2
Technology	Pakistani	2
Accounting	Chn-Cdn	1
Biotech and Pharmacy	Chinese	1
Civil Engineer	British	1
Civil Engineer	Pakistani	1
Ecommerce	British	1
Ecommerce	Indian	1
Electrical Engineer	Greek	1
Executive Assistant	Greek	1
Finance	Greek	1
Food Services Managers	Chinese	1
Food Services Managers	Indian	1
Maintenance Technician	Chinese	1
Social Worker	Chinese	1
Biotech and Pharmacy	British	0
Biotech and Pharmacy	Chn-Cdn	0
Biotech and Pharmacy	Pakistani	0
Civil Engineer	Greek	0
Ecommerce	Chn-Cdn	0
Ecommerce	Greek	0
Ecommerce	Pakistani	0
Education	Chinese	0
Education	Chn-Cdn	0
Education	Pakistani	0
Food Services Managers	Greek	0
Human Resources Payroll	Chinese	0
Human Resources Payroll	Greek	0
Human Resources Payroll	Pakistani	0
Insurance	Greek	0
Production	British	0
Production	Chn-Cdn	0
Production	Greek	0
Production	Pakistani	0
Social Worker	British	0
Social Worker	Chn-Cdn	0
Social Worker	Indian	0
Social Worker	Pakistani	0

Notice that the data from your previous answer is in the long format. Reshape the data into the wide format where each row has a unique occupation type and each column tracks the ethnicity of the candidate. Each cell in the resulting data frame should contain the total number of callbacks for the respective occupation-type-name-ethnicity combination.

wide_table <- summary_table %>%
  pivot_wider(names_from = name_ethnicity, values_from = total_callbacks, values_fill = list(total_callbacks = 0))

colnames(wide_table) <- c("Occupation Type", "Canadian", "Indian", "Chinese", "British", "Chinese - Canadian", "Pakistani", "Greek")


wide_table %>%
  kable("html", caption = "Total Callbacks for Each Occupation Type and Ethnicity Combination") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Total Callbacks for Each Occupation Type and Ethnicity Combination
Occupation Type	Canadian	Indian	Chinese	British	Chinese - Canadian	Pakistani	Greek
Finance	112	66	57	16	17	8	1
Marketing and Sales	85	48	47	32	25	24	16
Retail	74	60	48	27	27	15	5
Administrative	58	27	32	15	13	7	9
Programmer	45	42	36	7	14	5	3
Insurance	32	34	16	6	12	7	0
Executive Assisstant	23	11	8	5	2	3	0
Civil Engineer	22	12	11	1	4	1	0
Clerical	12	17	16	0	4	0	3
Electrical Engineer	17	6	4	4	5	4	1
Technology	10	5	3	5	4	2	2
Accounting	7	6	9	0	1	0	3
Human Resources Payroll	8	5	0	2	3	0	0
Executive Assistant	7	5	6	0	2	0	1
Education	6	3	0	3	0	0	0
Ecommerce	2	1	5	1	0	0	0
Maintenance Technician	5	3	1	4	2	2	0
Production	2	5	2	0	0	0	0
Social Worker	5	0	1	0	0	0	0
Biotech and Pharmacy	4	4	1	0	0	0	0
Food Services Managers	2	1	1	0	2	0	0
Media and Arts	2	2	2	0	0	0	0

Run and interpret a regression where the outcome is whether a resume got a callback. The co-variates in your regression should be the ethnicity of each candidates as well as the levels of candidates’ skills: speaking, writing and personal skills. Your interpretation should detail what a one unit increase in the independent variable means for the dependent variable.

As this regression involves a dependent variable that is a binary variable, using a linear regression might model probabilities outside the probabilities of 0 (no call back) and 1 (received a call back). Instead, I opted for a logistic regression which models the probability of an event occurring and ensures that the predicted probabilities remains bound within the 0 to 1 range.

model1 <- glm(callback ~ name_ethnicity + skillspeaking + skillwriting + skillsocialper, 
              data = data, family = binomial)
stargazer(model1, 
          type = 'html',  
          header = FALSE, 
          dep.var.labels = "Got a Callback", 
          covariate.labels = c("Canadian", "Chinese", "Chinese-Canadian", "Greek", 
                               "Indian", "Pakistani", "Speaking Skills", "Writing Skills", "Social Skills", "Intercept (British)"),
          style = 'qje',   
          title = 'Regression Results: Callbacks across Ethnicities and Skills' ,
          notes = "",  
          column.sep.width = "1pt", 
          out = "summary_table.html",
          coef = list(exp(coef(model1))))

**Regression Results: Callbacks across Ethnicities and Skills**

	Got a Callback

Canadian	1.337^***
	(0.118)

Chinese	0.800^***
	(0.125)

Chinese-Canadian	0.640^***
	(0.146)

Greek	0.882^***
	(0.204)

Indian	0.755^***
	(0.123)

Pakistani	0.582^***
	(0.167)

Speaking Skills	1.027^***
	(0.005)

Writing Skills	0.988^***
	(0.003)

Social Skills	0.996^***
	(0.004)

Intercept (British)	0.052
	(0.286)

N	12,897
Log Likelihood	-4,110.432
Akaike Inf. Crit.	8,240.865

Notes:	^***Significant at the 1 percent level.
	^**Significant at the 5 percent level.
	^*Significant at the 10 percent level.

Given that we ran a logistic regression, the results are in the form of log odds. Exponentiating the log odds gives us odds ratios, a clearer picture of how much more (or less) likely something is to happen in response to a predictor variable. Odds ratios represent the odds of an event occurring in one group relative to the odds of it occurring in another group, with values greater than 1 indicating higher odds and values less than 1 indicating lower odds.

Intercept (British):: The British candidates serve as the reference group, against which we can compare all the other candidates. Given that the log odds have been exponentiated, despite the intercept value of 0.052, the odds ratio is essentially 1, since the model compares all other ethnicities against British candidates.

Canadian candidate (1.337) : Candidates with a Canadian ethnicity have 33.7% higher odds of getting a callback compared to British candidates. This is statistically significant at the 1% level.

Chinese candidate (0.800): Candidates with a Chinese ethnicity have 20% lower odds of getting a callback compared to British candidates. This is statistically significant at the 1% level.

Chinese-Canadian candidate (0.640): Candidates with a Chinese-Canadian ethnicity have 36% lower odds of receiving a callback compared to British candidates. This is statistically significant at the 1% level.

Greek candidate (0.882): Greek candidates have 11.8% lower odds of receiving a callback compared to British candidates. This is statistically significant at the 1% level.

Indian candidate (0.755): Indian candidates have 24.5% lower odds of getting a callback compared to British candidates. This is statistically significant at the 1% level.

Pakistani candidate (0.582): Pakistani candidates have 41.8% lower odds of getting a callback compared to British candidates. This is statistically significant at the 1% level.

Speaking Skills (1.027): For each one-unit increase in speaking skills, the odds of receiving a callback increase by 2.7%. This is statistically significant at the 1% level, meaning speaking skills have a positive effect on the likelihood of getting a callback.

Writing Skills (0.988): For each one-unit increase in writing skills, the odds of receiving a callback decrease by 1.2%. This is statistically significant at the 1% level, meaning writing skills have a negative, but small, effect on the likelihood of getting a callback.

Social-Personal Skills (0.996): For each one-unit increase in social skills, the odds of receiving a callback decrease by 0.4%. This is statistically significant at the 1% level, but the effect is very small and likely negligible in practical terms.

Run a regression where the outcome is whether a candidate gets a callback and the covariates are name-ethnicity variables and candidates’ language skills. Run the same regression but only for “Programmer” jobs. Run the same regression but only for “Retail” jobs. What do you notice? What does that tell you about the calculus of the employers?

model2 <- glm(callback ~ name_ethnicity + language_skills, 
             data = data, family = binomial)

programmerdata <- subset(data, occupation_type == "Programmer")
model_Prog <- glm(callback ~ name_ethnicity + language_skills,
                             data = programmerdata, family = binomial)

retaildata <- subset(data, occupation_type == "Retail")

model_Retail <- glm(callback ~ name_ethnicity + language_skills,
                         data = retaildata, family = binomial)

stargazer(model2, model_Prog, model_Retail, 
          type = 'html',  
          header = FALSE, 
          dep.var.labels = "Got a Callback", 
          covariate.labels = c( "Canadian", "Chinese", "Chinese-Canadian", "Greek", 
                               "Indian", "Pakistani", "Language Skills" , "Intercept(British)"),
          style = 'qje',   
          title = 'Regression Results: Callbacks and Language Skills by Ethnicity and Occupation',
          notes.align = "l", 
          column.labels = c("All Jobs", "Programmer Jobs", "Retail Jobs"),
          coef = list(exp(coef(model2)), exp(coef(model_Prog)), exp(coef(model_Retail))),
          out = "model_summary_table.html")

**Regression Results: Callbacks and Language Skills by Ethnicity and Occupation**

	Got a Callback
	All Jobs	Programmer Jobs	Retail Jobs
	(1)	(2)	(3)

Canadian	1.314^***	1.851^***	0.830^***
	(0.118)	(0.497)	(0.275)

Chinese	0.794^***	1.724^***	0.515^*
	(0.124)	(0.500)	(0.290)

Chinese-Canadian	0.639^***	1.313^**	0.686^**
	(0.146)	(0.562)	(0.329)

Greek	0.883^***	1.455^*	0.447
	(0.203)	(0.767)	(0.582)

Indian	0.745^***	1.611^***	0.480^*
	(0.122)	(0.498)	(0.286)

Pakistani	0.585^***	0.850	0.447
	(0.166)	(0.655)	(0.398)

Language Skills	1.202^***	1.163^***	1.311^***
	(0.066)	(0.210)	(0.175)

Intercept(British)	0.120	0.075	0.270
	(0.106)	(0.468)	(0.241)

N	12,910	1,172	1,391
Log Likelihood	-4,137.972	-401.585	-582.409
Akaike Inf. Crit.	8,291.944	819.171	1,180.819

Notes:	^***Significant at the 1 percent level.
	^**Significant at the 5 percent level.
	^*Significant at the 10 percent level.

Canadian candidate: Significant Positive effect in Programmer Jobs (+85.1%) and All Jobs (+31.4%) Significant Negative effect in Retail Jobs (-17%)

Chinese candidate: Significant Positive effect in Programmer Jobs (+72.4%). Significant Negative effect in All Jobs (-20.6%) and Retail Jobs (-48.5%)

Chinese-Canadian candidate: Significant Positive effect in Programmer Jobs (+31.3%). Significant Negative effect in All Jobs (-36.1%) and Retail Jobs (-31.4%).

Greek candidate: No significant effect in Retail Jobs. Positive effect in Programmer Jobs (+45.5%). Significant negative effect in All jobs (-11.7%).

Indian candidate: Significant Negative effect in All Jobs (-25.5%). Significant Positive effect in Programmer Jobs (+61.1%). Significant Negative effect in Retail Jobs (-52%).

Pakistani candidate: Significant Negative effect in All Jobs (-41.5%). No significant effect in Programmer Jobs (-15.0%) or Retail Jobs (-55.3%).

Language Skills: All Jobs: 20.2% increase in odds per unit increase in language skills.

Programmer Jobs: 16.3% increase in odds per unit increase in language skills.

Retail Jobs: 31.1% increase in odds per unit increase in language skills.

Language Skills positively affect the likelihood of getting a callback for all job types, programmer jobs, and retail jobs.

Programmer Jobs: Candidates from ethnic groups like Canadian, Chinese, and Indian show significantly higher odds of receiving a callback in technical roles. Pakistani candidates, however, do not face significant disadvantages or advantages in this category, as their results are not statistically significant (p-value = 0.679).

Retail Jobs: While Pakistani candidates show no significant effect in retail jobs (p-value = 0.398), Chinese, Chinese-Canadian, and Indian candidates experience significantly lower odds of receiving callbacks for retail positions, indicating an ethnic disparity in customer-facing roles. Given how all coefficients are lesser than the refernce group (british), it could be likely that employment is down for teh retail industry during the time period of this data collection.

All Jobs: Only Canadians experience significant positive effect (31.4%) in hearing callbacks across jobs and everything other ethnicity experiences lower odds of hearing a callback compared to the reference British group.

Employers’ hiring decisions are influenced by both ethnic background and the specific job type, with ethnic disparities more pronounced in general and retail roles. In technical roles, such as programming, ethnic background seems to play a lesser role, suggesting that employers and the hiring process may prioritize technical skills over cultural fit in these contexts. Assumptions such as stereotypes (e.g.: Indians being good at programming roles) may also play a role in the hiring process. However, unconscious biases and preferences likely affect hiring for customer-facing positions, where ethnicity may impact employers’ perceptions of a candidate’s suitability for the role. Language skills matter for all roles, suggesting that ability to communicate is a necessity in the employer’s outlook regardless of the specific role.

Run the same regression as above, but include the interaction between ethnicity and language skills. How do the results differ?

model3 <- glm(callback ~ name_ethnicity * language_skills, data = data, family = binomial)

programmerinteraction <- glm(callback ~ name_ethnicity * language_skills, data = programmerdata, family = binomial)

retailinteraction <- glm(callback ~ name_ethnicity * language_skills, data = retaildata, family = binomial)

stargazer(model3, programmerinteraction, retailinteraction, 
          type = 'html',  
          header = FALSE, 
          dep.var.labels = "Got a Callback", 
          covariate.labels = c("Canadian", "Chinese", "Chinese-Canadian", "Greek", 
                               "Indian", "Pakistani", "Language Skills", "Canadian*Language Skills", "Chinese*Language Skills" ,
                               "Chinese Canadian*Language Skills", "Greek*Language Skills", 
                               "Indian*Language Skills", "Pakistani*Language Skills", "British (Intercept)"),
          style = 'qje',   
          title = 'Regression Results: Interaction Between Ethnicity and Language Skills for Callbacks',
          column.labels = c("All Jobs", "Programmer Jobs", "Retail Jobs"),
          coef = list(exp(coef(model3)), exp(coef(programmerinteraction)), exp(coef(retailinteraction))),
          out = "interaction_model_summary_table.html")

**Regression Results: Interaction Between Ethnicity and Language Skills for Callbacks**

	Got a Callback
	All Jobs	Programmer Jobs	Retail Jobs
	(1)	(2)	(3)

Canadian	1.269^***	1.579^***	0.764^**
	(0.131)	(0.562)	(0.307)

Chinese	0.789^***	1.705^***	0.430
	(0.139)	(0.561)	(0.327)

Chinese-Canadian	0.594^***	1.215^*	0.514
	(0.165)	(0.639)	(0.376)

Greek	0.881^***	1.412	0.257
	(0.241)	(0.911)	(0.780)

Indian	0.687^***	1.419^**	0.453
	(0.138)	(0.563)	(0.320)

Pakistani	0.484^**	0.571	0.253
	(0.194)	(0.787)	(0.499)

Language Skills	0.943^***	0.750	0.676
	(0.275)	(1.155)	(0.610)

Canadian*Language Skills	1.222^***	1.889	1.501^**
	(0.299)	(1.219)	(0.697)

Chinese*Language Skills	1.099^***	1.060	2.250^***
	(0.313)	(1.239)	(0.720)

Chinese Canadian*Language Skills	1.407^***	1.411	3.596^***
	(0.357)	(1.356)	(0.802)

Greek*Language Skills	1.117^**	1.259	6.165^***
	(0.462)	(1.733)	(1.256)

Indian*Language Skills	1.425^***	1.676	1.407^**
	(0.305)	(1.218)	(0.713)

Pakistani*Language Skills	2.181^***	4.308^***	8.518^***
	(0.392)	(1.503)	(0.918)

British (Intercept)	0.126	0.083	0.311
	(0.116)	(0.520)	(0.263)

N	12,910	1,172	1,391
Log Likelihood	-4,134.734	-400.444	-577.596
Akaike Inf. Crit.	8,297.469	828.887	1,183.193

Notes:	^***Significant at the 1 percent level.
	^**Significant at the 5 percent level.
	^*Significant at the 10 percent level.

General Jobs: For Canadian candidates, the interaction with language skills significantly increases the likelihood of receiving a callback (odds ratio = 1.222), showing a positive effect. Similarly, candidates from Chinese, Chinese-Canadian, Greek, and Indian backgrounds also experience a positive interaction with language skills, though the magnitude varies. Notably, Pakistani candidates experience the most significant positive interaction (odds ratio = 2.181), indicating that language proficiency significantly boosts their callback chances in general jobs. Compared to British candidates (the reference group), language skills increase the likelihood of a callback for most ethnic groups, but particularly for Pakistani candidates, who experience the largest increase.

Programmer Jobs: Pakistani candidates have a very strong positive effect of language skills (odds ratio = 4.308, p < 0.01), meaning they significantly benefit from strong language skills in programming roles. Other ethnicities such as Canadian, Chinese, and Indian candidates also experience positive effects, but none are significant like the effect for Pakistani candidates. Greek candidates do not experience a significant effect from language skills in programming roles. Language skills, in addition to technical competency, is likely expected of Pakistani candidates as compared to others—this shows a likely bias, which necessitates additional skills for Pakistanis to receive a callback over other ethnic groups.

Retail Jobs: In Retail Jobs, language skills significantly improve callback chances for all ethnic groups. Pakistani candidates see the most substantial increase , followed by Greek, Chinese-Canadian, and Chinese candidates (odds ratio = 2.250, p < 0.01). Canadian and Indian candidates also benefit significantly. Language skills thus strongly enhance the likelihood of receiving a callback across all ethnicities in retail jobs.

In question (e), ethnicity and language skills are treated as separate, independent predictors.Each ethnic group has its own effect on the likelihood of a callback, and language skills are included as a separate main effect. However, there is no consideration of how the impact of language skills might differ across ethnicities.

In this model, the interaction recognizes that the effect of language skills might not be the same for all ethnic groups and allows the relationship to change depending on the ethnicity. It shows that language skills have a stronger or weaker effect for certain ethnicities compared to others, across job categories.

Create an interactive heat map displaying the number of callbacks by the type of jobs and the ethnicity of the name. Hint: use the plot_ly() function with the argument type=“heatmap”.

 heatmapplot <- plot_ly(data = summary_table, 
                        x = ~name_ethnicity, 
                        y = ~occupation_type, 
                        z = ~total_callbacks, 
                        type = "heatmap",
                        colors = "Oranges") %>%
  layout(title = "Number of Callbacks by Job Type and Ethnicity",
         xaxis = list(title = "Job Type"),
         yaxis = list(title = "Ethnicity of Name"),
         coloraxis = list(colorbar = list(title = "Number of Callbacks")))

heatmapplot

Problem 2: Causal Inference

Remember, DAGs go from left to right in temporal order!

A researcher would like to study the relationship between education and income. She hypothesizes that individuals with more education will earn more money in the future. She surveys 1,000 individuals, asking them their highest degree earned and their yearly wage, runs a regression, and finds a statistically significant positive effect, concluding that there is a causal effect of education on income. Is this researcher correct to make causal claims? Why or why not? Draw a causal graph to support your conclusions.

Based on her findings, the researcher is not correct in drawing a causal relationship between education and income. While her findings suggest a correlation between income and education, the researcher must explore confounding variables that could have influenced this relationship. Confounding variables such as socioeconomic status, networking, and work experience can influence both the level of education achieved as well as the income the students went on to earn. For example, an individual with a high socioeconomic status would have access to better institutions and, therefore, better networks, enabling them to gain more relevant work experience, positively impacting their income. In this scenario, it is also important to consider that reverse causality could exist. The higher the income an individual can earn, the more they can spend on higher education or need higher education to improve the prospects of their professional success. However, this can only be established upon controlling for confounding variables.

#install.packages("dagitty")
library(dagitty)

dag <- dagitty("dag {
  Socioeconomic_Status -> Education
  Socioeconomic_Status -> Networking
  Socioeconomic_Status -> Work_Experience
  Socioeconomic_Status -> Income 
  Education -> Work_Experience
  Education -> Income
  Networking -> Work_Experience
  Networking -> Income
  Work_Experience -> Income
}")

plot(dag)

Another researcher also wants to study the relationship between education and income (labor economists aren’t very creative). This one, however, has better data. Her country introduced a scholarship to help people attend university: anyone who applied received two free years of university education. Looking only at people who applied in the first year of the program, she compares 1,000 people who received the scholarship to 1,000 people who did not receive the scholarship, and found that five years after the program started, the people who obtained the scholarship had a higher income than those who did not receive the scholarship. She concludes that education increases income. Is this researcher correct to make causal claims? Why or why not? Draw a causal graph to support your conclusions.

It would not be correct to draw a casual relationship as the conclusion based on this researcher’s findings. There are multiple reasons for this: Similar to the previous question, confounding variables have to be considered to understand the relationship between education and income. Apart from the scholarship, factors such as student motivation, which affects both scholarship and income, and factors such as networks, which enable students to learn about the scholarship and boost their chances of finding well-paying jobs. It is also important to note that there may be systemic factors that render the scholarship recipients to be different from those who did not—there may be selection bias in the sample. The treatment (the dependent variable) has to be randomly assigned to draw a causal relationship. Here, the researcher only compares the scholarship recipients to those who did not receive it. There is na assignment of people to the treatment group and the control—the groups could be different from each other in ways that are not controlled. This means claiming a direct, causal relationship is difficult.

dag <- dagitty('
   dag {
    "Student Motivation" -> "Education" -> "Income"
    "Student Motivation" -> "Income"
    "Student Networks" -> "Education"
    "Student Networks" -> "Income"
    "Student Motivation" -> "Student Networks"
    "Education" -> "Income"
  }
')

plot(dag)

A third researcher decides that she wants to settle this debate once and for all. She has the full support of a government and infinite money to solve this problem; her government agrees to let her run a randomized experiment. She flips a coin for every city in the country, and if the coin comes up heads, that city will get a free kindergarten. If tails, it gets a free park. Twenty years later, she finds that the cities that got kindergartens have higher incomes than the cities that received parks, and concludes that education increases income. Is this researcher correct to make causal claims? Why or why not? Draw a causal graph to support your conclusions.

The causal relationship between education and income drawn by the researcher is incorrect. There are multiple reasons for this: There may be confounding variables that affect both education and outcome in cities that received the randomized intervention. Factors such as the pre-existing economic status of cities, and urbanization are factors that affect education and income that an individual in these cities receives. While the randomization is limited to which cities get parks/kindergartens, it does not mean all potential confounding variables that influence our outcome variable are accounted for at an individual level. The random assignment occurs at the city level, but the dependent variable is measured at the individual level. So, while the intervention (park/kindergarten) may be random, this does not mean that the individual level factors, such as the city’s income distribution and proximity to the service, are all controlled. A kindergarten is an educational intervention, whereas a park is a leisure/public-good intervention. The effects of these two may not be directly comparable, as the way people engage with these services are not the same. Kindergartens cater to individuals and their children who are of age to attend the kindergarten, while parks are publicly accessible. The prescription of park versus kindergarten also suggests that education and leisure/community well-being are opposed to each other.
The introduction of kindergartens and parks may have caused spillover effects that impacted the city in manners that this research has not fully grasped—which could account for the differences.

In the DAG, kindergarten represents the educational intervention given, that seems to result in higher income. Here park represents the non-educational intervention given, that seems to result in lower income.

dag <- dagitty('
  dag {
    "Pre-existing Economic Status" -> "Kindergarten"
    "Pre-existing Economic Status" -> "Park"  
    "Pre-existing Economic Status" -> "Income"
    "Urbanization" -> "Kindergarten"
    "Urbanization" -> "Park"
    "Urbanization" -> "Income"
    "Kindergarten" -> "Income"
    "Park" -> "Income"
  }
')

plot(dag)

Why did the control group cities receive a park instead of nothing?

By providing the control group with a park, the researcher allows for a comparison between two types of interventions: one related to education (kindergarten) and one related to community infrastructure (park). This helps isolate the specific effect of education (kindergarten) on income, while controlling for other potential confounding factors (e.g., general community improvements). The idea is that the kindergarten intervention focuses on education, while the park intervention might improve general community well-being but not have the same direct impact on education. Here, it could also be perceived as kindergartens being educational interventions while the park is a leisure-promoting intervention. The kindergarten intervention focuses on early childhood education, which improves cognitive development and enhances future income potential. In contrast, the park intervention aims to improve leisure, physical health, and community well-being but does not directly impact education or cognitive development If the control group received nothing, it would be difficult to distinguish whether any observed differences in income were due to the kindergarten intervention or simply due to the fact that the experimental group was receiving something (any intervention) while the control group was not. Also, considering that this is a public intervention experiment, withholding public goods for the sake of an experiment would raise ethical concerns of accessibility, and citizen equality.

Homework 3

Gaya Menon (gm2918)

2022-10-14

Problem 1: Reshaping, analyzing, and visualizing data

Problem 2: Causal Inference