library(AER)
library(dplyr)
library(ggplot2)
library(lmtest)
library(sandwich)
library(haven)
library(lmtest)
library(stargazer)

Problem 1.

(a) Interpret the intercept in this equation, and comment on its sign and magnitude.

Answer: The intercept is -124.84. It implies that at a zero income the consumption is predicted to be -124.84 and the negative sign is not meaningful as we cannot have a negative consumption. This informs us that the model is only effective when the values of the positive income in the sample are positive not when income is zero. The value of 124.84 is quite low as compared to the average family earnings in the data.

(b) What is the predicted consumption when family income is USD 30000?

predicted_cons <- -124.84 + 0.853 * 30000
predicted_cons

## [1] 25465.16

Answer: When family income is $30,000, the predicted consumption is $25,465.16.

(c) With inc on the x-axis, draw a graph of the estimated MPC and APC.

income <- seq(5000, 100000, by = 1000)
MPC <- rep(0.853, length(income))
APC <- (-124.84 / income) + 0.853

plot_data <- data.frame(income, MPC, APC)

ggplot(plot_data, aes(x = income)) +
  geom_line(aes(y = MPC, color = "MPC"), size = 1.2) +
  geom_line(aes(y = APC, color = "APC"), size = 1.2) +
  labs(x = "Income (USD)", y = "Propensity", color = "", title="The Estimated MPC and APC") +
  ylim(0.82, 0.86) +
  theme_minimal()

Graph Interpretation: The MPC has a value of 0.853, which is illustrated as the straight line. It implies that every additional dollar of income increases consumption by 85.3 cents. The APC begins at low income and approaches the MPC with increase in incomes because of the negative impact of the intercept which is less when there is an increase in income.

Problem 2.There are approximately 40,000 highway traffic fatalities every year in the United States. About one-third of fatal crashes involve a driver who was drinking (Stock and Watson).Using the package “AER”, use the dataset “fatalities” for this problem. Generate a subset of the dataset for the year = 1988 only.

data(Fatalities)
fatalities_1988 <- subset(Fatalities, year == 1988)

(a) create summary statistics for these variables (state, beertax, jail, service, breath, young-drivers, fatal, pop), and write one paragraph summarizing the main findings. The goal of this paragraph is to provide the background you would like the reader to have before you present any information about the effect of certain policies in reducing traffic fatalities.

summary(fatalities_1988[, c("beertax", "jail", "service", "breath", 
                             "youngdrivers", "fatal", "pop")])

##     beertax          jail    service   breath    youngdrivers    
##  Min.   :0.04331   no  :33   no  :37   no :24   Min.   :0.07314  
##  1st Qu.:0.19440   yes :14   yes :10   yes:24   1st Qu.:0.15543  
##  Median :0.34649   NA's: 1   NA's: 1            Median :0.16198  
##  Mean   :0.47982                                Mean   :0.16200  
##  3rd Qu.:0.59806                                3rd Qu.:0.17104  
##  Max.   :2.19442                                Max.   :0.22072  
##      fatal             pop          
##  Min.   : 104.0   Min.   :  479000  
##  1st Qu.: 294.2   1st Qu.: 1578252  
##  Median : 723.5   Median : 3479496  
##  Mean   : 974.8   Mean   : 5074334  
##  3rd Qu.:1101.5   3rd Qu.: 5920494  
##  Max.   :5390.0   Max.   :28314028

The 1988 data consists of traffic fatality data of 48 US states. Taxes on beer vary widely by state, with the range of 0.04 to 2.19 per case with the median of 0.48. Approximately, 29 percent of states have a mandatory jail sentence on drunken driving, 21 percent require community service, and half of the states have laws on breath test. The average percentage of young drivers at the ages 15-24 is approximately 16.2, and it varies between 7.3 and 22.1. The fatalities range between 104 to 5,390 per state with an average of 974.8. Population in the states increases up to more than 28 million from 479,000. TThe population of the states is growing to over 28 million among 479,000. The contrast in the policies and outcomes offers us with an excellent opportunity to consider the questions whether the legislations on drunk driving actually reduce the number of traffic deaths.

(b) Create a variable (called VFR) that indicates the vehicle fatality rate (defined as the number of vehicle fatalities per 10,000 people living in the state). What is the average vehicle fatality rate in this sample?

fatalities_1988$VFR <- (fatalities_1988$fatal / fatalities_1988$pop) * 10000
mean(fatalities_1988$VFR)

## [1] 2.069594

Answer: The average vehicle fatality rate is 2.07 deaths per 10,000 people.

(c) uppose we are interested in estimating the relationship between the fatality rate (VFR) and policies that could be used to reduce traffic fatalities. Consider the following PRF:

$VFR = \beta_0 + \beta_1 jail + \beta_2 beertax + \beta_3 service + \beta_4 breath + \beta_5 youngdrivers + u$

(i) Before running any regressions, what sign do you expect β1 to have?

I expect β₁ to be negative.The effect of jail sentences should be to deter people to drunk drive, therefore, states that have jail laws should record few deaths on roads than those that do not.

(ii) stimate the above PRF using R. Interpret the coefficient on jail. Is it statistically significant at the 5 percent level? Is the sign of this coefficient the one you expected?

model <- lm(VFR ~ jail + beertax + service + breath + youngdrivers, 
            data = fatalities_1988)
summary(model)

## 
## Call:
## lm(formula = VFR ~ jail + beertax + service + breath + youngdrivers, 
##     data = fatalities_1988)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7297 -0.3283 -0.1218  0.3028  1.0466 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1.3329     0.5339   2.497   0.0166 *
## jailyes        0.4170     0.2030   2.054   0.0464 *
## beertax        0.3860     0.1785   2.163   0.0364 *
## serviceyes    -0.1500     0.2098  -0.715   0.4788  
## breathyes     -0.1238     0.1546  -0.801   0.4280  
## youngdrivers   3.2209     3.3735   0.955   0.3453  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.471 on 41 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2859, Adjusted R-squared:  0.1988 
## F-statistic: 3.283 on 5 and 41 DF,  p-value: 0.01388

Interpretation: The coefficient on jail is 0.4170. This implies that a state whereby the jail sentence is mandatory has 0.417 higher traffic deaths per 10,000 population as opposed to a state where there are no jail laws, other things held constant. The p-value stands at 0.0464 and this is less than 0.05. Therefore the coefficient becomes statistically significant at level 5% level. It is a positive sign and this was not what I expected. I expected that jail would minimize deaths but all we are seeing is the contrary. Probably states with worse drunk driving problems had stricter laws, or jail is not effective without other policies.

(iii) Interpret the coefficient on beertax. Be very specific to the context.

The beertax coefficient stands at 0.3860. The rate of fatality increases by 0.386 deaths per 10,000 people when beer tax is increased one dollar per case, other things being constant. This is strange in that increased prices should lessen drinking and drunk driving. It is possible that states that have more accidents increased beer taxes to address this issue, or beer taxes themselves do not have a significant effect on behavior.

(iv) Interpret the coefficient on youngdrivers. Be very specific to the context.

The youngdrivers coefficient is 3.2209. Youngdrivers being a proportion between 0 and 1 implies that an increment of 100 percentage points in young drivers would increase the fatality rate by 3.22 deaths per 10,000 people. In more realistic terms, a 1 percentage point rise in young motorists corresponds to a 0.0322 fatalities/10,000 population rise in the fatalities. This is understandable since young drivers are not as experienced and are more reckless.

(v) Next, for each of the estimates above jail, beertax and youngdrivers, calculate the 95% confidence intervals. What do you conclude about the precision of these estimates?

ci_results <- confint(model, parm = c("jailyes", "beertax", "youngdrivers"), level = 0.95)
ci_results

##                    2.5 %     97.5 %
## jailyes       0.00705388  0.8269059
## beertax       0.02560153  0.7464246
## youngdrivers -3.59210932 10.0339191

What do you conclude about the precision of these estimates?

For jail, the 95% CI is [0.0071, 0.8269]. This interval doesn’t include zero, so we’re confident the effect is positive. The width of the interval is moderate and has enough precision in the estimate. The lower endpoint is near zero meaning that there is uncertainty on the actual magnitude of the effect. For beertax, the 95% CI is [0.0256, 0.7464]. This period does not have zero indicating that beer taxes are related to the increased fatality rates. The interval is adequate with acceptable accuracy. For youngdrivers, the 95% CI is [-3.5921, 10.0339]. This is a very broad range and zero is contained within it hence, we cannot tell whether young drivers are raising or lowering the fatality rates. The low accuracy may be due to low level of variation of this variable among states or due to correlation with other variables.

Problem 3.

primitive <- read_dta("C:/Users/User/Desktop/RM 2 Data/primitive_aejmacro2.dta")

head(primitive)

## # A tibble: 6 × 78
##   wbname country colony mineuro majeuro ly2002    a1    a2     a3    a4       c1
##   <chr>  <chr>    <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>
## 1 ""     French…      1       0       0     NA     1     1  0.667    NA  1.49e-8
## 2 ""     French…      1       0       0     NA    NA    NA  0.667    NA NA      
## 3 ""     Lesser…      0       0       0     NA    NA    NA  0.667    NA NA      
## 4 ""     Micron…      1       0       0     NA    NA    NA  0.667    NA NA      
## 5 ""     New Ca…      1       0       0     NA    NA    NA  0.667    NA NA      
## 6 ""     Surina…      1       0       0     NA     1     1 NA        NA  1.49e-8
## # ℹ 67 more variables: c2 <dbl>, c3 <dbl>, c4 <dbl>, t1 <dbl>, t2 <dbl>,
## #   t3 <dbl>, t4 <dbl>, mt1 <dbl>, mt2 <dbl>, mt3 <dbl>, mw1 <dbl>, mw2 <dbl>,
## #   mw3 <dbl>, mw4 <dbl>, tr1 <dbl>, tr2 <dbl>, tr3 <dbl>, tr4 <dbl>,
## #   tr1500cc <dbl>, mt1500c <dbl>, c1500c <dbl>, u1000 <dbl>, u0 <dbl>,
## #   pop1000 <dbl>, lpopu4 <dbl>, larable <dbl>, latitude <dbl>,
## #   distequat <dbl>, tropical <dbl>, eu <dbl>, af <dbl>, as <dbl>, am <dbl>,
## #   oc <dbl>, weu <dbl>, chi <dbl>, india <dbl>, arab <dbl>, year <dbl>, …

(a) Replicating Table 7A

(i) Reproduce Table 7A. Report the coefficients, t-statistics, sample size and R-squared.

Table 7A examines the persistence of technology adoption from 0 AD to 1500 AD.

# Regression Models for Table 7A
m1 <- lm(tr1500cc~tr2,data=primitive)
m2 <- lm(tr1500cc~tr1,data=primitive)
m3 <- lm(tr1500cc~tr2+eu+af+as+am,data=primitive)
m4 <- lm(tr1500cc~tr1+eu+af+as+am,data=primitive)
m5 <- lm(tr1500cc~tr2+distequat+dist2,data=primitive)
m6 <- lm(tr1500cc~tr1+distequat+dist2,data=primitive)
m7 <- lm(tr1500cc~tr2+distequat+dist2+landlocked,data=primitive)
m8 <- lm(tr1500cc~tr1+distequat+dist2+landlocked,data=primitive)
m9 <- lm(tr1500cc~tr2+landlocked+tropical,data=primitive)
m10 <- lm(tr1500cc~tr1+landlocked+tropical,data=primitive)

# Clustered standard errors
c1 <- coeftest(m1, vcov = vcovCL(m1, cluster = ~clus1000))
c2 <- coeftest(m2, vcov = vcovCL(m2, cluster = ~clus1000))
c3 <- coeftest(m3, vcov = vcovCL(m3, cluster = ~clus1000))
c4 <- coeftest(m4, vcov = vcovCL(m4, cluster = ~clus1000))
c5 <- coeftest(m5, vcov = vcovCL(m5, cluster = ~clus1000))
c6 <- coeftest(m6, vcov = vcovCL(m6, cluster = ~clus1000))
c7 <- coeftest(m7, vcov = vcovCL(m7, cluster = ~clus1000))
c8 <- coeftest(m8, vcov = vcovCL(m8, cluster = ~clus1000))
c9 <- coeftest(m9, vcov = vcovCL(m9, cluster = ~clus1000))
c10 <- coeftest(m10, vcov = vcovCL(m10, cluster = ~clus1000))

# Extract t-statistics and p-values
models <- list(m1, m2, m3, m4, m5, m6, m7, m8, m9, m10)
t_stats_list <- list(
  round(c1[,3], 2), round(c2[,3], 2), round(c3[,3], 2), round(c4[,3], 2), round(c5[,3], 2),
  round(c6[,3], 2), round(c7[,3], 2), round(c8[,3], 2), round(c9[,3], 2), round(c10[,3], 2)
)
p_value_list <- list(c1[,4], c2[,4], c3[,4], c4[,4], c5[,4], c6[,4], c7[,4], c8[,4], c9[,4], c10[,4])

# R-squared
rsq_rounded <- sapply(models, function(m) round(summary(m)$r.squared, 2))

# Create table
stargazer(models,
          type = "text",
          title = "Table 7A",
          dep.var.labels = "Technology in 1500 AD",
          se = t_stats_list,
          p = p_value_list,
          omit.stat = c("f", "ser", "adj.rsq", "rsq"),
          add.lines = list(c("R-squared", rsq_rounded)))

## 
## Table 7A
## ========================================================================================================
##                                                  Dependent variable:                                    
##              -------------------------------------------------------------------------------------------
##                                                 Technology in 1500 AD                                   
##                (1)      (2)      (3)      (4)      (5)      (6)       (7)      (8)       (9)      (10)  
## --------------------------------------------------------------------------------------------------------
## tr2          0.782***          0.356***          0.614***          0.635***           0.691***          
##              (8.360)           (5.750)           (7.140)            (7.580)            (6.150)          
##                                                                                                         
## tr1                   0.784***          0.242***          0.582***           0.584***           0.637***
##                       (5.550)           (4.920)           (3.500)            (3.430)            (3.250) 
##                                                                                                         
## eu                             0.520*** 0.689***                                                        
##                                (11.070) (17.080)                                                        
##                                                                                                         
## af                              0.005   0.193***                                                        
##                                (0.100)  (4.410)                                                         
##                                                                                                         
## as                             0.368*** 0.537***                                                        
##                                (5.110)  (9.110)                                                         
##                                                                                                         
## am                              -0.004   0.068*                                                         
##                                (-0.340) (1.900)                                                         
##                                                                                                         
## distequat                                         0.348    -0.097    0.424    -0.118                    
##                                                  (0.570)  (-0.270)  (0.820)  (-0.300)                   
##                                                                                                         
## dist2                                             0.590   1.280***   0.438   1.310***                   
##                                                  (0.820)  (2.750)   (0.720)  (2.710)                    
##                                                                                                         
## landlocked                                                         -0.098***  0.016   -0.101**   0.010  
##                                                                    (-2.790)  (0.380)  (-2.500)  (0.220) 
##                                                                                                         
## tropical                                                                              -0.249*** -0.215**
##                                                                                       (-3.090)  (-2.260)
##                                                                                                         
## Constant      -0.062   0.128*  0.030***  0.018    -0.129   0.088*   -0.129*   0.086     0.146   0.299** 
##              (-0.900) (1.700)  (2.850)  (0.560)  (-1.640) (1.710)  (-1.790)  (1.620)   (1.460)  (2.130) 
##                                                                                                         
## --------------------------------------------------------------------------------------------------------
## R-squared      0.47     0.47     0.86     0.87     0.67     0.63     0.69      0.63     0.64      0.57  
## Observations   116      104      116      104      108       97       108       97       116      104   
## ========================================================================================================
## Note:                                                                        *p<0.1; **p<0.05; ***p<0.01

(ii) Why didn’t the authors include all the continent dummies (Europe, Africa, Asia,America and Oceania) in their regression

The authors are not able to include all the five continent dummies as it produces perfect multicollinearity with the intercept. All countries fall under one continent, hence the sum of the five dummies will be 1 in each observation. This amount is equal to the intercept that is also 1. In the event that the variables are correlated perfectly such as this, the regression is unable to discern individual effects in each case. As such it drops one continent as the reference group. That continent’s effect gets captured in the intercept, and the other dummies show differences compared to it.

(iii) For column (1), interpret the coefficient on “Overall technology adoption level in 0 AD”. Be very specific.

The coefficient on tr2 in Column (1) is 0.786. A one-unit increase in technology in 0 AD leads to a 0.786 unit increase in technology in 1500 AD. Since both are indices from 0 to 1, this shows very strong persistence over 1,500 years. Countries that were more advanced in 0 AD stayed more advanced in 1500 AD.

(iv) For column (3), interpret the coefficient on “Overall technology adoption level in 0 AD”. Be very specific.

The coefficient on tr2 in Column (3) is 0.356. This means that, holding the country’s continent constant, a one-unit increase in overall technology adoption in 0 AD is associated with about a 0.356-unit increase in technology adoption in 1500 AD.

(v) Interpret the R-squared for column (1).

About 47% of the cross-country variation in technology adoption in 1500 AD is explained by their technology levels in 0 AD.

(vi) hoose the appropriate regression estimates in Table 7A: What is the predicted value of the “Overall technology adoption level in 1500 AD” for a country in Africa with an average “Overall technology adoption level in 0 AD”? (Hint refer to Table 4 for the average level)

# Get Africa's average from Table 4
avg_tr2_africa <- 0.77

# Use Column 3 model (m3) which has continent dummies
pred_africa <- coef(m3)["(Intercept)"] + 
               coef(m3)["tr2"] * avg_tr2_africa + 
               coef(m3)["af"] * 1

cat("Predicted technology in 1500 AD for Africa:", round(pred_africa, 3))

## Predicted technology in 1500 AD for Africa: 0.308

Using Column (2) from Table 7A, which includes continent dummies, the prediction for an African country with average 0 AD technology level of 0.77 , from Table 4, is calculated as: 0.5268 + 0.3709(0.77) + (-0.5043) = 0.308. This is the estimated general level of adoption of technology in 1500 AD.

(b) Plot Figure 1 and Figure 2

# Figure 1: Technology 1500 AD vs Log Income 2002
ggplot(primitive, aes(x = tr3, y = ly2002)) +
  geom_point(alpha = 0.6, size = 2, color = "red") +
  geom_smooth(method = "lm", se = FALSE, color = "purple", linewidth = 1) +
  geom_text(aes(label = country), size = 2.5, check_overlap = TRUE, vjust = -0.5) +
  labs(
    title = "Figure 1",
    x = "Overall technology adoption level in 1500 AD",
    y = "Log per capita income in 2002"
  ) +
  theme_minimal()

# Figure 2: Migration-adjusted Technology 1500 AD vs Log Income 2002
ggplot(primitive, aes(x = tr3mig, y = ly2002)) +
  geom_point(alpha = 0.6, size = 2, color = "orange") +
  geom_smooth(method = "lm", se = FALSE, color = "brown", linewidth = 1) +
  geom_text(aes(label = country), size = 2.5, check_overlap = TRUE, vjust = -0.5) +
  labs(
    title = "Figure 2",
    x = "Migration-adjusted technology level in 1500 AD",
    y = "Log per capita income in 2002"
  ) +
  theme_minimal()

Comment visually on the fit of the OLS line for these two figures. What stands out to you?

In both figures, historical technology and the current income have a distinct positive correlation. Figure 1 is more spread around the line with some countries such as the US, Canada and Australia being way above the line even though they were less technologically developed in 1500 AD. With migration being adjusted, Figure 2 indicates a far tighter fit. The points are distributed closer to the line particularly those countries that were previously outliers. This suggests that accounting for migration patterns matters a lot - people brought their technology with them when they moved, which is why migration-adjusted technology is a better predictor of current income than just looking at where technology was geographically located in 1500 AD.

(c) The authors find a significant positive association between technology adoption levels in 1500 AD and current income levels

(i) Explain why this finding does not necessarily imply a causal relationship between historical technology adoption and current income levels.

Correlation does not imply causation. The fact that countries that had more technology in 1500 AD are rich today does not necessarily mean that the technology made them rich. It is likely that other issues, such as geography, institutions, or natural resources, were affecting not only the technology of the past, but also the present income. This was observed in Table 7A - once we included geographic controls, the coefficient on technology reduced greatly, from 0.786 to 0.371, indicating that geography was affecting the relationship. In the absence of the ability to control all these other factors and a natural experiment, we cannot say that technology in 1500 AD was the actual cause of the income differences in the present world.

(ii) Describe two potential confounding factors that could explain this association without implying direct causation

Geography and institutions are the two key confounding factors. Geography such as climate, arable land, and access to the sea assisted nations to invent technology in 1500 AD and continues to expand the economy of nations up to date meaning that geography drives both of them and not technology causing income. Similarly, strong institutions that emerged early promoted technology adoption in 1500 AD and these same institutions persist today driving current income. Both create correlations without direct causation from historical technology to modern wealth.

(d) Reflection

(a) What was your top challenge that you faced in this exercise? How did you overcome it?

The most significant challenge was to learn what variables in the dataset would be related to the concepts of the paper. The names of the variables such as tr1, tr2 and tr3 did not make sense to me when I first saw the data. I responded to this by taking a close read of the document that defines the variables accompanying the dataset and cross-matched it with the methodology section of the paper. Another R command that I used was a str() and a names() command to review the structure of the dataset and ensure that I was referring to the right variable. Doing initial test regressions on each variable acted as a good way of checking whether I was in the right direction, before trying the complete Table 7A replication.

(b) f you were unable to fully replicate the results in this exercise, what factors might explain these discrepancies?

My replication findings are not similar to those obtained in the original paper. In the case of Column (1), I got a t-statistic of 10.01 and a coefficient of 0.786. In the case of Column (3), my coefficient was 0.371 with a t -statistic of 5.82. These discrepancies could have been explained by several reasons. Depending on how the authors selected the initial working dataset, the dataset that I used might not be consistent with their original working dataset in terms of which observations it included and how it dealt with missing values. Sources of specific data cleaning may have been used, or the authors may have used some exclusion criteria that were not well defined in the published article. Also, not all statistical software packages and versions will support the implementation of clustered standard errors. Although with the vcovCL R command using the clustering variable of clus1000 as an input, the actual algorithmic implementation of calculating clustered standard error in Stata (apparently what the authors did) can yield slightly different values. The publicly accessible dataset can also have been modified or revised after the time of writing the paper, which will result in discrepancies in the sample structure or the values of the variables.

DDT104: Research Method 2 - Problem Set 1

Sarah AlAbdulkarim

12 November 2025