Submission Rules

Submit: The knitted HTML output or .Rmd file

PART A. Difference-in-Differences: use PS1.xlsx file from practice session 1

Research Question

Is the change in log(net_assets) after 2022 different for systemic banks compared to non-systemic banks?

Structural Equation

Let:

\(y_{it}\) = log(net_assets_it)
\(treated_i\) = 1 if bank is systemic
\(post_t\) = 1 if year >= 2022
\(DiD_{it}\) = treated_i × post_t

Estimate:

\[ y_{it} = α + β_1 treated_i + β_2 post_t + δ (treated_i × post_t) + u_{it} \]

Load Data

df <- read.csv("/Users/illia/Downloads/PS1.csv")
head(df)

##   nkb                       bank          ROA loans_corp loans_hh  net_assets
## 1  46         АТ КБ "ПриватБанк" 4.292320e+00   14502070 42302983 386939574.0
## 2   6              АТ "Ощадбанк" 1.033015e+00   54311100  8914977 235722486.5
## 3   2          АТ "Укрексімбанк" 1.669166e-05   48449783   141701 192835491.9
## 4 274            АБ "УКРГАЗБАНК" 2.710253e-01   44330831  3186752 142780022.3
## 5 593  ПАТ "РОЗРАХУНКОВИЙ ЦЕНТР" 5.368773e-01          0        0    384649.6
## 6  36 АТ "Райффайзен Банк Аваль" 3.586104e+00   39976357  6038606 111547920.0
##   ownership   GB_TNA    L_TNA L_NFC_TNA    L_HH_TNA    D_TNA         TNA
## 1         1 50.47218 14.68060   3.74789 10.93271028 79.99600 386939574.0
## 2         1 44.58017 26.82225  23.04027  3.78197993 78.43384 235722486.5
## 3         1 30.63463 25.19841  25.12493  0.07348282 59.45002 192835491.9
## 4         1 29.98680 33.28027  31.04834  2.23193115 88.64743 142780022.3
## 5         1 30.36400  0.00000   0.00000  0.00000000 25.62168    384649.6
## 6         2 10.57642 41.25130  35.83783  5.41346396 79.03632 111547920.0
##        TNA_p       CIR sys year  gdp
## 1 21.3174614  35.00666   1 2020 -3.8
## 2 12.9865368  74.84880   1 2020 -3.8
## 3 10.6237858        NA   1 2020 -3.8
## 4  7.8661058  65.73606   1 2020 -3.8
## 5  0.0211913 101.73284   0 2020 -3.8
## 6  6.1454517  51.79713   1 2020 -3.8

Prepare Variables

df <- df %>%
  mutate(nkb = as.character(nkb), 
         year = as.integer(year),
         sys = as.integer(sys),
         post = as.integer(year >= 2022),
         treated = sys,
         #add net 
         ln_net_assets=log(net_assets)
  )%>%
  
  arrange(nkb, year)

describe(df)

##               vars   n        mean          sd     median     trimmed
## nkb*             1 335       36.91       21.36      37.00       36.88
## bank*            2 335       43.54       25.31      44.00       43.57
## ROA              3 335        0.79        3.05       0.91        1.07
## loans_corp       4 335  8044764.86 16219438.56 1053898.03  3743779.09
## loans_hh         5 335  2581825.26  8175652.63   48697.62   665238.09
## net_assets       6 335 37575226.57 89277782.81 5156326.64 16042671.76
## ownership        7 335        2.59        0.65       3.00        2.72
## GB_TNA           8 335       19.38       18.18      14.84       16.82
## L_TNA            9 335       29.12       18.59      27.81       28.10
## L_NFC_TNA       10 335       23.79       16.89      23.08       22.53
## L_HH_TNA        11 335        5.33       12.21       0.98        2.28
## D_TNA           12 335       67.88       21.06      75.00       71.35
## TNA             13 335 37575226.57 89277782.81 5156326.64 16042671.76
## TNA_p           14 335        1.50        3.42       0.23        0.66
## CIR             15 334       79.19       60.53      69.62       70.56
## sys             16 335        0.24        0.43       0.00        0.17
## year            17 335     2021.90        1.41    2022.00     2021.88
## gdp             18 335       -4.31       12.67       2.90       -2.50
## post            19 335        0.57        0.50       1.00        0.59
## treated         20 335        0.24        0.43       0.00        0.17
## ln_net_assets   21 335       15.73        1.88      15.46       15.66
##                      mad       min          max        range  skew kurtosis
## nkb*               28.17      1.00        73.00        72.00  0.01    -1.23
## bank*              32.62      1.00        87.00        86.00  0.00    -1.22
## ROA                 1.27    -21.09        13.40        34.49 -2.00    12.28
## loans_corp    1463850.97      0.00  90004245.09  90004245.09  2.76     7.48
## loans_hh        72199.09      0.00  79836630.95  79836630.95  5.58    37.83
## net_assets    6412014.43 183562.69 771835030.30 771651467.61  4.56    26.33
## ownership           0.00      1.00         3.00         2.00 -1.32     0.49
## GB_TNA             16.47      0.00        87.39        87.39  1.14     0.88
## L_TNA              19.81      0.00        79.11        79.11  0.43    -0.42
## L_NFC_TNA          18.08      0.00        73.66        73.66  0.58    -0.20
## L_HH_TNA            1.45      0.00        78.93        78.93  3.88    15.84
## D_TNA              14.34      0.00        96.72        96.72 -1.41     1.41
## TNA           6412014.43 183562.69 771835030.30 771651467.61  4.56    26.33
## TNA_p               0.28      0.01        23.39        23.39  4.06    19.31
## CIR                25.86     13.02       674.89       661.87  6.09    52.20
## sys                 0.00      0.00         1.00         1.00  1.22    -0.51
## year                1.48   2020.00      2024.00         4.00  0.09    -1.30
## gdp                 3.85    -28.80         5.50        34.30 -1.28    -0.09
## post                0.00      0.00         1.00         1.00 -0.28    -1.93
## treated             0.00      0.00         1.00         1.00  1.22    -0.51
## ln_net_assets       1.87     12.12        20.46         8.34  0.36    -0.61
##                       se
## nkb*                1.17
## bank*               1.38
## ROA                 0.17
## loans_corp     886162.59
## loans_hh       446683.62
## net_assets    4877766.34
## ownership           0.04
## GB_TNA              0.99
## L_TNA               1.02
## L_NFC_TNA           0.92
## L_HH_TNA            0.67
## D_TNA               1.15
## TNA           4877766.34
## TNA_p               0.19
## CIR                 3.31
## sys                 0.02
## year                0.08
## gdp                 0.69
## post                0.03
## treated             0.02
## ln_net_assets       0.10

Download in table

df_table <- df %>%
  mutate(nkb = as.character(nkb), 
         year = as.integer(year),
         sys = as.integer(sys),
         post = as.integer(year >= 2022),
         treated = sys,
         #add net 
         ln_net_assets=log(net_assets)
  )%>%
  
  arrange(nkb, year)

write.csv(df_table, "HW1_Prepared_Data.csv", row.names = FALSE)

Estimate DiD

model_did <- lm(ln_net_assets ~ treated + post + treated:post, data = df)


summary(model_did)

## 
## Call:
## lm(formula = ln_net_assets ~ treated + post + treated:post, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0579 -0.8007  0.0591  0.8225  3.2493 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   14.8542     0.1182 125.639   <2e-16 ***
## treated        3.1072     0.2508  12.389   <2e-16 ***
## post           0.1598     0.1579   1.012    0.312    
## treated:post   0.2712     0.3263   0.831    0.406    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.251 on 331 degrees of freedom
## Multiple R-squared:  0.5597, Adjusted R-squared:  0.5557 
## F-statistic: 140.3 on 3 and 331 DF,  p-value: < 2.2e-16

Questions (Answer in text)

Is there evidence that systemic banks changed net assets differently after 2022? Yes, coef. DiD is 0.2712
Is it statistically significant? No
What is average net assets for systemically important banks before 2022? 14.8542+3.1072=17,9614 ————————————————————————

PART B. Panel Estimators using WAGEPAN

We now compare pooled OLS, between, and first-difference estimators.

data("wagepan", package = "wooldridge")
df_w <- wagepan

Consider the following panel data model:

\[ y_{it} = \beta_0 + \beta_1 educ_{it} + \beta_2 exper_{it} +\beta_3 hours_{it}+ a_i + u_{it} \] where:

\(y_{it}\) = log wage of individual \(i\) at time \(t\)
\(educ_{it}\) = years of education
\(exper_{it}\) = labor market experience
\(hours_{it}\) = hours worked

1. Estimate pooled OLS

df_b <- read.csv("/Users/illia/Downloads/wagepan.csv")
head(df_b)

##   nr year agric black bus construc ent exper fin hisp poorhlth hours manuf
## 1 13 1980     0     0   1        0   0     1   0    0        0  2672     0
## 2 13 1981     0     0   0        0   0     2   0    0        0  2320     0
## 3 13 1982     0     0   1        0   0     3   0    0        0  2940     0
## 4 13 1983     0     0   1        0   0     4   0    0        0  2960     0
## 5 13 1984     0     0   0        0   0     5   0    0        0  3071     0
## 6 13 1985     0     0   1        0   0     6   0    0        0  2864     0
##   married min nrthcen nrtheast occ1 occ2 occ3 occ4 occ5 occ6 occ7 occ8 occ9 per
## 1       0   0       0        1    0    0    0    0    0    0    0    0    1   0
## 2       0   0       0        1    0    0    0    0    0    0    0    0    1   1
## 3       0   0       0        1    0    0    0    0    0    0    0    0    1   0
## 4       0   0       0        1    0    0    0    0    0    0    0    0    1   0
## 5       0   0       0        1    0    0    0    0    1    0    0    0    0   1
## 6       0   0       0        1    0    1    0    0    0    0    0    0    0   0
##   pro pub rur south educ tra trad union    lwage d81 d82 d83 d84 d85 d86 d87
## 1   0   0   0     0   14   0    0     0 1.197540   0   0   0   0   0   0   0
## 2   0   0   0     0   14   0    0     1 1.853060   1   0   0   0   0   0   0
## 3   0   0   0     0   14   0    0     0 1.344462   0   1   0   0   0   0   0
## 4   0   0   0     0   14   0    0     0 1.433213   0   0   1   0   0   0   0
## 5   0   0   0     0   14   0    0     0 1.568125   0   0   0   1   0   0   0
## 6   0   0   0     0   14   0    0     0 1.699891   0   0   0   0   1   0   0
##   expersq
## 1       1
## 2       4
## 3       9
## 4      16
## 5      25
## 6      36

# 1. Pooled OLS
model_pooled <- plm(lwage ~ educ + exper + hours, data = df_w, model = "pooling")

# 2. Between Estimator
model_between <- plm(lwage ~ educ + exper + hours, data = df_w, model = "between")

# 3. First Differences
# Зверни увагу: educ тут може зникнути, якщо вона не змінюється з часом!
model_fd <- plm(lwage ~ educ + exper + hours, data = df_w, model = "fd")

# 4. Звіт (stargazer)
stargazer(model_pooled, model_between, model_fd, type = "text")

## 
## ========================================================================================
##                                          Dependent variable:                            
##              ---------------------------------------------------------------------------
##                                                 lwage                                   
##                         (1)                      (2)                      (3)           
## ----------------------------------------------------------------------------------------
## educ                 0.110***                 0.097***                                  
##                       (0.005)                  (0.011)                                  
##                                                                                         
## exper                0.059***                 0.036***                                  
##                       (0.003)                  (0.012)                                  
##                                                                                         
## hours               -0.0001***                 0.00000                -0.0002***        
##                      (0.00001)                (0.00004)                (0.00001)        
##                                                                                         
## Constant               0.098                    0.269                  0.080***         
##                       (0.066)                  (0.199)                  (0.007)         
##                                                                                         
## ----------------------------------------------------------------------------------------
## Observations           4,360                     545                     3,815          
## R2                     0.146                    0.134                    0.059          
## Adjusted R2            0.146                    0.129                    0.058          
## F Statistic  248.918*** (df = 3; 4356) 27.938*** (df = 3; 541) 237.729*** (df = 1; 3813)
## ========================================================================================
## Note:                                                        *p<0.1; **p<0.05; ***p<0.01

Questions

Interpret significant coefficients. - Educ and exper coefs are highly significant. Hours coef has small impact
What assumptions are required for unbiasedness? They should be linear
Why might pooled OLS be biased in panel settings? Because pooled OLS ignores individual effects

2. Estimate between Estimator

m_be <- plm(lwage ~ educ + exper + hours, 
            data=df_w, 
            index = c("nr","year"), 
            model="between")
summary(m_be)

## Oneway (individual) effect Between Model
## 
## Call:
## plm(formula = lwage ~ educ + exper + hours, data = df_w, model = "between", 
##     index = c("nr", "year"))
## 
## Balanced Panel: n = 545, T = 8, N = 4360
## Observations used in estimation: 545
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -1.20409033 -0.24853220  0.00038719  0.25961118  1.67324865 
## 
## Coefficients:
##               Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept) 2.6861e-01 1.9857e-01  1.3527  0.176716    
## educ        9.6904e-02 1.1007e-02  8.8042 < 2.2e-16 ***
## exper       3.6441e-02 1.1640e-02  3.1307  0.001838 ** 
## hours       1.3105e-06 4.1122e-05  0.0319  0.974589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    83.06
## Residual Sum of Squares: 71.918
## R-Squared:      0.13414
## Adj. R-Squared: 0.12934
## F-statistic: 27.9383 on 3 and 541 DF, p-value: < 2.22e-16

Questions

What variation does the between estimator use? - between variation
How do results differ from pooled OLS? - Pooled OLS consider each as a independent entity, between OLS focus on one observation per person. Pooled OLS gives greater weight to units with more frequent observations in unbalanced panels, while Between OLS averages these observations, smoothing out within-entity variation over time.
What are pros and cons? - Pros - Analysis of time-invariant variables and lower sensitivity to measurement error. Cons - Loss of information and small sample

3. Estimate first Differences Estimator

m_fd <- plm(
 lwage ~ educ + exper + hours,
 data = df_w,
 index = c("nr", "year"),
 model = "fd"
)

summary(m_fd)

## Oneway (individual) effect First-Difference Model
## 
## Call:
## plm(formula = lwage ~ educ + exper + hours, data = df_w, model = "fd", 
##     index = c("nr", "year"))
## 
## Balanced Panel: n = 545, T = 8, N = 4360
## Observations used in estimation: 3815
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -4.534461 -0.135898 -0.024686  0.114997  4.661021 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept)  8.0454e-02  7.0220e-03  11.457 < 2.2e-16 ***
## hours       -2.2272e-04  1.4445e-05 -15.418 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    751.19
## Residual Sum of Squares: 707.11
## R-Squared:      0.058688
## Adj. R-Squared: 0.058441
## F-statistic: 237.729 on 1 and 3813 DF, p-value: < 2.22e-16

Questions

Why is education and experience excluded? education is constant, so it dissapears. Expirience level is +1 each year, so the delta is 1
Compare results to pooled and between. While Pooled OLS and Between estimators are likely biased due to omitted unobserved individual heterogeneity, the first-difference model addresses this by using within-individual variation to remove time-invariant factors. FD estimates are typically smaller and more consistent, as they eliminate the upward bias often present in models that fail to control for fixed characteristics
What problem does FD solve? Endogeneity caused by unobserved individual heterogeneity.
What are disadvantages of FD? Data lossage, imposible to estimate constant variables

4. Use `stargazer` and report all regressions

# 1. Завантажуємо пакет (якщо він ще не завантажений)
library(stargazer)

# 2. Виводимо порівняльну таблицю
stargazer(model_pooled, m_be, model_fd, 
          type = "text", 
          column.labels = c("Pooled OLS", "Between", "First Diff"),
          title = "Comparison of Panel Estimators for Wages",
          digits = 4)

## 
## Comparison of Panel Estimators for Wages
## ===========================================================================================
##                                           Dependent variable:                              
##              ------------------------------------------------------------------------------
##                                                  lwage                                     
##                      Pooled OLS                 Between                  First Diff        
##                         (1)                       (2)                       (3)            
## -------------------------------------------------------------------------------------------
## educ                 0.1097***                 0.0969***                                   
##                       (0.0046)                  (0.0110)                                   
##                                                                                            
## exper                0.0591***                 0.0364***                                   
##                       (0.0029)                  (0.0116)                                   
##                                                                                            
## hours                -0.0001***                 0.000001                 -0.0002***        
##                      (0.00001)                 (0.00004)                 (0.00001)         
##                                                                                            
## Constant               0.0977                    0.2686                  0.0805***         
##                       (0.0657)                  (0.1986)                  (0.0070)         
##                                                                                            
## -------------------------------------------------------------------------------------------
## Observations           4,360                      545                      3,815           
## R2                     0.1463                    0.1341                    0.0587          
## Adjusted R2            0.1458                    0.1293                    0.0584          
## F Statistic  248.9177*** (df = 3; 4356) 27.9383*** (df = 3; 541) 237.7293*** (df = 1; 3813)
## ===========================================================================================
## Note:                                                           *p<0.1; **p<0.05; ***p<0.01

Final Comparison

When would you prefer pooled OLS? I would use pooled OLS when in there is no observed individual heterogeneity
When would you prefer between? I would use it when I’m specifically interested in long-run differences between individuals rather than changes over time.
When would you prefer FD? I would prefer this method when I suspect that unobserved factors are correlated with regressors
Which estimator is most credible here and why? The First-Difference (FD) estimator is the most credible because of control for ability bias and statistical significance # BONUS TASK (+1 point)

In a simple 2×2 Difference-in-Differences setup without additional controls, the DiD coefficient from the regression

\[ y_{it} = \alpha + \beta_1 treated_i + \beta_2 post_t + \delta(treated_i \times post_t) + u_{it} \]

must be identical (up to rounding) to the manual DiD computed from group means.

1) Construct the 2×2 Mean Table

Using the dataset from Part A, compute the mean of log(net_assets) for each treatment–period cell:

Treated (systemic), Pre (year < 2022): \(\bar{Y}_{T,pre}\)
Treated (systemic), Post (year ≥ 2022): \(\bar{Y}_{T,post}\)
Control (non-systemic), Pre (year < 2022): \(\bar{Y}_{C,pre}\)
Control (non-systemic), Post (year ≥ 2022): \(\bar{Y}_{C,post}\)

Fill in the table below:

	Pre (year < 2022)	Post (year ≥ 2022)
Systemic Banks (T = 1)
Non-Systemic Banks (T = 0)

2) Compute the Manual DiD Estimator

Using the values from the table above, compute:

\[ \widehat{DiD} = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre}) \]

3) Compare with the Regression DiD Coefficient

Estimate the DiD regression from Part A:

\[ y_{it} = \alpha + \beta_1 treated_i + \beta_2 post_t + \delta (treated_i \times post_t) + u_{it} \]

Extract the interaction coefficient \(\hat{\delta}\) and compare it to \(\widehat{DiD}\).
They should be numerically identical (up to rounding).

Homework 1

Student Name

2026-02-22

Submission Rules

PART A. Difference-in-Differences: use PS1.xlsx file from practice session 1

Research Question

Structural Equation

Load Data

Prepare Variables

Download in table

Estimate DiD

Questions (Answer in text)

PART B. Panel Estimators using WAGEPAN

Consider the following panel data model:

1. Estimate pooled OLS

Questions

2. Estimate between Estimator

Questions

What are pros and cons? - Pros - Analysis of time-invariant variables and lower sensitivity to measurement error. Cons - Loss of information and small sample

3. Estimate first Differences Estimator

Questions

4. Use `stargazer` and report all regressions

Final Comparison

1) Construct the 2×2 Mean Table

2) Compute the Manual DiD Estimator

3) Compare with the Regression DiD Coefficient

Homework 1

Student Name

2026-02-22

Submission Rules

PART A. Difference-in-Differences: use PS1.xlsx file from practice session 1

Research Question

Structural Equation

Load Data

Prepare Variables

Download in table

Estimate DiD

Questions (Answer in text)

PART B. Panel Estimators using WAGEPAN

Consider the following panel data model:

1. Estimate pooled OLS

Questions

2. Estimate between Estimator

Questions

What are pros and cons? - Pros - Analysis of time-invariant variables and lower sensitivity to measurement error. Cons - Loss of information and small sample

3. Estimate first Differences Estimator

Questions

4. Use stargazer and report all regressions

Final Comparison

1) Construct the 2×2 Mean Table

2) Compute the Manual DiD Estimator

3) Compare with the Regression DiD Coefficient

4. Use `stargazer` and report all regressions