Introduction

This part, we will focus on the panle data. panel data consit of repeated observations on the same individual over time. Right now, we are dealing with micro panel data, which means we don’t have to worry about unit root issue from time series problems. This note includes two parts. Part 1 will just replicate the material from Lecture 4. Part 2 will solve the Lab exercise.

Part I Lecture 4 replicate

1.1 Creating panle data

Before doing any analysis, you have to clean your data, which means you have to organize your data which the computer can read and manuplate. In this part, you can use any tool, either Stata or R or Excel. I will focus on Stata and R. Later you will see R programming has much advantage on cleaning and reshaping the data.



.  cd "Z:\L14009\Dataset_part_1"
Z:\L14009\Dataset_part_1

. 
. use bhps_1991.dta 

. append using bhps_1992.dta

. append using bhps_1993.dta 

. 
. sort pid 

. 
. describe 

Contains data from bhps_1991.dta
  obs:        29,709                          
 vars:           142                          3 Oct 2008 12:53
 size:     5,496,165                          
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
adoid           byte    %8.0g      adoid      date of interview: day
adoim           byte    %8.0g      adoim      date of interview: month
aplbornc        byte    %8.0g      aplbornc   country of birth
asex            byte    %8.0g      asex       sex
apaju           byte    %8.0g      apaju      father not working when resp.
                                                aged 14
apasoc          int     %8.0g      apasoc     father's occupation (soc), resp.
                                                aged 14
apasemp         byte    %8.0g      apasemp    father self employed, resp. aged
                                                14
apaboss         byte    %8.0g      apaboss    father had employees, resp. aged
                                                14
apamngr         byte    %8.0g      apamngr    father was manager, resp. aged 14
amaju           byte    %8.0g      amaju      mother not working when resp.
                                                aged 14
amasoc          int     %8.0g      amasoc     mother's occupation (soc), resp.
                                                aged 14
amasemp         byte    %8.0g      amasemp    mother self employed, resp. aged
                                                14
amaboss         byte    %8.0g      amaboss    mother had employees, resp. aged
                                                14
amamngr         byte    %8.0g      amamngr    mother was manager, resp. aged 14
amlstat         byte    %8.0g      amlstat    present legal marital status
aschool         byte    %8.0g      aschool    never went to /still at school
ascend          byte    %8.0g      ascend     school leaving age
asctype         byte    %8.0g      asctype    type of school attended
ascnow          byte    %8.0g      ascnow     still at school
afetype         byte    %8.0g      afetype    type of further education
                                                attended
afenow          byte    %8.0g      afenow     still in further education
afeend          byte    %8.0g      afeend     further education leaving age
asmoker         byte    %8.0g      asmoker    smoker
ancigs          byte    %8.0g      ancigs     number of cigarettes smoked
arach16         byte    %8.0g      arach16    responsible for dependent child
                                                under 16
ajbhas          byte    %8.0g      ajbhas     did paid work last week
ajboff          byte    %8.0g      ajboff     no work last week but has job
ajboffy         byte    %8.0g      ajboffy    reason off work last week
ajbsoc          int     %8.0g      ajbsoc     occupation (soc): current main
                                                job
ajbsic          int     %8.0g      ajbsic     industry (sic) of employer:
                                                current job
ajbsemp         byte    %8.0g      ajbsemp    employee or self-employed:
                                                current job
ajbsize         byte    %8.0g      ajbsize    number employed at workplace:
                                                current jo
ajbhrs          byte    %8.0g      ajbhrs     no. of hours normally worked per
                                                week
atujbpl         byte    %8.0g      atujbpl    union or staff association at
                                                workplace
atuin1          byte    %8.0g      atuin1     member of workplace union
ajbbgd          byte    %8.0g      ajbbgd     day started current job
ajbbgm          byte    %8.0g      ajbbgm     month started current job
ahunurs         byte    %8.0g      ahunurs    who cares for ill children
ajbstat         byte    %8.0g      ajbstat    current labour force status
arace           byte    %8.0g      arace      ethnic group membership
pid             long    %12.0g                cross-wave person identifier
aregion         byte    %8.0g      aregion    region / metropolitan area
aage            byte    %8.0g      aage       age at date of interview
anchild         byte    %8.0g      anchild    number of own children in
                                                household
aqfedhi         byte    %8.0g      aqfedhi    highest educational qualification
aqfvoc          byte    %8.0g      aqfvoc     has vocational qualifications
aqfachi         byte    %8.0g      aqfachi    highest academic qualification
ajbft           byte    %8.0g      ajbft      employed full time
apaygu          double  %10.0g     apaygu     usual gross pay per month:
                                                current job
acjsten         int     %8.0g      acjsten    length (days) of current labour
                                                market
ayr2uk4         int     %8.0g      ayr2uk4    year came to britain: 4 digit
ajbbgy4         int     %8.0g      ajbbgy4    year started current job: 4 digit
bdoid           byte    %8.0g                 date of interview: day
bdoim           byte    %8.0g      bdoim      date of interview: month
bivlyr          byte    %8.0g      bivlyr     ic: interviewed last year
bsex            byte    %8.0g      bsex       sex
bjbstat         byte    %8.0g      bjbstat    current economic activity
bplbornc        byte    %8.0g      bplbornc   country of birth
brace           byte    %8.0g      brace      ethnic group membership
bschool         byte    %8.0g      bschool    never went to /still at school
bscend          byte    %8.0g      bscend     school leaving age
bsctype         byte    %8.0g      bsctype    type of school attended
bscnow          byte    %8.0g      bscnow     still at school
bfetype         byte    %8.0g      bfetype    type of further education
                                                attended
bfenow          byte    %8.0g      bfenow     still in further education
bfeend          byte    %8.0g      bfeend     further education leaving age
bsmoker         byte    %8.0g      bsmoker    smoker
bncigs          byte    %8.0g      bncigs     number of cigarettes smoked
bmlstat         byte    %8.0g      bmlstat    present legal marital status
bjbhas          byte    %8.0g      bjbhas     did paid work last week
bjboff          byte    %8.0g      bjboff     no work last week but has job
bjboffy         byte    %8.0g      bjboffy    reason off work last week
bjbsoc          int     %8.0g      bjbsoc     occupation (soc): current main
                                                job
bjbsic          int     %8.0g      bjbsic     industry (sic) of employer:
                                                current job
bjbsemp         byte    %8.0g      bjbsemp    employee or self-employed:
                                                current job
bjbsize         byte    %8.0g      bjbsize    no. employed at workplace:
                                                current job
bjbhrs          byte    %8.0g      bjbhrs     no. of hours normally worked per
                                                week
bjbbgd          byte    %8.0g      bjbbgd     day started current job
bjbbgm          byte    %8.0g      bjbbgm     month started current job
btujbpl         byte    %8.0g      btujbpl    union or staff association at
                                                workplace
btuin1          byte    %8.0g      btuin1     member of workplace union
bjbed           byte    %8.0g      bjbed      had work related training since
                                                1.9.91
bhunurs         byte    %8.0g      bhunurs    who cares for ill children
bage            byte    %8.0g      bage       age at date of interview
bnchild         byte    %8.0g      bnchild    number of own children in
                                                household
brach16         byte    %8.0g      brach16    whether responsible adult for
                                                child
bsampst         byte    %8.0g      bsampst    sample membership status
bregion         byte    %8.0g      bregion    region / metropolitan area
bqfedhi         byte    %8.0g      bqfedhi    highest educational qualification
bqfvoc          byte    %8.0g      bqfvoc     has vocational qualifications
bqfachi         byte    %8.0g      bqfachi    highest academic qualification
bjbft           byte    %8.0g      bjbft      employed full time
bpaygu          double  %10.0g     bpaygu     usual gross pay per month:
                                                current job
bcjsten         int     %8.0g      bcjsten    length (days) of current labour
                                                market
bdoiy4          int     %8.0g      bdoiy4     date of interview: 4 digit year
byr2uk4         int     %8.0g      byr2uk4    year came to britain: 4 digit
bjbbgy4         int     %8.0g      bjbbgy4    year started current job: 4 digit
cdoid           byte    %8.0g                 date of interview: day
cdoim           byte    %8.0g      cdoim      date of interview: month
civievr         byte    %8.0g      civievr    ever interviewed
csex            byte    %8.0g      csex       sex
cjbstat         byte    %8.0g      cjbstat    current economic activity
cmlstat         byte    %8.0g      cmlstat    present legal marital status
cplbornc        byte    %8.0g      cplbornc   country of birth
crace           byte    %8.0g      crace      ethnic group membership
cschool         byte    %8.0g      cschool    never went to /still at school
cscend          byte    %8.0g      cscend     school leaving age
csctype         byte    %8.0g      csctype    type of school attended
cscnow          byte    %8.0g      cscnow     still at school
cfetype         byte    %8.0g      cfetype    type of further education
                                                attended
cfenow          byte    %8.0g      cfenow     still in further education
cfeend          byte    %8.0g      cfeend     further education leaving age
csmoker         byte    %8.0g      csmoker    smoker
cncigs          byte    %8.0g      cncigs     number of cigarettes smoked
cjbhas          byte    %8.0g      cjbhas     did paid work last week
cjboff          byte    %8.0g      cjboff     no work last week but has job
cjboffy         byte    %8.0g      cjboffy    reason off work last week
cjbsoc          int     %8.0g      cjbsoc     occupation (soc): current main
                                                job
cjbsic          int     %8.0g      cjbsic     industry (sic) of employer:
                                                current job
cjbsemp         byte    %8.0g      cjbsemp    employee or self-employed:
                                                current job
cjbsize         byte    %8.0g      cjbsize    no. employed at workplace:
                                                current job
cjbhrs          byte    %8.0g      cjbhrs     no. of hours normally worked per
                                                week
cjbbgd          byte    %8.0g      cjbbgd     day started current job
cjbbgm          byte    %8.0g      cjbbgm     month started current job
ctujbpl         byte    %8.0g      ctujbpl    union or staff association at
                                                workplace
ctuin1          byte    %8.0g      ctuin1     member of workplace union
cjbed           byte    %8.0g      cjbed      had work related training since
                                                1.9.92
chunurs         byte    %8.0g      chunurs    who cares for ill children
cage            byte    %8.0g      cage       age at date of interview
cnchild         byte    %8.0g      cnchild    number of own children in
                                                household
crach16         byte    %8.0g      crach16    whether responsible adult for
                                                child
csampst         byte    %8.0g      csampst    sample membership status
cregion         byte    %8.0g      cregion    region / metropolitan area
cqfedhi         byte    %8.0g      cqfedhi    highest educational qualification
cqfvoc          byte    %8.0g      cqfvoc     has vocational qualifications
cqfachi         byte    %8.0g      cqfachi    highest academic qualification
cjbft           byte    %8.0g      cjbft      employed full time
cpaygu          double  %10.0g     cpaygu     usual gross pay per month:
                                                current job
ccjsten         int     %8.0g      ccjsten    length (days) current labour
                                                market sp.
cdoiy4          int     %8.0g      cdoiy4     date of interview: 4 digit year
cyr2uk4         int     %8.0g      cyr2uk4    year came to britain: 4 digit
cjbbgy4         int     %8.0g      cjbbgy4    year started current job: 4 digit
-------------------------------------------------------------------------------
Sorted by: pid
     Note: Dataset has changed since last saved.

. 
. * If you check variable names carefully, you will find we have the format lik
> e this a_, b_, c_;
. * the content after prefix a/b/c is the same. 
. * a means 1991; b means 1992; c means 1993 
. 
. * Let us list some data to take a look. 
. 
. 
. list pid *age *sex, sepby(pid), if _n <=10 

       +----------------------------------------------------------+
       |      pid   aage   bage   cage     asex     bsex     csex |
       |----------------------------------------------------------|
    1. | 10002251     91      .      .   female        .        . |
       |----------------------------------------------------------|
    2. | 10004491      .     29      .        .     male        . |
    3. | 10004491     28      .      .     male        .        . |
       |----------------------------------------------------------|
    4. | 10004521      .      .     28        .        .     male |
    5. | 10004521     26      .      .     male        .        . |
    6. | 10004521      .     27      .        .     male        . |
       |----------------------------------------------------------|
    7. | 10007857     57      .      .   female        .        . |
    8. | 10007857      .      .     59        .        .   female |
    9. | 10007857      .     59      .        .   female        . |
       |----------------------------------------------------------|
   10. | 10014578      .     55      .        .   female        . |
       +----------------------------------------------------------+

. 
. * hope you know the meaning of _n, which is the counting function.
. 

You can see, we have three variables measuing age and sex. This is because we appended three different time datasets. To clean our data, we need see the structure of dataset. To check how to use rename or renpfix, check this link



. cd "Z:\L14009\Dataset_part_1"
Z:\L14009\Dataset_part_1

. 
. clear 

. use bhps_1991

. renpfix a

. generate year=1991

. save wave1, replace 
file wave1.dta saved

. 
. * renpfix a means: drop a 
. 
. clear 

. use bhps_1992

. renpfix b

. generate year=1992

. save wave2, replace 
file wave2.dta saved

. 
. clear 

. use bhps_1993

. renpfix c

. generate year=1993

. save wave3, replace 
file wave3.dta saved

. 
. list pid year sex age, sepby(pid), if _n <=10

      +--------------------------------+
      |      pid   year      sex   age |
      |--------------------------------|
   1. | 10004521   1993     male    28 |
      |--------------------------------|
   2. | 10007857   1993   female    59 |
      |--------------------------------|
   3. | 20002092   1993   female    26 |
      |--------------------------------|
   4. | 10014578   1993   female    56 |
      |--------------------------------|
   5. | 10014608   1993     male    59 |
      |--------------------------------|
   6. | 10016813   1993     male    37 |
      |--------------------------------|
   7. | 10016848   1993   female    33 |
      |--------------------------------|
   8. | 10017933   1993   female    51 |
      |--------------------------------|
   9. | 10017968   1993     male    48 |
      |--------------------------------|
  10. | 10019057   1993   female    61 |
      +--------------------------------+

. 
. * now we have variable = 1993, and also no aage,or cage, or asex, etc. 
. 
. ** The data is quite clean now. 
. 

Once we clean the data, we can combine them together.



. 
. cd "Z:\L14009\Dataset_part_1"
Z:\L14009\Dataset_part_1

. 
. use wave1

. append using wave2

. append using wave3

. sort pid year 

. list pid year sex age, sepby(pid), if _n <=10 

       +--------------------------------+
       |      pid   year      sex   age |
       |--------------------------------|
    1. | 10002251   1991   female    91 |
       |--------------------------------|
    2. | 10004491   1991     male    28 |
    3. | 10004491   1992     male    29 |
       |--------------------------------|
    4. | 10004521   1991     male    26 |
    5. | 10004521   1992     male    27 |
    6. | 10004521   1993     male    28 |
       |--------------------------------|
    7. | 10007857   1991   female    57 |
    8. | 10007857   1992   female    59 |
    9. | 10007857   1993   female    59 |
       |--------------------------------|
   10. | 10014578   1991   female    54 |
       +--------------------------------+

. 
. save wave_final, replace 
file wave_final.dta saved

. 

The crucial points to rember in creating panel data are:

  1. You need a variable which identifies each unit and variable which identifies each time period

  2. You can append cross-section datasets to each other to create the panle only if all your variables have the same names in every wave.


1.2 Describing panel data

We can use all the basic comands we discussed in Lecture 2 for analysing panel data. However, it is crucial to realise that we now have repeated observation on the same individuals.



. cd "Z:\L14009\Dataset_part_1"
Z:\L14009\Dataset_part_1

. 
. use wave_final 

. 
. tab sex 

           sex  |      Freq.     Percent        Cum.
----------------+-----------------------------------
        male    |     13,939       46.92       46.92
        female  |     15,770       53.08      100.00
----------------+-----------------------------------
          Total |     29,709      100.00

. 
. tab year sex 

           |         sex 
      year |   male       female  |     Total
-----------+----------------------+----------
      1991 |     4,833      5,431 |    10,264 
      1992 |     4,630      5,215 |     9,845 
      1993 |     4,476      5,124 |     9,600 
-----------+----------------------+----------
     Total |    13,939     15,770 |    29,7. 
. 
. xtset pid year 
       panel variable:  pid (unbalanced)
        time variable:  year, 1991 to 1993, but with gaps
                delta:  1 unit

. 
. * set the dataset as panel data 
. 
. xtdes 

     pid:  10002251, 10004491, ..., 37763717                 n =      11754
    year:  1991, 1992, ..., 1993                             T =          3
           Delta(year) = 1 unit
           Span(year)  = 3 periods
           (pid*year uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       2         3         3       3       3

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+---------
     8170     69.51   69.51 |  111
     1045      8.89   78.40 |  1..
      800      6.81   85.21 |  11.
      615      5.23   90.44 |  ..1
      566      4.82   95.25 |  .11
      309      2.63   97.88 |  .1.
      249      2.12  100.00 |  1.1
 ---------------------------+---------
    11754    100.00         |  XXX

. 
. drop if paygu <= 0
(15,037 observations deleted)

. 
. xtsum paygu 

Variable         |      Mean   Std. Dev.       Min        Max |    Observations
-----------------+--------------------------------------------+----------------
paygu    overall |  962.6754   719.4627   8.666667   13010.01 |     N =   14672
         between |             698.7412   8.666667   9173.723 |     n =    6590
         within  |             173.1106  -1205.659   4988.588 | T-bar =  2.2264

. 
. * be careful about 'between' and 'within'
. 
. * see this example 
. 
. xttab region 

                  Overall             Between            Within
   region |    Freq.  Percent      Freq.  Percent        Percent
----------+-----------------------------------------------------
 inner lo |     573      3.91       293      4.45          93.40
 outer lo |     940      6.41       434      6.59          95.47
 r. of so |    2841     19.36      1299     19.71          97.66
 south we |    1303      8.88       595      9.03          98.60
 east ang |     511      3.48       234      3.55          98.43
 east mid |    1169      7.97       539      8.18          97.71
 west mid |     527      3.59       254      3.85          98.95
 r. of we |     788      5.37       354      5.37          98.59
 greater  |     620      4.23       277      4.20          98.32
 merseysi |     283      1.93       131      1.99          98.85
 r. of no |     660      4.50       295      4.48          96.38
 south yo |     376      2.56       167      2.53          97.50
 west yor |     523      3.56       242      3.67          98.28
 r. of yo |     493      3.36       222      3.37          96.62
 tyne & w |     345      2.35       159      2.41          98.22
 r. of no |     610      4.16       270      4.10          98.46
  wales   |     694      4.73       314      4.76          99.26
 scotland |    1416      9.65       660     10.02          99.14
----------+-----------------------------------------------------
    Total |   14672    100.00      6739    102.26          97.79
                              (n = 6590)
  • The between variation is the variation of individual mean between individuals (means cancel out effect of time), between is about individuals.

  • The within variation tells us how much the value varis across time for each person. within is about time;

In the example above, the overal column tells us that in the data there are 573 person-years where the person lives in region 1, which is 3.91% of the total number of person-years.

1.3 Built-in functions for time-series operators

Stata has many built-in functions for calculating lags and leads which use time-series operators. For example, L.varname is the first lag of varname. Leads are obtained by using F.varname and differences with D.varname, and L3.varname is the tree-period lag.

1.4 Econometric methods for panel data

Suppose we have panel data for just two periods t=1, 2. We can write a simple model with a single explanatory variable as \[y_{it} = \beta_{0} + \delta d_{2t} + \beta x_{it} + \alpha_{i} + u_{it} , t = 1, 2 \]

In this notation \(i\) denotes the person, firm, city and so on, and t denotes the time periods. The variable \(d_{2t}\) is a dummy variable which equals 0 when t = 1, and 1 when t = 1. By doing this, we can treat the data as if it came from a single data set, and treat the variation in \(y_{it}\) and \(x_{it}\) across t in the same way as the variation across i.

Now, let’s run pooled regression



. cd "Z:\L14009\Dataset_part_1"
Z:\L14009\Dataset_part_1

. 
. use wave_final

. 
. drop if year == 1991
(10,264 observations deleted)

. 
. drop if age<16|age>60
(4,292 observations deleted)

. drop if jbed<0
(4,735 observations deleted)

. 
. gen training = 1 if jbed ==1
(7,050 missing values generated)

. replace training = 0 if jbed == 2
(7,050 real changes made)

. 
. gen female = 1 if sex == 2
(5,319 missing values generated)

. replace female = 0 if sex == 1
(5,319 real changes made)

. 
. gen d2 = 1 if year == 1993
(5,298 missing values generated)

. replace d2 = 0 if year == 1992
(5,298 real changes made)

. 
. gen lnpay = ln(paygu)
(1,278 missing values generated)

. 
. save wave_2year, replace 
file wave_2year.dta saved

. 
. * the regression will be 
. 
. regress lnpay d2 female age training 

      Source |       SS           df       MS      Number of obs   =     9,140
-------------+----------------------------------   F(4, 9135)      =    790.78
       Model |   1829.6012         4  457.400299   Prob > F        =    0.0000
    Residual |  5283.86235     9,135  .578419523   R-squared       =    0.2572
-------------+----------------------------------   Adj R-squared   =    0.2569
       Total |  7113.46354     9,139  .778363447   Root MSE        =    .76054

------------------------------------------------------------------------------
       lnpay |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          d2 |   .0207247   .0159143     1.30   0.193    -.0104709    .0519202
      female |   -.708709   .0159335   -44.48   0.000    -.7399422   -.6774758
         age |   .0153719   .0006934    22.17   0.000     .0140128    .0167311
    training |   .4393872   .0167418    26.24   0.000     .4065694    .4722049
       _cons |   6.258067   .0293321   213.35   0.000     6.200569    6.315564
------------------------------------------------------------------------------

. 
. * the regression without assuming same individual across time are independent
>  
. 
. regress lnpay d2 female age training, vce(cluster pid)

Linear regression                               Number of obs     =      9,140
                                                F(4, 5434)        =     573.27
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2572
                                                Root MSE          =     .76054

                                (Std. Err. adjusted for 5,435 clusters in pid)
------------------------------------------------------------------------------
             |               Robust
       lnpay |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          d2 |   .0207247   .0100369     2.06   0.039     .0010483     .040401
      female |   -.708709   .0198915   -35.63   0.000    -.7477043   -.6697136
         age |   .0153719   .0009572    16.06   0.000     .0134955    .0172484
    training |   .4393872   .0179993    24.41   0.000     .4041013    .4746731
       _cons |   6.258067   .0402289   155.56   0.000     6.179202    6.336932
------------------------------------------------------------------------------

. 

Actually, we can potentially use panel data to deal with endogenous problem

It turns out that it is straightforward to eliminate any corrleation between the fixed effect \(\alpha_{i}\) and \(x_{it}\) using panle data. Writing out the model equation separatley for each year we get:

\[y_{i2} = \beta_{0} + \delta + \beta x_{i2} + \alpha_{i} + u_{i2} , t = 2 \]

\[y_{i1} = \beta_{0} + \beta x_{i1} + \alpha_{i} + u_{i1} , t = 1\]

Subtracting \(y_{i1}\) from \(y_{i2}\) we get:

\[y_{i2} - y_{i1} = \delta + \beta (x_{i2} -x_{i1}) + (u_{i2} - u_{i1})\]

Now the unobserved fixed effect \(\alpha{i}\) has been differenced away.



. 
. cd "Z:\L14009\Dataset_part_1"
Z:\L14009\Dataset_part_1

. 
. use wave_2year

. 
. xtset pid year 
       panel variable:  pid (unbalanced)
        time variable:  year, 1992 to 1993
                delta:  1 unit

. 
. * you have to set the time properity every time you use panle data or time se
> ries data
. 
. list pid year D.lnpay age D.age training D.training, sepby(pid), if _n <= 10 

       +---------------------------------------------------------------+
       |                           D.          D.                    D.|
       |      pid   year       lnpay   age   age   training   training |
       |---------------------------------------------------------------|
    1. | 10007857   1992           .    59     .          1          . |
    2. | 10007857   1993   -.0066824    59     0          1          0 |
       |---------------------------------------------------------------|
    3. | 10014608   1992           .    58     .          0          . |
    4. | 10014608   1993    .0862846    59     1          0          0 |
       |---------------------------------------------------------------|
    5. | 10016813   1992           .    37     .          0          . |
       |---------------------------------------------------------------|
    6. | 10016848   1992           .    33     .          1          . |
       |---------------------------------------------------------------|
    7. | 10017933   1992           .    49     .          0          . |
    8. | 10017933   1993           .    51     2          0          0 |
       |---------------------------------------------------------------|
    9. | 10017968   1992           .    46     .          0          . |
   10. | 10017968   1993    .1077027    48     2          0          0 |
       +---------------------------------------------------------------+

. 
. regress D.lnpay D.female D.training D.age 
note: D.female omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =     3,705
-------------+----------------------------------   F(2, 3702)      =      1.79
       Model |  .418572903         2  .209286451   Prob > F        =    0.1667
    Residual |  432.190724     3,702  .116745198   R-squared       =    0.0010
-------------+----------------------------------   Adj R-squared   =    0.0004
       Total |  432.609297     3,704  .116795167   Root MSE        =    .34168

------------------------------------------------------------------------------
     D.lnpay |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |
         D1. |          0  (omitted)
             |
    training |
         D1. |   .0016944   .0107622     0.16   0.875     -.019406    .0227948
             |
         age |
         D1. |  -.0378549    .020068    -1.89   0.059    -.0772004    .0014905
             |
       _cons |   .0939594   .0207339     4.53   0.000     .0533083    .1346104
------------------------------------------------------------------------------

. 
. predict uhat, residuals 
(6,713 missing values generated)

. 
. ** We can also use within-groups estimator 
. 
. xtreg lnpay d2 female age training, fe 
note: female omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      9,140
Group variable: pid                             Number of groups  =      5,435

R-sq:                                           Obs per group:
     within  = 0.0274                                         min =          1
     between = 0.0481                                         avg =        1.7
     overall = 0.0333                                         max =          2

                                                F(3,3702)         =      34.73
corr(u_i, Xb)  = -0.5742                        Prob > F          =     0.0000

------------------------------------------------------------------------------
       lnpay |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          d2 |   .0939594   .0207339     4.53   0.000     .0533083    .1346104
      female |          0  (omitted)
         age |  -.0378549    .020068    -1.89   0.059    -.0772004    .0014905
    training |   .0016944   .0107622     0.16   0.875     -.019406    .0227948
       _cons |   7.922322   .7131802    11.11   0.000     6.524057    9.320587
-------------+----------------------------------------------------------------
     sigma_u |  1.0943318
     sigma_e |  .24160422
         rho |  .95352258   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(5434, 3702) = 19.58                 Prob > F = 0.0000

. 
. ** we can use random effect
. 
. xtreg lnpay d2 female age training, re 

Random-effects GLS regression                   Number of obs     =      9,140
Group variable: pid                             Number of groups  =      5,435

R-sq:                                           Obs per group:
     within  = 0.0143                                         min =          1
     between = 0.2305                                         avg =        1.7
     overall = 0.2216                                         max =          2

                                                Wald chi2(4)      =    1671.92
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
       lnpay |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          d2 |   .0369959   .0057385     6.45   0.000     .0257487    .0482432
      female |  -.7234406   .0212953   -33.97   0.000    -.7651786   -.6817026
         age |   .0172896   .0009032    19.14   0.000     .0155193    .0190599
    training |   .0945188   .0101153     9.34   0.000     .0746932    .1143444
       _cons |   6.267822   .0354034   177.04   0.000     6.198433    6.337212
-------------+----------------------------------------------------------------
     sigma_u |  .74015203
     sigma_e |  .24160422
         rho |  .90370698   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. 
. ** Let us check the residuals
. 
. qnorm uhat 

. 
. graph export "xtreg_re1.png", replace 
(file xtreg_re1.png written in PNG format)

. 

example QQ_plot



This is the end of part I, you should read more from Wooldrige textbook and CT(2009), and understand the logic behind the method and model.