This is a non-comprehensive review of Stata. It is based on a set of data files that can be downloaded directly from Canvas (see files/sections/stata_review_empty.zip).

Preliminaries

File Organization

Before jumping into the do-file I want to make some remarks about the directory setup and handling of the data.

There are a lot of empty directories in what you downloaded. These are all set up in anticipation of needing them later on. Even if you think that you only have a small project at the beginning you will probably end up with a whole bunch of files by the end. Be a proactive organizer.

The underlying data that we’ll use is stored in the directory data/raw. We’re going to put together a file (the aforementioned do-file) that will tell Stata to carry out a set of commands on the data. However, we will never actually modify the data in data/raw. We will instead create new datasets and save those elsewhere. This is critical for replicating an analysis.

Comments

You’re going to want to comment your code so that it is clear to future-you and others what exactly it is that you’re doing.

We can comment in a lot of ways:

* one line comment - Beginning of line only
disp "Another Method" // one line comment - Any place in line

/*
    BLOCK COMMENT  
*/

Running and getting commands

We can easily send commands from the do-file to the Stata interpreter in a few ways:

  • ctrl+d runs the entire do file
  • [highlight line]+ctrl+d runs the selected lines
  • ctrl+r quietly runs the do-file. That is nothing is printed to the Stata output screen.

You can also fetch previously-executed commands from i) the left panel (assuming default Stata config) or ii) clicking on the command line and hitting fn+[UP arrow].

A Sample .do File

Setup

Clear everything from memory

clear all

Avoid the annoying “more” message

set more off

Now define global “macros”, which is what Stata calls variables that are not columns in the dataframe. These are the variables that the user needs to specify in order to run your code; make them conspicuous.

// WORKING DIRECTORY - Must be set to run the script.
global wk_dir "C:/Users/Michael/Dropbox/teaching/ECON326/Recitations/Recitation1/stata_review"
// Log file name
global log_file "log/state_review"
 . // WORKING DIRECTORY - Must be set to run the script.
. global wk_dir "C:/Users/Michael/Dropbox/teaching/ECON326/Recitations/Recitation1/stata_review"

. // Log file name
. global log_file "log/state_review"

In this case it will be overkill to seprately define wk_dir. We will only use it for one purpose. However, you could imagine a case where there is a need to jump between several directories (and this might only occur on line 345). Defining these macros is going to make it so that the user only needs to set/change these once.

Set the working directory to wk_dir, which we specified above. We use the Stata macro syntax \$[macro name]. The prefix on the name tells Stata to interpret the input as the name of a global macro not as, e.g., a column name.
cd $wk_dir 
 . cd $wk_dir 
C:\Users\Michael\Dropbox\teaching\ECON326\Recitations\Recitation1\stata_review

Log Files

Next we need to start up the log file. What does this do? The log command is going to create (or replace) a file and write everything that we see in the Stata output window to it.

We want to do this because it preserves our results. This will make it easier to share and review our results later. For example, if you don’t have Stata on hand you can still review a log file (if you set the options correctly).

Start logging:
//log using $log_file , replace text
disp "hello"
//log close
 . //log using $log_file , replace text
. disp "hello"
hello

. //log close

Let’s discuss each of the elements of this command:

  • log using $log_file - Begin to log “using” the specified file. In this case it is the name that we defined above.
  • , - The comma separates the main command from the options in Stata.
  • replace - Stata will try to open a new log file. However, if we run the do-file several times a log file was already created. This tells Stata that it is okay to go ahead and replace the pre-exsting log file if it finds one.
  • text - This option will specify the output format of the log file. By specifying text we ensure that we will be able to open and read the log file in any text editor, not just Stata.2

Once we start the log file Stata will write all of our commands (not just those in the do-file) to the log file until we tell it to stop logging. To do this we will add a line at the end of the file–log close. It is a good idea to add this line at the same time that you add the log start statement (so that you don’t forget later) then write the rest of your commands between these two lines.

If you’re debugging the file while running it you may run into an error that halts the execution of the code. Since your log close statement is below the error line, it will not be executed. Thus, the log file is still open. If you then try to re-run the code you will get an error at the log using line saying that a log file is already open. One way to get around this is to comment out these two lines while you’re writing the file (you probably won’t need to look at them until after you’re done writing the file anyway). Make a note to make sure that you go back and log in the end.

Data Cleaning and Shoveling

Start with the dataset in “data/raw/lifeexp-eur&asia”.

// Open the base dataset
use "data/raw/lifeexp-eur&asia", replace
 . // Open the base dataset
. use "data/raw/lifeexp-eur&asia", replace
(Life expectancy, 1998)

For more on use of the use command type help use onto the command line.

Now we’re going to inspect the data that was just loaded in to see i) what it is and ii) if it makes sense.

// Inspect it: (Bad labels, data types, no labels, missing data)
describe
 . // Inspect it: (Bad labels, data types, no labels, missing data)
. describe

Contains data from data/raw/lifeexp-eur&asia.dta
  obs:            44                          Life expectancy, 1998
 vars:             5                          29 Mar 2017 15:40
 size:         1,540                          (_dta has notes)
------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------
region          byte    %12.0g     region     Country
country         str28   %28s                  Region
popgrowth       float   %9.0g               * Avg. annual % growth
lexp            byte    %9.0g               * Life expectancy at birth
safewater       byte    %9.0g               * 
                                            * indicated variables have notes
------------------------------------------------------------------------------------------------------------------------
Sorted by:  

Notice that the data labels for region and country appear to be switched around. We can check which is wrong (bad labels or bad variable names?) by looking at the data itself.

Look at the data itself (I comment this out when running full file because it will limit the number of boxes flashing up).
//browse //
 . //browse //

In the data empty entries show up as .. Notice that a lot of safewater entries are missing.

It also appears to be the case that the labels are mixed up. We can fix this easily:
// Correct Errors in Labeling
label var region Region
label var country Country
 . // Correct Errors in Labeling
. label var region Region

. label var country Country

We can get a better picture of the missing data:

misstable summarize, all
 . misstable summarize, all
                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
        region |                            44  |      1          1           1
       country | (string variable)              |
     popgrowth |                            44  |     20        -.5         2.8
          lexp |                            44  |     15         65          79
     safewater |        27                  17  |     12         55         100
  -----------------------------------------------------------------------------

We want to see both i) how many entries are missing in each entry and ii) if the data seems to make sense. It is not uncommon to program data as being missing by entering the data as -999 (or something along these lines). This doesn’t appear to be the case in this data, though.

We should also check to make sure that the other variables aren’t super-odd. For instance, are the lexp entries reasonable? We shouldn’t see negative entries or entries above 100 (or 90 really).

Also, if we’re working with hand-entered (or OCR’d) data we would want to do a lot more to check for mis-types.

Combining Data: Append and Merge

We have just imported some data. Usually we need to combine this data with some other datasets in order to get the full set of variables that we need.

There are two principle operations that we might carry out:

  • append adds additional rows to the frame. For instance, we could append a line for the U.S.
  • merge adds additional columns to the frame. In a minute we will add data on GNP for each country.

Append

We have a second dataset that is similar to the first but has additional countries from the Western Hemisphere.

We begin by peaking into the data to be appended to make sure that it is suitable for being merged.

The preserve command is going to store the data we have in memory right now while we open the other dataset.

Then open up the other data file and do the same data quality checks as before:
// Preserve current data for later
preserve
// Now using new data
use "data/raw/lifeexp-america", clear
// Run same checks on this data as last
describe
browse
misstable summarize, all
// Restore previous data
restore
 . // Preserve current data for later
. preserve

. // Now using new data
. use "data/raw/lifeexp-america", clear
(Life expectancy, 1998)

. // Run same checks on this data as last
. describe

Contains data from data/raw/lifeexp-america.dta
  obs:            23                          Life expectancy, 1998
 vars:             5                          4 Jan 2011 17:24
 size:           805                          (_dta has notes)
------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------
region          byte    %12.0g     region     Region
country         str28   %28s                  Country
popgrowth       float   %9.0g               * Avg. annual % growth
lexp            byte    %9.0g               * Life expectancy at birth
safewater       byte    %9.0g               * 
                                            * indicated variables have notes
------------------------------------------------------------------------------------------------------------------------
Sorted by:  

. browse
request ignored because of batch mode

. misstable summarize, all
                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
        region |                            23  |      2          2           3
       country | (string variable)              |
     popgrowth |                            23  |     16         .7           3
          lexp |                            23  |     15         54          79
     safewater |         1                  22  |     19         28          99
  -----------------------------------------------------------------------------

. // Restore previous data
. restore
In order to do a succesful append we need to make sure that the variable names in the two files match up as we want them to. For instance, we need to make sure that the cases of the variable names are the same. Stata is case-sensitive; i.e. interprets POPGROWTH differently from popgrowth. Everything looks good. We can restore the Europe and Asia data from memory (i.e. undo the preserve from earlier) and carry out the append:
// Append
append using "data/raw/lifeexp-america"
 . // Append
. append using "data/raw/lifeexp-america"
(label region already defined)

Merge

Now we’re going to merge our current dataframe with a dataset containing variables on GNP per capita by country.

In order to merge we need to ensure that we have common variables between the two datasets and that these merge variables play well together.

Stata requires that the two merge variables have identical names. Also, the merge command is “stupid” in the sense that it will only match values that are exactly identical; it cannot tell that U.S. and United States are the same countries.[^ For that matter, it can’t tell that united states and United States are the same country.]

As with the other datasets we begin by inspecting the new dataset.
// Merge dataset on per capita income - add COLUMNS
// Peak into merging dataset to check for problems, identify merging variables
save "data/int/lifeexp", replace
use "data/raw/gnppc", clear
describe
browse
misstable summarize
 . // Merge dataset on per capita income - add COLUMNS
. // Peak into merging dataset to check for problems, identify merging variables
. save "data/int/lifeexp", replace
file data/int/lifeexp.dta saved

. use "data/raw/gnppc", clear

. describe

Contains data from data/raw/gnppc.dta
  obs:            64                          
 vars:             2                          29 Mar 2017 16:13
 size:         1,472                          
------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------
Country         str19   %19s                  
gnppc           long    %8.0g                 GNP per capita
------------------------------------------------------------------------------------------------------------------------
Sorted by:  

. browse
request ignored because of batch mode

. misstable summarize
(variables nonmissing or string)
We see that there is a country variable but that it needs to be converted to lowercase in order to match with the variable in the other “master” dataset. Let’s do that.
* Can match on country name but need variables to be the same
rename Country country
* Save as NEW dataset in 'data/int'
save "data/int/gnppc_for_merge", replace
* Get old data
use "data/int/lifeexp"
 . * Can match on country name but need variables to be the same
. rename Country country

. * Save as NEW dataset in 'data/int'
. save "data/int/gnppc_for_merge", replace
file data/int/gnppc_for_merge.dta saved

. * Get old data
. use "data/int/lifeexp"
(Life expectancy, 1998)

Notice how I saved the modified data in a new file in the intermediate data directory. This is in keeping with the idea that–in order to ensure the replicability of the analysis–we never actually change the original data.3

Okay, let’s go ahead and try to merge. The merge syntax is similar to the append syntax except we need to specify the merge type and the merge variables.

For the merge type I specify 1:1. That is, we tell Stata that, for each entry in the master data, we should find exactly one entry in the using data. This is a nice way of ensuring that you’re not introducing or removing spurious observations.4
// Now do the actual merge!
// Still a bit hazy on the "sort" rationale. Just do it if you get an error...
sort country
// Let's preserve the data in case something goes wrong on merge
save "data/int/lifeexp", replace
// Merge
merge 1:1 country using "data/int/gnppc_for_merge"
 . // Now do the actual merge!
. // Still a bit hazy on the "sort" rationale. Just do it if you get an error...
. sort country

. // Let's preserve the data in case something goes wrong on merge
. save "data/int/lifeexp", replace
file data/int/lifeexp.dta saved

. // Merge
. merge 1:1 country using "data/int/gnppc_for_merge"

    Result                           # of obs.
    -----------------------------------------
    not matched                             9
        from master                         6  (_merge==1)
        from using                          3  (_merge==2)

    matched                                61  (_merge==3)
    -----------------------------------------
The merge command gives us a summary of how succesful the merge was. Let’s look into the variables that weren’t merged and make sure that they make sense. We should also browse the data to make sure that the data that was reported as being succesfully merged was merged correctly.
// What countries were left out?
list region country _merge if _merge != 3
 . // What countries were left out?
. list region country _merge if _merge != 3

     +---------------------------------------------------------------+
     |       region                        country            _merge |
     |---------------------------------------------------------------|
  9. | Eur & C.Asia         Bosnia and Herzegovina   master only (1) |
 15. |         N.A.                           Cuba   master only (1) |
 38. | Eur & C.Asia                  Macedonia FYR   master only (1) |
 49. |         N.A.                    Puerto Rico   master only (1) |
 60. | Eur & C.Asia                   Turkmenistan   master only (1) |
     |---------------------------------------------------------------|
 67. | Eur & C.Asia   Yugoslavia, FR (Serb./Mont.)   master only (1) |
 68. |            .                       Colombia    using only (2) |
 69. |            .                         Israel    using only (2) |
 70. |            .                      Macedonia    using only (2) |
     +---------------------------------------------------------------+

Okay, notice that there is an unmatched master entry called Macedonia FYR and an unmatched using entry called Macedonia. These are probably the same country but we should probably check it out to make sure. (I found the FYR to be somewhat puzzling–is it maybe an annotation added by someone to specify something…?) Let’s just look up Macedonia on Wikipedia and see if anything pops up. In fact, it does. These are referring to the same country.

In this next chunk I’m going to go back into the using data and rename the entry for Macedonia. But I’m also going to add a block comment with what I found out about Macedonia. The idea here is that–at some point down the line–you or someone else are going to be looking at the code (maybe for a bug) and you’re going to wonder if this is actually correct. Save your future self a moment and put it right in the code:
/* 
Appears that Macedonia and Macedonia FYR are actually the same country?

Yes, from Wikipedia entry for "Republic of Macedonia"
 (https://en.wikipedia.org/wiki/Republic_of_Macedonia)

    It became a member of the United Nations in 1993, but, as a result of an 
    ongoing dispute with Greece over the use of the name "Macedonia", was 
    admitted under the provisional description the former Yugoslav Republic 
    of Macedonia[10][11] (sometimes unofficially abbreviated as FYROM and 
    FYR Macedonia), a term that is also used by international organizations 
    such as the European Union,[12] the Council of Europe[13] and NATO.[14]
*/ 

// Amend gnppc_for_merge data
use "data/int/gnppc_for_merge", clear
// Change a single name
replace country = "Macedonia FYR" if country == "Macedonia"
// Save
save "data/int/gnppc_for_merge", replace
// Restore back life expectations data
use "data/int/lifeexp", replace
 . /* 
> Appears that Macedonia and Macedonia FYR are actually the same country?
> 
> Yes, from Wikipedia entry for "Republic of Macedonia"
>  (https://en.wikipedia.org/wiki/Republic_of_Macedonia)
> 
>         It became a member of the United Nations in 1993, but, as a result of an 
>         ongoing dispute with Greece over the use of the name "Macedonia", was 
>         admitted under the provisional description the former Yugoslav Republic 
>         of Macedonia[10][11] (sometimes unofficially abbreviated as FYROM and 
>         FYR Macedonia), a term that is also used by international organizations 
>         such as the European Union,[12] the Council of Europe[13] and NATO.[14]
> */ 
. 
. // Amend gnppc_for_merge data
. use "data/int/gnppc_for_merge", clear

. // Change a single name
. replace country = "Macedonia FYR" if country == "Macedonia"
(1 real change made)

. // Save
. save "data/int/gnppc_for_merge", replace
file data/int/gnppc_for_merge.dta saved

. // Restore back life expectations data
. use "data/int/lifeexp", replace
(Life expectancy, 1998)
Now let’s try this merge again.
// Merge again
// Merge
merge 1:1 country using "data/int/gnppc_for_merge"

// What countries were left out?
li region country _merge if _merge != 3
// Check on Macedonia:
li region country _merge if country == "Macedonia FYR"
 . // Merge again
. // Merge
. merge 1:1 country using "data/int/gnppc_for_merge"

    Result                           # of obs.
    -----------------------------------------
    not matched                             7
        from master                         5  (_merge==1)
        from using                          2  (_merge==2)

    matched                                62  (_merge==3)
    -----------------------------------------

. 
. // What countries were left out?
. li region country _merge if _merge != 3

     +---------------------------------------------------------------+
     |       region                        country            _merge |
     |---------------------------------------------------------------|
  9. | Eur & C.Asia         Bosnia and Herzegovina   master only (1) |
 15. |         N.A.                           Cuba   master only (1) |
 49. |         N.A.                    Puerto Rico   master only (1) |
 60. | Eur & C.Asia                   Turkmenistan   master only (1) |
 67. | Eur & C.Asia   Yugoslavia, FR (Serb./Mont.)   master only (1) |
     |---------------------------------------------------------------|
 68. |            .                       Colombia    using only (2) |
 69. |            .                         Israel    using only (2) |
     +---------------------------------------------------------------+

. // Check on Macedonia:
. li region country _merge if country == "Macedonia FYR"

     +--------------------------------------------+
     |       region         country        _merge |
     |--------------------------------------------|
 38. | Eur & C.Asia   Macedonia FYR   matched (3) |
     +--------------------------------------------+

Everything looks to be in order. At this point the data is ready for analysis. Let’s go ahead and save it to the final data directory.

This is useful if we want to come back and do something later but don’t feel like re-running the entire setup above.
// Data is ready for analysis. Save it to 'data/fin' 
save "data/fin/lifeexp_main", replace
 . // Data is ready for analysis. Save it to 'data/fin' 
. save "data/fin/lifeexp_main", replace
file data/fin/lifeexp_main.dta saved

Descriptive Statistics

In this section we discuss descriptive statistics like the mean, media, and variance of the variables.

The most common command here is summarize (which can be abbreviated to su). Here’s the basic usage:
// Basic Summary Stats - All variables by default
su
// Particular set of variables
su lexp safewater
 . // Basic Summary Stats - All variables by default
. su

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
      region |        67    1.477612    .7252629          1          3
     country |         0
   popgrowth |        67    .9567164    .9295199        -.5          3
        lexp |        67    72.31343    4.742487         54         79
   safewater |        39    76.05128    18.12232         28        100
-------------+--------------------------------------------------------
       gnppc |        64    8797.125    10595.19        370      39980
      _merge |        69    2.826087    .5411459          1          3

. // Particular set of variables
. su lexp safewater

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        67    72.31343    4.742487         54         79
   safewater |        39    76.05128    18.12232         28        100
Can we get more information than that? Sure: use detail
su gnppc, detail
 . su gnppc, detail

                       GNP per capita
-------------------------------------------------------------
      Percentiles      Smallest
 1%          370            370
 5%          410            380
10%          740            380       Obs                  64
25%         1440            410       Sum of Wgt.          64

50%         3445                      Mean           8797.125
                        Largest       Std. Dev.      10595.19
75%        15300          29240
90%        25580          33040       Variance       1.12e+08
95%        29240          34310       Skewness       1.270107
99%        39980          39980       Kurtosis       3.322431

Subsetting Data

Oftentimes we’ll only want to run a command on a subset of the data. For example, we might want to see how the results change if we exclude some possible outliers or if we restrict the data to a group of observations (e.g. countries in North America, students in Ms. Jone’s class).

Stata provides fairly simple syntax that can be used to subset data that is uniform across many common commands.
// Stats by region
bysort region: su lexp popgrowth
// Stats for subsets meeting logical condition
su lexp popgrowth if lexp > 75
//  include if (EITHER lexp > 75 OR lexp < 72)
su lexp popgrowth if lexp > 75 | lexp < 72
//  include if (lexp > 75 AND in region 1)
su lexp popgrowth if lexp > 75 & region == 1
 . // Stats by region
. bysort region: su lexp popgrowth

------------------------------------------------------------------------------------------------------------------------
-> region = Eur & C.Asia

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        44    73.06818    4.150639         65         79
   popgrowth |        44        .525    .7175945        -.5        2.8

------------------------------------------------------------------------------------------------------------------------
-> region = N.A.

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        14    71.21429    6.411271         54         79
   popgrowth |        14    1.692857    .7488086         .7          3

------------------------------------------------------------------------------------------------------------------------
-> region = S.A.

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |         9    70.33333    4.062019         62         75
   popgrowth |         9    1.922222    .6534099         .7        2.9

------------------------------------------------------------------------------------------------------------------------
-> region = .

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |         0
   popgrowth |         0


. // Stats for subsets meeting logical condition
. su lexp popgrowth if lexp > 75

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        19    77.57895    1.070607         76         79
   popgrowth |        19    .5105263    .2960757         .1        1.2

. //  include if (EITHER lexp > 75 OR lexp < 72)
. su lexp popgrowth if lexp > 75 | lexp < 72

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        46    71.80435    5.623836         54         79
   popgrowth |        46    1.034783    .9998261        -.4          3

. //  include if (lexp > 75 AND in region 1)
. su lexp popgrowth if lexp > 75 & region == 1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        15    77.73333    .9611501         76         79
   popgrowth |        15    .3866667    .1597617         .1         .7

In the last few lines I’ve combined logical conditions by using the & (‘and’) and | (‘inclusive or’) operators.

Use Macros

We now consider another subsetting problem: figure out the median life expectancy in the sample then get summary stats for the subsets of countries that are above/below the median.

When we call su (and many other commands) a number of macro variables will be created. You can see the full set of macro variables that are created at the bottom of the help pages (command: help su).

We’re going to get the macro for the median, save it to a a new macro (so that it isn’t overwritten) and then run the stats.5
// Detailed stats
su gnppc, detail
// Save median GNP for later
local gnp_med = `r(p50)'
// Stats for subsets meeting logical condition: If GNP above/below median.
su lexp popgrowth if gnppc < `gnp_med'
su lexp popgrowth if gnppc >= `gnp_med'
 . // Detailed stats
. su gnppc, detail

                       GNP per capita
-------------------------------------------------------------
      Percentiles      Smallest
 1%          370            370
 5%          410            380
10%          740            380       Obs                  64
25%         1440            410       Sum of Wgt.          64

50%         3445                      Mean           8797.125
                        Largest       Std. Dev.      10595.19
75%        15300          29240
90%        25580          33040       Variance       1.12e+08
95%        29240          34310       Skewness       1.270107
99%        39980          39980       Kurtosis       3.322431

. // Save median GNP for later
. local gnp_med = `r(p50)'

. // Stats for subsets meeting logical condition: If GNP above/below median.
. su lexp popgrowth if gnppc < `gnp_med'

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        31    68.96774    4.020646         54         75
   popgrowth |        31    1.293548    1.045286        -.4          3

. su lexp popgrowth if gnppc >= `gnp_med'

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        lexp |        36    75.19444    3.178674         66         79
   popgrowth |        36    .6666667    .7111359        -.5        2.8

Wildcards

It can be somewhat tedious to write out certain variables or sets of variables. Fortunately, Stata makes it fairly easy to reference those variables without writing them all out explicitly.

The basic idea of a wildcard variable is that you enter some fragment of a variable name and stata will return all variables that match the fragment.

Consider the following hypothetical example:6

// Variables: emp29, emp30, emp31, rmpd, gdp, log_gdp
// Prefix - Get all variables matching the prefix
emp*    =>      emp29, emp30, emp31
// Suffix - Get all variables matching the suffix
*gdp    =>      gdp, log_gdp
// Prefix and Suffix - Get all variables where string occurs at any point
*mp*    =>      emp29, emp30, emp31, rmpd
Let’s try it in the real data:
// Select variables - safewater and variables with "p" in name
su safewater *p*
 . // Select variables - safewater and variables with "p" in name
. su safewater *p*

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
   safewater |        39    76.05128    18.12232         28        100
   popgrowth |        67    .9567164    .9295199        -.5          3
        lexp |        67    72.31343    4.742487         54         79
       gnppc |        64    8797.125    10595.19        370      39980

Correlations and Regressions

The most common command for correlations is corr. We can use the exact same syntax that we did with su to subset the data:7
corr pop* lexp gnppc safewater
// By region
bysort region: corr lexp popgrowth
// For subsets meeting logical condition
corr lexp popgrowth if lexp > 75
//  include if (EITHER lexp > 75 OR lexp < 72)
corr lexp popgrowth if lexp > 75 | lexp < 72
//  include if (lexp > 75 AND in region 1)
corr lexp popgrowth if lexp > 75 & region == 1
 . corr pop* lexp gnppc safewater
(obs=36)

             | popgro~h     lexp    gnppc safewa~r
-------------+------------------------------------
   popgrowth |   1.0000
        lexp |  -0.4734   1.0000
       gnppc |  -0.4857   0.6941   1.0000
   safewater |  -0.3988   0.8199   0.7108   1.0000


. // By region
. bysort region: corr lexp popgrowth

------------------------------------------------------------------------------------------------------------------------
-> region = Eur & C.Asia
(obs=44)

             |     lexp popgro~h
-------------+------------------
        lexp |   1.0000
   popgrowth |  -0.2949   1.0000


------------------------------------------------------------------------------------------------------------------------
-> region = N.A.
(obs=14)

             |     lexp popgro~h
-------------+------------------
        lexp |   1.0000
   popgrowth |  -0.5925   1.0000


------------------------------------------------------------------------------------------------------------------------
-> region = S.A.
(obs=9)

             |     lexp popgro~h
-------------+------------------
        lexp |   1.0000
   popgrowth |  -0.3893   1.0000


------------------------------------------------------------------------------------------------------------------------
-> region = .
no observations


. // For subsets meeting logical condition
. corr lexp popgrowth if lexp > 75
(obs=19)

             |     lexp popgro~h
-------------+------------------
        lexp |   1.0000
   popgrowth |  -0.0028   1.0000


. //  include if (EITHER lexp > 75 OR lexp < 72)
. corr lexp popgrowth if lexp > 75 | lexp < 72
(obs=46)

             |     lexp popgro~h
-------------+------------------
        lexp |   1.0000
   popgrowth |  -0.4679   1.0000


. //  include if (lexp > 75 AND in region 1)
. corr lexp popgrowth if lexp > 75 & region == 1
(obs=15)

             |     lexp popgro~h
-------------+------------------
        lexp |   1.0000
   popgrowth |   0.3008   1.0000
The most common command for regressions is reg:
reg pop* lexp gnppc safewater, r
// By region
bysort region: reg lexp popgrowth, r
// For subsets meeting logical condition
reg lexp popgrowth if lexp > 75, r
//  include if (EITHER lexp > 75 OR lexp < 72)
reg lexp popgrowth if lexp > 75 | lexp < 72, r
//  include if (lexp > 75 AND in region 1)
reg lexp popgrowth if lexp > 75 & region == 1, r
 . reg pop* lexp gnppc safewater, r

Linear regression                                      Number of obs =      36
                                                       F(  3,    32) =    8.48
                                                       Prob > F      =  0.0003
                                                       R-squared     =  0.2753
                                                       Root MSE      =  .83165

------------------------------------------------------------------------------
             |               Robust
   popgrowth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        lexp |  -.0594837   .0485842    -1.22   0.230    -.1584465     .039479
       gnppc |    -.00003   .0000133    -2.26   0.031     -.000057   -2.97e-06
   safewater |   .0057573   .0190139     0.30   0.764    -.0329728    .0444874
       _cons |   5.408651   2.288166     2.36   0.024     .7478094    10.06949
------------------------------------------------------------------------------

. // By region
. bysort region: reg lexp popgrowth, r

------------------------------------------------------------------------------------------------------------------------
-> region = Eur & C.Asia

Linear regression                                      Number of obs =      44
                                                       F(  1,    42) =    8.59
                                                       Prob > F      =  0.0054
                                                       R-squared     =  0.0870
                                                       Root MSE      =  4.0129

------------------------------------------------------------------------------
             |               Robust
        lexp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   popgrowth |  -1.705995   .5821101    -2.93   0.005    -2.880741   -.5312495
       _cons |   73.96383   .7604829    97.26   0.000     72.42911    75.49855
------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-> region = N.A.

Linear regression                                      Number of obs =      14
                                                       F(  1,    12) =   14.22
                                                       Prob > F      =  0.0027
                                                       R-squared     =  0.3511
                                                       Root MSE      =  5.3756

------------------------------------------------------------------------------
             |               Robust
        lexp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   popgrowth |  -5.073004   1.345406    -3.77   0.003    -8.004392   -2.141615
       _cons |   79.80216   1.881382    42.42   0.000     75.70298    83.90134
------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-> region = S.A.

Linear regression                                      Number of obs =       9
                                                       F(  1,     7) =    3.41
                                                       Prob > F      =  0.1072
                                                       R-squared     =  0.1516
                                                       Root MSE      =  3.9999

------------------------------------------------------------------------------
             |               Robust
        lexp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   popgrowth |  -2.420299   1.310152    -1.85   0.107    -5.518317     .677719
       _cons |   74.98569   2.288981    32.76   0.000     69.57311    80.39827
------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-> region = .
no observations

. // For subsets meeting logical condition
. reg lexp popgrowth if lexp > 75, r

Linear regression                                      Number of obs =      19
                                                       F(  1,    17) =    0.00
                                                       Prob > F      =  0.9921
                                                       R-squared     =  0.0000
                                                       Root MSE      =  1.1016

------------------------------------------------------------------------------
             |               Robust
        lexp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   popgrowth |  -.0100066   .9906707    -0.01   0.992    -2.100139    2.080126
       _cons |   77.58406   .4920928   157.66   0.000     76.54583    78.62228
------------------------------------------------------------------------------

. //  include if (EITHER lexp > 75 OR lexp < 72)
. reg lexp popgrowth if lexp > 75 | lexp < 72, r

Linear regression                                      Number of obs =      46
                                                       F(  1,    44) =   16.05
                                                       Prob > F      =  0.0002
                                                       R-squared     =  0.2189
                                                       Root MSE      =  5.0265

------------------------------------------------------------------------------
             |               Robust
        lexp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   popgrowth |  -2.631737   .6569521    -4.01   0.000    -3.955737   -1.307737
       _cons |   74.52762   1.080852    68.95   0.000     72.34931    76.70594
------------------------------------------------------------------------------

. //  include if (lexp > 75 AND in region 1)
. reg lexp popgrowth if lexp > 75 & region == 1, r

Linear regression                                      Number of obs =      15
                                                       F(  1,    13) =    1.54
                                                       Prob > F      =  0.2366
                                                       R-squared     =  0.0905
                                                       Root MSE      =  .95124

------------------------------------------------------------------------------
             |               Robust
        lexp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   popgrowth |   1.809701   1.458352     1.24   0.237    -1.340876    4.960279
       _cons |   77.03358   .6216602   123.92   0.000     75.69057     78.3766
------------------------------------------------------------------------------

Notice that I’ve added an option to include robust standard errors (see help reg for more info).

Correlations–Treatment of Missing Data

There are several countries in our data that have missing data entries. We need to think carefully about how we want to deal with this in our analysis as this will affect the interpretation of our results.

In this part I want to briefly discuss how Stata deals with missing data by default in the context of correlations. The default treatment of missing data for a given command may not be the best treatment of missing data for your project so you need to be aware of what it’s doing and how you can change the behavior.

How does corr deal with missing data? Consider the first four entries of the data:
li country pop* lexp gnppc safewater in 1/4
 . li country pop* lexp gnppc safewater in 1/4

     +----------------------------------------------------+
     |       country   popgro~h   lexp   gnppc   safewa~r |
     |----------------------------------------------------|
  1. | Macedonia FYR         .3     73    1290          . |
  2. |        Latvia        -.2     70    2420          . |
  3. |      Slovenia         .2     75    9780         98 |
  4. |        France         .5     78   24210        100 |
     +----------------------------------------------------+

The first two countries have full data so they are kept. The second two lack data on safewater so they are removed. However, to compute a correlation coefficient between popgrowth and lexp we only need non-missing data for these two variables!

Does this make sense? Is Stata throwing good data away? Try to think of cases where it may or may not.

Suppose that we don’t want Stata to treat missing data like this. Then we can use the pwcorr command, which computes elementwise coefficients by default:
// NOTICE: these are different!
corr pop* lexp gnppc safewater
pwcorr pop* lexp gnppc safewater, obs 
// Restrict pwcorr to sample of countries with full data to get back corr
pwcorr pop* lexp gnppc safewater ///
    if popgrowth != . & lexp != . & gnppc != . & safewater != . , obs
// pwcorr can also display p-values for null hypothesis H0: CORR != 0
pwcorr  lexp gnppc safewater, sig
 . // NOTICE: these are different!
. corr pop* lexp gnppc safewater
(obs=36)

             | popgro~h     lexp    gnppc safewa~r
-------------+------------------------------------
   popgrowth |   1.0000
        lexp |  -0.4734   1.0000
       gnppc |  -0.4857   0.6941   1.0000
   safewater |  -0.3988   0.8199   0.7108   1.0000


. pwcorr pop* lexp gnppc safewater, obs 

             | popgro~h     lexp    gnppc safewa~r
-------------+------------------------------------
   popgrowth |   1.0000 
             |       67
             |
        lexp |  -0.4327   1.0000 
             |       67       67
             |
       gnppc |  -0.3520   0.7170   1.0000 
             |       62       62       64
             |
   safewater |  -0.4323   0.8315   0.7108   1.0000 
             |       39       39       36       39
             |

. // Restrict pwcorr to sample of countries with full data to get back corr
. pwcorr pop* lexp gnppc safewater ///
>         if popgrowth != . & lexp != . & gnppc != . & safewater != . , obs

             | popgro~h     lexp    gnppc safewa~r
-------------+------------------------------------
   popgrowth |   1.0000 
             |       36
             |
        lexp |  -0.4734   1.0000 
             |       36       36
             |
       gnppc |  -0.4857   0.6941   1.0000 
             |       36       36       36
             |
   safewater |  -0.3988   0.8199   0.7108   1.0000 
             |       36       36       36       36
             |

. // pwcorr can also display p-values for null hypothesis H0: CORR != 0
. pwcorr  lexp gnppc safewater, sig

             |     lexp    gnppc safewa~r
-------------+---------------------------
        lexp |   1.0000 
             |
             |
       gnppc |   0.7170   1.0000 
             |   0.0000
             |
   safewater |   0.8315   0.7108   1.0000 
             |   0.0000   0.0000
             |

Dummy/Factor Variables in Regression

Suppose we want to include a dummy variable in a regression. How do we do this? I’m going to give a sketch here.

First, in order to use dummy variables we need to tell Stata that a variable should be interpreted as a factor/dummy/categorical variable.

If you browse the data you’ll notice that country and region have different font colors even though they both seem to be strings (i.e. words/characters rather than numbers). Look closely at the top bar of the browse window. When you click a cell the data will appear in the bar. But now notice that if we select a cell in the region column a number comes up. Under the hood region is in fact an integer variable, not a string; it just displays as a string. This integer is simply an index for the region. It has no intrinsic meaning.

Now compare the following two regressions:
xi: reg popgrowth lexp gnppc i.region, r
reg popgrowth lexp gnppc region, r
 . xi: reg popgrowth lexp gnppc i.region, r
i.region          _Iregion_1-3        (naturally coded; _Iregion_1 omitted)

Linear regression                                      Number of obs =      62
                                                       F(  4,    57) =   30.10
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.5642
                                                       Root MSE      =  .62581

------------------------------------------------------------------------------
             |               Robust
   popgrowth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        lexp |  -.0396824   .0193411    -2.05   0.045    -.0784123   -.0009525
       gnppc |  -2.85e-06   8.03e-06    -0.35   0.724    -.0000189    .0000132
  _Iregion_2 |   1.211036   .2173555     5.57   0.000     .7757895    1.646283
  _Iregion_3 |   1.290804   .2471327     5.22   0.000     .7959289    1.785678
       _cons |   3.433166   1.403248     2.45   0.018       .62321    6.243122
------------------------------------------------------------------------------

. reg popgrowth lexp gnppc region, r

Linear regression                                      Number of obs =      62
                                                       F(  3,    58) =   32.30
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.5117
                                                       Root MSE      =  .65674

------------------------------------------------------------------------------
             |               Robust
   popgrowth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        lexp |  -.0463142   .0183285    -2.53   0.014    -.0830026   -.0096257
       gnppc |  -1.84e-06   7.24e-06    -0.25   0.801    -.0000163    .0000127
      region |   .7423806   .1270336     5.84   0.000     .4880952    .9966659
       _cons |   3.223895   1.398233     2.31   0.025     .4250281    6.022762
------------------------------------------------------------------------------

The first regression creates several new variables. Figure out what they’re doing. How does the interpretation of the region variables differ across specifications?

I will say that the first makes sense while the second is completely non-sensical. The reason has to do with the fact that the indices of the regions are arbitrary.

Graphs

There is a lot that can be done with graphs in Stata. The basic command is twoway. See the help file.

A quick twoway plot can be created as follows:
scatter lexp gnppc
 . scatter lexp gnppc

To make anything more elaborate we need to delve into twoway. I will do this presently using some new data from the Penn World Tables (PWT).

PWT Data Setup and Context

My goal is to make a scatterplot that will help me to evaluate Solow’s “Strong Convergence” Hypothesis: countries that are initially poor will have greater subsequent growth.8

I first need to set up the data. In doing so I introduce some new commands:

  • import excel one way to begin to get excel or other, non-.dta files into Stata
  • do call and excecute another do file
  • xtset sets up a panel.
  • encode makes a string variable a factor variable like region
  • egen provides many ways of combining data to create a new variable.
  • reshape pivots/reshapes/transposes data along select dimensions
Here is the code:
// Bring in PWT
import excel "data/raw/pwt90.xlsx", sheet("Data") firstrow clear
// Run the labels
do "do/pwt_label"

// Add in continent codes for later - many to one merge
preserve
import delimited "data/raw/country_continents.csv", clear 
drop countrycode
rename alpha3 countrycode
save "data/int/country_continents", replace
restore
merge m:1 countrycode using "data/int/country_continents"
keep if _merge == 3
drop _merge

// Set up panel
// Need the country variable to be a factor variable
encode country, gen(country_fac)
// Displays variable in terms of underlying label/string
tab country_fac
// But under the hood it is actually a number! Uncomment and run (throws error)
/*
count if country_fac == "Turkmenistan"
count if country == "Turkmenistan"
*/

// Now set panel
xtset country_fac year
 . // Bring in PWT
. import excel "data/raw/pwt90.xlsx", sheet("Data") firstrow clear

. // Run the labels
. do "do/pwt_label"

. /*
> Variable Labels for Penn World Tables - http://www.rug.nl/ggdc/productivity/pwt/
> */
. 
. label var country       "Country name"

. label var currency_unit "Currency unit"

. label var year  "Year"

. label var rgdpe "Expenditure-side real GDP at chained PPPs (in mil. 2011US$)"

. label var rgdpo "Output-side real GDP at chained PPPs (in mil. 2011US$)"

. label var pop   "Population (in millions)"

. label var emp   "Number of persons engaged (in millions)"

. label var avh   "Average annual hours worked by persons engaged"

. label var hc    "Human capital index, based on years of schooling and returns to education; see Human capital in PWT9.
> "
note: label truncated to 80 characters

. label var ccon  "Real consumption of households and government, at current PPPs (in mil. 2011US$)"

. label var cda   "Real domestic absorption, (real consumption plus investment), at current PPPs (in mil. 2011US$)"
note: label truncated to 80 characters

. label var cgdpe "Expenditure-side real GDP at current PPPs (in mil. 2011US$)"

. label var cgdpo "Output-side real GDP at current PPPs (in mil. 2011US$)"

. label var ck    "Capital stock at current PPPs (in mil. 2011US$)"

. label var ctfp  "TFP level at current PPPs (USA=1)"

. label var cwtfp "Welfare-relevant TFP levels at current PPPs (USA=1)"

. 
. label var rgdpna        "Real GDP at constant 2011 national prices (in mil. 2011US$)"

. label var rconna        "Real consumption at constant 2011 national prices (in mil. 2011US$)"

. label var rdana "Real domestic absorption at constant 2011 national prices (in mil. 2011US$)"

. label var rkna  "Capital stock at constant 2011 national prices (in mil. 2011US$)"

. label var rtfpna        "TFP at constant national prices (2011=1)"

. label var rwtfpna       "Welfare-relevant TFP at constant national prices (2011=1)"

. label var labsh "Share of labour compensation in GDP at current national prices"

. label var delta "Average depreciation rate of the capital stock"

. 
. label var xr    "Exchange rate, national currency/USD (market+estimated)"

. label var pl_con        "Price level of CCON (PPP/XR), price level of USA GDPo in 2011=1"

. label var pl_da "Price level of CDA (PPP/XR), price level of USA GDPo in 2011=1"

. label var pl_gdpo       "Price level of CGDPo (PPP/XR),  price level of USA GDPo in 2011=1"

. 
. label var i_cig "0/1/2: relative price data for consumption, investment and government is extrapolated (0), benchmark 
> (1) or interpolated (2)"
note: label truncated to 80 characters

. label var i_xm  "0/1/2: relative price data for exports and imports is extrapolated (0), benchmark (1) or interpolated
>  (2)"
note: label truncated to 80 characters

. label var i_xr  "0/1: the exchange rate is market-based (0) or estimated (1)"

. label var i_outlier     "0/1: the observation on pl_gdpe or pl_gdpo is not an outlier (0) or an outlier (1)"
note: label truncated to 80 characters

. label var cor_exp       "Correlation between expenditure shares of the country and the US (benchmark observations only
> )"
note: label truncated to 80 characters

. label var statcap       "Statistical capacity indicator (source: World Bank, developing countries only)"

. 
. label var csh_c "Share of household consumption at current PPPs"

. label var csh_i "Share of gross capital formation at current PPPs"

. label var csh_g "Share of government consumption at current PPPs"

. label var csh_x "Share of merchandise exports at current PPPs"

. label var csh_m "Share of merchandise imports at current PPPs"

. label var csh_r "Share of residual trade and GDP statistical discrepancy at current PPPs"

. 
. label var pl_c  "Price level of household consumption,  price level of USA GDPo in 2011=1"

. label var pl_i  "Price level of capital formation,  price level of USA GDPo in 2011=1"

. label var pl_g  "Price level of government consumption,  price level of USA GDPo in 2011=1"

. label var pl_x  "Price level of exports, price level of USA GDPo in 2011=1"

. label var pl_m  "Price level of imports, price level of USA GDPo in 2011=1"

. label var pl_k  "Price level of the capital stock, price level of USA in 2011=1"

. 
. // Add in continent codes for later - many to one merge
. preserve

. import delimited "data/raw/country_continents.csv", clear 
(9 vars, 249 obs)

. drop countrycode

. rename alpha3 countrycode

. save "data/int/country_continents", replace
file data/int/country_continents.dta saved

. restore

. merge m:1 countrycode using "data/int/country_continents"

    Result                           # of obs.
    -----------------------------------------
    not matched                            67
        from master                         0  (_merge==1)
        from using                         67  (_merge==2)

    matched                            11,830  (_merge==3)
    -----------------------------------------

. keep if _merge == 3
(67 observations deleted)

. drop _merge

. 
. // Set up panel
. // Need the country variable to be a factor variable
. encode country, gen(country_fac)

. // Displays variable in terms of underlying label/string
. tab country_fac

                      Country name |      Freq.     Percent        Cum.
-----------------------------------+-----------------------------------
                           Albania |         65        0.55        0.55
                           Algeria |         65        0.55        1.10
                            Angola |         65        0.55        1.65
                          Anguilla |         65        0.55        2.20
               Antigua and Barbuda |         65        0.55        2.75
                         Argentina |         65        0.55        3.30
                           Armenia |         65        0.55        3.85
                             Aruba |         65        0.55        4.40
                         Australia |         65        0.55        4.95
                           Austria |         65        0.55        5.49
                        Azerbaijan |         65        0.55        6.04
                           Bahamas |         65        0.55        6.59
                           Bahrain |         65        0.55        7.14
                        Bangladesh |         65        0.55        7.69
                          Barbados |         65        0.55        8.24
                           Belarus |         65        0.55        8.79
                           Belgium |         65        0.55        9.34
                            Belize |         65        0.55        9.89
                             Benin |         65        0.55       10.44
                           Bermuda |         65        0.55       10.99
                            Bhutan |         65        0.55       11.54
  Bolivia (Plurinational State of) |         65        0.55       12.09
            Bosnia and Herzegovina |         65        0.55       12.64
                          Botswana |         65        0.55       13.19
                            Brazil |         65        0.55       13.74
            British Virgin Islands |         65        0.55       14.29
                 Brunei Darussalam |         65        0.55       14.84
                          Bulgaria |         65        0.55       15.38
                      Burkina Faso |         65        0.55       15.93
                           Burundi |         65        0.55       16.48
                        Cabo Verde |         65        0.55       17.03
                          Cambodia |         65        0.55       17.58
                          Cameroon |         65        0.55       18.13
                            Canada |         65        0.55       18.68
                    Cayman Islands |         65        0.55       19.23
          Central African Republic |         65        0.55       19.78
                              Chad |         65        0.55       20.33
                             Chile |         65        0.55       20.88
                             China |         65        0.55       21.43
              China, Hong Kong SAR |         65        0.55       21.98
                  China, Macao SAR |         65        0.55       22.53
                          Colombia |         65        0.55       23.08
                           Comoros |         65        0.55       23.63
                             Congo |         65        0.55       24.18
                        Costa Rica |         65        0.55       24.73
                           Croatia |         65        0.55       25.27
                           Curaçao |         65        0.55       25.82
                            Cyprus |         65        0.55       26.37
                    Czech Republic |         65        0.55       26.92
                     Côte d'Ivoire |         65        0.55       27.47
                 D.R. of the Congo |         65        0.55       28.02
                           Denmark |         65        0.55       28.57
                          Djibouti |         65        0.55       29.12
                          Dominica |         65        0.55       29.67
                Dominican Republic |         65        0.55       30.22
                           Ecuador |         65        0.55       30.77
                             Egypt |         65        0.55       31.32
                       El Salvador |         65        0.55       31.87
                 Equatorial Guinea |         65        0.55       32.42
                           Estonia |         65        0.55       32.97
                          Ethiopia |         65        0.55       33.52
                              Fiji |         65        0.55       34.07
                           Finland |         65        0.55       34.62
                            France |         65        0.55       35.16
                             Gabon |         65        0.55       35.71
                            Gambia |         65        0.55       36.26
                           Georgia |         65        0.55       36.81
                           Germany |         65        0.55       37.36
                             Ghana |         65        0.55       37.91
                            Greece |         65        0.55       38.46
                           Grenada |         65        0.55       39.01
                         Guatemala |         65        0.55       39.56
                            Guinea |         65        0.55       40.11
                     Guinea-Bissau |         65        0.55       40.66
                             Haiti |         65        0.55       41.21
                          Honduras |         65        0.55       41.76
                           Hungary |         65        0.55       42.31
                           Iceland |         65        0.55       42.86
                             India |         65        0.55       43.41
                         Indonesia |         65        0.55       43.96
        Iran (Islamic Republic of) |         65        0.55       44.51
                              Iraq |         65        0.55       45.05
                           Ireland |         65        0.55       45.60
                            Israel |         65        0.55       46.15
                             Italy |         65        0.55       46.70
                           Jamaica |         65        0.55       47.25
                             Japan |         65        0.55       47.80
                            Jordan |         65        0.55       48.35
                        Kazakhstan |         65        0.55       48.90
                             Kenya |         65        0.55       49.45
                            Kuwait |         65        0.55       50.00
                        Kyrgyzstan |         65        0.55       50.55
                   Lao People's DR |         65        0.55       51.10
                            Latvia |         65        0.55       51.65
                           Lebanon |         65        0.55       52.20
                           Lesotho |         65        0.55       52.75
                           Liberia |         65        0.55       53.30
                         Lithuania |         65        0.55       53.85
                        Luxembourg |         65        0.55       54.40
                        Madagascar |         65        0.55       54.95
                            Malawi |         65        0.55       55.49
                          Malaysia |         65        0.55       56.04
                          Maldives |         65        0.55       56.59
                              Mali |         65        0.55       57.14
                             Malta |         65        0.55       57.69
                        Mauritania |         65        0.55       58.24
                         Mauritius |         65        0.55       58.79
                            Mexico |         65        0.55       59.34
                          Mongolia |         65        0.55       59.89
                        Montenegro |         65        0.55       60.44
                        Montserrat |         65        0.55       60.99
                           Morocco |         65        0.55       61.54
                        Mozambique |         65        0.55       62.09
                           Myanmar |         65        0.55       62.64
                           Namibia |         65        0.55       63.19
                             Nepal |         65        0.55       63.74
                       Netherlands |         65        0.55       64.29
                       New Zealand |         65        0.55       64.84
                         Nicaragua |         65        0.55       65.38
                             Niger |         65        0.55       65.93
                           Nigeria |         65        0.55       66.48
                            Norway |         65        0.55       67.03
                              Oman |         65        0.55       67.58
                          Pakistan |         65        0.55       68.13
                            Panama |         65        0.55       68.68
                          Paraguay |         65        0.55       69.23
                              Peru |         65        0.55       69.78
                       Philippines |         65        0.55       70.33
                            Poland |         65        0.55       70.88
                          Portugal |         65        0.55       71.43
                             Qatar |         65        0.55       71.98
                 Republic of Korea |         65        0.55       72.53
               Republic of Moldova |         65        0.55       73.08
                           Romania |         65        0.55       73.63
                Russian Federation |         65        0.55       74.18
                            Rwanda |         65        0.55       74.73
             Saint Kitts and Nevis |         65        0.55       75.27
                       Saint Lucia |         65        0.55       75.82
             Sao Tome and Principe |         65        0.55       76.37
                      Saudi Arabia |         65        0.55       76.92
                           Senegal |         65        0.55       77.47
                            Serbia |         65        0.55       78.02
                        Seychelles |         65        0.55       78.57
                      Sierra Leone |         65        0.55       79.12
                         Singapore |         65        0.55       79.67
         Sint Maarten (Dutch part) |         65        0.55       80.22
                          Slovakia |         65        0.55       80.77
                          Slovenia |         65        0.55       81.32
                      South Africa |         65        0.55       81.87
                             Spain |         65        0.55       82.42
                         Sri Lanka |         65        0.55       82.97
    St. Vincent and the Grenadines |         65        0.55       83.52
                State of Palestine |         65        0.55       84.07
                    Sudan (Former) |         65        0.55       84.62
                          Suriname |         65        0.55       85.16
                         Swaziland |         65        0.55       85.71
                            Sweden |         65        0.55       86.26
                       Switzerland |         65        0.55       86.81
              Syrian Arab Republic |         65        0.55       87.36
                 TFYR of Macedonia |         65        0.55       87.91
                            Taiwan |         65        0.55       88.46
                        Tajikistan |         65        0.55       89.01
                          Thailand |         65        0.55       89.56
                              Togo |         65        0.55       90.11
               Trinidad and Tobago |         65        0.55       90.66
                           Tunisia |         65        0.55       91.21
                            Turkey |         65        0.55       91.76
                      Turkmenistan |         65        0.55       92.31
          Turks and Caicos Islands |         65        0.55       92.86
        U.R. of Tanzania: Mainland |         65        0.55       93.41
                            Uganda |         65        0.55       93.96
                           Ukraine |         65        0.55       94.51
              United Arab Emirates |         65        0.55       95.05
                    United Kingdom |         65        0.55       95.60
                     United States |         65        0.55       96.15
                           Uruguay |         65        0.55       96.70
                        Uzbekistan |         65        0.55       97.25
Venezuela (Bolivarian Republic of) |         65        0.55       97.80
                          Viet Nam |         65        0.55       98.35
                             Yemen |         65        0.55       98.90
                            Zambia |         65        0.55       99.45
                          Zimbabwe |         65        0.55      100.00
-----------------------------------+-----------------------------------
                             Total |     11,830      100.00

. // But under the hood it is actually a number! Uncomment and run (throws error)
. /*
> count if country_fac == "Turkmenistan"
> count if country == "Turkmenistan"
> */
. 
. // Now set panel
. xtset country_fac year
       panel variable:  country_fac (strongly balanced)
        time variable:  year, 1950 to 2014
                delta:  1 unit

. 
end of do-file
// Make average growth rates
// For purposes of comparison, let's drop all years < 1970
keep if year >= 1970
// For ease of analysis, let's just keep those variables that we'll be needing
keep year country* rgdpe pop *region*
// Make log income per capita - Notice that rgdpe and pop both in millions
gen log_gdp_pc = log(rgdpe) - log(pop)
label var log_gdp_pc "Log GDP per capita - log(rgdpe / pop)"
// Now make a growth rate using the lag operator
gen d_gdp_pc = D.log_gdp_pc
label var d_gdp_pc "Growth GDP per capita - D.log_gdp_pc"
// Make mean growth rates using egen and bysort
bysort country: egen mean_d_gdp_pc = mean(d_gdp_pc)
// Make a percent by multiplying by 100
replace mean_d_gdp_pc  = mean_d_gdp_pc * 100
label var mean_d_gdp_pc "Mean Growth - GDP per capita (%), 1970-2014"
// Should do spot check to make sure correct
order country mean_d_gdp_pc
//browse

// How do we get the first year's GDP?
// reshape! long -> wide
// What do we need?
keep log_gdp_pc mean_d_gdp_pc country* year *region*
reshape wide log_gdp_pc, i(country) j(year)
// Really only need to keep the log_gdp_pc from 1970 - Make a new variable
gen init_log_gdp_pc = log_gdp_pc1970
label var init_log_gdp_pc "Log GDP per capita, 1970"
// Drop all others
drop log_gdp_pc*
 . // Make average growth rates
. // For purposes of comparison, let's drop all years < 1970
. keep if year >= 1970
(3640 observations deleted)

. // For ease of analysis, let's just keep those variables that we'll be needing
. keep year country* rgdpe pop *region*

. // Make log income per capita - Notice that rgdpe and pop both in millions
. gen log_gdp_pc = log(rgdpe) - log(pop)
(549 missing values generated)

. label var log_gdp_pc "Log GDP per capita - log(rgdpe / pop)"

. // Now make a growth rate using the lag operator
. gen d_gdp_pc = D.log_gdp_pc
(731 missing values generated)

. label var d_gdp_pc "Growth GDP per capita - D.log_gdp_pc"

. // Make mean growth rates using egen and bysort
. bysort country: egen mean_d_gdp_pc = mean(d_gdp_pc)

. // Make a percent by multiplying by 100
. replace mean_d_gdp_pc  = mean_d_gdp_pc * 100
(8190 real changes made)

. label var mean_d_gdp_pc "Mean Growth - GDP per capita (%), 1970-2014"

. // Should do spot check to make sure correct
. order country mean_d_gdp_pc

. //browse
. 
. // How do we get the first year's GDP?
. // reshape! long -> wide
. // What do we need?
. keep log_gdp_pc mean_d_gdp_pc country* year *region*

. reshape wide log_gdp_pc, i(country) j(year)
(note: j = 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
>  1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                     8190   ->     182
Number of variables                  10   ->      53
j variable (45 values)             year   ->   (dropped)
xij variables:
                             log_gdp_pc   ->   log_gdp_pc1970 log_gdp_pc1971 ... log_gdp_pc2014
-----------------------------------------------------------------------------

. // Really only need to keep the log_gdp_pc from 1970 - Make a new variable
. gen init_log_gdp_pc = log_gdp_pc1970
(26 missing values generated)

. label var init_log_gdp_pc "Log GDP per capita, 1970"

. // Drop all others
. drop log_gdp_pc*

Plotting

Let’s start out with another simple plot:
scatter mean_d_gdp_pc init_log_gdp_pc
 . scatter mean_d_gdp_pc init_log_gdp_pc

It kind of looks like there’s a weak negative correlation between the variables. Indeed, if we run an OLS regression we find that the correlation is significant at the 10% level. In fact, a long line of literature has found just the opposite–there is no close correlation at all. What gives?!

If we look at our figure we see that there are a few high-income countries that seem to be outliers.9

Let’s try to think of a way to deal with those datapoints. First, who are they? Let’s replace the circles with country abbreviations:
twoway (scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc <= log(30000), ///
    mlabel(countrycode) mlabp(0) m(i) mlabs(1.65)) || ///
    (scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc > log(30000), ///
    mlabel(countrycode) mlabp(0) m(i) mlabs(1.65) ///
    legend(off) ///
    title("Initial GDP and Subsequent Growth, 1970-2014") ///
    )
 . twoway (scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc <= log(30000), ///
>         mlabel(countrycode) mlabp(0) m(i) mlabs(1.65)) || ///
>         (scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc > log(30000), ///
>         mlabel(countrycode) mlabp(0) m(i) mlabs(1.65) ///
>         legend(off) ///
>         title("Initial GDP and Subsequent Growth, 1970-2014") ///
>         )

Aha, they’re all countries with a bunch of oil! Long story short, this is a lurking variable in the anaylsis. For the purpose of making our graph, then, we’ll get rid of them.

Consider the modified picture:
twoway scatter mean_d_gdp_pc init_log_gdp_pc  if init_log_gdp_pc <= log(30000), ///
    title("Initial GDP and Subsequent Growth, 1970-2014") ///
    mlabel(countrycode) mlabp(0) m(i) mlabs(1.45)
 . twoway scatter mean_d_gdp_pc init_log_gdp_pc  if init_log_gdp_pc <= log(30000), ///
>         title("Initial GDP and Subsequent Growth, 1970-2014") ///
>         mlabel(countrycode) mlabp(0) m(i) mlabs(1.45)

Notice how–having removed the outliers–the apparent correlation falls apart. Part of this is just mechanical. But I think that an important part of it has to do with how we perceive figures; make a lot of graphs to be sure that you’re eyes aren’t playing any tricks on you.

What if we wanted to see how the correlation held up across continents? Here’s two possibilities:
// What about separate plots by continent?
twoway scatter mean_d_gdp_pc init_log_gdp_pc, ///
    mlabel(countrycode) mlabp(0) m(i)  mlabs(1.45) ///
    by(region)
// One plot with color by subregion?
separate mean_d_gdp_pc, by(region)
twoway scatter mean_d_gdp_pc* init_log_gdp_pc, ///
    title("Initial GDP and Subsequent Growth, 1970-2014") ///
    legend(off) m(i) ///
    ytitle("Mean Growth - GDP per capita (%), 1970-2014")
 . // What about separate plots by continent?
. twoway scatter mean_d_gdp_pc init_log_gdp_pc, ///
>         mlabel(countrycode) mlabp(0) m(i)  mlabs(1.45) ///
>         by(region)

. // One plot with color by subregion?
. separate mean_d_gdp_pc, by(region)

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------
mean_d_gdp_pc1  float   %9.0g                 mean_d_gdp_pc, region == Africa
mean_d_gdp_pc2  float   %9.0g                 mean_d_gdp_pc, region == Americas
mean_d_gdp_pc3  float   %9.0g                 mean_d_gdp_pc, region == Asia
mean_d_gdp_pc4  float   %9.0g                 mean_d_gdp_pc, region == Europe
mean_d_gdp_pc5  float   %9.0g                 mean_d_gdp_pc, region == Oceania

. twoway scatter mean_d_gdp_pc* init_log_gdp_pc, ///
>         title("Initial GDP and Subsequent Growth, 1970-2014") ///
>         legend(off) m(i) ///
>         ytitle("Mean Growth - GDP per capita (%), 1970-2014")


  1. Instructor: Cynthia Kinnan, TA: Michael Wooley

  2. The alternative (and default) option is an .smcl file. These files only appear to be well-formatted when opened in Stata. I don’t know of any obvious advantages to this file type.

  3. In particular, if we did this our do file couldn’t be run more than once without error–the files that it is manipulating on the second run wouldn’t be the same as those that we manipulated on the first.

  4. 1:1 can be compared to a many-to-one (m:1) merge, which says that there may be several rows in the master data that correspond to the same row in the using data. For example, suppose that we were going to merge GNP by region to the data. There are many countries for each region, which would lead to a m:1 merge. A 1:m merge is the mirror inverse of the m:1 merge; e.g. let the GNP-by-region data be the master and the life expectancy data be the using data. We can also do a m:m merge. Notice that this case is especially dicey because we’re actually potentially introducing new observations into the data.

  5. Does all of this macro business seem like overkill? Why shouldn’t we just copy and paste the one number back into the do editor? Well, suppose that you realized that you messed up one of the steps in the previous section. Then the number that you copied and pasted here will probably no longer be correct. Now we could go back and update the median in the do file. However, this seems pretty cumbersome and I, for one, do not trust myself to remember that this even needs to be fixed.

  6. If you’re familiar with regular expressions you may be wondering if you can use those to select variable names. You can’t do this without complicating things a bit more. See help regex for more info.

  7. In fact, I’ve simply copied and pasted the code chunk from above and replaced su with corr.

  8. See chapter 3 of Ray for more details.

  9. Also, if we compare our figure to Ray’s Fig. 4.5 from his updated chapter 3 (here) we see that he cuts off the per capita income at $30,000.