This is a non-comprehensive review of Stata. It is based on a set of data files that can be downloaded directly from Canvas (see files/sections/stata_review_empty.zip).
Before jumping into the do-file I want to make some remarks about the directory setup and handling of the data.
There are a lot of empty directories in what you downloaded. These are all set up in anticipation of needing them later on. Even if you think that you only have a small project at the beginning you will probably end up with a whole bunch of files by the end. Be a proactive organizer.
The underlying data that we’ll use is stored in the directory data/raw. We’re going to put together a file (the aforementioned do-file) that will tell Stata to carry out a set of commands on the data. However, we will never actually modify the data in data/raw. We will instead create new datasets and save those elsewhere. This is critical for replicating an analysis.
We can easily send commands from the do-file to the Stata interpreter in a few ways:
ctrl+d runs the entire do file[highlight line]+ctrl+d runs the selected linesctrl+r quietly runs the do-file. That is nothing is printed to the Stata output screen.You can also fetch previously-executed commands from i) the left panel (assuming default Stata config) or ii) clicking on the command line and hitting fn+[UP arrow].
.do FileThis is a header for the underlying .do file. Give a short description of what the script does, what (if anything) the user needs to specify, and (if applicable) what it produces.
/*
main - Review of Stata commands
Michael Wooley
Econ 326 (or assoc project)
Last Edit: March 31, 2017
Script will run so long as modify wk_dir to "stata_review" directory.
*/
Clear everything from memory
clear all
Avoid the annoying “more” message
set more off
Now define global “macros”, which is what Stata calls variables that are not columns in the dataframe. These are the variables that the user needs to specify in order to run your code; make them conspicuous.
// WORKING DIRECTORY - Must be set to run the script.
global wk_dir "C:/Users/Michael/Dropbox/teaching/ECON326/Recitations/Recitation1/stata_review"
// Log file name
global log_file "log/state_review"
. // WORKING DIRECTORY - Must be set to run the script.
. global wk_dir "C:/Users/Michael/Dropbox/teaching/ECON326/Recitations/Recitation1/stata_review"
. // Log file name
. global log_file "log/state_review"
In this case it will be overkill to seprately define wk_dir. We will only use it for one purpose. However, you could imagine a case where there is a need to jump between several directories (and this might only occur on line 345). Defining these macros is going to make it so that the user only needs to set/change these once.
wk_dir, which we specified above. We use the Stata macro syntax \$[macro name]. The prefix on the name tells Stata to interpret the input as the name of a global macro not as, e.g., a column name.
cd $wk_dir
. cd $wk_dir
C:\Users\Michael\Dropbox\teaching\ECON326\Recitations\Recitation1\stata_review
Next we need to start up the log file. What does this do? The log command is going to create (or replace) a file and write everything that we see in the Stata output window to it.
We want to do this because it preserves our results. This will make it easier to share and review our results later. For example, if you don’t have Stata on hand you can still review a log file (if you set the options correctly).
Start logging://log using $log_file , replace text
disp "hello"
//log close
. //log using $log_file , replace text
. disp "hello"
hello
. //log close
Let’s discuss each of the elements of this command:
log using $log_file - Begin to log “using” the specified file. In this case it is the name that we defined above., - The comma separates the main command from the options in Stata.replace - Stata will try to open a new log file. However, if we run the do-file several times a log file was already created. This tells Stata that it is okay to go ahead and replace the pre-exsting log file if it finds one.text - This option will specify the output format of the log file. By specifying text we ensure that we will be able to open and read the log file in any text editor, not just Stata.2Once we start the log file Stata will write all of our commands (not just those in the do-file) to the log file until we tell it to stop logging. To do this we will add a line at the end of the file–log close. It is a good idea to add this line at the same time that you add the log start statement (so that you don’t forget later) then write the rest of your commands between these two lines.
If you’re debugging the file while running it you may run into an error that halts the execution of the code. Since your log close statement is below the error line, it will not be executed. Thus, the log file is still open. If you then try to re-run the code you will get an error at the log using line saying that a log file is already open. One way to get around this is to comment out these two lines while you’re writing the file (you probably won’t need to look at them until after you’re done writing the file anyway). Make a note to make sure that you go back and log in the end.
Start with the dataset in “data/raw/lifeexp-eur&asia”.
// Open the base dataset
use "data/raw/lifeexp-eur&asia", replace
. // Open the base dataset
. use "data/raw/lifeexp-eur&asia", replace
(Life expectancy, 1998)
For more on use of the use command type help use onto the command line.
Now we’re going to inspect the data that was just loaded in to see i) what it is and ii) if it makes sense.
// Inspect it: (Bad labels, data types, no labels, missing data)
describe
. // Inspect it: (Bad labels, data types, no labels, missing data)
. describe
Contains data from data/raw/lifeexp-eur&asia.dta
obs: 44 Life expectancy, 1998
vars: 5 29 Mar 2017 15:40
size: 1,540 (_dta has notes)
------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------
region byte %12.0g region Country
country str28 %28s Region
popgrowth float %9.0g * Avg. annual % growth
lexp byte %9.0g * Life expectancy at birth
safewater byte %9.0g *
* indicated variables have notes
------------------------------------------------------------------------------------------------------------------------
Sorted by:
Notice that the data labels for region and country appear to be switched around. We can check which is wrong (bad labels or bad variable names?) by looking at the data itself.
//browse //
. //browse //
In the data empty entries show up as .. Notice that a lot of safewater entries are missing.
// Correct Errors in Labeling
label var region Region
label var country Country
. // Correct Errors in Labeling
. label var region Region
. label var country Country
We can get a better picture of the missing data:
misstable summarize, all
. misstable summarize, all
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
region | 44 | 1 1 1
country | (string variable) |
popgrowth | 44 | 20 -.5 2.8
lexp | 44 | 15 65 79
safewater | 27 17 | 12 55 100
-----------------------------------------------------------------------------
We want to see both i) how many entries are missing in each entry and ii) if the data seems to make sense. It is not uncommon to program data as being missing by entering the data as -999 (or something along these lines). This doesn’t appear to be the case in this data, though.
We should also check to make sure that the other variables aren’t super-odd. For instance, are the lexp entries reasonable? We shouldn’t see negative entries or entries above 100 (or 90 really).
Also, if we’re working with hand-entered (or OCR’d) data we would want to do a lot more to check for mis-types.
We have just imported some data. Usually we need to combine this data with some other datasets in order to get the full set of variables that we need.
There are two principle operations that we might carry out:
append adds additional rows to the frame. For instance, we could append a line for the U.S.merge adds additional columns to the frame. In a minute we will add data on GNP for each country.We have a second dataset that is similar to the first but has additional countries from the Western Hemisphere.
We begin by peaking into the data to be appended to make sure that it is suitable for being merged.
The preserve command is going to store the data we have in memory right now while we open the other dataset.
// Preserve current data for later
preserve
// Now using new data
use "data/raw/lifeexp-america", clear
// Run same checks on this data as last
describe
browse
misstable summarize, all
// Restore previous data
restore
. // Preserve current data for later
. preserve
. // Now using new data
. use "data/raw/lifeexp-america", clear
(Life expectancy, 1998)
. // Run same checks on this data as last
. describe
Contains data from data/raw/lifeexp-america.dta
obs: 23 Life expectancy, 1998
vars: 5 4 Jan 2011 17:24
size: 805 (_dta has notes)
------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------
region byte %12.0g region Region
country str28 %28s Country
popgrowth float %9.0g * Avg. annual % growth
lexp byte %9.0g * Life expectancy at birth
safewater byte %9.0g *
* indicated variables have notes
------------------------------------------------------------------------------------------------------------------------
Sorted by:
. browse
request ignored because of batch mode
. misstable summarize, all
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
region | 23 | 2 2 3
country | (string variable) |
popgrowth | 23 | 16 .7 3
lexp | 23 | 15 54 79
safewater | 1 22 | 19 28 99
-----------------------------------------------------------------------------
. // Restore previous data
. restore
POPGROWTH differently from popgrowth. Everything looks good. We can restore the Europe and Asia data from memory (i.e. undo the preserve from earlier) and carry out the append:
// Append
append using "data/raw/lifeexp-america"
. // Append
. append using "data/raw/lifeexp-america"
(label region already defined)
Now we’re going to merge our current dataframe with a dataset containing variables on GNP per capita by country.
In order to merge we need to ensure that we have common variables between the two datasets and that these merge variables play well together.
Stata requires that the two merge variables have identical names. Also, the merge command is “stupid” in the sense that it will only match values that are exactly identical; it cannot tell that U.S. and United States are the same countries.[^ For that matter, it can’t tell that united states and United States are the same country.]
// Merge dataset on per capita income - add COLUMNS
// Peak into merging dataset to check for problems, identify merging variables
save "data/int/lifeexp", replace
use "data/raw/gnppc", clear
describe
browse
misstable summarize
. // Merge dataset on per capita income - add COLUMNS
. // Peak into merging dataset to check for problems, identify merging variables
. save "data/int/lifeexp", replace
file data/int/lifeexp.dta saved
. use "data/raw/gnppc", clear
. describe
Contains data from data/raw/gnppc.dta
obs: 64
vars: 2 29 Mar 2017 16:13
size: 1,472
------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------
Country str19 %19s
gnppc long %8.0g GNP per capita
------------------------------------------------------------------------------------------------------------------------
Sorted by:
. browse
request ignored because of batch mode
. misstable summarize
(variables nonmissing or string)
* Can match on country name but need variables to be the same
rename Country country
* Save as NEW dataset in 'data/int'
save "data/int/gnppc_for_merge", replace
* Get old data
use "data/int/lifeexp"
. * Can match on country name but need variables to be the same
. rename Country country
. * Save as NEW dataset in 'data/int'
. save "data/int/gnppc_for_merge", replace
file data/int/gnppc_for_merge.dta saved
. * Get old data
. use "data/int/lifeexp"
(Life expectancy, 1998)
Notice how I saved the modified data in a new file in the intermediate data directory. This is in keeping with the idea that–in order to ensure the replicability of the analysis–we never actually change the original data.3
Okay, let’s go ahead and try to merge. The merge syntax is similar to the append syntax except we need to specify the merge type and the merge variables.
For the merge type I specify1:1. That is, we tell Stata that, for each entry in the master data, we should find exactly one entry in the using data. This is a nice way of ensuring that you’re not introducing or removing spurious observations.4
// Now do the actual merge!
// Still a bit hazy on the "sort" rationale. Just do it if you get an error...
sort country
// Let's preserve the data in case something goes wrong on merge
save "data/int/lifeexp", replace
// Merge
merge 1:1 country using "data/int/gnppc_for_merge"
. // Now do the actual merge!
. // Still a bit hazy on the "sort" rationale. Just do it if you get an error...
. sort country
. // Let's preserve the data in case something goes wrong on merge
. save "data/int/lifeexp", replace
file data/int/lifeexp.dta saved
. // Merge
. merge 1:1 country using "data/int/gnppc_for_merge"
Result # of obs.
-----------------------------------------
not matched 9
from master 6 (_merge==1)
from using 3 (_merge==2)
matched 61 (_merge==3)
-----------------------------------------
browse the data to make sure that the data that was reported as being succesfully merged was merged correctly.
// What countries were left out?
list region country _merge if _merge != 3
. // What countries were left out?
. list region country _merge if _merge != 3
+---------------------------------------------------------------+
| region country _merge |
|---------------------------------------------------------------|
9. | Eur & C.Asia Bosnia and Herzegovina master only (1) |
15. | N.A. Cuba master only (1) |
38. | Eur & C.Asia Macedonia FYR master only (1) |
49. | N.A. Puerto Rico master only (1) |
60. | Eur & C.Asia Turkmenistan master only (1) |
|---------------------------------------------------------------|
67. | Eur & C.Asia Yugoslavia, FR (Serb./Mont.) master only (1) |
68. | . Colombia using only (2) |
69. | . Israel using only (2) |
70. | . Macedonia using only (2) |
+---------------------------------------------------------------+
Okay, notice that there is an unmatched master entry called Macedonia FYR and an unmatched using entry called Macedonia. These are probably the same country but we should probably check it out to make sure. (I found the FYR to be somewhat puzzling–is it maybe an annotation added by someone to specify something…?) Let’s just look up Macedonia on Wikipedia and see if anything pops up. In fact, it does. These are referring to the same country.
/*
Appears that Macedonia and Macedonia FYR are actually the same country?
Yes, from Wikipedia entry for "Republic of Macedonia"
(https://en.wikipedia.org/wiki/Republic_of_Macedonia)
It became a member of the United Nations in 1993, but, as a result of an
ongoing dispute with Greece over the use of the name "Macedonia", was
admitted under the provisional description the former Yugoslav Republic
of Macedonia[10][11] (sometimes unofficially abbreviated as FYROM and
FYR Macedonia), a term that is also used by international organizations
such as the European Union,[12] the Council of Europe[13] and NATO.[14]
*/
// Amend gnppc_for_merge data
use "data/int/gnppc_for_merge", clear
// Change a single name
replace country = "Macedonia FYR" if country == "Macedonia"
// Save
save "data/int/gnppc_for_merge", replace
// Restore back life expectations data
use "data/int/lifeexp", replace
. /*
> Appears that Macedonia and Macedonia FYR are actually the same country?
>
> Yes, from Wikipedia entry for "Republic of Macedonia"
> (https://en.wikipedia.org/wiki/Republic_of_Macedonia)
>
> It became a member of the United Nations in 1993, but, as a result of an
> ongoing dispute with Greece over the use of the name "Macedonia", was
> admitted under the provisional description the former Yugoslav Republic
> of Macedonia[10][11] (sometimes unofficially abbreviated as FYROM and
> FYR Macedonia), a term that is also used by international organizations
> such as the European Union,[12] the Council of Europe[13] and NATO.[14]
> */
.
. // Amend gnppc_for_merge data
. use "data/int/gnppc_for_merge", clear
. // Change a single name
. replace country = "Macedonia FYR" if country == "Macedonia"
(1 real change made)
. // Save
. save "data/int/gnppc_for_merge", replace
file data/int/gnppc_for_merge.dta saved
. // Restore back life expectations data
. use "data/int/lifeexp", replace
(Life expectancy, 1998)
// Merge again
// Merge
merge 1:1 country using "data/int/gnppc_for_merge"
// What countries were left out?
li region country _merge if _merge != 3
// Check on Macedonia:
li region country _merge if country == "Macedonia FYR"
. // Merge again
. // Merge
. merge 1:1 country using "data/int/gnppc_for_merge"
Result # of obs.
-----------------------------------------
not matched 7
from master 5 (_merge==1)
from using 2 (_merge==2)
matched 62 (_merge==3)
-----------------------------------------
.
. // What countries were left out?
. li region country _merge if _merge != 3
+---------------------------------------------------------------+
| region country _merge |
|---------------------------------------------------------------|
9. | Eur & C.Asia Bosnia and Herzegovina master only (1) |
15. | N.A. Cuba master only (1) |
49. | N.A. Puerto Rico master only (1) |
60. | Eur & C.Asia Turkmenistan master only (1) |
67. | Eur & C.Asia Yugoslavia, FR (Serb./Mont.) master only (1) |
|---------------------------------------------------------------|
68. | . Colombia using only (2) |
69. | . Israel using only (2) |
+---------------------------------------------------------------+
. // Check on Macedonia:
. li region country _merge if country == "Macedonia FYR"
+--------------------------------------------+
| region country _merge |
|--------------------------------------------|
38. | Eur & C.Asia Macedonia FYR matched (3) |
+--------------------------------------------+
Everything looks to be in order. At this point the data is ready for analysis. Let’s go ahead and save it to the final data directory.
// Data is ready for analysis. Save it to 'data/fin'
save "data/fin/lifeexp_main", replace
. // Data is ready for analysis. Save it to 'data/fin'
. save "data/fin/lifeexp_main", replace
file data/fin/lifeexp_main.dta saved
In this section we discuss descriptive statistics like the mean, media, and variance of the variables.
The most common command here issummarize (which can be abbreviated to su). Here’s the basic usage:
// Basic Summary Stats - All variables by default
su
// Particular set of variables
su lexp safewater
. // Basic Summary Stats - All variables by default
. su
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
region | 67 1.477612 .7252629 1 3
country | 0
popgrowth | 67 .9567164 .9295199 -.5 3
lexp | 67 72.31343 4.742487 54 79
safewater | 39 76.05128 18.12232 28 100
-------------+--------------------------------------------------------
gnppc | 64 8797.125 10595.19 370 39980
_merge | 69 2.826087 .5411459 1 3
. // Particular set of variables
. su lexp safewater
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 67 72.31343 4.742487 54 79
safewater | 39 76.05128 18.12232 28 100
detail
su gnppc, detail
. su gnppc, detail
GNP per capita
-------------------------------------------------------------
Percentiles Smallest
1% 370 370
5% 410 380
10% 740 380 Obs 64
25% 1440 410 Sum of Wgt. 64
50% 3445 Mean 8797.125
Largest Std. Dev. 10595.19
75% 15300 29240
90% 25580 33040 Variance 1.12e+08
95% 29240 34310 Skewness 1.270107
99% 39980 39980 Kurtosis 3.322431
Oftentimes we’ll only want to run a command on a subset of the data. For example, we might want to see how the results change if we exclude some possible outliers or if we restrict the data to a group of observations (e.g. countries in North America, students in Ms. Jone’s class).
Stata provides fairly simple syntax that can be used to subset data that is uniform across many common commands.// Stats by region
bysort region: su lexp popgrowth
// Stats for subsets meeting logical condition
su lexp popgrowth if lexp > 75
// include if (EITHER lexp > 75 OR lexp < 72)
su lexp popgrowth if lexp > 75 | lexp < 72
// include if (lexp > 75 AND in region 1)
su lexp popgrowth if lexp > 75 & region == 1
. // Stats by region
. bysort region: su lexp popgrowth
------------------------------------------------------------------------------------------------------------------------
-> region = Eur & C.Asia
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 44 73.06818 4.150639 65 79
popgrowth | 44 .525 .7175945 -.5 2.8
------------------------------------------------------------------------------------------------------------------------
-> region = N.A.
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 14 71.21429 6.411271 54 79
popgrowth | 14 1.692857 .7488086 .7 3
------------------------------------------------------------------------------------------------------------------------
-> region = S.A.
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 9 70.33333 4.062019 62 75
popgrowth | 9 1.922222 .6534099 .7 2.9
------------------------------------------------------------------------------------------------------------------------
-> region = .
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 0
popgrowth | 0
. // Stats for subsets meeting logical condition
. su lexp popgrowth if lexp > 75
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 19 77.57895 1.070607 76 79
popgrowth | 19 .5105263 .2960757 .1 1.2
. // include if (EITHER lexp > 75 OR lexp < 72)
. su lexp popgrowth if lexp > 75 | lexp < 72
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 46 71.80435 5.623836 54 79
popgrowth | 46 1.034783 .9998261 -.4 3
. // include if (lexp > 75 AND in region 1)
. su lexp popgrowth if lexp > 75 & region == 1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 15 77.73333 .9611501 76 79
popgrowth | 15 .3866667 .1597617 .1 .7
In the last few lines I’ve combined logical conditions by using the & (‘and’) and | (‘inclusive or’) operators.
We now consider another subsetting problem: figure out the median life expectancy in the sample then get summary stats for the subsets of countries that are above/below the median.
When we call su (and many other commands) a number of macro variables will be created. You can see the full set of macro variables that are created at the bottom of the help pages (command: help su).
// Detailed stats
su gnppc, detail
// Save median GNP for later
local gnp_med = `r(p50)'
// Stats for subsets meeting logical condition: If GNP above/below median.
su lexp popgrowth if gnppc < `gnp_med'
su lexp popgrowth if gnppc >= `gnp_med'
. // Detailed stats
. su gnppc, detail
GNP per capita
-------------------------------------------------------------
Percentiles Smallest
1% 370 370
5% 410 380
10% 740 380 Obs 64
25% 1440 410 Sum of Wgt. 64
50% 3445 Mean 8797.125
Largest Std. Dev. 10595.19
75% 15300 29240
90% 25580 33040 Variance 1.12e+08
95% 29240 34310 Skewness 1.270107
99% 39980 39980 Kurtosis 3.322431
. // Save median GNP for later
. local gnp_med = `r(p50)'
. // Stats for subsets meeting logical condition: If GNP above/below median.
. su lexp popgrowth if gnppc < `gnp_med'
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 31 68.96774 4.020646 54 75
popgrowth | 31 1.293548 1.045286 -.4 3
. su lexp popgrowth if gnppc >= `gnp_med'
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lexp | 36 75.19444 3.178674 66 79
popgrowth | 36 .6666667 .7111359 -.5 2.8
It can be somewhat tedious to write out certain variables or sets of variables. Fortunately, Stata makes it fairly easy to reference those variables without writing them all out explicitly.
The basic idea of a wildcard variable is that you enter some fragment of a variable name and stata will return all variables that match the fragment.
Consider the following hypothetical example:6
// Variables: emp29, emp30, emp31, rmpd, gdp, log_gdp
// Prefix - Get all variables matching the prefix
emp* => emp29, emp30, emp31
// Suffix - Get all variables matching the suffix
*gdp => gdp, log_gdp
// Prefix and Suffix - Get all variables where string occurs at any point
*mp* => emp29, emp30, emp31, rmpd
Let’s try it in the real data:
// Select variables - safewater and variables with "p" in name
su safewater *p*
. // Select variables - safewater and variables with "p" in name
. su safewater *p*
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
safewater | 39 76.05128 18.12232 28 100
popgrowth | 67 .9567164 .9295199 -.5 3
lexp | 67 72.31343 4.742487 54 79
gnppc | 64 8797.125 10595.19 370 39980
corr. We can use the exact same syntax that we did with su to subset the data:7
corr pop* lexp gnppc safewater
// By region
bysort region: corr lexp popgrowth
// For subsets meeting logical condition
corr lexp popgrowth if lexp > 75
// include if (EITHER lexp > 75 OR lexp < 72)
corr lexp popgrowth if lexp > 75 | lexp < 72
// include if (lexp > 75 AND in region 1)
corr lexp popgrowth if lexp > 75 & region == 1
. corr pop* lexp gnppc safewater
(obs=36)
| popgro~h lexp gnppc safewa~r
-------------+------------------------------------
popgrowth | 1.0000
lexp | -0.4734 1.0000
gnppc | -0.4857 0.6941 1.0000
safewater | -0.3988 0.8199 0.7108 1.0000
. // By region
. bysort region: corr lexp popgrowth
------------------------------------------------------------------------------------------------------------------------
-> region = Eur & C.Asia
(obs=44)
| lexp popgro~h
-------------+------------------
lexp | 1.0000
popgrowth | -0.2949 1.0000
------------------------------------------------------------------------------------------------------------------------
-> region = N.A.
(obs=14)
| lexp popgro~h
-------------+------------------
lexp | 1.0000
popgrowth | -0.5925 1.0000
------------------------------------------------------------------------------------------------------------------------
-> region = S.A.
(obs=9)
| lexp popgro~h
-------------+------------------
lexp | 1.0000
popgrowth | -0.3893 1.0000
------------------------------------------------------------------------------------------------------------------------
-> region = .
no observations
. // For subsets meeting logical condition
. corr lexp popgrowth if lexp > 75
(obs=19)
| lexp popgro~h
-------------+------------------
lexp | 1.0000
popgrowth | -0.0028 1.0000
. // include if (EITHER lexp > 75 OR lexp < 72)
. corr lexp popgrowth if lexp > 75 | lexp < 72
(obs=46)
| lexp popgro~h
-------------+------------------
lexp | 1.0000
popgrowth | -0.4679 1.0000
. // include if (lexp > 75 AND in region 1)
. corr lexp popgrowth if lexp > 75 & region == 1
(obs=15)
| lexp popgro~h
-------------+------------------
lexp | 1.0000
popgrowth | 0.3008 1.0000
reg:
reg pop* lexp gnppc safewater, r
// By region
bysort region: reg lexp popgrowth, r
// For subsets meeting logical condition
reg lexp popgrowth if lexp > 75, r
// include if (EITHER lexp > 75 OR lexp < 72)
reg lexp popgrowth if lexp > 75 | lexp < 72, r
// include if (lexp > 75 AND in region 1)
reg lexp popgrowth if lexp > 75 & region == 1, r
. reg pop* lexp gnppc safewater, r
Linear regression Number of obs = 36
F( 3, 32) = 8.48
Prob > F = 0.0003
R-squared = 0.2753
Root MSE = .83165
------------------------------------------------------------------------------
| Robust
popgrowth | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lexp | -.0594837 .0485842 -1.22 0.230 -.1584465 .039479
gnppc | -.00003 .0000133 -2.26 0.031 -.000057 -2.97e-06
safewater | .0057573 .0190139 0.30 0.764 -.0329728 .0444874
_cons | 5.408651 2.288166 2.36 0.024 .7478094 10.06949
------------------------------------------------------------------------------
. // By region
. bysort region: reg lexp popgrowth, r
------------------------------------------------------------------------------------------------------------------------
-> region = Eur & C.Asia
Linear regression Number of obs = 44
F( 1, 42) = 8.59
Prob > F = 0.0054
R-squared = 0.0870
Root MSE = 4.0129
------------------------------------------------------------------------------
| Robust
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
popgrowth | -1.705995 .5821101 -2.93 0.005 -2.880741 -.5312495
_cons | 73.96383 .7604829 97.26 0.000 72.42911 75.49855
------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
-> region = N.A.
Linear regression Number of obs = 14
F( 1, 12) = 14.22
Prob > F = 0.0027
R-squared = 0.3511
Root MSE = 5.3756
------------------------------------------------------------------------------
| Robust
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
popgrowth | -5.073004 1.345406 -3.77 0.003 -8.004392 -2.141615
_cons | 79.80216 1.881382 42.42 0.000 75.70298 83.90134
------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
-> region = S.A.
Linear regression Number of obs = 9
F( 1, 7) = 3.41
Prob > F = 0.1072
R-squared = 0.1516
Root MSE = 3.9999
------------------------------------------------------------------------------
| Robust
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
popgrowth | -2.420299 1.310152 -1.85 0.107 -5.518317 .677719
_cons | 74.98569 2.288981 32.76 0.000 69.57311 80.39827
------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
-> region = .
no observations
. // For subsets meeting logical condition
. reg lexp popgrowth if lexp > 75, r
Linear regression Number of obs = 19
F( 1, 17) = 0.00
Prob > F = 0.9921
R-squared = 0.0000
Root MSE = 1.1016
------------------------------------------------------------------------------
| Robust
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
popgrowth | -.0100066 .9906707 -0.01 0.992 -2.100139 2.080126
_cons | 77.58406 .4920928 157.66 0.000 76.54583 78.62228
------------------------------------------------------------------------------
. // include if (EITHER lexp > 75 OR lexp < 72)
. reg lexp popgrowth if lexp > 75 | lexp < 72, r
Linear regression Number of obs = 46
F( 1, 44) = 16.05
Prob > F = 0.0002
R-squared = 0.2189
Root MSE = 5.0265
------------------------------------------------------------------------------
| Robust
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
popgrowth | -2.631737 .6569521 -4.01 0.000 -3.955737 -1.307737
_cons | 74.52762 1.080852 68.95 0.000 72.34931 76.70594
------------------------------------------------------------------------------
. // include if (lexp > 75 AND in region 1)
. reg lexp popgrowth if lexp > 75 & region == 1, r
Linear regression Number of obs = 15
F( 1, 13) = 1.54
Prob > F = 0.2366
R-squared = 0.0905
Root MSE = .95124
------------------------------------------------------------------------------
| Robust
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
popgrowth | 1.809701 1.458352 1.24 0.237 -1.340876 4.960279
_cons | 77.03358 .6216602 123.92 0.000 75.69057 78.3766
------------------------------------------------------------------------------
Notice that I’ve added an option to include robust standard errors (see help reg for more info).
There are several countries in our data that have missing data entries. We need to think carefully about how we want to deal with this in our analysis as this will affect the interpretation of our results.
In this part I want to briefly discuss how Stata deals with missing data by default in the context of correlations. The default treatment of missing data for a given command may not be the best treatment of missing data for your project so you need to be aware of what it’s doing and how you can change the behavior.
How doescorr deal with missing data? Consider the first four entries of the data:
li country pop* lexp gnppc safewater in 1/4
. li country pop* lexp gnppc safewater in 1/4
+----------------------------------------------------+
| country popgro~h lexp gnppc safewa~r |
|----------------------------------------------------|
1. | Macedonia FYR .3 73 1290 . |
2. | Latvia -.2 70 2420 . |
3. | Slovenia .2 75 9780 98 |
4. | France .5 78 24210 100 |
+----------------------------------------------------+
The first two countries have full data so they are kept. The second two lack data on safewater so they are removed. However, to compute a correlation coefficient between popgrowth and lexp we only need non-missing data for these two variables!
Does this make sense? Is Stata throwing good data away? Try to think of cases where it may or may not.
Suppose that we don’t want Stata to treat missing data like this. Then we can use thepwcorr command, which computes elementwise coefficients by default:
// NOTICE: these are different!
corr pop* lexp gnppc safewater
pwcorr pop* lexp gnppc safewater, obs
// Restrict pwcorr to sample of countries with full data to get back corr
pwcorr pop* lexp gnppc safewater ///
if popgrowth != . & lexp != . & gnppc != . & safewater != . , obs
// pwcorr can also display p-values for null hypothesis H0: CORR != 0
pwcorr lexp gnppc safewater, sig
. // NOTICE: these are different!
. corr pop* lexp gnppc safewater
(obs=36)
| popgro~h lexp gnppc safewa~r
-------------+------------------------------------
popgrowth | 1.0000
lexp | -0.4734 1.0000
gnppc | -0.4857 0.6941 1.0000
safewater | -0.3988 0.8199 0.7108 1.0000
. pwcorr pop* lexp gnppc safewater, obs
| popgro~h lexp gnppc safewa~r
-------------+------------------------------------
popgrowth | 1.0000
| 67
|
lexp | -0.4327 1.0000
| 67 67
|
gnppc | -0.3520 0.7170 1.0000
| 62 62 64
|
safewater | -0.4323 0.8315 0.7108 1.0000
| 39 39 36 39
|
. // Restrict pwcorr to sample of countries with full data to get back corr
. pwcorr pop* lexp gnppc safewater ///
> if popgrowth != . & lexp != . & gnppc != . & safewater != . , obs
| popgro~h lexp gnppc safewa~r
-------------+------------------------------------
popgrowth | 1.0000
| 36
|
lexp | -0.4734 1.0000
| 36 36
|
gnppc | -0.4857 0.6941 1.0000
| 36 36 36
|
safewater | -0.3988 0.8199 0.7108 1.0000
| 36 36 36 36
|
. // pwcorr can also display p-values for null hypothesis H0: CORR != 0
. pwcorr lexp gnppc safewater, sig
| lexp gnppc safewa~r
-------------+---------------------------
lexp | 1.0000
|
|
gnppc | 0.7170 1.0000
| 0.0000
|
safewater | 0.8315 0.7108 1.0000
| 0.0000 0.0000
|
Suppose we want to include a dummy variable in a regression. How do we do this? I’m going to give a sketch here.
First, in order to use dummy variables we need to tell Stata that a variable should be interpreted as a factor/dummy/categorical variable.
If you browse the data you’ll notice that country and region have different font colors even though they both seem to be strings (i.e. words/characters rather than numbers). Look closely at the top bar of the browse window. When you click a cell the data will appear in the bar. But now notice that if we select a cell in the region column a number comes up. Under the hood region is in fact an integer variable, not a string; it just displays as a string. This integer is simply an index for the region. It has no intrinsic meaning.
xi: reg popgrowth lexp gnppc i.region, r
reg popgrowth lexp gnppc region, r
. xi: reg popgrowth lexp gnppc i.region, r
i.region _Iregion_1-3 (naturally coded; _Iregion_1 omitted)
Linear regression Number of obs = 62
F( 4, 57) = 30.10
Prob > F = 0.0000
R-squared = 0.5642
Root MSE = .62581
------------------------------------------------------------------------------
| Robust
popgrowth | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lexp | -.0396824 .0193411 -2.05 0.045 -.0784123 -.0009525
gnppc | -2.85e-06 8.03e-06 -0.35 0.724 -.0000189 .0000132
_Iregion_2 | 1.211036 .2173555 5.57 0.000 .7757895 1.646283
_Iregion_3 | 1.290804 .2471327 5.22 0.000 .7959289 1.785678
_cons | 3.433166 1.403248 2.45 0.018 .62321 6.243122
------------------------------------------------------------------------------
. reg popgrowth lexp gnppc region, r
Linear regression Number of obs = 62
F( 3, 58) = 32.30
Prob > F = 0.0000
R-squared = 0.5117
Root MSE = .65674
------------------------------------------------------------------------------
| Robust
popgrowth | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lexp | -.0463142 .0183285 -2.53 0.014 -.0830026 -.0096257
gnppc | -1.84e-06 7.24e-06 -0.25 0.801 -.0000163 .0000127
region | .7423806 .1270336 5.84 0.000 .4880952 .9966659
_cons | 3.223895 1.398233 2.31 0.025 .4250281 6.022762
------------------------------------------------------------------------------
The first regression creates several new variables. Figure out what they’re doing. How does the interpretation of the region variables differ across specifications?
I will say that the first makes sense while the second is completely non-sensical. The reason has to do with the fact that the indices of the regions are arbitrary.
There is a lot that can be done with graphs in Stata. The basic command is twoway. See the help file.
scatter lexp gnppc
. scatter lexp gnppc
To make anything more elaborate we need to delve into twoway. I will do this presently using some new data from the Penn World Tables (PWT).
My goal is to make a scatterplot that will help me to evaluate Solow’s “Strong Convergence” Hypothesis: countries that are initially poor will have greater subsequent growth.8
I first need to set up the data. In doing so I introduce some new commands:
import excel one way to begin to get excel or other, non-.dta files into Statado call and excecute another do filextset sets up a panel.encode makes a string variable a factor variable like regionegen provides many ways of combining data to create a new variable.reshape pivots/reshapes/transposes data along select dimensions// Bring in PWT
import excel "data/raw/pwt90.xlsx", sheet("Data") firstrow clear
// Run the labels
do "do/pwt_label"
// Add in continent codes for later - many to one merge
preserve
import delimited "data/raw/country_continents.csv", clear
drop countrycode
rename alpha3 countrycode
save "data/int/country_continents", replace
restore
merge m:1 countrycode using "data/int/country_continents"
keep if _merge == 3
drop _merge
// Set up panel
// Need the country variable to be a factor variable
encode country, gen(country_fac)
// Displays variable in terms of underlying label/string
tab country_fac
// But under the hood it is actually a number! Uncomment and run (throws error)
/*
count if country_fac == "Turkmenistan"
count if country == "Turkmenistan"
*/
// Now set panel
xtset country_fac year
. // Bring in PWT
. import excel "data/raw/pwt90.xlsx", sheet("Data") firstrow clear
. // Run the labels
. do "do/pwt_label"
. /*
> Variable Labels for Penn World Tables - http://www.rug.nl/ggdc/productivity/pwt/
> */
.
. label var country "Country name"
. label var currency_unit "Currency unit"
. label var year "Year"
. label var rgdpe "Expenditure-side real GDP at chained PPPs (in mil. 2011US$)"
. label var rgdpo "Output-side real GDP at chained PPPs (in mil. 2011US$)"
. label var pop "Population (in millions)"
. label var emp "Number of persons engaged (in millions)"
. label var avh "Average annual hours worked by persons engaged"
. label var hc "Human capital index, based on years of schooling and returns to education; see Human capital in PWT9.
> "
note: label truncated to 80 characters
. label var ccon "Real consumption of households and government, at current PPPs (in mil. 2011US$)"
. label var cda "Real domestic absorption, (real consumption plus investment), at current PPPs (in mil. 2011US$)"
note: label truncated to 80 characters
. label var cgdpe "Expenditure-side real GDP at current PPPs (in mil. 2011US$)"
. label var cgdpo "Output-side real GDP at current PPPs (in mil. 2011US$)"
. label var ck "Capital stock at current PPPs (in mil. 2011US$)"
. label var ctfp "TFP level at current PPPs (USA=1)"
. label var cwtfp "Welfare-relevant TFP levels at current PPPs (USA=1)"
.
. label var rgdpna "Real GDP at constant 2011 national prices (in mil. 2011US$)"
. label var rconna "Real consumption at constant 2011 national prices (in mil. 2011US$)"
. label var rdana "Real domestic absorption at constant 2011 national prices (in mil. 2011US$)"
. label var rkna "Capital stock at constant 2011 national prices (in mil. 2011US$)"
. label var rtfpna "TFP at constant national prices (2011=1)"
. label var rwtfpna "Welfare-relevant TFP at constant national prices (2011=1)"
. label var labsh "Share of labour compensation in GDP at current national prices"
. label var delta "Average depreciation rate of the capital stock"
.
. label var xr "Exchange rate, national currency/USD (market+estimated)"
. label var pl_con "Price level of CCON (PPP/XR), price level of USA GDPo in 2011=1"
. label var pl_da "Price level of CDA (PPP/XR), price level of USA GDPo in 2011=1"
. label var pl_gdpo "Price level of CGDPo (PPP/XR), price level of USA GDPo in 2011=1"
.
. label var i_cig "0/1/2: relative price data for consumption, investment and government is extrapolated (0), benchmark
> (1) or interpolated (2)"
note: label truncated to 80 characters
. label var i_xm "0/1/2: relative price data for exports and imports is extrapolated (0), benchmark (1) or interpolated
> (2)"
note: label truncated to 80 characters
. label var i_xr "0/1: the exchange rate is market-based (0) or estimated (1)"
. label var i_outlier "0/1: the observation on pl_gdpe or pl_gdpo is not an outlier (0) or an outlier (1)"
note: label truncated to 80 characters
. label var cor_exp "Correlation between expenditure shares of the country and the US (benchmark observations only
> )"
note: label truncated to 80 characters
. label var statcap "Statistical capacity indicator (source: World Bank, developing countries only)"
.
. label var csh_c "Share of household consumption at current PPPs"
. label var csh_i "Share of gross capital formation at current PPPs"
. label var csh_g "Share of government consumption at current PPPs"
. label var csh_x "Share of merchandise exports at current PPPs"
. label var csh_m "Share of merchandise imports at current PPPs"
. label var csh_r "Share of residual trade and GDP statistical discrepancy at current PPPs"
.
. label var pl_c "Price level of household consumption, price level of USA GDPo in 2011=1"
. label var pl_i "Price level of capital formation, price level of USA GDPo in 2011=1"
. label var pl_g "Price level of government consumption, price level of USA GDPo in 2011=1"
. label var pl_x "Price level of exports, price level of USA GDPo in 2011=1"
. label var pl_m "Price level of imports, price level of USA GDPo in 2011=1"
. label var pl_k "Price level of the capital stock, price level of USA in 2011=1"
.
. // Add in continent codes for later - many to one merge
. preserve
. import delimited "data/raw/country_continents.csv", clear
(9 vars, 249 obs)
. drop countrycode
. rename alpha3 countrycode
. save "data/int/country_continents", replace
file data/int/country_continents.dta saved
. restore
. merge m:1 countrycode using "data/int/country_continents"
Result # of obs.
-----------------------------------------
not matched 67
from master 0 (_merge==1)
from using 67 (_merge==2)
matched 11,830 (_merge==3)
-----------------------------------------
. keep if _merge == 3
(67 observations deleted)
. drop _merge
.
. // Set up panel
. // Need the country variable to be a factor variable
. encode country, gen(country_fac)
. // Displays variable in terms of underlying label/string
. tab country_fac
Country name | Freq. Percent Cum.
-----------------------------------+-----------------------------------
Albania | 65 0.55 0.55
Algeria | 65 0.55 1.10
Angola | 65 0.55 1.65
Anguilla | 65 0.55 2.20
Antigua and Barbuda | 65 0.55 2.75
Argentina | 65 0.55 3.30
Armenia | 65 0.55 3.85
Aruba | 65 0.55 4.40
Australia | 65 0.55 4.95
Austria | 65 0.55 5.49
Azerbaijan | 65 0.55 6.04
Bahamas | 65 0.55 6.59
Bahrain | 65 0.55 7.14
Bangladesh | 65 0.55 7.69
Barbados | 65 0.55 8.24
Belarus | 65 0.55 8.79
Belgium | 65 0.55 9.34
Belize | 65 0.55 9.89
Benin | 65 0.55 10.44
Bermuda | 65 0.55 10.99
Bhutan | 65 0.55 11.54
Bolivia (Plurinational State of) | 65 0.55 12.09
Bosnia and Herzegovina | 65 0.55 12.64
Botswana | 65 0.55 13.19
Brazil | 65 0.55 13.74
British Virgin Islands | 65 0.55 14.29
Brunei Darussalam | 65 0.55 14.84
Bulgaria | 65 0.55 15.38
Burkina Faso | 65 0.55 15.93
Burundi | 65 0.55 16.48
Cabo Verde | 65 0.55 17.03
Cambodia | 65 0.55 17.58
Cameroon | 65 0.55 18.13
Canada | 65 0.55 18.68
Cayman Islands | 65 0.55 19.23
Central African Republic | 65 0.55 19.78
Chad | 65 0.55 20.33
Chile | 65 0.55 20.88
China | 65 0.55 21.43
China, Hong Kong SAR | 65 0.55 21.98
China, Macao SAR | 65 0.55 22.53
Colombia | 65 0.55 23.08
Comoros | 65 0.55 23.63
Congo | 65 0.55 24.18
Costa Rica | 65 0.55 24.73
Croatia | 65 0.55 25.27
Curaçao | 65 0.55 25.82
Cyprus | 65 0.55 26.37
Czech Republic | 65 0.55 26.92
Côte d'Ivoire | 65 0.55 27.47
D.R. of the Congo | 65 0.55 28.02
Denmark | 65 0.55 28.57
Djibouti | 65 0.55 29.12
Dominica | 65 0.55 29.67
Dominican Republic | 65 0.55 30.22
Ecuador | 65 0.55 30.77
Egypt | 65 0.55 31.32
El Salvador | 65 0.55 31.87
Equatorial Guinea | 65 0.55 32.42
Estonia | 65 0.55 32.97
Ethiopia | 65 0.55 33.52
Fiji | 65 0.55 34.07
Finland | 65 0.55 34.62
France | 65 0.55 35.16
Gabon | 65 0.55 35.71
Gambia | 65 0.55 36.26
Georgia | 65 0.55 36.81
Germany | 65 0.55 37.36
Ghana | 65 0.55 37.91
Greece | 65 0.55 38.46
Grenada | 65 0.55 39.01
Guatemala | 65 0.55 39.56
Guinea | 65 0.55 40.11
Guinea-Bissau | 65 0.55 40.66
Haiti | 65 0.55 41.21
Honduras | 65 0.55 41.76
Hungary | 65 0.55 42.31
Iceland | 65 0.55 42.86
India | 65 0.55 43.41
Indonesia | 65 0.55 43.96
Iran (Islamic Republic of) | 65 0.55 44.51
Iraq | 65 0.55 45.05
Ireland | 65 0.55 45.60
Israel | 65 0.55 46.15
Italy | 65 0.55 46.70
Jamaica | 65 0.55 47.25
Japan | 65 0.55 47.80
Jordan | 65 0.55 48.35
Kazakhstan | 65 0.55 48.90
Kenya | 65 0.55 49.45
Kuwait | 65 0.55 50.00
Kyrgyzstan | 65 0.55 50.55
Lao People's DR | 65 0.55 51.10
Latvia | 65 0.55 51.65
Lebanon | 65 0.55 52.20
Lesotho | 65 0.55 52.75
Liberia | 65 0.55 53.30
Lithuania | 65 0.55 53.85
Luxembourg | 65 0.55 54.40
Madagascar | 65 0.55 54.95
Malawi | 65 0.55 55.49
Malaysia | 65 0.55 56.04
Maldives | 65 0.55 56.59
Mali | 65 0.55 57.14
Malta | 65 0.55 57.69
Mauritania | 65 0.55 58.24
Mauritius | 65 0.55 58.79
Mexico | 65 0.55 59.34
Mongolia | 65 0.55 59.89
Montenegro | 65 0.55 60.44
Montserrat | 65 0.55 60.99
Morocco | 65 0.55 61.54
Mozambique | 65 0.55 62.09
Myanmar | 65 0.55 62.64
Namibia | 65 0.55 63.19
Nepal | 65 0.55 63.74
Netherlands | 65 0.55 64.29
New Zealand | 65 0.55 64.84
Nicaragua | 65 0.55 65.38
Niger | 65 0.55 65.93
Nigeria | 65 0.55 66.48
Norway | 65 0.55 67.03
Oman | 65 0.55 67.58
Pakistan | 65 0.55 68.13
Panama | 65 0.55 68.68
Paraguay | 65 0.55 69.23
Peru | 65 0.55 69.78
Philippines | 65 0.55 70.33
Poland | 65 0.55 70.88
Portugal | 65 0.55 71.43
Qatar | 65 0.55 71.98
Republic of Korea | 65 0.55 72.53
Republic of Moldova | 65 0.55 73.08
Romania | 65 0.55 73.63
Russian Federation | 65 0.55 74.18
Rwanda | 65 0.55 74.73
Saint Kitts and Nevis | 65 0.55 75.27
Saint Lucia | 65 0.55 75.82
Sao Tome and Principe | 65 0.55 76.37
Saudi Arabia | 65 0.55 76.92
Senegal | 65 0.55 77.47
Serbia | 65 0.55 78.02
Seychelles | 65 0.55 78.57
Sierra Leone | 65 0.55 79.12
Singapore | 65 0.55 79.67
Sint Maarten (Dutch part) | 65 0.55 80.22
Slovakia | 65 0.55 80.77
Slovenia | 65 0.55 81.32
South Africa | 65 0.55 81.87
Spain | 65 0.55 82.42
Sri Lanka | 65 0.55 82.97
St. Vincent and the Grenadines | 65 0.55 83.52
State of Palestine | 65 0.55 84.07
Sudan (Former) | 65 0.55 84.62
Suriname | 65 0.55 85.16
Swaziland | 65 0.55 85.71
Sweden | 65 0.55 86.26
Switzerland | 65 0.55 86.81
Syrian Arab Republic | 65 0.55 87.36
TFYR of Macedonia | 65 0.55 87.91
Taiwan | 65 0.55 88.46
Tajikistan | 65 0.55 89.01
Thailand | 65 0.55 89.56
Togo | 65 0.55 90.11
Trinidad and Tobago | 65 0.55 90.66
Tunisia | 65 0.55 91.21
Turkey | 65 0.55 91.76
Turkmenistan | 65 0.55 92.31
Turks and Caicos Islands | 65 0.55 92.86
U.R. of Tanzania: Mainland | 65 0.55 93.41
Uganda | 65 0.55 93.96
Ukraine | 65 0.55 94.51
United Arab Emirates | 65 0.55 95.05
United Kingdom | 65 0.55 95.60
United States | 65 0.55 96.15
Uruguay | 65 0.55 96.70
Uzbekistan | 65 0.55 97.25
Venezuela (Bolivarian Republic of) | 65 0.55 97.80
Viet Nam | 65 0.55 98.35
Yemen | 65 0.55 98.90
Zambia | 65 0.55 99.45
Zimbabwe | 65 0.55 100.00
-----------------------------------+-----------------------------------
Total | 11,830 100.00
. // But under the hood it is actually a number! Uncomment and run (throws error)
. /*
> count if country_fac == "Turkmenistan"
> count if country == "Turkmenistan"
> */
.
. // Now set panel
. xtset country_fac year
panel variable: country_fac (strongly balanced)
time variable: year, 1950 to 2014
delta: 1 unit
.
end of do-file
// Make average growth rates
// For purposes of comparison, let's drop all years < 1970
keep if year >= 1970
// For ease of analysis, let's just keep those variables that we'll be needing
keep year country* rgdpe pop *region*
// Make log income per capita - Notice that rgdpe and pop both in millions
gen log_gdp_pc = log(rgdpe) - log(pop)
label var log_gdp_pc "Log GDP per capita - log(rgdpe / pop)"
// Now make a growth rate using the lag operator
gen d_gdp_pc = D.log_gdp_pc
label var d_gdp_pc "Growth GDP per capita - D.log_gdp_pc"
// Make mean growth rates using egen and bysort
bysort country: egen mean_d_gdp_pc = mean(d_gdp_pc)
// Make a percent by multiplying by 100
replace mean_d_gdp_pc = mean_d_gdp_pc * 100
label var mean_d_gdp_pc "Mean Growth - GDP per capita (%), 1970-2014"
// Should do spot check to make sure correct
order country mean_d_gdp_pc
//browse
// How do we get the first year's GDP?
// reshape! long -> wide
// What do we need?
keep log_gdp_pc mean_d_gdp_pc country* year *region*
reshape wide log_gdp_pc, i(country) j(year)
// Really only need to keep the log_gdp_pc from 1970 - Make a new variable
gen init_log_gdp_pc = log_gdp_pc1970
label var init_log_gdp_pc "Log GDP per capita, 1970"
// Drop all others
drop log_gdp_pc*
. // Make average growth rates
. // For purposes of comparison, let's drop all years < 1970
. keep if year >= 1970
(3640 observations deleted)
. // For ease of analysis, let's just keep those variables that we'll be needing
. keep year country* rgdpe pop *region*
. // Make log income per capita - Notice that rgdpe and pop both in millions
. gen log_gdp_pc = log(rgdpe) - log(pop)
(549 missing values generated)
. label var log_gdp_pc "Log GDP per capita - log(rgdpe / pop)"
. // Now make a growth rate using the lag operator
. gen d_gdp_pc = D.log_gdp_pc
(731 missing values generated)
. label var d_gdp_pc "Growth GDP per capita - D.log_gdp_pc"
. // Make mean growth rates using egen and bysort
. bysort country: egen mean_d_gdp_pc = mean(d_gdp_pc)
. // Make a percent by multiplying by 100
. replace mean_d_gdp_pc = mean_d_gdp_pc * 100
(8190 real changes made)
. label var mean_d_gdp_pc "Mean Growth - GDP per capita (%), 1970-2014"
. // Should do spot check to make sure correct
. order country mean_d_gdp_pc
. //browse
.
. // How do we get the first year's GDP?
. // reshape! long -> wide
. // What do we need?
. keep log_gdp_pc mean_d_gdp_pc country* year *region*
. reshape wide log_gdp_pc, i(country) j(year)
(note: j = 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
> 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 8190 -> 182
Number of variables 10 -> 53
j variable (45 values) year -> (dropped)
xij variables:
log_gdp_pc -> log_gdp_pc1970 log_gdp_pc1971 ... log_gdp_pc2014
-----------------------------------------------------------------------------
. // Really only need to keep the log_gdp_pc from 1970 - Make a new variable
. gen init_log_gdp_pc = log_gdp_pc1970
(26 missing values generated)
. label var init_log_gdp_pc "Log GDP per capita, 1970"
. // Drop all others
. drop log_gdp_pc*
scatter mean_d_gdp_pc init_log_gdp_pc
. scatter mean_d_gdp_pc init_log_gdp_pc
It kind of looks like there’s a weak negative correlation between the variables. Indeed, if we run an OLS regression we find that the correlation is significant at the 10% level. In fact, a long line of literature has found just the opposite–there is no close correlation at all. What gives?!
If we look at our figure we see that there are a few high-income countries that seem to be outliers.9
Let’s try to think of a way to deal with those datapoints. First, who are they? Let’s replace the circles with country abbreviations:twoway (scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc <= log(30000), ///
mlabel(countrycode) mlabp(0) m(i) mlabs(1.65)) || ///
(scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc > log(30000), ///
mlabel(countrycode) mlabp(0) m(i) mlabs(1.65) ///
legend(off) ///
title("Initial GDP and Subsequent Growth, 1970-2014") ///
)
. twoway (scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc <= log(30000), ///
> mlabel(countrycode) mlabp(0) m(i) mlabs(1.65)) || ///
> (scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc > log(30000), ///
> mlabel(countrycode) mlabp(0) m(i) mlabs(1.65) ///
> legend(off) ///
> title("Initial GDP and Subsequent Growth, 1970-2014") ///
> )
Aha, they’re all countries with a bunch of oil! Long story short, this is a lurking variable in the anaylsis. For the purpose of making our graph, then, we’ll get rid of them.
Consider the modified picture:twoway scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc <= log(30000), ///
title("Initial GDP and Subsequent Growth, 1970-2014") ///
mlabel(countrycode) mlabp(0) m(i) mlabs(1.45)
. twoway scatter mean_d_gdp_pc init_log_gdp_pc if init_log_gdp_pc <= log(30000), ///
> title("Initial GDP and Subsequent Growth, 1970-2014") ///
> mlabel(countrycode) mlabp(0) m(i) mlabs(1.45)
Notice how–having removed the outliers–the apparent correlation falls apart. Part of this is just mechanical. But I think that an important part of it has to do with how we perceive figures; make a lot of graphs to be sure that you’re eyes aren’t playing any tricks on you.
What if we wanted to see how the correlation held up across continents? Here’s two possibilities:// What about separate plots by continent?
twoway scatter mean_d_gdp_pc init_log_gdp_pc, ///
mlabel(countrycode) mlabp(0) m(i) mlabs(1.45) ///
by(region)
// One plot with color by subregion?
separate mean_d_gdp_pc, by(region)
twoway scatter mean_d_gdp_pc* init_log_gdp_pc, ///
title("Initial GDP and Subsequent Growth, 1970-2014") ///
legend(off) m(i) ///
ytitle("Mean Growth - GDP per capita (%), 1970-2014")
. // What about separate plots by continent?
. twoway scatter mean_d_gdp_pc init_log_gdp_pc, ///
> mlabel(countrycode) mlabp(0) m(i) mlabs(1.45) ///
> by(region)
. // One plot with color by subregion?
. separate mean_d_gdp_pc, by(region)
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------
mean_d_gdp_pc1 float %9.0g mean_d_gdp_pc, region == Africa
mean_d_gdp_pc2 float %9.0g mean_d_gdp_pc, region == Americas
mean_d_gdp_pc3 float %9.0g mean_d_gdp_pc, region == Asia
mean_d_gdp_pc4 float %9.0g mean_d_gdp_pc, region == Europe
mean_d_gdp_pc5 float %9.0g mean_d_gdp_pc, region == Oceania
. twoway scatter mean_d_gdp_pc* init_log_gdp_pc, ///
> title("Initial GDP and Subsequent Growth, 1970-2014") ///
> legend(off) m(i) ///
> ytitle("Mean Growth - GDP per capita (%), 1970-2014")
Instructor: Cynthia Kinnan, TA: Michael Wooley↩
The alternative (and default) option is an .smcl file. These files only appear to be well-formatted when opened in Stata. I don’t know of any obvious advantages to this file type.↩
In particular, if we did this our do file couldn’t be run more than once without error–the files that it is manipulating on the second run wouldn’t be the same as those that we manipulated on the first.↩
1:1 can be compared to a many-to-one (m:1) merge, which says that there may be several rows in the master data that correspond to the same row in the using data. For example, suppose that we were going to merge GNP by region to the data. There are many countries for each region, which would lead to a m:1 merge. A 1:m merge is the mirror inverse of the m:1 merge; e.g. let the GNP-by-region data be the master and the life expectancy data be the using data. We can also do a m:m merge. Notice that this case is especially dicey because we’re actually potentially introducing new observations into the data.↩
Does all of this macro business seem like overkill? Why shouldn’t we just copy and paste the one number back into the do editor? Well, suppose that you realized that you messed up one of the steps in the previous section. Then the number that you copied and pasted here will probably no longer be correct. Now we could go back and update the median in the do file. However, this seems pretty cumbersome and I, for one, do not trust myself to remember that this even needs to be fixed.↩
If you’re familiar with regular expressions you may be wondering if you can use those to select variable names. You can’t do this without complicating things a bit more. See help regex for more info.↩
In fact, I’ve simply copied and pasted the code chunk from above and replaced su with corr.↩
See chapter 3 of Ray for more details.↩
Also, if we compare our figure to Ray’s Fig. 4.5 from his updated chapter 3 (here) we see that he cuts off the per capita income at $30,000.↩
Comments
You’re going to want to comment your code so that it is clear to future-you and others what exactly it is that you’re doing.
We can comment in a lot of ways: