Data Background Info

Data describes long-run economic growth. Data can be downloaded on blackboard:

  • pwt100.xlsx
  • TaxRates.xlsx
  • CountryCodes.txt

Our main data source will be the Penn World Table (PWT), a cross-country data set covering real GDP, population, and other macroeconomic variables. The PWT is built from two distinct data sources: National income data from each country’s national statistical agency. Systematic data comparing prices across countries, constructed by the World Bank’s International Comparison Program (ICP). The ICP data is needed to account for a simple economic reality: each country’s GDP is calculated using local prices, but prices of key goods and services vary dramatically across countries: housing is much more expensive in Vancouver than in Houston, and a haircut is much cheaper in Mumbai than in London. The PWT research team use the results of the ICP to convert each country’s GDP data to comparable (PPP) units. The current version of the PWT is available online at http://www.ggdc.net/pwt. Source and original example

We also have a secondary data source with cross-country data on top marginal tax rates. Most countries have progressive tax systems - that means that residents with high income pay a higher tax rate than residents with low income - but these higher rates only apply to marginal income. For example if the marginal tax rate is 30% on taxable income below $100,000 and 40% on taxable income above $100,000, a taxpayer with $150,000 in taxable income would pay 30% on the first $100,000 and 40% on the remaining $50,000 for a total tax bill of $50,000, or an average tax rate of 33%. The data on the top marginal tax rate is obtained from the Tax Foundation7, a US-based policy and research organization. This data set comes from Table 1 in the report “Taxing High Incomes: A Comparison of 41 Countries,” available online at https://taxfoundation.org/taxing-high-income-2019/.

Finally, we have a “crosswalk” data file to help with linking the GDP and tax data (more about this below). Download the required file CountryCodes.txt. This file may display in your browser rather than downloading like the other files. If that happens, you can right-click on the link above and select “Save link as” to save the file on your computer.

read_excel() has additional arguments that can be used to specify what you want R to do:

Usage read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)

Arguments

  • path: Path to the xls/xlsx file
  • sheet: Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Defaults to the first sheet.
  • col_names: Either TRUE to use the first row as column names, FALSE to number columns sequentially from X1 to Xn, or a character vector giving a name for each column.
  • col_types: Either NULL to guess from the spreadsheet or a character vector containing “blank”, “numeric”, “date” or “text”.
  • na: Missing value. By default readxl converts blank cells to missing data. Set this value if you have used a sentinel value for missing values.
  • skip: Number of rows to skip before reading any data.

You could find this out by either googling “read_excel in R” and it would be one of the first results OR you can do ?read_excel in your code (console or script/markdown work) to pull up the Help tab with more information!

library(readxl)
taxes <- read_excel("TaxRates.xlsx", sheet = 1)
taxes
## # A tibble: 41 x 9
##    Country        IncomeCentral IncomeLocalRegi~ IncomeSurtaxes EmployeeSocialC~
##    <chr>                  <dbl>            <dbl>          <dbl>            <dbl>
##  1 Australia               45               NA            NA                NA  
##  2 Austria                 55               NA            NA                NA  
##  3 Belgium                 50                3.5          NA                13.1
##  4 Bulgaria                10               NA            NA                NA  
##  5 Canada                  33               17.5          NA                NA  
##  6 Chile                   35.5             NA            NA                NA  
##  7 Croatia                 36               NA             6.48             NA  
##  8 Cyprus                  35               NA            NA                NA  
##  9 Czech Republic          15               NA             7                NA  
## 10 Denmark                 52.0             NA            NA                 8  
## # ... with 31 more rows, and 4 more variables:
## #   EmployeeSocialContributionsNonDeductible <dbl>, Payroll tax <dbl>,
## #   Consumption tax <dbl>, Effective marginal tax rate <dbl>
countrynames <- read_csv("CountryCodes.txt")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   `CountryName                        CountryCode` = col_character()
## )
## Warning: 2 parsing failures.
## row col  expected    actual               file
##  40  -- 1 columns 2 columns 'CountryCodes.txt'
##  41  -- 1 columns 2 columns 'CountryCodes.txt'
countrynames
## # A tibble: 184 x 1
##    `CountryName                        CountryCode`
##    <chr>                                           
##  1 Albania                            ALB          
##  2 Algeria                            DZA          
##  3 Angola                             AGO          
##  4 Anguilla                           AIA          
##  5 Antigua and Barbuda                ATG          
##  6 Argentina                          ARG          
##  7 Armenia                            ARM          
##  8 Aruba                              ABW          
##  9 Australia                          AUS          
## 10 Austria                            AUT          
## # ... with 174 more rows

Awe, this one is messed up. What should we do? What kind of file was this originally?

I googled “fixed width files in R” and found my answer right away.
- Note: I tried read_fwf() first, then tried read_table(). Success doesn’t always happen right away.

countrynames <- read_table("CountryCodes.txt")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   CountryName = col_character(),
##   CountryCode = col_character()
## )
countrynames # Tibble, 184 observations, 2 variables
## # A tibble: 184 x 2
##    CountryName         CountryCode
##    <chr>               <chr>      
##  1 Albania             ALB        
##  2 Algeria             DZA        
##  3 Angola              AGO        
##  4 Anguilla            AIA        
##  5 Antigua and Barbuda ATG        
##  6 Argentina           ARG        
##  7 Armenia             ARM        
##  8 Aruba               ABW        
##  9 Australia           AUS        
## 10 Austria             AUT        
## # ... with 174 more rows
pwt100 <- read_excel("pwt100.xlsx")
pwt100
## # A tibble: 9 x 1
##   `Penn World Table, version 10.0`                                              
##   <chr>                                                                         
## 1  <NA>                                                                         
## 2 "This file contains the data of PWT 10.0, as available on www.ggdc.net/pwt"   
## 3 "Please refer to www.ggdc.net/pwt for extensive documentation of the differen~
## 4  <NA>                                                                         
## 5 "When using these data, please refer to the following paper:"                 
## 6 "Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), \"The Next ~
## 7  <NA>                                                                         
## 8 "Note"                                                                        
## 9 "On January 28, 2021 we updated PWT 10.0 due to [1] an error in the current v~

Hmmm… This has a lot of extra information in it that I don’t want. After opening the excel file (in Excel) I realized I wanted the information from the 3rd sheet.

pwt100 <- read_excel("pwt100.xlsx", sheet = 3)
pwt100 # success. Tibble with 12,810 rows and 52 columns
## # A tibble: 12,810 x 52
##    countrycode country currency_unit   year rgdpe rgdpo   pop   emp   avh    hc
##    <chr>       <chr>   <chr>          <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ABW         Aruba   Aruban Guilder  1950    NA    NA    NA    NA    NA    NA
##  2 ABW         Aruba   Aruban Guilder  1951    NA    NA    NA    NA    NA    NA
##  3 ABW         Aruba   Aruban Guilder  1952    NA    NA    NA    NA    NA    NA
##  4 ABW         Aruba   Aruban Guilder  1953    NA    NA    NA    NA    NA    NA
##  5 ABW         Aruba   Aruban Guilder  1954    NA    NA    NA    NA    NA    NA
##  6 ABW         Aruba   Aruban Guilder  1955    NA    NA    NA    NA    NA    NA
##  7 ABW         Aruba   Aruban Guilder  1956    NA    NA    NA    NA    NA    NA
##  8 ABW         Aruba   Aruban Guilder  1957    NA    NA    NA    NA    NA    NA
##  9 ABW         Aruba   Aruban Guilder  1958    NA    NA    NA    NA    NA    NA
## 10 ABW         Aruba   Aruban Guilder  1959    NA    NA    NA    NA    NA    NA
## # ... with 12,800 more rows, and 42 more variables: ccon <dbl>, cda <dbl>,
## #   cgdpe <dbl>, cgdpo <dbl>, cn <dbl>, ck <dbl>, ctfp <dbl>, cwtfp <dbl>,
## #   rgdpna <dbl>, rconna <dbl>, rdana <dbl>, rnna <dbl>, rkna <dbl>,
## #   rtfpna <dbl>, rwtfpna <dbl>, labsh <dbl>, irr <dbl>, delta <dbl>, xr <dbl>,
## #   pl_con <dbl>, pl_da <dbl>, pl_gdpo <dbl>, i_cig <chr>, i_xm <chr>,
## #   i_xr <chr>, i_outlier <chr>, i_irr <chr>, cor_exp <dbl>, statcap <dbl>,
## #   csh_c <dbl>, csh_i <dbl>, csh_g <dbl>, csh_x <dbl>, csh_m <dbl>, ...

Combining data

  1. We want to combine multiple data files into a single dataframe/tibble
  2. We want to join observations in one tibble with observations in another tibble
  3. We will want to construct new variables that are group-level aggregates (counts, averages, or sums) of existing variables

Adding Country Codes to the Tax Rate data

Combine CountryCodes, taxes, and pwt100 into one dataframe (or tibble).

  • pwt100 identifies a country the the country name and three-letter ISO country code
  • taxes only has country name

We could try to match using the country names but they aren’t always exactly the same. For example, “South Korea” in the taxrates data compared to “Republic of Korea” in pwt100.

Let’s give our dataframe taxes the standard three-letter ISO country codes. Do this by “joining”, the CountryNames data to the taxes data.

# ?left_join      # see the syntax for the command
countrytaxes <- left_join(taxes, countrynames, by = c("Country" = "CountryName"))
countrytaxes      # 41 X 10 tibble, CountryCode is in last column
## # A tibble: 41 x 10
##    Country        IncomeCentral IncomeLocalRegi~ IncomeSurtaxes EmployeeSocialC~
##    <chr>                  <dbl>            <dbl>          <dbl>            <dbl>
##  1 Australia               45               NA            NA                NA  
##  2 Austria                 55               NA            NA                NA  
##  3 Belgium                 50                3.5          NA                13.1
##  4 Bulgaria                10               NA            NA                NA  
##  5 Canada                  33               17.5          NA                NA  
##  6 Chile                   35.5             NA            NA                NA  
##  7 Croatia                 36               NA             6.48             NA  
##  8 Cyprus                  35               NA            NA                NA  
##  9 Czech Republic          15               NA             7                NA  
## 10 Denmark                 52.0             NA            NA                 8  
## # ... with 31 more rows, and 5 more variables:
## #   EmployeeSocialContributionsNonDeductible <dbl>, Payroll tax <dbl>,
## #   Consumption tax <dbl>, Effective marginal tax rate <dbl>, CountryCode <chr>

Combining PWT data (economic data) with tax data for each country

Now suppose we want to create a variable that is each country’s population( pop in the pwt100 tibble) in 2019 and a variable for the country’s population in 1990 to see how a country’s population has changed over time.

Filtering

First, lets look at our data. Filter the pwt100 data so that it only includes data from 1990 and 2019. Save it as a new object called growth. How many observations are there?

pwt100 %>% filter(year == 1990 | year ==2019)
## # A tibble: 366 x 52
##    countrycode country  currency_unit  year  rgdpe  rgdpo     pop      emp   avh
##    <chr>       <chr>    <chr>         <dbl>  <dbl>  <dbl>   <dbl>    <dbl> <dbl>
##  1 ABW         Aruba    Aruban Guild~  1990 2.58e3 3.01e3 6.21e-2 NA          NA
##  2 ABW         Aruba    Aruban Guild~  2019 3.92e3 3.47e3 1.06e-1  0.0476     NA
##  3 AGO         Angola   Kwanza         1990 3.63e4 3.60e4 1.18e+1  6.67       NA
##  4 AGO         Angola   Kwanza         2019 2.28e5 2.28e5 3.18e+1 16.6        NA
##  5 AIA         Anguilla East Caribbe~  1990 2.06e2 2.48e2 8.90e-3  0.00355    NA
##  6 AIA         Anguilla East Caribbe~  2019 3.77e2 2.26e2 1.49e-2 NA          NA
##  7 ALB         Albania  Lek            1990 1.20e4 1.21e4 3.29e+0  1.32       NA
##  8 ALB         Albania  Lek            2019 3.59e4 3.61e4 2.88e+0  1.08       NA
##  9 ARE         United ~ UAE Dirham     1990 2.27e5 2.65e5 1.83e+0  0.949      NA
## 10 ARE         United ~ UAE Dirham     2019 6.82e5 6.46e5 9.77e+0  5.81       NA
## # ... with 356 more rows, and 43 more variables: hc <dbl>, ccon <dbl>,
## #   cda <dbl>, cgdpe <dbl>, cgdpo <dbl>, cn <dbl>, ck <dbl>, ctfp <dbl>,
## #   cwtfp <dbl>, rgdpna <dbl>, rconna <dbl>, rdana <dbl>, rnna <dbl>,
## #   rkna <dbl>, rtfpna <dbl>, rwtfpna <dbl>, labsh <dbl>, irr <dbl>,
## #   delta <dbl>, xr <dbl>, pl_con <dbl>, pl_da <dbl>, pl_gdpo <dbl>,
## #   i_cig <chr>, i_xm <chr>, i_xr <chr>, i_outlier <chr>, i_irr <chr>,
## #   cor_exp <dbl>, statcap <dbl>, csh_c <dbl>, csh_i <dbl>, csh_g <dbl>, ...

Creating a new variable for population

In Excel, this could be done by creating a uniqueID for country_year and matching the data from 1990 and 2019. In Excel, this would look something like this: Create a new variable for pop1990 and pop2019 and then fill in using vlookup or xlookup: =XLOOKUP(CONCAT(A2,"2019"),pwt100!A$2:A$12811,pwt100!H$2:H$12811) IF I had the three dataframes as separate worksheets within one Excel workbook.

I will do something similar in R. I want to create a new variable using population data from 1990.

class(pwt100$year)  #numeric, this is good
## [1] "numeric"
# countrytaxes$pop1990 <- ifelse(pwt100$year == 1990, pwt100$pop, )

pop1990  <- pwt100 %>% 
  filter(year == 1990) %>% # only observations (rows) from 1990, Still has all of the columns
  select(countrycode, pop) # only keepsthe variables referenced
dim(pop1990) # 183 rows/countries, 3 columns
## [1] 183   2
pop2019 <- pwt100 %>%
  filter(year == 2019) %>%
  select(countrycode, pop)
dim(pop2019) # 183 rows, 52 columns
## [1] 183   2

Pause: What do I want?

I want to add two columns that have the population during 1990 and 2019 in my country tax rate data.

growth <- left_join(countrytaxes, pop1990, by = c("CountryCode"= "countrycode")) %>% rename(pop1990 = pop)
growth  # 41 X 11
## # A tibble: 41 x 11
##    Country        IncomeCentral IncomeLocalRegi~ IncomeSurtaxes EmployeeSocialC~
##    <chr>                  <dbl>            <dbl>          <dbl>            <dbl>
##  1 Australia               45               NA            NA                NA  
##  2 Austria                 55               NA            NA                NA  
##  3 Belgium                 50                3.5          NA                13.1
##  4 Bulgaria                10               NA            NA                NA  
##  5 Canada                  33               17.5          NA                NA  
##  6 Chile                   35.5             NA            NA                NA  
##  7 Croatia                 36               NA             6.48             NA  
##  8 Cyprus                  35               NA            NA                NA  
##  9 Czech Republic          15               NA             7                NA  
## 10 Denmark                 52.0             NA            NA                 8  
## # ... with 31 more rows, and 6 more variables:
## #   EmployeeSocialContributionsNonDeductible <dbl>, Payroll tax <dbl>,
## #   Consumption tax <dbl>, Effective marginal tax rate <dbl>,
## #   CountryCode <chr>, pop1990 <dbl>
growth <- left_join(growth, pop2019, by = c("CountryCode"= "countrycode")) %>% rename(pop2019 = pop)
growth    # 41 X 12
## # A tibble: 41 x 12
##    Country        IncomeCentral IncomeLocalRegi~ IncomeSurtaxes EmployeeSocialC~
##    <chr>                  <dbl>            <dbl>          <dbl>            <dbl>
##  1 Australia               45               NA            NA                NA  
##  2 Austria                 55               NA            NA                NA  
##  3 Belgium                 50                3.5          NA                13.1
##  4 Bulgaria                10               NA            NA                NA  
##  5 Canada                  33               17.5          NA                NA  
##  6 Chile                   35.5             NA            NA                NA  
##  7 Croatia                 36               NA             6.48             NA  
##  8 Cyprus                  35               NA            NA                NA  
##  9 Czech Republic          15               NA             7                NA  
## 10 Denmark                 52.0             NA            NA                 8  
## # ... with 31 more rows, and 7 more variables:
## #   EmployeeSocialContributionsNonDeductible <dbl>, Payroll tax <dbl>,
## #   Consumption tax <dbl>, Effective marginal tax rate <dbl>,
## #   CountryCode <chr>, pop1990 <dbl>, pop2019 <dbl>

Aggregating by group

Sometimes we will want to create a new variable that is the sum or average of another variable within some group, or the count of the number of observations in that group.

We can construct group-level averages variables using the AVERAGEIFS() function. It takes three arguments:

Average investment share by country

Suppose we are interested in whether high-tax countries tend to have higher or lower investment rates than low-tax countries. In order to do this, we might add a variable to our GrowthData worksheet that describes the average investment share of GDP (the variable csh_i) in each country over the full period of the data

Note: In Excel, COUNTIFS(), AVERAGEIFS(), SUMIFS(), MINIFS() and MAXIFS() functions all allow for multiple criteria to be used. Excel also includes a set of older functions COUNTIF(), AVERAGEIF(), SUMIF(), MINIF() and MAXIF() that allow only a single criterion. You may see these in older worksheets.

csh_i: Average investment share of GDP

pwt100 %>% group_by(countrycode) %>%
  summarize(avginvestshare = mean(csh_i))%>%
  arrange(desc(avginvestshare))
## # A tibble: 183 x 2
##    countrycode avginvestshare
##    <chr>                <dbl>
##  1 CYP                  0.444
##  2 NGA                  0.415
##  3 CHE                  0.349
##  4 ISL                  0.338
##  5 FIN                  0.336
##  6 LUX                  0.315
##  7 NOR                  0.304
##  8 VEN                  0.303
##  9 JPN                  0.300
## 10 DEU                  0.296
## # ... with 173 more rows
# tibble 183 X 2