Data describes long-run economic growth. Data can be downloaded on blackboard:
Our main data source will be the Penn World Table (PWT), a cross-country data set covering real GDP, population, and other macroeconomic variables. The PWT is built from two distinct data sources: National income data from each country’s national statistical agency. Systematic data comparing prices across countries, constructed by the World Bank’s International Comparison Program (ICP). The ICP data is needed to account for a simple economic reality: each country’s GDP is calculated using local prices, but prices of key goods and services vary dramatically across countries: housing is much more expensive in Vancouver than in Houston, and a haircut is much cheaper in Mumbai than in London. The PWT research team use the results of the ICP to convert each country’s GDP data to comparable (PPP) units. The current version of the PWT is available online at http://www.ggdc.net/pwt. Source and original example
We also have a secondary data source with cross-country data on top marginal tax rates. Most countries have progressive tax systems - that means that residents with high income pay a higher tax rate than residents with low income - but these higher rates only apply to marginal income. For example if the marginal tax rate is 30% on taxable income below $100,000 and 40% on taxable income above $100,000, a taxpayer with $150,000 in taxable income would pay 30% on the first $100,000 and 40% on the remaining $50,000 for a total tax bill of $50,000, or an average tax rate of 33%. The data on the top marginal tax rate is obtained from the Tax Foundation7, a US-based policy and research organization. This data set comes from Table 1 in the report “Taxing High Incomes: A Comparison of 41 Countries,” available online at https://taxfoundation.org/taxing-high-income-2019/.
Finally, we have a “crosswalk” data file to help with linking the GDP and tax data (more about this below). Download the required file CountryCodes.txt. This file may display in your browser rather than downloading like the other files. If that happens, you can right-click on the link above and select “Save link as” to save the file on your computer.
read_excel() has additional arguments that can be used to specify what you want R to do:
Usage read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)
Arguments
You could find this out by either googling “read_excel in R” and it would be one of the first results OR you can do ?read_excel in your code (console or script/markdown work) to pull up the Help tab with more information!
library(readxl)
taxes <- read_excel("TaxRates.xlsx", sheet = 1)
taxes
## # A tibble: 41 x 9
## Country IncomeCentral IncomeLocalRegi~ IncomeSurtaxes EmployeeSocialC~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 45 NA NA NA
## 2 Austria 55 NA NA NA
## 3 Belgium 50 3.5 NA 13.1
## 4 Bulgaria 10 NA NA NA
## 5 Canada 33 17.5 NA NA
## 6 Chile 35.5 NA NA NA
## 7 Croatia 36 NA 6.48 NA
## 8 Cyprus 35 NA NA NA
## 9 Czech Republic 15 NA 7 NA
## 10 Denmark 52.0 NA NA 8
## # ... with 31 more rows, and 4 more variables:
## # EmployeeSocialContributionsNonDeductible <dbl>, Payroll tax <dbl>,
## # Consumption tax <dbl>, Effective marginal tax rate <dbl>
countrynames <- read_csv("CountryCodes.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## `CountryName CountryCode` = col_character()
## )
## Warning: 2 parsing failures.
## row col expected actual file
## 40 -- 1 columns 2 columns 'CountryCodes.txt'
## 41 -- 1 columns 2 columns 'CountryCodes.txt'
countrynames
## # A tibble: 184 x 1
## `CountryName CountryCode`
## <chr>
## 1 Albania ALB
## 2 Algeria DZA
## 3 Angola AGO
## 4 Anguilla AIA
## 5 Antigua and Barbuda ATG
## 6 Argentina ARG
## 7 Armenia ARM
## 8 Aruba ABW
## 9 Australia AUS
## 10 Austria AUT
## # ... with 174 more rows
Awe, this one is messed up. What should we do? What kind of file was this originally?
I googled “fixed width files in R” and found my answer right away.
- Note: I tried read_fwf() first, then tried read_table(). Success doesn’t always happen right away.
countrynames <- read_table("CountryCodes.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## CountryName = col_character(),
## CountryCode = col_character()
## )
countrynames # Tibble, 184 observations, 2 variables
## # A tibble: 184 x 2
## CountryName CountryCode
## <chr> <chr>
## 1 Albania ALB
## 2 Algeria DZA
## 3 Angola AGO
## 4 Anguilla AIA
## 5 Antigua and Barbuda ATG
## 6 Argentina ARG
## 7 Armenia ARM
## 8 Aruba ABW
## 9 Australia AUS
## 10 Austria AUT
## # ... with 174 more rows
pwt100 <- read_excel("pwt100.xlsx")
pwt100
## # A tibble: 9 x 1
## `Penn World Table, version 10.0`
## <chr>
## 1 <NA>
## 2 "This file contains the data of PWT 10.0, as available on www.ggdc.net/pwt"
## 3 "Please refer to www.ggdc.net/pwt for extensive documentation of the differen~
## 4 <NA>
## 5 "When using these data, please refer to the following paper:"
## 6 "Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), \"The Next ~
## 7 <NA>
## 8 "Note"
## 9 "On January 28, 2021 we updated PWT 10.0 due to [1] an error in the current v~
Hmmm… This has a lot of extra information in it that I don’t want. After opening the excel file (in Excel) I realized I wanted the information from the 3rd sheet.
pwt100 <- read_excel("pwt100.xlsx", sheet = 3)
pwt100 # success. Tibble with 12,810 rows and 52 columns
## # A tibble: 12,810 x 52
## countrycode country currency_unit year rgdpe rgdpo pop emp avh hc
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW Aruba Aruban Guilder 1950 NA NA NA NA NA NA
## 2 ABW Aruba Aruban Guilder 1951 NA NA NA NA NA NA
## 3 ABW Aruba Aruban Guilder 1952 NA NA NA NA NA NA
## 4 ABW Aruba Aruban Guilder 1953 NA NA NA NA NA NA
## 5 ABW Aruba Aruban Guilder 1954 NA NA NA NA NA NA
## 6 ABW Aruba Aruban Guilder 1955 NA NA NA NA NA NA
## 7 ABW Aruba Aruban Guilder 1956 NA NA NA NA NA NA
## 8 ABW Aruba Aruban Guilder 1957 NA NA NA NA NA NA
## 9 ABW Aruba Aruban Guilder 1958 NA NA NA NA NA NA
## 10 ABW Aruba Aruban Guilder 1959 NA NA NA NA NA NA
## # ... with 12,800 more rows, and 42 more variables: ccon <dbl>, cda <dbl>,
## # cgdpe <dbl>, cgdpo <dbl>, cn <dbl>, ck <dbl>, ctfp <dbl>, cwtfp <dbl>,
## # rgdpna <dbl>, rconna <dbl>, rdana <dbl>, rnna <dbl>, rkna <dbl>,
## # rtfpna <dbl>, rwtfpna <dbl>, labsh <dbl>, irr <dbl>, delta <dbl>, xr <dbl>,
## # pl_con <dbl>, pl_da <dbl>, pl_gdpo <dbl>, i_cig <chr>, i_xm <chr>,
## # i_xr <chr>, i_outlier <chr>, i_irr <chr>, cor_exp <dbl>, statcap <dbl>,
## # csh_c <dbl>, csh_i <dbl>, csh_g <dbl>, csh_x <dbl>, csh_m <dbl>, ...
Combine CountryCodes, taxes, and pwt100 into one dataframe (or tibble).
We could try to match using the country names but they aren’t always exactly the same. For example, “South Korea” in the taxrates data compared to “Republic of Korea” in pwt100.
Let’s give our dataframe taxes the standard three-letter ISO country codes. Do this by “joining”, the CountryNames data to the taxes data.
# ?left_join # see the syntax for the command
countrytaxes <- left_join(taxes, countrynames, by = c("Country" = "CountryName"))
countrytaxes # 41 X 10 tibble, CountryCode is in last column
## # A tibble: 41 x 10
## Country IncomeCentral IncomeLocalRegi~ IncomeSurtaxes EmployeeSocialC~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 45 NA NA NA
## 2 Austria 55 NA NA NA
## 3 Belgium 50 3.5 NA 13.1
## 4 Bulgaria 10 NA NA NA
## 5 Canada 33 17.5 NA NA
## 6 Chile 35.5 NA NA NA
## 7 Croatia 36 NA 6.48 NA
## 8 Cyprus 35 NA NA NA
## 9 Czech Republic 15 NA 7 NA
## 10 Denmark 52.0 NA NA 8
## # ... with 31 more rows, and 5 more variables:
## # EmployeeSocialContributionsNonDeductible <dbl>, Payroll tax <dbl>,
## # Consumption tax <dbl>, Effective marginal tax rate <dbl>, CountryCode <chr>
Now suppose we want to create a variable that is each country’s population( pop in the pwt100 tibble) in 2019 and a variable for the country’s population in 1990 to see how a country’s population has changed over time.
First, lets look at our data. Filter the pwt100 data so that it only includes data from 1990 and 2019. Save it as a new object called growth. How many observations are there?
pwt100 %>% filter(year == 1990 | year ==2019)
## # A tibble: 366 x 52
## countrycode country currency_unit year rgdpe rgdpo pop emp avh
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW Aruba Aruban Guild~ 1990 2.58e3 3.01e3 6.21e-2 NA NA
## 2 ABW Aruba Aruban Guild~ 2019 3.92e3 3.47e3 1.06e-1 0.0476 NA
## 3 AGO Angola Kwanza 1990 3.63e4 3.60e4 1.18e+1 6.67 NA
## 4 AGO Angola Kwanza 2019 2.28e5 2.28e5 3.18e+1 16.6 NA
## 5 AIA Anguilla East Caribbe~ 1990 2.06e2 2.48e2 8.90e-3 0.00355 NA
## 6 AIA Anguilla East Caribbe~ 2019 3.77e2 2.26e2 1.49e-2 NA NA
## 7 ALB Albania Lek 1990 1.20e4 1.21e4 3.29e+0 1.32 NA
## 8 ALB Albania Lek 2019 3.59e4 3.61e4 2.88e+0 1.08 NA
## 9 ARE United ~ UAE Dirham 1990 2.27e5 2.65e5 1.83e+0 0.949 NA
## 10 ARE United ~ UAE Dirham 2019 6.82e5 6.46e5 9.77e+0 5.81 NA
## # ... with 356 more rows, and 43 more variables: hc <dbl>, ccon <dbl>,
## # cda <dbl>, cgdpe <dbl>, cgdpo <dbl>, cn <dbl>, ck <dbl>, ctfp <dbl>,
## # cwtfp <dbl>, rgdpna <dbl>, rconna <dbl>, rdana <dbl>, rnna <dbl>,
## # rkna <dbl>, rtfpna <dbl>, rwtfpna <dbl>, labsh <dbl>, irr <dbl>,
## # delta <dbl>, xr <dbl>, pl_con <dbl>, pl_da <dbl>, pl_gdpo <dbl>,
## # i_cig <chr>, i_xm <chr>, i_xr <chr>, i_outlier <chr>, i_irr <chr>,
## # cor_exp <dbl>, statcap <dbl>, csh_c <dbl>, csh_i <dbl>, csh_g <dbl>, ...
In Excel, this could be done by creating a uniqueID for country_year and matching the data from 1990 and 2019. In Excel, this would look something like this: Create a new variable for pop1990 and pop2019 and then fill in using vlookup or xlookup: =XLOOKUP(CONCAT(A2,"2019"),pwt100!A$2:A$12811,pwt100!H$2:H$12811) IF I had the three dataframes as separate worksheets within one Excel workbook.
I will do something similar in R. I want to create a new variable using population data from 1990.
class(pwt100$year) #numeric, this is good
## [1] "numeric"
# countrytaxes$pop1990 <- ifelse(pwt100$year == 1990, pwt100$pop, )
pop1990 <- pwt100 %>%
filter(year == 1990) %>% # only observations (rows) from 1990, Still has all of the columns
select(countrycode, pop) # only keepsthe variables referenced
dim(pop1990) # 183 rows/countries, 3 columns
## [1] 183 2
pop2019 <- pwt100 %>%
filter(year == 2019) %>%
select(countrycode, pop)
dim(pop2019) # 183 rows, 52 columns
## [1] 183 2
Pause: What do I want?
I want to add two columns that have the population during 1990 and 2019 in my country tax rate data.
growth <- left_join(countrytaxes, pop1990, by = c("CountryCode"= "countrycode")) %>% rename(pop1990 = pop)
growth # 41 X 11
## # A tibble: 41 x 11
## Country IncomeCentral IncomeLocalRegi~ IncomeSurtaxes EmployeeSocialC~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 45 NA NA NA
## 2 Austria 55 NA NA NA
## 3 Belgium 50 3.5 NA 13.1
## 4 Bulgaria 10 NA NA NA
## 5 Canada 33 17.5 NA NA
## 6 Chile 35.5 NA NA NA
## 7 Croatia 36 NA 6.48 NA
## 8 Cyprus 35 NA NA NA
## 9 Czech Republic 15 NA 7 NA
## 10 Denmark 52.0 NA NA 8
## # ... with 31 more rows, and 6 more variables:
## # EmployeeSocialContributionsNonDeductible <dbl>, Payroll tax <dbl>,
## # Consumption tax <dbl>, Effective marginal tax rate <dbl>,
## # CountryCode <chr>, pop1990 <dbl>
growth <- left_join(growth, pop2019, by = c("CountryCode"= "countrycode")) %>% rename(pop2019 = pop)
growth # 41 X 12
## # A tibble: 41 x 12
## Country IncomeCentral IncomeLocalRegi~ IncomeSurtaxes EmployeeSocialC~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 45 NA NA NA
## 2 Austria 55 NA NA NA
## 3 Belgium 50 3.5 NA 13.1
## 4 Bulgaria 10 NA NA NA
## 5 Canada 33 17.5 NA NA
## 6 Chile 35.5 NA NA NA
## 7 Croatia 36 NA 6.48 NA
## 8 Cyprus 35 NA NA NA
## 9 Czech Republic 15 NA 7 NA
## 10 Denmark 52.0 NA NA 8
## # ... with 31 more rows, and 7 more variables:
## # EmployeeSocialContributionsNonDeductible <dbl>, Payroll tax <dbl>,
## # Consumption tax <dbl>, Effective marginal tax rate <dbl>,
## # CountryCode <chr>, pop1990 <dbl>, pop2019 <dbl>
Sometimes we will want to create a new variable that is the sum or average of another variable within some group, or the count of the number of observations in that group.
We can construct group-level averages variables using the AVERAGEIFS() function. It takes three arguments: