607_Project_2

The goal of Project 2 is to choose wide/untidy datasets from Week 5 Discussion, ready from CSV into R to tidy/transform as needed, and perform the analysis requested in the discussion. I use Chun San Yip’s Week 5 Discussion here

Read data from original repo and write csv.

df = read_csv('https://github.com/jmhsi/DATA_607/raw/master/Project%202/API_EAS_DS2_en_excel_v2_822197.csv')

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Data Source` = col_character(),
##   `World Development Indicators` = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )

## See spec(...) for full column specifications.

dict = read_csv('https://github.com/jmhsi/DATA_607/raw/master/Project%202/EAS_dict.csv')

## Parsed with column specification:
## cols(
##   INDICATOR_CODE = col_character(),
##   INDICATOR_NAME = col_character(),
##   SOURCE_NOTE = col_character(),
##   SOURCE_ORGANIZATION = col_character()
## )

write.csv(df, 'API_EAS_DS2_en_excel_v2_822197.csv', row.names=FALSE)
write.csv(dict, 'EAS_dict.csv', row.names=FALSE)

Load in data and tidy-up to long form. Chun recommends pivoting_longer turning the year columns.

df = read_csv('API_EAS_DS2_en_excel_v2_822197.csv', skip = 2)

## Warning: Missing column names filled in: 'X1' [1], 'X2' [2], 'X3' [3], 'X4' [4],
## 'X5' [5], 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11],
## 'X12' [12], 'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16], 'X17' [17],
## 'X18' [18], 'X19' [19], 'X20' [20], 'X21' [21], 'X22' [22], 'X23' [23],
## 'X24' [24], 'X25' [25], 'X26' [26], 'X27' [27], 'X28' [28], 'X29' [29],
## 'X30' [30], 'X31' [31], 'X32' [32], 'X33' [33], 'X34' [34], 'X35' [35],
## 'X36' [36], 'X37' [37], 'X38' [38], 'X39' [39], 'X40' [40], 'X41' [41],
## 'X42' [42], 'X43' [43], 'X44' [44], 'X45' [45], 'X46' [46], 'X47' [47],
## 'X48' [48], 'X49' [49], 'X50' [50], 'X51' [51], 'X52' [52], 'X53' [53],
## 'X54' [54], 'X55' [55], 'X56' [56], 'X57' [57], 'X58' [58], 'X59' [59],
## 'X60' [60], 'X61' [61], 'X62' [62], 'X63' [63], 'X64' [64]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character()
## )

## See spec(...) for full column specifications.

dict = read_csv('EAS_dict.csv')

## Parsed with column specification:
## cols(
##   INDICATOR_CODE = col_character(),
##   INDICATOR_NAME = col_character(),
##   SOURCE_NOTE = col_character(),
##   SOURCE_ORGANIZATION = col_character()
## )

colnames(df) = df[1,]
# drop the header row
df = df[2:length(df),]
# grab only the percent data
pct_dict = dict[grep('%', dict$INDICATOR_NAME),]
pct_df = df[df$`Indicator Code` %in% pct_dict$INDICATOR_CODE,]
# pivot all the years to make long
pct_df = pct_df %>% pivot_longer(cols = 5:ncol(pct_df), names_to = 'year', values_to ='percent', values_drop_na = TRUE)
kable(head(pct_df), caption='Tidied data')

Tidied data
Country Name	Country Code	Indicator Name	Indicator Code	year	percent
East Asia & Pacific	EAS	Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)	TM.VAL.MRCH.R6.ZS	1960	1.589339
East Asia & Pacific	EAS	Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)	TM.VAL.MRCH.R6.ZS	1961	1.893343
East Asia & Pacific	EAS	Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)	TM.VAL.MRCH.R6.ZS	1962	1.859862
East Asia & Pacific	EAS	Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)	TM.VAL.MRCH.R6.ZS	1963	1.920439
East Asia & Pacific	EAS	Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)	TM.VAL.MRCH.R6.ZS	1964	2.259369
East Asia & Pacific	EAS	Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)	TM.VAL.MRCH.R6.ZS	1965	2.283279

Discussion questions: I add my own question as nothing was specifically mentioned to analyze. Which indicator had the most/least pct change over the years presented?

sub_df = pct_df %>% group_by(`Indicator Code`) %>% summarise(min_pct = min(percent), max_pct = max(percent), min_year = min(year), max_year = max(year), min_year_pct = percent[which(year == min(year))], max_year_pct = percent[which(year == max(year))])
sub_df$largest_pct_diff = sub_df$max_pct - sub_df$min_pct
sub_df$year_max_min_pct_diff = sub_df$max_year_pct - sub_df$min_year_pct

plt1 = ggplot(sub_df, aes(x = reorder(`Indicator Code`, year_max_min_pct_diff), y = largest_pct_diff))+ geom_bar(stat = 'identity', color="blue", fill = 'white') + coord_flip() + labs(y="Largest diff in % across all years", x="Indicator Code")
plt2 = ggplot(sub_df, aes(x = reorder(`Indicator Code`, year_max_min_pct_diff), y = year_max_min_pct_diff))+ geom_bar(stat = 'identity', color="blue", fill = 'white') + coord_flip() + labs(y="Diff in % of most recent year and earliest recorded year", x="Indicator Code")
grid.arrange(plt1, plt2, ncol=2)

We see that TM.TAX.MRCH.BC.ZS has some of the largest difference in percent between years, as well as the largest difference in percent of the earliest recorded year and most recent year. Perhaps unsurprisingly, the Indicator Codes with very little change in percent of earliest recorded year and most recent year also have very little percent difference across all years. Some next steps in analysis could be to look at what these Indicator Codes mean from the data dictionary and see if the percent differences we see make sense comparing earliest to most recent data.

607_Project_2_3

Justin Hsi

3/5/2020