The goal of Project 2 is to choose wide/untidy datasets from Week 5 Discussion, ready from CSV into R to tidy/transform as needed, and perform the analysis requested in the discussion. I use Chun San Yip’s Week 5 Discussion here
Read data from original repo and write csv.
df = read_csv('https://github.com/jmhsi/DATA_607/raw/master/Project%202/API_EAS_DS2_en_excel_v2_822197.csv')
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Data Source` = col_character(),
## `World Development Indicators` = col_character(),
## X3 = col_character(),
## X4 = col_character()
## )
## See spec(...) for full column specifications.
dict = read_csv('https://github.com/jmhsi/DATA_607/raw/master/Project%202/EAS_dict.csv')
## Parsed with column specification:
## cols(
## INDICATOR_CODE = col_character(),
## INDICATOR_NAME = col_character(),
## SOURCE_NOTE = col_character(),
## SOURCE_ORGANIZATION = col_character()
## )
write.csv(df, 'API_EAS_DS2_en_excel_v2_822197.csv', row.names=FALSE)
write.csv(dict, 'EAS_dict.csv', row.names=FALSE)
Load in data and tidy-up to long form. Chun recommends pivoting_longer turning the year columns.
df = read_csv('API_EAS_DS2_en_excel_v2_822197.csv', skip = 2)
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2], 'X3' [3], 'X4' [4],
## 'X5' [5], 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11],
## 'X12' [12], 'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16], 'X17' [17],
## 'X18' [18], 'X19' [19], 'X20' [20], 'X21' [21], 'X22' [22], 'X23' [23],
## 'X24' [24], 'X25' [25], 'X26' [26], 'X27' [27], 'X28' [28], 'X29' [29],
## 'X30' [30], 'X31' [31], 'X32' [32], 'X33' [33], 'X34' [34], 'X35' [35],
## 'X36' [36], 'X37' [37], 'X38' [38], 'X39' [39], 'X40' [40], 'X41' [41],
## 'X42' [42], 'X43' [43], 'X44' [44], 'X45' [45], 'X46' [46], 'X47' [47],
## 'X48' [48], 'X49' [49], 'X50' [50], 'X51' [51], 'X52' [52], 'X53' [53],
## 'X54' [54], 'X55' [55], 'X56' [56], 'X57' [57], 'X58' [58], 'X59' [59],
## 'X60' [60], 'X61' [61], 'X62' [62], 'X63' [63], 'X64' [64]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character()
## )
## See spec(...) for full column specifications.
dict = read_csv('EAS_dict.csv')
## Parsed with column specification:
## cols(
## INDICATOR_CODE = col_character(),
## INDICATOR_NAME = col_character(),
## SOURCE_NOTE = col_character(),
## SOURCE_ORGANIZATION = col_character()
## )
colnames(df) = df[1,]
# drop the header row
df = df[2:length(df),]
# grab only the percent data
pct_dict = dict[grep('%', dict$INDICATOR_NAME),]
pct_df = df[df$`Indicator Code` %in% pct_dict$INDICATOR_CODE,]
# pivot all the years to make long
pct_df = pct_df %>% pivot_longer(cols = 5:ncol(pct_df), names_to = 'year', values_to ='percent', values_drop_na = TRUE)
kable(head(pct_df), caption='Tidied data')
| Country Name | Country Code | Indicator Name | Indicator Code | year | percent |
|---|---|---|---|---|---|
| East Asia & Pacific | EAS | Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports) | TM.VAL.MRCH.R6.ZS | 1960 | 1.589339 |
| East Asia & Pacific | EAS | Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports) | TM.VAL.MRCH.R6.ZS | 1961 | 1.893343 |
| East Asia & Pacific | EAS | Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports) | TM.VAL.MRCH.R6.ZS | 1962 | 1.859862 |
| East Asia & Pacific | EAS | Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports) | TM.VAL.MRCH.R6.ZS | 1963 | 1.920439 |
| East Asia & Pacific | EAS | Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports) | TM.VAL.MRCH.R6.ZS | 1964 | 2.259369 |
| East Asia & Pacific | EAS | Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports) | TM.VAL.MRCH.R6.ZS | 1965 | 2.283279 |
Discussion questions: I add my own question as nothing was specifically mentioned to analyze. Which indicator had the most/least pct change over the years presented?
sub_df = pct_df %>% group_by(`Indicator Code`) %>% summarise(min_pct = min(percent), max_pct = max(percent), min_year = min(year), max_year = max(year), min_year_pct = percent[which(year == min(year))], max_year_pct = percent[which(year == max(year))])
sub_df$largest_pct_diff = sub_df$max_pct - sub_df$min_pct
sub_df$year_max_min_pct_diff = sub_df$max_year_pct - sub_df$min_year_pct
plt1 = ggplot(sub_df, aes(x = reorder(`Indicator Code`, year_max_min_pct_diff), y = largest_pct_diff))+ geom_bar(stat = 'identity', color="blue", fill = 'white') + coord_flip() + labs(y="Largest diff in % across all years", x="Indicator Code")
plt2 = ggplot(sub_df, aes(x = reorder(`Indicator Code`, year_max_min_pct_diff), y = year_max_min_pct_diff))+ geom_bar(stat = 'identity', color="blue", fill = 'white') + coord_flip() + labs(y="Diff in % of most recent year and earliest recorded year", x="Indicator Code")
grid.arrange(plt1, plt2, ncol=2)
We see that TM.TAX.MRCH.BC.ZS has some of the largest difference in percent between years, as well as the largest difference in percent of the earliest recorded year and most recent year. Perhaps unsurprisingly, the Indicator Codes with very little change in percent of earliest recorded year and most recent year also have very little percent difference across all years. Some next steps in analysis could be to look at what these Indicator Codes mean from the data dictionary and see if the percent differences we see make sense comparing earliest to most recent data.