Fuzzy Match
Joining of the CO2 and country_list datasets based on the country name attribute in both datasets.
As the country names may not 100% match, a fuzzy match of the country names and manual mapping of any unmatched names is completed to allow for the two datasets to be used together.
# Set country names to lower in both files to avoid case sensitivity
co2$country_name <- tolower(co2$Entity)
country_list$country_name <- tolower(country_list$`COUNTRY NAME`)
# Function to fuzzy match country names
unmatched_country_names <- function(df1, df2, column_name){
df1_col <- unique(select(df1,column_name))
df2_col <- unique(select(df2,column_name))
name_matches <- stringdist_left_join(x = df1_col, y= df2_col, max_dist=0)
# Return all positive matches from fuzzy join which are NOT exact matches
positive_name_matches <- name_matches %>% filter(complete.cases(.))
filter(positive_name_matches, !(1 == 2))
unmatched <- name_matches %>% filter_all(any_vars(is.na(.)))
return(unmatched)
}
# Return a dataframe of country names which do not match between country_list and co2 datasets
unmatched <- unmatched_country_names(co2,country_list, 'country_name')
Joining by: "country_name"
unmatched
43 rows with unmatched names.
Majority of the unmatched countries are either regions of the world (e.g. south america), countries with island, south, north or saint in their name or other records (e.g. world, annex b), which may have been abbreviated in the country_list dataset (e.g. north might be nth) or are irrelvant for our purposes.
Also several ‘annex b’ types, which are unknown.
All of these non-country records are to be removed.
# Non-countries in unmatched
non_countries <- c('africa', 'asia', 'central america', 'eu28', 'europe', 'kp annex b', 'middle east', 'non kp annex b', 'non-oecd', 'north america', 'oecd','oceania', 'statistical differences', 'world')
# Create a subset of the data excluding the non-country records
co2 <- subset(co2, !(country_name %in% non_countries))
# Group by country name and review for further non-country records
co2 %>%
group_by(country_name) %>%
summarise()
unmatched2 <- unmatched_country_names(co2,country_list, 'country_name')
Joining by: "country_name"
unmatched2
After re-running the match on the cleaned data, there are now 29 countries which do not have matched.
Options to match them could include regex or more detailed, prescriptive fuzzy matching, however given there are only 29 unmatched countries, resolved to manual match them.
# List of unmatched countries which have a corresponding record in the country_list dataset
unmatched_countries = c('czechoslovakia', 'democratic republic of congo', 'faeroe islands', 'falkland islands', 'gambia', 'micronesia (country)', 'north korea', 'palau', 'reunion', 'saint helena', 'saint kitts and nevis', 'saint lucia', 'saint pierre and miquelon', 'saint vincent and the grenadines', 'sint maarten (dutch part)', 'south korea', 'south sudan', 'yugoslavia')
# A list manually mapping the unmatched countries to the country_list dataset
manually_matched <- c('czech republic', 'congo', 'faroe islands', 'falkland islands (islas malvinas)', 'gambia the', 'micronesia, federated states of', 'korea, north', 'palau - trust territory of the pacific islands', 'reunion and associated islands', 'st. helena', 'st. kitts and nevis', 'st. lucia', 'st. pierre and miquelon','st. lucia and st. vincent', 'st. marteen, st. eustatius, and saba', 'korea, south', 'sudan', 'yugoslavia (& former territory)')
# Map and update the country names in co2 dataset using the manually_matched names
count <- 1
for (country in unmatched_countries) {
co2 <- co2 %>%
mutate(country_name = ifelse (country_name == country, country_name <- manually_matched[count], co2$country_name))
count <- count + 1
}
# Review the remaining unmatched countries
unmatched_round2 <- unmatched_country_names(co2,country_list, 'country_name')
Joining by: "country_name"
unmatched_round2
Manually checked and these countries do not exist in the country_list dataset.
Resolved to remove these from the co2 dataset.
# Drop countries from co2 data which do not have a corresponding record in the country_list dataset
co2 <- subset(co2, !(country_name %in% unmatched_round2$country_name.x))
# Combine the co2 emissions for countries with the same name but different codes
co2 <- co2 %>%
group_by(country_name, Year) %>%
summarise(co2 = sum(`Annual COâ‚‚ emissions (Global Carbon Project (2017)) (million tonnes)`))
---
title: "CO2 Data Cleaning"
output: html_notebook
---
```{r}
# Import required libraries
library(readr)
library(dplyr)
library(ggvis)
library(fuzzyjoin)
library(janitor)
```


## Import CSV files
```{r}
# Import raw CSV files
country_list <- read_delim('/Users/todddequincey/Google Drive/Education & Learning/Data Science/USQ/Semester 2 2018/CSC8002 Big Data Management/Project/Datasets/cleaned_datasets/country_cleaned.csv', delim=",")
co2 <-  read_delim('/Users/todddequincey/Google Drive/Education & Learning/Data Science/USQ/Semester 2 2018/CSC8002 Big Data Management/Project/Datasets/uncleaned_datasets/annual-co-emissions-per-country.csv', delim=",")
```

## Fuzzy Match
Joining of the CO2 and country_list datasets based on the country name attribute in both datasets.  
As the country names may not 100% match, a fuzzy match of the country names and manual mapping of any unmatched names is completed to allow for the two datasets to be used together.   
```{r}
# Set country names to lower in both files to avoid case sensitivity
co2$country_name <- tolower(co2$Entity)
country_list$country_name <- tolower(country_list$`COUNTRY NAME`)
```


```{r}
# Function to fuzzy match country names
unmatched_country_names <- function(df1, df2, column_name){
  df1_col <- unique(select(df1,column_name))
  df2_col <- unique(select(df2,column_name))
  
  name_matches <- stringdist_left_join(x = df1_col, y= df2_col, max_dist=0)
  
  # Return all positive matches from fuzzy join which are NOT exact matches
  positive_name_matches <- name_matches %>% filter(complete.cases(.))
  filter(positive_name_matches, !(1 == 2))
  unmatched <- name_matches %>% filter_all(any_vars(is.na(.)))
  return(unmatched)
}
```

```{r}
# Return a dataframe of country names which do not match between country_list and co2 datasets
unmatched <- unmatched_country_names(co2,country_list, 'country_name')
unmatched
```
`r as.integer(count(unique(unmatched)))` rows with unmatched names.  
Majority of the unmatched countries are either regions of the world (e.g. south america), countries with island, south, north or saint in their name or other records (e.g. world, annex b), which may have been abbreviated in the country_list dataset (e.g. north might be nth) or are irrelvant for our purposes.  
Also several 'annex b' types, which are unknown.  
All of these non-country records are to be removed. 

```{r}
# Non-countries in unmatched
non_countries <- c('africa', 'asia', 'central america', 'eu28', 'europe', 'kp annex b', 'middle east', 'non kp annex b', 'non-oecd', 'north america', 'oecd','oceania', 'statistical differences', 'world')

# Create a subset of the data excluding the non-country records
co2 <- subset(co2, !(country_name %in% non_countries))

# Group by country name and review for further non-country records
co2 %>%
  group_by(country_name) %>%
  summarise()
```

```{r}
unmatched2 <- unmatched_country_names(co2,country_list, 'country_name')
unmatched2
```
After re-running the match on the cleaned data, there are now `r as.integer(count(unique(unmatched2)))` countries which do not have matched.  
Options to match them could include regex or more detailed, prescriptive fuzzy matching, however given there are only `r as.integer(count(unique(unmatched2)))` unmatched countries, resolved to manual match them.  


```{r}
# List of unmatched countries which have a corresponding record in the country_list dataset
unmatched_countries = c('czechoslovakia', 'democratic republic of congo', 'faeroe islands', 'falkland islands', 'gambia', 'micronesia (country)', 'north korea', 'palau', 'reunion', 'saint helena', 'saint kitts and nevis', 'saint lucia', 'saint pierre and miquelon', 'saint vincent and the grenadines', 'sint maarten (dutch part)', 'south korea', 'south sudan', 'yugoslavia')

# A list manually mapping the unmatched countries to the country_list dataset
manually_matched <- c('czech republic', 'congo', 'faroe islands', 'falkland islands (islas malvinas)', 'gambia  the', 'micronesia, federated states of', 'korea, north', 'palau - trust territory of the pacific islands', 'reunion and associated islands', 'st. helena', 'st. kitts and nevis', 'st. lucia', 'st. pierre and miquelon','st. lucia and st. vincent', 'st. marteen, st. eustatius, and saba', 'korea, south', 'sudan', 'yugoslavia (& former territory)')


# Map and update the country names in co2 dataset using the manually_matched names
count <- 1
for (country in unmatched_countries) {
  co2 <- co2 %>%
    mutate(country_name = ifelse (country_name == country, country_name <- manually_matched[count], co2$country_name))
    count <- count + 1
}
```


```{r}
# Review the remaining unmatched countries
unmatched_round2 <- unmatched_country_names(co2,country_list, 'country_name')
unmatched_round2
```
Manually checked and these countries do not exist in the country_list dataset.  
Resolved to remove these from the co2 dataset.  


```{r}
# Drop countries from co2 data which do not have a corresponding record in the country_list dataset
co2 <- subset(co2, !(country_name %in% unmatched_round2$country_name.x))
```



```{r}
# Combine the co2 emissions for countries with the same name but different codes
co2 <- co2 %>%
  group_by(country_name, Year) %>%
  summarise(co2 = sum(`Annual CO₂ emissions (Global Carbon Project (2017)) (million tonnes)`))
```


## General Data Cleaning
##### Remove any empty rows and columns
```{r}
# Remove rows and columns with totally empty values 
remove_empty(co2, 'rows')
remove_empty(co2, 'cols')
```
**Results**  
No empty rows or columns found. Therefore no changes to the df


##### Check for missing values
```{r}
# Check for missing values / nulls on key 
anyNA(co2)
```
**Results**  
No missing values identified.


##### Check for negative CO2 emission values
```{r}
# Return the minimum value for CO2 emissions column
min(co2[,3])
```
**Results**  
Minimum value in co2_emissions column is `r min(co2[,3])`. Therefore, no negative values.


##### Check year range is correct
```{r}
# Return the minimum year value
min_year <- min(co2[,2])
max_year <-  max(co2[,2])
min_year
max_year
```
**Results**  
Minimum value in the year column is `r min_year`, with a maximum of `r max_year`.  
Values are within expected range.


## Export results to CSV
```{r}
# Write cleaned data back to csv for import into MySQL
write_csv(co2, '/Users/todddequincey/Google Drive/Education & Learning/Data Science/USQ/Semester 2 2018/CSC8002 Big Data Management/Project/Datasets/cleaned_datasets/co2_cleaned.csv')
```




