MATH2349 Semester 2, 2019

Required packages

Provide the packages required to reproduce the report.

library(readxl)

## Warning: package 'readxl' was built under R version 3.5.3

library(readr)

## Warning: package 'readr' was built under R version 3.5.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.5.3

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 3.5.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 3.5.3

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.5.3

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(outliers)
library(forecast)

## Warning: package 'forecast' was built under R version 3.5.3

Executive Summary

Three datasets used were sourced from the World Bank including the global carbon dioxide emissions dataset, a country dataset with categorical information such as region and income group and a world population dataset. Once all the datasets were imported into R, the next step involved joining the CO2 emissions dataset with the country dataset using a left join through the common variable ‘Country Code’. Before the second join could be actioned, both the newly combined emission-country dataset and the world population dataset would need to be made tidy.

A number of conversions were applied to the datasets including converting ‘Income Group’ from a character variable to an ordinal factor variable in the emissions-country dataset. The emissions-country dataset and population dataset were both identified as untidy. The column names, 1960-2014 for emissions-country and 1960-2018 for population datasets were converted from wide to long format via the gather function.

In order to join the datasets, a new variable was created called ‘id’ by combining the country code and year of each dataset separately using the unite function. Using a left join, both data sets were combined via ‘id’. The dataset was then made tidy once more by separating the ‘id’ variable into country code and year. Unneeded variables were removed and final conversions made.

Using the mutate function, a new variable was created titled CO2 emissions (KT) Per Capita by dividing CO2 emissions by the population per country and per year.

The CO2 emissions column was then scanned for missing values, generating 2261 NA’s. The NA’s were imputed with the median CO2 emission for each country. In addition, countries which had zero entries in the emission columns were removed. The NA’s identified in the population column however were removed given they represented less than 2% of observations.

The dataset was subset with the United States. Three outliers were identified within the CO2 Emissions column and ultimately capped, assigning the outlier to its nearest neighbour. The CO2 emissions of the US subset was transformed to overcome an extreme left skew using a power of 4 function. The histogram generated was closer to normal than any other transformation function including Box Cox. ## Data The datasets selected contains Carbon dioxide (CO2) emissions data from 1960 to 2014. Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring. The datasets have been sourced from the World Bank: https://data.worldbank.org/indicator/EN.ATM.CO2E.KT. The first dataset, emissions, contains the carbon dioxide emssions per year across each country. The second dataset, country, contains the categorical information such as the region of each country and the income group. Both datasets contain the country code character variable. This variable will be used to join both datasets together.

A third dataset, population, contains population records from 1960 to 2014, also soruced from The World Bank: https://data.worldbank.org/indicator/SP.POP.TOTL.

First Step, import the first dataset, ‘emissions’ and skip the first 4 rows. The ‘emissions’ dataset labelled ‘emis’ contains 264 observations across 64 variables.

emis<-read_csv("emissions.csv", skip=4)

## Warning: Missing column names filled in: 'X64' [64]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2015` = col_logical(),
##   `2016` = col_logical(),
##   `2017` = col_logical(),
##   `2018` = col_logical(),
##   X64 = col_logical()
## )

## See spec(...) for full column specifications.

head(emis)

Second step, import the second dataset, ‘country’. No rows skipped. The ‘country’ dataset contains 263 observations across 6 variables.

country<-read_csv("country.csv")

## Warning: Missing column names filled in: 'X6' [6]

## Parsed with column specification:
## cols(
##   `Country Code` = col_character(),
##   Region = col_character(),
##   IncomeGroup = col_character(),
##   SpecialNotes = col_character(),
##   TableName = col_character(),
##   X6 = col_logical()
## )

head(country)

Third step, import the third dataset, ‘wpopulation’. The first 4 rows will be skipped. There are 264 observations and 63 variables

worldpop<-read_csv("wpopulation.csv", skip=4)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character()
## )

## See spec(...) for full column specifications.

head(worldpop)

The next step is to join the emissions and country datasets first using the common ‘Country Code’ variable. This will be performed by using a left join in order to match all the data within the ‘emissions’ dataset with the ‘country’ dataset.

combined_emis<-emis %>% left_join(country, by = "Country Code")

head(combined_emis)

Once this join is complete, the next step will be to join the new dataset ‘combined_emis’ with the population dataset. However, this will need to be performed after the ‘combined_emis’ Data set and the population dataset conforms to tidy principles.
## Understand The combined_emis dataset has a number of different variables. This includes 8 character variables including ‘Country Name’, ‘Country Code’, ‘Indicator Name’ and ‘Indicator Code’, ‘Region’, ‘IncomeGroup’, SpecialNotes and ‘TableName’. In addition there are 55 numerical variables which range from 1960 to 2014. This will need to be looked at further when we re-shape the data. Furthermore, there are 4 logical variables (titled 2015-2018) which appear empty at this stage. This is to be expected given that emissions data is only available from 1960 to 2014.

str(combined_emis)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 264 obs. of  69 variables:
##  $ Country Name  : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country Code  : chr  "ABW" "AFG" "AGO" "ALB" ...
##  $ Indicator Name: chr  "CO2 emissions (kt)" "CO2 emissions (kt)" "CO2 emissions (kt)" "CO2 emissions (kt)" ...
##  $ Indicator Code: chr  "EN.ATM.CO2E.KT" "EN.ATM.CO2E.KT" "EN.ATM.CO2E.KT" "EN.ATM.CO2E.KT" ...
##  $ 1960          : num  NA 414 550 2024 NA ...
##  $ 1961          : num  NA 491 455 2281 NA ...
##  $ 1962          : num  NA 689 1181 2464 NA ...
##  $ 1963          : num  NA 708 1151 2083 NA ...
##  $ 1964          : num  NA 840 1225 2017 NA ...
##  $ 1965          : num  NA 1008 1188 2175 NA ...
##  $ 1966          : num  NA 1093 1555 2552 NA ...
##  $ 1967          : num  NA 1283 994 2681 NA ...
##  $ 1968          : num  NA 1225 1672 3073 NA ...
##  $ 1969          : num  NA 942 2787 3245 NA ...
##  $ 1970          : num  NA 1672 3583 3744 NA ...
##  $ 1971          : num  NA 1896 3410 4353 NA ...
##  $ 1972          : num  NA 1533 4507 5644 NA ...
##  $ 1973          : num  NA 1639 4881 5291 NA ...
##  $ 1974          : num  NA 1918 4873 4345 NA ...
##  $ 1975          : num  NA 2127 4415 4595 NA ...
##  $ 1976          : num  NA 1988 3286 4950 NA ...
##  $ 1977          : num  NA 2391 3535 5721 NA ...
##  $ 1978          : num  NA 2160 5412 6494 NA ...
##  $ 1979          : num  NA 2241 5504 7587 NA ...
##  $ 1980          : num  NA 1760 5346 5170 NA ...
##  $ 1981          : num  NA 1984 5280 7341 NA ...
##  $ 1982          : num  NA 2101 4650 7308 NA ...
##  $ 1983          : num  NA 2523 5115 7631 NA ...
##  $ 1984          : num  NA 2831 5009 7825 NA ...
##  $ 1985          : num  NA 3509 4701 7880 NA ...
##  $ 1986          : num  180 3143 4661 8056 NA ...
##  $ 1987          : num  447 3124 5816 7444 NA ...
##  $ 1988          : num  612 2868 5130 7327 NA ...
##  $ 1989          : num  649 2776 5009 8984 NA ...
##  $ 1990          : num  1639 2615 5115 5515 407 ...
##  $ 1991          : num  1683 2439 5090 4287 407 ...
##  $ 1992          : num  1463 1393 5196 2516 407 ...
##  $ 1993          : num  1595 1346 5776 2336 411 ...
##  $ 1994          : num  1613 1294 3891 1925 407 ...
##  $ 1995          : num  1668 1243 10975 2087 425 ...
##  $ 1996          : num  1690 1177 10458 2017 455 ...
##  $ 1997          : num  1745 1096 7382 1544 466 ...
##  $ 1998          : num  1797 1041 7308 1753 491 ...
##  $ 1999          : num  1808 821 9156 2985 513 ...
##  $ 2000          : num  2380 774 9542 3022 524 ...
##  $ 2001          : num  2409 818 9732 3223 524 ...
##  $ 2002          : num  2439 1071 12666 3751 532 ...
##  $ 2003          : num  2563 1195 9065 4294 535 ...
##  $ 2004          : num  2618 950 18793 4166 561 ...
##  $ 2005          : num  2721 1327 19156 4254 576 ...
##  $ 2006          : num  2717 1650 22266 3898 546 ...
##  $ 2007          : num  2824 2274 25152 3927 539 ...
##  $ 2008          : num  2659 4206 25709 4375 539 ...
##  $ 2009          : num  2629 6769 27792 4378 517 ...
##  $ 2010          : num  2508 8463 29057 4598 517 ...
##  $ 2011          : num  2501 12240 30341 5240 491 ...
##  $ 2012          : num  1349 10755 33399 4910 488 ...
##  $ 2013          : num  862 10015 32618 5064 477 ...
##  $ 2014          : num  873 9809 34763 5717 462 ...
##  $ 2015          : logi  NA NA NA NA NA NA ...
##  $ 2016          : logi  NA NA NA NA NA NA ...
##  $ 2017          : logi  NA NA NA NA NA NA ...
##  $ 2018          : logi  NA NA NA NA NA NA ...
##  $ X64           : logi  NA NA NA NA NA NA ...
##  $ Region        : chr  "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
##  $ IncomeGroup   : chr  "High income" "Low income" "Lower middle income" "Upper middle income" ...
##  $ SpecialNotes  : chr  NA NA NA NA ...
##  $ TableName     : chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ X6            : logi  NA NA NA NA NA NA ...

In relation to the second dataset (i.e. the pppulation dataset), very much similar to the ‘combined_emis’ dataset, there are 4 character varables associated with the population data. This includes ‘Country Name’, ‘Country Code’, ‘Indicator Name’, ‘Indicator Code’ and 59 numerical variables which range from 1960 to 2018.

Overall, a number of variable conversions will be needed at this stage for both the combined_emis dataset and the population dataset.

In relation to the combined_emis dataset, - ‘IncomeGroup’ will be converted to an ordinal factor variable. - ‘Region’ will also need to be converted into a factor variable. - The 4 logical variables representing 2015-2018 would normally need to be made into numerical variables given that the data under each year column contains carbon emission figures. However, this information will be removed post tidy transformation.

Each step in relation to variable conversions for the combined_emis dataset is summarised below.

Converting IncomeGroup into an ordinal factor variable (ordered)

combined_emis$IncomeGroup<-factor(combined_emis$IncomeGroup, levels = c("Low income", "Lower middle income", "Upper middle income", "High income"), ordered=TRUE)

Checking the conversion

class(combined_emis$IncomeGroup)

## [1] "ordered" "factor"

Converting Region into a factor variable.

combined_emis$Region<-factor(combined_emis$Region)

Checking the conversion

class(combined_emis$Region)

## [1] "factor"

The population dataset which is still yet to be joined will be checked for variable conversions after tidy transformation.

Tidy & Manipulate Data I

The combined_emis dataset is untidy at this stage and does not conform to the tidy data principles. The columns names, 1960 - 2018 are values instead of variables, as a result, we need to transform the data from wide to long format.

Before we begin to convert from wide to long format, a number of variables need to be removed as they provide no value. In addition ‘Year variables’ 2015-2018 contain no information as the dataset only records CO2 emissions from 1960 to 2014. As a result, they will also be removed at this stage.

combined_emis_untidy<-combined_emis %>% select(-'SpecialNotes', -'Indicator Code', -'Indicator Name', -'TableName', -'X64',-'X6', -'2015', -'2016', -'2017', -'2018')

head(combined_emis_untidy)

Now that the variables are removed, we can now transform our combined_emis dataset. The gather function will be used to transform the dataset from wide to long form.

combined_emis_tidy<-combined_emis_untidy %>% gather(3:57, key="Year", value = "CO2_Emissions")

head(combined_emis_tidy)

The CO2_Emissions variable is a numeric variable. No changes required.

class(combined_emis_tidy$CO2_Emissions)

## [1] "numeric"

Now that the combined_emis dataset has been tranformed to tidy, we can now begin to clean the world population dataset before the final join.

Like the combined_emis dataset, the world population dataset needs to be made tidy given that the colnames are values instead of variables. The gather function is used.

worldpop_tidy<-worldpop %>% gather(5:63, key="Year", value = "Population")

head(worldpop_tidy)

Note that, in order to join both datasets, country code alone will not suffice given that it is not the key attribute of the dataset anymore. The key is both Year and the Country Code. The combined_emis dataset will be succesfully joined with the world population dataset on a key variable of country code & Year combined. This will allow the population to be mached with the carbon emissions for each country at each year.

To do this, we will need to unite Country Code and Year in both datasets.

First, we will do this for combined_emis.

combined_emis_tidy_Unite<-combined_emis_tidy %>% unite(id, 'Country Code', 'Year')

head(combined_emis_tidy_Unite)

second, we do the same for population.

worldpop_tidy_unite<-worldpop_tidy %>% unite(id, 'Country Code', 'Year')

head(worldpop_tidy_unite)

To make life easier, some unwanted variables will now be removed from the world population dataset.

worldpop_tidy_unite_clean<-worldpop_tidy_unite %>% select(-'Indicator Name', -'Indicator Code', -'Country Name')

head(worldpop_tidy_unite_clean)

Now that both datasets have a mutual variable, we can now conduct the final left join to obtain our complete dataset.

combined_emis_pop<-combined_emis_tidy_Unite %>% left_join(worldpop_tidy_unite_clean, by='id')

head(combined_emis_pop)

Now that the datasets have been joined, it is clear that the dataset is still not tidy. We can now separate id into country code and year.

combined_emis_pop_tidy<-combined_emis_pop %>% separate(id, into = c("Country_Code", "Year"), sep= "_")

head(combined_emis_pop_tidy)

The final step is to now convert Year into a integer variable as it is currently a character variable.

combined_emis_pop_tidy$Year<-as.integer(combined_emis_pop_tidy$Year)

head(combined_emis_pop_tidy)

The dataset is now complete and tidy. ## Tidy & Manipulate Data II The purpose of joining the emissions and populatins dataset was to be able to perform a mutate function to create a new variable called CO2 Emissions (KT) per Capita. This is achieved by dividing CO2_Emissions by Population for each row of data in order to obtain CO2 Emissions(KT) per Capita per year and country.

combined_emis_pop_tidy_new<-combined_emis_pop_tidy %>% 
  mutate("CO2 Emissions Per Capita"= CO2_Emissions/Population)

head(combined_emis_pop_tidy_new)

Scan I

In this step, we have scanned the ‘combined_emis_pop_tidy’ dataset for missing values and/or NA’s in the CO2 Emissions variable. In this step, we have subset the NA’s in its own dataframe called ’Total_emissions_tidy_missing, as shown below.

Total_emissions_tidy_missing<-combined_emis_pop_tidy %>% subset(is.na(CO2_Emissions))

head(Total_emissions_tidy_missing)

Overall, there are 2261 missing values or NA’s in the CO2_Emissions Column.

Total_emissions_tidy_missing %>% summarise(NA_Count=sum(is.na(CO2_Emissions)))

If we look at the country level, there are a number of countries who have never had CO2 emissions recorded. Countries with an NA_Count of 55 would indicate this including American Samoa, Channel Islands, Guam, Isle of Man, Kosovo, Monaco, Northern Mariana Islands, Not classified, Puerto Rico, San Marino, St. Martin (French part) and Virgin Islands (U.S.).

Total_emissions_tidy_missing %>% 
  group_by(`Country Name`) %>%
  summarise(NA_Count=sum(is.na(CO2_Emissions))) %>%
  arrange(desc(NA_Count))

To take care of the NA’s in the CO2_Emissiosn column, the first step will be impute the missing NA’s values with the average CO2 emissions for each country. Median instead of mean was used given that the CO2 emissions data is strongly skewed to the right.

hist(combined_emis_pop_tidy$CO2_Emissions, xlab="CO2 Emissions", main="Histogram of CO2 Emissions", col="light blue")

After that, the NA’s left will be the countries listed above with an NA_Count of 55 each.

combined_tidy_Imputed <- combined_emis_pop_tidy %>%
  group_by(`Country Name`) %>%
  mutate(CO2_Emissions= ifelse(
    is.na(CO2_Emissions),
    median(CO2_Emissions, na.rm = TRUE),
    CO2_Emissions
  ))

head(combined_tidy_Imputed)

The number of NA’s that remain are 660 (which represents the NA’s of the countries with 55 Na’s each in the list above)

sum(is.na(combined_tidy_Imputed$CO2_Emissions))

## [1] 660

With that, we can now remove these countries from the data set.

combined_tidy_Imputed_1<-combined_tidy_Imputed[!(is.na(combined_tidy_Imputed$CO2_Emissions) | combined_tidy_Imputed$CO2_Emissions==" "), ]

head(combined_tidy_Imputed_1, 40)

When we examine the dataset above, we can see that American Samoa is not found. It has been removed.

combined_tidy_Imputed_1 %>% filter(`Country Name` =="American Samoa")

Using our updated dataset, we will now examine the population variable for missing values. Note that because the main focus of the dataset is CO2 emissions, the countiries that were removed from the data set also had their associated population removed for that year. As a result the total number of missing values in the Population column will be less than if this variables was examined first.

Below we have again created a subset dataframe to show just the missing values for the population.

Total_population_tidy_missing<-combined_tidy_Imputed_1 %>% subset(is.na(Population))

head(Total_population_tidy_missing)

Missing Population values grouped by country is shown below.

Total_population_tidy_missing %>% 
  group_by(`Country Name`) %>%
  summarise(NA_Count=sum(is.na(Population))) %>%
  arrange(desc(NA_Count))

The total number of NA’s in the remaining dataset for Population is 104.

sum(is.na(Total_population_tidy_missing$Population))

## [1] 104

The population data cannot be taken care of using a mean average. This would not be logical. The countries listed above would have had population data reported in other years. One idea was to use the population data from the one particular year and carry it forward to the next where the population was missing. However, this didn’t appear as an appropriate solution as the population data would not reflect the actual populaton in a given year.

For example, Serbia and West Bank and Gaza are missing data from 1960 to 1989. While Sint Maarten (Dutch part) is missing population data from 1960 to 1997.

Below is an example of the population missing for Sint Maarten (Dutch part).

Total_population_tidy_missing %>% filter(`Country Name`== "Sint Maarten (Dutch part)")

Given that there are only 108 missing values across these 5 countries, we will remove them from the data-set. The maximum number of rows for each country would be 55, 55 Multiplied by the 5 countires gives 275 rows of data. Given that the current dataset has 13,860 observations, we can remove the 275 observations given that it is difficult to accurately project the population forwards or backwards based on the available information.

NA_rate = (275/13860)*100
NA_rate

## [1] 1.984127

Given that the remaining NA rate for the missing population data is less than 2%, we will remove them from the dataset.

combined_tidy_Imputed_2<-combined_tidy_Imputed_1 %>% filter(`Country Name`!= "Eritrea",`Country Name`!= "Sint Maarten (Dutch part)", `Country Name`!= "Serbia", `Country Name`!="West Bank and Gaza", `Country Name`!="Kuwait")

head(combined_tidy_Imputed_2)

To confirm if the countries have been removed, we will sum the na’s in the population column.

sum(is.na(combined_tidy_Imputed_2$Population))

## [1] 0

Finally, We will check for special values in the CO2 Emissions and Population columns.

is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

There are no special values in the CO2 emissions column.

sum(sapply(combined_tidy_Imputed_2$CO2_Emissions,is.specialorNA))

## [1] 0

There are no special values in the population column.

sum(sapply(combined_tidy_Imputed_2$Population,is.specialorNA))

## [1] 0

This is to be expected given that both variables were modified with NA’s either imputed with median or removed. ## Scan II For this section, we will be examing outliers using the CO2 Emissions variable. Here we will subset the data to examine the USA given that the main dataset is very large.

Emissions_USA<-combined_tidy_Imputed_2 %>% filter(`Country Name`=="United States")

head(Emissions_USA)

First step is to examine the boxplot of the USA emissions data and determine the presence of outliers.

Emissions_USA$CO2_Emissions %>% boxplot(main="Box Plot of USA CO2 Emissions, 1960 - 2014", ylab="Emissions", col ="light blue")

From the boxplot, it is clear that the emissions data for the USA is not normally distributed. Therefore it is suitable to use Tukey’s Test instead of the z-score to detect outliers. The box plot aproach can’t detect the missing values clearly alone.

As a result, we need to identify the outliers exactly.

outliers <- boxplot(Emissions_USA$CO2_Emissions, main = "Box Plot of USA CO2 Emissions, 1960 - 2014", ylab="Emissions", col="light blue")$out

outliers

## [1] 2890696 2880506 2987208

The rate of outliers.

outlier_rate <- (length(outliers) / nrow(Emissions_USA))*100
outlier_rate

## [1] 5.454545

The percentage of the outliers is greater than 5%, as a result, we will apply the capping approach to handle the oultiers and assign them to the nearest neighbour.

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

Emissions_USA_capped <- Emissions_USA$CO2_Emissions %>% cap()

The histogram below confirms that the oultiers have been capped.

boxplot(Emissions_USA_capped, col="light blue", main="Box Plot of CO2 Emissions, outliers removed", ylab="CO2 Emissions")

## Transform To use the appropriate transformation strategy we need to confirm the skewness of the chosen variable attirbute. The histogram of the CO2 emissions variable for the USA subset shows a left skew.

hist(Emissions_USA$CO2_Emissions, col="light blue", xlab="CO2 Emissions", main="Histogram of CO2 Emissions in the USA, 1960-2014")

The square power tranformation was first used, but showed little change to the skewness of the CO2 attribute. The fourth power transformation was then applied, which gave the best distribution, at least closer to normal.

Emission_USA_fourth <-Emissions_USA$CO2_Emissions^4

Histogram of the 4th power transformation.

hist(Emission_USA_fourth, col="light blue", xlab="CO2 Emissions", main="Histogram of CO2 Emissions in the USA, 1960-2014")

Below is the statistical summary of the dataset following the 4th power transformation.

summary(Emission_USA_fourth)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 6.885e+25 3.686e+26 5.413e+26 5.733e+26 7.752e+26 1.124e+27

To confirm that the 4th power was the best solution, we also looked into the BoxCox transformation strategy.

Emissions_USA_BoxCox<- BoxCox(Emissions_USA$CO2_Emissions,lambda = TRUE)

Output of the BoxCox transformation.

Emissions_USA_BoxCox

##  [1] 2890695 2880505 2987207 3119230 3255994 3390922 3561877 3695708
##  [9] 3831354 4024748 4328904 4356769 4564952 4770194 4598487 4406329
## [17] 4613100 4742292 4890860 4901795 4723209 4535799 4306747 4341877
## [25] 4475191 4492554 4495462 4688372 4892525 4955080 4823402 4820846
## [33] 4909533 5028673 5094353 5132919 5252111 5368714 5401010 5504668
## [41] 5693684 5595793 5641308 5675701 5756074 5789726 5697285 5789030
## [49] 5614110 5263504 5395531 5289680 5119435 5159160 5254278
## attr(,"lambda")
## [1] TRUE

Histogram of the BoxCox transformation.

hist(Emissions_USA_BoxCox, col="light blue", xlab="CO2 Emissions", main="Histogram of CO2 Emissions in the USA, 1960-2014")

As shown above, the BoxCox transformation did not provide the desired normal distribution. It is clear that the 4th power transformation provided a distribution closer to normal than the BoxCox transformation.

MATH2349 Semester 2, 2019

Assignment 3

Andrew Mina s3807663, Thomas McLaughlin s3349230, Erdal Avsar s3766954

Required packages

Executive Summary

Tidy & Manipulate Data I

Scan I