Data Wrangling (Data Preprocessing)

Setup

# Load the necessary packages required to reproduce the report. For example:

library(kableExtra)
library(magrittr)
library(readr)
library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(deductive)

## Warning: package 'deductive' was built under R version 4.1.3

library(validate)

## Warning: package 'validate' was built under R version 4.1.3

## 
## Attaching package: 'validate'

## The following object is masked from 'package:dplyr':
## 
##     expr

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.1.3

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:validate':
## 
##     expr

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:validate':
## 
##     label, label<-

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(outliers)

## Warning: package 'outliers' was built under R version 4.1.3

library(MVN)

## Warning: package 'MVN' was built under R version 4.1.3

library(ggplot2)
library(knitr)
library(forecast)

## Warning: package 'forecast' was built under R version 4.1.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Student names, numbers and percentage of contributions

Group information
Student name	Student number	Percentage of contribution
Muhammad Alkourashi	s3586013	33.34%
Luxi Zou	s3957680	33.33%
Lifang Chen	s3957681	33.33%

Executive Summary

In this data preprocessing report, we have completed the following.

Loading relevant packages and data sets: Two data sets sourced from Worldbank.org were preprocessed. Packages involved in visualisation, transformation, outlier detecting and manipulating data frames were required.
Check and converted data types: We checked that character variables were correctly assigned, converted all the numerical recorded observations, and converted a single variable to a factor. We have kept the Year variable as a character.
Re orientated and merged data sets: As our data sets required a merge, we needed to pivot both data sets to a long format in order to avoid overlapping or cancelling out data. We then completed a left merge and successfully combined data sets.
Created an additional variable: We created an additional variable by multiplying the GDP per capita and Pupil-teacher ratio, with the designated name ‘GDP per Classroom’.
Removed missing values from the data: We have removed any observations containing with missing values, leaving us with less observations predominantly from the 1960:1970 Year variables.
Selected a comparison range from the data: We have filtered the data to shown only results from 1985 and 2015. We have done so for comparison and demonstration purposes.
Visualised clean data: We have displayed the output of both GDP per capita and Pupil-teacher ratio for 1985 and 2015 via side-by-side boxplots and histograms. All outputs show a right skew distribution.
Checked for outliers: We have conducted the Chi-squared QQ Plot function via MVN, as well as comparing the Z Scores of data to determine outliers. We have not removed any outliers from the data sets or comparison data.
Performed data transformation on comparison data: We utilised the Log10 and BoxCox transformation methods to attempt normalising the data. Doing so produced a comparitively more normal distribution of both the GDP per capita and Pupil-teacher ratios.

Data

Both data sets were located from the website World Bank (Worldbank.org, 2022). Our first data set provides recorded observations of primary school Pupils-teacher ratios across hundreds of countries over the past 60+ years. This data set can be located at: https://data.worldbank.org/indicator/SE.PRM.ENRL.TC.ZS?view=chart. Variables in this data set include country names, country codes, indicator names, indicator code, and yearly recorded observations ranging from 1960 to 2021.

Our second data set comes from the same data host, with yearly observations GDP per capita in US dollars across same countries for the past 60+ years. This data set can be viewed at: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?view=chart. Variables similarly include country names, country codes, indicator names, indicator codes, and GDP per capita observations of years ranging from 1960 to 2021.

After downloading and viewing the head of both sets, we’ve had to skip the first 4 rows for both data sets to allow for proper formatting of columns and column names. R codes are provided below. We intend to merge these data sets, and undertake the data preprocessing methods we have learnt throughout this unit.

# Importing data

pri_tea <- read_csv("API_SE.PRM.ENRL.TC.ZS_DS2_en_csv_v2_4029478.csv",
                    skip = 4)

## New names:
## * `` -> ...67

## Rows: 266 Columns: 67

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (4): Country Name, Country Code, Indicator Name, Indicator Code
## dbl (50): 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, ...
## lgl (13): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 2020, ...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(pri_tea, n=15)

gdp_pcap <- read_csv("API_NY.GDP.PCAP.CD_DS2_en_csv_v2_4019678.csv",
                     skip = 4)

## New names:
## * `` -> ...67

## Rows: 266 Columns: 67

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (4): Country Name, Country Code, Indicator Name, Indicator Code
## dbl (61): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...
## lgl  (2): 2021, ...67

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(gdp_pcap, n=15)

Understand

We can see from the import of both data sets, that there is a consistent uniformity of country name, country code, indicator names and codes as character variables. Many of the year variables seen have been initially read as double. There are quite a few year variables that have read as logical variables, most likely as they are made up of NA observations.

As all year columns are required to read as a double or numerical value, we are required to use the as.numeric function to convert these misread columns, and unify the data. We have also decided to convert the ‘country name’ variable on both data sets to a factor using as.factor, as this variable contains unchanging categories.

# This is the R chunk for the Understand Section

str(pri_tea)

## spec_tbl_df [266 x 67] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country Name  : chr [1:266] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa Western and Central" ...
##  $ Country Code  : chr [1:266] "ABW" "AFE" "AFG" "AFW" ...
##  $ Indicator Name: chr [1:266] "Pupil-teacher ratio, primary" "Pupil-teacher ratio, primary" "Pupil-teacher ratio, primary" "Pupil-teacher ratio, primary" ...
##  $ Indicator Code: chr [1:266] "SE.PRM.ENRL.TC.ZS" "SE.PRM.ENRL.TC.ZS" "SE.PRM.ENRL.TC.ZS" "SE.PRM.ENRL.TC.ZS" ...
##  $ 1960          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1961          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1962          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1963          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1964          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1965          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1966          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1967          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1968          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1969          : logi [1:266] NA NA NA NA NA NA ...
##  $ 1970          : num [1:266] NA NA 41.2 NA NA ...
##  $ 1971          : num [1:266] NA NA 39.1 NA 44.4 ...
##  $ 1972          : num [1:266] NA NA 40.4 NA 42.8 ...
##  $ 1973          : num [1:266] NA NA 38.3 NA 40.9 ...
##  $ 1974          : num [1:266] NA NA 37.2 NA 32.3 ...
##  $ 1975          : num [1:266] NA NA 37.3 NA NA ...
##  $ 1976          : num [1:266] NA NA 35.8 NA NA ...
##  $ 1977          : num [1:266] NA NA NA NA NA ...
##  $ 1978          : num [1:266] NA NA NA NA NA ...
##  $ 1979          : num [1:266] NA NA NA NA NA ...
##  $ 1980          : num [1:266] NA NA NA NA NA ...
##  $ 1981          : num [1:266] NA NA NA NA NA ...
##  $ 1982          : num [1:266] NA NA NA NA 31.5 ...
##  $ 1983          : num [1:266] NA NA NA NA 36.8 ...
##  $ 1984          : num [1:266] NA NA NA NA 31.8 ...
##  $ 1985          : num [1:266] NA NA NA NA 29.8 ...
##  $ 1986          : num [1:266] NA NA NA NA 31.3 ...
##  $ 1987          : num [1:266] NA NA NA NA 33.4 ...
##  $ 1988          : num [1:266] NA NA NA NA 32.6 ...
##  $ 1989          : num [1:266] NA NA 36.4 NA 33.4 ...
##  $ 1990          : num [1:266] NA NA 41.2 NA NA ...
##  $ 1991          : num [1:266] NA NA NA NA 31.9 ...
##  $ 1992          : num [1:266] NA NA NA NA 27.2 ...
##  $ 1993          : num [1:266] NA NA 48.7 NA NA ...
##  $ 1994          : num [1:266] NA NA 57.9 NA NA ...
##  $ 1995          : num [1:266] NA NA NA NA NA ...
##  $ 1996          : num [1:266] NA NA NA NA NA ...
##  $ 1997          : num [1:266] NA NA NA NA NA ...
##  $ 1998          : num [1:266] NA NA 32.1 NA 41.8 ...
##  $ 1999          : num [1:266] 19.2 NA 33.2 NA NA ...
##  $ 2000          : num [1:266] 19.1 NA NA NA NA ...
##  $ 2001          : num [1:266] 18.9 NA NA NA NA ...
##  $ 2002          : num [1:266] 19.1 NA NA NA NA ...
##  $ 2003          : num [1:266] 18.4 NA NA NA NA ...
##  $ 2004          : num [1:266] 18.5 NA NA NA NA ...
##  $ 2005          : num [1:266] 18.1 NA NA NA NA ...
##  $ 2006          : num [1:266] 18.2 NA 42.3 NA NA ...
##  $ 2007          : num [1:266] 17.4 NA 42.8 NA 41 ...
##  $ 2008          : num [1:266] 17.3 NA 43.7 NA NA ...
##  $ 2009          : num [1:266] 17.1 NA 42.8 NA NA ...
##  $ 2010          : num [1:266] 16.8 NA 44.4 NA 45.6 ...
##  $ 2011          : num [1:266] 15.1 NA 43.5 NA 42.5 ...
##  $ 2012          : num [1:266] 14.8 NA 44.7 NA NA ...
##  $ 2013          : num [1:266] NA NA 45.7 NA NA ...
##  $ 2014          : num [1:266] NA NA NA NA NA ...
##  $ 2015          : num [1:266] NA NA 43.4 NA 50 ...
##  $ 2016          : num [1:266] NA NA 44.6 NA NA ...
##  $ 2017          : num [1:266] NA NA 48 NA NA ...
##  $ 2018          : num [1:266] NA NA 48.8 NA NA ...
##  $ 2019          : num [1:266] NA NA NA NA NA NA NA NA NA NA ...
##  $ 2020          : logi [1:266] NA NA NA NA NA NA ...
##  $ 2021          : logi [1:266] NA NA NA NA NA NA ...
##  $ ...67         : logi [1:266] NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Country Name` = col_character(),
##   ..   `Country Code` = col_character(),
##   ..   `Indicator Name` = col_character(),
##   ..   `Indicator Code` = col_character(),
##   ..   `1960` = col_logical(),
##   ..   `1961` = col_logical(),
##   ..   `1962` = col_logical(),
##   ..   `1963` = col_logical(),
##   ..   `1964` = col_logical(),
##   ..   `1965` = col_logical(),
##   ..   `1966` = col_logical(),
##   ..   `1967` = col_logical(),
##   ..   `1968` = col_logical(),
##   ..   `1969` = col_logical(),
##   ..   `1970` = col_double(),
##   ..   `1971` = col_double(),
##   ..   `1972` = col_double(),
##   ..   `1973` = col_double(),
##   ..   `1974` = col_double(),
##   ..   `1975` = col_double(),
##   ..   `1976` = col_double(),
##   ..   `1977` = col_double(),
##   ..   `1978` = col_double(),
##   ..   `1979` = col_double(),
##   ..   `1980` = col_double(),
##   ..   `1981` = col_double(),
##   ..   `1982` = col_double(),
##   ..   `1983` = col_double(),
##   ..   `1984` = col_double(),
##   ..   `1985` = col_double(),
##   ..   `1986` = col_double(),
##   ..   `1987` = col_double(),
##   ..   `1988` = col_double(),
##   ..   `1989` = col_double(),
##   ..   `1990` = col_double(),
##   ..   `1991` = col_double(),
##   ..   `1992` = col_double(),
##   ..   `1993` = col_double(),
##   ..   `1994` = col_double(),
##   ..   `1995` = col_double(),
##   ..   `1996` = col_double(),
##   ..   `1997` = col_double(),
##   ..   `1998` = col_double(),
##   ..   `1999` = col_double(),
##   ..   `2000` = col_double(),
##   ..   `2001` = col_double(),
##   ..   `2002` = col_double(),
##   ..   `2003` = col_double(),
##   ..   `2004` = col_double(),
##   ..   `2005` = col_double(),
##   ..   `2006` = col_double(),
##   ..   `2007` = col_double(),
##   ..   `2008` = col_double(),
##   ..   `2009` = col_double(),
##   ..   `2010` = col_double(),
##   ..   `2011` = col_double(),
##   ..   `2012` = col_double(),
##   ..   `2013` = col_double(),
##   ..   `2014` = col_double(),
##   ..   `2015` = col_double(),
##   ..   `2016` = col_double(),
##   ..   `2017` = col_double(),
##   ..   `2018` = col_double(),
##   ..   `2019` = col_double(),
##   ..   `2020` = col_logical(),
##   ..   `2021` = col_logical(),
##   ..   ...67 = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>

pri_tea$`Country Name` <- as.factor(pri_tea$`Country Name`)
pri_tea$`Indicator Code` <- factor(c("Indicator Code"),
                                 levels = c("Indicator Code"))
pri_tea[,5:67] <- sapply(pri_tea[,5:67], as.numeric)


attributes(gdp_pcap)

## $names
##  [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
##  [5] "1960"           "1961"           "1962"           "1963"          
##  [9] "1964"           "1965"           "1966"           "1967"          
## [13] "1968"           "1969"           "1970"           "1971"          
## [17] "1972"           "1973"           "1974"           "1975"          
## [21] "1976"           "1977"           "1978"           "1979"          
## [25] "1980"           "1981"           "1982"           "1983"          
## [29] "1984"           "1985"           "1986"           "1987"          
## [33] "1988"           "1989"           "1990"           "1991"          
## [37] "1992"           "1993"           "1994"           "1995"          
## [41] "1996"           "1997"           "1998"           "1999"          
## [45] "2000"           "2001"           "2002"           "2003"          
## [49] "2004"           "2005"           "2006"           "2007"          
## [53] "2008"           "2009"           "2010"           "2011"          
## [57] "2012"           "2013"           "2014"           "2015"          
## [61] "2016"           "2017"           "2018"           "2019"          
## [65] "2020"           "2021"           "...67"         
## 
## $spec
## cols(
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `1960` = col_double(),
##   `1961` = col_double(),
##   `1962` = col_double(),
##   `1963` = col_double(),
##   `1964` = col_double(),
##   `1965` = col_double(),
##   `1966` = col_double(),
##   `1967` = col_double(),
##   `1968` = col_double(),
##   `1969` = col_double(),
##   `1970` = col_double(),
##   `1971` = col_double(),
##   `1972` = col_double(),
##   `1973` = col_double(),
##   `1974` = col_double(),
##   `1975` = col_double(),
##   `1976` = col_double(),
##   `1977` = col_double(),
##   `1978` = col_double(),
##   `1979` = col_double(),
##   `1980` = col_double(),
##   `1981` = col_double(),
##   `1982` = col_double(),
##   `1983` = col_double(),
##   `1984` = col_double(),
##   `1985` = col_double(),
##   `1986` = col_double(),
##   `1987` = col_double(),
##   `1988` = col_double(),
##   `1989` = col_double(),
##   `1990` = col_double(),
##   `1991` = col_double(),
##   `1992` = col_double(),
##   `1993` = col_double(),
##   `1994` = col_double(),
##   `1995` = col_double(),
##   `1996` = col_double(),
##   `1997` = col_double(),
##   `1998` = col_double(),
##   `1999` = col_double(),
##   `2000` = col_double(),
##   `2001` = col_double(),
##   `2002` = col_double(),
##   `2003` = col_double(),
##   `2004` = col_double(),
##   `2005` = col_double(),
##   `2006` = col_double(),
##   `2007` = col_double(),
##   `2008` = col_double(),
##   `2009` = col_double(),
##   `2010` = col_double(),
##   `2011` = col_double(),
##   `2012` = col_double(),
##   `2013` = col_double(),
##   `2014` = col_double(),
##   `2015` = col_double(),
##   `2016` = col_double(),
##   `2017` = col_double(),
##   `2018` = col_double(),
##   `2019` = col_double(),
##   `2020` = col_double(),
##   `2021` = col_logical(),
##   ...67 = col_logical()
## )
## 
## $problems
## <pointer: 0x00000000261efd70>
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266
## 
## $class
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

gdp_pcap$`Country Name`<- as.factor(gdp_pcap$`Country Name`)
gdp_pcap$`Indicator Code`<- factor (c("Indicator Code"),
                                  levels = c("Indicator Code"))
gdp_pcap[,5:67] <- sapply(gdp_pcap[,5:67], as.numeric)

We’ve confirmed the imported variable types using the str() and attributes() scripts to check on both data sets’ structures and attributes.

As indicated, we have converted ‘country names’ into a factor variable in both data sets, this symmetry was required in order to minimise issues while merging the data sets. Indicator Code into only one factor since they are the same and those are for two data sets merge purposes.

We also successfully converted all year variables into numeric via as.numeric for both data sets. This is to keep consistency of variable types.

Tidy & Manipulate Data I

# This is the R chunk for the Tidy & Manipulate Data I 

P2 <- select(pri_tea, -67) 
g2 <- select(gdp_pcap, -67)

P3 <- P2 [,-(3:4)] %>% pivot_longer(names_to =  "Year", values_to = "Pupil-teacher ratio",  cols= 3:64)

g3 <- g2 [,-(3:4)] %>% pivot_longer(names_to = "Year", values_to = "GDP per capita", cols = 3:64)


merged_country_data <- left_join(P3,g3, key="Country Name")

## Joining, by = c("Country Name", "Country Code", "Year")

str(merged_country_data)

## tibble [16,492 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Country Name       : Factor w/ 266 levels "Afghanistan",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ Country Code       : chr [1:16492] "ABW" "ABW" "ABW" "ABW" ...
##  $ Year               : chr [1:16492] "1960" "1961" "1962" "1963" ...
##  $ Pupil-teacher ratio: num [1:16492] NA NA NA NA NA NA NA NA NA NA ...
##  $ GDP per capita     : num [1:16492] NA NA NA NA NA NA NA NA NA NA ...

The original data sets both contained a column ‘…67’ with all NA and not representing any information. We have subset the data, removing the column in both data sets as it does not provide any information or observations. Though the data could be seen as tidy, it’s current format makes comparative analysis quite impractical given the orientation of data.

In order for us to properly interpret and merge the data sets, we have used the pivot_longer function to transform the data sets from a wide to long format, we gathered year columns into a new set of variables. This is required as we would otherwise merge data sets with clashing variables, potentially cancelling out observations.

Using left_join with the key “Country Name”, we’ve successfully been able to merge the GDP per capita and Pupil-teacher ratio values of the pivoted data sets. Doing so will allow us to further mutate and tidy data. This pivot and merge has allowed us to be able to view both significant variables recorded by the county for each individual year.

Tidy & Manipulate Data II

# This is the R chunk for the Tidy & Manipulate Data II 

merged_country_data <- merged_country_data %>% mutate (`GDP per Classroom` = `GDP per capita`*`Pupil-teacher ratio`)

To provide further understanding of the relationship between the two variables we have chosen to focus on, we have used the mutate function to multiply the GDP per capita variable by the primary Pupil-teacher ratio. Doing so should give us an estimated GDP per capita attributed to a single classroom for each year.

Scan I

# This is the R chunk for the Scan I

#is.na(merged_country_data) - We have omitted this code to minimise the length of this report.

clean_country_data <-na.omit(merged_country_data)

comparison_data <- clean_country_data %>% filter(Year == 2015 | Year == 1985)

boxplot(comparison_data$`Pupil-teacher ratio` ~ comparison_data$Year, border = 'olivedrab4',
        col = 'olivedrab3', horizontal = TRUE, ylab= "Year", xlab= 'Pupils to Teacher Ratios in 1985 and 2015')

boxplot(comparison_data$`GDP per capita` ~ comparison_data$Year, border = 'olivedrab4',
        col = 'olivedrab3', horizontal = TRUE, ylab= "Year", xlab= 'GDP per-capita in 1985 and 2015 (USD$)')

ggplot(comparison_data, aes(x = `GDP per capita`)) +
  geom_histogram(fill = "olivedrab3", colour = "white") +
  facet_grid(Year ~ .) +theme_bw() + xlab("GDP per-capita (Per Country)") + ylab("Country Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(comparison_data, aes(x = `Pupil-teacher ratio`)) +
  geom_histogram(fill = "olivedrab3", colour = "white") +
  facet_grid(Year ~ .) + theme_bw() + xlab("Pupil-teacher Ratio (Students)") + ylab("CountryFrequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Given that the data sets by nature are observations over time, we have decided not to imputate or replace any missing values. Instead, we have opted to use the na.omit function, removing any observation containing a missing value. Doing so has removed roughly 46% of the entire data set’s observations.

Though this practice is not recommended, we believe replacing missing values with the mean or median value may alter the pattern of any existing linear relationship. Given that every country also faces different cultural, economic and social circumstances, if we were to replace any missing values, it would have to be conducted for each country independently.

We’ve decided to check and initially compare the GDP per capita and Pupil-teacher ratio of two years, 1985 and 2015 which are 30 years apart. This was in attempt to see if the overall snapshots of time have changed, and in what way.

We have created boxplots using boxplot(), and also used the ggplot geom_histogram functions to visualise any changes in data from the two picked observations. Both the boxplots and histograms showed a right skewed distribution in both years 1985 and 2015.

Scan II

# This is the R chunk for the Scan II
comparison_data %>% summary()

##        Country Name Country Code           Year           Pupil-teacher ratio
##  Algeria     :  2   Length:330         Length:330         Min.   : 7.845     
##  Angola      :  2   Class :character   Class :character   1st Qu.:17.056     
##  Arab World  :  2   Mode  :character   Mode  :character   Median :24.827     
##  Austria     :  2                                         Mean   :26.320     
##  Bahamas, The:  2                                         3rd Qu.:34.047     
##  Bahrain     :  2                                         Max.   :69.510     
##  (Other)     :318                                                            
##  GDP per capita      GDP per Classroom
##  Min.   :    34.52   Min.   :   1818  
##  1st Qu.:   847.57   1st Qu.:  29197  
##  Median :  2936.91   Median :  73330  
##  Mean   :  9477.65   Mean   : 146127  
##  3rd Qu.:  9171.13   3rd Qu.: 173287  
##  Max.   :167313.27   Max.   :2097893  
##

comparison_obs <- comparison_data %>% dplyr::select(`GDP per capita`,`Pupil-teacher ratio`)
head(comparison_obs)

comparison_outliers <- mvn(data = comparison_obs, multivariateOutlierMethod = "quan", showOutliers = TRUE)

z_scores_tpr <- comparison_data$`Pupil-teacher ratio` %>% scores(type = "z")
z_scores_gdp <- comparison_data$`GDP per capita` %>% scores(type = "z")
which(abs(z_scores_tpr) >3 )

## [1] 209 285

which(abs(z_scores_gdp) >3 )

## [1]  32 168 179 181 184 221

To determine outliers of both the GDP per capita and Pupil-teacher ration, we have used two separate methods. The first being the mvn “quan” multivariate outlier method, which produced a Chi-square QQ plot indicating 118 of 330 variables could contain outliers. The second was by assessing the Z scores for all data values via the scores funciton, highlighting values with a Z score above 3. Of the 330 variables, 8 values were found to be potential outliers.

Given the high percentage of estimated outliers highlighted in the Chi-Square plot, we have decided not to remove any outliers from the remaining data frame of observations. As also previously mentioned, as the data set itself is recorded cyclically, we should expect some variance outside of the IQR, so the removal of outliers may not be appropriate.

Transform

# This is the R chunk for the Transform Section
comparison_data$log10_gdppc <- comparison_data$`GDP per capita` %>% log10()
comparison_data$log10_pttr <- comparison_data$`Pupil-teacher ratio` %>% log10()

ggplot(comparison_data, aes(x = `log10_gdppc`)) +
  geom_histogram(fill = "light blue", colour = "white") +
  facet_grid(Year ~ .) +theme_bw() + xlab("Log10 GDP per-capita (Per Country)") + ylab("Country Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(comparison_data, aes(x = `log10_pttr`)) +
  geom_histogram(fill = "light blue", colour = "white") +
  facet_grid(Year ~ .) + theme_bw() + xlab(" Log10 Pupil-teacher Ratio (Students)") + ylab("CountryFrequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

comparison_data$logbc_gdppc <- comparison_data$`GDP per capita` %>% BoxCox(lambda = "auto")
comparison_data$logbc_pttr <- comparison_data$`Pupil-teacher ratio` %>% BoxCox(lambda = "auto")

ggplot(comparison_data, aes(x = `logbc_gdppc`)) +
  geom_histogram(fill = "orange", colour = "white") +
  facet_grid(Year ~ .) +theme_bw() + xlab("BoxCox GDP per-capita (Per Country)") + ylab("Country Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(comparison_data, aes(x = `logbc_pttr`)) +
  geom_histogram(fill = "orange", colour = "white") +
  facet_grid(Year ~ .) + theme_bw() + xlab("BoxCox Pupil-teacher Ratio (Students)") + ylab("CountryFrequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We were able to perform two types of data transformation to our selected comparison data. The first was the Log10 method, given that the histograms showed right skewed data. As seen in the above transformed histogram, we see a vastly more symmetrical and normal distribution of variable data.

Our second and more preferred method of data transformation undertaken was the BoxCox Transformation method. This is because the data set itself contains recorded data over a series of time, and caters to right skewed distributions.

This method is more suitable than the log10 given the length of which data was recorded. It also shows a simillarly more normal distribution of data compared to the pre-transformed variables.

Reference:

UNESCO Institute for Statistics (February 2020) Pupil-teacher ratio, Primary. Available at: https://data.worldbank.org/indicator/SE.PRM.ENRL.TC.ZS?view=chart. (Accessed: 19 May 2022).

World Bank national accounts data, and OECD National Accounts data files, GDP per capita (Current US$).Available at: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?view=chart. (Accessed: 19 May 2022).