Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
There were 24 warnings (use warnings() to see them)
library(readxl)
library(dplyr)
library(tidyverse)
library(MVN)
library(outliers)
library(ggplot2)
library(forecast)

Executive Summary

Data pre-processing for this task includes the following few steps. Firstly, GDP, the Crime rate, Population are selected and merge into a data frame by a common attribute ‘country code’. Then I summary the data to see the data quality, e.g., whether the data types are correct. The class for incomeGroup is the character, and we need to change it to factor. Also, from the summary, we can see there are missing values in the attributes. The next step is to tidy the data. For missing value for ‘GDP’, I have replaced them with the mean of recent 5 years. For the crime rate, I used the mean from the year 2000 to 2019. Then we have to mutate a new attribute ‘GDP per capita’ by dividing ‘GDP’/’Population’ for the future comparison. I also mutate a new column ‘rank’ to replace ‘Income group’ as there are some missing values in the ‘income group’. After scanning the table again, there were still a few missing values that are not going to affect our analysis, and I drop them.

The next step is to scan through boxplot to see whether there are any outliers in our data. As there are some outliers in all numeric attributes, we used cap function to make sure we eliminate outliers in our dataset. However, we can also use the log transformation to eliminate outliers. In addition, for the nature of data, there is a considerable variation in our dataset. For example, the min population for a country is 1.151e+04 and the maximum is1.393e+09. Log can do a better job for transformation comparing to other techniques. In our case, the histograms of all numeric attributes are right-skewed, and after the log transformation, we can reduce this right skewness. finally, I will create a cleaned data_frame called df_final.

Data

The task is to understand whether the relation between GDP and Crime rate is more significant than the relation GDP PER CAPITA and Crime rate. All the datasets are from The World Bank and their data are from 1960 to 2019.As there are more missing value in 2019 than 2018, we are using the data in 2018 for our analysis. Country name and country code are the information of a country and can be used as a key to merge the data. GDP is the GDP in 2008. The crime rate is homicides per 100,000 people. Population is the population of 2018. Income group is to categorised country into 4 groups: “High income” “Low income” “Lower middle income” “Upper middle income”. I will also create two new attribute" GDP_per_capita" and “rank” later.

Reference & Data source:

World_Gdp data: Api.worldbank.org. 2020. GDP. [online] Available at: http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv [Accessed 14 October 2020].

Population data : Data.worldbank.org. 2020. Population. [online] Available at: https://data.worldbank.org/indicator/SP.POP.TOTL [Accessed 14 October 2020].

Crime: Api.worldbank.org. 2020. Intentional Homicides (Per 100,000 People). [online] Available at: http://api.worldbank.org/v2/en/indicator/VC.IHR.PSRC.P5?downloadformat=csv [Accessed 14 October 2020].

# This is the R chunk for the Data Section
# Create dfs by correctly selecting the rows.
world_GDP <- read_excel("C:/Users/wei_s/Desktop/world GDP.xls",skip = 3, col_names = TRUE)
world_crime <- read_excel("C:/Users/wei_s/Desktop/crime rate international over 40 years.xls",
                          skip = 3, col_names = TRUE)
population <- read_excel("C:/Users/wei_s/Desktop/World population.xls",skip=3, sheet = 1)
country_info <- read_excel("C:/Users/wei_s/Desktop/World population.xls",sheet = 2)

head(world_GDP)
head(world_crime)
head(population)
head(country_info)

# As there are data from 1960-2019, we have to select the appropriate columns to create dataframe
df1 <- world_GDP %>% select(1:3,59:63)
df12018 <- world_GDP %>% select(1:3,63)
df2 <- world_crime %>% select(2,"2001":"2018")
df22018 <- world_crime %>% select(1:3,63)
df3<- population %>% select(1,2,63)
df4<- country_info %>% select(1,3)

# Join df_GDP with  df_country_info
df5 <- df1 %>%
  left_join(df4,by = c("Country Code"))

# Please note: 4 more datasets will be joined together after tidying the data later.

Understand

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

# This is the R chunk for the Understand Section
# Check the basic information of data frame
str(df1)
tibble [264 x 8] (S3: tbl_df/tbl/data.frame)
 $ Country Name  : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
 $ Country Code  : chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
 $ Indicator Name: chr [1:264] "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
 $ 2014          : num [1:264] 2.77e+09 2.05e+10 1.46e+11 1.32e+10 3.27e+09 ...
 $ 2015          : num [1:264] 2.92e+09 1.99e+10 1.16e+11 1.14e+10 2.79e+09 ...
 $ 2016          : num [1:264] 2.97e+09 1.94e+10 1.01e+11 1.19e+10 2.90e+09 ...
 $ 2017          : num [1:264] 3.06e+09 2.02e+10 1.22e+11 1.30e+10 3.00e+09 ...
 $ 2018          : num [1:264] NA 1.95e+10 1.01e+11 1.51e+10 3.22e+09 ...
str(df2)
tibble [264 x 19] (S3: tbl_df/tbl/data.frame)
 $ Country Code: chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
 $ 2001        : num [1:264] 4.31 NA NA 7.03 NA ...
 $ 2002        : num [1:264] 5.26 NA NA 6.91 NA ...
 $ 2003        : num [1:264] 4.12 NA NA 5.32 NA ...
 $ 2004        : num [1:264] 2.03 NA NA 4.22 1.31 ...
 $ 2005        : num [1:264] 6 NA NA 4.99 NA ...
 $ 2006        : num [1:264] 4.96 NA NA 3.1 NA ...
 $ 2007        : num [1:264] 2.96 NA NA 3.46 0 ...
 $ 2008        : num [1:264] 4.93 NA NA 3.1 1.19 ...
 $ 2009        : num [1:264] 3.94 3.93 NA 2.86 0 ...
 $ 2010        : num [1:264] 3.93 3.37 NA 4.31 0 ...
 $ 2011        : num [1:264] 1.96 4.09 4.36 4.85 1.19 ...
 $ 2012        : num [1:264] 3.9 6.25 4.85 5.39 0 ...
 $ 2013        : num [1:264] 5.82 NA NA 4.27 0 ...
 $ 2014        : num [1:264] 1.93 NA NA 4.04 0 ...
 $ 2015        : num [1:264] NA 9.78 NA 2.21 0 ...
 $ 2016        : num [1:264] NA 6.55 NA 2.74 NA ...
 $ 2017        : num [1:264] NA 6.68 NA 2.01 NA ...
 $ 2018        : num [1:264] NA 6.66 NA 2.29 NA ...
str(df3)
tibble [264 x 3] (S3: tbl_df/tbl/data.frame)
 $ Country Name: chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
 $ Country Code: chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
 $ 2018        : num [1:264] 105845 37172386 30809762 2866376 77006 ...
# Check data quality, e.g., whether the data types is correct. 
summary(df5)
 Country Name       Country Code       Indicator Name          2014                2015          
 Length:264         Length:264         Length:264         Min.   :3.729e+07   Min.   :3.549e+07  
 Class :character   Class :character   Class :character   1st Qu.:9.312e+09   1st Qu.:8.742e+09  
 Mode  :character   Mode  :character   Mode  :character   Median :5.535e+10   Median :5.029e+10  
                                                          Mean   :2.619e+12   Mean   :2.475e+12  
                                                          3rd Qu.:5.569e+11   3rd Qu.:5.156e+11  
                                                          Max.   :7.945e+13   Max.   :7.520e+13  
                                                          NA's   :13          NA's   :14         
      2016                2017                2018           IncomeGroup       
 Min.   :3.655e+07   Min.   :4.062e+07   Min.   :4.259e+07   Length:264        
 1st Qu.:8.734e+09   1st Qu.:9.670e+09   1st Qu.:1.221e+10   Class :character  
 Median :5.160e+10   Median :5.473e+10   Median :5.960e+10   Mode  :character  
 Mean   :2.515e+12   Mean   :2.693e+12   Mean   :2.948e+12                     
 3rd Qu.:5.157e+11   3rd Qu.:5.410e+11   3rd Qu.:5.555e+11                     
 Max.   :7.634e+13   Max.   :8.123e+13   Max.   :8.636e+13                     
 NA's   :15          NA's   :15          NA's   :23                            
str(df5)
tibble [264 x 9] (S3: tbl_df/tbl/data.frame)
 $ Country Name  : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
 $ Country Code  : chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
 $ Indicator Name: chr [1:264] "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
 $ 2014          : num [1:264] 2.77e+09 2.05e+10 1.46e+11 1.32e+10 3.27e+09 ...
 $ 2015          : num [1:264] 2.92e+09 1.99e+10 1.16e+11 1.14e+10 2.79e+09 ...
 $ 2016          : num [1:264] 2.97e+09 1.94e+10 1.01e+11 1.19e+10 2.90e+09 ...
 $ 2017          : num [1:264] 3.06e+09 2.02e+10 1.22e+11 1.30e+10 3.00e+09 ...
 $ 2018          : num [1:264] NA 1.95e+10 1.01e+11 1.51e+10 3.22e+09 ...
 $ IncomeGroup   : chr [1:264] "High income" "Low income" "Lower middle income" "Upper middle income" ...
df3$'2018' <-as.integer(df3$'2018')
NAs introduced by coercion to integer range
df5$IncomeGroup<- as.factor(df5$IncomeGroup)
class(df3$'2018')
[1] "integer"
# The class for incomeGroup is the character, and we need to change it to factor.
df5$IncomeGroup <-factor( df5$IncomeGroup,
                          levels = c("High income", "Upper middle income",
                                     "Lower middle income", "Low income"), ordered=TRUE ) 
class(df5$IncomeGroup)
[1] "ordered" "factor" 

Tidy & Manipulate Data I

Explain why your data (or one of the data sets) doesn’t conform the tidy data principles (minimum requirement #5). Apply the required steps to reshape the data into a tidy format. In addition to the R codes and outputs, explain everything that you do in this step.

# This is the R chunk for the Tidy & Manipulate Data I 
# There are some missing value for GDP value. In order to get a more accurate value.
# we will replace the missing value with the mean value of 5 years between "2014 " to "2018"

ggdp <- df1 %>%
  gather("2014":"2018", key = "year", value = "gdp",na.rm=TRUE)

# calculate the mean for the missing value and create a new column
dfmean <- aggregate(ggdp[, 5], list(ggdp$`Country Code`), mean)

# rename df:gdpmean column name
dfmean <- dfmean %>% 
  rename(
    'Country Code'=Group.1,meangdp = gdp)

# replace the null value in year 2018 with the mean value calculated before
df12018 <- df12018 %>%
  left_join(dfmean, by = c("Country Code")) %>%
  mutate("2018" = ifelse(is.na(df12018$'2018'), meangdp,df12018$'2018'))%>%
  select('Country Code', '2018')

Tidy & Manipulate Data II

Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.


# This is the R chunk for the Scan I
# This is the R chunk for the Tidy & Manipulate Data II 
# The way we calculate the missing value for Crime is different to the way 
# we calculate the missing value for GDP
# As GDP is increasing, So I used the average number of recently 5 years
# While Crime rate is relatively more stable and there are more missing values so I use 
# average of 19 years from 2001 to 2018

gcrime <- df2 %>%
  gather("2001":"2018", key = "year", value = "crime",na.rm=TRUE)

# calculate the mean for the missing value and create a new column
crimemean <- aggregate(gcrime[, 3], list(gcrime$`Country Code`), mean)

#  df:crimemean column name
crimemean <- crimemean %>% 
  rename(
    'Country Code'=Group.1,meancrime = crime)

# replace the null value in year 2018 with the mean value calculated before
df22018 <- df22018 %>%
  left_join(crimemean, by = c("Country Code")) %>%
  mutate("2018" = ifelse(is.na(df22018$'2018'), meancrime,df22018$'2018'))%>%
  select('Country Code', '2018')

Scan I

Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.


# There are not special values or obvious errors can be found in the data set
# However, there are some missing values for GDP and Crime rate which has been identified and 
# replaced in two steps. Both of them are using different steps
# As I have scan through the data frame and I realised that some country are 
# incorrectly categories in IncomeGroup.For example, ALB Albania has GDP PER CAPITAL 
# FOR $5284 shouldn't be put in Upper middle income category. 
# In addition, there are still some missing values, so I will replace the value of 
# 'IncomeGroup' By create a new ranking which divide 'GDP PER CAPITAL' into 4 categories.
# 4 represent highest category GDP PER CAPTIAL and 1 represent lowest GDP PER CAPITAL category.

# join 4 tables to create a join table

df <- df12018 %>%
  left_join(df22018,by = c("Country Code")) %>% 
  left_join(df3,by = c("Country Code"))%>% 
  left_join(df4,by = c("Country Code")) 

# rename df column name

df <- df %>% 
  rename(GDP='2018.x','CRIME RATE'='2018.y', Population='2018')

# Create a new variable 'GDP PER CAPTIAL' from 'GDP'/'Population'
df <- df %>% mutate('GDP PER CAPITAL' = GDP /Population)

# Change the data type of 'GDP PER CAPITAL'
df$'GDP PER CAPITAL' <-df$'GDP PER CAPITAL'%>% as.integer

# Create a new variable use RANK FUNTION, this use divide 'GDP PER CAPTITAL' in to 4 different levels.
df$'Rank'<- ntile(df$'GDP PER CAPITAL',4)

# Change the number to factor
df$'Rank'<- as_factor(df$'Rank' )

head(df)

# There are still same missing value but they won't affect our analysis so we will drop them
df <- df%>% drop_na()

Scan II

Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

# This is the R chunk for the Scan II
# Generate boxplot to check outliers

df$`CRIME RATE`<- as.numeric(df$`CRIME RATE`)

df$GDP %>% boxplot(main = "GDP 2018")


df$`CRIME RATE` %>%  boxplot(main = "CRIME RATE 2018")


df$Population %>%  boxplot(main = "POPULATION IN DIFFERENT COUNTRIES 2018")
NAs produced by integer overflow

df$`GDP PER CAPITAL`%>% boxplot(main = "GDP_PER_CAPITAL 2018")



# From these boxplot, we can see there are many outliers in our table.
z.scores <- df$`CRIME RATE` %>%  scores(type = "z") 
z.scores %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.7127 -0.6048 -0.4129  0.0000  0.1270  4.6063 
which( abs(z.scores) >3 )
[1]  69  82  98 147 175 176
# One way to deal with outlier is to use capping.
# Capping involves replacing the outliers with the nearest neighbours that are not outliers. 
cap <- function(x){
  
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x }


df_sub <- df %>%  dplyr::select('GDP','CRIME RATE','Population','GDP PER CAPITAL') 

summary(df_sub)
      GDP              CRIME RATE        Population        GDP PER CAPITAL 
 Min.   :4.259e+07   Min.   : 0.1563   Min.   :1.151e+04   Min.   :   271  
 1st Qu.:7.733e+09   1st Qu.: 1.2078   1st Qu.:1.295e+06   1st Qu.:  3066  
 Median :4.029e+10   Median : 3.0792   Median :7.451e+06   Median :  7691  
 Mean   :4.656e+11   Mean   : 7.1052   Mean   :3.981e+07   Mean   : 19332  
 3rd Qu.:2.451e+11   3rd Qu.: 8.3436   3rd Qu.:2.918e+07   3rd Qu.: 23726  
 Max.   :2.053e+13   Max.   :52.0189   Max.   :1.393e+09   Max.   :185829  
df_capped <- sapply(df_sub, FUN = cap) 

summary(df_capped)
      GDP              CRIME RATE        Population        GDP PER CAPITAL
 Min.   :4.259e+07   Min.   : 0.1563   Min.   :    11508   Min.   :  271  
 1st Qu.:7.733e+09   1st Qu.: 1.2078   1st Qu.:  1294974   1st Qu.: 3066  
 Median :4.029e+10   Median : 3.0792   Median :  7451000   Median : 7691  
 Mean   :2.763e+11   Mean   : 7.2913   Mean   : 23983682   Mean   :17850  
 3rd Qu.:2.451e+11   3rd Qu.: 8.3436   3rd Qu.: 29183078   3rd Qu.:23726  
 Max.   :1.720e+12   Max.   :32.5943   Max.   :126495269   Max.   :72551  
# However, instead of deleting the value, we will use transformation to Transform variables 
# to eliminate outliers. For exmaple, we can use log function

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

# This is the R chunk for the Transform Section


# As we discussed before, we can used log to transform the data to eliminate outliers.
# We can compare the boxplot of Population. One is original data,
# the other is after the log transformation
# After the transformation, outliers are eliminated

df$log_population <- log10(df$Population) 
df$log_population %>% boxplot(main = "Population 2018 WITH THE LOG TRANSFORMAITON")

df$Population %>% boxplot(main = "Population 2018 WITHOUT THE LOG TRANSFORMAITON")
NAs produced by integer overflow

# We can checked the histgram to see whether data is left or skewed

hist(df$GDP)

hist(df$`CRIME RATE`)

hist(df$log_population)

hist(df$`GDP PER CAPITAL`)



# The Box-Cox transformation is to transform non-normal distribution data into 
# a normal distribution
# we can use BoxCox to transform the Crime rate into more normal. At the same time, 
# it is good to compare if we use the log transformation and see which methods is better.

df$boxcox_Crime<- BoxCox(df$`CRIME RATE`,lambda = "auto") 
hist(df$boxcox_Crime)

df$log_Crime <- log10(df$`CRIME RATE`)
hist(df$log_Crime)


# After observing the histgram, I think the log transformation do a better job

# We can Min- Max Normalisation to change data to range 0 to 1. 
# This a good way to normolise our dataset. 
# As the outcome is long, I will just show the code.
# R CODE: minmaxnormalise <- function(x){(x- min(x)) /(max(x)-min(x))}
# R CODE: lapply(df_sub$GDP, minmaxnormalise)
# There are many other potential way to transform the data such as centering. 

center_df <-scale(df_sub, center = TRUE, scale = FALSE) 
center_df
                 GDP  CRIME RATE Population GDP PER CAPITAL
  [1,] -4.627191e+11 -3.10145442  -39701676      8318.71585
  [2,] -4.461615e+11 -0.44960582   -2635135    -18808.28415
  [3,] -3.642927e+11 -2.50373663   -8997759    -16043.28415
  [4,] -4.504989e+11 -4.81567453  -36941145    -14048.28415
  [5,] -4.624276e+11 -6.73536261  -39730515     22460.71585
  [6,] -4.343085e+10 -6.35191379  -30176562     24506.71585
  [7,]  5.422562e+10 -1.78068823    4686981     -7649.28415
  [8,] -4.531880e+11 -5.41125141  -36855745    -15112.28415
  [9,] -4.650099e+11 -1.01816421  -39752056     -7866.28415
 [10,] -4.640353e+11  3.71355261  -39711235     -2606.28415
 [11,]  9.682585e+11 -6.21353457  -14824833     38062.71585
 [12,] -1.013764e+10 -6.13793836  -30967000     32192.71585
 [13,] -4.185330e+11 -4.90405952  -29867750    -14593.28415
 [14,] -4.626090e+11 -2.24419700  -28632143    -19061.28415
 [15,]  7.704002e+10 -5.05591174  -28380467     28158.71585
 [16,] -4.513949e+11 -5.97766956  -28322473    -18092.28415
 [17,] -4.494465e+11 -6.35968616  -20055986    -18512.28415
 [18,] -1.916068e+11 -4.73183817  121548518    -17634.28415
 [19,] -3.994450e+11 -5.80050038  -32782484     -9909.28415
 [20,] -4.279934e+11 -6.42981985  -38238082      4658.71585
 [21,] -4.532214e+11 17.78321827  -39421881     12884.71585
 [22,] -4.454624e+11 -5.93185671  -36483592    -13260.28415
 [23,] -4.056146e+11 -4.71429418  -30324022    -13002.28415
 [24,] -4.637747e+11 25.88056215  -39424450    -14448.28415
 [25,] -4.253582e+11  1.37256239  -28454379    -15784.28415
 [26,]  1.419837e+12 20.27736332  169661812    -10331.28415
 [27,] -4.605594e+11  2.66318357  -39520880     -1587.28415
 [28,] -4.520785e+11 -6.50361682  -39378559     12295.71585
 [29,] -4.631992e+11 -5.91215959  -39053127    -16089.28415
 [30,] -4.469826e+11  7.93173594  -37553395    -11053.28415
 [31,] -4.634249e+11 13.01523368  -35141144    -18857.28415
 [32,]  1.250617e+12 -5.34924584   -2749756     26980.71585
 [33,]  2.394947e+11 -6.51869895  -31293192     63485.71585
 [34,] -1.673879e+11 -2.70027249  -21078361     -3408.28415
 [35,]  1.342917e+13 -6.57807618 1352922479     -9356.28415
 [36,] -4.269518e+11 -3.34500347  -14591284    -17798.28415
 [37,] -1.320770e+11 18.23863563    9841164    -12614.28415
 [38,] -4.636789e+11 -0.30074446  -39263754    -15715.28415
 [39,] -4.050920e+11  4.15608753  -34808080     -7220.28415
 [40,] -3.656229e+11 -1.77285119  -28469383    -10511.28415
 [41,] -4.625180e+11 15.15020971  -39647721       240.71585
 [42,] -4.601605e+11  0.19850489  -39743347     66144.71585
 [43,] -4.406839e+11 -5.84388056  -38618256      1656.71585
 [44,] -2.206585e+11 -6.07262105  -29177593      3713.71585
 [45,]  3.483903e+12 -6.15719035   43098261     28306.71585
 [46,] -4.650950e+11  6.06682866  -39735896    -11641.28415
 [47,] -1.099706e+11 -6.09684501  -34013885     42057.71585
 [48,] -3.800905e+11  2.94456698  -29180356    -11282.28415
 [49,] -2.918879e+11 -5.99545046    2420908    -15218.28415
 [50,] -3.580839e+11 -1.30453963  -22723164    -13037.28415
 [51,] -2.147511e+11 -5.73844220   58616074    -16783.28415
 [52,]  9.540893e+11 -6.48408704    6990233     11004.71585
 [53,] -4.348987e+11 -4.98862567  -38485544      3925.71585
 [54,] -3.813765e+11  1.68731158   69417038    -18561.28415
 [55,] -1.897522e+11 -5.47549536  -34291996     30688.71585
 [56,] -4.601091e+11 -4.54602576  -38924038    -13066.28415
 [57,]  2.322218e+12 -5.90653047   27158391     22298.71585
 [58,]  2.395022e+12 -5.90025219   26652823     23710.71585
 [59,] -4.480462e+11 -4.88180448  -36080972    -14610.28415
 [60,] -4.000894e+11 -5.17208976  -10040413    -17130.28415
 [61,] -4.641877e+11 -3.85371551  -37933212    -18555.28415
 [62,] -2.475075e+11 -6.16430303  -29074639       991.71585
 [63,] -4.644772e+11  1.91083838  -39696067     -8847.28415
 [64,] -4.625943e+11  6.90723505  -39751498     35137.71585
 [65,] -3.925277e+11 15.39618175  -23460571    -14860.28415
 [66,] -4.597259e+11 -3.71515378  -39641753     16379.71585
 [67,] -4.617672e+11  7.14374222  -39028517    -14353.28415
 [68,] -1.039528e+11 -6.45403063  -32356521     29209.71585
 [69,] -4.416217e+11 31.82042413  -30219999    -16827.28415
 [70,] -4.046545e+11 -6.52774517  -35719678     -4412.28415
 [71,] -4.559872e+11 -0.42542431  -28684345    -18464.28415
 [72,] -3.077630e+11 -5.28817804  -30031957     -3182.28415
 [73,]  5.765944e+11 -6.54781368  227855914    -15439.28415
 [74,] -4.587499e+11 -5.31848526  -39723444     62686.71585
 [75,]  2.247519e+12 -4.02593453 1312809807    -17327.28415
 [76,] -8.297154e+10 -6.23356151  -34940205     59288.71585
 [77,] -4.540405e+10 -4.35137614   41992748    -14195.28415
 [78,] -2.414179e+11  3.08721438   -1373921    -13498.28415
 [79,] -4.399083e+11 -6.21419783  -39454800     53635.71585
 [80,] -9.505792e+10 -5.03553545  -30924721     22386.71585
 [81,]  1.620118e+12 -6.53611631   20614239     15187.71585
 [82,] -4.499320e+11 36.74711456  -36872666    -13978.28415
 [83,] -4.234146e+11 -5.46552894  -29851510    -15091.28415
 [84,]  4.489161e+12 -6.84259288   86721579     19826.71585
 [85,] -2.863059e+11  2.38477971  -21531023     -9520.28415
 [86,] -3.778673e+11 -2.17643895   11585489    -17625.28415
 [87,] -4.573748e+11 -4.91608936  -33484721    -18024.28415
 [88,] -4.410741e+11 -4.13858723  -23557723    -17820.28415
 [89,] -4.654492e+11 -1.33036044  -39691674    -17634.28415
 [90,] -4.646351e+11 26.86227472  -39755080       -57.28415
 [91,]  1.254933e+12 -6.50131758   11799112     14007.71585
 [92,] -3.250005e+11 -5.34340088  -35670209     14661.71585
 [93,] -4.106846e+11 -4.61224047  -32958596    -11308.28415
 [94,] -4.623819e+11 -3.39786809  -34988544    -18655.28415
 [95,] -4.635800e+11 14.33636362  -39625632     -7975.28415
 [96,] -4.592170e+11 -4.46789707  -39769611    150250.71585
 [97,] -3.772200e+11 -4.68392334  -18137521    -15252.28415
 [98,] -4.630700e+11 30.50083456  -37699389    -18111.28415
 [99,] -4.121907e+11 -2.53581093  -37005978      -252.28415
[100,] -3.947259e+11 -6.12021050  -39199571     97321.71585
[101,] -4.313320e+11 -2.74936201  -37880347     -1527.28415
[102,] -4.105618e+11 -6.78852740  -39175885     67875.71585
[103,] -3.477245e+11 -5.68964409   -3778383    -16060.28415
[104,] -4.584577e+11 -6.10980232  -39768839    166496.71585
[105,] -4.541885e+11 -3.00837407  -37101472    -15099.28415
[106,] -4.603184e+11 -5.48488408  -39291825     -9002.28415
[107,]  7.550536e+11 21.96589465   86383267     -9659.28415
[108,] -4.530170e+11 -5.90495016  -37724563    -13270.28415
[109,] -4.510423e+11 -5.51155961  -39322891     10800.71585
[110,] -3.894779e+11 -5.21809724   13900874    -17914.28415
[111,] -4.601391e+11 -4.87516807  -39185294    -10482.28415
[112,] -4.525371e+11 -0.92261904  -36637313    -15198.28415
[113,] -4.509287e+11 -2.43089814  -10311559    -18834.28415
[114,] -4.514641e+11 -4.18530687  -38542218     -8124.28415
[115,] -4.587286e+11 -3.16634208  -21664206    -18951.28415
[116,] -1.070640e+11 -4.95226098   -8278936     -7959.28415
[117,] -4.521917e+11 10.30014451  -37359266    -13837.28415
[118,] -4.528191e+11 -2.46336026  -17364573    -18761.28415
[119,] -6.748549e+10 27.41887893  156067219    -17300.28415
[120,] -4.525820e+11  4.30034032  -33342008    -17312.28415
[121,]  4.484590e+11 -6.51898538  -22575897     33715.71585
[122,] -3.147928e+10 -6.63682326  -34495605     62401.71585
[123,] -4.364724e+11 -4.20787704  -11719650    -18294.28415
[124,] -2.577253e+11 -5.97937407  -34966521     23616.71585
[125,] -3.863705e+11 -6.83598663  -34978038     -2918.28415
[126,] -1.510784e+11 -3.22208424  172407509    -17850.28415
[127,] -4.005177e+11  2.27985549  -35630648     -3740.28415
[128,] -2.436009e+11 -0.10338967   -7818265    -12391.28415
[129,] -1.188040e+11 -0.64017881   66844401    -16080.28415
[130,] -4.653619e+11  4.06115540  -39789614     -3473.28415
[131,] -4.422334e+11  1.67715975  -31201205    -16612.28415
[132,]  1.214682e+11 -6.37471227   -1832771     -3872.28415
[133,] -3.646660e+11 13.98314800  -36614167     12288.71585
[134,] -2.243713e+11 -6.31540016  -29523699      4128.71585
[135,] -4.252612e+11  0.03967302  -32851450    -13527.28415
[136,] -4.510300e+11 -6.61164222  -35238434    -16134.28415
[137,] -2.742838e+11 -6.58865925  -37025844     49460.71585
[138,] -2.240189e+11 -5.82351727  -20334976     -6924.28415
[139,]  1.203937e+12  1.10430865  104670339     -7777.28415
[140,] -4.560182e+11 -4.22845730  -27505582    -18550.28415
[141,]  3.208759e+11 -5.89454086   -6107574      4005.71585
[142,] -4.395673e+11 -2.19649722    1994012    -18709.28415
[143,] -4.424099e+11 -6.83764882  -23953161    -17867.28415
[144,] -9.242881e+10 -6.94884921  -34168845     46855.71585
[145,] -4.642503e+11 -2.35856129  -39154663    -17195.28415
[146,] -4.615608e+11 -4.89462103  -32157367    -18799.28415
[147,] -4.395285e+11 44.91375920  -33386777    -15265.28415
[148,] -4.640080e+11 -6.78250826  -39773736     29148.71585
[149,] -4.150486e+11 -5.87827639  -32824917    -12086.28415
[150,] -4.526659e+11  6.79483303  -28831601    -18150.28415
[151,] -4.652236e+11 -3.35821139  -39596493    -17331.28415
[152,] -4.621878e+11  2.36174815  -39231530    -13329.28415
[153,] -3.598254e+11 -5.96818170  -34360750        95.71585
[154,] -4.116115e+11 -6.62389703  -37733627      6721.71585
[155,]  8.980948e+10 -6.02209429  -29632307     35256.71585
[156,] -4.609353e+11  6.53714031  -38671330    -15187.28415
[157,] -4.640599e+11  4.57708685  -39710759     -2942.28415
[158,] -4.646236e+11 -2.10113033  -39769856      7809.71585
[159,]  4.086821e+10 -1.40836303   29621003    -12037.28415
[160,] -4.581229e+11 -4.84552114  -30706684    -18506.28415
[161,] -4.248848e+11 -2.32808286  -33956613    -12366.28415
[162,] -4.640773e+11 -3.13648152  -38539549    -18095.28415
[163,] -4.651955e+11 -3.07087539  -39704324    -14968.28415
[164,] -4.418377e+11 20.36478202  -38417663     -2203.28415
[165,] -4.258756e+11 -4.54647917  -28242317    -15894.28415
[166,]  3.057044e+11 -4.51469130   42512203     -9962.28415
[167,] -4.656033e+11 -3.69328335  -39796013    -15632.28415
[168,] -4.076447e+11  0.80761229   16510827    -18303.28415
[169,] -4.328731e+11  3.41929331    2915618    -18565.28415
[170,] -3.347440e+11 -0.52354086    4814997    -16399.28415
[171,] -4.060490e+11  4.95528605  -36358222     -2055.28415
[172,]  2.006340e+13 -2.14821642  286879980     43507.71585
[173,] -4.152533e+11 -4.25377884   -6851421    -17803.28415
[174,] -4.648346e+11 15.43487504  -39697311    -11971.28415
[175,]  1.671342e+10 29.58246787  -10937326     -2625.28415
[176,] -4.618739e+11 32.52163989  -39700544     15926.71585
[177,] -2.204322e+11 -5.76796120   55732874    -16766.28415
[178,] -4.648254e+11 -0.29835277  -39611391    -15149.28415
[179,] -4.577029e+11 -4.70623688  -38010436    -14913.28415
[180,] -4.380546e+11 -2.34877900  -11308834    -18364.28415
[181,] -9.735696e+10 29.01836244   17972101    -12958.28415
[182,] -4.386407e+11 -1.31289558  -22455699    -17776.28415
[183,] -4.413343e+11  3.15834062  -25368503    -17649.28415
attr(,"scaled:center")
            GDP      CRIME RATE      Population GDP PER CAPITAL 
   4.656459e+11    7.105167e+00    3.980752e+07    1.933228e+04 
# In order to maintain consistency of the data transformation,
# I think log is one of the best options for our transformation,
# because the data has a big range and range. for example, 
# population is range from 10^4 to 10^9.

# create new column for GDP AND GDP PER CAPITA using Log
df$log_GDP <- log10(df$GDP) 
df$log_GDPPC <- log10(df$`GDP PER CAPITAL`) 

# create final table for analysis after cleaning
df_final = df%>% select(1,4,8,10,11,12,13)
head(df_final)
str(df_final)
tibble [183 x 7] (S3: tbl_df/tbl/data.frame)
 $ Country Code: chr [1:183] "ABW" "AFG" "AGO" "ALB" ...
 $ Country Name: chr [1:183] "Aruba" "Afghanistan" "Angola" "Albania" ...
 $ Rank        : Factor w/ 4 levels "1","2","3","4": 4 1 2 2 4 4 3 2 3 3 ...
 $ boxcox_Crime: num [1:183] 1.258 1.661 1.371 0.781 -1.069 ...
  ..- attr(*, "lambda")= num -0.143
 $ log_Crime   : num [1:183] 0.602 0.823 0.663 0.36 -0.432 ...
 $ log_GDP     : num [1:183] 9.47 10.29 11.01 10.18 9.51 ...
 $ log_GDPPC   : num [1:183] 4.44 2.72 3.52 3.72 4.62 ...

NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections. 

Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )



---
title: "MATH2349 Data Wrangling"
author: "Yuan Hong s3501537"
subtitle: Assignment 2
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
---

## Required packages 


Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

```{r}
# This is the R chunk for the required packages
library(readxl)
library(dplyr)
library(tidyverse)
library(MVN)
library(outliers)
library(ggplot2)
library(forecast)
```


## Executive Summary 

Data pre-processing for this task includes the following few steps. Firstly, GDP, the Crime rate, Population are selected and merge into a data frame by a common attribute ‘country code’. Then I summary the data to see the data quality, e.g., whether the data types are correct. The class for incomeGroup is the character, and we need to change it to factor. Also, from the summary, we can see there are missing values in the attributes. The next step is to tidy the data. For missing value for ‘GDP’, I have replaced them with the mean of recent 5 years. For the crime rate, I used the mean from the year 2000 to 2019. Then we have to mutate a new attribute ‘GDP per capita’ by dividing ‘GDP’/’Population’ for the future comparison.  I also mutate a new column ‘rank’ to replace  ‘Income group’ as there are some missing values in the ‘income group’. After scanning the table again, there were still a few missing values that are not going to affect our analysis, and I drop them. 

The next step is to scan through boxplot to see whether there are any outliers in our data. As there are some outliers in all numeric attributes, we used cap function to make sure we eliminate outliers in our dataset. However, we can also use the log transformation to eliminate outliers. In addition, for the nature of data, there is a considerable variation in our dataset. For example, the min population for a country is 1.151e+04 and the maximum is1.393e+09. Log can do a better job for transformation comparing to other techniques. In our case, the histograms of all numeric attributes are right-skewed, and after the log transformation, we can reduce this right skewness. finally, I will create a cleaned data_frame called df_final.




## Data 

The task is to understand whether the relation between GDP and Crime rate is more significant than the relation GDP PER CAPITA and Crime rate. All the datasets are from The World Bank and their data are from 1960 to 2019.As there are more missing value in 2019 than 2018, we are using the data in 2018 for our analysis. Country name and country code are the information of a country and can be used as a key to merge the data. GDP is the GDP in 2008. The crime rate is homicides per 100,000 people. Population is the population of 2018. Income group is to categorised country into 4 groups: "High income" "Low income" "Lower middle income" "Upper middle income". I will also create two new attribute" GDP_per_capita" and "rank" later. 

Reference & Data source:

World_Gdp data: Api.worldbank.org. 2020. GDP. [online] Available at: <http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv> [Accessed 14 October 2020].

Population data : Data.worldbank.org. 2020. Population. [online] Available at: <https://data.worldbank.org/indicator/SP.POP.TOTL> [Accessed 14 October 2020].

Crime: Api.worldbank.org. 2020. Intentional Homicides (Per 100,000 People). [online] Available at: <http://api.worldbank.org/v2/en/indicator/VC.IHR.PSRC.P5?downloadformat=csv> [Accessed 14 October 2020].



```{r}
# This is the R chunk for the Data Section
# Create dfs by correctly selecting the rows.
world_GDP <- read_excel("C:/Users/wei_s/Desktop/world GDP.xls",skip = 3, col_names = TRUE)
world_crime <- read_excel("C:/Users/wei_s/Desktop/crime rate international over 40 years.xls",
                          skip = 3, col_names = TRUE)
population <- read_excel("C:/Users/wei_s/Desktop/World population.xls",skip=3, sheet = 1)
country_info <- read_excel("C:/Users/wei_s/Desktop/World population.xls",sheet = 2)

head(world_GDP)
head(world_crime)
head(population)
head(country_info)

# As there are data from 1960-2019, we have to select the appropriate columns to create dataframe
df1 <- world_GDP %>% select(1:3,59:63)
df12018 <- world_GDP %>% select(1:3,63)
df2 <- world_crime %>% select(2,"2001":"2018")
df22018 <- world_crime %>% select(1:3,63)
df3<- population %>% select(1,2,63)
df4<- country_info %>% select(1,3)

# Join df_GDP with  df_country_info
df5 <- df1 %>%
  left_join(df4,by = c("Country Code"))

# Please note: 4 more datasets will be joined together after tidying the data later.

```

## Understand 

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.


```{r}
# This is the R chunk for the Understand Section
# Check the basic information of data frame
str(df1)
str(df2)
str(df3)

# Check data quality, e.g., whether the data types is correct. 
summary(df5)
str(df5)
df3$'2018' <-as.integer(df3$'2018')
df5$IncomeGroup<- as.factor(df5$IncomeGroup)
class(df3$'2018')

# The class for incomeGroup is the character, and we need to change it to factor.
df5$IncomeGroup <-factor( df5$IncomeGroup,
                          levels = c("High income", "Upper middle income",
                                     "Lower middle income", "Low income"), ordered=TRUE ) 
class(df5$IncomeGroup)

```


##	Tidy & Manipulate Data I 

Explain why your data (or one of the data sets) doesn’t conform the tidy data principles (minimum requirement #5). Apply the required steps to reshape the data into a tidy format. In addition to the R codes and outputs, explain everything that you do in this step.


```{r}
# This is the R chunk for the Tidy & Manipulate Data I 
# There are some missing value for GDP value. In order to get a more accurate value.
# we will replace the missing value with the mean value of 5 years between "2014 " to "2018"

ggdp <- df1 %>%
  gather("2014":"2018", key = "year", value = "gdp",na.rm=TRUE)

# calculate the mean for the missing value and create a new column
dfmean <- aggregate(ggdp[, 5], list(ggdp$`Country Code`), mean)

# rename df:gdpmean column name
dfmean <- dfmean %>% 
  rename(
    'Country Code'=Group.1,meangdp = gdp)

# replace the null value in year 2018 with the mean value calculated before
df12018 <- df12018 %>%
  left_join(dfmean, by = c("Country Code")) %>%
  mutate("2018" = ifelse(is.na(df12018$'2018'), meangdp,df12018$'2018'))%>%
  select('Country Code', '2018')

```

##	Tidy & Manipulate Data II 

Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.

```{r}

# This is the R chunk for the Scan I
# This is the R chunk for the Tidy & Manipulate Data II 
# The way we calculate the missing value for Crime is different to the way 
# we calculate the missing value for GDP
# As GDP is increasing, So I used the average number of recently 5 years
# While Crime rate is relatively more stable and there are more missing values so I use 
# average of 19 years from 2001 to 2018

gcrime <- df2 %>%
  gather("2001":"2018", key = "year", value = "crime",na.rm=TRUE)

# calculate the mean for the missing value and create a new column
crimemean <- aggregate(gcrime[, 3], list(gcrime$`Country Code`), mean)

#  df:crimemean column name
crimemean <- crimemean %>% 
  rename(
    'Country Code'=Group.1,meancrime = crime)

# replace the null value in year 2018 with the mean value calculated before
df22018 <- df22018 %>%
  left_join(crimemean, by = c("Country Code")) %>%
  mutate("2018" = ifelse(is.na(df22018$'2018'), meancrime,df22018$'2018'))%>%
  select('Country Code', '2018')

```


##	Scan I 

Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.


```{r}

# There are not special values or obvious errors can be found in the data set
# However, there are some missing values for GDP and Crime rate which has been identified and 
# replaced in two steps. Both of them are using different steps
# As I have scan through the data frame and I realised that some country are 
# incorrectly categories in IncomeGroup.For example, ALB Albania has GDP PER CAPITAL 
# FOR $5284 shouldn't be put in Upper middle income category. 
# In addition, there are still some missing values, so I will replace the value of 
# 'IncomeGroup' By create a new ranking which divide 'GDP PER CAPITAL' into 4 categories.
# 4 represent highest category GDP PER CAPTIAL and 1 represent lowest GDP PER CAPITAL category.

# join 4 tables to create a join table

df <- df12018 %>%
  left_join(df22018,by = c("Country Code")) %>% 
  left_join(df3,by = c("Country Code"))%>% 
  left_join(df4,by = c("Country Code")) 

# rename df column name

df <- df %>% 
  rename(GDP='2018.x','CRIME RATE'='2018.y', Population='2018')

# Create a new variable 'GDP PER CAPTIAL' from 'GDP'/'Population'
df <- df %>% mutate('GDP PER CAPITAL' = GDP /Population)

# Change the data type of 'GDP PER CAPITAL'
df$'GDP PER CAPITAL' <-df$'GDP PER CAPITAL'%>% as.integer

# Create a new variable use RANK FUNTION, this use divide 'GDP PER CAPTITAL' in to 4 different levels.
df$'Rank'<- ntile(df$'GDP PER CAPITAL',4)

# Change the number to factor
df$'Rank'<- as_factor(df$'Rank' )

head(df)

# There are still same missing value but they won't affect our analysis so we will drop them
df <- df%>% drop_na()

```


##	Scan II

Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

```{r}
# This is the R chunk for the Scan II
# Generate boxplot to check outliers

df$`CRIME RATE`<- as.numeric(df$`CRIME RATE`)

df$GDP %>% boxplot(main = "GDP 2018")

df$`CRIME RATE` %>%  boxplot(main = "CRIME RATE 2018")

df$Population %>%  boxplot(main = "POPULATION IN DIFFERENT COUNTRIES 2018")

df$`GDP PER CAPITAL`%>% boxplot(main = "GDP_PER_CAPITAL 2018")


# From these boxplot, we can see there are many outliers in our table.
z.scores <- df$`CRIME RATE` %>%  scores(type = "z") 
z.scores %>% summary()
which( abs(z.scores) >3 )

# One way to deal with outlier is to use capping.
# Capping involves replacing the outliers with the nearest neighbours that are not outliers. 
cap <- function(x){
  
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x }


df_sub <- df %>%  dplyr::select('GDP','CRIME RATE','Population','GDP PER CAPITAL') 

summary(df_sub)

df_capped <- sapply(df_sub, FUN = cap) 

summary(df_capped)

# However, instead of deleting the value, we will use transformation to Transform variables 
# to eliminate outliers. For exmaple, we can use log function


```


##	Transform 

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.


```{r}
# This is the R chunk for the Transform Section


# As we discussed before, we can used log to transform the data to eliminate outliers.
# We can compare the boxplot of Population. One is original data,
# the other is after the log transformation
# After the transformation, outliers are eliminated

df$log_population <- log10(df$Population) 
df$log_population %>% boxplot(main = "Population 2018 WITH THE LOG TRANSFORMAITON")
df$Population %>% boxplot(main = "Population 2018 WITHOUT THE LOG TRANSFORMAITON")

# We can checked the histgram to see whether data is left or skewed

hist(df$GDP)
hist(df$`CRIME RATE`)
hist(df$log_population)
hist(df$`GDP PER CAPITAL`)


# The Box-Cox transformation is to transform non-normal distribution data into 
# a normal distribution
# we can use BoxCox to transform the Crime rate into more normal. At the same time, 
# it is good to compare if we use the log transformation and see which methods is better.

df$boxcox_Crime<- BoxCox(df$`CRIME RATE`,lambda = "auto") 
hist(df$boxcox_Crime)
df$log_Crime <- log10(df$`CRIME RATE`)
hist(df$log_Crime)

# After observing the histgram, I think the log transformation do a better job

# We can Min- Max Normalisation to change data to range 0 to 1. 
# This a good way to normolise our dataset. 
# As the outcome is long, I will just show the code.
# R CODE: minmaxnormalise <- function(x){(x- min(x)) /(max(x)-min(x))}
# R CODE: lapply(df_sub$GDP, minmaxnormalise)
# There are many other potential way to transform the data such as centering. 

center_df <-scale(df_sub, center = TRUE, scale = FALSE) 
center_df

# In order to maintain consistency of the data transformation,
# I think log is one of the best options for our transformation,
# because the data has a big range and range. for example, 
# population is range from 10^4 to 10^9.

# create new column for GDP AND GDP PER CAPITA using Log
df$log_GDP <- log10(df$GDP) 
df$log_GDPPC <- log10(df$`GDP PER CAPITAL`) 

# create final table for analysis after cleaning
df_final = df%>% select(1,4,8,10,11,12,13)
head(df_final)
str(df_final)

```


NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections.\  

Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )

<br>
<br>
