This reports analyses and evaluates household economic status for various countries around the world between the year 2015 - 2019. Methods of analysis include using grammer of data manipulation, pipe operator, histograms and log10 transformation to combine and analyse this data. Findings show a positvely skewed dataset in variable EXPENDITURE and using log10 transformation provides a more symmetrical data set. Results of the data illustrates the household spending far exceeds the disposable income on average which is a real economic issue especially when crisis climates (such as 2020) occur. Recommendations include: * Collecting household disposable income data for 2019 so when data is joined it minimises NA rates. * Current climate data acquisition will be interesting to observe the damaged caused by crisis upon households
In order to complete this assessment, the following packages need to be installed:
# This is an R chunk for required packages
install.packages("outliers")
Error in install.packages : Updating loaded packages
install.packages("MVN")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into 㤼㸱E:/Users/Terence/Documents/R/win-library/3.6㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/MVN_5.8.zip'
Content type 'application/zip' length 392241 bytes (383 KB)
downloaded 383 KB
package ‘MVN’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Terence\AppData\Local\Temp\Rtmp0yVb1L\downloaded_packages
install.packages("outliers")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into 㤼㸱E:/Users/Terence/Documents/R/win-library/3.6㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
Warning in install.packages :
package ‘outliers’ is in use and will not be installed
install.packages("Hmisc")
Error in install.packages : Updating loaded packages
library(dplyr) # grammer of data manipulation eg. mutate(), filter(), summarise()
library(tidyr) # reshapes data
library(outliers) # shows outliers
library(Hmisc)
The current datasets follow on from Assignment 1 which looks at Household Economic Status for various countries around the world. Where the first Assignment looks at debt, the aim of this Assignment is to look at household income and spending.
The location of 2 datasets were sourced from OECD website specifically from the following links:
https://data.oecd.org/hha/household-disposable-income.htm#indicator-chart https://data.oecd.org/hha/household-spending.htm#indicator-chart
Dataset 1 - Household disposable income This dataset looks at gross household disposable income in US dollars per capita. It looks at years between 2015 to 2018 as this is the latest data available. They are segregated by countries around the world.
Dataset 2 - Household spending This dataset observes the total spending in US dollars in millions between 2015 to 2019 and is also separated by various countries.
In order to merge the 2 datasets into 1, the use of left_join function was applied and passed variable name "“ï..LOCATION” & “TIME”. This joined dataset 1 to dataset 2 by joining the location variable then the time variable. You can see that not all observations in one dataset match the second dataset meaning that there was one lot of data in one and not the other. For example, one data set has information in 2019 whereas the other doesn’t hence NA data.
NA’s and tidy of data will be addressed in the following sections of this report.
A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfill steps #1-2 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.
# head of datasets
hhdincome <- read.csv("hhdincome.csv")
hhdincome
hhspending <- read.csv("hhspending.csv")
hhspending
names(hhdincome)
[1] "ï..LOCATION" "INDICATOR" "SUBJECT" "MEASURE" "FREQUENCY" "TIME" "Value" "Flag.Codes"
names(hhspending)
[1] "ï..LOCATION" "INDICATOR" "SUBJECT" "MEASURE" "FREQUENCY" "TIME" "Value" "Flag.Codes"
total <- hhspending %>%
left_join(hhdincome, by = c("ï..LOCATION", "TIME"))
Column `攼㹦..LOCATION` joining factors with different levels, coercing to character vector
This area will look at the total object which will be passed through class(), str() and attributes() function.
When the total object is passed through the class() function, it illustrates that the 2 joined datasets is a dataframe. str() function provides a more insightful look into the dataset being that is outlines observations (252) and variables (14). The data set also includes various data types being factors, numerics, character and integer.
attributes() function access the ‘total’ object’s attributes (2). In this instance, you can see that the row names are values between 1 to 252, class being data frame and the dates are outlined for each variable.
The following section will look at tidying up the dataset.
# summarise variables and data structures
class(total)
[1] "data.frame"
str(total)
'data.frame': 252 obs. of 14 variables:
$ ï..LOCATION : chr "AUS" "AUS" "AUS" "AUS" ...
$ INDICATOR.x : Factor w/ 1 level "HHEXP": 1 1 1 1 1 1 1 1 1 1 ...
$ SUBJECT.x : Factor w/ 1 level "TOT": 1 1 1 1 1 1 1 1 1 1 ...
$ MEASURE.x : Factor w/ 1 level "MLN_USD": 1 1 1 1 1 1 1 1 1 1 ...
$ FREQUENCY.x : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
$ TIME : int 2015 2016 2017 2018 2015 2016 2017 2018 2019 2015 ...
$ Value.x : num 656213 689858 711943 745618 217783 ...
$ Flag.Codes.x: Factor w/ 3 levels "","E","P": 1 1 1 1 1 1 1 1 1 1 ...
$ INDICATOR.y : Factor w/ 1 level "HHDI": 1 1 1 1 1 1 1 1 NA 1 ...
$ SUBJECT.y : Factor w/ 1 level "GROSSADJ": 1 1 1 1 1 1 1 1 NA 1 ...
$ MEASURE.y : Factor w/ 1 level "USD_CAP": 1 1 1 1 1 1 1 1 NA 1 ...
$ FREQUENCY.y : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 NA 1 ...
$ Value.y : num 37731 39031 39275 40237 34423 ...
$ Flag.Codes.y: Factor w/ 2 levels "","E": 1 1 1 1 1 1 1 1 NA 1 ...
attributes(total)
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[29] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
[57] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
[113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
[141] 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
[169] 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196
[197] 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
[225] 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
$class
[1] "data.frame"
$names
[1] "ï..LOCATION" "INDICATOR.x" "SUBJECT.x" "MEASURE.x" "FREQUENCY.x" "TIME" "Value.x"
[8] "Flag.Codes.x" "INDICATOR.y" "SUBJECT.y" "MEASURE.y" "FREQUENCY.y" "Value.y" "Flag.Codes.y"
This section will look at tidying up the dataset so as it is more digestable. In order to change the variable names in the dataset, I have created a vector of all relevant names with enclosed commas "“, for example”COUNTRY" replaces “ï..LOCATION”. This is then assigned to names() function and passing the ‘total’ object through allowing the newly created variable names to replace existing ones.
Secondly, as 2 datasets were joined together the columns (variables) were not in an order which flowed well for example, YEAR was in the middle of the dataset which I think would make more sense to move to the second column so as the reader can determine firstly the COUNTRY then the YEAR that this dataset is looking at.
In order to rearrange the dataset, I used the square brackets [] to subset the total object. To adjust the variables within the square brackets it’s require to manipulate date in the right hand side of the comma. In this instance, the columns are structed as 1,6,2,3,4,7,9,10,11,13 meaning column 6 has been subsetted to position 2, column 5, 8 and 12 have been removed. This was assigned to a new object beign ‘total1’.
The columns which have been removed is very similar to Assignment 1 which was the “FLAG” because it doesn’t have relevance in analysing data in this specific instance. According to the OECD User Guide (3) the Flag.codes variable appears alongside the numerical data and represents whether the numerical data is either E (Estimated value) F (Forecast value) P (Provisional). In this particular example of data which I’ve chosen, there where flags for specific observations for example E (Estimate value) for specific countries which is good to know but not directly relevant to the numeric analysis therefore I have decided to remove to tidy up the data for the end user.
Furthermore, a data type conversion was applied to COUNTRY converting it from a character to a factor data type because of how the data type sorts the levels alphabetically (4).
There weren’t any empty spaces missing except for whent the data was combined which created some NA’s but this will be addressed in the following section of the assignment.
All characters were in uppercase which also supports the idea that the original data was already quite tidy.
# This is an R chunk for Tidy & Manipulate Data I section
# Changing names in variables
names(total) <- c("COUNTRY", "SCALE_1", "SUBJECT_1", "MEASURE_1", "FREQUENCY", "YEAR", "SPEND_VALUE",
"CODES", "SCALE_2", "SUBJECT_2", "MEASURE_2", "FREQUENCY_2", "INCOME_VALUE", "CODES_2")
# including only necessary columns. Getting rid of frequence and codes
# moving year to second position
total1 <- total[,c(1,6,2,3,4,7,9,10,11,13)]
#changing COUNTRY to factor in order to sort by alphabet
total1$COUNTRY <- as.factor(total1$COUNTRY)
total1
NA
NA
NA
NA
In this section, the mutate() function is used to create a new column to determine the EXPENDITURE of each country per year.
In order to achieve this, the pipe operator from dplyr package is used to “chain multiple methods”; thinking of it more like “THEN” method (5). Therefore, by applying the pipe operator in the total1 dataset I am chaining this dataset to the next command being the mutate() function. Within the funtion, I create a new variable named EXPENDITURE and I want to look at the expense per country by subtracting the INCOME_VALUE by SPEND_VALUE to see how much expense each country uses per year. I round the float to 2 decimal places by passing the equation through a round() function.
To better visualise the SPEND_VALUE and INCOME_VALUE variables, I passed each variable through the hist() function which illustrates 2 historgrams. First being the graph of household SPEND_VALUE and the second being household INCOME_VALUE.
Printing our total2 will allow the end user to see the data set joined together with the new addtional column at the end named EXPENDITURE.
# This is an R chunk for Tidy & Manipulate Data II section
total2 <- total1 %>%
mutate(EXPENDITURE = round(SPEND_VALUE - INCOME_VALUE,2))
hist(total2$SPEND_VALUE)
hist(total2$INCOME_VALUE)
total2
NA
NA
NA
After joining the 2 datasets together and creating the EXPENDITURE variable, there are NA’s within this column because once combined not all datasets have values for a particular year. For example, there was no recorded INCOME_VALUE for AUT in 2019 but there was SPEND_VALUE recorded hence the NA.
In this dataset I am looking only at the variable EXPENDITURE and analysing this column therefore focusing only on omitting NA’s within this column. However if I was looking at the whole dataset then the methods and processes would differ. In order to only analyse complete values in the EXPENDITURE variable, I use the complete.case() function and pass EXPENDITURE through this function. This is of course after the comma and enclosed by square brackets within “table2” object.
You can see in the first table, all the rows containing NA’s in the EXPENDITURE variable has been removed and therefore looking at only whole values and omitted missing values. This allows the end use to observe and analyse full datasets.
By passing total2$EXPENDITURE through the hist() function you can better visualise the the data with the frequency of expenditure in total of all countries. Another method of visualisation is used in the last image which looks at density using ggplot. By passing total2 object through as the first argument and selecting EXPENDITURE for the x axis you can see that the graph illustates a similar output as the histogram but in a more fluid way.
I chose to display both the histogram and the density graph because of the gap between the measurements. The density provides a better visual aide for the gap between the data set and outliers.
is.na(total2$EXPENDITURE)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[20] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[39] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[58] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[77] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
[96] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[115] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[134] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[153] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
[172] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[191] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
[210] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[229] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[248] TRUE TRUE TRUE TRUE TRUE
hist(total2$EXPENDITURE, main = "Histogram of Expenditure", xlab = "Expenditure")
total2[,complete.cases("EXPENDITURE")]
#density curve
ggplot(total2, aes(x = EXPENDITURE)) + geom_density()
NA
NA
NA
This sections aims to illustrate any outliers in the EXPENDITURE variable by passing the variable through the boxplot() function. OUtliers are defined as observations that are far away from other observations (7). You can see in the first graph the outliers represented as dots outside of the boxplot.
Another method to detect outliers is the z-score which uses a standardised score of all observations (8). It is considered an outlier based on this score, if the absolute value is greater than 3 (8). In this section, I have used the scores() function to calculate the z-scores for the EXPENDITURE variable.
In addition, to find out how many outliers there are in this chosen data set, I used a pipe operator to z.scores then drilled down by using length() and which() function was above SD 3 to determine the outlier amount.
In this instance the z score appears to be zero indicating that it points to the mean.
boxplot(total2$EXPENDITURE)
#scan for outliers
z.scores <- total2$EXPENDITURE %>%
scores(type= "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
NA NA NA NaN NA NA 252
length(which(abs(z.scores)>3))
[1] 0
Log transformation specifically base 10 consolidates high values and spreads low values in order to reduce right skewness to obtain symmetry within the dataset (6). This section looks to apply log transformation to EXPENDITURE variable by passing total2$EXPENDITURE variable through log10() function and assigning it to a new variable name log_expenditure. The first graph illustrates a somewhat more symmetrical dataset.
This can also be seen in the boxplot dataset where outliers have been omitted illustrating a more homogenous dataset to be analysed.
# This is an R chunk for Transform section
log_expenditure <- log10(total2$EXPENDITURE)
NaNs produced
hist(log_expenditure, main = "Histogram of Expenditure", xlab = "Expenditure")
# boxplot without outliers
boxplot(log_expenditure)
REFERENCE
2.RDocumentation [Internet]. United States of America. Available from :https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/attributes
OECD [Internet]. Paris (France):OECD.Stat Web Browser User Guide;2013.WBOS User Guide(EN). Available from: https://stats.oecd.org/Content/themes/OECD/static/help/WBOS%20User%20Guide%20(EN).PDF
Github [Internet]. Fribourg (Switzerland). Avalable from: https://janhove.github.io/analysis/2016/08/18/ordering-factor-levels#:~:text=R%20tip%3A%20Ordering%20factor%20levels%20more%20easily,meaningful%20principles%20than%20alphabetical%20order.
Datacamp[Internet]. New York (United States of America): Datacamp Community Tutorials; 2020. Available from: https://www.datacamp.com/community/tutorials/pipe-r-tutorial
RMIT [Internet]. Melbourne (Australia): RMIT University. 4.4.1 Techniques to detect outliers: Available from: https://rmit.instructure.com/courses/70749/pages/4-dot-4-1-data-transformation-i?module_item_id=2312451
Atkinson, AC & Hawkins, DM 1981, ‘Identification of Outliers (Links to an external site.)’, Biometrics, vol. 37, no. 4, pp. 860-861.
RMIT [Internet]. Melbourne (Australia): RMIT University. 4.3.2 Data Transformation I: Available from: https://rmit.instructure.com/courses/70749/pages/4-dot-3-2-techniques-to-detect-outliers?module_item_id=2312446