Required packages

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(MVN)
library(forecast)
library(knitr)

Executive Summary

The object of this assignment is to preprocess the data set which are created by merging two different data sets: data_1 and data_2. Firstly, we made the two data sets tidy which including remove blank rows, rename inappropriate variable names and make the table structure tidy(merge every states columns into one colume and named States and separate the levels of type as independent columns). Next, merged two tidy datasets as a ‘fullData’ set and created a new column. Moreover, we checked missing value, special symbol and converted variable type. Then, we used self-defined function to recognise outliers and applied transformation finish our preprocessing work.

Data

This assignment merged two data sets: data_1(88 observations of 10 variables) and data_2(80 observations of 10 variables) which are inside of the ‘Summary tables’ dataset from Australian Bureau of Statistics (http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/4610.02015-16?OpenDocument). Summary tables, by State and Territory (Selected Indicators), 2008-09 to 2015-16 provides information on the water use and consumptive practices of households and key industries (Agriculture and Water Supply, Sewerage and Drainage Services).

The variable description is produced here from data_1 file:
* Year: factor.
* Type: Water consumption by industry (GL), Water consumption by households (GL), Total water consumption (GL), Gross State Product, chain volume measures (\(m), Population at 30 June 2016 ('000), Number of households ('000), Water use productivity (\)),Industry (a),Total economy (b),Total water consumption per capita (kL), Water consumption per household (kL).
* NSW: New South Wales state.
* VIC: Victoria state.
* QLD: Queensland state.
* SA: South Australia state.
* WA: Western Australia state.
* TAS: Tasmania state.
* NT: North Territory.
* ACT: Australian Capital Territory.

The variable description is produced here from data_2 file:
* Year: factor.
* Type: Agriculture, Aquaculture, forestry and fishing, Mining, Manufacturing, Electricity and gas supply, Water supply, sewerage and drainage services (a), Other industries, Water consumption by industry, Water consumption by households, Total water consumption.
* NSW: New South Wales state.
* VIC: Victoria state.
* QLD: Queensland state.
* SA: South Australia state.
* WA: Western Australia state.
* TAS: Tasmania state.
* NT: North Territory.
* ACT: Australian Capital Territory.

data_1 <- read_csv("data_1.csv") #read the summary data
data_2 <- read_csv("data_2.csv") #read the industry data
head(data_1)
head(data_2)
# With these untidy datasets, we have to make them tidy before merge.

Understand

The variable description is produced from data_1(88 observations of 10 variables):
* Year: integer.
* Type: character.
* NSW: numeric.
* VIC: numeric.
* QLD: numeric.
* SA: numeric.
* WA: numeric.
* TAS: numeric.
* NT: numeric.
* ACT: numeric.

The variable description is produced from data_2(80 observations of 10 variables):
* Year: integer.
* Type: character.
* NSW: numeric.
* VIC: numeric.
* QLD: numeric.
* SA: numeric.
* WA: numeric.
* TAS: character.
* NT: character.
* ACT: character.

str(data_1) # check the data type in each variable
## Classes 'tbl_df', 'tbl' and 'data.frame':    88 obs. of  10 variables:
##  $ years: int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Type : chr  "Water consumption by industry (GL)" "Water consumption by households (GL)" "Total water consumption (GL)" "Gross State Product, chain volume measures ($m)" ...
##  $ NSW  : num  4655 562 5217 531323 7726 ...
##  $ VIC  : num  3335 384 3719 373624 6068 ...
##  $ QLD  : num  3584 375 3958 314569 4844 ...
##  $ SA   : num  1039 136 1175 101096 1708 ...
##  $ WA   : num  1080 335 1415 255214 2617 ...
##  $ TAS  : num  415 38 453 26039 519 ...
##  $ NT   : num  105 39 144 23648 245 ...
##  $ ACT  : num  20 31 51 36225 396 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 10
##   .. ..$ years: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Type : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ NSW  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ VIC  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ QLD  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ SA   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ WA   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ TAS  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ NT   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ ACT  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
str(data_2) # check the data type in each variable
## Classes 'tbl_df', 'tbl' and 'data.frame':    80 obs. of  10 variables:
##  $ years: int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ type : chr  "Agriculture" "Aquaculture, forestry and fishing" "Mining" "Manufacturing" ...
##  $ NSW  : num  3150 24 81 108 52 ...
##  $ VIC  : num  2326 26 4 138 127 ...
##  $ QLD  : num  2454 51 135 181 79 ...
##  $ SA   : num  804 7 46 49 2 ...
##  $ WA   : num  350 53 363 63 28 ...
##  $ TAS  : chr  "306" "9" "9" "36" ...
##  $ NT   : chr  "43" "-" "22" "-" ...
##  $ ACT  : chr  "-" "-" "1" "-" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 10
##   .. ..$ years: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ type : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ NSW  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ VIC  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ QLD  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ SA   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ WA   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_number" "collector"
##   .. ..$ TAS  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ NT   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ ACT  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
# We got character, numeric and integer types in both dataset, and apply the required data type conversions after tidy the data.

Tidy & Manipulate Data I

Check if the data conforms the tidy data principles. If your data is not in a tidy format, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.

rowSums(is.na(data_1))
##  [1] 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0
## [36] 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0
## [71] 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0
# We found the "Water use productivity ($)" doesn't contain any value in every years, 
# so using `filter()` remove it before tidy data.

data_1 <- data_1 %>% filter(Type != "Water use productivity ($)")
## Warning: package 'bindrcpp' was built under R version 3.4.4
# Moverover, we found repeating imformation under the column `Type`. 
# `Population` be labelled as different names with year, so remane 
# it as a unique name before tidy to avoid duplication of information.
# Checking the levels of `Tpye`
unique(data_1$Type)
##  [1] "Water consumption by industry (GL)"             
##  [2] "Water consumption by households (GL)"           
##  [3] "Total water consumption (GL)"                   
##  [4] "Gross State Product, chain volume measures ($m)"
##  [5] "Population at 30 June 2016 ('000)"              
##  [6] "Number of households ('000)"                    
##  [7] "Industry (a)"                                   
##  [8] "Total economy (b)"                              
##  [9] "Total water consumption per capita (kL)"        
## [10] "Water consumption per household (kL)"           
## [11] "Population at 30 June 2015 ('000)"              
## [12] "Population at 30 June 2014 ('000)"              
## [13] "Population at 30 June 2013 ('000)"              
## [14] "Population at 30 June 2012 ('000)"              
## [15] "Population at 30 June 2011 ('000)"              
## [16] "Population at 30 June 2010 ('000)"              
## [17] "Population at 30 June 2009 ('000)"
# Using `grep()` search each `Type` column which include `Population` and rename it.
data_1$Type[grep("Population", data_1$Type)] <- "Population" 
unique(data_1$Type) # Checking all values are renamed successful.
##  [1] "Water consumption by industry (GL)"             
##  [2] "Water consumption by households (GL)"           
##  [3] "Total water consumption (GL)"                   
##  [4] "Gross State Product, chain volume measures ($m)"
##  [5] "Population"                                     
##  [6] "Number of households ('000)"                    
##  [7] "Industry (a)"                                   
##  [8] "Total economy (b)"                              
##  [9] "Total water consumption per capita (kL)"        
## [10] "Water consumption per household (kL)"
# After tidy data, each value in `Type` column becomes a new column, hence these values should be simplify.
# The following code renamed each level of `Type` :
data_1$Type[which(data_1$Type == "Water consumption by industry (GL)")]                 <- "Consumption_by_Industry"
data_1$Type[which(data_1$Type == "Water consumption by households (GL)")]               <- "Consumption_by_Households"
data_1$Type[which(data_1$Type == "Total water consumption (GL)")]                       <- "Total"
data_1$Type[which(data_1$Type == "Gross State Product, chain volume measures ($m)")]    <- "GSP"
data_1$Type[which(data_1$Type == "Number of households ('000)")]                        <- "Number_of_Households"
data_1$Type[which(data_1$Type == "Water use productivity ($)")]                         <- "Productivity"
data_1$Type[which(data_1$Type == "Industry (a)")]                                       <- "Industry"
data_1$Type[which(data_1$Type == "Total economy (b)")]                                  <- "Economy"
data_1$Type[which(data_1$Type == "Total water consumption per capita (kL)")]            <- "Consumption_per_Capita"
data_1$Type[which(data_1$Type == "Water consumption per household (kL)")]               <- "Consumption_per_Household"

# Checking all values are renamed successfully.
unique(data_1$Type) 
##  [1] "Consumption_by_Industry"   "Consumption_by_Households"
##  [3] "Total"                     "GSP"                      
##  [5] "Population"                "Number_of_Households"     
##  [7] "Industry"                  "Economy"                  
##  [9] "Consumption_per_Capita"    "Consumption_per_Household"
# The last step is tidy data using both `gather()` and `spread()` functions. 
# Using `gather()` to merge every states columns into one colume and named `States`.
# Using `spread()` to separate the levels of `Type` as independent columns.
tidyData_1 <- data_1 %>% 
              gather(NSW,VIC,QLD,SA,WA,TAS,NT,ACT, key="States", value="values") %>% 
              spread(key="Type", value="values")

# Checking dataset should be tidied now.
head(tidyData_1)
# Remove repeating variables which contained in the first dataset.
data_2 <- data_2 %>% filter(type != "Water consumption by industry" & 
                              type != "Water consumption by households" & 
                              type != "Total water consumption")

# Checking the levels of `type`
unique(data_2$type)
## [1] "Agriculture"                                     
## [2] "Aquaculture, forestry and fishing"               
## [3] "Mining"                                          
## [4] "Manufacturing"                                   
## [5] "Electricity and gas supply"                      
## [6] "Water supply, sewerage and drainage services (a)"
## [7] "Other industries"
# After tidy data, each value in `type` column becomes a new column, hence these values should be simplify.
# The following code renamed each level of `type` :
data_2$type[which(data_2$type == "Aquaculture, forestry and fishing")]                   <- "Aquaculture_Forestry_Fishing"
data_2$type[which(data_2$type == "Electricity and gas supply")]                          <- "Electricity_Gas"
data_2$type[which(data_2$type == "Water supply, sewerage and drainage services (a)")]    <- "Water_Supply_Sewerage"
data_2$type[which(data_2$type == "Other industries")]                                    <- "Other_Industries"

# Checking all values are renamed successful.
unique(data_2$type)
## [1] "Agriculture"                  "Aquaculture_Forestry_Fishing"
## [3] "Mining"                       "Manufacturing"               
## [5] "Electricity_Gas"              "Water_Supply_Sewerage"       
## [7] "Other_Industries"
# The last step is tidy data using both `gather()` and `spread()` functions. 
# Using `gather()` to merge every states columns into one colume and named `States`.
# Using `spread()` to separate the levels of `type` as independent columns.
tidyData_2 <- data_2 %>% 
              gather(NSW,VIC,QLD,SA,WA,TAS,NT,ACT, key="States", value="values") %>% 
              spread(key="type", value="values")

# Checking dataset should be tidied now.
head(tidyData_2)
# Checking the columns' name for data merging
names(tidyData_1)
##  [1] "years"                     "States"                   
##  [3] "Consumption_by_Households" "Consumption_by_Industry"  
##  [5] "Consumption_per_Capita"    "Consumption_per_Household"
##  [7] "Economy"                   "GSP"                      
##  [9] "Industry"                  "Number_of_Households"     
## [11] "Population"                "Total"
names(tidyData_2)
## [1] "years"                        "States"                      
## [3] "Agriculture"                  "Aquaculture_Forestry_Fishing"
## [5] "Electricity_Gas"              "Manufacturing"               
## [7] "Mining"                       "Other_Industries"            
## [9] "Water_Supply_Sewerage"
# Using `full_join()` function and matching both `years` and `states` columns to merge data. 
fullData <- tidyData_1 %>% full_join(tidyData_2, by=c("years", "States"))
# Checking dataset
head(fullData)

Tidy & Manipulate Data II

Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step..

# In this section, we would like to create a column to show the proportion of water consumption of each state by years.
# First of all, create a table of total water consumption in each year.
Total <- fullData %>% group_by(years) %>% summarise(Total_in_Year = sum(Total))
# Secondly, using `left_join()` function merge the total water consumption for each year.
fullData <- fullData %>% left_join(Total, by="years")
# Finally, calculate the proportion and using `round()` function specified decimal.
fullData$Proportion_of_Consumption <- round(fullData$Total/fullData$Total_in_Year,3)
# Checking dataset
head(fullData)

Scan I

Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.

# Checking missing value within dataset
sum(is.na(fullData))
## [1] 0
# The output shows there is no missing value in the dataset, however, when print out the values some special symbols appeared.
fullData$Agriculture
##  [1] "2"    "2001" "35"   "2144" "788"  "264"  "1517" "325"  "1"    "2127"
## [11] "49"   "1928" "720"  "286"  "1553" "324"  "1"    "2861" "52"   "1878"
## [21] "646"  "185"  "1234" "319"  "-"    "4037" "52"   "2014" "661"  "203" 
## [31] "2041" "315"  "1"    "6210" "45"   "2463" "759"  "246"  "2607" "288" 
## [41] "1"    "4983" "50"   "2761" "691"  "235"  "2572" "296"  "-"    "4120"
## [51] "56"   "2250" "709"  "228"  "2738" "308"  "-"    "3150" "43"   "2454"
## [61] "804"  "306"  "2326" "350"
sum(fullData == "-")
## [1] 43
# It means this state does not have that type of industry.
# Therefore, to deal with them, replacing as `NA` to indicate there is no such industry.
# It is better than replaced by `0` when calculate the mean water consumption. 
# In this case, `0` means that state has this industry without any water consumption.
fullData[fullData == "-"] <- NA

# In total, we have 43 "missing values" but do not actually means missing.
sum(is.na(fullData))
## [1] 43
# After replacing missing value, we could apply the required data type conversions.
str(fullData)
## Classes 'tbl_df', 'tbl' and 'data.frame':    64 obs. of  21 variables:
##  $ years                       : int  2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 ...
##  $ States                      : chr  "ACT" "NSW" "NT" "QLD" ...
##  $ Consumption_by_Households   : num  27 548 39 340 128 69 331 335 28 556 ...
##  $ Consumption_by_Industry     : num  21 4007 121 3001 1051 ...
##  $ Consumption_per_Capita      : num  135 646 708 772 733 925 549 608 130 605 ...
##  $ Consumption_per_Household   : num  200 210 574 213 199 340 163 403 204 210 ...
##  $ Economy                     : num  642 99 118 80 77 54 110 134 675 106 ...
##  $ GSP                         : num  30821 450461 18870 266578 90702 ...
##  $ Industry                    : num  1468 112 156 89 86 ...
##  $ Number_of_Households        : num  135 2607 68 1595 642 ...
##  $ Population                  : num  355 7054 226 4329 1609 ...
##  $ Total                       : num  48 4555 160 3341 1179 ...
##  $ Agriculture                 : chr  "2" "2001" "35" "2144" ...
##  $ Aquaculture_Forestry_Fishing: chr  NA "21" "1" "74" ...
##  $ Electricity_Gas             : chr  NA "92" "1" "79" ...
##  $ Manufacturing               : chr  NA "140" "22" "142" ...
##  $ Mining                      : chr  NA "67" "21" "120" ...
##  $ Other_Industries            : chr  "11" "350" "32" "212" ...
##  $ Water_Supply_Sewerage       : chr  "7" "1336" "9" "229" ...
##  $ Total_in_Year               : num  14061 14061 14061 14061 14061 ...
##  $ Proportion_of_Consumption   : num  0.003 0.324 0.011 0.238 0.084 0.033 0.21 0.097 0.003 0.32 ...
# From the structure of dataset we found the `States` column should convert to factor.
# In addition, the remaining columns with type of character should convert to numeric.

# Change 'States' into an ordered factor.
fullData$States <- factor(fullData$States, ordered=TRUE) 
# Change others character's columns into numeric.
fullData$Agriculture <- as.numeric(fullData$Agriculture)
fullData$Aquaculture_Forestry_Fishing <- as.numeric(fullData$Aquaculture_Forestry_Fishing)
fullData$Electricity_Gas <- as.numeric(fullData$Electricity_Gas)
fullData$Manufacturing <- as.numeric(fullData$Manufacturing)
fullData$Mining <- as.numeric(fullData$Mining)
fullData$Other_Industries <- as.numeric(fullData$Other_Industries)
fullData$Water_Supply_Sewerage <- as.numeric(fullData$Water_Supply_Sewerage)
# Checking structure of dataset
str(fullData)
## Classes 'tbl_df', 'tbl' and 'data.frame':    64 obs. of  21 variables:
##  $ years                       : int  2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 ...
##  $ States                      : Ord.factor w/ 8 levels "ACT"<"NSW"<"NT"<..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ Consumption_by_Households   : num  27 548 39 340 128 69 331 335 28 556 ...
##  $ Consumption_by_Industry     : num  21 4007 121 3001 1051 ...
##  $ Consumption_per_Capita      : num  135 646 708 772 733 925 549 608 130 605 ...
##  $ Consumption_per_Household   : num  200 210 574 213 199 340 163 403 204 210 ...
##  $ Economy                     : num  642 99 118 80 77 54 110 134 675 106 ...
##  $ GSP                         : num  30821 450461 18870 266578 90702 ...
##  $ Industry                    : num  1468 112 156 89 86 ...
##  $ Number_of_Households        : num  135 2607 68 1595 642 ...
##  $ Population                  : num  355 7054 226 4329 1609 ...
##  $ Total                       : num  48 4555 160 3341 1179 ...
##  $ Agriculture                 : num  2 2001 35 2144 788 ...
##  $ Aquaculture_Forestry_Fishing: num  NA 21 1 74 9 9 31 94 NA 23 ...
##  $ Electricity_Gas             : num  NA 92 1 79 2 NA 123 27 NA 69 ...
##  $ Manufacturing               : num  NA 140 22 142 84 50 143 60 NA 141 ...
##  $ Mining                      : num  NA 67 21 120 22 18 6 252 NA 61 ...
##  $ Other_Industries            : num  11 350 32 212 99 25 238 171 15 337 ...
##  $ Water_Supply_Sewerage       : num  7 1336 9 229 47 ...
##  $ Total_in_Year               : num  14061 14061 14061 14061 14061 ...
##  $ Proportion_of_Consumption   : num  0.003 0.324 0.011 0.238 0.084 0.033 0.21 0.097 0.003 0.32 ...

Scan II

Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain how you dealt with these values.

# The `MVN` function could not performance well with such dataset, hence we decide to use `z scores` detecting outliers.
# Due to the size of dataset, we build a function to detect outliers for each numeric column.
showOutliers <- function(data,columns,type){ # the parameters include data, number of columns and the scores type.
        N = length(columns)    # calculate the length of total clumns to create the table of output.
        Columns = array(NA,N)  # create variable of table to show the column's name
        Min = array(NA,N)      # create variable of the minimum scores
        # to save time, the following two parameters could be ignored this time
        #Median = array(NA,N)  # create variable of the median of scores
        #Mean = array(NA,N)    # create variable of the mean of scores
        Max = array(NA,N)      # create variable of the maximum of scores
        Outliers = array(NA,N) # create variable of the outliers in total for each column
        count = 1              # create a variable to count the row number
      for (i in columns){      
              scores <- na.omit(data[,i]) %>% scores(type = type) # calculate scores for column i
              Min[count] = round(min(scores),4) # calculate minimum scores and distribute to `Min`
              #Median[count] = round(median(scores),4)  # calculate median scores and distribute to `Median`
              #Mean[count] = round(mean(scores),4)      # calculate mean scores and distribute to `Mean`
              Max[count] = round(max(scores),4)         # calculate maximum scores and distribute to `Max`
              Outliers[count] = length(which(abs(scores) >3))   # detecting outliers and distribute to `Outliers`
              Columns[count] = names(data)[i]           # distribute the column's name
              count = count + 1                         # move onto next row
      }
      # return a table to show the result of calculation
      return(as.data.frame(list(Columns=Columns, Outliers=Outliers, Min=Min, Max=Max)))
} 
# Using the function above to detect outliers with all numeric columns.
outliers <- showOutliers(data=fullData,columns = c(3:21), type = "z")
# Showing the first 6 columns with outliers
kable(outliers)
Columns Outliers Min Max
Consumption_by_Households 0 -1.1077 1.8479
Consumption_by_Industry 1 -0.9217 3.3715
Consumption_per_Capita 0 -2.2652 2.4851
Consumption_per_Household 0 -0.9709 2.8767
Economy 0 -0.6074 2.9678
GSP 0 -1.0388 2.0665
Industry 1 -0.4907 3.1657
Number_of_Households 0 -1.0580 1.8832
Population 0 -1.0288 1.8967
Total 1 -0.9458 3.2664
Agriculture 1 -0.8830 3.6031
Aquaculture_Forestry_Fishing 0 -1.1534 2.5578
Electricity_Gas 0 -1.0419 1.8935
Manufacturing 0 -1.5556 1.9085
Mining 1 -0.8405 3.6125
Other_Industries 0 -1.1593 2.3518
Water_Supply_Sewerage 0 -0.6961 2.9891
Total_in_Year 0 -1.2203 1.6266
Proportion_of_Consumption 0 -0.9818 2.5824
# From the output above, we found 5 outliers in total. 
# to deal with them one by one
z.scores <- fullData[,4] %>% scores(type = "z")
# check the row of the outlier in `Consumption_by_Industry`
which(abs(z.scores)>3) 
## [1] 34
# calculate the value again by `Total` minus `Consumption_by_Households` and it's equal.
fullData$Consumption_by_Industry[34] == fullData$Total[34]-fullData$Consumption_by_Households[34]
## [1] TRUE
# calculate the value again by summing up all industry's water consumption and it's equal. So, it is not an outlier.
fullData$Consumption_by_Industry[34] == sum(fullData[34,13:19])
## [1] TRUE
z.scores <- fullData[,12] %>% scores(type = "z")
# check the row of the outlier in `Total`
which(abs(z.scores)>3) 
## [1] 34
# calculate the value again by sumup water consumption by households and industry, it's equal. So, it is not an outlier.
fullData$Total[34] == fullData$Consumption_by_Households[34] + fullData$Consumption_by_Industry[34]
## [1] TRUE
z.scores <- na.omit(fullData[,13]) %>% scores(type = "z")
# check the row of the outlier in `Agriculture`
which(abs(z.scores)>3)
## [1] 33
# couse we removed `NA` when calculate z scores, hence that we should find the row of outlier in raw data.
na.omit(fullData$Agriculture)[33]
## [1] 6210
which(fullData$Agriculture == 6210)
## [1] 34
# calculate the value again by using total industry water consumption minus the sum of others industry and it's equal.
# So, it is not an outlier.
fullData$Agriculture[34] == fullData$Consumption_by_Industry[34] - sum(fullData[34,14:19])
## [1] TRUE
z.scores <- na.omit(fullData[,17]) %>% scores(type = "z")
# check the row of the outlier in `Mining`
which(abs(z.scores)>3) 
## [1] 49
# As we removed `NA` when calculate z scores, hence we should find the row of outlier in raw data.
na.omit(fullData$Mining)[49]
## [1] 452
which(fullData$Mining == 452)
## [1] 56
# calculate the value again by using total industry water consumption minus the sum of others industry and it's equal.
# So, it is not an outlier.
fullData$Mining[56] == fullData$Consumption_by_Industry[56] - sum(fullData[56,c(13:16,18:19)], na.rm = TRUE)
## [1] TRUE
z.scores <- fullData[,9] %>% scores(type = "z")
# found the row of the outlier in `Industry`
which(abs(z.scores)>3) 
## [1] 49
# We found that `ACT` keeps a very high amount of industry through these years.
ACT <- fullData %>% filter(States == "ACT") %>% select(Industry)
# When check the outlier in `ACT` alone, there is no outlier any more.
z.scores <- ACT %>% scores(type = "z")
which(abs(z.scores)>3) 
## integer(0)
# Therefore, it is not an outlier.

Transform

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.

# Checking the value of column `Manufacturing` whether it is normal distribution or not.
hist(fullData$Manufacturing)

# The histogram shows that it is far from normal distribution, hence apply log transformation to make it normal. 
manufacturing.log <- log(fullData$Manufacturing)
# After log transformation, the histogram still exist skewness. However, it is much better than raw data. 
hist(manufacturing.log)

NOTE: Follow the order outlined above in the report. Make sure your code is visible (within the margin of the page). Do not use View() to show your data instead give headers (using head() )

Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Please also provide the R codes, outputs and brief explanations on why and how you applied these tasks on the data.