library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(MVN)
library(forecast)
library(knitr)
The object of this assignment is to preprocess the data set which are created by merging two different data sets: data_1 and data_2. Firstly, we made the two data sets tidy which including remove blank rows, rename inappropriate variable names and make the table structure tidy(merge every states columns into one colume and named States and separate the levels of type as independent columns). Next, merged two tidy datasets as a ‘fullData’ set and created a new column. Moreover, we checked missing value, special symbol and converted variable type. Then, we used self-defined function to recognise outliers and applied transformation finish our preprocessing work.
This assignment merged two data sets: data_1(88 observations of 10 variables) and data_2(80 observations of 10 variables) which are inside of the ‘Summary tables’ dataset from Australian Bureau of Statistics (http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/4610.02015-16?OpenDocument). Summary tables, by State and Territory (Selected Indicators), 2008-09 to 2015-16 provides information on the water use and consumptive practices of households and key industries (Agriculture and Water Supply, Sewerage and Drainage Services).
The variable description is produced here from data_1 file:
* Year: factor.
* Type: Water consumption by industry (GL), Water consumption by households (GL), Total water consumption (GL), Gross State Product, chain volume measures (\(m), Population at 30 June 2016 ('000), Number of households ('000), Water use productivity (\)),Industry (a),Total economy (b),Total water consumption per capita (kL), Water consumption per household (kL).
* NSW: New South Wales state.
* VIC: Victoria state.
* QLD: Queensland state.
* SA: South Australia state.
* WA: Western Australia state.
* TAS: Tasmania state.
* NT: North Territory.
* ACT: Australian Capital Territory.
The variable description is produced here from data_2 file:
* Year: factor.
* Type: Agriculture, Aquaculture, forestry and fishing, Mining, Manufacturing, Electricity and gas supply, Water supply, sewerage and drainage services (a), Other industries, Water consumption by industry, Water consumption by households, Total water consumption.
* NSW: New South Wales state.
* VIC: Victoria state.
* QLD: Queensland state.
* SA: South Australia state.
* WA: Western Australia state.
* TAS: Tasmania state.
* NT: North Territory.
* ACT: Australian Capital Territory.
data_1 <- read_csv("data_1.csv") #read the summary data
data_2 <- read_csv("data_2.csv") #read the industry data
head(data_1)
head(data_2)
# With these untidy datasets, we have to make them tidy before merge.
The variable description is produced from data_1(88 observations of 10 variables):
* Year: integer.
* Type: character.
* NSW: numeric.
* VIC: numeric.
* QLD: numeric.
* SA: numeric.
* WA: numeric.
* TAS: numeric.
* NT: numeric.
* ACT: numeric.
The variable description is produced from data_2(80 observations of 10 variables):
* Year: integer.
* Type: character.
* NSW: numeric.
* VIC: numeric.
* QLD: numeric.
* SA: numeric.
* WA: numeric.
* TAS: character.
* NT: character.
* ACT: character.
str(data_1) # check the data type in each variable
## Classes 'tbl_df', 'tbl' and 'data.frame': 88 obs. of 10 variables:
## $ years: int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Type : chr "Water consumption by industry (GL)" "Water consumption by households (GL)" "Total water consumption (GL)" "Gross State Product, chain volume measures ($m)" ...
## $ NSW : num 4655 562 5217 531323 7726 ...
## $ VIC : num 3335 384 3719 373624 6068 ...
## $ QLD : num 3584 375 3958 314569 4844 ...
## $ SA : num 1039 136 1175 101096 1708 ...
## $ WA : num 1080 335 1415 255214 2617 ...
## $ TAS : num 415 38 453 26039 519 ...
## $ NT : num 105 39 144 23648 245 ...
## $ ACT : num 20 31 51 36225 396 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 10
## .. ..$ years: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Type : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ NSW : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ VIC : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ QLD : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ SA : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ WA : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ TAS : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ NT : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ ACT : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
str(data_2) # check the data type in each variable
## Classes 'tbl_df', 'tbl' and 'data.frame': 80 obs. of 10 variables:
## $ years: int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ type : chr "Agriculture" "Aquaculture, forestry and fishing" "Mining" "Manufacturing" ...
## $ NSW : num 3150 24 81 108 52 ...
## $ VIC : num 2326 26 4 138 127 ...
## $ QLD : num 2454 51 135 181 79 ...
## $ SA : num 804 7 46 49 2 ...
## $ WA : num 350 53 363 63 28 ...
## $ TAS : chr "306" "9" "9" "36" ...
## $ NT : chr "43" "-" "22" "-" ...
## $ ACT : chr "-" "-" "1" "-" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 10
## .. ..$ years: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ type : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ NSW : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ VIC : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ QLD : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ SA : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ WA : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ TAS : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ NT : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ ACT : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
# We got character, numeric and integer types in both dataset, and apply the required data type conversions after tidy the data.
Check if the data conforms the tidy data principles. If your data is not in a tidy format, reshape your data into a tidy format (minimum requirement #5). In addition to the R codes and outputs, explain everything that you do in this step.
rowSums(is.na(data_1))
## [1] 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0
## [36] 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0
## [71] 0 0 8 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0
# We found the "Water use productivity ($)" doesn't contain any value in every years,
# so using `filter()` remove it before tidy data.
data_1 <- data_1 %>% filter(Type != "Water use productivity ($)")
## Warning: package 'bindrcpp' was built under R version 3.4.4
# Moverover, we found repeating imformation under the column `Type`.
# `Population` be labelled as different names with year, so remane
# it as a unique name before tidy to avoid duplication of information.
# Checking the levels of `Tpye`
unique(data_1$Type)
## [1] "Water consumption by industry (GL)"
## [2] "Water consumption by households (GL)"
## [3] "Total water consumption (GL)"
## [4] "Gross State Product, chain volume measures ($m)"
## [5] "Population at 30 June 2016 ('000)"
## [6] "Number of households ('000)"
## [7] "Industry (a)"
## [8] "Total economy (b)"
## [9] "Total water consumption per capita (kL)"
## [10] "Water consumption per household (kL)"
## [11] "Population at 30 June 2015 ('000)"
## [12] "Population at 30 June 2014 ('000)"
## [13] "Population at 30 June 2013 ('000)"
## [14] "Population at 30 June 2012 ('000)"
## [15] "Population at 30 June 2011 ('000)"
## [16] "Population at 30 June 2010 ('000)"
## [17] "Population at 30 June 2009 ('000)"
# Using `grep()` search each `Type` column which include `Population` and rename it.
data_1$Type[grep("Population", data_1$Type)] <- "Population"
unique(data_1$Type) # Checking all values are renamed successful.
## [1] "Water consumption by industry (GL)"
## [2] "Water consumption by households (GL)"
## [3] "Total water consumption (GL)"
## [4] "Gross State Product, chain volume measures ($m)"
## [5] "Population"
## [6] "Number of households ('000)"
## [7] "Industry (a)"
## [8] "Total economy (b)"
## [9] "Total water consumption per capita (kL)"
## [10] "Water consumption per household (kL)"
# After tidy data, each value in `Type` column becomes a new column, hence these values should be simplify.
# The following code renamed each level of `Type` :
data_1$Type[which(data_1$Type == "Water consumption by industry (GL)")] <- "Consumption_by_Industry"
data_1$Type[which(data_1$Type == "Water consumption by households (GL)")] <- "Consumption_by_Households"
data_1$Type[which(data_1$Type == "Total water consumption (GL)")] <- "Total"
data_1$Type[which(data_1$Type == "Gross State Product, chain volume measures ($m)")] <- "GSP"
data_1$Type[which(data_1$Type == "Number of households ('000)")] <- "Number_of_Households"
data_1$Type[which(data_1$Type == "Water use productivity ($)")] <- "Productivity"
data_1$Type[which(data_1$Type == "Industry (a)")] <- "Industry"
data_1$Type[which(data_1$Type == "Total economy (b)")] <- "Economy"
data_1$Type[which(data_1$Type == "Total water consumption per capita (kL)")] <- "Consumption_per_Capita"
data_1$Type[which(data_1$Type == "Water consumption per household (kL)")] <- "Consumption_per_Household"
# Checking all values are renamed successfully.
unique(data_1$Type)
## [1] "Consumption_by_Industry" "Consumption_by_Households"
## [3] "Total" "GSP"
## [5] "Population" "Number_of_Households"
## [7] "Industry" "Economy"
## [9] "Consumption_per_Capita" "Consumption_per_Household"
# The last step is tidy data using both `gather()` and `spread()` functions.
# Using `gather()` to merge every states columns into one colume and named `States`.
# Using `spread()` to separate the levels of `Type` as independent columns.
tidyData_1 <- data_1 %>%
gather(NSW,VIC,QLD,SA,WA,TAS,NT,ACT, key="States", value="values") %>%
spread(key="Type", value="values")
# Checking dataset should be tidied now.
head(tidyData_1)
# Remove repeating variables which contained in the first dataset.
data_2 <- data_2 %>% filter(type != "Water consumption by industry" &
type != "Water consumption by households" &
type != "Total water consumption")
# Checking the levels of `type`
unique(data_2$type)
## [1] "Agriculture"
## [2] "Aquaculture, forestry and fishing"
## [3] "Mining"
## [4] "Manufacturing"
## [5] "Electricity and gas supply"
## [6] "Water supply, sewerage and drainage services (a)"
## [7] "Other industries"
# After tidy data, each value in `type` column becomes a new column, hence these values should be simplify.
# The following code renamed each level of `type` :
data_2$type[which(data_2$type == "Aquaculture, forestry and fishing")] <- "Aquaculture_Forestry_Fishing"
data_2$type[which(data_2$type == "Electricity and gas supply")] <- "Electricity_Gas"
data_2$type[which(data_2$type == "Water supply, sewerage and drainage services (a)")] <- "Water_Supply_Sewerage"
data_2$type[which(data_2$type == "Other industries")] <- "Other_Industries"
# Checking all values are renamed successful.
unique(data_2$type)
## [1] "Agriculture" "Aquaculture_Forestry_Fishing"
## [3] "Mining" "Manufacturing"
## [5] "Electricity_Gas" "Water_Supply_Sewerage"
## [7] "Other_Industries"
# The last step is tidy data using both `gather()` and `spread()` functions.
# Using `gather()` to merge every states columns into one colume and named `States`.
# Using `spread()` to separate the levels of `type` as independent columns.
tidyData_2 <- data_2 %>%
gather(NSW,VIC,QLD,SA,WA,TAS,NT,ACT, key="States", value="values") %>%
spread(key="type", value="values")
# Checking dataset should be tidied now.
head(tidyData_2)
# Checking the columns' name for data merging
names(tidyData_1)
## [1] "years" "States"
## [3] "Consumption_by_Households" "Consumption_by_Industry"
## [5] "Consumption_per_Capita" "Consumption_per_Household"
## [7] "Economy" "GSP"
## [9] "Industry" "Number_of_Households"
## [11] "Population" "Total"
names(tidyData_2)
## [1] "years" "States"
## [3] "Agriculture" "Aquaculture_Forestry_Fishing"
## [5] "Electricity_Gas" "Manufacturing"
## [7] "Mining" "Other_Industries"
## [9] "Water_Supply_Sewerage"
# Using `full_join()` function and matching both `years` and `states` columns to merge data.
fullData <- tidyData_1 %>% full_join(tidyData_2, by=c("years", "States"))
# Checking dataset
head(fullData)
Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step..
# In this section, we would like to create a column to show the proportion of water consumption of each state by years.
# First of all, create a table of total water consumption in each year.
Total <- fullData %>% group_by(years) %>% summarise(Total_in_Year = sum(Total))
# Secondly, using `left_join()` function merge the total water consumption for each year.
fullData <- fullData %>% left_join(Total, by="years")
# Finally, calculate the proportion and using `round()` function specified decimal.
fullData$Proportion_of_Consumption <- round(fullData$Total/fullData$Total_in_Year,3)
# Checking dataset
head(fullData)
Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain how you dealt with these values.
# Checking missing value within dataset
sum(is.na(fullData))
## [1] 0
# The output shows there is no missing value in the dataset, however, when print out the values some special symbols appeared.
fullData$Agriculture
## [1] "2" "2001" "35" "2144" "788" "264" "1517" "325" "1" "2127"
## [11] "49" "1928" "720" "286" "1553" "324" "1" "2861" "52" "1878"
## [21] "646" "185" "1234" "319" "-" "4037" "52" "2014" "661" "203"
## [31] "2041" "315" "1" "6210" "45" "2463" "759" "246" "2607" "288"
## [41] "1" "4983" "50" "2761" "691" "235" "2572" "296" "-" "4120"
## [51] "56" "2250" "709" "228" "2738" "308" "-" "3150" "43" "2454"
## [61] "804" "306" "2326" "350"
sum(fullData == "-")
## [1] 43
# It means this state does not have that type of industry.
# Therefore, to deal with them, replacing as `NA` to indicate there is no such industry.
# It is better than replaced by `0` when calculate the mean water consumption.
# In this case, `0` means that state has this industry without any water consumption.
fullData[fullData == "-"] <- NA
# In total, we have 43 "missing values" but do not actually means missing.
sum(is.na(fullData))
## [1] 43
# After replacing missing value, we could apply the required data type conversions.
str(fullData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 64 obs. of 21 variables:
## $ years : int 2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 ...
## $ States : chr "ACT" "NSW" "NT" "QLD" ...
## $ Consumption_by_Households : num 27 548 39 340 128 69 331 335 28 556 ...
## $ Consumption_by_Industry : num 21 4007 121 3001 1051 ...
## $ Consumption_per_Capita : num 135 646 708 772 733 925 549 608 130 605 ...
## $ Consumption_per_Household : num 200 210 574 213 199 340 163 403 204 210 ...
## $ Economy : num 642 99 118 80 77 54 110 134 675 106 ...
## $ GSP : num 30821 450461 18870 266578 90702 ...
## $ Industry : num 1468 112 156 89 86 ...
## $ Number_of_Households : num 135 2607 68 1595 642 ...
## $ Population : num 355 7054 226 4329 1609 ...
## $ Total : num 48 4555 160 3341 1179 ...
## $ Agriculture : chr "2" "2001" "35" "2144" ...
## $ Aquaculture_Forestry_Fishing: chr NA "21" "1" "74" ...
## $ Electricity_Gas : chr NA "92" "1" "79" ...
## $ Manufacturing : chr NA "140" "22" "142" ...
## $ Mining : chr NA "67" "21" "120" ...
## $ Other_Industries : chr "11" "350" "32" "212" ...
## $ Water_Supply_Sewerage : chr "7" "1336" "9" "229" ...
## $ Total_in_Year : num 14061 14061 14061 14061 14061 ...
## $ Proportion_of_Consumption : num 0.003 0.324 0.011 0.238 0.084 0.033 0.21 0.097 0.003 0.32 ...
# From the structure of dataset we found the `States` column should convert to factor.
# In addition, the remaining columns with type of character should convert to numeric.
# Change 'States' into an ordered factor.
fullData$States <- factor(fullData$States, ordered=TRUE)
# Change others character's columns into numeric.
fullData$Agriculture <- as.numeric(fullData$Agriculture)
fullData$Aquaculture_Forestry_Fishing <- as.numeric(fullData$Aquaculture_Forestry_Fishing)
fullData$Electricity_Gas <- as.numeric(fullData$Electricity_Gas)
fullData$Manufacturing <- as.numeric(fullData$Manufacturing)
fullData$Mining <- as.numeric(fullData$Mining)
fullData$Other_Industries <- as.numeric(fullData$Other_Industries)
fullData$Water_Supply_Sewerage <- as.numeric(fullData$Water_Supply_Sewerage)
# Checking structure of dataset
str(fullData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 64 obs. of 21 variables:
## $ years : int 2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 ...
## $ States : Ord.factor w/ 8 levels "ACT"<"NSW"<"NT"<..: 1 2 3 4 5 6 7 8 1 2 ...
## $ Consumption_by_Households : num 27 548 39 340 128 69 331 335 28 556 ...
## $ Consumption_by_Industry : num 21 4007 121 3001 1051 ...
## $ Consumption_per_Capita : num 135 646 708 772 733 925 549 608 130 605 ...
## $ Consumption_per_Household : num 200 210 574 213 199 340 163 403 204 210 ...
## $ Economy : num 642 99 118 80 77 54 110 134 675 106 ...
## $ GSP : num 30821 450461 18870 266578 90702 ...
## $ Industry : num 1468 112 156 89 86 ...
## $ Number_of_Households : num 135 2607 68 1595 642 ...
## $ Population : num 355 7054 226 4329 1609 ...
## $ Total : num 48 4555 160 3341 1179 ...
## $ Agriculture : num 2 2001 35 2144 788 ...
## $ Aquaculture_Forestry_Fishing: num NA 21 1 74 9 9 31 94 NA 23 ...
## $ Electricity_Gas : num NA 92 1 79 2 NA 123 27 NA 69 ...
## $ Manufacturing : num NA 140 22 142 84 50 143 60 NA 141 ...
## $ Mining : num NA 67 21 120 22 18 6 252 NA 61 ...
## $ Other_Industries : num 11 350 32 212 99 25 238 171 15 337 ...
## $ Water_Supply_Sewerage : num 7 1336 9 229 47 ...
## $ Total_in_Year : num 14061 14061 14061 14061 14061 ...
## $ Proportion_of_Consumption : num 0.003 0.324 0.011 0.238 0.084 0.033 0.21 0.097 0.003 0.32 ...
Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain how you dealt with these values.
# The `MVN` function could not performance well with such dataset, hence we decide to use `z scores` detecting outliers.
# Due to the size of dataset, we build a function to detect outliers for each numeric column.
showOutliers <- function(data,columns,type){ # the parameters include data, number of columns and the scores type.
N = length(columns) # calculate the length of total clumns to create the table of output.
Columns = array(NA,N) # create variable of table to show the column's name
Min = array(NA,N) # create variable of the minimum scores
# to save time, the following two parameters could be ignored this time
#Median = array(NA,N) # create variable of the median of scores
#Mean = array(NA,N) # create variable of the mean of scores
Max = array(NA,N) # create variable of the maximum of scores
Outliers = array(NA,N) # create variable of the outliers in total for each column
count = 1 # create a variable to count the row number
for (i in columns){
scores <- na.omit(data[,i]) %>% scores(type = type) # calculate scores for column i
Min[count] = round(min(scores),4) # calculate minimum scores and distribute to `Min`
#Median[count] = round(median(scores),4) # calculate median scores and distribute to `Median`
#Mean[count] = round(mean(scores),4) # calculate mean scores and distribute to `Mean`
Max[count] = round(max(scores),4) # calculate maximum scores and distribute to `Max`
Outliers[count] = length(which(abs(scores) >3)) # detecting outliers and distribute to `Outliers`
Columns[count] = names(data)[i] # distribute the column's name
count = count + 1 # move onto next row
}
# return a table to show the result of calculation
return(as.data.frame(list(Columns=Columns, Outliers=Outliers, Min=Min, Max=Max)))
}
# Using the function above to detect outliers with all numeric columns.
outliers <- showOutliers(data=fullData,columns = c(3:21), type = "z")
# Showing the first 6 columns with outliers
kable(outliers)
| Columns | Outliers | Min | Max |
|---|---|---|---|
| Consumption_by_Households | 0 | -1.1077 | 1.8479 |
| Consumption_by_Industry | 1 | -0.9217 | 3.3715 |
| Consumption_per_Capita | 0 | -2.2652 | 2.4851 |
| Consumption_per_Household | 0 | -0.9709 | 2.8767 |
| Economy | 0 | -0.6074 | 2.9678 |
| GSP | 0 | -1.0388 | 2.0665 |
| Industry | 1 | -0.4907 | 3.1657 |
| Number_of_Households | 0 | -1.0580 | 1.8832 |
| Population | 0 | -1.0288 | 1.8967 |
| Total | 1 | -0.9458 | 3.2664 |
| Agriculture | 1 | -0.8830 | 3.6031 |
| Aquaculture_Forestry_Fishing | 0 | -1.1534 | 2.5578 |
| Electricity_Gas | 0 | -1.0419 | 1.8935 |
| Manufacturing | 0 | -1.5556 | 1.9085 |
| Mining | 1 | -0.8405 | 3.6125 |
| Other_Industries | 0 | -1.1593 | 2.3518 |
| Water_Supply_Sewerage | 0 | -0.6961 | 2.9891 |
| Total_in_Year | 0 | -1.2203 | 1.6266 |
| Proportion_of_Consumption | 0 | -0.9818 | 2.5824 |
# From the output above, we found 5 outliers in total.
# to deal with them one by one
z.scores <- fullData[,4] %>% scores(type = "z")
# check the row of the outlier in `Consumption_by_Industry`
which(abs(z.scores)>3)
## [1] 34
# calculate the value again by `Total` minus `Consumption_by_Households` and it's equal.
fullData$Consumption_by_Industry[34] == fullData$Total[34]-fullData$Consumption_by_Households[34]
## [1] TRUE
# calculate the value again by summing up all industry's water consumption and it's equal. So, it is not an outlier.
fullData$Consumption_by_Industry[34] == sum(fullData[34,13:19])
## [1] TRUE
z.scores <- fullData[,12] %>% scores(type = "z")
# check the row of the outlier in `Total`
which(abs(z.scores)>3)
## [1] 34
# calculate the value again by sumup water consumption by households and industry, it's equal. So, it is not an outlier.
fullData$Total[34] == fullData$Consumption_by_Households[34] + fullData$Consumption_by_Industry[34]
## [1] TRUE
z.scores <- na.omit(fullData[,13]) %>% scores(type = "z")
# check the row of the outlier in `Agriculture`
which(abs(z.scores)>3)
## [1] 33
# couse we removed `NA` when calculate z scores, hence that we should find the row of outlier in raw data.
na.omit(fullData$Agriculture)[33]
## [1] 6210
which(fullData$Agriculture == 6210)
## [1] 34
# calculate the value again by using total industry water consumption minus the sum of others industry and it's equal.
# So, it is not an outlier.
fullData$Agriculture[34] == fullData$Consumption_by_Industry[34] - sum(fullData[34,14:19])
## [1] TRUE
z.scores <- na.omit(fullData[,17]) %>% scores(type = "z")
# check the row of the outlier in `Mining`
which(abs(z.scores)>3)
## [1] 49
# As we removed `NA` when calculate z scores, hence we should find the row of outlier in raw data.
na.omit(fullData$Mining)[49]
## [1] 452
which(fullData$Mining == 452)
## [1] 56
# calculate the value again by using total industry water consumption minus the sum of others industry and it's equal.
# So, it is not an outlier.
fullData$Mining[56] == fullData$Consumption_by_Industry[56] - sum(fullData[56,c(13:16,18:19)], na.rm = TRUE)
## [1] TRUE
z.scores <- fullData[,9] %>% scores(type = "z")
# found the row of the outlier in `Industry`
which(abs(z.scores)>3)
## [1] 49
# We found that `ACT` keeps a very high amount of industry through these years.
ACT <- fullData %>% filter(States == "ACT") %>% select(Industry)
# When check the outlier in `ACT` alone, there is no outlier any more.
z.scores <- ACT %>% scores(type = "z")
which(abs(z.scores)>3)
## integer(0)
# Therefore, it is not an outlier.
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.
# Checking the value of column `Manufacturing` whether it is normal distribution or not.
hist(fullData$Manufacturing)
# The histogram shows that it is far from normal distribution, hence apply log transformation to make it normal.
manufacturing.log <- log(fullData$Manufacturing)
# After log transformation, the histogram still exist skewness. However, it is much better than raw data.
hist(manufacturing.log)
NOTE: Follow the order outlined above in the report. Make sure your code is visible (within the margin of the page). Do not use View() to show your data instead give headers (using head() )
Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Please also provide the R codes, outputs and brief explanations on why and how you applied these tasks on the data.