Analysing Changes in Water Consumption for Households in Victoria

Are Households Using Water More Efficiently Over Time?

S3442140 Shishen Chen, S3744421 Shipren Jayadev

Last updated: 04 August, 2019

RPubs link information

You must publish your presentation to RPubs (see here) and add this link to your presentation here.
Rpubs link comes here: www………
This online version of the presentation will be used for marking. Failure to add your link will delay your feedback and risk late penalties.

Introduction

Water is a precious and limited resource on this earth, as such we should take steps to ensure that we don’t use too much of it to prevent it’s sources from running out. In this report we will be investigating if there is an increase in average water consumption across Victoria.

Introduction Cont.

Knowing if there is an increase or not can help us take predict measures that need to be taken to ensure water consumption is sustainable.

Problem Statement

Investigating if there has been an increase in average water consumption between the year 2008 and 2009.
We will be conducting a hypothesis test to investigate if there has been a significant increase in water usage in Victoria.

Data

The data was collected from an open source website: www.data.vic.gov.au/data/dataset/melbourne-water-use-by-postcode.

The data source meets the requirement of it having a creative commons license.

#Renaming and cleaning data for analysis
watertemporary <- read_csv("fileswateruse.csv") 
water <- watertemporary [-c(1),]
names(water) <- c("Postcode", "Suburb", "Y08", "Y09", "Change (%)")
water$Y08 <- as.numeric(as.character(water$Y08))
water$Y09 <- as.numeric(as.character(water$Y09))

water1 <- watertemporary %>% gather("2008","2009", key = "year", value = "usage")
water1 <- water1 [-c(1),]
water1$year <- as.integer(water1$year)
water1$usage <- as.numeric(water1$usage)

#Obtaining variables for later use
mean08 <- mean(as.numeric(water$`Y08`))
mean09 <- mean(as.numeric(water$`Y09`))
std08 <- sqrt(var(as.numeric(water$Y08)))
std09 <- sqrt(var(as.numeric(water$Y09)))

#Creating column subset for summary statistics
cols <- c("Y08", "Y09")

Data Cont.

There are four main variables in this dataset which are the postcode, consumption in 2008, consumption in 2009, and the percentage difference between 2008 and 2009. Water consumption is measured as litres per household per day.

There was no preprocessing required in this dataset as it was collected from a reputable source and is well presented as it is.

There were some post-processing steps required to “clean up” the data to make it easier to analyse such as converting the water consumption from characters to numerics and renaming the headings.

Descriptive Statistics and Visualisation

As we can see in the output below, the water consumption for 2008 and 2009 in Victoria follow a non-liner pattern.

Based on the histograms, there is a very slight right skew for both years which indicates water consumption has been for the most part remained at average leves for both years compared to their minimums and maximums.

Looking at the line graphs we can see that the 2008 water consumption line is constantly on top of the blue 2009 line indicating that there has been a decrease over the year. Whether or not it is a significant increase for the purpose of testing is yet to be seen and will be investigated further below.

#Dot plots for both years
plot(water$Y08,xlab = "Number of Postcodes", ylab = "Litres", main = "Year 2008")

plot(water$Y09,xlab = "Number of Postcodes", ylab = "Litres", main = "Year 2009")

#Histograms with normal curve overlay for both years
h1 <- hist(water$Y08, breaks=10, col="green", xlab="Litres", 
   main="Water Consumption Histogram (08)") 
xfit<-seq(min(water$Y08),max(water$Y08),length=211) 
yfit<-dnorm(xfit,mean=mean08,sd=std08) 
yfit <- yfit*diff(h1$mids[1:2])*length(water$Y08) 
lines(xfit, yfit, col="red", lwd=2)

h2 <- hist(water$Y09, breaks=10, col="blue", xlab="Litres", 
   main="Water Consumption Histogram (09)") 
xfit<-seq(min(water$Y09),max(water$Y09),length=211) 
yfit<-dnorm(xfit,mean=mean09,sd=std09) 
yfit <- yfit*diff(h2$mids[1:2])*length(water$Y09) 
lines(xfit, yfit, col="red", lwd=2)

#Line graph for both years
plot(water$Y08,type = "l",col = "red", xlab = "Postcode", ylab = "Water Consumption", 
   main = "Water Usage (Victoria)")
lines(water$Y09, type = "l", col = "blue")
legend("topleft", inset = 0.01, legend=c("2008", "2009"), col=c("red", "blue"), lty=1:1)

Decsriptive Statistics Cont.

#Summary Table for Water Consumption in 2008
water %>% summarise (Year = "2008", 
                     Min = min(water$Y08,na.rm = TRUE),
                     Q1 = quantile(water$Y08,probs = .25,na.rm = TRUE),
                     Median = median(water$Y08, na.rm = TRUE),
                     Q3 = quantile(water$Y08,probs = .75,na.rm = TRUE),
                     Max = max(water$Y08,na.rm = TRUE),
                     Mean = mean(water$Y08, na.rm = TRUE),
                     SD = sd(water$Y08, na.rm = TRUE),
                     n = n(),
                     Missing = sum(is.na(water$Y08))) -> table08
knitr::kable(table08)

Year	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
2008	255	402.5	442	480.5	677	443.0142	72.25474	211	0

#Summary Table for Water Consumption in 2009
water %>% summarise (Year = "2009", 
                     Min = min(water$Y09,na.rm = TRUE),
                     Q1 = quantile(water$Y09,probs = .25,na.rm = TRUE),
                     Median = median(water$Y09, na.rm = TRUE),
                     Q3 = quantile(water$Y09,probs = .75,na.rm = TRUE),
                     Max = max(water$Y09,na.rm = TRUE),
                     Mean = mean(water$Y09, na.rm = TRUE),
                     SD = sd(water$Y09, na.rm = TRUE),
                     n = n(),
                     Missing = sum(is.na(water$Y09))) -> table09
knitr::kable(table09)

Year	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
2009	234	377.5	413	451.5	652	417.0427	67.59792	211	0

Hypthesis Testing Cont.

In this investigation, the two sampe t-test will be used. Firstly, checking the assumption of Variance, using the Levene’s test which an assumption preparation for following t-test. \[H_0: \sigma_1 = \sigma_2\] \[H_0: \sigma_1 \ne \sigma_2\]

leveneTest(usage~as.character(year), data = water1)

As the p-value for the Levene’s test of equal variance for water usage between 2008 and 2009 was 0.38, we find p > 0.05. Therefore, we fail to reject \(H_0\). The variances are equal.

The two-sample t -test has the following statistical hypotheses: \[H_0: \mu_1 - \mu_2 = 0\]

\[H_A: \mu_1 -\mu_2 \ne 0\] where \(\mu1\) and \(\mu2\) refer to the water usage means of the 2008 and 2009 respectively. The null hypothesis is simply that the difference between the two independent years of water usage means is 0. The alternative is that the two years water usage means difference is not equal to 0. Meanwhile, the homogenity of variance has been investigated on the above. The argument of equal variance will be the true.

t.test(usage~year,
       data = water1,
       var.equal = TRUE,
       alternative = "two.sided")

## 
##  Two Sample t-test
## 
## data:  usage by year
## t = 3.8128, df = 420, p-value = 0.000158
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.58231 39.36082
## sample estimates:
## mean in group 2008 mean in group 2009 
##           443.0142           417.0427

The p-value is 0.00016 which is less than 0.05 so we reject \(H_0\).

Discussion

The result of the two sample t-test regarding the water usage between 2008 and 2009 is that they are statistically difference. In other words, the water usage of 2009 has seen a statistically siganificant decrease in use compared to 2008.

The average water usage in 2009 is statistically different from the average water usage in 2008.

References

Data.vic.gov.au. (2018). Melbourne water use by postcode - Victorian Government Data Directory. [online] Available at: https://www.data.vic.gov.au/data/dataset/melbourne-water-use-by-postcode [Accessed 28 Oct. 2018].