MATH2349 Data Wrangling

Required packages

library(readr)
library(readxl)
library(dplyr)
library(ggplot2)
library(deductive)
library(validate)
library(lubridate)
library(Hmisc)
library(tidyr)
library(outliers)

Executive Summary

Using the ‘rbind’ function I combined 12 weather data sets. The combined data set was then merged with a 13th to add seasons as a variable. ‘left_join’ function was used and they were merged by the date variable. However, prior to merging data sets I changed the date format so they matched. I removed the first empty column and changed column names to remove unwanted characters. I wish I also removed the spaces in the column names at this stage… I’ve then proceeded to create an extra variable for month as a factor by duplicating the date variable, separating day, month and year and then dropping day and year columns. I then converted the month number to a factor and added labels. I changed the data type for 22 variables. I believed I would achieve more accurate alternatives for missing characters if I completed this step before tidying the data, so I’ve gone ahead and replaced missing data with the mean for numeric and mode for categorical variables. I’ve then gone and tidied up the data based on the variables that had multiple values stored in rows. In hindsight I should’ve created separate data frames for each of the variables i.e. a separate data frame for MSL pressure, cloud, humidity, wind direction, wind speed and temperature. I eventually ended up with 105408 observations. I’ve used box plots to easily search for an identify outliers. Once outliers were identified, I created subsets of the data and used the capping method to replace outliers. Another hindsight moment but I think there’s an opportunity to search for and replace outliers before tidying the data. However, by doing this we would obviously lose the original values completely. Hope you enjoy the following 25 pages.

Data

13 Data sets overall

12 data sets from the Bureau of Meteorology (BoM) website (http://www.bom.gov.au/climate/dwo/IDCJDW3033.latest.shtml)
- From the above site you can access data for the last 14 months under the heading Other times and other places
- Data is 12 months of daily weather observations for the city Melbourne and range from 1/09/2019 - 31/08/2020
- Daily data can be downloaded month
- Data sets contains 22 variables and 366 observations after combining 12 months of data
  - Date: Date format DD/MM/YYYY
  - Minimum temperature (C): Minimum temperature in degrees celcius
  - Maximum temperature (C): Maximum temperature in degrees celcius
  - Rainfall (mm): Rainfall in millimetres
  - Evaporation (mm): “Class A” pan evaporation in millimetres
  - Sunshine (hours): Bright sunshine measured in hours
  - Direction of maximum wind gust: Direction of strongest gust using 16 compass points
  - Speed of maximum wind gust (km/h): Speed of strongest wind gust in kilometres per hour
  - Time of maximum wind gust: Time of strongest wind gust based on locat time hh:mm
  - 9am Temperature (°C): Temperature at 9 am in degrees celcius
  - 9am relative humidity (%): Relative humidity at 9 am in percent
  - 9am cloud amount (oktas): Fraction of sky obscured by cloud at 9 am measured in eighths
  - 9am wind direction: Wind direction averaged over 10 minutes prior to 9 am from compass points
  - 9am wind speed (km/h): Wind speed averaged over 10 minutes prior to 9 am in kilometres per hour
  - 9am MSL pressure (hPa): Atmospheric pressure reduced to mean sea level at 9 am in hectopascals
  - 3pmm Temperature (°C): Temperature at 3 pm in degrees celcius
  - 3pm relative humidity (%): Relative humidity at 3 pm in percent
  - 3pm cloud amount (oktas): Fraction of sky obscured by cloud at 3 pm measured in eighths
  - 3pm wind direction: Wind direction averaged over 10 minutes prior to 3 pm from compass points
  - 3pm wind speed (km/h): Wind speed averaged over 10 minutes prior to 3 pm in kilometres per hour
  - 3pm MSL pressure (hPa): Atmospheric pressure reduced to mean sea level at 3 pm in hectopascals
- Variable information and definitions have been taken from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml.
1 Data set created
- Data sets contains 2 variables and 366 observations
  - Date: Date format DD/MM/YYYY
  - Season: four seasons of the year (1 = Spring, 2 = Summer, 3 = Autumn, 4 = Winter)
3 variables are added to data set as part of the data wrangling process
- City: City of observation if you ever wanted to compare with another City
- Month: 9 = September, 10 = October, 11 = November, 12 = December, 1 = January, 2 = February, 3 = March, 4 = April, 5 = May, 6 = June, 7 = July, 8 = August)
- Temp_norm: mutated column by dividing temperature value by the mean temperature

# Import data and skip the rows that don't form part of the data set
vicsep <- read_csv("VIC_IDCJDW3050.201909.csv", skip = 7) 
vicoct <- read_csv("VIC_IDCJDW3050.201910.csv", skip = 7) 
vicnov <- read_csv("VIC_IDCJDW3050.201911.csv", skip = 7) 
vicdec <- read_csv("VIC_IDCJDW3033.201912.csv", skip = 4) 
vicjan <- read_csv("VIC_IDCJDW3050.202001.csv", skip = 7)
vicfeb <- read_csv("VIC_IDCJDW3050.202002.csv", skip = 7)
vicmar <- read_csv("VIC_IDCJDW3050.202003.csv", skip = 7)
vicapr <- read_csv("VIC_IDCJDW3050.202004.csv", skip = 7)
vicmay <- read_csv("VIC_IDCJDW3050.202005.csv", skip = 7)
vicjun <- read_csv("VIC_IDCJDW3050.202006.csv", skip = 7)
vicjul <- read_csv("VIC_IDCJDW3050.202007.csv", skip = 7)
vicaug <- read_csv("VIC_IDCJDW3050.202008.csv", skip = 7)

# Example of one data set as they all have the same variables
head(vicsep)

# Import seasons data set
seasons <- read_csv("seasons.csv")
head(seasons)

# Combine the data sets using rbind 
viccomb <- rbind(vicsep, vicoct, vicnov, vicdec, vicjan, vicfeb, vicmar, vicapr, vicmay, vicjun, vicjul, vicaug)


# Drop the first column that won't be used 
viccomb$X1 <- NULL

# Add the City variable 
viccomb$City <- c("Melbourne")

head(viccomb)

# As we can see from the header of 'seasons' & 'viccomb' data sets the date formats don't match. This variable is key for our merge & we need to convert them into the same format before merging 

class(seasons$Date)

## [1] "character"

class(viccomb$Date)

## [1] "character"

seasons$Date <- as.Date(seasons$Date, format = "%d/%m/%Y")
viccomb$Date <- as.Date(viccomb$Date)

# Merge 'seasons' to 'viccomb'
vicseason <- left_join(viccomb, seasons, by = "Date")

head(vicseason)

# Change column names to remove characters
colnames(vicseason)[2] <- "Minimum temperature (C)"
colnames(vicseason)[3] <- "Maximum temperature (C)"
colnames(vicseason)[10] <- "9am Temperature (C)"
colnames(vicseason)[16] <- "3pm Temperature (C)"

# Add in Month variable
vicseason$Month = vicseason$Date
vsm <- vicseason %>% separate(Month, into = c("year", "Month", "day"), sep = "-")
vsm$year <- NULL
vsm$day <- NULL

head(vsm)

Understand

We can obtain the data types from the headers and will need to make some changes to make the data suitable for analysis. Below table highlights current formations and required formats in bold. As you’ll see, we need to change the data types for 22 variables, keeping in mind we’ve already made the date changes.

Variable	Current Data Type	Required Data Type
Date	date	date
Minimum temperature (C)	double	numeric
Maximum temperature (C)	double	numeric
Rainfall (mm)	double	numeric
Evaporation (mm)	double	numeric
Sunshine (hours)	double	numeric
Direction of maximum wind gust	character	factor
Speed of maximum wind gust (km/h)	number	factor
Time of maximum wind gust	time	time
9am Temperature (C)	double	numeric
9am relative humidity (%)	double	numeric
9am cloud amount (oktas)	double	numeric
9am wind direction	character	factor
9am wind speed (km/h)	character	numeric
9am MSL pressure (hPa)	double	numeric
3pm Temperature (C)	double	numeric
3pm relative humidity (%)	double	numeric
3pm cloud amount (oktas)	double	numeric
3pm wind direction	character	factor
3pm wind speed (km/h)	character	numeric
3pm MSL pressure (hPa)	double	numeric
City	character	factor
Season	double	factor
Month	character	factor

# Below code is to change data types and apply labels where appropriate
vsm$`Minimum temperature (C)` <- as.numeric(vsm$`Minimum temperature (C)`)
vsm$`Maximum temperature (C)` <- as.numeric(vsm$`Maximum temperature (C)`)
vsm$`Rainfall (mm)` <- as.numeric(vsm$`Rainfall (mm)`)
vsm$`Evaporation (mm)` <- as.numeric(vsm$`Evaporation (mm)`)
vsm$`Sunshine (hours)` <- as.numeric(vsm$`Sunshine (hours)`)
vsm$`Direction of maximum wind gust` <- as.factor(vsm$`Direction of maximum wind gust`)
vsm$`9am cloud amount (oktas)` <- as.numeric(vsm$`9am cloud amount (oktas)`)
vsm$`Speed of maximum wind gust (km/h)` <- as.numeric(vsm$`Speed of maximum wind gust (km/h)`)
vsm$`9am Temperature (C)` <- as.numeric(vsm$`9am Temperature (C)`)
vsm$`9am relative humidity (%)` <- as.numeric(vsm$`9am relative humidity (%)`)
vsm$`9am wind direction` <- as.factor(vsm$`9am wind direction`)
vsm$`9am wind speed (km/h)` <- as.numeric(vsm$`9am wind speed (km/h)`)
vsm$`9am MSL pressure (hPa)` <- as.numeric(vsm$`9am MSL pressure (hPa)`)
vsm$`3pm cloud amount (oktas)` <- as.numeric(vsm$`3pm cloud amount (oktas)`)
vsm$`3pm Temperature (C)` <- as.numeric(vsm$`3pm Temperature (C)`)
vsm$`3pm relative humidity (%)` <- as.numeric(vsm$`3pm relative humidity (%)`)
vsm$`3pm wind direction` <- as.factor(vsm$`3pm wind direction`)
vsm$`3pm wind speed (km/h)` <- as.numeric(vsm$`3pm wind speed (km/h)`)
vsm$`3pm MSL pressure (hPa)` <- as.numeric(vsm$`3pm MSL pressure (hPa)`)
vsm$Season <- as.factor(vsm$Season)
vsm$Season <- factor(vsm$Season, levels = c("1","2","3","4"), 
                     labels = c("Spring", "Summer", "Winter", "Autumn"), ordered = TRUE)
levels(vsm$Season)

## [1] "Spring" "Summer" "Winter" "Autumn"

vsm$City <- as.factor(vsm$City)
vsm$Month <- as.factor(vsm$Month)
vsm$Month <- factor(vsm$Month, levels = c("09","10","11","12","01","02","03","04","05","06","07","08"),
                      labels = c("September", "October", "November","December", "January", "February", 
                                 "March", "April", "May", "June", "July", "August"), ordered = TRUE)
levels(vsm$Month)

##  [1] "September" "October"   "November"  "December"  "January"   "February" 
##  [7] "March"     "April"     "May"       "June"      "July"      "August"

Scan I

Mixing it up here but for more accurate imputations I’ve decided to scan for missing values before tidying the data. This allows for more accurate mean estimates as when we tidy the data we’ll be joining variable from 9am to 3pm and min and max values.

# First lets check which columns are missing values
colSums(is.na(vsm))

##                              Date           Minimum temperature (C) 
##                                 0                                 0 
##           Maximum temperature (C)                     Rainfall (mm) 
##                                 0                                 0 
##                  Evaporation (mm)                  Sunshine (hours) 
##                                33                                32 
##    Direction of maximum wind gust Speed of maximum wind gust (km/h) 
##                                 5                                 5 
##         Time of maximum wind gust               9am Temperature (C) 
##                                 5                                 1 
##         9am relative humidity (%)          9am cloud amount (oktas) 
##                                 1                                35 
##                9am wind direction             9am wind speed (km/h) 
##                                13                                13 
##            9am MSL pressure (hPa)               3pm Temperature (C) 
##                                 2                                 0 
##         3pm relative humidity (%)          3pm cloud amount (oktas) 
##                                 0                                35 
##                3pm wind direction             3pm wind speed (km/h) 
##                                 2                                 2 
##            3pm MSL pressure (hPa)                              City 
##                                 0                                 0 
##                            Season                             Month 
##                                 0                                 0

# 14 columns have missing values. I'll start off first with the numeric variables and replace missing with mean values. 
vsm$`Evaporation (mm)`<- impute(vsm$`Evaporation (mm)`, fun = mean)
vsm$`Sunshine (hours)`<- impute(vsm$`Sunshine (hours)`, fun = mean)
vsm$`Speed of maximum wind gust (km/h)`<- impute(vsm$`Speed of maximum wind gust (km/h)`, fun = mean)
vsm$`9am Temperature (C)` <- impute(vsm$`9am Temperature (C)`, fun = mean)
vsm$`9am relative humidity (%)` <- impute(vsm$`9am relative humidity (%)`, fun = mean)
vsm$`9am cloud amount (oktas)`  <- impute(vsm$`9am cloud amount (oktas)`, fun = mean)
vsm$`9am wind speed (km/h)`<- impute(vsm$`9am wind speed (km/h)`, fun = mean)
vsm$`9am MSL pressure (hPa)`<- impute(vsm$`9am MSL pressure (hPa)`, fun = mean)
vsm$`3pm cloud amount (oktas)`  <- impute(vsm$`3pm cloud amount (oktas)`, fun = mean)
vsm$`3pm wind speed (km/h)`<- impute(vsm$`3pm wind speed (km/h)`, fun = mean)


# Now to replace the missing categorical/factor variables with the mode
vsm$`Direction of maximum wind gust` <- impute(vsm$`Direction of maximum wind gust`, fun = mode)
vsm$`9am wind direction` <- impute(vsm$`9am wind direction`, fun = mode)
vsm$`3pm wind direction`<- impute(vsm$`3pm wind direction`, fun = mode)
vsm$`Time of maximum wind gust`<- impute(vsm$`Time of maximum wind gust`, fun = mode)

# I'll be using the below function from Dr. Anil Dolgun's Module 5 course material to Check every numerical column whether they have infinite or NaN or NA values using a function called is.specialorNA
is.specialorNA <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

# Now to apply this function to the data set.

sapply(vsm, function(x) sum( is.na(x) ))

##                              Date           Minimum temperature (C) 
##                                 0                                 0 
##           Maximum temperature (C)                     Rainfall (mm) 
##                                 0                                 0 
##                  Evaporation (mm)                  Sunshine (hours) 
##                                 0                                 0 
##    Direction of maximum wind gust Speed of maximum wind gust (km/h) 
##                                 0                                 0 
##         Time of maximum wind gust               9am Temperature (C) 
##                                 5                                 0 
##         9am relative humidity (%)          9am cloud amount (oktas) 
##                                 0                                 0 
##                9am wind direction             9am wind speed (km/h) 
##                                 0                                 0 
##            9am MSL pressure (hPa)               3pm Temperature (C) 
##                                 0                                 0 
##         3pm relative humidity (%)          3pm cloud amount (oktas) 
##                                 0                                 0 
##                3pm wind direction             3pm wind speed (km/h) 
##                                 0                                 0 
##            3pm MSL pressure (hPa)                              City 
##                                 0                                 0 
##                            Season                             Month 
##                                 0                                 0

head(vsm)

Tidy & Manipulate Data I

I get a bit nervous with this step as I go from 366 to 105408 observations, but here we go. Wind speed, temperature (c), wind direction, humidity, cloud amount and MSL pressure variables all have multiple values stored in rows, therefore the data set is not tidy.

# Below codes will aim to ensure each variable has it's own column by grouping every time the variable is measured and bring it into 2 columns being the categorical variable and the value. For example, combining the max speed of wind, 9am wind speed and 3pm wind speed: 

vsmgather <- vsm %>%  gather ('Speed of maximum wind gust (km/h)', '9am wind speed (km/h)', 
                              '3pm wind speed (km/h)', key = "Wind Measurement", value = "Wind km/h")
vsmgather1 <- vsmgather %>% gather ('Minimum temperature (C)','Maximum temperature (C)','9am Temperature (C)',
                                    '3pm Temperature (C)', key = "Temperature Measurement", value = "Degrees Celcius(C)")
vsmgather2 <- vsmgather1 %>% gather ('Direction of maximum wind gust', '9am wind direction','3pm wind direction',
                                     '3pm wind direction',key = 'Wind Direction', value = 'Direction')
vsmgather3 <- vsmgather2 %>% gather('9am relative humidity (%)','3pm relative humidity (%)', key = "Humidity", value = "%")
vsmgather4 <- vsmgather3 %>% gather('9am cloud amount (oktas)','3pm cloud amount (oktas)', key = "Cloud Amount", value = "Oktas")
vsmgather5 <- vsmgather4 %>% gather('9am MSL pressure (hPa)','3pm MSL pressure (hPa)', key = "MSL Pressure", value = "hPA")

# AS each line

Tidy & Manipulate Data II

Below we’ll create a column to calculate the Temperature normalisation by dividing the temperature value by the mean temperature value

# Mutate and divide the newly created 'Degrees Celcius (C)' variable by mean 

vsmtidy <- vsmgather5 %>% mutate(Temp_norm = `Degrees Celcius(C)`/mean(`Degrees Celcius(C)`))

head(vsmtidy)

Scan II

Below we’ll scan the variables for any outliers. We have 3 univariate outliers and 5 multivariate outliers. All scans will be done using box plots.

#During the imputation of missing values some of the data types changed to double. In order to create the box plots I'll need to change these back to numeric. 
vsmtidy$`Sunshine (hours)` <- as.numeric(vsmtidy$`Sunshine (hours)`)
vsmtidy$`Wind km/h` <- as.numeric(vsmtidy$`Wind km/h`)
vsmtidy$`Degrees Celcius(C)` <- as.numeric(vsmtidy$`Degrees Celcius(C)`)
vsmtidy$`%` <- as.numeric(vsmtidy$`%`)
vsmtidy$hPA <- as.numeric(vsmtidy$hPA)
vsmtidy$Oktas <- as.numeric(vsmtidy$Oktas)

# Now to check for Univariate outliers in Rainfall. As rainfall is best measured by monthly rainfall as opposed to daily I've created a dataframe with monthly totals. If i used daily figures a lot of values would end up with too many outliers and the data would be useless. Therefore by using using monthly values we end up with a more accurate picture. 

#Rainfall box plots
monthrain <- vsm %>% group_by(Month) %>% summarise(Monthly_Rainfall = sum(`Rainfall (mm)`, na.rm = TRUE))
monthrain$Monthly_Rainfall %>% boxplot(main="Rainfall", ylab="mm", col = "lightblue")

#Good start, no outliers for rainfall

#Evaporation box plot, similar to rainfall we're using monthly totals. 
monthevaporation <- vsm %>% group_by(Month) %>% summarise(Monthly_Evap = sum(`Evaporation (mm)`, na.rm = TRUE))
monthevaporation$Monthly_Evap %>% boxplot(main="Evaporation", ylab="mm", col = "cyan")

#No outliers for evaporation

#sunshine box plot
vsmtidy$`Sunshine (hours)` %>%  boxplot(main="Sunshine", ylab="hours", col = "coral")

#No outliers

#Now to check the Multivariate outliers

#Wind box plot
boxplot(vsmtidy$`Wind km/h` ~ vsmtidy$`Wind Measurement`, main="Wind by km/h", ylab = "km/h", xlab = "Measurement", col = "pink")

#Outliers detected we'll deal with them shortly

#Temperature box plot
boxplot(vsmtidy$`Degrees Celcius(C)` ~ vsmtidy$`Temperature Measurement`, main="Temperature by Degree Celcius", ylab = "Temperature (C)", xlab = "Temperature Measurement", col = "yellow")

#Outliers detected

#Humidity box plot
boxplot(vsmtidy$`%` ~ vsmtidy$Humidity, main="Humidity time by Humidity %", ylab = "Humidity %", xlab = "Humidity time", col = "orange")

#outliers detected

#MSL Pressure
boxplot(vsmtidy$hPA ~ vsmtidy$`MSL Pressure`, main="MSL Pressure by hPa", ylab = "hPA", xlab = "MSL Pressure", col= "magenta")

#Outliers detected are minimal, no action . 

#Cloud box plots
boxplot(vsmtidy$Oktas ~ vsmtidy$`Cloud Amount`, main="Cloud measurement by Oktas", ylab = "Oktas", xlab = "Time of Cloud Measurement", col = "purple")

#No outliers

#I am going to use the method of capping to replace outliers. These outliers aren't mistakes therefore using capping ensures we keep our results close enough to actuals without having outliers skewing the outcomes. Again, I'm using a function Dr. Anil Dolgun's Data Wrangling module 6 notes. Which also indicate, the function was taken from https://stackoverflow.com/questions/13339685/how-to-replace-outliers-with-the-5th-and-95th-percentile-values-in-r?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

# Now to replace the outliers I'll need to break the multivariate outliers down by the qualitative variable. First I'll start by creating data frame for Wind
wind <- vsmtidy %>% select(Date, `Wind Measurement`,`Wind km/h`)
#Now to filter variable '3pm Wind Speed'
wind3 <- wind %>% filter(wind$`Wind Measurement` == "3pm wind speed (km/h)") 
#apply the capping function to a subset of the 3pm wind data
wind3_capped <- wind3$`Wind km/h` %>% cap()
#repeat the above steps to other variables as well

#9am Wind Speed
wind9 <- wind %>% filter(wind$`Wind Measurement` == "9am wind speed (km/h)") 
wind9_capped <- wind9$`Wind km/h` %>% cap()


#Max Wind Speed
windmax <- wind %>% filter(wind$`Wind Measurement` == "Speed of maximum wind gust (km/h)") 
windmax_capped <- windmax$`Wind km/h` %>% cap()

#Below subsets and summaries have been set up to draw a comparison between original values and the capped values.
winduncapped <- cbind(windmax$`Wind km/h`, wind9$`Wind km/h`, wind3$`Wind km/h`)
windcapped <- cbind(windmax_capped, wind9_capped, wind3_capped)
summary(winduncapped)

##        V1              V2              V3      
##  Min.   :13.00   Min.   : 2.00   Min.   : 2.0  
##  1st Qu.:28.00   1st Qu.: 7.00   1st Qu.: 9.0  
##  Median :33.00   Median : 9.00   Median :13.0  
##  Mean   :34.83   Mean   :10.17   Mean   :13.8  
##  3rd Qu.:43.00   3rd Qu.:13.00   3rd Qu.:17.0  
##  Max.   :85.00   Max.   :33.00   Max.   :33.0

summary(windcapped)

##  windmax_capped   wind9_capped    wind3_capped  
##  Min.   :13.00   Min.   : 2.00   Min.   : 2.00  
##  1st Qu.:28.00   1st Qu.: 7.00   1st Qu.: 9.00  
##  Median :33.00   Median : 9.00   Median :13.00  
##  Mean   :34.71   Mean   :10.09   Mean   :13.77  
##  3rd Qu.:43.00   3rd Qu.:13.00   3rd Qu.:17.00  
##  Max.   :61.00   Max.   :22.00   Max.   :28.00

boxplot(winduncapped, col="red")

boxplot(windcapped, col="maroon")

# We'll follow the same steps for the Temperature outliers
temp <- vsmtidy %>% select(Date, `Temperature Measurement`,`Degrees Celcius(C)`)

#3pm Temperature 
temp3 <- temp %>% filter(temp$`Temperature Measurement` == "3pm Temperature (C)") 
temp3_capped <- temp3$`Degrees Celcius(C)` %>% cap()


#9am Temperature 
temp9 <-temp %>% filter(temp$`Temperature Measurement`== "9am Temperature (C)") 
temp9_capped <- temp9$`Degrees Celcius(C)`  %>% cap()


#Max Temperature 
tempmax <- temp %>% filter(temp$`Temperature Measurement`== "Maximum temperature (C)") 
tempmax_capped <- tempmax$`Degrees Celcius(C)` %>% cap()

#Min Temperature
tempmin <- temp %>% filter(temp$`Temperature Measurement`== "Minimum temperature (C)") 
tempmin_capped <- tempmin$`Degrees Celcius(C)` %>% cap()

#Comparison with uncapped and capped data
tempuncapped <- cbind(tempmax$`Degrees Celcius(C)`,tempmin$`Degrees Celcius(C)`, temp3$`Degrees Celcius(C)`, temp9$`Degrees Celcius(C)`)
tempcapped <- cbind(tempmax_capped, tempmin_capped, temp3_capped, temp9_capped)
summary(tempuncapped)

##        V1              V2              V3              V4       
##  Min.   :10.30   Min.   : 1.70   Min.   : 7.20   Min.   : 2.50  
##  1st Qu.:15.30   1st Qu.: 8.20   1st Qu.:14.30   1st Qu.:10.80  
##  Median :18.60   Median :10.85   Median :17.45   Median :13.65  
##  Mean   :19.94   Mean   :11.00   Mean   :18.37   Mean   :13.98  
##  3rd Qu.:23.00   3rd Qu.:13.60   3rd Qu.:20.90   3rd Qu.:16.50  
##  Max.   :43.50   Max.   :23.50   Max.   :42.20   Max.   :33.10

summary(tempcapped)

##  tempmax_capped  tempmin_capped   temp3_capped    temp9_capped  
##  Min.   :10.30   Min.   : 1.70   Min.   : 7.20   Min.   : 2.50  
##  1st Qu.:15.30   1st Qu.: 8.20   1st Qu.:14.30   1st Qu.:10.80  
##  Median :18.60   Median :10.85   Median :17.45   Median :13.65  
##  Mean   :19.74   Mean   :10.98   Mean   :18.15   Mean   :13.83  
##  3rd Qu.:23.00   3rd Qu.:13.60   3rd Qu.:20.90   3rd Qu.:16.50  
##  Max.   :34.30   Max.   :21.50   Max.   :30.60   Max.   :24.20

boxplot(tempuncapped, col="red")

boxplot(tempcapped, col="maroon")

# We'll follow the same steps for the Temperature outliers
humidity <- vsmtidy %>% select(Date, Humidity,'%')

#3pm humidity
humidity3 <- humidity %>% filter(humidity$Humidity == "3pm relative humidity (%)") 
humidity3_capped <- humidity3$'%' %>% cap()


# 9am humidity
humidity9 <- humidity %>% filter(humidity$Humidity == "9am relative humidity (%)") 
humidity9_capped <- humidity9$'%' %>% cap()

#Comparison with uncapped and capped data
humidityuncapped <- cbind(humidity9$'%',humidity3$'%')
humiditycapped <- cbind(humidity9_capped, humidity3_capped)

summary(humidityuncapped)

##        V1               V2        
##  Min.   : 17.00   Min.   : 13.00  
##  1st Qu.: 65.00   1st Qu.: 47.00  
##  Median : 74.00   Median : 56.00  
##  Mean   : 73.72   Mean   : 57.11  
##  3rd Qu.: 83.00   3rd Qu.: 66.00  
##  Max.   :100.00   Max.   :100.00

summary(humiditycapped)

##  humidity9_capped humidity3_capped
##  Min.   : 38.00   Min.   :19.00   
##  1st Qu.: 65.00   1st Qu.:47.00   
##  Median : 74.00   Median :56.00   
##  Mean   : 74.09   Mean   :57.13   
##  3rd Qu.: 83.00   3rd Qu.:66.00   
##  Max.   :100.00   Max.   :92.00

boxplot(humidityuncapped, col="red")

boxplot(humiditycapped, col="maroon")

References

Dolgun, A 2020, ‘Module 6 Scan: Outliers’, Module notes, MATH2349, RMIT University, http://rare-phoenix-161610.appspot.com/secured/Module_06.html#multivariate_outlier_detection_methods
Dolgun, A 2020, ‘Module 5 Scan: Missing Values’, Module notes, MATH2349, RMIT University, http://rare-phoenix-161610.appspot.com/secured/Module_05.html
Baglin, J 2020, ‘R Bootcamp - Course 4: R Markdown’, Module notes, MATH1324, RMIT University, https://astral-theory-157510.appspot.com/secured/RBootcamp_Course_04.html#creating_an_r_markdown_document_in_r_studio
RStudio, R Markdown from RStudio https://rmarkdown.rstudio.com/lesson-1.html
Modern Data Science with R, 2nd edition, Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton https://beanumber.github.io/mdsr2e/ch-dataII.html
Bureau of Meteorology, Notes to accompany Daily Weather Observations, http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml
Bureau of Meteorology, Melbourne (Olympic Park), Victoria October 2020 Daily Weather Observations, http://www.bom.gov.au/climate/dwo/IDCJDW3033.latest.shtml