In this notebook, we’ve got a set of questions.

Your challenge is to match the question, to the appropriate code chunk(s). For charts you don’t use, can you explain why not/what’s wrong with them?

Try and work out what the code chunk will output before you run it.

The code snippets are designed so you could copy them into your own work to do analyses (if you wanted to), and to demonstrate useful functions.

Preliminaries

Load packages. If you don’t have them installed already, you should uncomment (delete the #) the first line in the following code block (starting ‘install.packages’)

Once you’ve done that, you can test the file by using the ‘knit’ button (in RStudio).

Read more about R markdown and ‘kniting’ (rendering) documents https://rmarkdown.rstudio.com/authoring_quick_tour.html#overview

To read documentation and view examples of usage, type ?function_name in console (E.g. ?hist) or search from the help bar in RStudio.

First, we’ll load the required packages

#install.packages(c("psych","ggplot2","doBy","reshape2","knitr","lattice"))
  sh <- suppressPackageStartupMessages #To get rid os warning and other messages while loading the libraries
  sh(library(ggplot2))  #for graphs and plots
## Warning: package 'ggplot2' was built under R version 3.5.3
  sh(library(psych))    #for statistical measures and testing
## Warning: package 'psych' was built under R version 3.5.2
  sh(library(doBy))         #for group by analysis
## Warning: package 'doBy' was built under R version 3.5.2
  sh(library(reshape2)) #for data wrangling
## Warning: package 'reshape2' was built under R version 3.5.2
  sh(library(knitr))    #for rendering markdown
  sh(library(lattice))  #just to illustrate another histogram function 
## Warning: package 'lattice' was built under R version 3.5.2

#The datasets

Load data from the csv files to the RStudio environment. We’ve got two input files here; make sure they are available in the same folder the code is located at. If not, point to the right directory when reading the file by changing read.csv(“C:/…/syd_wealth.csv”)

weather_sydney <- read.csv("syd_weath.csv", stringsAsFactors = F)
weather_melbourne <- read.csv("mel_weath.csv", stringsAsFactors = F)

Data Exploration

We are first exploring the dataset to see what it includes

#View the first 10 rows of the dataset
head(weather_sydney,10)
#View the variables in the data set
colnames(weather_sydney)
##  [1] "X"            "date"         "temp"         "dew_pt"      
##  [5] "hum"          "wind_spd"     "wind_gust"    "dir"         
##  [9] "vis"          "pressure"     "wind_chill"   "heat_index"  
## [13] "precip"       "precip_rate"  "precip_total" "cond"        
## [17] "fog"          "rain"         "snow"         "hail"        
## [21] "thunder"      "tornado"

Plots

An example (a pretty ugly one) of a plot is in the code. We will see how this can be improved later.

hist(weather_sydney$temp)

hist(weather_melbourne$temp)

#The questions - copy the code chunk(s) that best addresses it under each question

Q1. What is the relationship between dew point and humidity?

Q2. Where do the fastest winds flow in Sydney?

Q3. How does temperature compare in melbourne vs sydney?

Q4. What kind of relationship is there between visibility and humidity?

Q5. What weather conditions are seen in melbourne and sydney?

Q6. How does humidity vary in melbourne vs sydney across months?

Q7. What weather condition has the lowest humidity?

Tables

We are calculating summary statistics on key variables here. Read about measures of skew and kurtosis here: https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa

library(knitr)

kable(rbind(psych::describe(weather_sydney$temp),psych::describe(weather_melbourne$temp)), caption = "Summary of Mel & Sydney weather - Temperature")
Summary of Mel & Sydney weather - Temperature
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 2206 23.37307 3.006867 23.0 23.16025 2.9652 17 39.0 22.0 1.097941 2.9843386 0.0640194
X11 1 1562 23.01485 5.097803 22.8 22.73952 5.9304 13 40.4 27.4 0.475591 -0.2072643 0.1289860
#note, you should label the rows
kable(rbind(psych::describe(weather_sydney$wind_spd),psych::describe(weather_melbourne$wind_spd)), caption = "Summary of Mel & Sydney weather - Wind Speed")
Summary of Mel & Sydney weather - Wind Speed
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 2206 22.85938 10.47271 22.2 22.39581 13.63992 0.0 55.6 55.6 0.3531646 -0.718568 0.2229751
X11 1 1562 -1608.80000 0.00000 -1608.8 -1608.80000 0.00000 -1608.8 -1608.8 0.0 NaN NaN 0.0000000

You should notice something wrong in the data… What’s going on there?

#Code blocks

As in any data analysis, the first step is preparing the raw dataset in the required format by cleaning it. Learn more about combining data frames in R here: http://www.programmingr.com/examples/r-dataframe/merge-data-frames/

#this data used to need a lot more cleaning! Previously, you need to remove -9999 values for temp. If you notice abnormalities in data, make sure you clean it. 

#Adding location details to individual files to merge the dataframes
weather_sydney$loc <- "SYD" #populating a new column loc with value=SYD
weather_melbourne$loc <- "MEL"

#Combining the rows of the two datasets and selecting only the required two columns
temps <- subset(rbind(weather_sydney, weather_melbourne), select = c("temp", "loc"))
temps$temp <- as.numeric(temps$temp)

The code blocks do not correspond to the questions asked (Q1 to Q7) directly. Your task is to identify the right ones to match them correctly. Note that additional figures and code that do not answer the questions aptly are also present.

Block 1: Summary statistics by category

summaryBy(temp~loc, data=temps,FUN=c(min, max, mean, var))

Plots

Block 2

histogram(~ temp | loc, data=temps)

Block 3

ggplot(temps, aes(x = temp, fill = loc)) + geom_histogram(alpha = .5, position = 'identity') 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Block 4

ggplot(temps, aes(x = temp, fill = loc)) + geom_histogram(alpha = .5, aes(y = ..density..), position = 'identity') #note use of 'density' because we have unequal temperature counts in each dataset, and this lets us understand the data as a percentage over the period. Alpha is the transparency level.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Block 5

ggplot(temps) + 
  geom_bar(aes(x = loc, y = temp, fill = loc),
           position = "dodge", stat = "summary", fun.y = "mean")

Block 6

ggplot(temps, aes(x=loc, y=temp, fill=loc)) + geom_boxplot() +
    guides(fill=FALSE)+
    stat_summary(fun.y=mean, geom="point", shape=5, size=4)

Block 7

#Using the Sydney weather dataframe
line_example <- subset(weather_sydney, !is.na(hum), select=c("hum", "dew_pt","date"))
line_example[c("hum","dew_pt")] <- lapply(line_example[c("hum","dew_pt")],as.numeric)

line_example <- aggregate(. ~ date, line_example, FUN=mean)

library(reshape2)
#convert to long
line_example <- melt(line_example, id.vars = c("date"))

ggplot(data=line_example, aes(x=date, y=value, group=variable, colour=variable)) +
    geom_line() +
    geom_point()

cor.test(as.numeric(weather_sydney$hum),as.numeric(weather_sydney$dew_pt))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(weather_sydney$hum) and as.numeric(weather_sydney$dew_pt)
## t = 40.31, df = 2204, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6267465 0.6748281
## sample estimates:
##       cor 
## 0.6514409
#Can you observe a similar relationship in the Melbourne weather dataframe?

#What if we want to explore the relationship between dew_pt and other features
#https://support.office.com/en-us/article/Present-your-data-in-a-scatter-chart-or-a-line-chart-4570a80f-599a-4d6b-a155-104a9018b86e

#Is date an important variable in this analysis? Does the scaling of the data gives us the best available insight into relationships of paired values? Is the use of a line to join datapoints appropriate given missing data?

Block 8

ggplot(weather_sydney, aes(x=hum, y=dew_pt)) +
    geom_point(shape=1)      # Use hollow circles

cor.test(as.numeric(weather_sydney$hum),as.numeric(weather_sydney$dew_pt))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(weather_sydney$hum) and as.numeric(weather_sydney$dew_pt)
## t = 40.31, df = 2204, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6267465 0.6748281
## sample estimates:
##       cor 
## 0.6514409

Block 9

ggplot(weather_sydney, aes(x=hum, y=vis)) +
    geom_point(shape=4)      # Use x to denote data points
## Warning: Removed 755 rows containing missing values (geom_point).

cor.test(as.numeric(weather_sydney$hum),as.numeric(weather_sydney$vis))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(weather_sydney$hum) and as.numeric(weather_sydney$vis)
## t = -16.148, df = 1449, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4332746 -0.3460133
## sample estimates:
##        cor 
## -0.3905208

Block 10

weather <- rbind(weather_sydney[c("temp","dew_pt","hum","wind_spd","precip_total","cond","date","loc")],weather_melbourne[c("temp","dew_pt","hum","wind_spd","precip_total","cond","date","loc")])
#weather <- subset(weather, temp >-300)
weather$month <- format(as.Date(weather$date), "%m")

ggplot(weather, aes(x=month, y=hum, fill=loc)) + geom_boxplot() +
    guides(fill=FALSE) +
    stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
    facet_wrap(~loc)

Block 11

summaryBy(hum~cond, data=weather,FUN=c(min, max, mean, var))

Block 12

unique(weather$cond)
##  [1] "Partly Cloudy"                "Clear"                       
##  [3] ""                             "Mostly Cloudy"               
##  [5] "Scattered Clouds"             "Haze"                        
##  [7] "Thunderstorm"                 "Light Rain Showers"          
##  [9] "Light Thunderstorms and Rain" "Light Rain"                  
## [11] "Overcast"                     "Unknown"                     
## [13] "Rain Showers"                 "Light Drizzle"               
## [15] "Drizzle"                      "Heavy Rain Showers"          
## [17] "Rain"                         NA
table(weather$cond,weather$loc)
##                               
##                                MEL SYD
##                                  0 470
##   Clear                          0 285
##   Drizzle                        0   3
##   Haze                           0  63
##   Heavy Rain Showers             0   2
##   Light Drizzle                  0  11
##   Light Rain                     0  40
##   Light Rain Showers             0  72
##   Light Thunderstorms and Rain   0   5
##   Mostly Cloudy                  0 591
##   Overcast                       0  27
##   Partly Cloudy                  0 362
##   Rain                           0  10
##   Rain Showers                   0  13
##   Scattered Clouds               0 247
##   Thunderstorm                   0   1
##   Unknown                        0   4
weather_con <- unique(subset(weather,select=c("cond","date","loc")))

ggplot(data=weather_con, aes(x=cond, fill = loc)) +
    geom_bar(position=position_dodge()) +
    theme(axis.text.x = element_text(angle = 90, vjust = .5, hjust = 1))

Block 13

ggplot(weather_sydney, aes(x=dir, y=wind_spd, fill=dir)) + geom_boxplot() +
    guides(fill=FALSE)+
    stat_summary(fun.y=mean, geom="point", shape=5, size=4)

Block 14

# geom_bar is designed to make it easy to create bar charts that show counts (or sums of weights)

ggplot(data=weather_sydney, aes(x=dir, fill=wind_spd, stat="mean")) +
    geom_bar(position= "stack") +
    theme(axis.text.x = element_text(angle = 90, vjust = .5, hjust = 1))