In this notebook, we’ve got a set of questions.
Your challenge is to match the question, to the appropriate code chunk(s). For charts you don’t use, can you explain why not/what’s wrong with them?
Try and work out what the code chunk will output before you run it.
The code snippets are designed so you could copy them into your own work to do analyses (if you wanted to), and to demonstrate useful functions.
Load packages. If you don’t have them installed already, you should uncomment (delete the #) the first line in the following code block (starting ‘install.packages’)
Once you’ve done that, you can test the file by using the ‘knit’ button (in RStudio).
Read more about R markdown and ‘kniting’ (rendering) documents https://rmarkdown.rstudio.com/authoring_quick_tour.html#overview
To read documentation and view examples of usage, type ?function_name in console (E.g. ?hist) or search from the help bar in RStudio.
First, we’ll load the required packages
#install.packages(c("psych","ggplot2","doBy","reshape2","knitr","lattice"))
sh <- suppressPackageStartupMessages #To get rid os warning and other messages while loading the libraries
sh(library(ggplot2)) #for graphs and plots
## Warning: package 'ggplot2' was built under R version 3.5.3
sh(library(psych)) #for statistical measures and testing
## Warning: package 'psych' was built under R version 3.5.2
sh(library(doBy)) #for group by analysis
## Warning: package 'doBy' was built under R version 3.5.2
sh(library(reshape2)) #for data wrangling
## Warning: package 'reshape2' was built under R version 3.5.2
sh(library(knitr)) #for rendering markdown
sh(library(lattice)) #just to illustrate another histogram function
## Warning: package 'lattice' was built under R version 3.5.2
#The datasets
Load data from the csv files to the RStudio environment. We’ve got two input files here; make sure they are available in the same folder the code is located at. If not, point to the right directory when reading the file by changing read.csv(“C:/…/syd_wealth.csv”)
weather_sydney <- read.csv("syd_weath.csv", stringsAsFactors = F)
weather_melbourne <- read.csv("mel_weath.csv", stringsAsFactors = F)
We are first exploring the dataset to see what it includes
#View the first 10 rows of the dataset
head(weather_sydney,10)
#View the variables in the data set
colnames(weather_sydney)
## [1] "X" "date" "temp" "dew_pt"
## [5] "hum" "wind_spd" "wind_gust" "dir"
## [9] "vis" "pressure" "wind_chill" "heat_index"
## [13] "precip" "precip_rate" "precip_total" "cond"
## [17] "fog" "rain" "snow" "hail"
## [21] "thunder" "tornado"
An example (a pretty ugly one) of a plot is in the code. We will see how this can be improved later.
hist(weather_sydney$temp)
hist(weather_melbourne$temp)
#The questions - copy the code chunk(s) that best addresses it under each question
We are calculating summary statistics on key variables here. Read about measures of skew and kurtosis here: https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa
library(knitr)
kable(rbind(psych::describe(weather_sydney$temp),psych::describe(weather_melbourne$temp)), caption = "Summary of Mel & Sydney weather - Temperature")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 2206 | 23.37307 | 3.006867 | 23.0 | 23.16025 | 2.9652 | 17 | 39.0 | 22.0 | 1.097941 | 2.9843386 | 0.0640194 |
| X11 | 1 | 1562 | 23.01485 | 5.097803 | 22.8 | 22.73952 | 5.9304 | 13 | 40.4 | 27.4 | 0.475591 | -0.2072643 | 0.1289860 |
#note, you should label the rows
kable(rbind(psych::describe(weather_sydney$wind_spd),psych::describe(weather_melbourne$wind_spd)), caption = "Summary of Mel & Sydney weather - Wind Speed")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 2206 | 22.85938 | 10.47271 | 22.2 | 22.39581 | 13.63992 | 0.0 | 55.6 | 55.6 | 0.3531646 | -0.718568 | 0.2229751 |
| X11 | 1 | 1562 | -1608.80000 | 0.00000 | -1608.8 | -1608.80000 | 0.00000 | -1608.8 | -1608.8 | 0.0 | NaN | NaN | 0.0000000 |
You should notice something wrong in the data… What’s going on there?
#Code blocks
As in any data analysis, the first step is preparing the raw dataset in the required format by cleaning it. Learn more about combining data frames in R here: http://www.programmingr.com/examples/r-dataframe/merge-data-frames/
#this data used to need a lot more cleaning! Previously, you need to remove -9999 values for temp. If you notice abnormalities in data, make sure you clean it.
#Adding location details to individual files to merge the dataframes
weather_sydney$loc <- "SYD" #populating a new column loc with value=SYD
weather_melbourne$loc <- "MEL"
#Combining the rows of the two datasets and selecting only the required two columns
temps <- subset(rbind(weather_sydney, weather_melbourne), select = c("temp", "loc"))
temps$temp <- as.numeric(temps$temp)
The code blocks do not correspond to the questions asked (Q1 to Q7) directly. Your task is to identify the right ones to match them correctly. Note that additional figures and code that do not answer the questions aptly are also present.
summaryBy(temp~loc, data=temps,FUN=c(min, max, mean, var))
histogram(~ temp | loc, data=temps)
ggplot(temps, aes(x = temp, fill = loc)) + geom_histogram(alpha = .5, position = 'identity')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(temps, aes(x = temp, fill = loc)) + geom_histogram(alpha = .5, aes(y = ..density..), position = 'identity') #note use of 'density' because we have unequal temperature counts in each dataset, and this lets us understand the data as a percentage over the period. Alpha is the transparency level.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(temps) +
geom_bar(aes(x = loc, y = temp, fill = loc),
position = "dodge", stat = "summary", fun.y = "mean")
ggplot(temps, aes(x=loc, y=temp, fill=loc)) + geom_boxplot() +
guides(fill=FALSE)+
stat_summary(fun.y=mean, geom="point", shape=5, size=4)
#Using the Sydney weather dataframe
line_example <- subset(weather_sydney, !is.na(hum), select=c("hum", "dew_pt","date"))
line_example[c("hum","dew_pt")] <- lapply(line_example[c("hum","dew_pt")],as.numeric)
line_example <- aggregate(. ~ date, line_example, FUN=mean)
library(reshape2)
#convert to long
line_example <- melt(line_example, id.vars = c("date"))
ggplot(data=line_example, aes(x=date, y=value, group=variable, colour=variable)) +
geom_line() +
geom_point()
cor.test(as.numeric(weather_sydney$hum),as.numeric(weather_sydney$dew_pt))
##
## Pearson's product-moment correlation
##
## data: as.numeric(weather_sydney$hum) and as.numeric(weather_sydney$dew_pt)
## t = 40.31, df = 2204, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6267465 0.6748281
## sample estimates:
## cor
## 0.6514409
#Can you observe a similar relationship in the Melbourne weather dataframe?
#What if we want to explore the relationship between dew_pt and other features
#https://support.office.com/en-us/article/Present-your-data-in-a-scatter-chart-or-a-line-chart-4570a80f-599a-4d6b-a155-104a9018b86e
#Is date an important variable in this analysis? Does the scaling of the data gives us the best available insight into relationships of paired values? Is the use of a line to join datapoints appropriate given missing data?
ggplot(weather_sydney, aes(x=hum, y=dew_pt)) +
geom_point(shape=1) # Use hollow circles
cor.test(as.numeric(weather_sydney$hum),as.numeric(weather_sydney$dew_pt))
##
## Pearson's product-moment correlation
##
## data: as.numeric(weather_sydney$hum) and as.numeric(weather_sydney$dew_pt)
## t = 40.31, df = 2204, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6267465 0.6748281
## sample estimates:
## cor
## 0.6514409
ggplot(weather_sydney, aes(x=hum, y=vis)) +
geom_point(shape=4) # Use x to denote data points
## Warning: Removed 755 rows containing missing values (geom_point).
cor.test(as.numeric(weather_sydney$hum),as.numeric(weather_sydney$vis))
##
## Pearson's product-moment correlation
##
## data: as.numeric(weather_sydney$hum) and as.numeric(weather_sydney$vis)
## t = -16.148, df = 1449, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4332746 -0.3460133
## sample estimates:
## cor
## -0.3905208
weather <- rbind(weather_sydney[c("temp","dew_pt","hum","wind_spd","precip_total","cond","date","loc")],weather_melbourne[c("temp","dew_pt","hum","wind_spd","precip_total","cond","date","loc")])
#weather <- subset(weather, temp >-300)
weather$month <- format(as.Date(weather$date), "%m")
ggplot(weather, aes(x=month, y=hum, fill=loc)) + geom_boxplot() +
guides(fill=FALSE) +
stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
facet_wrap(~loc)
summaryBy(hum~cond, data=weather,FUN=c(min, max, mean, var))
unique(weather$cond)
## [1] "Partly Cloudy" "Clear"
## [3] "" "Mostly Cloudy"
## [5] "Scattered Clouds" "Haze"
## [7] "Thunderstorm" "Light Rain Showers"
## [9] "Light Thunderstorms and Rain" "Light Rain"
## [11] "Overcast" "Unknown"
## [13] "Rain Showers" "Light Drizzle"
## [15] "Drizzle" "Heavy Rain Showers"
## [17] "Rain" NA
table(weather$cond,weather$loc)
##
## MEL SYD
## 0 470
## Clear 0 285
## Drizzle 0 3
## Haze 0 63
## Heavy Rain Showers 0 2
## Light Drizzle 0 11
## Light Rain 0 40
## Light Rain Showers 0 72
## Light Thunderstorms and Rain 0 5
## Mostly Cloudy 0 591
## Overcast 0 27
## Partly Cloudy 0 362
## Rain 0 10
## Rain Showers 0 13
## Scattered Clouds 0 247
## Thunderstorm 0 1
## Unknown 0 4
weather_con <- unique(subset(weather,select=c("cond","date","loc")))
ggplot(data=weather_con, aes(x=cond, fill = loc)) +
geom_bar(position=position_dodge()) +
theme(axis.text.x = element_text(angle = 90, vjust = .5, hjust = 1))
ggplot(weather_sydney, aes(x=dir, y=wind_spd, fill=dir)) + geom_boxplot() +
guides(fill=FALSE)+
stat_summary(fun.y=mean, geom="point", shape=5, size=4)
# geom_bar is designed to make it easy to create bar charts that show counts (or sums of weights)
ggplot(data=weather_sydney, aes(x=dir, fill=wind_spd, stat="mean")) +
geom_bar(position= "stack") +
theme(axis.text.x = element_text(angle = 90, vjust = .5, hjust = 1))