#  devtools::install_github("cboettig/knitcitations@v1")
  library(knitcitations); cleanbib()
  cite_options(citation_format = "pandoc", check.entries=FALSE)
  library(bibtex)
  library(psych)
  library(curl)
  library(devtools)
  library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
  library(psych)
  library(tidyr)
  library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
  library(knitr)
  library(lattice) #just to illustrate another histogram function
  library(kfigr) #this lets us crossreference figures, etc. Read more about it https://github.com/mkoohafkan/kfigr/edit/master/vignettes/introduction.Rmd 
## knitr hook "anchor" is now available
  library(rwunderground) #see https://github.com/ALShum/rwunderground - you'll need to register an api key
## 
## Attaching package: 'rwunderground'
## The following object is masked from 'package:utils':
## 
##     history
source("C:\\Users\\125295_admin\\Oxygen Enterprise\\Project data\\RProjects\\DSI\\rwunderground_key.R") #this loads my key without making it visible to you . That file just has one line: 
  #rwunderground::set_api_key("mykey")
  
   #library(rnoaa) - complicated and returned crap data
  #install_github("weatherData", "Ram-N") - now defunct library(weatherData) install.packages("weatherData") - the personal weather station data is only available via github version
  #package info at https://ram-n.github.io/weatherData/

  options("kfigr.prefix" = TRUE)
  options("kfigr.link" = TRUE)

1 Abstract

If you really want to extend yourself, or already have skills in working with markdown, you might choose to use this file instead of Microsoft Word. Please note that while we’re keen for you to extend your technical skills, a key concern of AT2 is how you communicate about and with data, so take caution not to get distracted by technical issues, and to focus on the criteria. This template mirrors the Word file to provide a structure for the report. Make sure that you read it closely, several times.

2 Word Length

2800 words (excluding data excerpts and appendices, visualisations, and references)

3 Using this template

This is the suggested structure for your report. The basic structure is similar to the style of academic papers and, if followed, should ensure that everything you need to include is present. I have included the assessment criteria at the relevant places to remind you of what needs to be in the report.

You are free to vary the structure by renaming the sections, including other sections, or dropping ones that you don’t use. Keep in mind that the suggested structure is conventional (and therefore easy to follow), practical, and comprehensive. (Criterion 5: Professionally presented in a manner appropriate to the discipline.) If you do use this template, you will need to install R, RStudio, and the packages listed in the code block at the head of this document.

Note: We have provided some sample code below, along with some text between angle brackets, < >. All of this should be replaced by your work.

Please don’t forget to include a title, name, student number, etc. on a covering sheet

4 Introduction

<a paragraph that gives an overview of what you’ve done>

4.1 Citations

You’ll want to ensure that you connect what you did, and what you found, to the wider context of data science - including external sources of information (such as academic studies). You can build your reflection (criterion 4) through the paper like that. You’ll need to work out how to cite (this is not my expertise)… The relationship was first described by Halpern et al. (2006). However, there are also opinions that the relationship is spurious (Keil et al. 2012). We used R for our calculations (R Core Team 2017), and we used package knitcitations (Boettiger 2017) to make the bibliography.

This is a block quotation, if you have a long quote from someone this is the best way to do it (but don’t forget hte citation). This is a very long line that will still be quoted properly when it wraps. Oh boy let’s keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.

4.2 Other formatting

You’ll see that we can:

  1. Format things, e.g.
  1. And add headings using a # (but note, to get that to display properly I had to ‘escape’ it using a preceding backslash)
  2. And we can use citation, inline code, and charts
  3. All this means we can write a document, but we can also pull data in live and display it to the reader, who can also download this Rmd to see how we did it…it’s pretty cool hey?

But, just because it’s in a different format, that doesn’t mean you can get away with not following normal writing conventions. Writing should be in paragraphs, with correct spelling and grammar, and figures, etc. should be fully explained to the reader. For example, it’s weird that I’ve just dumped Fig.1 image below, so in the methods, I’ll make sure to explain them properly. (Note though, that this does illustrate how to include images - so if you do analysis outside R, you can still include figures, etc. without the code being R based)

Fig.1 A random logo image

Fig.1 A random logo image

5 Description of process, or method

<this is where you give details about what you’ve been collecting and how much you data have; why you choose this data to collect; how you managed the quality and frequency of collection issues; what you did to anonymise or de-identify the data, and how you dealt with the storage and sharing of data within the group. Do not include a dump of all your data here. If you wish to include examples of data (and I think you should) then put these in an appendix to the report.
Criterion 1: Justifies a method to obtain data from multiple sources, for gaining insight into a chosen problem, including analysis of data quality issues in the individual and group data.>

6 Analysis

<describe how you analysed your data, and how you contrasted your data with the group’s data.
Criterion 2: Justifies the analysis of the obtained data, including quality issues, to draw conclusions in a professional and engaging manner.>

6.1 Equations (probably not useful, but just in case)

If you want to insert equations (you probably don’t) you can do so using the syntax below. You can also insert bits of inline code like, so the 2+2 here is produced by a piece of code, and the 4 is produced by an equation (namely 2+2)

The determinisstic part of the model is defined by this in-line equation as \(\mu_i = \beta_0 + \beta_1x\), and the stochastic part by the centered equation:

\[ \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu_i)^2/(2\sigma^2)} \]

6.2 Get and load data

# Sydney is 33.8688° S, 151.2093° E
#see https://github.com/ALShum/rwunderground - you'll need to register an api key
#use_metric = TRUE

end_date <- Sys.Date()-1
start_date <- Sys.Date()-35

#lookup_airport("Melbourne")
#set_location(zip_code = "2016") you can also use this

#weather_sydney <- history_range(set_location(airport_code = "SYD"), date_start = start_date, date_end = end_date, limit = 10, no_api = FALSE, use_metric = TRUE, key = get_api_key(), raw = FALSE, message = TRUE)
#write.csv(weather_sydney, file = "syd_weath.csv")
weather_sydney <- read.csv("syd_weath.csv", stringsAsFactors = F)

#weather_melbourne <- history_range(set_location(PWS_id = "IVICTORI842"), date_start = start_date, date_end = end_date, limit = 10, no_api = FALSE, use_metric = TRUE, key = get_api_key(), raw = FALSE, message = TRUE) #deliberately taking data from a slightly worse station. Use airport_code = "YMML" for equiv
#weather_melbourne_gd <- history_range(set_location(PWS_id = "INORTHCO3"), date_start = start_date, date_end = end_date, limit = 10, no_api = FALSE, use_metric = TRUE, key = get_api_key(), raw = FALSE, message = TRUE)
#write.csv(weather_melbourne, file = "mel_weath.csv")
weather_melbourne <- read.csv("mel_weath.csv", stringsAsFactors = F)

#to make this more interesting I'm going to randomly delete 500 observations
weather_melbourne <- weather_melbourne[-sample(1:nrow(weather_melbourne), 800), ]
#for another 250 observations we're going to deliberately add noisy missing data in the form of -9999 values
weather_melbourne$temp[sample(nrow(weather_melbourne),250)] <- -9999

#weather_sydney_summary <- history_daily(set_location(airport_code = "SYD"), date = start_date, use_metric = TRUE, key = get_api_key(), raw = FALSE, message = TRUE)
#weather_melbourne_summary <- history_daily(set_location(airport_code = "YMML"), date = start_date, use_metric = TRUE, key = get_api_key(), raw = FALSE, message = TRUE)

#if we wanted to write this data and read it. OR if you want to read data from your system or the web, you can use this pair of lines
#write.csv(weather_sydney, file = "syd_weath.csv")
#read.csv("syd_weath.csv", stringsAsFactors = F) #(you might wantto change stringsasfactors to True)

6.3 Tables

library(knitr)

kable(rbind(describe(weather_sydney$temp),describe(weather_melbourne$temp)), caption = "Summary of Mel & Sydney weather")
Summary of Mel & Sydney weather
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 2206 23.37307 3.006867 23.0 23.16025 2.96520 17 39.0 22.0 1.0979415 2.984339 0.0640194
X11 1 762 -3265.15289 4708.511055 18.4 -2836.86672 11.41602 -9999 39.7 10038.7 -0.7308688 -1.467748 170.5713586
#note, you should label the rows

You’ll see above that I used a labelled the table, I did that by adding anchor="table" to the start of the chunk (along with the name Summary of Mel & Sydney weather). Now, I can use figr("Summary of Mel & Sydney weather", "Table") to refer to it like this 1. I haven’t worked out if you can get it to output the whole caption (e.g. Table 1: Caption name here). You should also see something weird in the data…what’s going on there…

6.4 Plots

An example (a pretty ugly one) of a plot is in the code, and visible below (once I get the captioning working again)…Let’s see how this can be improved.

hist(weather_sydney$temp)

hist(weather_melbourne$temp)

#this data used to need a lot more cleaning! Previously, you need to remove -9999 values
SYD_temp <- as.data.frame(as.numeric(unlist(subset(weather_sydney, temp >-300, select=c("temp"))))) #you could also replace these wiht NA, but here we're just going to exclude the missing data
colnames(SYD_temp)[1] <- "temp"
SYD_temp$loc <- "SYD"

MEL_temp <- as.data.frame(as.numeric(unlist(subset(weather_melbourne, temp >-300, select=c("temp")))))
colnames(MEL_temp)[1] <- "temp"
MEL_temp$loc <- "MEL"

temps <- rbind(SYD_temp, MEL_temp)
temps$temp <- as.numeric(temps$temp)

First off, I did a bit of cleaning (code chunk above). The ggplot package makes much nicer figures, like that shown in 1, but what’s wrong with that figure?

ggplot(temps, aes(x = temp, fill = loc)) + geom_histogram(alpha = .5, position = 'identity') 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pretty histograms

pretty histograms

Hm, ok let’s try and fix 1

ggplot(temps, aes(x = temp, fill = loc)) + geom_histogram(alpha = .5, aes(y = ..density..), position = 'identity') #note use of 'density' because we have unequal temperature counts in each dataset, and this lets us understand the data as a percentage over the period. Alpha is the transparency level.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pretty histograms

pretty histograms

2 is an improvement. There’s another (also simple) way to do this

histogram(~ temp | loc, data=temps)

What’s wrong with this?

ggplot(temps) + 
  geom_bar(aes(x = loc, y = temp, fill = loc),
           position = "dodge", stat = "summary", fun.y = "mean")

More informative?

ggplot(temps, aes(x=loc, y=temp, fill=loc)) + geom_boxplot() +
    guides(fill=FALSE)+
    stat_summary(fun.y=mean, geom="point", shape=5, size=4)

Couple of useful things - let’s pull the date out to its own value, and this time we’ll replace missing values (-9999) with NA

ggplot(weather_sydney, aes(x=temp, y=dew_pt)) +
    geom_point(shape=1)      # Use hollow circles

weather_sydney[weather_sydney == -9999] <- NA
weather_sydney$date <- as.Date(weather_sydney$date)

weather_melbourne[weather_melbourne == -9999] <- NA
weather_melbourne$date <- as.Date(weather_melbourne$date)

ggplot(weather_sydney, aes(x=temp, y=dew_pt)) +
    geom_point(shape=1)      # Use hollow circles

What if we want to explore the relationship between dew_pt and other features https://support.office.com/en-us/article/Present-your-data-in-a-scatter-chart-or-a-line-chart-4570a80f-599a-4d6b-a155-104a9018b86e

One way you might be tempted to do this…

bad_example <- subset(weather_sydney, !is.na(hum), select=c("hum", "dew_pt","date"))
bad_example[c("hum","dew_pt")] <- lapply(bad_example[c("hum","dew_pt")],as.numeric)

bad_example <- aggregate(. ~ date, bad_example, FUN=mean)

#convert to long
bad_example <- melt(bad_example, id.vars = c("date"))

ggplot(data=bad_example, aes(x=date, y=value, group=variable, colour=variable)) +
    geom_line() +
    geom_point()

#Is date an important variable in this analysis? Does the scaling of the data gives us the best available insight into relationships of paired values? Is the use of a line to join datapoints appropriate given missing data?

A better way?

ggplot(weather_sydney, aes(x=hum, y=dew_pt)) +
    geom_point(shape=1)      # Use hollow circles

cor.test(as.numeric(weather_sydney$hum),as.numeric(weather_sydney$dew_pt))
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(weather_sydney$hum) and as.numeric(weather_sydney$dew_pt)
## t = 40.31, df = 2204, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6267465 0.6748281
## sample estimates:
##       cor 
## 0.6514409

Of course, you don’t have to just display the correlation, you can **output the coefficient in-line with code: 0.6514409*

Ok what if we want to look at how weather varies over time and place?

weather_sydney$loc <- "Sydney"
weather_melbourne$loc <- "Melbourne"
weather <- rbind(weather_sydney[c("temp","dew_pt","hum","wind_spd","precip_total","cond","date","loc")],weather_melbourne[c("temp","dew_pt","hum","wind_spd","precip_total","cond","date","loc")])
weather$month <- format(as.Date(weather$date), "%m")

ggplot(weather, aes(x=month, y=temp, fill=loc)) + geom_boxplot() +
    guides(fill=FALSE) +
    stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
    facet_wrap(~loc)
## Warning: Removed 250 rows containing non-finite values (stat_boxplot).
## Warning: Removed 250 rows containing non-finite values (stat_summary).

Or at how weather events vary by place

#unique(weather$Events)
unique(weather$cond)
##  [1] "Partly Cloudy"                "Clear"                       
##  [3] ""                             "Mostly Cloudy"               
##  [5] "Scattered Clouds"             "Haze"                        
##  [7] "Thunderstorm"                 "Light Rain Showers"          
##  [9] "Light Thunderstorms and Rain" "Light Rain"                  
## [11] "Overcast"                     "Unknown"                     
## [13] "Rain Showers"                 "Light Drizzle"               
## [15] "Drizzle"                      "Heavy Rain Showers"          
## [17] "Rain"                         NA
table(weather$cond,weather$loc)
##                               
##                                Melbourne Sydney
##                                        0    470
##   Clear                                0    285
##   Drizzle                              0      3
##   Haze                                 0     63
##   Heavy Rain Showers                   0      2
##   Light Drizzle                        0     11
##   Light Rain                           0     40
##   Light Rain Showers                   0     72
##   Light Thunderstorms and Rain         0      5
##   Mostly Cloudy                        0    591
##   Overcast                             0     27
##   Partly Cloudy                        0    362
##   Rain                                 0     10
##   Rain Showers                         0     13
##   Scattered Clouds                     0    247
##   Thunderstorm                         0      1
##   Unknown                              0      4
weather_con <- unique(subset(weather,select=c("cond","date","loc")))

ggplot(data=weather_con, aes(x=cond, fill = loc)) +
    geom_bar(position=position_dodge()) +
    theme(axis.text.x = element_text(angle = 90, vjust = .5, hjust = 1))

#weather_con <- unique(subset(weather,select=c("Events","DateUTC","loc")))
#ggplot(data=weather_con, aes(x=Events, fill = loc)) +
 #   geom_bar(position=position_dodge()) +
    #scale_y_continuous(labels=scales::percent) +
  #  theme(axis.text.x = element_text(angle = 90, vjust = .5, hjust = 1))

We’ve often seen students refer to ‘average mood’. Sometimes this might make sense, but this is an analogous example…

#let's take the weather event data, and code it from best ('no event') to worst ('snow')
weather_con$cond[weather_con$cond=="Clear"] <- 7
weather_con$cond[weather_con$cond==""] <- 6
weather_con$cond[weather_con$cond=="Mostly Cloudy"] <- 5
weather_con$cond[weather_con$cond=="Light Rain Showers"] <- 4
weather_con$cond[weather_con$cond=="Light Thunderstorms and Rain"] <- 3
weather_con$cond[weather_con$cond=="Heavy Rain Showers"] <- 2
weather_con$cond[weather_con$cond=="Thunderstorm"] <- 1

weather_con$cond <- as.numeric(weather_con$cond)
## Warning: NAs introduced by coercion
ggplot(weather_con, aes(x = loc, y = cond, fill=loc)) + geom_boxplot() +
    stat_summary(fun.y=mean, geom="point", shape=5, size=4)
## Warning: Removed 129 rows containing non-finite values (stat_boxplot).
## Warning: Removed 129 rows containing non-finite values (stat_summary).

7 Findings and Conclusions

<what conclusions did you come to as a result of the analysis of your data and of the group’s data.
Criterion 2: Justifies the analysis of the obtained data, including quality issues, to draw conclusions in a professional and engaging manner.>

8 Discussion

<discuss aspects of the process that you see as important. For example, what difficulties did you encounter; how could you avoid problems if you did it again; etc>

Your ‘justification’ and evaluation of your approach is likely to go in this section, but may also be threaded through the preceding sections. This includes Criterion 3: Identifies, contextualises, and reflects on the ethical, privacy, and legal issues relevant to the collection and analysis of personal data of self and others. >

9 Reflection

<General reflection on what you learnt during this task. What are you unsure about? What would you do differently if you had to do it all again?
Criteria 4: Connects the individual experience of this QS project to the practice of data science (and the preceding three criteria). >

10 Other

If you are submitting any additional materials, such as short multimedia presentations or visualisations (such as Prezi, or voice-over video/screen capture, etc), they probably can’t be submitted through UTSOnline so you will need to arrange some other process such as posting on YouTube or elsewhere, or handing in a memory stick or CD/DVD. Please ensure that additional material like this is accessible to the markers (test this by accessing it through someone else’s computer) and avoid any restrictive or proprietary software constraints. Remember to check any inculded web links!

Diagrams, figures, charts and illustrations must be labelled, and explained, and must be referred to from somewhere in the report. If drawn from another source, then the source must be provided.

11 References

  write.bibtex(file="references.bib")

Boettiger, C. (2017). Knitcitations: Citations for ’knitr’ markdown files. Retrieved from https://CRAN.R-project.org/package=knitcitations

Halpern, B.S., Regan, H.M., Possingham, H.P. & McCarthy, M.A. (2006). Accounting for uncertainty in marine reserve design. Ecology Letters, 9, 2–11. Retrieved from https://doi.org/10.1111/j.1461-0248.2005.00827.x

Keil, P., Belmaker, J., Wilson, A.M., Unitt, P. & Jetz, W. (2012). Downscaling of species distribution models: <U+2028>a hierarchical approach (R. Freckleton, Ed.). Methods in Ecology and Evolution, 4, 82–94. Retrieved from https://doi.org/10.1111/j.2041-210x.2012.00264.x

R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from https://www.R-project.org/