1 Your AT2 and using R Notebooks

Data scientists don’t just use Word and Powerpoint to write. They also write live reports, that draw on real-time data to show visualisations alongside narratives. This type of writing can draw on APIs, databases, and local data to write text and conduct reproducible analysis of data for insights. One of the key tools to do this is ‘notebooks.’

For AT2, we want you to use RStudio to write and submit your report. We know this will be unfamiliar for many of you, but that’s ok. We’re not asking you to learn to code. We’ve provided a template, and if you want, you can simply modify the example ‘markdown’ to format your own report, and load visualisations that you’ve created in other tools (like RawGraphs.io, or Tableau). Some of you will want to go further, and that’s ok too! But remember to address the assessment criteria - this isn’t an assignment where you have to demonstrate technical coding skills.

Please note that while we’re keen for you to extend your technical skills, a key concern of AT2 is how you communicate about and with data, so take caution not to get distracted by technical issues, and to focus on the criteria. This template provides a structure for the report. Make sure that you read it closely, several times.

This template serves two purpose:

  1. It gives you a suggested structure for your report
  2. It demonstrates some of the functions or R and markdown

1.1 Structure of the template

I have included the assessment criteria at the relevant places to remind you of what needs to be in the report.

You are free to vary the structure by renaming the sections, including other sections, or dropping ones that you don’t use. Keep in mind that the suggested structure is conventional (and therefore easy to follow), practical, and comprehensive. (Criterion 5: Professionally presented in a manner appropriate to the discipline.) If you do use this template, you will need to install R, RStudio, and the packages listed in the code block at the head of this document.

Note: We have provided some sample code below, along with some text mostly marked as blockquotes using >. All of this should be replaced by your work.

Please don’t forget to include a title, name, student number, etc. on a covering sheet

1.2 To submit AT2, you will:

  1. Complete self assessment on REVIEW
  2. Use the RMarkdown template to ‘knit’ a html file (by default html, you may use pdf, or knit to rich text if you really want, but I suggested you stick to html)
  3. Upload to Canvas:
    • If you have used this template with defaults, just the html output. This should include the code links (if not, also upload your Rmd file.
    • If you have modified anything, then the HTML/PDF/.doc output and any associated directories (it may be necessary to zip this). In this case you will also need to share the raw .Rmd.

You may also wish to share these on github or rpubs - however, consider the privacy implications of doing so first.

1.3 Word Length

2800 words (excluding data excerpts and appendices, visualisations, and references).

See details below for referencing. If you use footnotes, they are included like this [^1].

To check this, you can either copy the html output to word, or use the addin Word Count Addin. E.g. wordcountaddin:::text_stats()

wordcountaddin:::text_stats()
Method koRpus stringi
Word count 3213 3087
Character count 18195 18199
Sentence count 246 Not available
Reading time 16.1 minutes 15.4 minutes

1.3.1 Grammar and spelling in RStudio

If you’re using RStudio, you can still do grammar and spelling checks. The ‘Visual editor’ mode makes this more natural (ctrl+shift+f4 on windows).

The gramr package lets you run an open tool within RStudio to get this feedback. (you can explore the code on github).

pacman::p_load_gh("ropenscilabs/gramr")

1.4 Introduction to R Notebooks

Click here to read more about R Notebooks!

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing the chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

1.5 Read more about working with notebooks in R

rintro <- tibble(
  long_name = c("Install RStudio interactive","rmarkdown intro","DSI vignettes","rmarkdown detailed guide"),
  blurb = c("Interactive learnr activity to setup RStudio, R, and packages","An introduction to R markdown","Check the DSI vignettes (and, via Canvas, exemplar reports) for inspiration","A longer book about how to use markdown with R"),
  image_url = c("https://rstudio.github.io/learnr/logo.png","https://raw.githubusercontent.com/rstudio/rmarkdown/main/man/figures/logo.png","https://rstudio.github.io/learnr/logo.png","https://raw.githubusercontent.com/rstudio/rmarkdown/main/man/figures/logo.png"),
  resource_url = c("https://learnr-examples.shinyapps.io/ex-setup-r/","https://rmarkdown.rstudio.com/authoring_quick_tour.html#overview","https://sjgknight.github.io/DSI/","https://bookdown.org/yihui/rmarkdown/markdown-syntax.html"),
  tags = c("Get-started; learnr","Get-started; markdown","intermediate; analysis","advanced; markdown")
)

rintro %>% 
  cards(
    title = long_name,
    text = blurb,
    link = resource_url,
    image = image_url,
    tags = paste("all;", tags),
    width = 4,
    footer = tags,
    layout = "label-right"
  )

Install RStudio interactive

Interactive learnr activity to setup RStudio, R, and packages

rmarkdown intro

An introduction to R markdown

DSI vignettes

Check the DSI vignettes (and, via Canvas, exemplar reports) for inspiration

rmarkdown detailed guide

A longer book about how to use markdown with R

What is R Markdown? from RStudio, Inc. on Vimeo.

1.6 Setting the notebook up

The template is made up of:

  1. this file (which you’re probably viewing the output of)
  2. the raw .Rmd file
  3. a suggested directory structure

The easiest way to work with it is to download the github repo and open the .Rproj file in RStudio.

1.6.1 Using projects in R

It is best to set up your assignment as a project, rather than just have a single RMarkdown file. Setting up a project will define your working directory based on where a .RProj file is located. Other files and folders can then be found relative to that .RProj file. This gives projects some advantages:

  • It’s easier to find your files, because you can set up subfolders with consistent names
  • You can refer to your data with relative referencing, eg ../datamy_data.csv, rather than having to type C:\folder\other_folder\data\my_data.csv.
  • When you open your project, it unloads your libraries and clears your memory. That way, the libraries that you had loaded before won’t get in the way of the one you’re working on now. But when you close this project, it goes back to the state it was in before.

To start a project in RStudio, - click File -> New Project and follow the prompts to set up a new project in a new folder.

  1. Create subfolders called “R” and “data.”
  2. Save this template to the R folder, along with any other R code files you create on the project.
  3. Save any data files (eg csv files, or screenshots from your other analysis) to the data folder.

I highly recommend this link on project-oriented workflow

1.6.2 Installing Packages

If we don’t have these packages, we’ll need to download them from the internet. Here’s some code that does that.

You’ll see a “#” at the start of the first line; this tells R that it is a ‘comment’ not code. If you remove the “#” R will try to run the code.

Installing the packages only puts them on our computer. To use them in our project, we need them loaded, I’ve used a package called pacman which checks if you have the called packages installed, and loads them. Normally you should not do this, because it’s useful to be aware of the environment you’re executing code with.

You may also want to knit the file on your computer, which will install the useful packages below.

#install.packages("pacman")

library(pacman)

p_load(bs4cards, tidyverse, flexdashboard, shiny, psych, devtools, bibtex, curl, gganimate)

p_load_gh("benmarwick/wordcountaddin")

#go to Tools > Addins to select the wordcountaddin 

pacman::p_install_gh("hadley/emo") #install, but call functions directly. Largely for illustrative

1.7 Timeline for task 📆

For formal timelines make sure you refer to (1) the subject outline (the most important document in any subject), (2) the subject canvas site+REVIEW both of which show deadlines, (3) if unsure, ask me.

pacman::p_load(timevis)

week_1 <- as.Date("2022-02-21")

tl <- data.frame(
  id      = 0:14,
  long_content = c("Pre-work: What Does Facebook Know About Me?",
              "Criterion 1: Choose group, data, and method. Establish communication approach and begin sharing data and insights.",
              "Criterion 1: Justify collection and analysis. Be able to justify your approach 'for the method to obtain data from multiple sources, for gaining insight into a chosen problem, including analysis of data quality issues in the individual and group data' (Criterion 1) - draft this section in the template",
              "Have data, share insight. Ensure you have a shared dataset in preparation for Mystery Box formative task; start to think about insights (criterion 2)",
              "AT2a due. Group status update, and your preliminary thoughts on analysis and external (ideally scholarly) resources you're drawing on",
              "AT1 due. Analysis and planning. Continue thinking about insights you might gain, visualisations you can use, issues (including ethical) with your data (criteria 2 and 3).  Review sample assignments and the AT2 template.",
              "Consider issues in data. Focus on issues with your data (including ethical) (criteria 1-3) and their implications for the practice of data science (criterion 4)",
                            "STUVAC HERE. Continue on AT2.",
              "Consider issues in data. Continue from week 7, with a particular focus on how comparing across the levels of data (individual, group, cohort) provides insights. Ensure you have considered the privacy and ethical issues throughout your report, and the implications of the project for the practice of data science",
              "Week 9, draft submission of AT2b. See detailed instructions.",
              "Week 10, review colleague's AT2b. Continue work on your own final submission",
              "Week 11, review colleague's AT2b. Continue work on your own final submission",
              "Week 12 AT2b feedback due. You should use that feedback to reflect on how to improve for your final submission",
              "STUVAC. Continue AT2 work.",
              "AT2C Due. Final assessment period."
              )
  )

tl <- tl %>% mutate(start = week_1 + 7*id,
                      end = start + 7,
                    content = str_split_n(long_content, "\\.", 1))
#  X-WR-TIMEZONE = "Australia/Sydney"
#library(calendar)
tl %>% transmute(
  DTSTART = as.POSIXct(.$start),
  DTEND = as.POSIXct(.$end),
  SUMMARY = content,
  DESCRIPTION = long_content) %>%
  mutate(UID = replicate(nrow(.), calendar::ic_guid())) %>%
  calendar::ical() %>%
  calendar::ic_write(.,"DSI_AT2.ics")

You should be able to download a calendar of events to import into your provider of choice. by clicking on this ics calendar download

1.8 Professionally presented (criterion 5), and reflection (criterion 4)

Each criterion threads right through the report. This is especially true for 4 and 5. I will especially look for reports that:

  • Are creative, using novel approaches to the data and its analysis or representation;
  • think about stakeholders including how to communicate to people who might be impacted by this data, privacy and ethics, implications for innovation (new tools/apps/policies), great visual communication, etc.;
  • engage with technical approaches. This doesn’t have to mean coding amazing api pipelines, it could be exploring the range of publicly available GUI/demo services that might do interesting things with the data you’ve gathered.

Criterion 5 Level of professionalism in the presentation appropriate to the discipline: You can see specific guidance on this criterion in the subject outline. Remember, your visualisations, and the way you develop your narrative are a part of professional presentation. You should draw on external sources to support and contextualise your work throughout. Be careful to emphasise interpretation and analysis over description and narrative. So, don’t tell us about discussions you had and who said what (description), tell us about the decisions you made, why, and their implications for the practice of data science (analysis).

2 The Quantified Self

For AT2 you will collect, record, share, and analyse several types of data about yourself and compare and contrast what you find in your analysis with an analysis of the same data from the group.

You will negotiate and agree a processes for recording, sharing and storing the data being collected as a group, in the first class session for AT2. Your attendance at this session will be crucial in getting off to a strong start with a minimum of disruption for this major task.

The following requirements apply to your data collection:

  1. Two sources of data negotiated with your group for sharing:

    • Unstructured. One of which must be unstructured in nature (e.g. text, comments, images, audio, etc. You might obtain this from social media, email, slack, twitter, daily photos, etc.) - you may find the ‘what does facebook know about you’ materials useful for this. You will need to work out a method to process this data.
    • Structured or unstructured. The second source can be structured, drawing on one of the many examples provided.
  2. One additional individual dataset, structured or unstructured. This can be of personal interest to you. It does not need to be shared across the groups, but should be analysed by you in your report.

  3. External cohort-level data to compare your own data to (probably summary data from previously published work): The idea of this dataset is that you will have data from: (1) an individual, (2) a small group, and (3) a larger cohort. You will probably draw on published summary level data (for example, what is the average step count in Australia?…for who?), or publicly available stepcount data. In order of complexity, you may be able to obtain insights from one of these sources:

    • Searching for your target variable on google or the library database (e.g. ‘sleep, health, Australia’; ‘average steps sydney,’ etc) recommended approach
    • Some apps have public datasets, directly accessible publicly (e.g. Strava), or via platforms such as Kaggle
    • Open humans foundation - has a set of notebooks that demonstrate obtaining and processing data from different services
    • If you’re interested in data that might help us validate measures (e.g., how accurate smartwatch vs phone stepcounts are), (1) check the literature, and (2) datasets such as the wireless sensor data mining one http://www.cis.fordham.edu/wisdm/dataset.php or/and http://crowdsignals.io/ may be of interest.

Examples of data that you and your group could collect include: daily step counts; pulse rates; time spent on activities each day (exercise, grooming, travelling, eating/cooking, shopping, sleeping studying, etc.); sleep patterns; daily spending; number & length of conversations each day; location tracking, and so on. Some of these can be easily tracked via smartphone apps, see examples at https://quantifiedself.com/

Old examples of this assignment, and all of our feedback given in a previous semester are available via Canvas.

You might find the DSI vignettes, many created by students in the Statistics subject helpful if you want to use R to do analysis (but remember, you do not have to!).

2.1 AT2 components

Assignment two has 3 parts. This structure ensures you’re on track for the assignment, and provides an opportunity for you to resubmit your AT2 taking into account the feedback provided to make changes.

2.1.1 Week 5 AT2a

AT2a is due week 5, and is a short online form (only available in the week before due date)

2.1.2 Week 9 AT2b

AT2b is due week 9, via Canvas, and consists of (a) a draft of your final submission, and (b) your feedback to your class colleagues via peer review

2.1.3 exam period AT2c

AT2c is your final submission, due in the UTS exam period

2.2 Formatting guide

Here are some formatting tricks you can use.

2.2.1 Fonts

italics
bold
bold italics
verbatim code
superscript2
subscript2

This is a block quotation, if you have a long quote from someone this is the best way to do it (but don’t forget the citation). This is a very long line that will still be quoted properly when it wraps. Oh boy let’s keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.

2.2.2 Headings

Add headings using a # (but note, to get that to display properly I had to ‘escape’ it using a preceding backslash \#). One # gives you a line with Heading 1 style, ## gives you Heading 2 etc.

2.2.3 Lists

  1. Numbered
  2. Lists
  3. Are
  4. Possible
  • And so
  • are bulleted lists

More examples can be found on the cheat sheet at this link (check website for versions in languages other than English)

2.2.4 Equations

If you want to insert equations (you probably don’t) you can do so using the syntax below. You can also insert bits of inline code like, so the 2+2 here is produced by a piece of code, and the 4 is produced by an equation (namely 2+2)

The deterministic part of the model is defined by this in-line equation as \(\mu_i = \beta_0 + \beta_1x\), and the stochastic part by the centered equation:

\[ \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu_i)^2/(2\sigma^2)} \] More examples at this link

2.3 Embedding an image

You might have saved some analysis from another program as a picture file. This is how you paste it: Let’s embed a UTS logo, which I’ve saved to the data folder.

knitr::include_graphics(here::here("AT2_default_template/data/uts_logo_new.png"), dpi = NA)

Or like this: A logo

2.4 Tables

To create tables either:

  • Use R functions to output table data (ideally formatted)
  • Include images of the table - this isn’t ideal, but for this assignment it’s fine
  • Use markdown per the examples below
  • Use bootstrap functions per the examples below

2.4.1 Markdown tables

Markdown is fine for simple tables (but, you can’t have merged cells, so here I’ve got two tables next to each). You can create these easily using the Visual editor in RStudio, or tools like TablesGenerator.

Data source: Tweets made by each group member
Data structure: JSON structured, but raw text, media (images), and URLs etc. require further processing for analysis.
Row 1 b1 c1 d1
Row 2 b2 c2 d2

2.4.1.1 Bootstrap layout

You can use bootstrap to create complex layouts, here’s a fairly simple example.

Here is the first Div.

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

And this second div block will be put on the right:

plot(iris[, -5])

2.5 Citations and References

There are three ways for you to include references:

  1. Just write them in, formatted in Harvard style.
  2. Use footnotes
  3. Use a .bib file and cite like this (Knight and Shum 2020)

You’ll want to ensure that you connect what you did, and what you found, to the wider context of data science - including external sources of information (such as academic studies). You can build your reflection (criterion 4) through the paper like that. Use external sources to support and contextualise your claims, by giving examples of where things have gone wrong or worked well before, of relevant policies or systems, and of research into the potential, methods, and issues.

You’ll need to work out how to cite…

2.5.1 Using footnotes - Easiest option, and acceptable for AT2, especially if you get stuck with the other options!

If you’re stuck we’ll just accept footnotes for this assignment. To insert them you just type ^[This is a footnote.], you’ll get a hyperlinked number and at the end of your document the list is automatically created! Pretty useful right?1

2.5.2 Citing using a .bib file and citation manager

If you create a .bib file, you can cite using (Halpern et al. 2006) - where your bib file has the ‘key’ (the bit after the @) with all the other detail. See the sample file!

The packages you use are automatically added to a .bib and included in the template by the function at the end of the template.

2.5.3 knitcitations package

You can use the knitcitations package to add citations by doi or url.

2.6 Other formatting

2.6.1 Including quotations

This is a block quotation, if you have a long quote from someone this is the best way to do it (but don’t forget the citation). This is a very long line that will still be quoted properly when it wraps. Oh boy let’s keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.

You’ll see that we can:

  1. Format things, e.g.
  • italicise

  • or bold

  • or even bold italics (you can also have numbered sublists…)

    …See the markdown cheatsheet for more on this….to link … we use \[description here\](http://urlhere.com).

  1. And add headings using a # (but note, to get that to display properly I had to ‘escape’ it using a preceding backslash)
  2. And we can use citation, inline code, and charts
  3. All this means we can write a document, but we can also pull data in live and display it to the reader, who can also download this Rmd to see how we did it…it’s pretty cool hey?

But, just because it’s in a different format, that doesn’t mean you can get away with not following normal writing conventions. Writing should be in paragraphs, with correct spelling and grammar, and figures, etc. should be fully explained to the reader.

2.7 Inline R output

You can show full R chunks. But you might also write some output inline, e.g. output the coefficient in-line with code: 0.418684

2.8 Equations (probably not useful, but just in case)

If you want to insert equations (you probably don’t) you can do so using the syntax below. You can also insert bits of inline code like, so the 2+2 here is produced by a piece of code, and the 4 is produced by an equation (namely 2+2)

The determinisstic part of the model is defined by this in-line equation as \(\mu_i = \beta_0 + \beta_1x\), and the stochastic part by the centered equation:

\[ \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu_i)^2/(2\sigma^2)} \]

3 The template itself!

In the other .Rmd file we’ll start the template itself.

Halpern, Benjamin S., Helen M. Regan, Hugh P. Possingham, and Michael A. McCarthy. 2006. “Accounting for Uncertainty in Marine Reserve Design.” Ecology Letters 9 (1): 2–11. https://doi.org/10.1111/j.1461-0248.2005.00827.x.
Knight, Simon, and Simon Buckingham Shum. 2020. “Artificial Intelligence Holds Great Potential for Both Students and Teachers &Ndash; but Only If Used Wisely.” The Conversation. http://theconversation.com/artificial-intelligence-holds-great-potential-for-both-students-and-teachers-but-only-if-used-wisely-81024. https://theconversation.com/artificial-intelligence-holds-great-potential-for-both-students-and-teachers-but-only-if-used-wisely-81024.

  1. This is a footnote, see how it auto appears at the end of the doc.↩︎