Data scientists don’t just use Word and Powerpoint to write. They also write live reports, that draw on real-time data to show visualisations alongside narratives. This type of writing can draw on APIs, databases, and local data to write text and conduct reproducible analysis of data for insights. One of the key tools to do this is ‘notebooks.’
For AT2, we want you to use RStudio to write and submit your report. We know this will be unfamiliar for many of you, but that’s ok. We’re not asking you to learn to code. We’ve provided a template, and if you want, you can simply modify the example ‘markdown’ to format your own report, and load visualisations that you’ve created in other tools (like RawGraphs.io, or Tableau). Some of you will want to go further, and that’s ok too! But remember to address the assessment criteria - this isn’t an assignment where you have to demonstrate technical coding skills.
Please note that while we’re keen for you to extend your technical skills, a key concern of AT2 is how you communicate about and with data, so take caution not to get distracted by technical issues, and to focus on the criteria. This template provides a structure for the report. Make sure that you read it closely, several times.
This template serves two purpose:
R and markdownI have included the assessment criteria at the relevant places to remind you of what needs to be in the report.
You are free to vary the structure by renaming the sections, including other sections, or dropping ones that you don’t use. Keep in mind that the suggested structure is conventional (and therefore easy to follow), practical, and comprehensive. (Criterion 5: Professionally presented in a manner appropriate to the discipline.) If you do use this template, you will need to install R, RStudio, and the packages listed in the code block at the head of this document.
Note: We have provided some sample code below, along with some text mostly marked as blockquotes using >. All of this should be replaced by your work.
Please don’t forget to include a title, name, student number, etc. on a covering sheet
You may also wish to share these on github or rpubs - however, consider the privacy implications of doing so first.
2800 words (excluding data excerpts and appendices, visualisations, and references).
See details below for referencing. If you use footnotes, they are included like this [^1].
To check this, you can either copy the html output to word, or use the addin Word Count Addin. E.g. wordcountaddin:::text_stats()
wordcountaddin:::text_stats()| Method | koRpus | stringi |
|---|---|---|
| Word count | 3213 | 3087 |
| Character count | 18195 | 18199 |
| Sentence count | 246 | Not available |
| Reading time | 16.1 minutes | 15.4 minutes |
If you’re using RStudio, you can still do grammar and spelling checks. The ‘Visual editor’ mode makes this more natural (ctrl+shift+f4 on windows).
The gramr package lets you run an open tool within RStudio to get this feedback. (you can explore the code on github).
pacman::p_load_gh("ropenscilabs/gramr")This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing the chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
rintro <- tibble(
long_name = c("Install RStudio interactive","rmarkdown intro","DSI vignettes","rmarkdown detailed guide"),
blurb = c("Interactive learnr activity to setup RStudio, R, and packages","An introduction to R markdown","Check the DSI vignettes (and, via Canvas, exemplar reports) for inspiration","A longer book about how to use markdown with R"),
image_url = c("https://rstudio.github.io/learnr/logo.png","https://raw.githubusercontent.com/rstudio/rmarkdown/main/man/figures/logo.png","https://rstudio.github.io/learnr/logo.png","https://raw.githubusercontent.com/rstudio/rmarkdown/main/man/figures/logo.png"),
resource_url = c("https://learnr-examples.shinyapps.io/ex-setup-r/","https://rmarkdown.rstudio.com/authoring_quick_tour.html#overview","https://sjgknight.github.io/DSI/","https://bookdown.org/yihui/rmarkdown/markdown-syntax.html"),
tags = c("Get-started; learnr","Get-started; markdown","intermediate; analysis","advanced; markdown")
)
rintro %>%
cards(
title = long_name,
text = blurb,
link = resource_url,
image = image_url,
tags = paste("all;", tags),
width = 4,
footer = tags,
layout = "label-right"
)
Interactive learnr activity to setup RStudio, R, and packages
An introduction to R markdown
Check the DSI vignettes (and, via Canvas, exemplar reports) for inspiration
A longer book about how to use markdown with R
What is R Markdown? from RStudio, Inc. on Vimeo.
The template is made up of:
The easiest way to work with it is to download the github repo and open the .Rproj file in RStudio.
It is best to set up your assignment as a project, rather than just have a single RMarkdown file. Setting up a project will define your working directory based on where a .RProj file is located. Other files and folders can then be found relative to that .RProj file. This gives projects some advantages:
../datamy_data.csv, rather than having to type C:\folder\other_folder\data\my_data.csv.To start a project in RStudio, - click File -> New Project and follow the prompts to set up a new project in a new folder.
I highly recommend this link on project-oriented workflow
If we don’t have these packages, we’ll need to download them from the internet. Here’s some code that does that.
You’ll see a “#” at the start of the first line; this tells R that it is a ‘comment’ not code. If you remove the “#” R will try to run the code.
Installing the packages only puts them on our computer. To use them in
our project, we need them loaded, I’ve used a package called pacman which checks if you have the called packages installed, and loads them. Normally you should not do this, because it’s useful to be aware of the environment you’re executing code with.
You may also want to knit the file on your computer, which will install the useful packages below.
#install.packages("pacman")
library(pacman)
p_load(bs4cards, tidyverse, flexdashboard, shiny, psych, devtools, bibtex, curl, gganimate)
p_load_gh("benmarwick/wordcountaddin")
#go to Tools > Addins to select the wordcountaddin
pacman::p_install_gh("hadley/emo") #install, but call functions directly. Largely for illustrativeFor formal timelines make sure you refer to (1) the subject outline (the most important document in any subject), (2) the subject canvas site+REVIEW both of which show deadlines, (3) if unsure, ask me.
pacman::p_load(timevis)
week_1 <- as.Date("2022-02-21")
tl <- data.frame(
id = 0:14,
long_content = c("Pre-work: What Does Facebook Know About Me?",
"Criterion 1: Choose group, data, and method. Establish communication approach and begin sharing data and insights.",
"Criterion 1: Justify collection and analysis. Be able to justify your approach 'for the method to obtain data from multiple sources, for gaining insight into a chosen problem, including analysis of data quality issues in the individual and group data' (Criterion 1) - draft this section in the template",
"Have data, share insight. Ensure you have a shared dataset in preparation for Mystery Box formative task; start to think about insights (criterion 2)",
"AT2a due. Group status update, and your preliminary thoughts on analysis and external (ideally scholarly) resources you're drawing on",
"AT1 due. Analysis and planning. Continue thinking about insights you might gain, visualisations you can use, issues (including ethical) with your data (criteria 2 and 3). Review sample assignments and the AT2 template.",
"Consider issues in data. Focus on issues with your data (including ethical) (criteria 1-3) and their implications for the practice of data science (criterion 4)",
"STUVAC HERE. Continue on AT2.",
"Consider issues in data. Continue from week 7, with a particular focus on how comparing across the levels of data (individual, group, cohort) provides insights. Ensure you have considered the privacy and ethical issues throughout your report, and the implications of the project for the practice of data science",
"Week 9, draft submission of AT2b. See detailed instructions.",
"Week 10, review colleague's AT2b. Continue work on your own final submission",
"Week 11, review colleague's AT2b. Continue work on your own final submission",
"Week 12 AT2b feedback due. You should use that feedback to reflect on how to improve for your final submission",
"STUVAC. Continue AT2 work.",
"AT2C Due. Final assessment period."
)
)
tl <- tl %>% mutate(start = week_1 + 7*id,
end = start + 7,
content = str_split_n(long_content, "\\.", 1))# X-WR-TIMEZONE = "Australia/Sydney"
#library(calendar)
tl %>% transmute(
DTSTART = as.POSIXct(.$start),
DTEND = as.POSIXct(.$end),
SUMMARY = content,
DESCRIPTION = long_content) %>%
mutate(UID = replicate(nrow(.), calendar::ic_guid())) %>%
calendar::ical() %>%
calendar::ic_write(.,"DSI_AT2.ics")You should be able to download a calendar of events to import into your provider of choice. by clicking on this ics calendar download
Each criterion threads right through the report. This is especially true for 4 and 5. I will especially look for reports that:
Criterion 5 Level of professionalism in the presentation appropriate to the discipline: You can see specific guidance on this criterion in the subject outline. Remember, your visualisations, and the way you develop your narrative are a part of professional presentation. You should draw on external sources to support and contextualise your work throughout. Be careful to emphasise interpretation and analysis over description and narrative. So, don’t tell us about discussions you had and who said what (description), tell us about the decisions you made, why, and their implications for the practice of data science (analysis).
For AT2 you will collect, record, share, and analyse several types of data about yourself and compare and contrast what you find in your analysis with an analysis of the same data from the group.
You will negotiate and agree a processes for recording, sharing and storing the data being collected as a group, in the first class session for AT2. Your attendance at this session will be crucial in getting off to a strong start with a minimum of disruption for this major task.
The following requirements apply to your data collection:
Two sources of data negotiated with your group for sharing:
One additional individual dataset, structured or unstructured. This can be of personal interest to you. It does not need to be shared across the groups, but should be analysed by you in your report.
External cohort-level data to compare your own data to (probably summary data from previously published work): The idea of this dataset is that you will have data from: (1) an individual, (2) a small group, and (3) a larger cohort. You will probably draw on published summary level data (for example, what is the average step count in Australia?…for who?), or publicly available stepcount data. In order of complexity, you may be able to obtain insights from one of these sources:
Examples of data that you and your group could collect include: daily step counts; pulse rates; time spent on activities each day (exercise, grooming, travelling, eating/cooking, shopping, sleeping studying, etc.); sleep patterns; daily spending; number & length of conversations each day; location tracking, and so on. Some of these can be easily tracked via smartphone apps, see examples at https://quantifiedself.com/
Old examples of this assignment, and all of our feedback given in a previous semester are available via Canvas.
You might find the DSI vignettes, many created by students in the Statistics subject helpful if you want to use R to do analysis (but remember, you do not have to!).
Assignment two has 3 parts. This structure ensures you’re on track for the assignment, and provides an opportunity for you to resubmit your AT2 taking into account the feedback provided to make changes.
AT2a is due week 5, and is a short online form (only available in the week before due date)
AT2b is due week 9, via Canvas, and consists of (a) a draft of your final submission, and (b) your feedback to your class colleagues via peer review
AT2c is your final submission, due in the UTS exam period
Here are some formatting tricks you can use.
italics
bold
bold italics
verbatim code
superscript2
subscript2
This is a block quotation, if you have a long quote from someone this is the best way to do it (but don’t forget the citation). This is a very long line that will still be quoted properly when it wraps. Oh boy let’s keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.
Add headings using a # (but note, to get that to display properly I had to ‘escape’ it using a preceding backslash \#). One # gives you a line with Heading 1 style, ## gives you Heading 2 etc.
More examples can be found on the cheat sheet at this link (check website for versions in languages other than English)
If you want to insert equations (you probably don’t) you can do so using the syntax below. You can also insert bits of inline code like, so the 2+2 here is produced by a piece of code, and the 4 is produced by an equation (namely 2+2)
The deterministic part of the model is defined by this in-line equation as \(\mu_i = \beta_0 + \beta_1x\), and the stochastic part by the centered equation:
\[ \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu_i)^2/(2\sigma^2)} \] More examples at this link
You might have saved some analysis from another program as a picture file. This is how you paste it: Let’s embed a UTS logo, which I’ve saved to the data folder.
knitr::include_graphics(here::here("AT2_default_template/data/uts_logo_new.png"), dpi = NA)Or like this:
To create tables either:
markdown per the examples belowbootstrap functions per the examples belowMarkdown is fine for simple tables (but, you can’t have merged cells, so here I’ve got two tables next to each). You can create these easily using the Visual editor in RStudio, or tools like TablesGenerator.
| Data source: Tweets made by each group member |
|---|
| Data structure: JSON structured, but raw text, media (images), and URLs etc. require further processing for analysis. |
| Row 1 | b1 | c1 | d1 |
|---|---|---|---|
| Row 2 | b2 | c2 | d2 |
You can use bootstrap to create complex layouts, here’s a fairly simple example.
Here is the first Div.
str(iris)## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
And this second div block will be put on the right:
plot(iris[, -5])There are three ways for you to include references:
You’ll want to ensure that you connect what you did, and what you found, to the wider context of data science - including external sources of information (such as academic studies). You can build your reflection (criterion 4) through the paper like that. Use external sources to support and contextualise your claims, by giving examples of where things have gone wrong or worked well before, of relevant policies or systems, and of research into the potential, methods, and issues.
You’ll need to work out how to cite…
If you’re stuck we’ll just accept footnotes for this assignment. To insert them you just type ^[This is a footnote.], you’ll get a hyperlinked number and at the end of your document the list is automatically created! Pretty useful right?1
If you create a .bib file, you can cite using (Halpern et al. 2006) - where your bib file has the ‘key’ (the bit after the @) with all the other detail. See the sample file!
The packages you use are automatically added to a .bib and included in the template by the function at the end of the template.
You can use the knitcitations package to add citations by doi or url.
This is a block quotation, if you have a long quote from someone this is the best way to do it (but don’t forget the citation). This is a very long line that will still be quoted properly when it wraps. Oh boy let’s keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.
You’ll see that we can:
italicise
or bold
or even bold italics (you can also have numbered sublists…)
…See the markdown cheatsheet for more on this….to link … we use \[description here\](http://urlhere.com).
But, just because it’s in a different format, that doesn’t mean you can get away with not following normal writing conventions. Writing should be in paragraphs, with correct spelling and grammar, and figures, etc. should be fully explained to the reader.
You can show full R chunks. But you might also write some output inline, e.g. output the coefficient in-line with code: 0.418684
If you want to insert equations (you probably don’t) you can do so using the syntax below. You can also insert bits of inline code like, so the 2+2 here is produced by a piece of code, and the 4 is produced by an equation (namely 2+2)
The determinisstic part of the model is defined by this in-line equation as \(\mu_i = \beta_0 + \beta_1x\), and the stochastic part by the centered equation:
\[ \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu_i)^2/(2\sigma^2)} \]
In the other .Rmd file we’ll start the template itself.
This is a footnote, see how it auto appears at the end of the doc.↩︎