THis is a tutorial on how to use R markdown for reproducable research Here we can type long passages or descriptions of our data without the need of “hashtaging” out our comments with the # symbol. in our first example, we will be using the ToothGrowth dataset. In this experiment Guinea pigs were given different Vitamin C ammounts to see the effects on the animal’s teeth growth.

To run R code in markdown file, we need to denote the section that is considered R code. we call these sections “code chunks”

below is a code chunk:

Tootdata<- ToothGrowth
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

as you can see from running the “play” button, the results are printed inline from the r markdown file.

fit<-lm(len ~ dose, data = ToothGrowth)

b<- fit$coefficients

plot(len ~ dose, data = ToothGrowth)

abline(lm(len ~ dose, data = ToothGrowth))

The slope of the regression line is 9.7635714

section headers

we can also put sections and subsections in our r markdown, similar to numbers or bullter points in a word document. this is done with the “#” that we previously used to denote text text in an R script

first level header

second level header

third level header

make sure there is a space after the hashtag before text or it will not work.

we can also add bullter point type marks in our r markdown file

one item
one item
one item
- one more item
- one more item
- one more item
  - one last item

its important to note here that in r markdown indentation matters!!!

first item
second item
third item

subitem 1
subitem 2
subitem 3

block quotes

we can put really nice quotes into the markdown document. we do this by using the “>” symbol.

“genes are like the story and DNA is the language the story is written in.” — Sam Kean

hyperlinks

hyperlinks can also be incorporated into these files, this is especially useful in HTML files, since they are in a web broser and will redirct the reader to the materials that you are interested in showing them. Here we will use the link to R markdown’s homepage for this example. RMarkdown

formulas

we can also put nicely formatted formulas into markdown using two $ signs. Hardy-Weinberg Formula

\[p^2 + 2pq + q^2 = 1\]

you can get really complex as well

\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]

code chunks

codechunk options

there are options for your R markdown file on how Knitr interprets the code chunk. there are ther following options.

Eval (T or F) : whether or not to evaluate code chunk

print("Hello World")

echo (T or F) : whether or not to show code for chunk but results will still print.

## [1] "Hello, World"

cache : if enebled, the same code chunk will not be evaluated the next time the knitr is run. great for code that has long run times.

fig.width or fig.height : the graphical device size of the R plots in inches. the figures are first written to the knitr document then to files that are saved seperately.

out.width or out.height : the output sie of the R plots in the R document

fig.cap : the words for the figure caption

fit<-lm(len ~ dose, data = ToothGrowth)

b<- fit$coefficients

plot(len ~ dose, data = ToothGrowth)

abline(lm(len ~ dose, data = ToothGrowth))

Figure 1: the tooth growth of guinea pigs when given cariable amounts of Vitamin C

we can also add a table of contents to our HTML document. we do this by altering the YAML code (weird code chunck at beginning of code). We can add this:

title: “HTML_Tutorial” author: “Cassie_Wingate” date: “2024-06-28” output: html_document: toc: true toc_float: true

this will give us a nice floating table of contents on the right hand side of the document

tabs

you can also add TABS in our report. to do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.

themes

you can also add themes to HTML document that change the heighlighting color and hyperlink color. this can be nice aesthetically. to do this, change your theme in the YAML to one of the following:

cerulean journal flatly readable spacelab united cosmo lumen paper sandstone simplex yeti null

you can also change the color by specifying highlights:

defult tango payments kate monochrome espresso zenburn haddock textmate

code of rolding

you can also use the code_folding option to allow the reader to toggle between displaying the code and hiding the code. this is done with:

code_folding: hide

summary

there are a ton of options and ways to customize your R doce using the HTML. THis is also a great way to display a porfolio of your work if you are trying to market yourself to interested parties.

Data Wrangling with R

If you want to filter based on different opperators, you can use the following: equals = not equal to != greater than > less than < greater than or equal to >= less than or equal to <=

you can also use logical opperators to be more elective: and & or | not !

loading and exploring Data

for this problem set we will load the dataset “nycflights13” package.

??flights  # View documentation
my_data <- nycflights13::flights
head(my_data)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Flitering Data

Datasets can be filtered based on certain specific criteria to display desired values.

Here is the data regarding flights on October 14th:

oct_14_flights <- filter(my_data, month == 10, day == 14)
head(oct_14_flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013    10    14      451            500        -9      624            648
## 2  2013    10    14      511            517        -6      733            757
## 3  2013    10    14      536            545        -9      814            855
## 4  2013    10    14      540            545        -5      932            933
## 5  2013    10    14      548            545         3      824            827
## 6  2013    10    14      549            600       -11      719            730
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

To print and save this filtered data:

(oct_14_flights_2 <- filter(my_data, month == 10, day == 14))

## # A tibble: 987 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    10    14      451            500        -9      624            648
##  2  2013    10    14      511            517        -6      733            757
##  3  2013    10    14      536            545        -9      814            855
##  4  2013    10    14      540            545        -5      932            933
##  5  2013    10    14      548            545         3      824            827
##  6  2013    10    14      549            600       -11      719            730
##  7  2013    10    14      552            600        -8      650            659
##  8  2013    10    14      553            600        -7      646            700
##  9  2013    10    14      554            600        -6      836            829
## 10  2013    10    14      555            600        -5      832            855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Here is the data regarding flights through September:

flights_through_september <- filter(my_data, month < 10)
head(flights_through_september)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

logical operators can be used for selective filtering:

Here is the data regarding flights in March or April

march_april_flights <- filter(my_data, month == 3 | month == 4)
head(march_april_flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     3     1        4           2159       125      318             56
## 2  2013     3     1       50           2358        52      526            438
## 3  2013     3     1      117           2245       152      223           2354
## 4  2013     3     1      454            500        -6      633            648
## 5  2013     3     1      505            515       -10      746            810
## 6  2013     3     1      521            530        -9      813            827
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Here is the data regarding flights in both March and April:

march_april_flights <- filter(my_data, month == 3 & month == 4)

Here is the data regarding flights on March 3rd:

march_april_flights <- filter(my_data, month == 3 & day == 4)

Here is the data regarding flights NOT in January:

non_jan_flights <- filter(my_data, month != 1)
head(non_jan_flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013    10     1      447            500       -13      614            648
## 2  2013    10     1      522            517         5      735            757
## 3  2013    10     1      536            545        -9      809            855
## 4  2013    10     1      539            545        -6      801            827
## 5  2013    10     1      539            545        -6      917            933
## 6  2013    10     1      544            550        -6      912            932
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Arranging Data

Data can be arraged using “arrange”:

To arrange by year, day, and month:

arrange(my_data, year, day, month)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

To arrange in descending order:

descending <- arrange(my_data, desc(year), desc(day), desc(month))

Selecting Colums

selecting specific columns

To select a specific colum from a dataset:

# Select specific columns
calendar <- select(my_data, year, month, day)
head(calendar)

## # A tibble: 6 × 3
##    year month   day
##   <int> <int> <int>
## 1  2013     1     1
## 2  2013     1     1
## 3  2013     1     1
## 4  2013     1     1
## 5  2013     1     1
## 6  2013     1     1

selecting specific colums

To select specifc columns from a dataset:

calendar <- select(my_data, year, month, day)
head(calendar)

## # A tibble: 6 × 3
##    year month   day
##   <int> <int> <int>
## 1  2013     1     1
## 2  2013     1     1
## 3  2013     1     1
## 4  2013     1     1
## 5  2013     1     1
## 6  2013     1     1

Columns can also be selected using a negetaibe selection

everything_else <- select(my_data, -(year:day))
head(everything_else)

## # A tibble: 6 × 16
##   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
## 1      517            515         2      830            819        11 UA     
## 2      533            529         4      850            830        20 UA     
## 3      542            540         2      923            850        33 AA     
## 4      544            545        -1     1004           1022       -18 B6     
## 5      554            600        -6      812            837       -25 DL     
## 6      554            558        -4      740            728        12 UA     
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Renaming Columns

Columns can be renamed.

Here is the column “dep_time” being renamed to “departure_time”:

my_data <- rename(my_data, departure_time = dep_time)
head(my_data)

## # A tibble: 6 × 19
##    year month   day departure_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>          <int>          <int>     <dbl>    <int>
## 1  2013     1     1            517            515         2      830
## 2  2013     1     1            533            529         4      850
## 3  2013     1     1            542            540         2      923
## 4  2013     1     1            544            545        -1     1004
## 5  2013     1     1            554            600        -6      812
## 6  2013     1     1            554            558        -4      740
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Final Project

Cassie Wingate

2024-06-20