THis is a tutorial on how to use R markdown for reproducable research Here we can type long passages or descriptions of our data without the need of “hashtaging” out our comments with the # symbol. in our first example, we will be using the ToothGrowth dataset. In this experiment Guinea pigs were given different Vitamin C ammounts to see the effects on the animal’s teeth growth.
To run R code in markdown file, we need to denote the section that is considered R code. we call these sections “code chunks”
below is a code chunk:
Tootdata<- ToothGrowth
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
as you can see from running the “play” button, the results are printed inline from the r markdown file.
fit<-lm(len ~ dose, data = ToothGrowth)
b<- fit$coefficients
plot(len ~ dose, data = ToothGrowth)
abline(lm(len ~ dose, data = ToothGrowth))
The slope of the regression line is 9.7635714
we can also put sections and subsections in our r markdown, similar to numbers or bullter points in a word document. this is done with the “#” that we previously used to denote text text in an R script
make sure there is a space after the hashtag before text or it will not work.
we can also add bullter point type marks in our r markdown file
its important to note here that in r markdown indentation matters!!!
we can put really nice quotes into the markdown document. we do this by using the “>” symbol.
“genes are like the story and DNA is the language the story is written in.” — Sam Kean
hyperlinks can also be incorporated into these files, this is especially useful in HTML files, since they are in a web broser and will redirct the reader to the materials that you are interested in showing them. Here we will use the link to R markdown’s homepage for this example. RMarkdown
we can also put nicely formatted formulas into markdown using two $ signs. Hardy-Weinberg Formula
\[p^2 + 2pq + q^2 = 1\]
you can get really complex as well
\[\Theta = \begin{pmatrix}\alpha & \beta\\ \gamma & \delta \end{pmatrix}\]
there are options for your R markdown file on how Knitr interprets the code chunk. there are ther following options.
Eval (T or F) : whether or not to evaluate code chunk
print("Hello World")
echo (T or F) : whether or not to show code for chunk but results will still print.
## [1] "Hello, World"
cache : if enebled, the same code chunk will not be evaluated the next time the knitr is run. great for code that has long run times.
fig.width or fig.height : the graphical device size of the R plots in inches. the figures are first written to the knitr document then to files that are saved seperately.
out.width or out.height : the output sie of the R plots in the R document
fig.cap : the words for the figure caption
fit<-lm(len ~ dose, data = ToothGrowth)
b<- fit$coefficients
plot(len ~ dose, data = ToothGrowth)
abline(lm(len ~ dose, data = ToothGrowth))
Figure 1: the tooth growth of guinea pigs when given cariable amounts of Vitamin C
we can also add a table of contents to our HTML document. we do this by altering the YAML code (weird code chunck at beginning of code). We can add this:
title: “HTML_Tutorial” author: “Cassie_Wingate” date: “2024-06-28” output: html_document: toc: true toc_float: true
this will give us a nice floating table of contents on the right hand side of the document
you can also add TABS in our report. to do this you need to specify each section that you want to become a tab by placing “{.tabset}” after the line. Every subsequent header will be a new tab.
you can also add themes to HTML document that change the heighlighting color and hyperlink color. this can be nice aesthetically. to do this, change your theme in the YAML to one of the following:
cerulean journal flatly readable spacelab united cosmo lumen paper sandstone simplex yeti null
you can also change the color by specifying highlights:
defult tango payments kate monochrome espresso zenburn haddock textmate
you can also use the code_folding option to allow the reader to toggle between displaying the code and hiding the code. this is done with:
code_folding: hide
there are a ton of options and ways to customize your R doce using the HTML. THis is also a great way to display a porfolio of your work if you are trying to market yourself to interested parties.
If you want to filter based on different opperators, you can use the following: equals = not equal to != greater than > less than < greater than or equal to >= less than or equal to <=
you can also use logical opperators to be more elective: and & or | not !
for this problem set we will load the dataset “nycflights13” package.
??flights # View documentation
my_data <- nycflights13::flights
head(my_data)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Datasets can be filtered based on certain specific criteria to display desired values.
Here is the data regarding flights on October 14th:
oct_14_flights <- filter(my_data, month == 10, day == 14)
head(oct_14_flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
To print and save this filtered data:
(oct_14_flights_2 <- filter(my_data, month == 10, day == 14))
## # A tibble: 987 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 14 451 500 -9 624 648
## 2 2013 10 14 511 517 -6 733 757
## 3 2013 10 14 536 545 -9 814 855
## 4 2013 10 14 540 545 -5 932 933
## 5 2013 10 14 548 545 3 824 827
## 6 2013 10 14 549 600 -11 719 730
## 7 2013 10 14 552 600 -8 650 659
## 8 2013 10 14 553 600 -7 646 700
## 9 2013 10 14 554 600 -6 836 829
## 10 2013 10 14 555 600 -5 832 855
## # ℹ 977 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Here is the data regarding flights through September:
flights_through_september <- filter(my_data, month < 10)
head(flights_through_september)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Here is the data regarding flights in March or April
march_april_flights <- filter(my_data, month == 3 | month == 4)
head(march_april_flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 3 1 4 2159 125 318 56
## 2 2013 3 1 50 2358 52 526 438
## 3 2013 3 1 117 2245 152 223 2354
## 4 2013 3 1 454 500 -6 633 648
## 5 2013 3 1 505 515 -10 746 810
## 6 2013 3 1 521 530 -9 813 827
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Here is the data regarding flights in both March and April:
march_april_flights <- filter(my_data, month == 3 & month == 4)
Here is the data regarding flights on March 3rd:
march_april_flights <- filter(my_data, month == 3 & day == 4)
Here is the data regarding flights NOT in January:
non_jan_flights <- filter(my_data, month != 1)
head(non_jan_flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 1 447 500 -13 614 648
## 2 2013 10 1 522 517 5 735 757
## 3 2013 10 1 536 545 -9 809 855
## 4 2013 10 1 539 545 -6 801 827
## 5 2013 10 1 539 545 -6 917 933
## 6 2013 10 1 544 550 -6 912 932
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Data can be arraged using “arrange”:
To arrange by year, day, and month:
arrange(my_data, year, day, month)
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
To arrange in descending order:
descending <- arrange(my_data, desc(year), desc(day), desc(month))
To select a specific colum from a dataset:
# Select specific columns
calendar <- select(my_data, year, month, day)
head(calendar)
## # A tibble: 6 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
To select specifc columns from a dataset:
calendar <- select(my_data, year, month, day)
head(calendar)
## # A tibble: 6 × 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
Columns can also be selected using a negetaibe selection
everything_else <- select(my_data, -(year:day))
head(everything_else)
## # A tibble: 6 × 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 517 515 2 830 819 11 UA
## 2 533 529 4 850 830 20 UA
## 3 542 540 2 923 850 33 AA
## 4 544 545 -1 1004 1022 -18 B6
## 5 554 600 -6 812 837 -25 DL
## 6 554 558 -4 740 728 12 UA
## # ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Columns can be renamed.
Here is the column “dep_time” being renamed to “departure_time”:
my_data <- rename(my_data, departure_time = dep_time)
head(my_data)
## # A tibble: 6 × 19
## year month day departure_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>