R is excellent for reading data, data, and implementing statistical analyses.
R
This programming language is based in an older commercial language called S. It is “Open Source”, which means that all the code needed to build and modify R is available free of charge for non-commercial use.
R also estimulates developing open-source tools. And there is a huge community of R users and developers that continuously contribute new features in this language.
Check Version: When R starts up, it prints the version number, please refer to this number when describing your work (e.g. R 4.5.1). A generic citation of R can be obtained with the citation(). Also, be aware of “package versions”
RStudio
It is an Interactive Development Environment (IDE) for R (and other languages). This means that it’s an interface to make working wiht R easy.
A frequent error: To report that “RStudio was used to analyze data”. This is not the case. R is used for data analysis RStudio is just the interfase.
Quarto is an open-source, next-generation scientific and technical publishing system designed for creating dynamic, reproducible documents, websites, books, and presentations.
It allows users to weave code elements (in R in this case) with formatted narrative elements to create reports in HTML (browser will open), PDF, or Word.
We prefer HTML formats and we will provide the quarto files (called Quarto Markdown or .qmd) files and expect that you will modify them or create other files and return back .qmd and .html files
YAML is the header of every Quarto (.qmd) file that tells the program how to format the output. One example would be how to format and place the table of contents for the page.
HTML is the language used to build internet documents (along with some extensions). By Default Quarto compiles everything into HTML.
Knitr options
The act of building a report from a quarto markdown using R and RStudio is called “knitting”.
A first “code chunk” in R should establish clear instructions for knitting:
Setup Code
It is recommended to set up all the utilities needed by a certain R program, at the top of your qmd script. Some of these are:
Install all needed packages
call/invoke packages
specify printing and formatting options
prepare paths to working files
In R, we first have to download the packages using a function install.packages() only once. And then call the installed packages every time using the function library().
#====================================================================## Setup Options#====================================================================## remove all objects if restarting scriptrm(list=ls())# set tibble width for printingoptions(tibble.width =Inf)# remove scientific notationoptions(scipen=999)#==============================================================================## Install Packages / Load Packages#==============================================================================##install.packages("lubridate")#install.packages("tidyverse")library(lubridate)
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#==============================================================================## Set paths#==============================================================================## set all pathspath_main <-"C:/Users/jsteibel/OneDrive/Documents/job/455/"path_data <-str_c(path_main, "Data/", sep="")# str_c() is from the stringr package, which is a part of 'tidyverse'# this concatenates two pieces of the path. We will use it a lot.# NOTE: We cannot set the working directory in quarto with this method# as we do in an R script. Use the root.dir option in YAML. #==============================================================================## Set Inputs#==============================================================================## data file namedata_file <-"Production.csv"data_file_raw <-"Production_raw.csv"
TIP: Load packages in order of least important to most important (i.e. load tidyverse last) so that the most commonly used functions are most accessible.
Data
Dr. Gustavo Silva provided all data.
Tips for reading in data:
Avoid opeining a CSV file with Excel!
Open CSV and other ‘flat’ files with barebones text editors
Write R code to document all changes/alterations
Here are some common functions to read data into R:
read.table()
fread() (data.table package)
read_delim() (from the readr package)
read_excel() (readxl package)
We will use predominantly the read_delim() function from readr package in this class. This function provides modern features and a very good performance.
Read in CSV file from folder
full_file_path<-str_c(path_data, data_file)# read sow production data#this code lets R guess the types of columns and read their names. #it's simple, but there are risks in letting R guessproduction<-read_delim(full_file_path,delim =",",col_names =TRUE)
Rows: 105 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): ID, PARITY, TOTALBORN, LIVEBORN
date (2): SERVDATE, FARROWINGDATE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#this code explicitly assignes names and types of variablesproduction_2<-read_delim(file = full_file_path, # path to filedelim =",", # 'comma' separated (fields/columns)skip =1, # Skip the first line (we will give col names)col_names =c("ID", "PARITY", "SERVDATE", "FARROWDATE", "TOTALBORN", "BORNALIVE"),col_types ="ficcii", # Set Typesna =c("", "NULL", "NA") # vector of possible missing values)
The best way to view and search is using the “Help” tab in RStudio.
Or from the Console, you can type ?function() such as ?str_c. This will pull up the help file from the str_c() function.
Notice up top, the brackets will tell you what package the function is from. In this case, {stringr}.
Types
Here is a list of the characters you can give to the read_delim() function and what they stand for.
c = character
f = factor
D = date
n = numeric (double)
i = integer
l = logical
? = guess
To determine what the ‘class’ of the object is, we use the class() function.
Character
Character types are just held as strings in R. One common example are IDs or farm names. Anything that may need string manipulation later on.
You can convert something into a character with as.character().
Factors
Factors look like characters on the outside, however internally they are stored as ordered integers. Factors are used to store categorical variables
For example sex may be stored as:
1 = Female
2 = Male
This is because factors will order alphabetically until reordered with another function.
The different categories are called levels, in this example, there are 2 levels (Female and Male).
You can convert something into a factor with as.factor().
Dates
Dates have the structure YYYY-MM-DD in R, this is an unambiguous structure for the values within it, unlike Excel. Everyone should use this format for their research to avoid mix ups with others.
In R dates can be read as characters and then use one of many functions to convert a character into a date. This is my preferred method.
Numeric
Numeric is also known as a ‘double’ for double precision floating point (has a decimal point). We store continuos variables with this type in R.
Use as.numeric() to convert something into a number.
Integers
Integers are positive and negative numbers such as -1, 0, 1, 2, 3, etc. We use this type to store discrete variables. (Although, many times discrete are stored as numeric)
Convert a column into an integer with as.integer().
Logical
TRUE/FALSE are the logical type known as boolean. Represented with 0’s and 1’s internally.
0 = FALSE
1 = TRUE
Use as.logical() to convert something into a logical type.
Check Types
To check the types in R, you can use the following 2 functions.
A common operation is type conversion. This is done to ensure that all columns in a dataset have a correct format for further analyses. There are many type conversion functions in R. Here are a few:
as.numeric()
as.character()
as.factor()
as.Date()
as.logical()
Dates
It is highly recommended that date variables/features are managed using functions in the lubridate packagein R. For a full account of lubridate, please see https://lubridate.tidyverse.org/.
Let’s convert SERVDATE and FARROWDATE from character to date in proruction_3
data_file<-"Production_Raw.csv"full_file_path<-str_c(path_data, data_file)#Explicitly assignes names and types of variablesproduction_r<-read_delim(file = full_file_path, # path to filedelim =",", # 'comma' separated (fields/columns)skip =1, # Skip the first line (we will give col names)col_names =c("ID", "PARITY", "SERVDATE", "FARROWDATE", "TOTALBORN", "BORNALIVE"),col_types ="ficcii", # Set Typesna =c("", "NULL", "NA") # vector of possible missing values)production_r #Service is in European format, Farrow is in American format
# A tibble: 105 × 28
ID PARITY SERVDATE FARROWDATE TOTALBORN BORNALIVE X7 X8 X9
<fct> <int> <chr> <chr> <int> <int> <lgl> <lgl> <lgl>
1 18152 7 16/07/2024 11/11/2024 13 13 NA NA NA
2 18367 7 16/07/2024 11/11/2024 23 14 NA NA NA
3 19166 7 16/07/2024 11/9/2024 18 17 NA NA NA
4 19600 6 12/7/2024 11/5/2024 15 14 NA NA NA
5 20619 6 16/07/2024 11/10/2024 13 13 NA NA NA
6 21079 6 16/07/2024 11/12/2024 10 8 NA NA NA
7 21228 6 16/07/2024 11/10/2024 16 14 NA NA NA
8 21502 6 16/07/2024 11/9/2024 16 14 NA NA NA
9 22119 6 16/07/2024 11/8/2024 15 14 NA NA NA
10 22192 4 12/7/2024 11/4/2024 11 8 NA NA NA
X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA NA NA NA NA NA
X23 X24 X25 X26 X27 X28
<lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 NA NA NA NA NA NA
4 NA NA NA NA NA NA
5 NA NA NA NA NA NA
6 NA NA NA NA NA NA
7 NA NA NA NA NA NA
8 NA NA NA NA NA NA
9 NA NA NA NA NA NA
10 NA NA NA NA NA NA
# ℹ 95 more rows
Use these functions to modify production_r to match exactly the other two datasets.
All you need to do is alter the order of the ‘y’, ‘m’, and ‘d’ to get the correct function. If you have a date such as 25/01/2021, use the function dmy.
ymd() for year-month-day formats (any separator)
dmy() for day-month-year formats
mdy() for month-day-year formats
The lubridate package also has functions to extract parts of the date such as year(), quarter(), month(), and day().
# extract month from birth datemonth(production_2$SERVDATE)
We can use the table() function to create a count of each level of a factor. This is called, frequency table
# table of birth monthstable(month(production_2$SERVDATE))
7 12
73 32
Other functions in lubridate: - year() - weekdays() day of the week by name - quarters() to extract quarter - floor_date(), round_date(), and ceiling_date() to round down the date to specific units - week() to extract the week of the year - yq() to extract year-quarter dates - today() and now() to extract current date or time
Many more to deal with periods, durations, and intervals!