R Markdown
An R Markdown file is a plain text file with an .Rmd extension. The file is a notebook interface for R. For more info, check out RStudio.com.
I prefer to use this type of file to write my code because it allows for lots of space to write notes.Additionally, because the chunk format allows a user to run only certain sections of code at a time, I find trouble-shooting errors to be much easier. If you happen to also be a Python user, you might find the R Markdown format similar to the Jupyter Notebook style of coding. Notice I said “also”, because by virtue of you being here and trying these code samples, you have become an R user. Look at you!
All R Markdown code chunks start and end with ``` – three back ticks. On your keyboard, the back ticks can be found on the same key as the tilde (~). These are not the same as apostrophes. After the back ticks, there is a single parenthesis, the lowercase letter r, then the tag is closed with another parenthesis. The chunk is then closed with three more back ticks.
The length of the chunk is determined by the number of new lines between the opening and closing back ticks.
For example:
#this is one new line
#this is two new linesIn R, comments are written by prefacing the text with a “#”. This prevents the line from running. Always use comments in your R code so that others (and more than likely yourself as well) will understand what the code is doing.
Anything written outside the chunk, like this text, will print as plain text when the entire document is knitted (done by clicking the blue “Knit” button in the RStudio Integrated Development Environment (IDE)).
To run a single chunk of code, click the green arrow in the right-hand side of the chunk. To run all chunks at once, click “Run” at the top, and select “Run All”. Alternatively, try knitting the R Markdown document to see how pretty your project has become :)
What is knitting? Knitting is how the RStudio IDE renders (basically prints to your browser, if the defaults are left in place) an R Markdown document so users can see how it looks to run all code and all visualizations all at once. If you are using R Markdown for school, this is how to see what your professor will see when you hand in your code.
Knitting can result in an HTML, PDF, or Microsoft Word document. To choose between them, select the arrow next to the “Knit” button.
For more info, see the R Markdown Cheat Sheet or Michael Sachs’ Introduction to knitr.
Package Loading
If you want to use any functions besides those included in base R (and you absolutely will), you’ll need to install and load packages. There are hundreds of R packages, all designed to make your life easier by including functions for commonly needed tasks. Some are even specific to different disciplines or act as intermediaries between languages.
For example, there are R packages to… * reshape data, * let you use SQL-style code, * use Python code in the same file as R code, * create visualizations like graphs and plots, * create visualizations that tap into Google Maps, * ones that let you work with text data, with dates, * and even ones that are just meant to make your code output more aesthetically pleasing (I’m looking at you, kableExtra.
The typical way this is done is to type the following code:
install.packages('dpylr') #installs package for the first time. Only needed the first time you use a package.
library(dplyr) #loads the package. Need to run this line EVERY TIME you use a package.Because R requires users to load the packages they will need every time a script is run (not just a quirk of R, Python does this too), users inevitably find themselves with a section that looks like this, at the top of every R script:
library(dplyr)
library(ggplot2)
library(stringr)
library(stringi)
library(purrr)
library(readr)
library(readxl)
library(caret)As you can imagine, this can get quite long, and takes up a lot of valuable screen space. Here R users have several options, such as using a function like ‘apply’ to load a vector of package names, writing a loop to do this, or using a pre-written function to load multiple packages all at once, in a single line of code.
My favorite is the last option, and my package loader of choice is pacman.
The pacman package’s ‘p_load’ function can load all desired packages very quickly and with minimal code. This function is a wrapper for ‘library’ and ‘require’. It checks to see if a package is installed, and if it isn’t, it attempts to install in from CRAN or any other repository in the pacman repository list. So now instead of 15 lines of “library(package_name)” at the top of our file, we have simply:
#load all packages at once using pacman.
pacman::p_load(knitr,dplyr,readxl,writexl,tidyverse,lubridate, ggplot2,pastecs)Much better!
Using a single function from a package
Packages contain multiple functions, and sometimes functions with the same name can appear in multiple packages. If you find this happening to you (most noticeable to me during times when my code is throwing errors), a solution can be to tell R specifically which package you want to use for that function. To do that, type the package name, followed by two sets of colons, then the function name:
dplyr::select(mydata,mydata$column_name) #where the dataframe is named "mydata", and the column you want to select is called "column_name"Importing Data
Now that packages are loaded, it’s time to import data. This can be done with the RStudio IDE by going to the Environment/History panel (usually top right in RStudio), clicking “Import Dataset”, and browsing for a file. This is particularly helpful if, like me, you sometimes can’t remember where you stored the file you want to use.
Alternatively, you can do it with
code:
mtcars<-read_csv("/Users/mkinlan/Downloads/mtcars_dataset.csv",na="null"
, col_names = TRUE, show_col_types = FALSE)
#Viewing the enire dataset as a separate tab in RStudio. This should appear next to the tab in RStudio with your R Markdown file
View(mtcars)Exploring the Data with Descriptive Statistics
One of the first things to do is to explore the data and try to understand more about it. To see just a few lines of the dataset and get some descriptive statistics:
# This will print below the chunk, so you may have to scroll down.
head(mtcars,10) #looking at just the top 10 lines of the dataset.## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
summary(mtcars) #min, mean, median, quartiles, and max for numeric variables ## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
The stat.desc function in the pastecs package gives typical stats like mean, median, and quartiles, but also performs tests of normality. To learn more about options for this function, see this UCLA article package
stat.desc(mtcars, norm = TRUE) #count null values, standard deviation, skewness, normality test## mpg cyl disp hp drat
## nbr.val 32.0000000 3.200000e+01 3.200000e+01 32.00000000 32.00000000
## nbr.null 0.0000000 0.000000e+00 0.000000e+00 0.00000000 0.00000000
## nbr.na 0.0000000 0.000000e+00 0.000000e+00 0.00000000 0.00000000
## min 10.4000000 4.000000e+00 7.110000e+01 52.00000000 2.76000000
## max 33.9000000 8.000000e+00 4.720000e+02 335.00000000 4.93000000
## range 23.5000000 4.000000e+00 4.009000e+02 283.00000000 2.17000000
## sum 642.9000000 1.980000e+02 7.383100e+03 4694.00000000 115.09000000
## median 19.2000000 6.000000e+00 1.963000e+02 123.00000000 3.69500000
## mean 20.0906250 6.187500e+00 2.307219e+02 146.68750000 3.59656250
## SE.mean 1.0654240 3.157093e-01 2.190947e+01 12.12031731 0.09451874
## CI.mean.0.95 2.1729465 6.438934e-01 4.468466e+01 24.71955013 0.19277224
## var 36.3241028 3.189516e+00 1.536080e+04 4700.86693548 0.28588135
## std.dev 6.0269481 1.785922e+00 1.239387e+02 68.56286849 0.53467874
## coef.var 0.2999881 2.886338e-01 5.371779e-01 0.46740771 0.14866382
## skewness 0.6106550 -1.746119e-01 3.816570e-01 0.72602366 0.26590390
## skew.2SE 0.7366922 -2.106512e-01 4.604298e-01 0.87587259 0.32078561
## kurtosis -0.3727660 -1.762120e+00 -1.207212e+00 -0.13555112 -0.71470062
## kurt.2SE -0.2302812 -1.088573e+00 -7.457714e-01 -0.08373853 -0.44151592
## normtest.W 0.9475647 7.533100e-01 9.200127e-01 0.93341934 0.94588390
## normtest.p 0.1228814 6.058338e-06 2.080657e-02 0.04880824 0.11006076
## wt qsec vs am gear
## nbr.val 32.00000000 32.0000000 3.200000e+01 3.200000e+01 3.200000e+01
## nbr.null 0.00000000 0.0000000 1.800000e+01 1.900000e+01 0.000000e+00
## nbr.na 0.00000000 0.0000000 0.000000e+00 0.000000e+00 0.000000e+00
## min 1.51300000 14.5000000 0.000000e+00 0.000000e+00 3.000000e+00
## max 5.42400000 22.9000000 1.000000e+00 1.000000e+00 5.000000e+00
## range 3.91100000 8.4000000 1.000000e+00 1.000000e+00 2.000000e+00
## sum 102.95200000 571.1600000 1.400000e+01 1.300000e+01 1.180000e+02
## median 3.32500000 17.7100000 0.000000e+00 0.000000e+00 4.000000e+00
## mean 3.21725000 17.8487500 4.375000e-01 4.062500e-01 3.687500e+00
## SE.mean 0.17296847 0.3158899 8.909831e-02 8.820997e-02 1.304266e-01
## CI.mean.0.95 0.35277153 0.6442617 1.817172e-01 1.799054e-01 2.660067e-01
## var 0.95737897 3.1931661 2.540323e-01 2.489919e-01 5.443548e-01
## std.dev 0.97845744 1.7869432 5.040161e-01 4.989909e-01 7.378041e-01
## coef.var 0.30412851 0.1001159 1.152037e+00 1.228285e+00 2.000825e-01
## skewness 0.42314646 0.3690453 2.402577e-01 3.640159e-01 5.288545e-01
## skew.2SE 0.51048252 0.4452150 2.898461e-01 4.391476e-01 6.380083e-01
## kurtosis -0.02271075 0.3351142 -2.001938e+00 -1.924741e+00 -1.069751e+00
## kurt.2SE -0.01402987 0.2070213 -1.236724e+00 -1.189035e+00 -6.608529e-01
## normtest.W 0.94325772 0.9732509 6.322635e-01 6.250744e-01 7.727856e-01
## normtest.p 0.09265499 0.5935176 9.737376e-08 7.836354e-08 1.306844e-05
## carb
## nbr.val 3.200000e+01
## nbr.null 0.000000e+00
## nbr.na 0.000000e+00
## min 1.000000e+00
## max 8.000000e+00
## range 7.000000e+00
## sum 9.000000e+01
## median 2.000000e+00
## mean 2.812500e+00
## SE.mean 2.855297e-01
## CI.mean.0.95 5.823417e-01
## var 2.608871e+00
## std.dev 1.615200e+00
## coef.var 5.742933e-01
## skewness 1.050874e+00
## skew.2SE 1.267771e+00
## kurtosis 1.257043e+00
## kurt.2SE 7.765553e-01
## normtest.W 8.510972e-01
## normtest.p 4.382405e-04
To learn more about how to interpret the results of stat.desc, see Verify if Data are Normally Distributed in R: Part 2, from Scientifically Sound.
Looking at the results of the ‘skewness’(0.61) and ‘kurtosis’ (-0.37) outputs for the ‘mpg’ variable, we can see that mpg is skewed left (more data is toward the left side of a histogram) with a flat and light-tailed distribution.
This means there is a fewer number of extreme values (aka outliers or influential observations) than is typically the case in a Normal or Gaussian distribution.
To see this visually, we can make a quick histogram:
hist(mtcars$mpg) #select the specific column by typing a dollar sign after the dataset nameThe more the data mimics the Normal distribution, the less of a tail you’ll see. This variable almost has a Normal distribution; if there were more data points between 25-30 on the x-axis it would look fairly Normal
Learn more about Kurtosis with Kurtosis Explained:Basics of Kurtosis Interpretation on Graphs
Alternatively, you can make a prettier histogram with the magic of ggplot2:
ggplot(mtcars,aes(x=mpg))+geom_histogram(binwidth=4.5, color="#e9ecef",fill="#00AD0B") + ggtitle("Histogram of Variable 'MPG' with binwidth 4.5")You’ll notice that the appearance of skewness and kurtosis can change dramatically depending on the binwidth chosen for a histogram. The plot above has a binwidth of 4.5, but if we change it to binwidth=10, the we lose all sight of any skewness or tails:
ggplot(mtcars,aes(x=mpg))+geom_histogram(binwidth=10, color="#e9ecef",fill="#34ADDE") + ggtitle("Histogram of Variable 'MPG' with binwidth 10")At this point you might be asking, “Well, how did the base R function (the first histogram) determine binwidth? It must have picked a certain binwidth value for a reason.”
To which I say you are correct!
The hist() function in base R (the default functions that come with R) are determined using the “Sturges” method.
For more information on that function, see: