How to use R Markdown, Import Data, and Start EDA with R

R Markdown

An R Markdown file is a plain text file with an .Rmd extension. The file is a notebook interface for R. For more info, check out RStudio.com.

I prefer to use this type of file to write my code because it allows for lots of space to write notes.Additionally, because the chunk format allows a user to run only certain sections of code at a time, I find trouble-shooting errors to be much easier. If you happen to also be a Python user, you might find the R Markdown format similar to the Jupyter Notebook style of coding. Notice I said “also”, because by virtue of you being here and trying these code samples, you have become an R user. Look at you!

All R Markdown code chunks start and end with ``` – three back ticks. On your keyboard, the back ticks can be found on the same key as the tilde (~). These are not the same as apostrophes. After the back ticks, there is a single parenthesis, the lowercase letter r, then the tag is closed with another parenthesis. The chunk is then closed with three more back ticks.

The length of the chunk is determined by the number of new lines between the opening and closing back ticks.

For example:

#this is one new line
#this is two new lines

In R, comments are written by prefacing the text with a “#”. This prevents the line from running. Always use comments in your R code so that others (and more than likely yourself as well) will understand what the code is doing.

Anything written outside the chunk, like this text, will print as plain text when the entire document is knitted (done by clicking the blue “Knit” button in the RStudio Integrated Development Environment (IDE)).

To run a single chunk of code, click the green arrow in the right-hand side of the chunk. To run all chunks at once, click “Run” at the top, and select “Run All”. Alternatively, try knitting the R Markdown document to see how pretty your project has become :)

What is knitting? Knitting is how the RStudio IDE renders (basically prints to your browser, if the defaults are left in place) an R Markdown document so users can see how it looks to run all code and all visualizations all at once. If you are using R Markdown for school, this is how to see what your professor will see when you hand in your code.

Knitting can result in an HTML, PDF, or Microsoft Word document. To choose between them, select the arrow next to the “Knit” button.

For more info, see the R Markdown Cheat Sheet or Michael Sachs’ Introduction to knitr.

Package Loading

If you want to use any functions besides those included in base R (and you absolutely will), you’ll need to install and load packages. There are hundreds of R packages, all designed to make your life easier by including functions for commonly needed tasks. Some are even specific to different disciplines or act as intermediaries between languages.

For example, there are R packages to… * reshape data, * let you use SQL-style code, * use Python code in the same file as R code, * create visualizations like graphs and plots, * create visualizations that tap into Google Maps, * ones that let you work with text data, with dates, * and even ones that are just meant to make your code output more aesthetically pleasing (I’m looking at you, kableExtra.

The typical way this is done is to type the following code:

install.packages('dpylr') #installs package for the first time. Only needed the first time you use a package.
library(dplyr) #loads the package. Need to run this line EVERY TIME you use a package.

Because R requires users to load the packages they will need every time a script is run (not just a quirk of R, Python does this too), users inevitably find themselves with a section that looks like this, at the top of every R script:

library(dplyr)
library(ggplot2)
library(stringr)
library(stringi)
library(purrr)
library(readr)
library(readxl)
library(caret)

As you can imagine, this can get quite long, and takes up a lot of valuable screen space. Here R users have several options, such as using a function like ‘apply’ to load a vector of package names, writing a loop to do this, or using a pre-written function to load multiple packages all at once, in a single line of code.

My favorite is the last option, and my package loader of choice is pacman.

The pacman package’s ‘p_load’ function can load all desired packages very quickly and with minimal code. This function is a wrapper for ‘library’ and ‘require’. It checks to see if a package is installed, and if it isn’t, it attempts to install in from CRAN or any other repository in the pacman repository list. So now instead of 15 lines of “library(package_name)” at the top of our file, we have simply:

#load all packages at once using pacman. 
pacman::p_load(knitr,dplyr,readxl,writexl,tidyverse,lubridate, ggplot2,pastecs)

Much better!

Using a single function from a package

Packages contain multiple functions, and sometimes functions with the same name can appear in multiple packages. If you find this happening to you (most noticeable to me during times when my code is throwing errors), a solution can be to tell R specifically which package you want to use for that function. To do that, type the package name, followed by two sets of colons, then the function name:

dplyr::select(mydata,mydata$column_name) #where the dataframe is named "mydata", and the column you want to select is called "column_name"

Importing Data

Now that packages are loaded, it’s time to import data. This can be done with the RStudio IDE by going to the Environment/History panel (usually top right in RStudio), clicking “Import Dataset”, and browsing for a file. This is particularly helpful if, like me, you sometimes can’t remember where you stored the file you want to use.

Environment/History panel Alternatively, you can do it with code:

mtcars<-read_csv("/Users/mkinlan/Downloads/mtcars_dataset.csv",na="null"
                 , col_names = TRUE, show_col_types = FALSE)

#Viewing the enire dataset as a separate tab in RStudio. This should appear next to the tab in RStudio with your R Markdown file
View(mtcars)

Exploring the Data with Descriptive Statistics

One of the first things to do is to explore the data and try to understand more about it. To see just a few lines of the dataset and get some descriptive statistics:

# This will print below the chunk, so you may have to scroll down.
head(mtcars,10) #looking at just the top 10 lines of the dataset.

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

summary(mtcars) #min, mean, median, quartiles, and max for numeric variables

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

The stat.desc function in the pastecs package gives typical stats like mean, median, and quartiles, but also performs tests of normality. To learn more about options for this function, see this UCLA article package

stat.desc(mtcars, norm = TRUE) #count null values, standard deviation, skewness, normality test

##                      mpg           cyl          disp            hp         drat
## nbr.val       32.0000000  3.200000e+01  3.200000e+01   32.00000000  32.00000000
## nbr.null       0.0000000  0.000000e+00  0.000000e+00    0.00000000   0.00000000
## nbr.na         0.0000000  0.000000e+00  0.000000e+00    0.00000000   0.00000000
## min           10.4000000  4.000000e+00  7.110000e+01   52.00000000   2.76000000
## max           33.9000000  8.000000e+00  4.720000e+02  335.00000000   4.93000000
## range         23.5000000  4.000000e+00  4.009000e+02  283.00000000   2.17000000
## sum          642.9000000  1.980000e+02  7.383100e+03 4694.00000000 115.09000000
## median        19.2000000  6.000000e+00  1.963000e+02  123.00000000   3.69500000
## mean          20.0906250  6.187500e+00  2.307219e+02  146.68750000   3.59656250
## SE.mean        1.0654240  3.157093e-01  2.190947e+01   12.12031731   0.09451874
## CI.mean.0.95   2.1729465  6.438934e-01  4.468466e+01   24.71955013   0.19277224
## var           36.3241028  3.189516e+00  1.536080e+04 4700.86693548   0.28588135
## std.dev        6.0269481  1.785922e+00  1.239387e+02   68.56286849   0.53467874
## coef.var       0.2999881  2.886338e-01  5.371779e-01    0.46740771   0.14866382
## skewness       0.6106550 -1.746119e-01  3.816570e-01    0.72602366   0.26590390
## skew.2SE       0.7366922 -2.106512e-01  4.604298e-01    0.87587259   0.32078561
## kurtosis      -0.3727660 -1.762120e+00 -1.207212e+00   -0.13555112  -0.71470062
## kurt.2SE      -0.2302812 -1.088573e+00 -7.457714e-01   -0.08373853  -0.44151592
## normtest.W     0.9475647  7.533100e-01  9.200127e-01    0.93341934   0.94588390
## normtest.p     0.1228814  6.058338e-06  2.080657e-02    0.04880824   0.11006076
##                        wt        qsec            vs            am          gear
## nbr.val       32.00000000  32.0000000  3.200000e+01  3.200000e+01  3.200000e+01
## nbr.null       0.00000000   0.0000000  1.800000e+01  1.900000e+01  0.000000e+00
## nbr.na         0.00000000   0.0000000  0.000000e+00  0.000000e+00  0.000000e+00
## min            1.51300000  14.5000000  0.000000e+00  0.000000e+00  3.000000e+00
## max            5.42400000  22.9000000  1.000000e+00  1.000000e+00  5.000000e+00
## range          3.91100000   8.4000000  1.000000e+00  1.000000e+00  2.000000e+00
## sum          102.95200000 571.1600000  1.400000e+01  1.300000e+01  1.180000e+02
## median         3.32500000  17.7100000  0.000000e+00  0.000000e+00  4.000000e+00
## mean           3.21725000  17.8487500  4.375000e-01  4.062500e-01  3.687500e+00
## SE.mean        0.17296847   0.3158899  8.909831e-02  8.820997e-02  1.304266e-01
## CI.mean.0.95   0.35277153   0.6442617  1.817172e-01  1.799054e-01  2.660067e-01
## var            0.95737897   3.1931661  2.540323e-01  2.489919e-01  5.443548e-01
## std.dev        0.97845744   1.7869432  5.040161e-01  4.989909e-01  7.378041e-01
## coef.var       0.30412851   0.1001159  1.152037e+00  1.228285e+00  2.000825e-01
## skewness       0.42314646   0.3690453  2.402577e-01  3.640159e-01  5.288545e-01
## skew.2SE       0.51048252   0.4452150  2.898461e-01  4.391476e-01  6.380083e-01
## kurtosis      -0.02271075   0.3351142 -2.001938e+00 -1.924741e+00 -1.069751e+00
## kurt.2SE      -0.01402987   0.2070213 -1.236724e+00 -1.189035e+00 -6.608529e-01
## normtest.W     0.94325772   0.9732509  6.322635e-01  6.250744e-01  7.727856e-01
## normtest.p     0.09265499   0.5935176  9.737376e-08  7.836354e-08  1.306844e-05
##                      carb
## nbr.val      3.200000e+01
## nbr.null     0.000000e+00
## nbr.na       0.000000e+00
## min          1.000000e+00
## max          8.000000e+00
## range        7.000000e+00
## sum          9.000000e+01
## median       2.000000e+00
## mean         2.812500e+00
## SE.mean      2.855297e-01
## CI.mean.0.95 5.823417e-01
## var          2.608871e+00
## std.dev      1.615200e+00
## coef.var     5.742933e-01
## skewness     1.050874e+00
## skew.2SE     1.267771e+00
## kurtosis     1.257043e+00
## kurt.2SE     7.765553e-01
## normtest.W   8.510972e-01
## normtest.p   4.382405e-04

To learn more about how to interpret the results of stat.desc, see Verify if Data are Normally Distributed in R: Part 2, from Scientifically Sound.

Looking at the results of the ‘skewness’(0.61) and ‘kurtosis’ (-0.37) outputs for the ‘mpg’ variable, we can see that mpg is skewed left (more data is toward the left side of a histogram) with a flat and light-tailed distribution.

This means there is a fewer number of extreme values (aka outliers or influential observations) than is typically the case in a Normal or Gaussian distribution.

To see this visually, we can make a quick histogram:

hist(mtcars$mpg) #select the specific column by typing a dollar sign after the dataset name

The more the data mimics the Normal distribution, the less of a tail you’ll see. This variable almost has a Normal distribution; if there were more data points between 25-30 on the x-axis it would look fairly Normal

Learn more about Kurtosis with Kurtosis Explained:Basics of Kurtosis Interpretation on Graphs

Alternatively, you can make a prettier histogram with the magic of ggplot2:

ggplot(mtcars,aes(x=mpg))+geom_histogram(binwidth=4.5, color="#e9ecef",fill="#00AD0B") + ggtitle("Histogram of Variable 'MPG' with binwidth 4.5")

You’ll notice that the appearance of skewness and kurtosis can change dramatically depending on the binwidth chosen for a histogram. The plot above has a binwidth of 4.5, but if we change it to binwidth=10, the we lose all sight of any skewness or tails:

ggplot(mtcars,aes(x=mpg))+geom_histogram(binwidth=10, color="#e9ecef",fill="#34ADDE") + ggtitle("Histogram of Variable 'MPG' with binwidth 10")

At this point you might be asking, “Well, how did the base R function (the first histogram) determine binwidth? It must have picked a certain binwidth value for a reason.”

To which I say you are correct!

The hist() function in base R (the default functions that come with R) are determined using the “Sturges” method.

For more information on that function, see: