Hi!
If you’re reading this you’re probably just getting started with R, or you need a serious refresher since refusing to open it after it tried to make you do three different updates in one month and you decided maybe data science wasn’t for you (hah no personal experience here).
Nonetheless, I created this little guide as a very basic introduction to R. I got thrown into the thick of things when I first started and I think a guide like this, even hanging out on the periphery, might have saved me some of the cognitive stress of trying to figure it all out alongside the actual statistics I was supposed to be doing.
I hope you find this useful and good luck with your R journey!
(P.S. everything is at least a little easier if you make your graphs hot pink and you use dark mode O.O) 1
Although it is easy enough to execute these commands in R using the various buttons and gadgets, keyboard shortcuts are a great way to keep the flow of your code writing going, and to help you get faster at using R.
The easier it is to navigate the basics, the quicker you will be able to get through topics simply because code becomes easier to write and execute.
Windows: SHIFT + ALT + K
MAC: SHIFT + OPTION + K
Windows: CTRL+ALT+I
MAC: CMD+OPTION+I
Note: Sometimes on MAC the windows option works instead of
the MAC option. Alternatively you might have to press
CMD+ALT+I instead of “option”
You will need to make sure your cursor is somewhere on the line of code when you use this command.
Windows: CTRL+ENTER
MAC: CMD+ENTER
You will need to make sure your cursor is somewhere in the chunk when you use this command.
Windows: CTRL+SHIFT+ENTER
MAC: CMD+SHIFT+ENTER
Windows: CTRL+ALT+R
MAC: CMD+OPTION+R
Windows: ALT+ - (the hyphen key)
MAC: OPTION+ - (the hyphen key)
Windows: CTRL+SHIFT+M
MAC: CMD+SHIFT+M
Windows: CTRL+SHIFT+K
MAC: CMD+SHIFT+K
Windows: CTRL+S
MAC: CMD+S
Typically an unsaved R file will look like this:
The text is generally either blue or red, and there is an asterisk (*) at the end of the file name.
After you save it should look like this:
The file name is typically white/black in colour and there is NO asterisk next to it.
NB!! This should be done every 20-30 minutes. You will find that R can be a little moody (which is fair because everything suffers in this economy) but that does mean sometimes it may crash or just stop loading what you’re asking it to do, forcing you to quit and reload your project. Saving periodically means you won’t lose big chunks of work should anything happen. Be careful, and try to get into the habit of saving your work regularly.
To start with when you create an “R Notebook” (which you will have to do for assignments and tests) it will open with a whole lot of text.
This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
I highly recommend reading this little snippet as it is a good introduction to how user friendly R is. If you consider it more a temperamental orange cat and less a starved lion, it might be easier to handle? Probably…I have faith.
You will end up removing all of this when you create notebooks in the future as it will likely have nothing to do with your assignments, but it is worth a read.
As you are well aware R is very fussy with its syntax, or in other words its language. Every bracket must have both an open and close, all chunks must have the correct number of backticks (```), arguments must be separated by commas, etc. (honestly some degrees require less 0_0).
And sometimes all these details can get a little confusing, especially when you’re working with larger codes, or more complicated functions. So there are some great hacks to keep you from dropping everything for the cool ocean floor.
One of my personal favourites is rainbow parentheses. This is a setting in R that gives R permission to use different colour brackets within your code, so you always know which colours match up together, and can keep track of your open and closed brackets.
Here’s a screenshot of what my brackets look like in a code chunk with a few different functions going on:
While it is not important what is going on in this chunk, do note the
colours of the brackets. As you can see my outer brackets are pink
(i.e., where it says mutate. it means everything within the
pink brackets is running with the mutate function), and then brackets
within them are yellow (i.e., where it says sum), and then
green (i.e., where it says c_across), and then turquoise
(i.e. where it says everything).
Because all the brackets are different colours I can see within this line of code where bracket pairs start and end, and what functions I’m trying to run within them.
So, in order to switch on your rainbow parentheses you only have to do one simple step.
At the very top of your R screen (likely in the top left corner) where there are drop down menus like “File”, “Edit”, “Code”, “View”, etc. you are going to click on code, and then find the function that says “Rainbow Parentheses”. Click on that, make sure it has a tick next to it, and Shakespearian magic, your brackets should all be some kind of colour now.
Here are screenshots to help:
As you can see above here is the list of drop down options.
And then below, the “rainbow parentheses” function is on, indicated by the tick.
Another fussy thing about R is that it wants commas to separate every argument in your function. This is something that can be easily missed and lead to annoying errors that take a frustrating amount of time to realise and find (I’m not talking from experience, what do you mean 0.0)
So try to get into the habit of putting in your commas, and checking that they’re there as you write your code. Don’t wait till you’ve written large portions of code, only to run it and be met with errors.
In this code I was creating a dataframe. Note how all my arguments have commas between them BUT all my functions also have commas between them. There’s a comma after the yellow bracket in the ‘time’ line, and a comma after the yellow bracket in the mean line. Had I not put those commas in, I would have certainly got an error.
#1 You will notice it has 3 back ticks at the beginning AND end. These backticks indicate it’s a chunk of code. You need to have 3 at the top and the bottom in order for R to read it as a usable code chunk.
I encourage you to play around with it and see what happens if you don’t have the correct number of back ticks; errors are a great way to learn. Small spoiler: the code chunk doesn’t run o_0
#2. You will also notice it has curly brackets that contain the
letter ‘r’ at the top. This just means it is an R code chunk. Because
you can use different coding languages in R it is useful to have the
type of code chunk labelled. R does it for you automatically. You may
notice that some code chunks have other text in it. This is mostly
useful when you are creating interactive output files, or you need to
use an entire code chunk and its output in a later portion of your code.
It will help tell R which chunk it needs to work with. However,
typically you will not need to worry about labelling your code chunks.
Having the standard {r} in there is great.
#3. You will know a code chunk is usable when the backticks and your curly brackets are blue in colour. It should also have the typical options in the top right corner: the gear, the downwards arrow, and the green play button. All those mean it is a valid code chunk and can be used for code (this does not however mean your code has the same privileges. just like unpaid interns, sometimes code only runs errors >_< )
Your code chunks will not always need to be shown in your output HTML/PDF file so why is it sometimes necessary to add text to it?
Well, the text you put in your RMD notebook in all this blank space is usually the interpretations from your data, and the outputs of your code (for example, you coded a graph and now need to explain what that graph is telling us about your data). Interpretation and relevant answers will go into this RMD space.
But sometimes you need to know what is happening in your code, and if someone else had to use your code they would need to know what was happening too. So adding explanatory text by using the #-symbol and typing your explanation next to/above/below your code within the code chunk is a good way to keep your coding explanations neat and useful without cluttering up your RMD notebook.
In this code I was importing pdf files from a folder on my laptop into R and then asking R to read the text of those PDFs. As you can see the hashtagged text explains what is happening in the code. Now anyone who uses this code will know what is running and why we are running it. It also means they know what kind of output to expect.
It is important that any explanatory text within your chunks has a hashtag in front of it. Otherwise, R will attempt to read it as actual code and you will run into errors. The hashtag essentially tells R to ignore whatever comes after it.
When doing assignments, you will have to do a lot of interpretation which means paragraphs of explanation about why this is this way, how you got to this conclusion, and the price of eggs in Japan. Inevitably there will be typos, but luckily R has a handy spell check function of it’s own that will catch a good few of the typing errors your tired eyes missed.
Find it here:
Highly recommend using it !
You will need to install packages into R quite regularly depending on the kind of functions you need and this means you’ll need to get familiar with when to install and when not to install (very Hamlet of you).
(Note I’ve hashtagged these example codes because they cannot be run as is. However you would not have hashtags before your code when running these functions)
The first thing to ask is Do I already have this package in my R programme? and the way to check that is simply by clicking on the “Packages” tab in the bottom right panel of your screen
If there is a tick in the little box next to the package then you have it in your R and you DO NOT need to reinstall it ever (unless you delete R entirely). If you do not have a tick, or if the package is not showing up in the package section, then you don’t have it and need to run the code to install it.
However, every single time you start a new project, or you restart your R you need to LOAD your packages. This means using the library function.
The library function only allows you to load one package at a time so you will have to have multiple different lines of code to run every package you want to. This is fine in the beginning when you are only using three or four packages but as you dive deeper into R and do more complicated analyses you will find your package list grows astronomically. Loading each individual package becomes tiresome.
Luckily, there is an easy way to do this so you don’t have to keep typing in “library()” for every package you want to load. I will give you the code chunk, and then explain it.
You will first need to load the pacman package as this allows you
to use the p_load() command.
Then using p_load() you can type all the packages
you want to load for your work, separating by commas, and run the entire
chunk at one go.
The p_load command is especially useful because it overrides the need to separate the “installing” and “loading” process of a package. So yes, even if you don’t have a package installed, you can simply put the name into your p_load() function and if it’s not installed it will first install it and then load it! (ah how revolutionary is the world?!)
Just a quick note:
R has a LOT of functions, and sometimes the same functions can be found in multiple packages, which confuses R (like a newborn giraffe learning to walk), or some packages override other packages and end up becoming compressed in the R claw machine. So, there will be instances where you are trying to run a code and it’s just giving you a problem it can’t work out how to solve. Generally something like “Error: could not find function…”.
All you have to do is specify to R which package you want to use when you’re trying to execute this function.
The two colons (::) mean “in here” essentially. So your code is
saying in the readr package I am trying to use the
function read_table
And that should do the trick!
As is the nature of data, and data collection, things can get messy and you might have to manipulate, change, and tidy your data before you can use it for analysis. Luckily R is a very understanding boss and has a bunch of ways to help you out with this.
One of the first things you’ll do when doing data analysis is, naturally, read in a data set that you can analyse. But there are different ways to read in data depending on the data file.
One of the cool things about R is that there is a bunch of datasets already stored in the programme, and in various packages in R. This is great, especially when you’re experimenting with code and trying to figure out how the whole world of R programming, data analysis, and coding works. Often times the data already in R (or R packages) has already been cleaned (to the best of the community’s ability) and there is very little data wrangling (wrestling and strangling put together O_O) you need to do in order to get your data to work for you.
A great way to check what datasets are available to you is to run this function:
This will list ALL the datasets available in R as well as a brief summary of what the data is.
To read in the datasets that are already in R is pretty simple. You don’t even have to assign it a name. It looks like this:
Running this code will put the dataset name in the “values” section of your global environment with an argument next to it that looks like “< Promise >” (romantic huh?).
But we know we don’t want our datasets to be in the values section, we want it in the section above titled “Data”. All we have to do is click on the dataset name in that values box and it will activate the actual dataset and move it up to the Data portion.
And now we have a dataset in our global environment to use when we’re ready.
However most often you’ll be using datasets that are not in R and need to be imported in, and although this process has a few more steps, it is just as uncomplicated as the previous one.
General structure:
For all these file types you will need to first assign a name to the data. This can be absolutely anything, but try to make it as simple as possible and avoid capital letters as they are a pain to remember and you end up becoming best friends with syntax errors (true avoidable horror).
Following this you put the function specific to the file type.
read.filetype("nameofdatafile_asitappears_inyourcomputerfolder.filetype").
So your code will look something like this:
If these waters are a little murky, don’t worry practical examples follow below.
NB!! The most important thing to note is that any data you import like this should save directly into your “Data” section in your Global Environment. If it saves into your values section, you have not imported the data; you have likely given R just a “name” or a blank file of some kind. Be sure to check that your data is in your DATA section in your environment, and that it tells you the number of observations and variables. Clicking on the name in your environment will also automatically open a tab that shows you the full dataframe.
A fun hack for you:
You must have the inverted commas there for this to work.
Specific File Types
To read in a CSV file your code will look something like this:
Now in my data section of my global environment I will have a dataframe titled “happiness” that I can use for analysis.
This is the package you need in order to import excel files. Remember if you don’t have it installed you need to install it first and then load it in R using your library function.
Note that we use underscores for excel data and not periods. This is because it is from a specific package and the maker of the function chose to use underscores instead of dots. Not really an issue as you will see suggestions come up as you’re typing the function and that will show you the correct format but this is good to be aware of all the same.
Why “table”? Because .txt files are simply
text and we need to make it a dataframe, we can manipulate it upon
importation into anyway we want, and typically we see dataframes as
tables. Therefore when we read it in we say “read as table”
(read.table()).
And there you go! Expert at reading data into R!
Often times we need to know what type of variable(s) we’re dealing with. Especially if we want to conduct statistical tests on it. For example, it’d be really hard to calculate a correlation coefficient on a character variable type (the correlation is the alphabet…I guess?).
Luckily there are multiple ways to build a bridge and to check what kind of variables you’re handling. See the code chunks below:
## 'data.frame': 32 obs. of 5 variables:
## $ class : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
## $ age : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
## $ survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ freq : num 0 0 35 0 0 0 17 0 118 154 ...
## Rows: 32
## Columns: 5
## $ class <fct> 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew…
## $ sex <fct> Male, Male, Male, Male, Female, Female, Female, Female, Male,…
## $ age <fct> Child, Child, Child, Child, Child, Child, Child, Child, Adult…
## $ survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ freq <dbl> 0, 0, 35, 0, 0, 0, 17, 0, 118, 154, 387, 670, 4, 13, 89, 3, 5…
Note: str() is part of the base R
functions; glimpse is part of the “dplyr” package
Both of these outputs will list all your variables in the first column, then the variable type, and then the actual values of each observation (aka the actual data).
Note the differences between the two outputs when it comes to listing the actual observations. The str tells you the “names” the observations are called and then gives you the level numbers for each observation when listing the actual data. Glimpse on the other hand gives you the “names” of the data as they appear in the dataframe.
For example, in this dataset the variable age is classified as 1 = Child and 2 = Adult, so you are either a child OR an adult. In str it tells you that age is a factor with 2 levels “Child” or “Adult”, and then lists 1’s or 2’s for each observation.
However in glimpse it tells you age is a factor (
But sometimes we have really large datasets with 20+ variables, for example, and we just want to look at one variable type, or we simply do not want to clutter up our RMD space with a massive output that lists every single variable. So we use a more succinct function:
## [1] "factor"
Now we know age is a factor, which is helpful because then we know it’s a categorical variable and not continuous, which means we have groupings in our data.
Note: You cannot pass 2 arguments into any of these
functions. So if you wanted to glimpse two different datasets like
titanic and mtcars you could not run code that reads
str(titanic, mtcars) nor could you run
glimpse(titanic, mtcars). If you wanted to find out the
class for the age AND sex categories in the titanic dataset you cannot
ask your code to run class (titanic$age, titanic$sex). It
will give you an error. So you would need to run the separate lines of
code for each dataset/variable you want to know about.
The type of variables you can get are:
| Sym | Explanation |
|---|---|
| int | integer |
| chr | character |
| dbl | double (real numbers) |
| fct | factor |
| dttm | date & time |
Now that we know how to check what our data is we can move on to what to do when we don’t want it to be that. This might be the closest we’ll ever get to playing Creator of the Universe so use these functions with abandon!
We can manipulate our data in all sorts of ways, but there are a few common ways you will come across time again.
Let’s stick with the titanic data for now, and run through a scenario.
Let’s say, we saw the variables and variable types when we glimpsed our data set, but when importing the data the researchers didn’t make “sex” a factor, and so, it being words, read into R as a character variable. While yes, “male” and “female” are characters because they’re just a bunch of letters, for the purpose of data analysis we need them to mean something to us.
## [1] "character"
For example if we wanted to compare the rate of survival between males and females, we would not be able to do that if the variable sex was just characters. We need them to be actual categories (or factors) in our data set.
So we have to manipulate the data so we can use that variable in our analysis. And how we do that is pretty simple and standardised no matter how you want to change your variable.
# dataframe$variable <- as.typeyourechangingto(dataframe$variable)
titanic$sex <- as.factor(titanic$sex)
class(titanic$sex)## [1] "factor"
Now it is a factor and we can use it for analysis.
Note: I have written the above code so that when I changed the variable sex from character to factor, I changed the exact same variable in the data frame. However, sometimes we want a variable to stay as one type, but also have it be another type. For this we need to add a variable to our data frame. And wonderfully, it is not complicated at all
We’ll keep with the sex variable. Say we wanted the sex variable as both a character variable AND as a factor. All we need to do is create a separate column where sex reads as a factor.
Note that by putting the dataframe first and then the new_variable_name I am telling R where to put the new variable. If I simply said “sexfactor <- as.factor(titanic$sex)” this would have saved as an object in our “values” section of the global environment (top right panel) and would not have saved into our titanic dataframe
## [1] "character"
## [1] "factor"
## Rows: 32
## Columns: 6
## $ class <fct> 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Cre…
## $ sex <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Female"…
## $ age <fct> Child, Child, Child, Child, Child, Child, Child, Child, Adul…
## $ survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, …
## $ freq <dbl> 0, 0, 35, 0, 0, 0, 17, 0, 118, 154, 387, 670, 4, 13, 89, 3, …
## $ sexfactor <fct> Male, Male, Male, Male, Female, Female, Female, Female, Male…
And Grecian magic, here we have a character sex variable, a factor sex variable, and we’ve added in another column to our dataframe.
What do the dollar signs ($) mean? They work similarly to how the double colon (::) works for packages. You are essentially telling R “in this object” pull out “this thing”. So double colons work for packages, and dollar signs work for datasets.
There are quite a few symbols in R that get used for various code functions. In the table below i present the most frequent and basic ones:
Note: I am still using the ‘titanic’ data for examples.
| Sym | Use | Explanation | Code_Example | Example |
|---|---|---|---|---|
| & | and | This vector AND this vector | filter(class = 1st & 2nd) | Give me data for ‘1st Class’ AND ‘2nd Class’ |
| | | or | This vector OR this vector | if (class = 2nd | Crew) ~ ‘A’ | If class is 2nd or crew give them the letter A |
| ! | not | NOT this vector | select(!class(Crew)) | Do NOT give me data for ‘Crew’ |
| == | equal to | anything the EXACT SAME as this vector | filter(class == 3rd) | Give me data for ONLY ‘3rd Class’ |
| != | not equal to | anything EXCEPT this vector | filter(frequency != 0) | Give me all data EXCEPT if their frequency equals 0 |
| %in% | in this vector | something specific WITHIN this vector | select(class %in% c(2nd, 3rd, Crew)) | Give me ‘2nd, 3rd, and Crew’ in ‘Class’ |
| < / > | less than / greater than | if vector is greater/less than this then keep | filter(Freq < 20) | Give me data for frequency LESS THAN 20 |
| <= / >= | less/greater than and equal to | if vector is ‘greater/less than and equal to this then keep’ | filter(Freq >= 45 | Give me data for frequency GREATER than and EQUAL to 45 |
Note: a single equal sign (=) and a double equal sign (==) cannot be used interchangeably. A single = means you are assigning code to something. A double == means you are trying to pull out that specific element of your data.
Hopefully I’ve covered the very very basics, and it doesn’t feel so overwhelming. The good news is that all of this becomes habit after a few tries! And then the fun of making graphs and interpreting findings takes you by storm.
Take it slow, use YouTube when things don’t make sense, use your tutors when sense stops making you, and be patient with yourself.
Here are some incredible resources for R that truly make life worth living (again 0_0):
An Introduction to Statistical Learning with Application. Textbook by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
(click the titles to open the links)
Also remember to rest your eyes, and really have fun with it. Coding is right at your fingertips now :)
I, Ciara Flashman, permit free, credited use of this document. I give permission to share indiscriminately, provided no unauthorised edits, deletions, or otherwise changes are made without my prior written consent.
If you would like to suggest an edit, or additions, please contact me at ciaraflashman16@gmail.com.
You can change the appearance of your R by going to Tools > Global Options > Appearance > Editor Theme↩︎