Guide to Using R for the first time
R and RStudio are two distinctly different applications that serve different purposes.
R is the software that performs the actual instructions. Without R installed on your computer or server, you would not be able to run any commands.
RStudio is a software that provides an interface to R. It’s sometimes referred to as an Integrated Development Environment (IDE). Its purpose is to provide bells and whistles that can improve your experience with the R software.
If you need R and R studio on your work laptop you will need to raise a service request. To download on a personal latop the following links can be used.
Download R - https://cran.r-project.org/bin/windows/base/
Download RStudio - https://www.rstudio.com/products/rstudio/download
When you first open R studio you will need to open a new script, this can be done by clicking on File in the top left hand corner and then New File then R script. (There are other file types you can choose such as R Markdown, which we won’t be covering in this session.) You will get a blank script screen which looks like this.
The top left hand box is the script where you can write you code. The console in the bottom left corner can be used to test and run some simple commands.
For example if you type 2+2 next to the blue > then press return it will return the answer 4. Anything you type in the console at the bottom won’t be saved when you close the script.
R has many in built functions you can use to perform specific tasks but additional packages are available that can make these tasks easier to perform or allow additional functionality compared to the base R functions.
In the script you can write notes to go with your code it you add # in front of any text it will make it as a note.
You can save the script you are working on by clicking on file and save as, you can then give the file a name and select the folder you want to save it in.
If you want to open a script you have worked on recently in file there is an option to open recent files and it will give you a list of the recent scripts you have worked on.
Also you can use the Files option on the lower right hand side of the screen where you can navigate to the folder you have saved your script in.
Packages are collections of R functions, data, and compiled code in a well-defined format, created to add specific functionality. There are 10,000+ user contributed packages and growing.
Packages for R can be installed from the CRAN package repository in two ways. Depending on where R is installed you may need to reload a package each time you open R. You can set up a personal library for your packages to be installed in to so they wouldn’t need to be installed each time. This won’t be covered in this session, but we can provide details on how to do this.
install.packages("tidyverse")
Once installed you need to use the library function to install the package you want to use on your script. the code below is to load the tidyverse package
library(tidyverse)
If you are using more than one package then you need to load them all at the start of your script. Each one should have library and then the package name in brackets. This example show if you want to load in both ggplots and dyplr.
library(ggplot2)
library(dplyr)
Within a package there are functions which are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result. For each package in R there is documentation produced which provides information on the different functions it contains and gives examples of how to use the function. These can be accessed in two main ways.
help(package="dplyr")
The working directory is the file path for the directory you are working in on your computer. This is the location R will look in for any files you are trying to read in. It is also where R will store any outpit you save. Note: you can specify the file path when loading/saving if the working directory is not what you want/need.
You can see what the current working directory of R is by using the following code
getwd()
You can set the working directory to a specific folder on your computer, so for example a project folder where you may have data saved that you would like to upload in to R. To do this you would use the following code. This sets my working directory to a folder called R saved in my user folder on my C drive, You need to replace cypher in the code with your own cypher. (Work files and any sensitive information should not be saved on your C drive, this is for example purposes only) NOTE - the slashes in the directory path are forwards and not backwards as they would be when you copy a file path from your computer.
setwd("C:/Users/Cypher/R")
You can load many data formats in to R, this session will cover reading in CSV and Excel files as these are the two most common file types you might want to load.
R has an inbuilt function to read in csv files (see code below).If you have set your working directory to the folder your data is saved in then you can use the following code and just add the name of the CSV file in the brackets. When you load the data in you need to give the dataframe you are creating in R a name So if for example you wanted to import a csv file called MyData and call the table first_table then you would use the following code. In the code you put the name you want to call the table in R first followed by <-
first_table <- read_csv("Mydata.csv")
If the data you want to load is in a different location to your working directory then you would need to add the file path to where the data is saved to load the data in. For example you wanted to load in a CSV called Mydata2 and it was saved in a folder called CSVS in your person folder and you wanted to call it second_table then you would use the following code. Both the file path and the name of the CSV need to be included in the ““. REMEMBER that the slashes in the file path should be forwards and not backwards.
second_table <- read_csv("C:/Users/cypher/CSVS/Mydata2.csv")
Once your data has been loaded you will see it in the environment window in the top right of your screen. This tells you how many rows the table has (obs short for observations) and how many columns (variables).If you click on the name of the table you can then view the table.
To load an excel file you can use a package called readXl (there are alternative packages for this also available). If you have installed tidyverse, this is one of the packages included. As above if you have set your working directory to the folder your excel spreadsheet is saved in then you just need to include the name of the excel file in the brackets. For example if you want to import Mydata3 and call it third_table then you would use the following code
thrid_table <- read_excel("Mydata3.xlsx")
Again as above if the excel spreadsheet is saved in a different location to your working directory then you would need to add the file path to where the data is saved to load the data in. For example you wanted to load in an excel spreadsheet called Mydata4 and it was saved in a folder called excel in your person folder and you wanted to call it fourth_table then you would use the following code.
fourth_table <- read_excel("C:/Users/Cypher/EXCEL/Mydata4.xlsx")
For this section we are going to use one of the data sets pre-built in R called mtcars but any of the following analysis could be done on data sets you have loaded yourself. To load the data we use the following code. The functions used in the section are from the dyplr package that we installed via the tidyverse package
mtcars <- mtcars
If you want to get a sum of a column you can use the summaries function in the dplyr.The following code will give you a data output item with the sum of all of the cylinders of the 32 cars in the mtcar data set. In this example we are creating a data object called num_cyl to keep a record of the value. We also need to use a pipe to then use the summaries function, the pipe is %>% (or you might also see it written as !>). The key board short cut to create the pipe symbol is Ctrl+Shift+M.
num_cyl <- mtcars %>%
summarise(sum(cyl))
If you want to filter rows from the table, for example you only want to look at cars with 4 cylinders the following code can be used. A double = is used for an exact match.
Cylinders4 <- mtcars %>%
filter(cyl == 4)
If you want to filter on more than one column, for example it we want cars with 4 cylinders that get more than 30 mpg then the following code can be used.
Cylinders4_mpgn <- mtcars %>%
filter(cyl == 4 & mpg >30)
If you want to count the number of cars by the number of cylinders they have then you can use the following code. In the group_by you need to add the column name that you want to group by so in this example the number of cylinders (cyl), then we used the summarise function to count the number of cars,in the brackets you add the name you want the column to be and the = n() is used to count the numbers of rows that has each number of cylinders.
Num_cars = mtcars %>%
group_by(cyl) %>%
summarise(number_of_cars = n())
We will use the ggplot2 package to draw a bar chart to show the number of cars by the number of cylinders they have. The geom_bar is used to draw a bar chart, if you wanted to draw a line graph instead this would be set to geom_line. This session just covers the very basics of drawing a graph, at the end of the document there is a link to a cheat sheet which contains example code and details of other options and parameters you can change and include in graphs.
ggplot(data=Num_cars, aes(x= cyl, y= number_of_cars)) +
geom_bar(stat="identity")
We can see the scale for the cylinders is continuous and not specific to the 3 cylinder sizes we have in our data. This happens when you want to plot a number on the x axis, to correct these we need to change the data type of this data field to a factor, so that it is recognizes it as an individual number. This can be done using the mutate function.
Num_cars <- Num_cars %>%
mutate(cyl= as.factor(cyl))
If you then replot the graph the scale of the x axis is now correct.
ggplot(data=Num_cars, aes(x= cyl, y= number_of_cars)) +
geom_bar(stat="identity")
If you want to change the colour of your bars you can specify the colour this example is for steelblue. Buy using the fill function you can manually select the colour. You can view the colour palettes available for ggplot2 here. https://r-graph-gallery.com/ggplot2-color.html
ggplot(data=Num_cars, aes(x= cyl, y= number_of_cars)) +
geom_bar(stat="identity", fill="steelblue")
If you want your bars to be different colours you can use the fill option in the code and it will use a default set. There are also different colour palettes available and you can create your own colour themes.
ggplot(data=Num_cars, aes(x= cyl, y= number_of_cars, fill = cyl)) +
geom_bar(stat="identity")
You can give your graph a title by using ggtitle. You can change the location and size of the title but we won’t be covering that in this session.
ggplot(data=Num_cars, aes(x= cyl, y= number_of_cars, fill = cyl)) +
geom_bar(stat="identity")+
ggtitle("Number of cars by number of Cylinders")
You can also change the axis labels using xlab and ylab
ggplot(data=Num_cars, aes(x= cyl, y= number_of_cars, fill = cyl)) +
geom_bar(stat="identity")+
ggtitle("Number of cars by number of Cylinders") +
xlab("Number of Cylinders") +
ylab("Number of Cars")