Dr. J. Kavanagh
2023-09-09
In this document, there are a series of examples setting out the basic commands of R.
You must have the following installed on your computer before we begin:
The appropriate links for these downloads are provided in the Moodle page for this module.
When you open RStudio for the first time there are four windows, you can re-orientate these to your own preference. This also applies to the text colour, font type and size. It is very customisable for your needs. RStudio uses the memory of your computer and you need to set a Working Directory for your analysis.
Go to Console and type the following:
## [1] "/Users/jackkavanagh/Dropbox/R_Business"
My advice is create a specific folder and link RStudio to it using the Session -> Set Working Directory -> Choose Directory… option in the Menu Bar. Otherwise you can specify the folder in the console as follows:
This lecture is designed to show you the potential for using R for statistical analysis. You must have the following libraries installed via RStudio before we begin. Run the following command to install the necessary packages.
install.packages("tidyverse", "ggthemes", "historydata", "lubridate")
# The library() command will load the relevant libraries
library(tidyverse)
library(ggthemes)
library(historydata)
library(lubridate)Since the creation of the ‘Tidyverse’ by Hadley Wickham there has been a trend to teach solely within the TidyR framework, which involves concentrating on a number of small and interlinked packages in R. This approach while useful for creating a basic overview of R, prevents users from understanding the full potential of R for analysis and leads to errors which can often be quickly solved using the base R commands.
The following definitions were set out by Deborah Nolan in ‘An introduction to programming in R (2019)’:
In R, vectors are the primitive objects. A vector is simply an ordered collection of values grouped together into a single container. Some primitive types of vectors are numeric, logical, and character. A very important characteristic of these vectors is that they can only store values of the same type. A vector contains values that are homogeneous primitive elements. That is, a numeric vector contains only real numbers, a logical vector stores values that are either TRUE or FALSE, and character vectors store strings.
To sum up:
The Console is where you type in commands and after hitting return you get results. As R was originally designed for statistical purposes, it has fully operational calculator functions as follows:
+ = same
- = same
* = mulitply
/ = divide
## [1] 4
## [1] 20
## [1] 52
## [1] 6
## [1] 3.141593
## [1] 9.424778
## [1] 1
You can assign functions and results to a vector in R
## [1] 9.424778
## [1] 1 2 3 4 5
Loops are essential automated processes that can analyses multiple datasets. They often appear quite off-putting but are quite simple to understand if broken down into the distinct types that are available.
## [1] 1
Now this loop is going to add 4 to the r_loop vector 15 times.
## [1] 5
## [1] 9
## [1] 13
## [1] 17
## [1] 21
## [1] 25
## [1] 29
## [1] 33
## [1] 37
## [1] 41
## [1] 45
## [1] 49
## [1] 53
## [1] 57
## [1] 61
# This creates a character
r_loop_2 <- c("Ringo", "John", "Paul", "George", "Linda",
"Janice", "Ella", "Sarah", "Barbara")for(i in r_loop_2) { # Loop over character vector
print(paste("The name", i, "consists of", nchar(i), "characters."))
}## [1] "The name Ringo consists of 5 characters."
## [1] "The name John consists of 4 characters."
## [1] "The name Paul consists of 4 characters."
## [1] "The name George consists of 6 characters."
## [1] "The name Linda consists of 5 characters."
## [1] "The name Janice consists of 6 characters."
## [1] "The name Ella consists of 4 characters."
## [1] "The name Sarah consists of 5 characters."
## [1] "The name Barbara consists of 7 characters."
You can tell your loop to either skip a sequence or end after a sequence. Using the very simple syntax of break or next.
## [1] "Sin é 1"
## [1] "Sin é 2"
## [1] "Sin é 3"
## [1] "Anseo 1"
## [1] "Anseo 2"
## [1] "Anseo 3"
## [1] "Anseo 5"
## [1] "Anseo 6"
## [1] "Anseo 7"
Sample data is included within RStudio and individual libraries. Use the following command bring up all the sample datasets that could be called into the Environment panel of RStudio.
We want to bring in the following datasets from the historydata library
These will now be listed as
## # A tibble: 3,532 × 13
## judge_id name_first name_middle name_last name_suffix birth_date
## <int> <chr> <chr> <chr> <chr> <int>
## 1 3419 Ronnie <NA> Abrams <NA> 1968
## 2 1 Matthew T. Abruzzo <NA> 1889
## 3 2 Marcus Wilson Acheson <NA> 1828
## 4 3 William Marsh Acker Jr. 1927
## 5 4 Harold Arnold Ackerman <NA> 1928
## 6 5 James Waldo Ackerman <NA> 1926
## 7 6 Raymond L. Acosta <NA> 1925
## 8 7 J[ackson] Leroy Adair <NA> 1887
## 9 8 Arlin Marvin Adams <NA> 1921
## 10 9 Elmer Bragg Adams <NA> 1842
## # ℹ 3,522 more rows
## # ℹ 7 more variables: birthplace_city <chr>, birthplace_state <chr>,
## # death_date <int>, death_city <chr>, death_state <chr>, gender <chr>,
## # race <chr>
## # A tibble: 4,202 × 15
## judge_id court_name court_type president_name president_party nomination_date
## <int> <chr> <chr> <chr> <chr> <chr>
## 1 3419 U. S. Dis… USDC Barack Obama Democratic 07/28/2011
## 2 1 U. S. Dis… USDC Franklin D. R… Democratic 02/03/1936
## 3 2 U. S. Dis… USDC Rutherford B.… Republican 01/06/1880
## 4 3 U. S. Dis… USDC Ronald Reagan Republican 07/22/1982
## 5 4 U. S. Dis… USDC Jimmy Carter Democratic 09/28/1979
## 6 5 U. S. Dis… USDC Gerald Ford Republican 06/18/1976
## 7 6 U. S. Dis… USDC Ronald Reagan Republican 09/09/1982
## 8 7 U. S. Dis… USDC Franklin D. R… Democratic 03/24/1937
## 9 8 U. S. Cou… USCA Richard M. Ni… Republican 09/22/1969
## 10 9 U. S. Dis… USDC Grover Clevel… Democratic 12/04/1895
## # ℹ 4,192 more rows
## # ℹ 9 more variables: predecessor_last_name <chr>,
## # predecessor_first_name <chr>, senate_confirmation_date <chr>,
## # commission_date <chr>, chief_judge_begin <int>, chief_judge_end <int>,
## # retirement_from_active_service <chr>, termination_date <chr>,
## # termination_reason <chr>
## # A tibble: 65 × 6
## college original_name city state established sponsorship
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Harvard <NA> Camb… MA 1636 Congregati…
## 2 William and Mary <NA> Will… VA 1693 Anglican
## 3 Yale <NA> New … CT 1701 Congregati…
## 4 Pennsylvania, Univ. of <NA> Phil… PA 1740 Nondenomin…
## 5 Princeton College of New Je… Prin… NJ 1746 Presbyteri…
## 6 Columbia King's College New … NY 1754 Anglican
## 7 Brown <NA> Prov… RI 1765 Baptist
## 8 Rutgers Queen's College New … NJ 1766 Dutch Refo…
## 9 Dartmouth <NA> Hano… NH 1769 Congregati…
## 10 Charleston, Coll. Of <NA> Char… SC 1770 Anglican
## # ℹ 55 more rows
The $ command is used to display the internal components of a dataframe
early_colleges$
The %>% pipeline command will be used throughout this weeks to link various command queries
early_colleges %>% select(established)
The %in% command is used for matching a vector within a dataframe
early_colleges %>% filter(established %in% c('1795','1797','1802'))
Please each of these now in the Console section of RStudio
The head() and tail() commands are useful for exploring the datasets, each shows the first and last rows of the dataset. This is particularly useful when importing data and ensuring that all the information has been correctly inputted.
## # A tibble: 10 × 6
## college original_name city state established sponsorship
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Harvard <NA> Camb… MA 1636 Congregati…
## 2 William and Mary <NA> Will… VA 1693 Anglican
## 3 Yale <NA> New … CT 1701 Congregati…
## 4 Pennsylvania, Univ. of <NA> Phil… PA 1740 Nondenomin…
## 5 Princeton College of New Je… Prin… NJ 1746 Presbyteri…
## 6 Columbia King's College New … NY 1754 Anglican
## 7 Brown <NA> Prov… RI 1765 Baptist
## 8 Rutgers Queen's College New … NJ 1766 Dutch Refo…
## 9 Dartmouth <NA> Hano… NH 1769 Congregati…
## 10 Charleston, Coll. Of <NA> Char… SC 1770 Anglican
## # A tibble: 10 × 6
## college original_name city state established sponsorship
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Holy Cross <NA> Worchester MA 1843 Roman Cath…
## 2 Mississipps, Univ. of <NA> Oxford MI 1844 Secular
## 3 Louisiana, Univ. of <NA> New Orleans LA 1845 Secular
## 4 U.S. Naval Academy <NA> Annapolis MD 1845 Secular
## 5 Beloit <NA> Beloit WI 1846 Congregati…
## 6 Bucknell <NA> Lewisburg PA 1846 Baptist
## 7 Grinnell <NA> Grinnell IA 1846 Congregati…
## 8 Mount Union <NA> Alliance OH 1846 Methodist
## 9 Earlham <NA> Richmond IN 1847 Quaker
## 10 Wisconsin, Univ. of <NA> Madison WI 1848 Secular
Another way to view an entire dataset is to use the glimpse() command which displays the overall dataset and the class of each type
## Rows: 65
## Columns: 6
## $ college <chr> "Harvard", "William and Mary", "Yale", "Pennsylvania, Un…
## $ original_name <chr> NA, NA, NA, NA, "College of New Jersey", "King's College…
## $ city <chr> "Cambridge", "Williamsburg", "New Haven", "Philadelphia"…
## $ state <chr> "MA", "VA", "CT", "PA", "NJ", "NY", "RI", "NJ", "NH", "S…
## $ established <int> 1636, 1693, 1701, 1740, 1746, 1754, 1765, 1766, 1769, 17…
## $ sponsorship <chr> "Congregational; after 1805 Unitarian", "Anglican", "Con…
The summary() command provides an overview of the dataset, most beneficial with numerical data
## college original_name city state
## Length:65 Length:65 Length:65 Length:65
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## established sponsorship
## Min. :1636 Length:65
## 1st Qu.:1793 Class :character
## Median :1823 Mode :character
## Mean :1810
## 3rd Qu.:1838
## Max. :1848
The select() command from ‘dplyr’ is a very versatile command that can be used in sequence using the %>% pipeline to link to other commands.
## # A tibble: 6 × 3
## college city state
## <chr> <chr> <chr>
## 1 Harvard Cambridge MA
## 2 William and Mary Williamsburg VA
## 3 Yale New Haven CT
## 4 Pennsylvania, Univ. of Philadelphia PA
## 5 Princeton Princeton NJ
## 6 Columbia New York NY
The filter() command also from ‘dplyr’ is very useful and can implement numerical and text commands.
This example shows the number of colleges established prior to 1800
## # A tibble: 6 × 6
## college original_name city state established sponsorship
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Harvard <NA> Camb… MA 1636 Congregati…
## 2 William and Mary <NA> Will… VA 1693 Anglican
## 3 Yale <NA> New … CT 1701 Congregati…
## 4 Pennsylvania, Univ. of <NA> Phil… PA 1740 Nondenomin…
## 5 Princeton College of New Jer… Prin… NJ 1746 Presbyteri…
## 6 Columbia King's College New … NY 1754 Anglican
This example shows the colleges with the state of New York using the ==
## # A tibble: 6 × 6
## college original_name city state established sponsorship
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Columbia King's College New York NY 1754 Anglican
## 2 Union College <NA> Schenectady NY 1795 Presbyteri…
## 3 U.S. Military Academy <NA> West Point NY 1802 Secular
## 4 Colgate <NA> Hamilton NY 1819 Baptist
## 5 New York Univ. <NA> New York NY 1831 Nondenomin…
## 6 Fordham <NA> Fordham NY 1841 Roman Cath…
Using the != displays all states that are not New York
## # A tibble: 6 × 6
## college original_name city state established sponsorship
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Harvard <NA> Camb… MA 1636 Congregati…
## 2 William and Mary <NA> Will… VA 1693 Anglican
## 3 Yale <NA> New … CT 1701 Congregati…
## 4 Pennsylvania, Univ. of <NA> Phil… PA 1740 Nondenomin…
## 5 Princeton College of New Jer… Prin… NJ 1746 Presbyteri…
## 6 Brown <NA> Prov… RI 1765 Baptist
The simple logical operators are for the filter command are:
& (and)
| (or)
! (not)
## # A tibble: 6 × 6
## college original_name city state established sponsorship
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Harvard <NA> Cambridge MA 1636 Congregation…
## 2 William and Mary <NA> Williamsburg VA 1693 Anglican
## 3 Columbia King's College New York NY 1754 Anglican
## 4 Hampden-Sydney <NA> Hampden-Sydney VA 1775 Presbyterian
## 5 Williams <NA> Williamstown MA 1793 Congregation…
## 6 Union College <NA> Schenectady NY 1795 Presbyterian…
Note that the %in% command can also be expressed as follows:
# Create a new character list of three states using their abbreviations
three_states <- c("NY", "VA", "MA")# Filter for this using the %in% command
early_colleges %>% filter(state %in% three_states) %>% head()## # A tibble: 6 × 6
## college original_name city state established sponsorship
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 Harvard <NA> Cambridge MA 1636 Congregation…
## 2 William and Mary <NA> Williamsburg VA 1693 Anglican
## 3 Columbia King's College New York NY 1754 Anglican
## 4 Hampden-Sydney <NA> Hampden-Sydney VA 1775 Presbyterian
## 5 Williams <NA> Williamstown MA 1793 Congregation…
## 6 Union College <NA> Schenectady NY 1795 Presbyterian…
Although mutate() creates a new column, unless you save it back into the main dataframe it will be lost therefore always point your code back to the original dataframe using the -> command.
Some programmers use the = sign, however, this is not recommended in R as that sign has other uses depending on which package you are using.
# Now when you run this code, the number of variables will increase to 7
early_colleges %>% mutate(location=paste(city,state,sep=",")) -> early_colleges## # A tibble: 65 × 7
## college original_name city state established sponsorship location
## <chr> <chr> <chr> <chr> <int> <chr> <chr>
## 1 Harvard <NA> Camb… MA 1636 Congregati… Cambrid…
## 2 William and Mary <NA> Will… VA 1693 Anglican William…
## 3 Yale <NA> New … CT 1701 Congregati… New Hav…
## 4 Pennsylvania, Uni… <NA> Phil… PA 1740 Nondenomin… Philade…
## 5 Princeton College of N… Prin… NJ 1746 Presbyteri… Princet…
## 6 Columbia King's Colle… New … NY 1754 Anglican New Yor…
## 7 Brown <NA> Prov… RI 1765 Baptist Provide…
## 8 Rutgers Queen's Coll… New … NJ 1766 Dutch Refo… New Bru…
## 9 Dartmouth <NA> Hano… NH 1769 Congregati… Hanover…
## 10 Charleston, Coll.… <NA> Char… SC 1770 Anglican Charles…
## # ℹ 55 more rows
Filter the early_colleges to show the colleges established in the original 13 colonies of the United States of America
Create a new object from the early_colleges dataset showing the largest religious sponsorship of colleges