Last lab, we touched bases on defining objects and one thing we introduced briefly were vectors. We can create what we call a vector in which a single object has multiple values as oppose to a singular value. We define a vector by using the c(x, y, z) function, where x, y, and z represent the values we want to store in that vector.
Let us define an object called temperatureF by storing the values 78, 85, 64, 54, 102, and 98.6, which will represent the temperature in degrees-Fahrenheit, in that object:
library(tinytex)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Creating an object of recorded temperatures in degrees-Fahrenheit.
temperatureF <- c( 78, 85, 64, 54, 102, 98.6)
# Prints the created object.
temperatureF
## [1] 78.0 85.0 64.0 54.0 102.0 98.6
The power and usefullness of vectors is that sometimes R can do the same calculation on all elements of a vector with one command, and we played around with this power last lab. For example, to convert a temperature in Fahrenheit to Celsius, we would want to subtract 32 and multiply times 5/9. We can do that for all the numbers in this vector at once:
\(\dfrac{5}{9} \times (\text{temperatureF}- 32)\)
# Creates an object that stores the output of the resulting conversion calculation with recorded temperatures in degrees-Fahrenheit as the input.
temperatureC <- (5/9)*(temperatureF - 32)
# Prints the temperatures in Celsius.
temperatureC
## [1] 25.55556 29.44444 17.77778 12.22222 38.88889 37.00000
If you want to pull out a specific number from a vector list, we would us [x], brackets, where x denotes the index of the value you want to pull out.
# Pulls out the 3rd value in the temperatures in Cesius object.
temperatureC[3]
## [1] 17.77778
# Pulls out the 5th value in the temperatures in Cesius object.
temperatureC[5]
## [1] 38.88889
Be careful with the difference between ( ) parentheses and [ ] brackets and using them in R.
Vectors are mathematical objects that store multiple values, and you can perform operations directly on these values. Instead of extracting individual values from a vector for calculations, we can apply functions and operations to the entire vector at once. This makes calculations more efficient and simplifies coding when working with large datasets. This is especially useful if we want to perform some statistical analysis on a set of values (i.e. Finding the mean) or to inevstigate what our objects, or variables, look like.
Mean: We use the function mean(x), where x is the object (vector) we want to find the mean of.
mean(temperatureC)
## [1] 26.81481Sum: We use the function sum(x), where x is the object (vector) we want to find the sum of.
sum(temperatureC)
## [1] 160.8889Length: We use the function length(x), where x is the object (vector) we want to find the length of, or how many values/ elements are in that object.
length(temperatureC)
## [1] 6If you wanted to remove the whole list of objects in your environment, you would click on the “broom” icon button to clear the whole environment. However, if you would like to remove certain objects you may use either remove( ) or rm( ) with the objects you want to remove inside the parentheses (If you have multiple objects you would like to remove, separate them with a comma).
x = c(1, 2, 3, 4, 5)
y = c(6, 7, 8, 9, 10)
remove(x)
remove(x, y)
## Warning in remove(x, y): object 'x' not found
Packages in R are collections of functions, data,
and documentation bundled together to extend R’s capabilities. They
allow users to perform specialized tasks (e.g., data manipulation,
visualization, machine learning) that aren’t available in base R.
Popular packages include ggplot2 (for visualization),
dplyr (for data manipulation), and caret (for
machine learning). There are a bunch of packages, but it is up to you to
decide what package to install to make your objective with R more
efficient. You can find a bunch on the web.
We can install a package using the install.packages()
function. Let us install the package dplyr, which
allows for easy modification of data frames, since we
will be using it for this lab.
install.packages("dplyr")
## Warning: package 'dplyr' is in use and will not be installed
Once a package is installed, it needs to be loaded into R during a session if you want to use it. You do this with a function called library( ).
library(dplyr)
*If you are getting an error, that means the package is not installed yet.
A working directory in R is the folder or location on your computer where R reads and saves files by default. It’s like R’s “home base” for finding and storing files.
A file path defines the location of a file or
directory on your computer. It shows the path that leads from the root
directory (like C:\ on Windows or / on
Linux/macOS) to the file or folder you want to work with.
Windows:
C:\Users\YourName\Documents\file.csv
macOS/Linux:
/Users/YourName/Documents/file.csv
Open your file explorer (Windows Explorer, Finder, etc.).
Find the file or folder you would like to refer to or save in.
Right-click on the file and select “Properties” (Windows) or “Get Info” (macOS)/ or hold Opt and copy path (macOS), and copy the location.
Add the file name (if it’s a file) at the end of the location.
If you are using windows, the file path will not read into R if you have a \ (backwards slash). We can do a few things to fix this:
We can change all the \ (backwards slashes) into / (Forward Slashes). OR
We can add another \ (backwards slashes) to the already existing backward slashes so you have two \\ (Backward Slashes).
You can check your current working directory using getwd( ).
getwd()
## [1] "C:/BI412L/Lab 2 Intro to R Part 2"
You can also set your working directory if it currently is not the file path that you want with setwd( ), with your desired file path inside the parentheses.
setwd("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs")
Here for practice we set our working directory to refer to the labs, but then again working directories are supposed to be personal and to your desired way of referring or saving to a file.
Previously (perhaps in older versions of R), we used to be able to set a working directory for reference if you wanted to pull out a data set located in a file path. That way when we read in a data set, we do not necessarily have to type in the working directory path but just the data set title alone to read the data set. Unfortunately now we have to use more complex ways of first setting working directory and calling a data set directly from that directory using an external package and a new command, which I will go over in another lab. For now, in order to read a data set, we need to copy the whole file path with the data set attached and read that into R (See next section).
Sometimes, we already have a data set we want to work with from an external file. We can call this data into R so that we can work with it in R. In these labs, we have saved the data in a “comma-separated variable” format, CSV for short.
For example in this lab, let’s use a data set about the passengers of the RMS Titanic. One of the data sets in the folder that we downloaded in the first week is called “titanic.csv”. This is a data set of 1313 passengers from the voyage of this ship, which contains information about some personal info about each passenger as well as whether they survived the accident or not.
To import a CSV file into R, we use the read.csv() function as in the following command (We will use the full file path to the titanic data set).
titanicData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\titanic.csv", stringsAsFactors = TRUE)
This looks for the file called titanic.csv in the folder called DataForLabs. Here we have given the name titanicData to the object in R that contains all this passenger data.
stringsAsFactors Does:When set to TRUE, any character
columns in a data frame are automatically converted into
factors. Factors are categorical variables with a
predefined set of levels (unique values).
When set to FALSE (which is now the
default behavior in recent versions of R), character columns remain as
character strings.
To see if the data loads appropriately, we might want to run the command summary( ). This will give a summary of the data file that we are reading.
summary(titanicData)
## passenger_class name age
## 1st:322 Carlsson,MrFransOlof : 2 Min. : 0.1667
## 2nd:280 Connolly,MissKate : 2 1st Qu.:21.0000
## 3rd:711 Kelly,MrJames : 2 Median :30.0000
## Abbing,MrAnthony : 1 Mean :31.1942
## Abbott,MasterEugeneJoseph: 1 3rd Qu.:41.0000
## Abbott,MrRossmoreEdward : 1 Max. :71.0000
## (Other) :1304 NA's :680
## embarked home_destination sex survive
## :493 :558 female:463 no :864
## Cherbourg :202 NewYork,NY : 65 male :850 yes:449
## Queenstown : 45 London : 14
## Southampton:573 Montreal,PQ : 10
## Cornwall/Akron,OH: 9
## Paris,France : 9
## (Other) :648
Sometimes we would like to add a new column to a data frame. The easiest way to do this is to simply assign a new vector to a new column name, using the $.
For example, to add the log of age as a column in the titanicData data frame, we can write:
titanicData$log_age = log(titanicData$age)
The head( ) command allows us to see the first 6 rows of data entries in a data frame. Of course we can specify how many rows we want to see using a comma followed by the number of rows desired in the head ( ) command.
You can run the command head(titanicData) to see that log_age is now a column in titanicData.
# Gets first 6 rows.
head(titanicData)
## passenger_class name age embarked
## 1 1st Allen,MissElisabethWalton 29.0000 Southampton
## 2 1st Allison,MissHelenLoraine 2.0000 Southampton
## 3 1st Allison,MrHudsonJoshuaCreighton 30.0000 Southampton
## 4 1st Allison,MrsHudsonJ.C.(BessieWaldoDaniels) 25.0000 Southampton
## 5 1st Allison,MasterHudsonTrevor 0.9167 Southampton
## 6 1st Anderson,MrHarry 47.0000 Southampton
## home_destination sex survive log_age
## 1 StLouis,MO female yes 3.36729583
## 2 Montreal,PQ/Chesterville,ON female no 0.69314718
## 3 Montreal,PQ/Chesterville,ON male no 3.40119738
## 4 Montreal,PQ/Chesterville,ON female no 3.21887582
## 5 Montreal,PQ/Chesterville,ON male yes -0.08697501
## 6 NewYork,NY male yes 3.85014760
# Gets first 10 rows.
head(titanicData, 10)
## passenger_class name age
## 1 1st Allen,MissElisabethWalton 29.0000
## 2 1st Allison,MissHelenLoraine 2.0000
## 3 1st Allison,MrHudsonJoshuaCreighton 30.0000
## 4 1st Allison,MrsHudsonJ.C.(BessieWaldoDaniels) 25.0000
## 5 1st Allison,MasterHudsonTrevor 0.9167
## 6 1st Anderson,MrHarry 47.0000
## 7 1st Andrews,MissKorneliaTheodosia 63.0000
## 8 1st Andrews,MrThomas,jr 39.0000
## 9 1st Appleton,MrsEdwardDale(CharlotteLamson) 58.0000
## 10 1st Artagaveytia,MrRamon 71.0000
## embarked home_destination sex survive log_age
## 1 Southampton StLouis,MO female yes 3.36729583
## 2 Southampton Montreal,PQ/Chesterville,ON female no 0.69314718
## 3 Southampton Montreal,PQ/Chesterville,ON male no 3.40119738
## 4 Southampton Montreal,PQ/Chesterville,ON female no 3.21887582
## 5 Southampton Montreal,PQ/Chesterville,ON male yes -0.08697501
## 6 Southampton NewYork,NY male yes 3.85014760
## 7 Southampton Hudson,NY female yes 4.14313473
## 8 Southampton Belfast,NI male no 3.66356165
## 9 Southampton Bayside,Queens,NY female yes 4.06044301
## 10 Cherbourg Montevideo,Uruguay male no 4.26267988
Sometimes we want to do an analysis only on some of the data that fit certain criteria. For example, we might want to analyze the data from the Titanic using only the information from females.
The easiest way to do this is to use the filter( ) function from the package dplyr. (Make sure you have sourced the dplyr package as described above, and then load it into R using library( )):
library(dplyr)
In the titanic data set there is a variable named sex, and an individual is female if that variable has value “female”. We can create a new data frame that includes only the data from females with the following command:
titanicDataFemalesOnly <- filter(titanicData, sex == "female")
This new data frame will include all the same columns as the original titanicData, but it will only include the rows for which the sex was “female”.
Note that the syntax here requires a double == sign. In R (and many other computer languages), the double equal sign creates a statement that can be evaluated as true or false, while a single equal sign may change the value of the object to the value on the right-hand side of the equal sign. Here we are asking, for each individual, whether sex is “female”, not assigning the value ”female” to the variable sex. So we use a double equal sign ==.