Vectors

Last lab, we touched bases on defining objects and one thing we introduced briefly were vectors. We can create what we call a vector in which a single object has multiple values as oppose to a singular value. We define a vector by using the c(x, y, z) function, where x, y, and z represent the values we want to store in that vector.

Temperature Example:

Let us define an object called temperatureF by storing the values 78, 85, 64, 54, 102, and 98.6, which will represent the temperature in degrees-Fahrenheit, in that object:

library(tinytex)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Creating an object of recorded temperatures in degrees-Fahrenheit.
temperatureF <- c( 78, 85, 64, 54, 102, 98.6) 

# Prints the created object.
temperatureF
## [1]  78.0  85.0  64.0  54.0 102.0  98.6

The power and usefullness of vectors is that sometimes R can do the same calculation on all elements of a vector with one command, and we played around with this power last lab. For example, to convert a temperature in Fahrenheit to Celsius, we would want to subtract 32 and multiply times 5/9. We can do that for all the numbers in this vector at once:

  1. Let us come up with the equation for degrees-Celsius using our given degrees-Fahrenheit by putting our defined object into the conversion formula. It should look something like this:

\(\dfrac{5}{9} \times (\text{temperatureF}- 32)\)

  1. Let us assign this calculation to another object called temperatureC, which will store our conversions (our output values).
# Creates an object that stores the output of the resulting conversion calculation with recorded temperatures in degrees-Fahrenheit as the input.
temperatureC <- (5/9)*(temperatureF - 32)

# Prints the temperatures in Celsius.
temperatureC
## [1] 25.55556 29.44444 17.77778 12.22222 38.88889 37.00000

If you want to pull out a specific number from a vector list, we would us [x], brackets, where x denotes the index of the value you want to pull out.

# Pulls out the 3rd value in the temperatures in Cesius object.
temperatureC[3]
## [1] 17.77778
# Pulls out the 5th value in the temperatures in Cesius object.
temperatureC[5]
## [1] 38.88889

Be careful with the difference between ( ) parentheses and [ ] brackets and using them in R.

  • One of the common ways to slip up in R is to confuse the [square brackets] which pull out an element of a vector, with the (parentheses) , which is used to enclose the arguments of a function.

Basic Calculations of Vectors

Vectors are mathematical objects that store multiple values, and you can perform operations directly on these values. Instead of extracting individual values from a vector for calculations, we can apply functions and operations to the entire vector at once. This makes calculations more efficient and simplifies coding when working with large datasets. This is especially useful if we want to perform some statistical analysis on a set of values (i.e. Finding the mean) or to inevstigate what our objects, or variables, look like.

  1. Mean: We use the function mean(x), where x is the object (vector) we want to find the mean of.

    • Example: Find the mean of the temperature in degrees-Celsius.
    mean(temperatureC)
    ## [1] 26.81481
  2. Sum: We use the function sum(x), where x is the object (vector) we want to find the sum of.

    • Example: Find the sum of the temperature in degrees-Celsius.
    sum(temperatureC)
    ## [1] 160.8889
  3. Length: We use the function length(x), where x is the object (vector) we want to find the length of, or how many values/ elements are in that object.

    • Example: Find how many values are on temperature in degrees-Celsius.
    length(temperatureC)
    ## [1] 6

Removing Selected Objects from Environment

If you wanted to remove the whole list of objects in your environment, you would click on the “broom” icon button to clear the whole environment. However, if you would like to remove certain objects you may use either remove( ) or rm( ) with the objects you want to remove inside the parentheses (If you have multiple objects you would like to remove, separate them with a comma).

x = c(1, 2, 3, 4, 5)
y = c(6, 7, 8, 9, 10)
remove(x)
remove(x, y)
## Warning in remove(x, y): object 'x' not found

Packages

Packages in R are collections of functions, data, and documentation bundled together to extend R’s capabilities. They allow users to perform specialized tasks (e.g., data manipulation, visualization, machine learning) that aren’t available in base R. Popular packages include ggplot2 (for visualization), dplyr (for data manipulation), and caret (for machine learning). There are a bunch of packages, but it is up to you to decide what package to install to make your objective with R more efficient. You can find a bunch on the web.

Installing Packages

We can install a package using the install.packages() function. Let us install the package dplyr, which allows for easy modification of data frames, since we will be using it for this lab.

install.packages("dplyr")
## Warning: package 'dplyr' is in use and will not be installed

Loading Packages

Once a package is installed, it needs to be loaded into R during a session if you want to use it. You do this with a function called library( ).

library(dplyr)

*If you are getting an error, that means the package is not installed yet.

Work Directory

A working directory in R is the folder or location on your computer where R reads and saves files by default. It’s like R’s “home base” for finding and storing files.

File Path

A file path defines the location of a file or directory on your computer. It shows the path that leads from the root directory (like C:\ on Windows or / on Linux/macOS) to the file or folder you want to work with.

Examples of File Path:

  • Windows: C:\Users\YourName\Documents\file.csv

  • macOS/Linux: /Users/YourName/Documents/file.csv

How to Get a Desired Filepath:

  • Open your file explorer (Windows Explorer, Finder, etc.).

  • Find the file or folder you would like to refer to or save in.

  • Right-click on the file and select “Properties” (Windows) or “Get Info” (macOS)/ or hold Opt and copy path (macOS), and copy the location.

  • Add the file name (if it’s a file) at the end of the location.

File Path Disclaimer:

If you are using windows, the file path will not read into R if you have a \ (backwards slash). We can do a few things to fix this:

  • We can change all the \ (backwards slashes) into / (Forward Slashes). OR

  • We can add another \ (backwards slashes) to the already existing backward slashes so you have two \\ (Backward Slashes).

Getting a Work Directory

You can check your current working directory using getwd( ).

getwd()
## [1] "C:/BI412L/Lab 2 Intro to R Part 2"

Setting a Work Directory

You can also set your working directory if it currently is not the file path that you want with setwd( ), with your desired file path inside the parentheses.

setwd("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs")

Here for practice we set our working directory to refer to the labs, but then again working directories are supposed to be personal and to your desired way of referring or saving to a file.

Additional Disclaimers Pertaining to Working Directories (Referencing Files & Reading Data Sets):

Previously (perhaps in older versions of R), we used to be able to set a working directory for reference if you wanted to pull out a data set located in a file path. That way when we read in a data set, we do not necessarily have to type in the working directory path but just the data set title alone to read the data set. Unfortunately now we have to use more complex ways of first setting working directory and calling a data set directly from that directory using an external package and a new command, which I will go over in another lab. For now, in order to read a data set, we need to copy the whole file path with the data set attached and read that into R (See next section).

Reading a File into R

Sometimes, we already have a data set we want to work with from an external file. We can call this data into R so that we can work with it in R. In these labs, we have saved the data in a “comma-separated variable” format, CSV for short.

For example in this lab, let’s use a data set about the passengers of the RMS Titanic. One of the data sets in the folder that we downloaded in the first week is called “titanic.csv”. This is a data set of 1313 passengers from the voyage of this ship, which contains information about some personal info about each passenger as well as whether they survived the accident or not.

To import a CSV file into R, we use the read.csv() function as in the following command (We will use the full file path to the titanic data set).

titanicData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\titanic.csv", stringsAsFactors = TRUE)

This looks for the file called titanic.csv in the folder called DataForLabs. Here we have given the name titanicData to the object in R that contains all this passenger data.

What stringsAsFactors Does:

  • When set to TRUE, any character columns in a data frame are automatically converted into factors. Factors are categorical variables with a predefined set of levels (unique values).

  • When set to FALSE (which is now the default behavior in recent versions of R), character columns remain as character strings.

To see if the data loads appropriately, we might want to run the command summary( ). This will give a summary of the data file that we are reading.

summary(titanicData)
##  passenger_class                        name           age         
##  1st:322         Carlsson,MrFransOlof     :   2   Min.   : 0.1667  
##  2nd:280         Connolly,MissKate        :   2   1st Qu.:21.0000  
##  3rd:711         Kelly,MrJames            :   2   Median :30.0000  
##                  Abbing,MrAnthony         :   1   Mean   :31.1942  
##                  Abbott,MasterEugeneJoseph:   1   3rd Qu.:41.0000  
##                  Abbott,MrRossmoreEdward  :   1   Max.   :71.0000  
##                  (Other)                  :1304   NA's   :680      
##         embarked            home_destination     sex      survive  
##             :493                    :558     female:463   no :864  
##  Cherbourg  :202   NewYork,NY       : 65     male  :850   yes:449  
##  Queenstown : 45   London           : 14                           
##  Southampton:573   Montreal,PQ      : 10                           
##                    Cornwall/Akron,OH:  9                           
##                    Paris,France     :  9                           
##                    (Other)          :648

Introduction to Data Frames

Adding a New Column to a Data Frame

Sometimes we would like to add a new column to a data frame. The easiest way to do this is to simply assign a new vector to a new column name, using the $.

For example, to add the log of age as a column in the titanicData data frame, we can write:

titanicData$log_age = log(titanicData$age)

The head( ) command allows us to see the first 6 rows of data entries in a data frame. Of course we can specify how many rows we want to see using a comma followed by the number of rows desired in the head ( ) command.

You can run the command head(titanicData) to see that log_age is now a column in titanicData.

# Gets first 6 rows.
head(titanicData)
##   passenger_class                                      name     age    embarked
## 1             1st                 Allen,MissElisabethWalton 29.0000 Southampton
## 2             1st                  Allison,MissHelenLoraine  2.0000 Southampton
## 3             1st           Allison,MrHudsonJoshuaCreighton 30.0000 Southampton
## 4             1st Allison,MrsHudsonJ.C.(BessieWaldoDaniels) 25.0000 Southampton
## 5             1st                Allison,MasterHudsonTrevor  0.9167 Southampton
## 6             1st                          Anderson,MrHarry 47.0000 Southampton
##              home_destination    sex survive     log_age
## 1                  StLouis,MO female     yes  3.36729583
## 2 Montreal,PQ/Chesterville,ON female      no  0.69314718
## 3 Montreal,PQ/Chesterville,ON   male      no  3.40119738
## 4 Montreal,PQ/Chesterville,ON female      no  3.21887582
## 5 Montreal,PQ/Chesterville,ON   male     yes -0.08697501
## 6                  NewYork,NY   male     yes  3.85014760
# Gets first 10 rows.
head(titanicData, 10)
##    passenger_class                                      name     age
## 1              1st                 Allen,MissElisabethWalton 29.0000
## 2              1st                  Allison,MissHelenLoraine  2.0000
## 3              1st           Allison,MrHudsonJoshuaCreighton 30.0000
## 4              1st Allison,MrsHudsonJ.C.(BessieWaldoDaniels) 25.0000
## 5              1st                Allison,MasterHudsonTrevor  0.9167
## 6              1st                          Anderson,MrHarry 47.0000
## 7              1st             Andrews,MissKorneliaTheodosia 63.0000
## 8              1st                       Andrews,MrThomas,jr 39.0000
## 9              1st   Appleton,MrsEdwardDale(CharlotteLamson) 58.0000
## 10             1st                      Artagaveytia,MrRamon 71.0000
##       embarked            home_destination    sex survive     log_age
## 1  Southampton                  StLouis,MO female     yes  3.36729583
## 2  Southampton Montreal,PQ/Chesterville,ON female      no  0.69314718
## 3  Southampton Montreal,PQ/Chesterville,ON   male      no  3.40119738
## 4  Southampton Montreal,PQ/Chesterville,ON female      no  3.21887582
## 5  Southampton Montreal,PQ/Chesterville,ON   male     yes -0.08697501
## 6  Southampton                  NewYork,NY   male     yes  3.85014760
## 7  Southampton                   Hudson,NY female     yes  4.14313473
## 8  Southampton                  Belfast,NI   male      no  3.66356165
## 9  Southampton           Bayside,Queens,NY female     yes  4.06044301
## 10   Cherbourg          Montevideo,Uruguay   male      no  4.26267988

Choosing Subsets of Data

Sometimes we want to do an analysis only on some of the data that fit certain criteria. For example, we might want to analyze the data from the Titanic using only the information from females.

The easiest way to do this is to use the filter( ) function from the package dplyr. (Make sure you have sourced the dplyr package as described above, and then load it into R using library( )):

library(dplyr)

In the titanic data set there is a variable named sex, and an individual is female if that variable has value “female”. We can create a new data frame that includes only the data from females with the following command:

titanicDataFemalesOnly <- filter(titanicData, sex == "female")

This new data frame will include all the same columns as the original titanicData, but it will only include the rows for which the sex was “female”.

Note that the syntax here requires a double == sign. In R (and many other computer languages), the double equal sign creates a statement that can be evaluated as true or false, while a single equal sign may change the value of the object to the value on the right-hand side of the equal sign. Here we are asking, for each individual, whether sex is “female”, not assigning the value ”female” to the variable sex. So we use a double equal sign ==.