Dataframes have several properties that make them more user-firendly than matrices, including easier data access and the ability to use some functions that require dataframes. Dataframes are essentially matrices with higher-level properties. That means you can apply all the commands that you have learned with matrices to dataframes.
When talking about dataframes, we will refer to rows as observations and columns as variables. This language is used when conducting data analysis. Dataframes have multiple observations (e.g. days) on which variables (e.g. number of classes, type of protein, etc.) were measured.
my_classes_df <- data.frame(my_classes, stringsAsFactors = FALSE) # We want to keep the orginal classes of the data intact (not change strings, or character vectors, to factors)
meal_plan_df <- data.frame(meal_plan, stringsAsFactors = FALSE)
class(meal_plan_df)
## [1] "data.frame"
Subsetting a dataframe can be done the same way as with matrices. However, when using dataframes it is more efficient to use the $ operator to access variables.
# Find on which day you will eat steak using the matrix way
eat_steak <- meal_plan_df[, "protein"] == "Steak"
meal_plan_df[eat_steak, "day"]
## [1] "Tuesday"
## Find on which day you will eat steak using using $
meal_plan_df$day[meal_plan_df$protein == "Steak"] # $ attaches the dataframe and the variable
## [1] "Tuesday"
# Count the number of days with classes
length(which(my_classes_df[,2] > 0)) # using indexing
## [1] 3
length(which(my_classes_df$class > 0)) # using $
## [1] 3
Find what protein you will eat on the day that you eat avocado using $.
Sometimes, we need to delete observations or variables in a dataframe, or objects from the environment. Be careful when overwriting or removing existing objects. If you haven’t saved a previous version, you won’t be able to access them anymore.
# Delete variable variable
meal_plan_df$dessert <- NULL # Overwrites meal_plan_df. Deletes the dessert variable
meal_plan_df2 <- meal_plan_df[, c(1:3)] # Creates a new dataset with subsetting; using the colon as "to". Deletes the drink variable
# Delete observations
meal_plan_df2 <- meal_plan_df2[c(1:3), ] # Deletes the last two observations (days)
# Remove an object from the environment
rm(my_classes_df)
Typically, you will work with existing datasets. Often, datasets are not in a R dataframe format, but they can be converted using an in-built function. When working with data, it is best practice to set your working directory first. This ensures that all the data and scripts are saved in the same location.
# Get and Set the working directory
getwd() # get the current directory
## [1] "/Users/sarahlachance/Documents/Tutorial/Math camp/Programming part - 2021"
setwd("/Users/sarahlachance/Documents/Tutorial/Math camp/Programming part - 2021") # Set the directory to a new location. Assumes that you have created such a folder before-hand. For the present tutorial, the path should be "/cloud/project/Day_2/""
getwd()
## [1] "/Users/sarahlachance/Documents/Tutorial/Math camp/Programming part - 2021"
# Import R data
load("titanic.Rda") # Assumes that you have a dataset with that name in your working directory. The extension for R dataframes is .Rda
load("/Users/sarahlachance/Documents/Tutorial/Math camp/Programming part - 2021/titanic.Rda") # Otherwise, you can include the full path to the file
# Load other data types
police <- read.csv("police-fair-2009.csv") # use function read.[format]. csv is the most common format
Today we will work with data on the Titanic. Here is some background information (Click here for the source):
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
First, let’s preview the data.
dim(titanic) # Dimensions (number of rows/observations and columns/variables)
## [1] 891 9
names(titanic) # Variable names
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Fare"
str(titanic) # Returns the dimensions, the names, and class of each variable
## 'data.frame': 891 obs. of 9 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 725 7128 792 5310 805 ...
View(titanic) # Browse the dataset. Note that each observation is a passenger.
To get a sense of the potential relationships that may exist in the data, we can explore the variables by looking at their distribution. Usually, we have research questions and hypotheses (preliminary answers to these questions based on theory) that guide our exploration.
Contingency tables are used to explore factors (categorical variables).
In this section, we will answer the questions: 1) What is the overall rate of survival?; 2) What is the rate of survival by class?; 3) What is the rate of survival among females?
# Get a table with the frequency of survivors and fatalities
table(titanic$Survived)
##
## 0 1
## 549 342
# Get the proportions instead of the frequencies
table(titanic$Survived) / sum(table(titanic$Survived)) # Using the sum() function to get the total, and divinding the frequencies element-wise by this sum
##
## 0 1
## 0.6161616 0.3838384
# Let's break the code in chunks
survived_freq <- table(titanic$Survived) # create an obejct with the frequencies
total_passengers <- sum(table(titanic$Survived)) # create an obejct with the total
survived_freq / total_passengers # divide the frequencies by the total
##
## 0 1
## 0.6161616 0.3838384
We can also look at bivariate distributions, i.e. explore potential relationships between two variables. In this example, we will look at the relationship between passenger class and survival.
# Get a table with the frequency of survivals and fatalities, crossed with class frequencies.
table(titanic$Pclass, titanic$Survived) # the first factor inputed is displayed as rows, the second as columns
##
## 0 1
## 1 80 136
## 2 97 87
## 3 372 119
# Get the survivals and fatalities by class
survived_class <- table(titanic$Pclass, titanic$Survived) # create a table object
prop.table(survived_class, margin = 2) # using the prop.table() function with the argument margin set at 2 to get column percentages. For row percentages, we would set it at 1.
##
## 0 1
## 1 0.1457195 0.3976608
## 2 0.1766849 0.2543860
## 3 0.6775956 0.3479532
If we set the margin argument at 2, what kind of question could we answer?
What is the proportion of survivors among females ?
We can explore the distribution of numeric variables by looking at descriptive statistics (minimum, maximum, mean, median, quartiles).
Just as the median rank observations on a variable and splits them in two groups with the same number of observations, quartiles ranks and split observations in four equal groups.
# Use the summary() function to get the descriptive statistics of the Age variable
summary(titanic$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
What is the range of the age variable?
What is the median number of family members ?
Variables do not always have the class required for the data analysis that we want to do. When it is the case, we can convert them.
There are two new classes of variables in the titanic dataset: integer and factor. Factors are variables that are categorical. Integers are whole numbers. A special type of integer variables, dichotomous variables, take a value of 0 or 1 to indicate success or failure (e.g. success at surviving, failure at surviving).
# From character to factor
titanic$Sex <- as.factor(titanic$Sex)
class(titanic$Sex)
## [1] "factor"
levels(titanic$Sex) # factors have levels; whose order matter when doing some kinds of data analysis
## [1] "female" "male"
# From factor to numeric
titanic$Survived_dicho <- as.numeric(titanic$Survived) # This is the wrong way
unique(titanic$Survived_dicho) # the variables has taken the values of its previous levels, 1 & 2...oops!
## [1] 1 2
titanic$Survived_dicho <- as.numeric(as.character(titanic$Survived)) # You need to convert to character before you convert to numeric, otherwise the variable will take levels as values.
unique(titanic$Survived_dicho)
## [1] 0 1
We can manipulate data to suit our purposes by creating new variables from existing ones, renaming existing variables, or recoding existing variables.
# Rename "SibSp" & "Parch" variables
names(titanic)[names(titanic) == c("SibSp", "Parch")] <- c("Sibling_spouse", "Parent_children") # Subsetting the dataset names and assigning new values, element-wise
# Create a variable that measures the total number of family relatives
titanic$Family <- titanic$Sibling_spouse + titanic$Parent_children
# Recode the Fare variable to change the unit from cents to dollars
titanic$Fare <- titanic$Fare / 100 # Note that this overwrites an existing variable
# Check the changes
head(titanic) # Only shows the first few observations
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age
## 1 Braund, Mr. Owen Harris male 22
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38
## 3 Heikkinen, Miss. Laina female 26
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35
## 5 Allen, Mr. William Henry male 35
## 6 Moran, Mr. James male NA
## Sibling_spouse Parent_children Fare Survived_dicho Family
## 1 1 0 7.2500 0 1
## 2 1 0 71.2833 1 1
## 3 0 0 7.9250 1 0
## 4 1 0 53.1000 1 1
## 5 0 0 8.0500 0 0
## 6 0 0 8.4583 0 0
When we need to assign a value to a variable conditional on another variable taking a certain value, we can use logical operators.
In this example, we will create a new dichotomous sex variable using the existing Sex factor variable.
titanic$Sex_dicho[titanic$Sex == "female"] <- 1 # Here, success is defined as "female"... I had to choose, no offense ;-) !
titanic$Sex_dicho[titanic$Sex == "male"] <- 0
# Check the changes
unique(titanic$Sex_dicho)
## [1] 0 1
Oftentimes, data is missing on some variables for some observations. Missing data can create problems when using functions.
# Get the average age
mean(titanic$Age) # Returns NA because of missing values
## [1] NA
# We can set the argument na.rm to TRUE to tell R to ignore missing values
mean(titanic$Age, na.rm = TRUE)
## [1] 29.69912
Save your data FREQUENTLY! And remember to save it before quitting.
save(titanic, file = "titanic_data.Rda") # save a particular dataframe
save.image(file = "tutorial.RData") # save the whole environment. Note the .RData extension
This work by Sarah Lachance is licensed under CC BY-NC-ND 4.0