Overview

  1. Converting a Matrix to a Dataframe
  2. Subsetting a Dataframe
  3. Deleting Data
  4. Importing Data
  5. Previewing Data
  6. Exploring Data
  7. Converting Data Classes
  8. Creating, Renaming, Recoding Variables
  9. Missing Values
  10. Saving Data

Converting a Matrix to a Dataframe

Dataframes have several properties that make them more user-firendly than matrices, including easier data access and the ability to use some functions that require dataframes. Dataframes are essentially matrices with higher-level properties. That means you can apply all the commands that you have learned with matrices to dataframes.

When talking about dataframes, we will refer to rows as observations and columns as variables. This language is used when conducting data analysis. Dataframes have multiple observations (e.g. days) on which variables (e.g. number of classes, type of protein, etc.) were measured.

my_classes_df <- data.frame(my_classes, stringsAsFactors = FALSE) # We want to keep the orginal classes of the data intact (not change strings, or character vectors, to factors)
meal_plan_df <- data.frame(meal_plan, stringsAsFactors = FALSE) 
class(meal_plan_df)
## [1] "data.frame"

Subsetting a Dataframe

Subsetting a dataframe can be done the same way as with matrices. However, when using dataframes it is more efficient to use the $ operator to access variables.

# Find on which day you will eat steak using the matrix way
eat_steak <- meal_plan_df[, "protein"] == "Steak" 
meal_plan_df[eat_steak, "day"] 
## [1] "Tuesday"
## Find on which day you will eat steak using using $
meal_plan_df$day[meal_plan_df$protein == "Steak"] # $ attaches the dataframe and the variable
## [1] "Tuesday"
# Count the number of days with classes
length(which(my_classes_df[,2] > 0)) # using indexing
## [1] 3
length(which(my_classes_df$class > 0)) # using $
## [1] 3

Exercise 1

Find what protein you will eat on the day that you eat avocado using $.

Deleting Data

Sometimes, we need to delete observations or variables in a dataframe, or objects from the environment. Be careful when overwriting or removing existing objects. If you haven’t saved a previous version, you won’t be able to access them anymore.

# Delete variable variable
meal_plan_df$dessert <- NULL # Overwrites meal_plan_df. Deletes the dessert variable
meal_plan_df2 <- meal_plan_df[, c(1:3)] # Creates a new dataset with subsetting; using the colon as "to". Deletes the drink variable

# Delete observations
meal_plan_df2 <- meal_plan_df2[c(1:3), ] # Deletes the last two observations (days)

# Remove an object from the environment
rm(my_classes_df)

Importing Data

Typically, you will work with existing datasets. Often, datasets are not in a R dataframe format, but they can be converted using an in-built function. When working with data, it is best practice to set your working directory first. This ensures that all the data and scripts are saved in the same location.

# Get and Set the working directory 
getwd() # get the current directory
## [1] "/Users/sarahlachance/Documents/Tutorial/Math camp/Programming part - 2021"
setwd("/Users/sarahlachance/Documents/Tutorial/Math camp/Programming part - 2021") # Set the directory to a new location. Assumes that you have created such a folder before-hand. For the present tutorial, the path should be "/cloud/project/Day_2/""
getwd()
## [1] "/Users/sarahlachance/Documents/Tutorial/Math camp/Programming part - 2021"
# Import R data 
load("titanic.Rda") # Assumes that you have a dataset with that name in your working directory. The extension for R dataframes is .Rda
load("/Users/sarahlachance/Documents/Tutorial/Math camp/Programming part - 2021/titanic.Rda") # Otherwise, you can include the full path to the file

# Load other data types
police <- read.csv("police-fair-2009.csv") # use function read.[format]. csv is the most common format

Previewing Data

Today we will work with data on the Titanic. Here is some background information (Click here for the source):

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

First, let’s preview the data.

dim(titanic) # Dimensions (number of rows/observations and columns/variables)
## [1] 891   9
names(titanic) # Variable names
## [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
## [6] "Age"         "SibSp"       "Parch"       "Fare"
str(titanic) # Returns the dimensions, the names, and class of each variable
## 'data.frame':    891 obs. of  9 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare       : num  725 7128 792 5310 805 ...
View(titanic) # Browse the dataset. Note that each observation is a passenger.

Exercise 2

  1. What are the dimensions of the police dataset?
  2. What are the observations in the police dataset (on what were measurements made)?
  3. What is the class of the N variable?

Exploring Data

Factor Variables: Contingency Tables

To get a sense of the potential relationships that may exist in the data, we can explore the variables by looking at their distribution. Usually, we have research questions and hypotheses (preliminary answers to these questions based on theory) that guide our exploration.

Contingency tables are used to explore factors (categorical variables).

In this section, we will answer the questions: 1) What is the overall rate of survival?; 2) What is the rate of survival by class?; 3) What is the rate of survival among females?

# Get a table with the frequency of survivors and fatalities
table(titanic$Survived)
## 
##   0   1 
## 549 342
# Get the proportions instead of the frequencies
table(titanic$Survived) / sum(table(titanic$Survived)) # Using the sum() function to get the total, and divinding the frequencies element-wise by this sum
## 
##         0         1 
## 0.6161616 0.3838384
# Let's break the code in chunks
survived_freq <- table(titanic$Survived) # create an obejct with the frequencies
total_passengers <- sum(table(titanic$Survived)) # create an obejct with the total
survived_freq / total_passengers # divide the frequencies by the total
## 
##         0         1 
## 0.6161616 0.3838384


We can also look at bivariate distributions, i.e. explore potential relationships between two variables. In this example, we will look at the relationship between passenger class and survival.

# Get a table with the frequency of survivals and fatalities, crossed with class frequencies.
table(titanic$Pclass, titanic$Survived) # the first factor inputed is displayed as rows, the second as columns
##    
##       0   1
##   1  80 136
##   2  97  87
##   3 372 119
# Get the survivals and fatalities by class
survived_class <- table(titanic$Pclass, titanic$Survived)  # create a table object
prop.table(survived_class, margin = 2) # using the prop.table() function with the argument margin set at 2 to get column percentages. For row percentages, we would set it at 1.
##    
##             0         1
##   1 0.1457195 0.3976608
##   2 0.1766849 0.2543860
##   3 0.6775956 0.3479532

Question 1

If we set the margin argument at 2, what kind of question could we answer?

Exercise 3

What is the proportion of survivors among females ?

Numeric Variables: Descriptive Statistics

We can explore the distribution of numeric variables by looking at descriptive statistics (minimum, maximum, mean, median, quartiles).

Just as the median rank observations on a variable and splits them in two groups with the same number of observations, quartiles ranks and split observations in four equal groups.

# Use the summary() function to get the descriptive statistics of the Age variable
summary(titanic$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

Question 2

What is the range of the age variable?

Exercise 4

What is the median number of family members ?

Converting Data Classes

Variables do not always have the class required for the data analysis that we want to do. When it is the case, we can convert them.

There are two new classes of variables in the titanic dataset: integer and factor. Factors are variables that are categorical. Integers are whole numbers. A special type of integer variables, dichotomous variables, take a value of 0 or 1 to indicate success or failure (e.g. success at surviving, failure at surviving).

# From character to factor
titanic$Sex <- as.factor(titanic$Sex)
class(titanic$Sex)
## [1] "factor"
levels(titanic$Sex) # factors have levels; whose order matter when doing some kinds of data analysis
## [1] "female" "male"
# From factor to numeric
titanic$Survived_dicho <- as.numeric(titanic$Survived) # This is the wrong way
unique(titanic$Survived_dicho) # the variables has taken the values of its previous levels, 1 & 2...oops!
## [1] 1 2
titanic$Survived_dicho <- as.numeric(as.character(titanic$Survived)) # You need to convert to character before you convert to numeric, otherwise the variable will take levels as values.
unique(titanic$Survived_dicho) 
## [1] 0 1

Creating, Renaming, Recoding Variables

We can manipulate data to suit our purposes by creating new variables from existing ones, renaming existing variables, or recoding existing variables.

# Rename "SibSp" & "Parch" variables
names(titanic)[names(titanic) == c("SibSp", "Parch")] <- c("Sibling_spouse", "Parent_children") # Subsetting the dataset names and assigning new values, element-wise

# Create a variable that measures the total number of family relatives 
titanic$Family <- titanic$Sibling_spouse + titanic$Parent_children
  
# Recode the Fare variable to change the unit from cents to dollars
titanic$Fare <- titanic$Fare / 100 # Note that this overwrites an existing variable

# Check the changes
head(titanic) # Only shows the first few observations
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age
## 1                             Braund, Mr. Owen Harris   male  22
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38
## 3                              Heikkinen, Miss. Laina female  26
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35
## 5                            Allen, Mr. William Henry   male  35
## 6                                    Moran, Mr. James   male  NA
##   Sibling_spouse Parent_children    Fare Survived_dicho Family
## 1              1               0  7.2500              0      1
## 2              1               0 71.2833              1      1
## 3              0               0  7.9250              1      0
## 4              1               0 53.1000              1      1
## 5              0               0  8.0500              0      0
## 6              0               0  8.4583              0      0

Creating and Recoding Variables with Logical Operators

When we need to assign a value to a variable conditional on another variable taking a certain value, we can use logical operators.

In this example, we will create a new dichotomous sex variable using the existing Sex factor variable.

titanic$Sex_dicho[titanic$Sex == "female"] <- 1 # Here, success is defined as "female"... I had to choose, no offense ;-) !
titanic$Sex_dicho[titanic$Sex == "male"] <- 0

# Check the changes
unique(titanic$Sex_dicho)
## [1] 0 1

Missing Values

Oftentimes, data is missing on some variables for some observations. Missing data can create problems when using functions.

# Get the average age
mean(titanic$Age) # Returns NA because of missing values
## [1] NA
# We can set the argument na.rm to TRUE to tell R to ignore missing values
mean(titanic$Age, na.rm = TRUE)
## [1] 29.69912

Saving Data

Save your data FREQUENTLY! And remember to save it before quitting.

save(titanic,  file = "titanic_data.Rda") # save a particular dataframe 
save.image(file = "tutorial.RData") # save the whole environment. Note the .RData extension

This work by Sarah Lachance is licensed under CC BY-NC-ND 4.0