Merging Data

Recitation 3

Rithika Kumar

September 19, 2019

Goal for today

  1. Review knitting on R Markdown
  2. Merging data
  3. Exercise
  4. Q&A

Reviewing R Markdown

Opening an R Markdown file in R Studio

  1. Open up Rstudio
  2. Create a new Markdown file and while creating it choose PDF in the pop-up
  3. Save the file
  4. R Markdown is saved as a .Rmd file
  5. knitr() then converts to Pandoc which then converts it to a Word/PDF/Html etc.

Writing functions within R Markdown

  1. Code chunks start with the following line: ```{r} and end with ```.
  2. You can quickly insert chunks into your R Markdown file with the keyboard shortcut Cmd + Option + I (Windows Ctrl + Alt + I).
  3. Let’s try to import our dataset by typing the command within the code chunk

Your code should look something like this:

getwd()
setwd("/Users/rithika/Google\ Drive/Penn/TA/Intro\ to\ DS/Rk_Recitation/Data") 
#insert the path of the directory of your choice
country.profile <- read.csv("country_profile_variables.csv")

Other things to know about R Markdown

  1. In case you don’t want the code chunk to be visible in your document you can use the following syntax: ```{r, echo = FALSE} and end with ```

  2. Using # creates a new section while ## creates a sub-sub section Eg. # Answer 1.1 –> this will create a new section within your document

  3. Once you are done writing, you must knit your document using the Knit button at the top. Click on it and knit to PDF.

  4. Let’s now go over the example on the screen to better understand how this is done.

Merging data

Download the data

  1. Download the data lfp1.csv and *lfp2.csv from the Canvas website
  2. Set your working directory and make sure the file you just downloaded is there
  3. Load them into R Studio
setwd("/Users/rithika/Google\ Drive/Penn/TA/Intro\ to\ DS/Rk_Recitation/Data") 

lfp1 <- read.csv("lfp1.csv")
lfp2 <- read.csv("lfp2.csv")

Our goal: Merge these two datasets

We want to see if there is a unique ID that links both these datasets

head(lfp1)
head(lfp2)

Merging the data

It looks like the first column in these two datasets is the same. The next thing we want to do is check whether these variables have the same class using class().

class(lfp1)
class(lfp2)

In order to merge the data, we need to ensure that the unique identifier has the same name in both datasets and for this we need to rename one of them

# Let's rename the one in lfp2

lfp2$id <- lfp2$id.no

Checking the dimensions

# Verifying that the number of unique IDs is similar to the number of rows
length(unique(lfp2$id))==dim(lfp2)[1]

# How many passengers in this dataset?
dim(lfp2)[1]

# Verifying that the number of unique IDs is similar to the number of rows
length(unique(lfp1$id))==dim(lfp1)[1]

# How many passengers in this dataset?
dim(lfp1)[1]

However, we now see that number of unique IDs in lfp1 is lesser than that in lfp2. This will result in NAs when we merge Which might be needed to be deleted after the merge

Merging the data!!

library(dplyr)
# Merging using merge()
lfp.merge <- merge(lfp1,lfp2,by="id")

# Merging using right_join()
lfp.merge <- right_join(lfp1,lfp2,by="id")

# Checking the dimensions of the new dataset
dim(lfp.merge)

# Remove old datasets (if needed)
#rm(lfp1)
#rm(lfp2)

Merged data

We have successfully merged both datasets! Let’s take a look at the content of our new dataset.

Before that let’s clear all the NAs

# Removing all rows with missing values using the na.omit() function (we call this "listwise deletion")
lfp.clean <- na.omit(lfp.merge)
# Looking at the content of the first five rows using head()
head(lfp.merge,5)

As you see, we have all of the information about our respondents within one dataframe rather than it being split over two separate df

Exercise

  1. Using the lfp1 and lfp2 datasets, identify if they have a unique ID between them (you already did this if you followed recitation)

  2. Use the left_join() command to merge lfp1 and lfp2 and identify the difference in this dataset from the dataset formed using right_join. Hint: the summary() command might provide some insights

  3. Similarly, use the full_join() and inner_join() to merge lfp1 and lfp2 and get the dimensions of the resulting datasets.