Jose M. Fernandez
September 11, 2018
Merge - adds variables to a dataset. This document will use -merge- function. Merging two datasets require that both have at least one variable in common (either string or numeric). If string make sure the categories have the same spelling (i.e. country names, etc.).
Explore each dataset separately before merging. Make sure to use all possible common variables (for example, if merging two panel datasets you will need country and years).
# merge two data frames by ID
total <- merge(data.frameA,data frameB,by="ID")
# merge two data frames by ID and Country
total <- merge(data.frameA,data frameB,by=c("ID","Country"))An inner join of df1 and df2: Return only the rows in which the left table have matching keys in the right table.
An outer join of df1 and df2: Returns all rows from both tables, join records from the left which have matching keys in the right table.
A left outer join (or simply left join) of df1 and df2 Return all rows from the left table, and any rows with matching keys from the right table.
A right outer join of df1 and df2 Return all rows from the right table, and any rows with matching keys from the left table.
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
inner.df<-merge(df1, df2)
outer.df<-merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
left.df<-merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
right.df<-merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
cross.df<-merge(x = df1, y = df2, by = NULL)Append - adds cases/observations to a dataset. This document will use the -rbind- function.
Appending two datasets require that both have variables with exactly the same name. If using categorical data make sure the categories on both datasets refer to exactly the same thing (i.e. 1 “Agree”, 2“Disagree”, 3 “DK” on both).
Sometimes you get tables by year, but you want a panel dataset. So one way to merge the files is by stacking them.
# Stacking ----------------------------------------------------------------
# Let's generate some fake data to illustrate combining data frames by stacking.
first <- data.frame(x0=1:5,
x1=rnorm(5),
x2=c("M","F","M","F","F"))
second <- data.frame(x0=10:14,
x1=rnorm(5),
x2=c("M","F","M","F","F"))
third <- data.frame(x4=c(3,3,1,3,2),
x5=c("e","g","v","b","z"))# We can use the rbind() function to stack data frames. Make sure the number of
# columns match. Also, the names and classes of values being joined must match.
# Here we stack the first, the second and then the first again:
rbind(first, second, first)## x0 x1 x2
## 1 1 -0.3054741 M
## 2 2 0.5675513 F
## 3 3 -0.3136784 M
## 4 4 1.5195405 F
## 5 5 1.8820481 F
## 6 10 1.3836129 M
## 7 11 -0.3645436 F
## 8 12 -0.7598269 M
## 9 13 -0.9192285 F
## 10 14 0.3347758 F
## 11 1 -0.3054741 M
## 12 2 0.5675513 F
## 13 3 -0.3136784 M
## 14 4 1.5195405 F
## 15 5 1.8820481 F
class(rbind(first, second, first)) # still a data frame## [1] "data.frame"
## [1] "data.frame"
# works with vectors too:
rbind(1:3,4:6)## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
class(rbind(1:3,4:6)) # matrix## [1] "matrix"
By adding columns: If the two sets of data have an equal set of rows, and the order of the rows is identical, then adding columns makes sense. Your options for doing this are data.frame or cbind().
# Side-by-side ------------------------------------------------------------
# Use the cbind function to combine data frames side-by-side:
cbind(first,third)## x0 x1 x2 x4 x5
## 1 1 -0.3054741 M 3 e
## 2 2 0.5675513 F 3 g
## 3 3 -0.3136784 M 1 v
## 4 4 1.5195405 F 3 b
## 5 5 1.8820481 F 2 z
class(cbind(first,third))## [1] "data.frame"
The function cbind does not require the vectors to be of the same size. The shorter vector will get recycled.
cbind(rbind(first,second),third)## x0 x1 x2 x4 x5
## 1 1 -0.3054741 M 3 e
## 2 2 0.5675513 F 3 g
## 3 3 -0.3136784 M 1 v
## 4 4 1.5195405 F 3 b
## 5 5 1.8820481 F 2 z
## 6 10 1.3836129 M 3 e
## 7 11 -0.3645436 F 3 g
## 8 12 -0.7598269 M 1 v
## 9 13 -0.9192285 F 3 b
## 10 14 0.3347758 F 2 z
# This is NOT true with rbindMerger, Append, and Join
We are going to scrape the top 100 movies from IMBD.
There are four steps to webscrapping 1. Extract - obtaining the data - Library to use rvest 2. Preprocess - Cleaning the data - Library to use stringr 3. Analyze - Using the data - Libraries to use dplyr, ggplot2
Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used.
Almost all the main languages provide ways for performing web scrapping. In these slides, we’ll use R for scrapping the data for the most popular feature films from the IMDb website.
We’ll get a number of features for each of the 100 popular feature films. Also, we’ll look at the most common problems that one might face while scrapping data from the internet because of lack of consistency in the website code and look at how to solve these problems.
limiting the number of calls your ip can make to the website
Your webscrap program will need to be modified if the website is updated.
I am sure the first questions that must have popped in your head till now is “Why do we need web scraping”? As I stated before, the possibilities with web scraping are immense.
To provide you with hands-on knowledge, we are going to scrap data from IMDB. Some other possible applications that you can use web scrapping for are:
There are several ways of scraping data from the web. Some of the popular ways are:
We’ll use the DOM parsing approach during the course of this article. And rely on the CSS selectors of the webpage for finding the relevant fields which contain the desired information. But before we begin there are a few prerequisites that one need in order to proficiently scrap data from any website.
install.packages('rvest')Now, let’s get started with scraping the IMDb website for the 100 most popular feature films released in 2016. You can access them here.
#Loading the rvest package
library('rvest')## Loading required package: xml2
library('stringr')
#Specifying the url for desired website to be scrapped
url <- 'https://www.imdb.com/list/ls055592025/'
#Reading the HTML code from the website
page <- read_html(url)Step 1: Now, we will start with scraping the Rank field. For that, we’ll use the selector gadget to get the specific CSS selectors that encloses the rankings. You can click on the extension in your browser and select the rankings field with cursor.
Make sure that all the rankings are selected. You can select some more ranking sections in case you are not able to get all of them and you can also de-select them by clicking on the selected section to make sure that you only have those sections highlighted that you want to scrap for that go.
Step 2: Once you are sure that you have made the right selections, you need to copy the corresponding CSS selector that you can view in the bottom center.
Step 3: Once you know the CSS selector that contains the rankings, you can use this simple R code to get all the rankings:
#Using CSS selectors to scrap the rankings section
movie.rank <- html_nodes(page,'.text-primary')
#Converting the ranking data to text
movie.rank <- html_text(movie.rank)
#Let's have a look at the rankings
head(movie.rank)## [1] "1." "2." "3." "4." "5." "6."
Step 4: Once you have the data, make sure that it looks in the desired format. I am preprocessing my data to convert it to numerical format.
#Data-Preprocessing: Converting rankings to numerical
movie.rank<-as.numeric(movie.rank)
#Let's have another look at the rankings
head(movie.rank)## [1] 1 2 3 4 5 6
Step 5: Now you can clear the selector section and select all the titles. You can visually inspect that all the titles are selected. Make any required additions and deletions with the help of your curser. I have done the same here.
We repeat this for each variable
# movie.rank <- page %>% html_nodes(".text-primary") %>% html_text()
titles <- page %>% html_nodes(".lister-item-header a") %>% html_text()
ratings <- page %>% html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") %>% html_text() %>% as.numeric()
release.date <- page %>% html_nodes(".lister-item-header") %>% html_text()
movie.rating <- page %>% html_nodes(".certificate") %>% html_text()
starring.cast <- page %>% html_nodes(".text-small:nth-child(6)") %>% html_text()
movie.genre <- page %>% html_nodes(".genre") %>% html_text()
movie.runtime <- page %>% html_nodes(".runtime") %>% html_text()
movie.summary <- page %>% html_nodes(".ipl-rating-widget+ p , .ratings-metascore+ p") %>% html_text()
movie.votes <- page %>% html_nodes(".text-small+ .text-small") %>% html_text()Let’s look at each one of our variables starting with titles, the star rating, and the movie rating
head(titles) # These one looks good## [1] "The Godfather" "The Shawshank Redemption"
## [3] "Schindler's List" "Raging Bull"
## [5] "Casablanca" "Citizen Kane"
head(ratings)## [1] 9.2 9.3 8.9 8.2 8.5 8.4
head(movie.rating)## [1] "R" "R" "R" "R" "PG" "PG"
movie.rating<-factor(movie.rating)There are some that need to be cleaned up
head(release.date) #Not so pretty## [1] "\n 1.\n \n The Godfather\n (1972)\n"
## [2] "\n 2.\n \n The Shawshank Redemption\n (1994)\n"
## [3] "\n 3.\n \n Schindler's List\n (1993)\n"
## [4] "\n 4.\n \n Raging Bull\n (1980)\n"
## [5] "\n 5.\n \n Casablanca\n (1942)\n"
## [6] "\n 6.\n \n Citizen Kane\n (1941)\n"
A common type of html code used is new line or return We will use the gsub function to get rid of those terms
release.date<-gsub("\n","",release.date)
head(release.date)## [1] " 1. The Godfather (1972)"
## [2] " 2. The Shawshank Redemption (1994)"
## [3] " 3. Schindler's List (1993)"
## [4] " 4. Raging Bull (1980)"
## [5] " 5. Casablanca (1942)"
## [6] " 6. Citizen Kane (1941)"
We will use the gsub function to get rid of extra spaces as well
release.date<-gsub(" ","",release.date)
head(release.date)## [1] "1.The Godfather(1972)" "2.The Shawshank Redemption(1994)"
## [3] "3.Schindler's List(1993)" "4.Raging Bull(1980)"
## [5] "5.Casablanca(1942)" "6.Citizen Kane(1941)"
Typically these commands are all you need to clean up a string. However, we want to extract the year between the (), which is difficult in R because a lot of functions use () and ignore them as a character.
# First, we tell R to grab any thing within ()
release.date<-str_extract_all(release.date, "\\([^()]+\\)")
# This code says to grab the items between parantheses
release.date <- substring(release.date, 2, nchar(release.date)-1)
# Lastly, we change the values from character to numeric
release.date <- release.date %>% as.numeric()
head(release.date)## [1] 1972 1994 1993 1980 1942 1941
Sometimes we only want to grab the numerical values from a character string. For example in run-time we do not need the label for minutes just the number
head(movie.runtime)## [1] "175 min" "142 min" "195 min" "129 min" "102 min" "119 min"
#Grab the first set of numbers
movie.runtime<-as.numeric(gsub("([0-9]+).*$", "\\1", movie.runtime))
head(movie.runtime)## [1] 175 142 195 129 102 119
In this slide, we deal with the most complicated variable movie.vote This variable contains both the number of voters as well as Gross Earning for the movie. We need to not only clean up the variable, but also split the variable. The best method is to isolate the values you want for each variable. Let’s start with vote totals first.
#First, we clean movie.votes
movie.votes <- gsub("\n","",movie.votes) # removes \n
movie.votes <- gsub(" ","",movie.votes) # removes extra white space
movie.votes <- gsub("Votes:","",movie.votes) #removes the word Votes
movie.votes <- gsub("Gross:","",movie.votes) #removes the word Gross
movie.votes <- gsub(",","",movie.votes) #removes all the commas
movie.votes1<-readr::parse_number(movie.votes) #extracts the first number
head(movie.votes1)## [1] 1365575 1993670 1028619 275709 454998 348946
Now that we have vote totals, we want to obtain revenue. We will need to split our movie.vote variable by the character “|”.
# This one extracts only Revenue
# It splits the variable at each instance of "|" and holds the second value
# We need to do this for each row of the vector.
# We use sapply to repeat the split at each row
movie.rev<-sapply(strsplit(movie.votes, split='|',fixed=TRUE),function(x) (x[2]))
head(movie.rev)## [1] "$134.97M" "$28.34M" "$96.07M" "$23.38M" "$1.02M" "$1.59M"
# Lastly, we need to extract the numerical values
movie.rev <- readr::parse_number(movie.rev)
head(movie.rev)## [1] 134.97 28.34 96.07 23.38 1.02 1.59
starring.cast<-gsub("\n","",starring.cast)
starring.cast<-gsub(" ","",starring.cast)
starring.cast<-gsub("Director:","",starring.cast)
starring.cast<-gsub("Directors:","",starring.cast)
movie.director<-sapply(strsplit(starring.cast, split="| Stars:", fixed=TRUE),function(x) (x[1]))
movie.cast<-sapply(strsplit(starring.cast, split="| Stars:", fixed=TRUE),function(x) (x[2]))
head(movie.director)## [1] "Francis Ford Coppola " "Frank Darabont " "Steven Spielberg "
## [4] "Martin Scorsese " "Michael Curtiz " "Orson Welles "
head(movie.cast)## [1] "Marlon Brando, Al Pacino, James Caan, Diane Keaton"
## [2] "Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler"
## [3] "Liam Neeson, Ralph Fiennes, Ben Kingsley, Caroline Goodall"
## [4] "Robert De Niro, Cathy Moriarty, Joe Pesci, Frank Vincent"
## [5] "Humphrey Bogart, Ingrid Bergman, Paul Henreid, Claude Rains"
## [6] "Orson Welles, Joseph Cotten, Dorothy Comingore, Agnes Moorehead"
movie.summary <- gsub(" ","",gsub("\n","",movie.summary))
head(movie.summary)## [1] "The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son."
## [2] "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency."
## [3] "In German-occupied Poland during World War II, Oskar Schindler gradually becomes concerned for his Jewish workforce after witnessing their persecution by the Nazi Germans."
## [4] "The life of boxer Jake LaMotta, as the violence and temper that leads him to the top in the ring destroys his life outside of it."
## [5] "A cynical nightclub owner protects an old flame and her husband from Nazis in Morocco."
## [6] "Following the death of a publishing tycoon, news reporters scramble to discover the meaning of his final utterance."
movie.genre <- gsub(" ","",gsub("\n","",movie.genre))
movie.genre.primary <- factor(gsub(",.*","",movie.genre))
head(movie.genre.primary)## [1] Crime Drama Biography Biography Drama Drama
## 11 Levels: Action Adventure Biography Comedy Crime Drama ... Western
movie.info<-data.frame(rank = movie.rank, titles, rating = movie.rating, score = ratings, year = release.date, votes = movie.votes1, gross = movie.rev, genres=movie.genre, main_genre = movie.genre.primary, runtime = movie.runtime, director = movie.director, cast = movie.cast, synopsis = movie.summary)library(knitr)
table(movie.info$rating,movie.info$main_genre) %>% kable(caption="Top 100 Movies by Genre and Rating")| Action | Adventure | Biography | Comedy | Crime | Drama | Family | Film-Noir | Horror | Mystery | Western | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Approved | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| G | 0 | 2 | 1 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
| GP | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Not Rated | 0 | 4 | 1 | 5 | 3 | 7 | 0 | 2 | 0 | 0 | 0 |
| Passed | 0 | 1 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 |
| PG | 3 | 5 | 1 | 5 | 0 | 6 | 1 | 0 | 0 | 2 | 0 |
| PG-13 | 1 | 2 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| R | 3 | 0 | 5 | 1 | 9 | 12 | 0 | 0 | 2 | 0 | 1 |
library('ggplot2')
qplot(data = movie.info,runtime,fill = main_genre,bins = 30)ggplot(movie.info, aes(x=score,y=gross))+geom_smooth()