Data exploration with the dplyr, tidyr and stringr libraries. - Column subsetting - Filtering of rows - Boolean operators, Boolean algebra, de Morgan’s laws - Creating new columns (1x Challenge) - Missing values - Manipulating text (3x Challenge) - Aggregating data (1x Challenge) - Pivot tables, data in long and wide format - Merging tables
Useful resources:
- dplyr
cheatsheet
- tidyr
cheatsheet
- stringr
cheatsheet
- ggplot2
cheatsheet
- A.
Kassambara - Guide to Create Beautiful Graphics in R.
The data comes from https://flixgem.com/ (dataset version as of March 12, 2021). The data contains information on 9425 movies and series available on Netlix.
We select columns by their names using the select() function. We can also delete columns by preceding the name of a column with the - symbol.
dane %>%
select(Title, Runtime, IMDb.Score, Release.Date) %>%
head(5)
dane %>%
select(-Netflix.Link, -IMDb.Link, -Image, -Poster, -TMDb.Trailer)%>%
head(5)
dane %>%
select(1:10)%>%
head(5)
dane %>%
select(Title:Runtime)%>%
head(5)
Useful functions when selecting/deleting columns: - starts_with() - select or delete columns starting with a given string. - ends_with() - select or delete columns ending with a given string - contains() - select or delete columns containing the given string.
dane %>%
select(starts_with('IMDb'))%>%
head(10)
dane %>%
select(ends_with('Score'))%>%
head(10)
dane %>%
select(contains('Date'))%>%
head(10)
Use the matches() function to select or remove columns containing a given regular expression. A useful tool in building and testing regular expressions is at the link https://regex101.com/.
dane %>%
select(matches('^[a-z]{5,6}$')) %>%
head(10)
dane %>%
select(-matches('\\.'))%>%
head(10)
The select() function always returns a data frame, while we also have the option of returning a vector using the pull() function.
dane %>%
select(IMDb.Score)%>%
head(10)
# dane %>%
# select(IMDb.Score) %>%
# unlist(use.names = FALSE)
dane %>%
pull(IMDb.Score)%>%
head(10)
dane %>%
pull(IMDb.Score, Title)%>%
head(10)
We filter the rows with the filter() function using the ==, !=, >, >=, <, <=, between() operators.
dane %>%
filter(Series.or.Movie == "Series")%>%
head(10)
dane %>%
filter(IMDb.Score > 8)%>%
head(10)
The logical operator AND denoted by the symbol &. - FALSE & FALSE = FALSE - FALSE & TRUE = FALSE - TRUE & FALSE = FALSE - TRUE & TRUE = TRUE
dane %>%
filter(IMDb.Score >= 8 & Series.or.Movie == 'Series')%>%
head(10)
The logical operator OR denoted by the symbol |. - FALSE | FALSE = FALSE - FALSE | TRUE = TRUE - TRUE | FALSE = TRUE - TRUE | TRUE = TRUE
dane %>%
filter(IMDb.Score >= 9 | IMDb.Votes < 1000)%>%
head(10)
De Morgan’s laws say that when we enter with a negation under the parenthesis, OR turns into AND (and vice versa).
not (A & B) = (not A) | (not B) not (A | B) = (not A) & (not B)
dane %>%
filter(!(IMDb.Score >= 9 | IMDb.Votes < 1000))%>%
head(10)
dane %>%
filter(!(IMDb.Score >= 9) & !(IMDb.Votes < 1000))%>%
head(10)
Use the mutate() function to either add new columns to the data frame or edit existing columns.
dane %>%
mutate(score_category = if_else(IMDb.Score >= 5, 'Good', 'Poor')) %>%
select(Title, IMDb.Score, score_category)%>%
head(10)
dane %>%
transmute(
Release = Release.Date %>% as.Date(format = '%y/%m/%d')
,Netflix.Release = Netflix.Release.Date %>% as.Date(format = '%y/%m/%d')
)
CHALLENGE 1: What is the oldest Woody Allen film available on Netflix?
dane %>%
select(Title,Director, Release.Date) %>%
filter(Director == "Woody Allen") %>%
mutate(Release = as.numeric(format(as.Date(Release.Date,format = "%m/%d/%Y"),"%Y"))) %>%
arrange(Release) %>%
head(1)
In the case of the *case_when()* function, we do not need to write conditions that create mutually disjoint sets. The evaluation occurs when the first condition is met, followed immediately by another iteration.
dane %>%
mutate(score_category = case_when(
IMDb.Score <= 2 ~ 'Very Poor'
,IMDb.Score <= 4 ~ 'Poor'
,IMDb.Score <= 6 ~ 'Medium'
,IMDb.Score <= 8 ~ 'Good'
,IMDb.Score <= 10 ~ 'Very Good'
)) %>%
select(Title, IMDb.Score, score_category)%>%
head(10)
We perform mathematical operations for each row and based on several columns using the rowwise() function.
dane %>%
mutate(avg_score = mean(c(IMDb.Score * 10
,Hidden.Gem.Score * 10
,Rotten.Tomatoes.Score
,Metacritic.Score)
,na.rm = TRUE) %>%
round(2)) %>%
select(Title, avg_score)%>%
head(10)
dane %>%
rowwise() %>%
mutate(avg_score = mean(c(IMDb.Score * 10
,Hidden.Gem.Score * 10
,Rotten.Tomatoes.Score
,Metacritic.Score)
,na.rm = TRUE) %>%
round(2)) %>%
select(Title, avg_score)%>%
head(10)
By default, the columns created with mutate() are at the end of the table. With relocate() we can change the positions of individual columns in the table.
dane %>%
mutate(Popularity = if_else(IMDb.Votes > quantile(IMDb.Votes, 0.90, na.rm = TRUE), 'High', 'Not High')) %>%
relocate(Popularity, .after = Title)
We rename the columns using the rename() function.
dane %>%
rename(
Tytul = Title
,Gatunek = Genre
)
Using functions from the tidyr library, we can tame missing values: - drop_na() - we remove rows containing missing values in the indicated columns. - replace_na() - we replace missing values with the specified constant. - fill() - we replace missing values with the previous or next available value.
dane %>%
sapply(function(x) is.na(x) %>% sum())
dane %>%
drop_na(Hidden.Gem.Score)
dane %>%
mutate(Hidden.Gem.Score = replace_na(Hidden.Gem.Score, median(Hidden.Gem.Score, na.rm = TRUE))) %>%
sapply(function(x) is.na(x) %>% sum())
dane %>%
replace_na(list(Hidden.Gem.Score = median(dane$Hidden.Gem.Score, na.rm = TRUE))) %>%
sapply(function(x) is.na(x) %>% sum())
The stringr library contains a lot of useful functions for manipulating text and regular expressions. Most of the functions in this library start with str_.
Q: What can be improved in the following code so that the tidyverse style convention is followed?
gatunki = dane$Genre %>%
paste0(collapse = ', ') %>%
str_extract_all('[A-Za-z]+') %>%
unlist() %>%
table() %>%
as.data.frame()
gatunki %>%
arrange(-Freq)
dane %>%
mutate(poland_available = str_detect(Country.Availability, 'Poland')) %>%
filter(poland_available == TRUE) %>%
pull(Title)%>%
head(10)
Using separate(), we can split one column into several, and combine several columns into one using the unite() function.
dane %>%
unite(
col = 'Scores'
,c('Hidden.Gem.Score', 'IMDb.Score', 'Rotten.Tomatoes.Score', 'Metacritic.Score')
,sep = ', '
) %>%
select(Title, Scores)%>%
head(10)
CHALLENGE 2: What are the three highest rated comedies available in Polish?
dane %>%
rowwise() %>%
mutate(avg_score = mean(c(IMDb.Score * 10
,Hidden.Gem.Score * 10
,Rotten.Tomatoes.Score
,Metacritic.Score)
,na.rm = TRUE) %>%
round(2)) %>%
select(Title, Country.Availability, avg_score)%>%
filter(Country.Availability == "Poland") %>%
head(3)
CHALLENGE 3: For 2019 and 2020 productions, what is the average time between release and appearance on Netflix?
dane$Release <- as.numeric(format(as.Date(dane$Release.Date,format = "%m/%d/%Y"),"%Y"))
dane$Netflix.Release <- as.numeric(format(as.Date(dane$Netflix.Release.Date,format = "%m/%d/%Y"),"%Y"))
dane$avg_time_movie <- as.numeric(as.Date(dane$Netflix.Release) - as.Date(dane$Release))
dane %>%
filter(Release == '2019' | Release == '2020') %>%
select(Title, avg_time_movie, Release, Netflix.Release)
CHALLENGE 4: What are the most popular tags for productions available in Polish?
dane %>%
group_by(Tags, Languages) %>%
filter(Languages == "Polish") %>%
summarise(Count = n()) %>%
arrange(desc(Count))
Using the group_by() and summarize() functions, we perform operations on aggregated data.
dane %>%
group_by(Series.or.Movie) %>%
summarize(
count = n()
,avg_imdb_score = mean(IMDb.Score, na.rm = TRUE) %>% round(2)
,avg_imdb_votes = mean(IMDb.Votes, na.rm = TRUE) %>% round(0)
,sum_awards = sum(Awards.Received, na.rm = TRUE)
)
dane %>%
group_by(Series.or.Movie, Runtime) %>%
summarize(n = n()) %>%
arrange(-n)
CHALLENGE 5: What are the average ratings of films produced in each decade (i.e., 1960s, 1970s, 1980s, 1990s, etc.)?
dane$Date_Col <- as.Date(dane$Release.Date, format = "%d/%m/%Y")
dane$Decade <- cut(dane$Date_Col, breaks = seq(as.Date("1959-12-31"), as.Date("2020-12-31"), by = "10 years"), labels = c("1960s", "1970s", "1980s", "1990s", "2000s", "2010s", "2020s"))
dane %>%
rowwise() %>%
mutate(avg_score = mean(c(IMDb.Score * 10
,Hidden.Gem.Score * 10
,Rotten.Tomatoes.Score
,Metacritic.Score)
,na.rm = TRUE) %>%
round(2)) %>%
summarise(Title, avg_score, Decade) %>%
arrange(Decade)
Data in wide format: - rows represent individual observations - columns represent attributes of these observations - cells represent values of individual attributes for individual observations.
Data in long format: - in the first column we have observations (the observation key can also consist of more than one column) - in the second column we have attributes - in the third column we have values.
The long format is useful, among other things, when creating charts in the ggplot2 library.
dane_pivot = dane %>%
select(Title, ends_with('Score'))
dane_pivot = dane_pivot %>%
pivot_longer(
cols = 2:5
,names_to = 'Attribute'
,values_to = 'Value'
)
dane_pivot = dane_pivot %>%
pivot_wider(
id_cols = 1
,names_from = 'Attribute'
,values_from = 'Value'
)
## Warning: Values from `Value` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
## {data} %>%
## dplyr::group_by(Title, Attribute) %>%
## dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
## dplyr::filter(n > 1L)
oceny_metacritic = dane %>%
select(Title, Metacritic.Score) %>%
.[1:100,] %>%
drop_na()
oceny_rotten_tomatoes = dane %>%
select(Title, Rotten.Tomatoes.Score) %>%
.[1:100,] %>%
drop_na()
We join tables by their corresponding keys just as we do in SQL.
oceny_metacritic %>%
left_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))
## Title Metacritic.Score Rotten.Tomatoes.Score
## 1 Lets Fight Ghost 82 98
## 2 HOW TO BUILD A GIRL 69 79
## 3 The Invisible 36 20
## 4 Joker 59 68
## 5 I 51 52
## 6 Harrys Daughters 85 96
## 7 The Closet 72 85
## 8 Trial by Fire 51 61
## 9 Dilili in Paris 37 NA
## 10 Framing John DeLorean 67 90
## 11 Alice 67 75
## 12 Ordinary People 86 89
## 13 Paths of the Soul 90 94
## 14 Rebel in the Rye 46 30
## 15 The Return 82 NA
## 16 Stray 54 56
## 17 Stand by Me 75 91
## 18 Wonderstruck 71 68
## 19 Intimate Strangers 71 86
## 20 The Girl on the Train 48 44
## 21 Ride Your Wave 63 93
## 22 Capone 46 NA
## 23 Above Suspicion 57 NA
## 24 A Call to Spy 65 NA
## 25 Red 60 72
## 26 The Mole Agent 69 NA
## 27 I Care a Lot 67 NA
## 28 Burden 57 97
## 29 Collective 95 NA
## 30 Love 51 40
## 31 Amanda 63 NA
## 32 Corpus Christi 77 NA
## 33 The Shadow 50 35
## 34 Aftermath 44 42
## 35 Unhinged 40 NA
## 36 John Lewis: Good Trouble 70 NA
## 37 Repo Man 82 98
## 38 For Love of the Game 43 46
## 39 The Replacement Killers 42 36
oceny_metacritic %>%
right_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))
## Title Metacritic.Score Rotten.Tomatoes.Score
## 1 Lets Fight Ghost 82 98
## 2 HOW TO BUILD A GIRL 69 79
## 3 The Invisible 36 20
## 4 Joker 59 68
## 5 I 51 52
## 6 Harrys Daughters 85 96
## 7 The Closet 72 85
## 8 Trial by Fire 51 61
## 9 Framing John DeLorean 67 90
## 10 Alice 67 75
## 11 Ordinary People 86 89
## 12 Paths of the Soul 90 94
## 13 Rebel in the Rye 46 30
## 14 Stray 54 56
## 15 Stand by Me 75 91
## 16 Wonderstruck 71 68
## 17 Intimate Strangers 71 86
## 18 The Girl on the Train 48 44
## 19 Ride Your Wave 63 93
## 20 Red 60 72
## 21 Burden 57 97
## 22 Love 51 40
## 23 The Shadow 50 35
## 24 Aftermath 44 42
## 25 Repo Man 82 98
## 26 For Love of the Game 43 46
## 27 The Replacement Killers 42 36
## 28 The Simple Minded Murderer NA 92
## 29 Comrades: Almost a Love Story NA 89
## 30 The Mysterians NA 51
## 31 Repast NA 87
## 32 Sway NA 86
## 33 When a Woman Ascends the Stairs NA 100
## 34 Yearning NA 88
## 35 Ginza Cosmetics NA 45
## 36 Floating Clouds NA 83
## 37 Life and Nothing But NA 86
## 38 Let Joy Reign Supreme NA 79
## 39 Coup de Torchon NA 83
## 40 Keys To The Heart NA 77
## 41 Gonjiam: Haunted Asylum NA 91
## 42 Golden Slumber NA 75
## 43 Extreme Job NA 82
## 44 Default NA 78
## 45 The Accidental Detective 2: In Action NA 73
## 46 1987: When the Day Comes NA 82
## 47 Ten Years Japan NA 100
## 48 Overcoming NA 88
## 49 Awara Paagal Deewana NA 54
oceny_metacritic %>%
inner_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))
## Title Metacritic.Score Rotten.Tomatoes.Score
## 1 Lets Fight Ghost 82 98
## 2 HOW TO BUILD A GIRL 69 79
## 3 The Invisible 36 20
## 4 Joker 59 68
## 5 I 51 52
## 6 Harrys Daughters 85 96
## 7 The Closet 72 85
## 8 Trial by Fire 51 61
## 9 Framing John DeLorean 67 90
## 10 Alice 67 75
## 11 Ordinary People 86 89
## 12 Paths of the Soul 90 94
## 13 Rebel in the Rye 46 30
## 14 Stray 54 56
## 15 Stand by Me 75 91
## 16 Wonderstruck 71 68
## 17 Intimate Strangers 71 86
## 18 The Girl on the Train 48 44
## 19 Ride Your Wave 63 93
## 20 Red 60 72
## 21 Burden 57 97
## 22 Love 51 40
## 23 The Shadow 50 35
## 24 Aftermath 44 42
## 25 Repo Man 82 98
## 26 For Love of the Game 43 46
## 27 The Replacement Killers 42 36
oceny_metacritic %>%
full_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))
## Title Metacritic.Score Rotten.Tomatoes.Score
## 1 Lets Fight Ghost 82 98
## 2 HOW TO BUILD A GIRL 69 79
## 3 The Invisible 36 20
## 4 Joker 59 68
## 5 I 51 52
## 6 Harrys Daughters 85 96
## 7 The Closet 72 85
## 8 Trial by Fire 51 61
## 9 Dilili in Paris 37 NA
## 10 Framing John DeLorean 67 90
## 11 Alice 67 75
## 12 Ordinary People 86 89
## 13 Paths of the Soul 90 94
## 14 Rebel in the Rye 46 30
## 15 The Return 82 NA
## 16 Stray 54 56
## 17 Stand by Me 75 91
## 18 Wonderstruck 71 68
## 19 Intimate Strangers 71 86
## 20 The Girl on the Train 48 44
## 21 Ride Your Wave 63 93
## 22 Capone 46 NA
## 23 Above Suspicion 57 NA
## 24 A Call to Spy 65 NA
## 25 Red 60 72
## 26 The Mole Agent 69 NA
## 27 I Care a Lot 67 NA
## 28 Burden 57 97
## 29 Collective 95 NA
## 30 Love 51 40
## 31 Amanda 63 NA
## 32 Corpus Christi 77 NA
## 33 The Shadow 50 35
## 34 Aftermath 44 42
## 35 Unhinged 40 NA
## 36 John Lewis: Good Trouble 70 NA
## 37 Repo Man 82 98
## 38 For Love of the Game 43 46
## 39 The Replacement Killers 42 36
## 40 The Simple Minded Murderer NA 92
## 41 Comrades: Almost a Love Story NA 89
## 42 The Mysterians NA 51
## 43 Repast NA 87
## 44 Sway NA 86
## 45 When a Woman Ascends the Stairs NA 100
## 46 Yearning NA 88
## 47 Ginza Cosmetics NA 45
## 48 Floating Clouds NA 83
## 49 Life and Nothing But NA 86
## 50 Let Joy Reign Supreme NA 79
## 51 Coup de Torchon NA 83
## 52 Keys To The Heart NA 77
## 53 Gonjiam: Haunted Asylum NA 91
## 54 Golden Slumber NA 75
## 55 Extreme Job NA 82
## 56 Default NA 78
## 57 The Accidental Detective 2: In Action NA 73
## 58 1987: When the Day Comes NA 82
## 59 Ten Years Japan NA 100
## 60 Overcoming NA 88
## 61 Awara Paagal Deewana NA 54
oceny_metacritic %>%
anti_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))
## Title Metacritic.Score
## 1 Dilili in Paris 37
## 2 The Return 82
## 3 Capone 46
## 4 Above Suspicion 57
## 5 A Call to Spy 65
## 6 The Mole Agent 69
## 7 I Care a Lot 67
## 8 Collective 95
## 9 Amanda 63
## 10 Corpus Christi 77
## 11 Unhinged 40
## 12 John Lewis: Good Trouble 70
oceny_rotten_tomatoes %>%
anti_join(oceny_metacritic, by = c('Title' = 'Title'))
## Title Rotten.Tomatoes.Score
## 1 The Simple Minded Murderer 92
## 2 Comrades: Almost a Love Story 89
## 3 The Mysterians 51
## 4 Repast 87
## 5 Sway 86
## 6 When a Woman Ascends the Stairs 100
## 7 Yearning 88
## 8 Ginza Cosmetics 45
## 9 Floating Clouds 83
## 10 Life and Nothing But 86
## 11 Let Joy Reign Supreme 79
## 12 Coup de Torchon 83
## 13 Keys To The Heart 77
## 14 Gonjiam: Haunted Asylum 91
## 15 Golden Slumber 75
## 16 Extreme Job 82
## 17 Default 78
## 18 The Accidental Detective 2: In Action 73
## 19 1987: When the Day Comes 82
## 20 Ten Years Japan 100
## 21 Overcoming 88
## 22 Awara Paagal Deewana 54