Podstawowe operacje w R - część 3.

Eksploracja danych

Kinga Derewecka, Piotr Bochiński

2023-11-26

Spis treści:

Eksploracja danych z bibliotekami dplyr, tidyr oraz stringr
- Podzbiory kolumn
- Filtrowanie wierszy
- Operatory logiczne, algebra Boola, prawa de Morgana
- Tworzenie nowych kolumn (1x Challenge)
- Wartości brakujące
- Manipulowanie tekstem (3x Challenge)
- Agregacja danych (1x Challenge)
- Tabele przestawne, dane w formacie long oraz wide
- Łączenie tabel

Przydatne materiały:
- dplyr cheatsheet
- tidyr cheatsheet
- stringr cheatsheet
- ggplot2 cheatsheet
- A. Kassambara - Guide to Create Beautiful Graphics in R.

Dane pochodzą ze strony https://flixgem.com/ (wersja zbioru danych z dnia 12 marca 2021). Dane zawierają informacje na temat 9425 filmów i seriali dostępnych na Netlix.

Eksploracja danych z bibliotekami dplyr oraz tidyr

Podzbiory kolumn

Kolumny wybieramy po ich nazwach za pomocą funkcji select(). Możemy też usuwać kolumny, poprzedzając nazwę danej kolumny symbolem -.

dane %>%
  select(Title, Runtime, IMDb.Score, Release.Date) %>%
  head(5)

dane %>%
  select(-Netflix.Link, -IMDb.Link, -Image, -Poster, -TMDb.Trailer)%>%
  head(5)

dane %>%
  select(1:10)%>%
  head(5)

dane %>%
  select(Title:Runtime)%>%
  head(5)

Przydatne funkcje podczas wybierania/usuwania kolumn: - starts_with() - wybieramy lub usuwamy kolumny zaczynające się danym ciągiem znaków - ends_with() - wybieramy lub usuwamy kolumny kończące się danym ciągiem znaków - contains() - wybieramy lub usuwamy kolumny zawierające dany ciąg znaków.

dane %>%
  select(starts_with('IMDb'))%>% 
  head(10)

dane %>%
  select(ends_with('Score'))%>% 
  head(10)

dane %>%
  select(contains('Date'))%>% 
  head(10)

Za pomocą funkcji matches() wybieramy lub usuwamy kolumny zawierające dane wyrażenie regularne. Przydatne narzędzie w budowaniu i testowaniu wyrażeń regularnych jest pod linkiem https://regex101.com/.

dane %>%
  select(matches('^[a-z]{5,6}$')) %>% 
  head(10)

dane %>%
  select(-matches('\\.'))%>% 
  head(10)

Funkcja select() zawsze zwraca ramkę danych, natomiast mamy też możliwość zwrócenia wektora za pomocą funkcji pull().

dane %>%
  select(IMDb.Score)%>% 
  head(10)

# dane %>%
#   select(IMDb.Score) %>%
#   unlist(use.names = FALSE)

dane %>%
  pull(IMDb.Score)%>% 
  head(10)

dane %>%
  pull(IMDb.Score, Title)%>% 
  head(10)

Filtrowanie wierszy

Wiersze filtrujemy za pomocą funkcji filter() korzystając z operatorów ==, !=, >, >=, <, <=, between().

dane %>%
  filter(Series.or.Movie == "Series")%>% 
  head(10)

dane %>%
  filter(IMDb.Score > 8)%>% 
  head(10)

Operatory logiczne, algebra Boola, prawa de Morgana

Operator logiczny AND oznaczany symbolem & - FALSE & FALSE = FALSE - FALSE & TRUE = FALSE - TRUE & FALSE = FALSE - TRUE & TRUE = TRUE

dane %>%
  filter(IMDb.Score >= 8 & Series.or.Movie == 'Series')%>% 
  head(10)

dane %>%
  filter(IMDb.Score >= 9 | IMDb.Votes < 1000)%>% 
  head(10)

Prawa de Morgana mówią, że gdy wchodzimy z negacją pod nawias, to OR zamienia się na AND (i na odwrót). not (A & B) = (not A) | (not B) not (A | B) = (not A) & (not B)

dane %>%
  filter(!(IMDb.Score >= 9 | IMDb.Votes < 1000))%>% 
  head(10)

dane %>%
  filter(!(IMDb.Score >= 9) & !(IMDb.Votes < 1000))%>% 
  head(10)

Tworzenie nowych kolumn

Za pomocą funkcji mutate() dodajemy nowe kolumny do ramki danych albo edytujemy już istniejące kolumny.

dane %>%
  mutate(score_category = if_else(IMDb.Score >= 5, 'Good', 'Poor')) %>%
  select(Title, IMDb.Score, score_category)%>% 
  head(10)

dane %>%
  transmute(
    Release = Release.Date %>% as.Date(format = '%m/%d/%y')
    ,Netflix.Release = Netflix.Release.Date %>% as.Date(format = '%m/%d/%y')
  )

CHALLENGE 1: Jaki jest najstarszy film Woody’ego Allena dostępny na Netflixie?

library(lubridate)
dane %>%
  filter(Director == "Woody Allen") %>%
  filter(mdy(Release.Date) == (min(mdy(Release.Date)))) %>%
  select(Title, Release.Date)

W przypadku funkcji case_when() nie musimy pisać warunków tworzących zbiory wzajemnie rozłączne. Ewaluacja następuje po spełnieniu pierwszego z warunków, po czym natychmiastowo następuje kolejna iteracja.

dane %>%
  mutate(score_category = case_when(
    IMDb.Score <= 2 ~ 'Very Poor'
    ,IMDb.Score <= 4 ~ 'Poor'
    ,IMDb.Score <= 6 ~ 'Medium'
    ,IMDb.Score <= 8 ~ 'Good'
    ,IMDb.Score <= 10 ~ 'Very Good'
    )) %>%
  select(Title, IMDb.Score, score_category)%>% 
  head(10)

Działania matematyczne wykonywane dla każdego wiersza i bazujące na kilku kolumnach wykonujemy przy pomocy funkcji rowwise().

dane %>%
  mutate(avg_score = mean(c(IMDb.Score * 10
                            ,Hidden.Gem.Score * 10
                            ,Rotten.Tomatoes.Score
                            ,Metacritic.Score)
                          ,na.rm = TRUE) %>%
           round(2)) %>%
  select(Title, avg_score)%>% 
  head(10)

dane %>% 
  rowwise() %>%
  mutate(avg_score = mean(c(IMDb.Score * 10
                            ,Hidden.Gem.Score * 10
                            ,Rotten.Tomatoes.Score
                            ,Metacritic.Score)
                          ,na.rm = TRUE) %>%
           round(2)) %>%
  select(Title, avg_score)%>% 
  head(10)

Domyślnie kolumny tworzone są pomocą mutate() są na końcu tabeli. Za pomocą relocate() możemy zmieniać pozycje poszczególnych kolumn w tabeli.

dane %>%
  mutate(Popularity = if_else(IMDb.Votes > quantile(IMDb.Votes, 0.90, na.rm = TRUE), 'High', 'Not High')) %>%
  relocate(Popularity, .after = Title)

Zmieniamy nazwy kolumn za pomocą funkcji rename().

dane %>%
  rename(
    Tytul = Title
    ,Gatunek = Genre
  )

Wartości brakujące

Za pomocą funkcji z biblioteki tidyr możemy okiełznać wartości brakujące: - drop_na() - usuwamy wiersze zawierające wartości brakujące we wskazanych kolumnach - replace_na() - zastępujemy wartości brakujące określoną stałą - fill() - zastępujemy wartości brakujące poprzednią lub następną dostępną wartością.

dane %>%
  sapply(function(x) is.na(x) %>% sum())

dane %>%
  drop_na(Hidden.Gem.Score)

dane %>%
  mutate(Hidden.Gem.Score = replace_na(Hidden.Gem.Score, median(Hidden.Gem.Score, na.rm = TRUE))) %>%
  sapply(function(x) is.na(x) %>% sum())

dane %>%
  replace_na(list(Hidden.Gem.Score = median(dane$Hidden.Gem.Score, na.rm = TRUE))) %>%
  sapply(function(x) is.na(x) %>% sum())

Manipulowanie tekstem

Biblioteka stringr zawiera dużo przydatnych funkcji do manipulacji tekstem oraz wyrażeniami regularnymi. Większość funkcji z tej biblioteki zaczyna się od str_.

Q: Co można poprawić w poniższym kodzie, aby była zachowana konwencja stylu tidyverse?

gatunki = dane$Genre %>%
  paste0(collapse = ', ') %>%
  str_extract_all('[A-Za-z]+') %>%
  unlist() %>%
  table() %>%
  as.data.frame()

gatunki %>%
  arrange(-Freq)

dane %>%
  mutate(poland_available = str_detect(Country.Availability, 'Poland')) %>%
  filter(poland_available == TRUE) %>%
  pull(Title)%>% 
  head(10)

Za pomocą separate() możemy rozdzielać jedną kolumną na kilką oraz łączyć kilka kolumn w jedną za pomocą funkcji unite().

dane %>%
  unite(
    col = 'Scores'
    ,c('Hidden.Gem.Score', 'IMDb.Score', 'Rotten.Tomatoes.Score', 'Metacritic.Score')
    ,sep = ', '
  ) %>%
  select(Title, Scores)%>% 
  head(10)

CHALLENGE 2: Jakie są trzy najwyżej oceniane komedie dostępne w języku polskim?

dane %>%
  filter(Genre %>% str_detect("Comedy")) %>%
  filter(Languages %>% str_detect("Polish")) %>%
  arrange(desc(IMDb.Score)) %>%
  head(3)

CHALLENGE 3: Dla produkcji z lat 2019 oraz 2020 jaki jest średni czas między premierą a pojawieniem się na Netflixie?

library(lubridate)
dane %>%
  filter(year(mdy(Release.Date)) %in% c(2019,2020)) %>%
  mutate(Czas_Miedzy = as.numeric(difftime(mdy(Netflix.Release.Date), mdy(Release.Date), units = "days"))) %>%
  summarise(Srednia_Czas = mean(Czas_Miedzy, na.rm = TRUE))

CHALLENGE 4: Jakie są najpopularniejsze tagi dla produkcji dostępnych w języku polskim?

dane %>%
  filter(Languages %>% str_detect("Polish")) %>%
  group_by(Tags) %>%
  summarise(Ilość = n()) %>%
  arrange(desc(Ilość))

Agregacja danych

Za pomocą funkcji group_by() oraz summarize() wykonujemy operacje na zagregowanych danych.

dane %>%
  group_by(Series.or.Movie) %>%
  summarize(
    count = n()
    ,avg_imdb_score = mean(IMDb.Score, na.rm = TRUE) %>% round(2)
    ,avg_imdb_votes = mean(IMDb.Votes, na.rm = TRUE) %>% round(0)
    ,sum_awards = sum(Awards.Received, na.rm = TRUE)
  )

dane %>%
  group_by(Series.or.Movie, Runtime) %>%
  summarize(n = n()) %>%
  arrange(-n)

CHALLENGE 5: Jakie są średnie oceny filmów wyprodukowanych w poszczególnych dekadach (tzn. lata 60, 70, 80, 90 etc.)?

dane %>%
  mutate(RokProdukcji = year(mdy(Release.Date)), Dekada = 10 * (RokProdukcji %/% 10)) %>%
  group_by(Dekada) %>%
  summarise(SredniaOcena = mean(IMDb.Score, na.rm= TRUE))

Tabele przestawne, dane w formacie long oraz wide

Dane w formacie wide: - wiersze reprezentują pojedyncze obserwacje - kolumny reprezentują atrybuty tych obserwacji - w komórkach znajdują się wartości poszczególnych atrybutów dla poszczególnych obserwacji.

Dane w formacie long: - w pierwszej kolumnie mamy obserwacje (klucz obserwacji może składać się też z więcej niż jednej kolumny) - w drugiej kolumnie mamy atrybuty - w trzeciej kolumnie mamy wartości.

Format long jest przydatny m. in. przy tworzeniu wykresów w bibliotece ggplot2.

dane_pivot = dane %>%
  select(Title, ends_with('Score'))

dane_pivot = dane_pivot %>%
  pivot_longer(
    cols = 2:5
    ,names_to = 'Attribute'
    ,values_to = 'Value'
  )

dane_pivot = dane_pivot %>%
  pivot_wider(
    id_cols = 1
    ,names_from = 'Attribute'
    ,values_from = 'Value'
  )

## Warning: Values from `Value` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
##   {data} %>%
##   dplyr::group_by(Title, Attribute) %>%
##   dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
##   dplyr::filter(n > 1L)

Łączenie tabel

oceny_metacritic = dane %>%
  select(Title, Metacritic.Score) %>%
  .[1:100,] %>%
  drop_na()

oceny_rotten_tomatoes = dane %>%
  select(Title, Rotten.Tomatoes.Score) %>%
  .[1:100,] %>%
  drop_na()

Tabele łączymy po odpowiednich kluczach tak samo, jak robimy to w SQL.

oceny_metacritic %>%
  left_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))

##                       Title Metacritic.Score Rotten.Tomatoes.Score
## 1          Lets Fight Ghost               82                    98
## 2       HOW TO BUILD A GIRL               69                    79
## 3             The Invisible               36                    20
## 4                     Joker               59                    68
## 5                         I               51                    52
## 6          Harrys Daughters               85                    96
## 7                The Closet               72                    85
## 8             Trial by Fire               51                    61
## 9           Dilili in Paris               37                    NA
## 10    Framing John DeLorean               67                    90
## 11                    Alice               67                    75
## 12          Ordinary People               86                    89
## 13        Paths of the Soul               90                    94
## 14         Rebel in the Rye               46                    30
## 15               The Return               82                    NA
## 16                    Stray               54                    56
## 17              Stand by Me               75                    91
## 18             Wonderstruck               71                    68
## 19       Intimate Strangers               71                    86
## 20    The Girl on the Train               48                    44
## 21           Ride Your Wave               63                    93
## 22                   Capone               46                    NA
## 23          Above Suspicion               57                    NA
## 24            A Call to Spy               65                    NA
## 25                      Red               60                    72
## 26           The Mole Agent               69                    NA
## 27             I Care a Lot               67                    NA
## 28                   Burden               57                    97
## 29               Collective               95                    NA
## 30                     Love               51                    40
## 31                   Amanda               63                    NA
## 32           Corpus Christi               77                    NA
## 33               The Shadow               50                    35
## 34                Aftermath               44                    42
## 35                 Unhinged               40                    NA
## 36 John Lewis: Good Trouble               70                    NA
## 37                 Repo Man               82                    98
## 38     For Love of the Game               43                    46
## 39  The Replacement Killers               42                    36

oceny_metacritic %>%
  right_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))

##                                    Title Metacritic.Score Rotten.Tomatoes.Score
## 1                       Lets Fight Ghost               82                    98
## 2                    HOW TO BUILD A GIRL               69                    79
## 3                          The Invisible               36                    20
## 4                                  Joker               59                    68
## 5                                      I               51                    52
## 6                       Harrys Daughters               85                    96
## 7                             The Closet               72                    85
## 8                          Trial by Fire               51                    61
## 9                  Framing John DeLorean               67                    90
## 10                                 Alice               67                    75
## 11                       Ordinary People               86                    89
## 12                     Paths of the Soul               90                    94
## 13                      Rebel in the Rye               46                    30
## 14                                 Stray               54                    56
## 15                           Stand by Me               75                    91
## 16                          Wonderstruck               71                    68
## 17                    Intimate Strangers               71                    86
## 18                 The Girl on the Train               48                    44
## 19                        Ride Your Wave               63                    93
## 20                                   Red               60                    72
## 21                                Burden               57                    97
## 22                                  Love               51                    40
## 23                            The Shadow               50                    35
## 24                             Aftermath               44                    42
## 25                              Repo Man               82                    98
## 26                  For Love of the Game               43                    46
## 27               The Replacement Killers               42                    36
## 28            The Simple Minded Murderer               NA                    92
## 29         Comrades: Almost a Love Story               NA                    89
## 30                        The Mysterians               NA                    51
## 31                                Repast               NA                    87
## 32                                  Sway               NA                    86
## 33       When a Woman Ascends the Stairs               NA                   100
## 34                              Yearning               NA                    88
## 35                       Ginza Cosmetics               NA                    45
## 36                       Floating Clouds               NA                    83
## 37                  Life and Nothing But               NA                    86
## 38                 Let Joy Reign Supreme               NA                    79
## 39                       Coup de Torchon               NA                    83
## 40                     Keys To The Heart               NA                    77
## 41               Gonjiam: Haunted Asylum               NA                    91
## 42                        Golden Slumber               NA                    75
## 43                           Extreme Job               NA                    82
## 44                               Default               NA                    78
## 45 The Accidental Detective 2: In Action               NA                    73
## 46              1987: When the Day Comes               NA                    82
## 47                       Ten Years Japan               NA                   100
## 48                            Overcoming               NA                    88
## 49                  Awara Paagal Deewana               NA                    54

oceny_metacritic %>%
  inner_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))

##                      Title Metacritic.Score Rotten.Tomatoes.Score
## 1         Lets Fight Ghost               82                    98
## 2      HOW TO BUILD A GIRL               69                    79
## 3            The Invisible               36                    20
## 4                    Joker               59                    68
## 5                        I               51                    52
## 6         Harrys Daughters               85                    96
## 7               The Closet               72                    85
## 8            Trial by Fire               51                    61
## 9    Framing John DeLorean               67                    90
## 10                   Alice               67                    75
## 11         Ordinary People               86                    89
## 12       Paths of the Soul               90                    94
## 13        Rebel in the Rye               46                    30
## 14                   Stray               54                    56
## 15             Stand by Me               75                    91
## 16            Wonderstruck               71                    68
## 17      Intimate Strangers               71                    86
## 18   The Girl on the Train               48                    44
## 19          Ride Your Wave               63                    93
## 20                     Red               60                    72
## 21                  Burden               57                    97
## 22                    Love               51                    40
## 23              The Shadow               50                    35
## 24               Aftermath               44                    42
## 25                Repo Man               82                    98
## 26    For Love of the Game               43                    46
## 27 The Replacement Killers               42                    36

oceny_metacritic %>%
  full_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))

##                                    Title Metacritic.Score Rotten.Tomatoes.Score
## 1                       Lets Fight Ghost               82                    98
## 2                    HOW TO BUILD A GIRL               69                    79
## 3                          The Invisible               36                    20
## 4                                  Joker               59                    68
## 5                                      I               51                    52
## 6                       Harrys Daughters               85                    96
## 7                             The Closet               72                    85
## 8                          Trial by Fire               51                    61
## 9                        Dilili in Paris               37                    NA
## 10                 Framing John DeLorean               67                    90
## 11                                 Alice               67                    75
## 12                       Ordinary People               86                    89
## 13                     Paths of the Soul               90                    94
## 14                      Rebel in the Rye               46                    30
## 15                            The Return               82                    NA
## 16                                 Stray               54                    56
## 17                           Stand by Me               75                    91
## 18                          Wonderstruck               71                    68
## 19                    Intimate Strangers               71                    86
## 20                 The Girl on the Train               48                    44
## 21                        Ride Your Wave               63                    93
## 22                                Capone               46                    NA
## 23                       Above Suspicion               57                    NA
## 24                         A Call to Spy               65                    NA
## 25                                   Red               60                    72
## 26                        The Mole Agent               69                    NA
## 27                          I Care a Lot               67                    NA
## 28                                Burden               57                    97
## 29                            Collective               95                    NA
## 30                                  Love               51                    40
## 31                                Amanda               63                    NA
## 32                        Corpus Christi               77                    NA
## 33                            The Shadow               50                    35
## 34                             Aftermath               44                    42
## 35                              Unhinged               40                    NA
## 36              John Lewis: Good Trouble               70                    NA
## 37                              Repo Man               82                    98
## 38                  For Love of the Game               43                    46
## 39               The Replacement Killers               42                    36
## 40            The Simple Minded Murderer               NA                    92
## 41         Comrades: Almost a Love Story               NA                    89
## 42                        The Mysterians               NA                    51
## 43                                Repast               NA                    87
## 44                                  Sway               NA                    86
## 45       When a Woman Ascends the Stairs               NA                   100
## 46                              Yearning               NA                    88
## 47                       Ginza Cosmetics               NA                    45
## 48                       Floating Clouds               NA                    83
## 49                  Life and Nothing But               NA                    86
## 50                 Let Joy Reign Supreme               NA                    79
## 51                       Coup de Torchon               NA                    83
## 52                     Keys To The Heart               NA                    77
## 53               Gonjiam: Haunted Asylum               NA                    91
## 54                        Golden Slumber               NA                    75
## 55                           Extreme Job               NA                    82
## 56                               Default               NA                    78
## 57 The Accidental Detective 2: In Action               NA                    73
## 58              1987: When the Day Comes               NA                    82
## 59                       Ten Years Japan               NA                   100
## 60                            Overcoming               NA                    88
## 61                  Awara Paagal Deewana               NA                    54

oceny_metacritic %>%
  anti_join(oceny_rotten_tomatoes, by = c('Title' = 'Title'))

##                       Title Metacritic.Score
## 1           Dilili in Paris               37
## 2                The Return               82
## 3                    Capone               46
## 4           Above Suspicion               57
## 5             A Call to Spy               65
## 6            The Mole Agent               69
## 7              I Care a Lot               67
## 8                Collective               95
## 9                    Amanda               63
## 10           Corpus Christi               77
## 11                 Unhinged               40
## 12 John Lewis: Good Trouble               70

oceny_rotten_tomatoes %>%
  anti_join(oceny_metacritic, by = c('Title' = 'Title'))

##                                    Title Rotten.Tomatoes.Score
## 1             The Simple Minded Murderer                    92
## 2          Comrades: Almost a Love Story                    89
## 3                         The Mysterians                    51
## 4                                 Repast                    87
## 5                                   Sway                    86
## 6        When a Woman Ascends the Stairs                   100
## 7                               Yearning                    88
## 8                        Ginza Cosmetics                    45
## 9                        Floating Clouds                    83
## 10                  Life and Nothing But                    86
## 11                 Let Joy Reign Supreme                    79
## 12                       Coup de Torchon                    83
## 13                     Keys To The Heart                    77
## 14               Gonjiam: Haunted Asylum                    91
## 15                        Golden Slumber                    75
## 16                           Extreme Job                    82
## 17                               Default                    78
## 18 The Accidental Detective 2: In Action                    73
## 19              1987: When the Day Comes                    82
## 20                       Ten Years Japan                   100
## 21                            Overcoming                    88
## 22                  Awara Paagal Deewana                    54