Tidy data

What is tidy data?

Multiple variables per column

Being a busy person, you don’t want to spend too much time on Netflix, so you decide to crunch some numbers on TV show and movie durations before deciding what to watch. You’ve managed to obtain a dataset named netflix_df, but its duration column has an issue. It contains strings with both a value and unit of duration (“min” or “Season”).

You’ll tidy this dataset so that each variable gets its own column.

As will always be the case in this course, the tidyr package has been pre-loaded for you.

Inspect netflix_df by typing its name directly in the R console at the bottom-right and hitting Enter to see what string separates the value from the unit in the duration column.

netflix_df <- read.delim("DATABASE/netflix.txt", sep = ",")

Separate the duration column over two variables named value and unit. Pass the string separating the number from the unit to the sep argument.

netflix_df %>% 
  # Split the duration column into value and unit columns
  separate(duration, into = c("value", "unit"), sep = " ", convert = TRUE)

##       X                    title    type value    unit
## 1     1               Article 15   Movie   125     min
## 2     2      Kill Me If You Dare   Movie   100     min
## 3     3                  The Spy TV Show     1  Season
## 4     4        The World We Make   Movie   108     min
## 5     5                 Watchman   Movie    93     min
## 6     6    Mo Gilligan: Momentum   Movie    64     min
## 7     7                   Domino   Movie    89     min
## 8     8                TUNA GIRL   Movie    90     min
## 9     9            Bard of Blood TV Show     1  Season
## 10   10   Dragons: Rescue Riders TV Show     1  Season
## 11   11                 Skylines TV Show     1  Season
## 12   12           The Politician TV Show     1  Season
## 13   13                   Malaal   Movie   133     min
## 14   14                  Birders   Movie    38     min
## 15   15                    Furie   Movie    97     min
## 16   16     Oh! Baby (Malayalam)   Movie   146     min
## 17   17         Oh! Baby (Tamil)   Movie   146     min
## 18   18  Inside Man: Most Wanted   Movie   106     min
## 19   19             Manmadhudu 2   Movie   152     min
## 20   20              Team Kaylie TV Show     1  Season
## 21   21                  BONDING TV Show     1  Season
## 22   22                 Vagabond TV Show     1  Season
## 23   23         Criminal: France TV Show     1  Season
## 24   24        Criminal: Germany TV Show     1  Season
## 25   25          Criminal: Spain TV Show     1  Season
## 26   26             Criminal: UK TV Show     1  Season
## 27   27             Daddy Issues   Movie    82     min
## 28   28         The Hockey Girls TV Show     1  Season
## 29   29   True: Tricky Treat Day   Movie    24     min
## 30   30              Kabir Singh   Movie   171     min
## 31   31 When the Camellia Blooms TV Show     1  Season
## 32   32               Avengement   Movie    88     min
## 33   33   The Last Kids on Earth TV Show     1  Season
## 34   34       Surviving R. Kelly TV Show     1  Season
## 35   35             The Outsider   Movie    86     min
## 36   36                 Oh! Baby   Movie   157     min
## 37   37      American Warfighter   Movie   106     min
## 38   38             Deviant Love   Movie    87     min
## 39   39                 Marianne TV Show     1  Season
## 40   40                  Monarca TV Show     1  Season
## 41   41        Shanghai Fortress   Movie   107     min
## 42   42                Tall Girl   Movie   103     min
## 43   43                  Top Boy TV Show     1  Season
## 44   44             Unbelievable TV Show     1  Season
## 45   45               The I-Land TV Show     1  Season
## 46   46      The Mind, Explained TV Show     1  Season
## 47   47   Bill Burr: Paper Tiger   Movie    67     min
## 48   48                   Evelyn   Movie    96     min
## 49   49            Our Godfather   Movie    93     min
## 50   50       Mokalik (Mechanic)   Movie   100     min
## 51   51        Uncle Naji in UAE   Movie    95     min
## 52   52                    After   Movie   106     min
## 53   53            Rhythm + Flow TV Show     1  Season
## 54   54  Deon Cole: Cole Hearted   Movie    70     min
## 55   55  My Country: The New Age TV Show     1  Season
## 56   56                 Noblemen   Movie   109     min
## 57   57          Brother in Love   Movie   110     min
## 58   58        In the Tall Grass   Movie   102     min
## 59   59        Kids on the Block   Movie   102     min
## 60   60             Raising Dion TV Show     1  Season
## 61   61              Nowhere Man TV Show     1  Season
## 62   62           Mehandi Circus   Movie   126     min
## 63   63                   Tunnel TV Show     1  Season
## 64   64               Seis Manos TV Show     1  Season
## 65   65           A 3 Minute Hug   Movie    28     min
## 66   66           Wait, My Youth TV Show     1  Season
## 67   67         The Last Whistle   Movie    88     min
## 68   68                 Incoming   Movie    89     min
## 69   69               Assimilate   Movie    94     min
## 70   70              Brotherhood TV Show     1  Season
## 71   71      Dolemite Is My Name   Movie   118     min
## 72   72       It Takes a Lunatic   Movie   127     min
## 73   73        Nailed It! France TV Show     1  Season
## 74   74         Nailed It! Spain TV Show     1  Season
## 75   75         Prank Encounters TV Show     1  Season
## 76   76              Rattlesnake   Movie    86     min
## 77   77              The Untamed TV Show     1  Season
## 78   78                 Daybreak TV Show     1  Season
## 79   79       Echo in the Canyon   Movie    82     min
## 80   80 Revenge of the Pontianak   Movie    92     min
## 81   81   Dancing with the Birds   Movie    52     min
## 82   82      Living Undocumented TV Show     1  Season
## 83   83          Ready to Mingle   Movie    98     min
## 84   84                      Eli   Movie    98     min
## 85   85  Interior Design Masters TV Show     1  Season
## 86   86     Living with Yourself TV Show     1  Season
## 87   87                Seventeen   Movie   100     min
## 88   88         Tell Me Who I Am   Movie    86     min
## 89   89           The Laundromat   Movie    96     min
## 90   90      Unnatural Selection TV Show     1  Season
## 91   91                 Upstarts   Movie   112     min
## 92   92             THE UNLISTED TV Show     1  Season
## 93   93     Ghosts of Sugar Land   Movie    22     min
## 94   94                 Homeland   Movie    73     min
## 95   95            Magical Andes TV Show     1  Season
## 96   96              Street Flow   Movie    96     min
## 97   97          The Lies Within TV Show     1  Season
## 98   98                Fractured   Movie   100     min
## 99   99      The Bling Lagosians   Movie    99     min
## 100 100       The Forest of Love   Movie   151     min
## 101 101            The Influence   Movie   102     min
## 102 102      Ultramarine Magmell TV Show     1  Season
## 103 103                   A.M.I.   Movie    77     min
## 104 104    Nikki Glaser: Bangin’   Movie    64     min
## 105 105       Green Eggs and Ham TV Show     1  Season
## 106 106              Let It Snow   Movie    93     min
## 107 107           Paradise Beach   Movie    94     min
## 108 108             Burning Cane   Movie    78     min
## 109 109                    SCAMS TV Show     1  Season
## 110 110  Seth Meyers: Lobby Baby   Movie    61     min
## 111 111         Tune in for Love   Movie   123     min
## 112 112     Undercover Brother 2   Movie    85     min
## 113 113                    Voice TV Show     1  Season
## 114 114    Oththa Seruppu Size 7   Movie   103     min
## 115 115      The Devil Next Door TV Show     1  Season
## 116 116                Chocolate TV Show     1  Season
## 117 117          The Zoya Factor   Movie   135     min
## 118 118                Atlantics   Movie   106     min
## 119 119           I Lost My Body   Movie    81     min
## 120 120     Sugar Rush Christmas TV Show     1  Season
## 121 121     The Charming Stepmom TV Show     1  Season
## 122 122  The Movies That Made Us TV Show     1  Season
## 123 123             Holiday Rush   Movie    94     min
## 124 124                   Levius TV Show     1  Season
## 125 125          Lugar de Mulher TV Show     1  Season
## 126 126     Merry Happy Whatever TV Show     1  Season
## 127 127              Mythomaniac TV Show     1  Season
## 128 128                   Broken TV Show     1  Season
## 129 129      Evvarikee Cheppoddu   Movie   134     min
## 130 130 Little Singham: Mahabali   Movie    69     min
## 131 131             The Irishman   Movie   209     min
## 132 132      True: Winter Wishes   Movie    46     min
## 133 133                Zona Rosa TV Show     1  Season
## 134 134                  Pranaam   Movie   117     min
## 135 135 What the F* Is Going On?   Movie    87     min
## 136 136                    Awake   Movie    92     min
## 137 137                  Brother   Movie    97     min
## 138 138          Dino Girl Gauko TV Show     1  Season
## 139 139 Narcoworld: Dope Stories TV Show     1  Season
## 140 140         Nobody's Looking TV Show     1  Season
## 141 141          Shelby American   Movie   119     min
## 142 142         Singapore Social TV Show     1  Season
## 143 143                   Mortel TV Show     1  Season
## 144 144          Bangkok Buddies TV Show     1  Season
## 145 145                Dorasaani   Movie   134     min
## 146 146          Holiday Secrets TV Show     1  Season
## 147 147                Mallesham   Movie   131     min
## 148 148            Miss Culinary TV Show     1  Season
## 149 149        No Time for Shame TV Show     1  Season
## 150 150             Candy Online TV Show     1  Season
## 151 151          Earthquake Bird   Movie   107     min
## 152 152                 Guna 369   Movie   143     min
## 153 153             House Arrest   Movie   105     min
## 154 154                    Klaus   Movie    98     min
## 155 155                 The Club TV Show     1  Season
## 156 156             The Stranded TV Show     1  Season
## 157 157          My Dear Warrior TV Show     1  Season
## 158 158               Submission   Movie    95     min
## 159 159         To Be of Service   Movie    88     min
## 160 160        A Remarkable Tale   Movie    94     min
## 161 161             American Son   Movie    90     min
## 162 162       Christmas Break-In   Movie    87     min
## 163 163                    Drive   Movie   119     min
## 164 164         Fire in Paradise   Movie    40     min
## 165 165                    Hache TV Show     1  Season
## 166 166              Hello Ninja TV Show     1  Season
## 167 167      Holiday in the Wild   Movie    86     min
## 168 168               Santa Girl   Movie    91     min
## 169 169                 The King   Movie   141     min
## 170 170  The Man Without Gravity   Movie   107     min
## 171 171          We Are the Wave TV Show     1  Season
## 172 172                    Abyss TV Show     1  Season
## 173 173          I Have a Script TV Show     1  Season
## 174 174  In the Bosom of a Thorn TV Show     1  Season
## 175 175               The Writer TV Show     1  Season
## 176 176                 What If? TV Show     1  Season
## 177 177              Last Breath   Movie    86     min
## 178 178      The Wandering Earth   Movie   125     min
## 179 179       Always Be My Maybe   Movie   102     min
## 180 180               Chopsticks   Movie   101     min
## 181 181           Killer Ratings TV Show     1  Season
## 182 182                  Pegasus   Movie    98     min
## 183 183         When They See Us TV Show     1  Season
## 184 184  Svaha: The Sixth Finger   Movie   123     min
## 185 185         All In My Family   Movie    40     min
## 186 186 Crime Diaries: Night Out TV Show     1  Season
## 187 187               Dead to Me TV Show     1  Season
## 188 188       Despite Everything   Movie    79     min
## 189 189                   Flinch TV Show     1  Season
## 190 190          The Last Summer   Movie   110     min
## 191 191            Tuca & Bertie TV Show     1  Season
## 192 192               Undercover TV Show     1  Season
## 193 193                  Charmed TV Show     1  Season
## 194 194        Historical Roasts TV Show     1  Season
## 195 195                Thackeray TV Show     1  Season
## 196 196              After Maria   Movie    38     min
## 197 197                      Joy   Movie   101     min
## 198 198         Rim of the World   Movie    99     min
## 199 199           The Perfection   Movie    91     min
## 200 200                WHAT / IF TV Show     1  Season
## 201 201   A Tale of Two Kitchens   Movie    30     min
## 202 202  Wanda Sykes: Not Normal   Movie    66     min
## 203 203        SGT. Will Gardner   Movie   125     min
## 204 204              The Brawler   Movie    95     min
## 205 205                     1994 TV Show     1  Season
## 206 206              It's Bruno! TV Show     1  Season
## 207 207                    Maria   Movie    90     min
## 208 208        See You Yesterday   Movie    87     min
## 209 209       Well-Intended Love TV Show     1  Season
## 210 210                 Good Sam   Movie    90     min
## 211 211           PATRICK (2019)   Movie    94     min
## 212 212               I Hear You TV Show     1  Season
## 213 213              Backdraft 2   Movie   102     min
## 214 214            Malibu Rescue   Movie    70     min
## 215 215             The Defected TV Show     1  Season
## 216 216              Shéhérazade   Movie   111     min
## 217 217                Jailbirds TV Show     1  Season
## 218 218              The Society TV Show     1  Season
## 219 219             Wine Country   Movie   104     min
## 220 220    Hidden in Plain Sight   Movie    87     min
## 221 221     Knock Down The House   Movie    88     min
## 222 222      Roswell, New Mexico TV Show     1  Season
## 223 223               Sur Sapata   Movie   140     min
## 224 224     The Kindness Diaries TV Show     1  Season
## 225 225               After Life TV Show     1  Season
## 226 226                  Juanita   Movie    90     min
## 227 227                   Lady J   Movie   111     min
## 228 228                   Shadow TV Show     1  Season
## 229 229       Walk. Ride. Rodeo.   Movie   100     min
## 230 230                The Order TV Show     1  Season
## 231 231             The Hard Way   Movie    94     min
## 232 232              Mercy Black   Movie    88     min
## 233 233                   15-Aug   Movie   124     min
## 234 234                  Bayonet   Movie   102     min
## 235 235                  Osmosis TV Show     1  Season
## 236 236           The Highwaymen   Movie   132     min
## 237 237                 Traitors TV Show     1  Season
## 238 238             All American TV Show     1  Season
## 239 239              Delhi Crime TV Show     1  Season
## 240 240     Most Beautiful Thing TV Show     1  Season
## 241 241           Selling Sunset TV Show     1  Season
## 242 242                 The Dirt   Movie   108     min
## 243 243              Stay Tuned! TV Show     1  Season
## 244 244     My Husband Won't Fit TV Show     1  Season
## 245 245      Amy Schumer Growing   Movie    61     min
## 246 246               Green Door TV Show     1  Season
## 247 247     Love, Death & Robots TV Show     1  Season
## 248 248          Turn Up Charlie TV Show     1  Season
## 249 249          Triple Frontier   Movie   126     min
## 250 250                   Losers TV Show     1  Season
## 251 251          Northern Rescue TV Show     1  Season
## 252 252       Beneath the Leaves   Movie    90     min
## 253 253       Berlin, I Love You   Movie   120     min
## 254 254          Elisa & Marcela   Movie   119     min
## 255 255              I Am Mother   Movie   114     min
## 256 256              Luka Chuppi   Movie   125     min
## 257 257                Pachamama   Movie    71     min
## 258 258            Rock My Heart   Movie   110     min
## 259 259        Tales of the City TV Show     1  Season
## 260 260      The Black Godfather   Movie   118     min
## 261 261            Alles ist gut   Movie    90     min
## 262 262          Everybody Knows   Movie   133     min
## 263 263             Dolphin Kick   Movie    90     min
## 264 264             Inhuman Kiss   Movie   122     min
## 265 265       Romeo Akbar Walter   Movie   141     min
## 266 266             Super Deluxe   Movie   176     min
## 267 267                   7SEEDS TV Show     1  Season
## 268 268                Exhibit A TV Show     1  Season
## 269 269          Family Business TV Show     1  Season
## 270 270            Staged Killer   Movie    88     min
## 271 271              Super Bheem TV Show     1  Season
## 272 272           The Chosen One TV Show     1  Season
## 273 273                    ANIMA   Movie    15     min
## 274 274        Answer for Heaven TV Show     1  Season
## 275 275  Daniel Sosa: Maleducado   Movie    55     min
## 276 276                 Unbroken   Movie   109     min
## 277 277 Mike Epps: Only One Mike   Movie    63     min
## 278 278            Triple Threat   Movie    96     min
## 279 279          Blood Will Tell   Movie   113     min
## 280 280                  Bolívar TV Show     1  Season
## 281 281             Mr. Iglesias TV Show     1  Season
## 282 282                 Rakkhosh   Movie   116     min
## 283 283          The Wolf's Call   Movie   116     min
## 284 284       Arthdal Chronicles TV Show     1  Season
## 285 285                  Krutant   Movie   114     min
## 286 286                    Beats   Movie   110     min
## 287 287    The Edge of Democracy   Movie   122     min
## 288 288           Charité at War TV Show     1  Season
## 289 289        Life Overtakes Me   Movie    40     min
## 290 290                   Luckee   Movie   113     min
## 291 291   Somewhere Only We Know TV Show     1  Season
## 292 292     The Alcàsser Murders TV Show     1  Season
## 293 293            DJ Cinderella   Movie    95     min
## 294 294                    Leila TV Show     1  Season
## 295 295           Murder Mystery   Movie    98     min
## 296 296                 Trinkets TV Show     1  Season
## 297 297                   Yankee TV Show     1  Season
## 298 298                     Jinn TV Show     1  Season
## 299 299            The 3rd Eye 2   Movie   117     min
## 300 300    Jo Koy: Comin' In Hot   Movie    60     min
## 301 301    A Thousand Goodnights TV Show     1  Season
## 302 302              Oh, Ramona!   Movie   109     min
## 303 303            Then Came You   Movie    97     min
## 304 304           We Are Legends   Movie   109     min
## 305 305   Aziz Ansari: RIGHT NOW   Movie    65     min
## 306 306           Krishna Balram TV Show     1  Season
## 307 307              In The Dark TV Show     1  Season
## 308 308                7 (Seven) TV Show     1  Season
## 309 309   All Good Ones Get Away   Movie    83     min
## 310 310         American Hangman   Movie    99     min
## 311 311           The Last Czars TV Show     1  Season
## 312 312         Mission of Honor   Movie   107     min
## 313 313           Twelve Forever TV Show     1  Season
## 314 314                    Eerie   Movie   100     min
## 315 315         Girls With Balls   Movie    77     min
## 316 316                  The Son   Movie    93     min
## 317 317             Another Life TV Show     1  Season
## 318 318           The Great Hack   Movie   114     min
## 319 319  Eurovision Song Contest TV Show     1  Season
## 320 320             Sexy Central TV Show     1  Season
## 321 321           ¡Ay, mi madre!   Movie    81     min
## 322 322               Typewriter TV Show     1  Season
## 323 323         Secret Obsession   Movie    98     min
## 324 324          Unrequited Love TV Show     1  Season
## 325 325                       4L   Movie   105     min
## 326 326               Blown Away TV Show     1  Season
## 327 327       Extreme Engagement TV Show     1  Season
## 328 328        Kidnapping Stella   Movie    90     min
## 329 329         One Spring Night TV Show     1  Season
## 330 330              Point Blank   Movie    87     min
## 331 331          Taco Chronicles TV Show     1  Season
## 332 332               True Tunes TV Show     1  Season
## 333 333             Chhota Bheem TV Show     1  Season
## 334 334           Family Reunion TV Show     1  Season
## 335 335 Parchís: the Documentary   Movie   106     min
## 336 336             Soul to Keep   Movie    94     min
## 337 337         Speed Is My Need   Movie    75     min
## 338 338                  Kingdom TV Show     1  Season
## 339 339                    Polar   Movie   119     min
## 340 340      Thieves of the Wood TV Show     1  Season
## 341 341                    Close   Movie    95     min
## 342 342                       IO   Movie    96     min
## 343 343                   Patria   Movie    90     min
## 344 344                  Deadcon   Movie    78     min
## 345 345                  Jezebel   Movie    86     min
## 346 346                 NiNoKuni   Movie   107     min
## 347 347           Eye For An Eye   Movie   108     min
## 348 348               Saving Zoë   Movie    95     min
## 349 349                The Brave   Movie   103     min
## 350 350            Deadly Switch   Movie    86     min
## 351 351         Women of Mafia 2   Movie   132     min
## 352 352            Made in China   Movie   128     min
## 353 353            Sex Education TV Show     1  Season
## 354 354           The Last Laugh   Movie    99     min
## 355 355              Giri / Haji TV Show     1  Season
## 356 356               Fall Girls   Movie    80     min
## 357 357         High Flying Bird   Movie    91     min
## 358 358        Nailed It! Mexico TV Show     1  Season
## 359 359  Romance is a bonus book TV Show     1  Season
## 360 360                Firebrand   Movie   117     min
## 361 361                Paddleton   Movie    89     min
## 362 362              Paris Is Us   Movie    84     min
## 363 363                   The 43 TV Show     1  Season
## 364 364     The Umbrella Academy TV Show     1  Season
## 365 365            Dating Around TV Show     1  Season
## 366 366           Always a Witch TV Show     1  Season
## 367 367             Russian Doll TV Show     1  Season
## 368 368   True: Happy Hearts Day   Movie    24     min
## 369 369           Velvet Buzzsaw   Movie   113     min
## 370 370                    Saaho   Movie   172     min
## 371 371        Nothing to Lose 2   Movie    97     min
## 372 372                  Glow Up TV Show     1  Season
## 373 373           Marriage Story   Movie   137     min
## 374 374    The Confession Killer TV Show     1  Season
## 375 375  Three Days of Christmas TV Show     1  Season
## 376 376           Triad Princess TV Show     1  Season
## 377 377             Virgin River TV Show     1  Season
## 378 378       Home for Christmas TV Show     1  Season
## 379 379                   V Wars TV Show     1  Season
## 380 380              Let's Dance   Movie   109     min
## 381 381         The Road to Love TV Show     1  Season
## 382 382                   Aaviri   Movie   110     min
## 383 383  ARASHI's Diary -Voyage- TV Show     1  Season
## 384 384               How High 2   Movie    89     min
## 385 385             The Neighbor TV Show     1  Season
## 386 386                 Wish Man   Movie   108     min
## 387 387                 The Gift TV Show     1  Season
## 388 388                  The App   Movie    79     min
## 389 389   The Bonfire of Destiny TV Show     1  Season
## 390 390       Doom: Annihilation   Movie    97     min
## 391 391     Grand-Daddy Day Care   Movie    96     min
## 392 392   Jarhead: Law of Return   Movie   103     min
## 393 393               Sweetheart   Movie    83     min
## 394 394     Como caído del cielo   Movie   117     min
## 395 395           Jacob's Ladder   Movie    90     min
## 396 396          Back of the Net   Movie    86     min
## 397 397                    Agent TV Show     1  Season
## 398 398            The Two Popes   Movie   126     min
## 399 399              The Witcher TV Show     1  Season
## 400 400           After the Raid   Movie    25     min
## 401 401        Twice Upon A Time TV Show     1  Season
## 402 402               Soundtrack TV Show     1  Season
## 403 403     Crash Landing on You TV Show     1  Season
## 404 404        Holy Expectations   Movie    89     min
## 405 405            6 Underground   Movie   129     min
## 406 406         The 9th Precinct   Movie    95     min
## 407 407          The Sky Is Pink   Movie   143     min
## 408 408 Michelle Wolf: Joke Show   Movie    60     min
## 409 409               Baby Dolls   Movie   127     min
## 410 410                Dead Kids   Movie    94     min
## 411 411        High End Yaariyan   Movie   120     min
## 412 412          Wandering Stars   Movie    95     min
## 413 413           Little Singham TV Show     1  Season
## 414 414                 Sintonia TV Show     1  Season
## 415 415               The Family TV Show     1  Season
## 416 416        The InBESTigators TV Show     1  Season
## 417 417        Tiny House Nation TV Show     1  Season
## 418 418   Westside vs. the World   Movie    96     min
## 419 419                   Dollar TV Show     1  Season
## 420 420       The Naked Director TV Show     1  Season
## 421 421             Wu Assassins TV Show     1  Season
## 422 422                    Badla   Movie   117     min
## 423 423              The Pursuit   Movie    76     min
## 424 424          Enter the Anime   Movie    59     min
## 425 425           Back to School   Movie    84     min
## 426 426         CAROLE & TUESDAY TV Show     1  Season
## 427 427          Mi amigo Alexis   Movie   102     min
## 428 428        Styling Hollywood TV Show     1  Season
## 429 429                     Expo   Movie    81     min
## 430 430         Falling Inn Love   Movie    98     min
## 431 431                   Kardec   Movie   110     min
## 432 432                  Fanatic   Movie    91     min
## 433 433           Kaake Da Viyah   Movie   133     min
## 434 434              Mayday Life   Movie   120     min
## 435 435    Rust Valley Restorers TV Show     1  Season
## 436 436                   Saavat   Movie   118     min
## 437 437               Love Alarm TV Show     1  Season
## 438 438         American Factory   Movie   110     min
## 439 439               Hyperdrive TV Show     1  Season
## 440 440  Simon Amstell: Set Free   Movie    52     min
## 441 441    Basketball or Nothing TV Show     1  Season
## 442 442                  Léa & I   Movie    84     min
## 443 443                Otherhood   Movie   101     min
## 444 444                Uriyadi 2   Movie   116     min
## 445 445                   45 rpm TV Show     1  Season
## 446 446           Better Than Us TV Show     1  Season
## 447 447                Diagnosis TV Show     1  Season
## 448 448        For Love or Money   Movie    95     min
## 449 449           Green Frontier TV Show     1  Season
## 450 450               Sextuplets   Movie   100     min
## 451 451        The King's Avatar TV Show     1  Season
## 452 452   The Little Switzerland   Movie    86     min
## 453 453           Cannon Busters TV Show     1  Season
## 454 454     Til Death Do Us Part TV Show     1  Season
## 455 455 Whindersson Nunes: Adult   Movie    69     min
## 456 456               Happy Jail TV Show     1  Season
## 457 457                    Uyare   Movie   119     min
## 458 458                    90 ML   Movie   123     min
## 459 459                Woodstock   Movie    97     min
## 460 460      DC Super Hero Girls TV Show     1  Season
## 461 461   Petta (Telugu Version)   Movie   170     min
## 462 462            Alien Warfare   Movie    88     min
## 463 463                 Backfire   Movie    97     min
## 464 464               Our Planet TV Show     1  Season
## 465 465                    Petta   Movie   170     min
## 466 466                Quicksand TV Show     1  Season
## 467 467                  Tijuana TV Show     1  Season
## 468 468            Unicorn Store   Movie    92     min
## 469 469                Possessed TV Show     1  Season
## 470 470       A Home with A View   Movie    92     min
## 471 471                 Chambers TV Show     1  Season
## 472 472         Loving is Losing   Movie    91     min
## 473 473               Money Trap   Movie   122     min
## 474 474              Street Food TV Show     1  Season
## 475 475 Little Singham in London   Movie    66     min
## 476 476         Grass Is Greener   Movie    98     min
## 477 477                 Lunatics TV Show     1  Season
## 478 478            Music Teacher   Movie   102     min
## 479 479      Rilakkuma and Kaoru TV Show     1  Season
## 480 480            Someone Great   Movie    92     min
## 481 481             The Ruthless   Movie   111     min
## 482 482       The Creative Brain   Movie    53     min
## 483 483          A Land Imagined   Movie    95     min
## 484 484           Huge in France TV Show     1  Season
## 485 485                  Special TV Show     1  Season
## 486 486         The Perfect Date   Movie    91     min
## 487 487             Black Summer TV Show     1  Season
## 488 488                  Persona TV Show     1  Season
## 489 489              The Silence   Movie    91     min
## 490 490             You vs. Wild TV Show     1  Season
## 491 491                      706   Movie   118     min
## 492 492                 Ultraman TV Show     1  Season
## 493 493                    Elite TV Show     2 Seasons
## 494 494                   Gotham TV Show     5 Seasons
## 495 495              El Marginal TV Show     3 Seasons
## 496 496                Explained TV Show     2 Seasons
## 497 497                   Glitch TV Show     3 Seasons
## 498 498                Locked Up TV Show     4 Seasons
## 499 499           Disenchantment TV Show     2 Seasons
## 500 500              Fastest Car TV Show     2 Seasons
## 501 501            The Blacklist TV Show     6 Seasons
## 502 502                I'm Sorry TV Show     2 Seasons
## 503 503            The Chef Show TV Show     2 Seasons
## 504 504                The Ranch TV Show     7 Seasons
## 505 505                   Scream TV Show     3 Seasons
## 506 506              Creeped Out TV Show     2 Seasons
## 507 507           Peaky Blinders TV Show     5 Seasons
## 508 508                   Rotten TV Show     2 Seasons
## 509 509           Super Monsters TV Show     3 Seasons
## 510 510            KENGAN ASHURA TV Show     2 Seasons
## 511 511        Flavorful Origins TV Show     2 Seasons
## 512 512                Jeopardy! TV Show     5 Seasons
## 513 513          BoJack Horseman TV Show     6 Seasons
## 514 514       Greenhouse Academy TV Show     3 Seasons
## 515 515      The Kominsky Method TV Show     2 Seasons
## 516 516                     Baby TV Show     2 Seasons
## 517 517                MeatEater TV Show     4 Seasons
## 518 518     The House of Flowers TV Show     2 Seasons
## 519 519                   Chosen TV Show     2 Seasons
## 520 520                  Haunted TV Show     2 Seasons
## 521 521               Insatiable TV Show     2 Seasons
## 522 522         The Hook Up Plan TV Show     2 Seasons
## 523 523     YooHoo to the Rescue TV Show     2 Seasons
## 524 524           Schitt's Creek TV Show     5 Seasons
## 525 525          Carmen Sandiego TV Show     2 Seasons
## 526 526            Little Things TV Show     3 Seasons
## 527 527                  Busted! TV Show     2 Seasons
## 528 528            Wild District TV Show     2 Seasons
## 529 529              Millennials TV Show     2 Seasons
## 530 530          Chip and Potato TV Show     2 Seasons
## 531 531         La Reina del Sur TV Show     2 Seasons
## 532 532                High Seas TV Show     2 Seasons
## 533 533      Nailed It! Holiday! TV Show     2 Seasons
## 534 534        The Dragon Prince TV Show     3 Seasons
## 535 535                The Crown TV Show     3 Seasons
## 536 536              Llama Llama TV Show     2 Seasons
## 537 537    The Toys That Made Us TV Show     3 Seasons
## 538 538                 The Yard TV Show     2 Seasons
## 539 539           Chief of Staff TV Show     2 Seasons
## 540 540                 Atypical TV Show     3 Seasons
## 541 541                 The Deep TV Show     3 Seasons
## 542 542                  Lucifer TV Show     4 Seasons
## 543 543                Bad Blood TV Show     2 Seasons
## 544 544             Supernatural TV Show    14 Seasons
## 545 545     Arrested Development TV Show     5 Seasons
## 546 546                Supergirl TV Show     4 Seasons
## 547 547                Riverdale TV Show     3 Seasons
## 548 548                The Flash TV Show     5 Seasons
## 549 549                    Arrow TV Show     7 Seasons
## 550 550         Prince of Peoria TV Show     2 Seasons
## 551 551                Nailed It TV Show     3 Seasons
## 552 552                 The Rain TV Show     2 Seasons
## 553 553               White Gold TV Show     2 Seasons
## 554 554                     Easy TV Show     3 Seasons
## 555 555            The Mechanism TV Show     2 Seasons
## 556 556              On My Block TV Show     2 Seasons
## 557 557       Santa Clarita Diet TV Show     3 Seasons
## 558 558          Black Lightning TV Show     2 Seasons
## 559 559                   The OA TV Show     2 Seasons
## 560 560             Larva Island TV Show     2 Seasons
## 561 561                       3% TV Show     3 Seasons
## 562 562      Designated Survivor TV Show     3 Seasons
## 563 563             Black Mirror TV Show     5 Seasons
## 564 564          Madam Secretary TV Show     5 Seasons
## 565 565         Documentary Now! TV Show     3 Seasons
## 566 566                  Dynasty TV Show     2 Seasons
## 567 567                     Dope TV Show     3 Seasons
## 568 568            Instant Hotel TV Show     2 Seasons
## 569 569             Motown Magic TV Show     2 Seasons
## 570 570                  Slasher TV Show     3 Seasons
## 571 571          Forest of Piano TV Show     2 Seasons
## 572 572                     Dark TV Show     2 Seasons
## 573 573       Girls Incarcerated TV Show     2 Seasons
## 574 574        Go! Live Your Way TV Show     2 Seasons
## 575 575           The Casketeers TV Show     2 Seasons
## 576 576     The Confession Tapes TV Show     2 Seasons
## 577 577               Aggretsuko TV Show     2 Seasons
## 578 578               Black Spot TV Show     2 Seasons
## 579 579           El desconocido TV Show     2 Seasons
## 580 580   Marvel's Jessica Jones TV Show     3 Seasons
## 581 581                Kakegurui TV Show     2 Seasons
## 582 582                Free Rein TV Show     3 Seasons
## 583 583                Kakegurui TV Show     2 Seasons
## 584 584          Stranger Things TV Show     3 Seasons
## 585 585              The Letdown TV Show     2 Seasons
## 586 586            Yummy Mummies TV Show     2 Seasons
## 587 587      My First First Love TV Show     2 Seasons
## 588 588  Orange Is the New Black TV Show     7 Seasons
## 589 589               Sugar Rush TV Show     2 Seasons
## 590 590          The Worst Witch TV Show     3 Seasons
## 591 591         La casa de papel TV Show     3 Seasons
## 592 592            Last Chance U TV Show     4 Seasons
## 593 593                Queer Eye TV Show     4 Seasons
## 594 594            Pinky Malinky TV Show     3 Seasons
## 595 595 3Below: Tales of Arcadia TV Show     2 Seasons
## 596 596             Bonus Family TV Show     3 Seasons
## 597 597            Club of Crows TV Show     4 Seasons
## 598 598    Marvel's The Punisher TV Show     2 Seasons
## 599 599        Hip-Hop Evolution TV Show     3 Seasons
## 600 600                    Booba TV Show     3 Seasons
## 601 601        Grace and Frankie TV Show     5 Seasons
## 602 602     Friends from College TV Show     2 Seasons
## 603 603             Horrid Henry TV Show     2 Seasons
## 604 604       Harvey Street Kids TV Show     3 Seasons
## 605 605   COMEDIANS of the world TV Show    13 Seasons
## 606 606                Big Mouth TV Show     3 Seasons
## 607 607          Man Like Mobeen TV Show     2 Seasons
## 608 608        One Day at a Time TV Show     3 Seasons
## 609 609             Chef's Table TV Show     6 Seasons
## 610 610   Suburra: Blood on Rome TV Show     2 Seasons
## 611 611              Cable Girls TV Show     4 Seasons
## 612 612                     GLOW TV Show     3 Seasons
## 613 613                  IZombie TV Show     5 Seasons
## 614 614          Jane The Virgin TV Show     5 Seasons
## 615 615             No Good Nick TV Show     2 Seasons
## 616 616      Mighty Little Bheem TV Show     2 Seasons
## 617 617             Workin' Moms TV Show     3 Seasons
## 618 618       Million Pound Menu TV Show     2 Seasons
## 619 619           13 Reasons Why TV Show     3 Seasons
## 620 620                HERO MASK TV Show     2 Seasons
## 621 621        Ask the StoryBots TV Show     3 Seasons
## 622 622        Dear White People TV Show     3 Seasons
## 623 623              Derry Girls TV Show     2 Seasons
## 624 624               MINDHUNTER TV Show     2 Seasons
## 625 625   QB1: Beyond the Lights TV Show     3 Seasons
## 626 626             Sacred Games TV Show     2 Seasons
## 627 627                  The 100 TV Show     6 Seasons
## 628 628               Knightfall TV Show     2 Seasons
## 629 629                Wentworth TV Show     7 Seasons
## 630 630         Men on a Mission TV Show     4 Seasons
## 631 631      Spirit: Riding Free TV Show     8 Seasons
## 632 632        Kim's Convenience TV Show     3 Seasons
## 633 633            The Protector TV Show     2 Seasons
## 634 634                   Cuckoo TV Show     5 Seasons
## 635 635                Samantha! TV Show     2 Seasons
## 636 636      Crazy Ex-Girlfriend TV Show     4 Seasons
## 637 637       Pokémon the Series TV Show     2 Seasons

Columns with multiple values

International phone numbers

You work for a multinational company that uses auto-dialer software to contact its customers. When new customers subscribe online they are asked for a phone number but they often forget to add the country code needed for international calls. You were asked to fix this issue in the database. You’ve been given a data frame with national numbers and country codes named phone_nr_df. Now you want to combine the country_code and national_number columns to create valid international numbers.

Use the unite() function to create a new international_number column, using an empty string as the separator.

phone_nr_df <- read.delim("DATABASE/phone_nr_df.txt", sep ="," )
head(phone_nr_df)

##   X        country country_code national_number
## 1 1            USA            1      2025550117
## 2 2 United Kingdom           44      1632960924
## 3 3         Brazil           55     95552452220
## 4 4      Australia           61      1900654321
## 5 5          China           86     13555953217
## 6 6          India           91      8555843898

phone_nr_df %>%
  # Unite the country_code and national_number columns
  unite("international_number", country_code, national_number, sep = "")

##   X        country international_number
## 1 1            USA          12025550117
## 2 2 United Kingdom         441632960924
## 3 3         Brazil        5595552452220
## 4 4      Australia         611900654321
## 5 5          China        8613555953217
## 6 6          India         918555843898

Extracting observations from values

You’re given a sample of the Netflix dataset containing TV shows and their casts called tvshow_df. You want to learn which six actors have the most appearances.

However, the dataset only has one row per TV show, and multiple actors are listed in the cast column.

Transform the data so that for each TV show, every actor has a row. The number of appearances will be calculated for you.

The dplyr package has been pre-loaded for you. Use separate_rows() on the cast column, using the appropriate separator for the sep argument.

tvshow_df <- read.delim("DATABASE/tvshow_df.txt", sep ="," ) %>% select(.,c(-"X"))

tvshow_df <- tvshow_df %>% 
  mutate(
    OTRA = as.character(tvshow_df$cast)
  ) %>% select(., c("OTRA", "title")) %>% setnames(., c("cast","title" ))

Use the head() function to keep just the top six.

tvshow_df %>% 
  # Separate the actors in the cast column over multiple rows
  separate_rows(cast, sep = ", ") %>% 
  rename(actor = cast) %>% 
  count(actor, sort = TRUE) %>% 
  head()

## # A tibble: 6 x 2
##   actor                  n
##   <chr>              <int>
## 1 Takahiro Sakurai      18
## 2 Yuki Kaji             16
## 3 Daisuke Ono           14
## 4 David Attenborough    14
## 5 Ashleigh Ball         12
## 6 Hiroshi Kamiya        12

Separating into columns and rows

Remember the drink ingredients data from the video? You’ve been given an similar version (drink_df) that also includes quantities and units. Now you want to create an overview of how much of each ingredient you should buy to make these drinks.

The dplyr package has been pre-loaded for you.

Inspect drink_df in the console to find the right separator in the ingredients column.

drink_df <- read.delim("DATABASE/drink_df.txt", sep ="," ) %>% select(., c(-"X"))

drink_df <- drink_df %>% mutate(
  ingredients_off = as.character(drink_df$ingredients)
) %>% select(., c("drink", "ingredients_off")) %>% setnames(., c("drink", "ingredients"))

Separate the ingredients column so that for each drink each ingredient gets a row.

drink_df %>% 
  # Separate the ingredients over rows
  separate_rows(ingredients, sep = ";")

## # A tibble: 9 x 2
##   drink          ingredients      
##   <fct>          <chr>            
## 1 Chocolate milk "milk 0.3 L"     
## 2 Chocolate milk " chocolate 40 g"
## 3 Chocolate milk " sugar 10 g"    
## 4 Orange juice   "oranges 3"      
## 5 Orange juice   " sugar 20 g"    
## 6 Cappuccino     "milk 0.1 L"     
## 7 Cappuccino     " water 0.1 L"   
## 8 Cappuccino     " coffee 30 g"   
## 9 Cappuccino     " sugar 5 g"

Inspect the output of the previous step to find the separator that splits the ingredients column into three columns: ingredient, quantity, and unit. Make sure to convert data types to numeric when possible.

drink_df %>% 
  # Separate the ingredients over rows
  separate_rows(ingredients, sep = "; ") %>% 
  # Separate ingredients into three columns
  separate(
    ingredients, 
    into = c("ingredient", "quantity", "unit"), 
    sep = " ", 
    convert = TRUE
  )

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].

## # A tibble: 9 x 4
##   drink          ingredient quantity unit 
##   <fct>          <chr>         <dbl> <chr>
## 1 Chocolate milk milk            0.3 L    
## 2 Chocolate milk chocolate      40   g    
## 3 Chocolate milk sugar          10   g    
## 4 Orange juice   oranges         3   <NA> 
## 5 Orange juice   sugar          20   g    
## 6 Cappuccino     milk            0.1 L    
## 7 Cappuccino     water           0.1 L    
## 8 Cappuccino     coffee         30   g    
## 9 Cappuccino     sugar           5   g

Group the data by ingredient and unit. Calculate the total quantity of each ingredient.

drink_df %>% 
  # Separate the ingredients over rows
  separate_rows(ingredients, sep = "; ") %>% 
  # Separate ingredients into three columns
  separate(
    ingredients, 
    into = c("ingredient", "quantity", "unit"), 
    sep = " ", 
    convert = TRUE
  ) %>% 
  # Group by ingredient and unit
  group_by(ingredient, unit) %>% 
  # Calculate the total quantity of each ingredient
  summarize(quantity = sum(quantity))

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].

## `summarise()` has grouped output by 'ingredient'. You can override using the `.groups` argument.

## # A tibble: 6 x 3
## # Groups:   ingredient [6]
##   ingredient unit  quantity
##   <chr>      <chr>    <dbl>
## 1 chocolate  g         40  
## 2 coffee     g         30  
## 3 milk       L          0.4
## 4 oranges    <NA>       3  
## 5 sugar      g         35  
## 6 water      L          0.1

Missing values

And the Oscar for best director goes to …

You’re working on a sample of the Netflix dataset pre-loaded as director_df. This time, the data frame contains just the directors and movie titles. Your goal is to identify the directors who created the most movies. Since the director column contains multiple names, you’ll first separate its values over multiple rows and then count the directors.

Since you don’t want movies without directors polluting your overview, you’ll apply the drop_na() function.

The dplyr package has been pre-loaded for you.

director_df <- read.delim("DATABASE/director_df.txt", sep ="," ) %>% select(., c(-"X"))

Inspect director_df in the console to see what string separates directors in the director column. Spread the values in the director column over separate rows.

director_df <- director_df %>% mutate(
  ingredients_off = as.character(director_df$director)
) %>% select(., c("ingredients_off", "title")) %>% setnames(., c("director", "title"))

director_df %>% 
  # Spread the director column over separate rows
  separate_rows(director, sep = ", ")

## # A tibble: 4,815 x 2
##    director              title                                  
##    <chr>                 <fct>                                  
##  1 Richard Finn          Norm of the North: King Sized Adventure
##  2 Tim Maltby            Norm of the North: King Sized Adventure
##  3 <NA>                  Jandino: Whatever it Takes             
##  4 Fernando Lebrija      #realityhigh                           
##  5 Gabe Ibáñez           Automata                               
##  6 Rodrigo Toro          Fabrizio Copano: Solo pienso en mi     
##  7 Francisco Schultz     Fabrizio Copano: Solo pienso en mi     
##  8 Henrik Ruben Genz     Good People                            
##  9 José Miguel Contreras Joaquín Reyes: Una y no más            
## 10 Daniel Alfredson      Kidnapping Mr. Heineken                
## # … with 4,805 more rows

Count the number of times each director appears in the data. Make sure to sort the output.

director_df %>% 
  # Spread the director column over separate rows
  separate_rows(director, sep = ", ") %>% 
  # Count the number of movies per director
  count(director, sort = TRUE)

## # A tibble: 3,536 x 2
##    director             n
##    <chr>            <int>
##  1 <NA>               128
##  2 Jan Suter           21
##  3 Raúl Campos         19
##  4 Jay Karas           14
##  5 Marcus Raboy        14
##  6 Jay Chapman         12
##  7 Martin Scorsese      9
##  8 Steven Spielberg     9
##  9 David Dhawan         8
## 10 Johnnie To           8
## # … with 3,526 more rows

Drop rows containing NA values in the director column.

director_df %>% 
  # Drop rows with NA values in the director column
  drop_na() %>% 
  # Spread the director column over separate rows
  separate_rows(director, sep = ", ") %>% 
  # Count the number of movies per director
  count(director, sort = TRUE)

## # A tibble: 3,535 x 2
##    director             n
##    <chr>            <int>
##  1 Jan Suter           21
##  2 Raúl Campos         19
##  3 Jay Karas           14
##  4 Marcus Raboy        14
##  5 Jay Chapman         12
##  6 Martin Scorsese      9
##  7 Steven Spielberg     9
##  8 David Dhawan         8
##  9 Johnnie To           8
## 10 Lance Bangs          8
## # … with 3,525 more rows

Imputing sales data

You’ve been asked to create a report that allows management to compare sales figures per quarter for two years. The problem is that the dataset (sales_df) contains missing values. You’ll need to impute the values in the year column so that you can visualize the data.

The ggplot2 package has been pre-loaded for you.

 sales_df <- read.delim("DATABASE/sales_df.txt", sep ="," ) %>% select(., c(-"X"))

Inspect sales_df in the console, pay attention to the year column. Use the fill() function to impute the year column in the correct direction. Create a line plot where each year has a different color.

sales_df %>% 
  # Impute the year column
  fill(year,.direction = "up") %>%
  # Create a line plot with sales per quarter colored by year.
  ggplot(aes(x = quarter, y = sales, color = year, group = year)) +
  geom_line()

Nuclear bombs per continent

Since WWII, a number of nations have been detonating nuclear bombs for military research. A tally of bombs detonated per nation has been calculated from the Nuclear Explosion DataBase (NEDB) and provided as nuke_df. You are interested in finding out how many bombs have been detonated by nations grouped per continent. To achieve this goal, nuke_df will be joined to country_to_continent_df which is a mapping of nation to continent. You will need to overwrite missing values with zeros so that you can create a nice plot.

The dplyr and ggplot2 packages have been pre-loaded for you.

Side note 1: Bombs detonated by the Soviet Union were attributed to the Russian Federation.

Side note 2: The Russian Federation is solely mapped to Europe for simplicity.

country_to_continent_df <- read.delim("DATABASE/country_to_continent_df.txt", sep ="," ) %>% select(., c(-"X"))
nuke_df <- read.delim("DATABASE/nuke_df.txt", sep ="," ) %>% select(., c(-"X"))

# country_to_continent_df %>% 
#   left_join(nuke_df, by = "country_code") %>% 
#   # Impute the missing values in the n_bombs column with 0L
#   replace_na(list(n_bombs = 0L))

Group the dataset by continent and aggregate the data by summing the number of bombs.

# country_to_continent_df %>% 
#   left_join(nuke_df, by = "country_code") %>% 
#   # Impute the missing values in the n_bombs column with 0L
#   replace_na(list(n_bombs = 0L)) %>% 
#   # Group the dataset by continent
#   group_by(continent) %>% 
#   # Sum the number of bombs per continent
#   summarize(n_bombs_continent = sum(n_bombs))

Plot the summed number of bombs detonated by nations from each continent.

# country_to_continent_df %>% 
#   left_join(nuke_df, by = "country_code") %>%  
#   # Impute the missing values in the n_bombs column with 0L
#   replace_na(list(n_bombs = 0L)) %>% 
#   # Group the dataset by continent
#   group_by(continent) %>% 
#   # Sum the number of bombs per continent
#   summarize(n_bombs_continent = sum(n_bombs)) %>% 
#   # Plot the number of bombs per continent
#   ggplot(aes(x = continent, y =n_bombs_continent)) +
#   geom_col()

From wide to long and back

From wide to long

Nuclear bombs per country

You’ve been given a version of the Nuclear Explosion DataBase (NEDB) where country names are specified in the column headers (nuke_df). You want to visualize how many nukes were detonated per year per country. You’ll need to pivot the data and replace NA values first.

The ggplot2 package has been pre-loaded for you.

nuke_df <- read.delim("DATABASE/nuke_df.txt", sep = ",") %>% select(., c(-"X"))
head(nuke_df)

##   year United.States Russian.Federation United.Kingdom France China India
## 1 1945             3                 NA             NA     NA    NA    NA
## 2 1946             2                 NA             NA     NA    NA    NA
## 3 1947            NA                 NA             NA     NA    NA    NA
## 4 1948             3                 NA             NA     NA    NA    NA
## 5 1949            NA                  1             NA     NA    NA    NA
## 6 1950            NA                 NA             NA     NA    NA    NA
##   Pakistan North.Korea
## 1       NA          NA
## 2       NA          NA
## 3       NA          NA
## 4       NA          NA
## 5       NA          NA
## 6       NA          NA

Pivot all columns except for year to a longer format.

nuke_df %>% 
  # Pivot the data to a longer format
  pivot_longer(-year)

## # A tibble: 584 x 3
##     year name               value
##    <int> <chr>              <int>
##  1  1945 United.States          3
##  2  1945 Russian.Federation    NA
##  3  1945 United.Kingdom        NA
##  4  1945 France                NA
##  5  1945 China                 NA
##  6  1945 India                 NA
##  7  1945 Pakistan              NA
##  8  1945 North.Korea           NA
##  9  1946 United.States          2
## 10  1946 Russian.Federation    NA
## # … with 574 more rows

The country names are now in the name column. Overwrite its name with country. The value column should be named n_bombs.

nuke_df %>% 
  # Pivot the data to a longer format
  pivot_longer(
    -year, 
    # Overwrite the names of the two new columns
    names_to = "country", 
    values_to = "n_bombs"
  )

## # A tibble: 584 x 3
##     year country            n_bombs
##    <int> <chr>                <int>
##  1  1945 United.States            3
##  2  1945 Russian.Federation      NA
##  3  1945 United.Kingdom          NA
##  4  1945 France                  NA
##  5  1945 China                   NA
##  6  1945 India                   NA
##  7  1945 Pakistan                NA
##  8  1945 North.Korea             NA
##  9  1946 United.States            2
## 10  1946 Russian.Federation      NA
## # … with 574 more rows

Replace the NA values in the n_bombs column with integer zero values (0L).

nuke_df %>% 
  # Pivot the data to a longer format
  pivot_longer(
    -year, 
    # Overwrite the names of the two new columns
    names_to = "country", 
    values_to = "n_bombs"
  ) %>%
  # Replace NA values for n_bombs with 0L
  replace_na(list(n_bombs = 0L))

## # A tibble: 584 x 3
##     year country            n_bombs
##    <int> <chr>                <int>
##  1  1945 United.States            3
##  2  1945 Russian.Federation       0
##  3  1945 United.Kingdom           0
##  4  1945 France                   0
##  5  1945 China                    0
##  6  1945 India                    0
##  7  1945 Pakistan                 0
##  8  1945 North.Korea              0
##  9  1946 United.States            2
## 10  1946 Russian.Federation       0
## # … with 574 more rows

Create a line plot where the number of bombs dropped per country is plotted over time. Use country to color the lines.

nuke_df %>% 
  # Pivot the data to a longer format
  pivot_longer(
    -year, 
    # Overwrite the names of the two new columns
    names_to = "country", 
    values_to = "n_bombs"
  ) %>% 
  # Replace NA values for n_bombs with 0L
  replace_na(list(n_bombs = 0L)) %>% 
  # Plot the number of bombs per country over time
  ggplot(aes(x = year, y = n_bombs, color = country)) +
  geom_line()

WHO obesity per country

According to the World Health Organization (WHO), worldwide obesity has nearly tripled since 1975. You’re interested in the severity of this global health issue per country and whether males and females are affected differently. You’ll use the WHO’s obesity data (obesity_df) to investigate this issue. The data holds the percentage of females, males, and both sexes combined that are considered obese (BMI > 30) per country.

You want to create a scatterplot where, per nation, you can see the obesity data colored differently for females and males. This implies that sex should become a variable with its own column.

The ggplot2 package has been pre-loaded for you.

obesity_df <- read.delim("DATABASE/obesity_df.txt", sep = ",") %>% select(., c(-"X"))
head(obesity_df)

##               country male female both_sexes
## 1         Afghanistan  3.2    7.6        5.5
## 2             Albania 21.6   21.8       21.7
## 3             Algeria 19.9   34.9       27.4
## 4             Andorra 25.9   25.3       25.6
## 5              Angola  4.0   12.1        8.2
## 6 Antigua and Barbuda 11.6   25.9       18.9

Inspect obesity_df in the console. Pivot the male and female columns. The old column names should go in the sex column, the original values should go in the pct_obese column.

obesity_df %>% 
  # Pivot the male and female columns
  pivot_longer(c("male", "female"), 
  names_to = "sex", 
  values_to = "pct_obese"
  
  )

## # A tibble: 380 x 4
##    country     both_sexes sex    pct_obese
##    <fct>            <dbl> <chr>      <dbl>
##  1 Afghanistan        5.5 male         3.2
##  2 Afghanistan        5.5 female       7.6
##  3 Albania           21.7 male        21.6
##  4 Albania           21.7 female      21.8
##  5 Algeria           27.4 male        19.9
##  6 Algeria           27.4 female      34.9
##  7 Andorra           25.6 male        25.9
##  8 Andorra           25.6 female      25.3
##  9 Angola             8.2 male         4  
## 10 Angola             8.2 female      12.1
## # … with 370 more rows

Create a scatterplot with pct_obese per country colored by sex. The country variable has been ordered by overall obesity % and added for you.

obesity_df %>% 
  # Pivot the male and female columns
  pivot_longer(c(male, female),
               names_to = "sex",
               values_to = "pct_obese") %>% 
  # Create a scatter plot with pct_obese per country colored by sex
  ggplot(aes(x = pct_obese, color = sex,
             y = forcats::fct_reorder(country, both_sexes))) +
  geom_point() +
  scale_y_discrete(breaks = c("India", "Nauru", "Cuba", "Brazil",
                              "Pakistan", "Gabon", "Italy", "Oman",
                              "China", "United States of America")) +
  labs(x = "% Obese", y = "Country")

Bond… James Bond

You’ve been given a James Bond movie dataset (bond_df) and want to visualize the number of movies that Bond actors have featured in per decade. However, the data is in an untidy format with the decade values captured in the column headers. You’ll tidy this dataset to give each variable its own column.

bond_df <- read.delim("DATABASE/bond_df.txt", sep = ",") %>% select(., c(-"X"))
#nos importa los nombrs mal 


columnas <- names(bond_df)

#vemos cual es el tamaño de las columnas
length <- length(columnas)

for(i in 1:length){
  
  #seleccionamos la columna
  
  columna <- columnas[i]
  # print(columna)
  
  #ESTABLECEMOS LA CONDICION 
  
  CONDICION <- str_detect(columna, "^X")
  
 
  
  if(CONDICION == TRUE){
    
    #CORREGIMOS EL ERROR
    
    CORREGIDO <- str_extract_all(columnas[i], "[:digit:]+", simplify = TRUE)
    
    
     #reemplazamos el nombre de la columna mal escrita
    columnas[i] <- replace(columnas[i], CONDICION == TRUE, CORREGIDO)
    
    
    
    
    
  }else{
    
    next
  }
  


}
#implementamos los nombres bien
names(bond_df) <- columnas

#vemos el resultado
head(bond_df)

##             Bond 1960 1970 1980 1990 2000 2010 2020
## 1   Sean Connery    5    1   NA   NA   NA   NA   NA
## 2    David Niven    1   NA   NA   NA   NA   NA   NA
## 3 George Lazenby    1   NA   NA   NA   NA   NA   NA
## 4    Roger Moore   NA    4    3   NA   NA   NA   NA
## 5 Timothy Dalton   NA   NA    2   NA   NA   NA   NA
## 6 Pierce Brosnan   NA   NA   NA    3    1   NA   NA

The ggplot2 package has been pre-loaded for you.

bond_df %>% 
  # Pivot the data to long format and set the column names
  pivot_longer(-"Bond", 
  names_to = "decade", 
  values_to = "n_movies")

## # A tibble: 49 x 3
##    Bond         decade n_movies
##    <fct>        <chr>     <dbl>
##  1 Sean Connery 1960          5
##  2 Sean Connery 1970          1
##  3 Sean Connery 1980         NA
##  4 Sean Connery 1990         NA
##  5 Sean Connery 2000         NA
##  6 Sean Connery 2010         NA
##  7 Sean Connery 2020         NA
##  8 David Niven  1960          1
##  9 David Niven  1970         NA
## 10 David Niven  1980         NA
## # … with 39 more rows

Drop any NA values in the n_movies column while it is created.

bond_df %>% 
  # Pivot the data to long format
  pivot_longer(
    -Bond, 
    # Overwrite the names of the two newly created columns
    names_to = "decade", 
    values_to = "n_movies", 
    # Drop na values
    values_drop_na = TRUE
  )

## # A tibble: 12 x 3
##    Bond           decade n_movies
##    <fct>          <chr>     <dbl>
##  1 Sean Connery   1960          5
##  2 Sean Connery   1970          1
##  3 David Niven    1960          1
##  4 George Lazenby 1960          1
##  5 Roger Moore    1970          4
##  6 Roger Moore    1980          3
##  7 Timothy Dalton 1980          2
##  8 Pierce Brosnan 1990          3
##  9 Pierce Brosnan 2000          1
## 10 Daniel Craig   2000          2
## 11 Daniel Craig   2010          2
## 12 Daniel Craig   2020          1

Transform the decade column data type to integer.

  bond_df %>% 
  # Pivot the data to long format
  pivot_longer(
    -Bond, 
    # Overwrite the names of the two newly created columns
    names_to = "decade", 
    values_to = "n_movies", 
    # Drop na values
    values_drop_na = TRUE, 
    # Transform the decade column data type to integer
   names_transform = list(decade = as.integer)
  ) %>% 
  ggplot(aes(x = decade + 5, y = n_movies, fill = Bond))+
  geom_col()

Deriving variables from column headers

New-Zealand’s bird of the year

Every year New Zealanders vote en masse to decide which species gets the bird of the year trophy. The contest is organized by the Forest & Bird agency which allows each person to give points to up to five birds (first pick gets 5 points, second 4, …). Your job is to decide this year’s winner from the messy dataset that’s been pre-loaded for you as bird_df.

The dplyr package has been pre-loaded for you.

bird_df <- read.delim("DATABASE/bird_df.txt", sep = ",") %>% select(., c(-"X"))

head(bird_df)

##                    points_5            points_4           points_3
## 1       Yellow-eyed penguin                 Kea                Tūī
## 2       Yellow-eyed penguin              Kererū                Tūī
## 3 Fiordland Crested Penguin   Bar-tailed Godwit    Banded Dotterel
## 4                   Fantail                 Kea                Tūī
## 5                    Kākāpō Little Spotted Kiwi Rockhopper Penguin
## 6       Yellow-eyed penguin                <NA>               <NA>
##                   points_2            points_1
## 1 Southern Royal Albatross Southern Brown Kiwi
## 2               Saddleback  Rockhopper Penguin
## 3              Black Robin           Blue Duck
## 4                     Kākā              Kākāpō
## 5                   Kōkako        Shore Plover
## 6                     <NA>                <NA>

Inspect bird_df in the console. Pivot bird_df to longer format so that an integer column points and a character column species are created. Use the names_prefix argument to clean up the points column and make sure no NA values remain.

bird_df %>%
  # Pivot the data to create a two column data frame
  pivot_longer(
    starts_with("points_"),
    names_to = "points",
    names_prefix = "points_",
    names_transform = list(points = as.integer),
    values_to = "species",
    values_drop_na = TRUE
  )

## # A tibble: 46,808 x 2
##    points species                 
##     <int> <fct>                   
##  1      5 Yellow-eyed penguin     
##  2      4 Kea                     
##  3      3 Tūī                     
##  4      2 Southern Royal Albatross
##  5      1 Southern Brown Kiwi     
##  6      5 Yellow-eyed penguin     
##  7      4 Kererū                  
##  8      3 Tūī                     
##  9      2 Saddleback              
## 10      1 Rockhopper Penguin      
## # … with 46,798 more rows

Calculate the total_points each species got.

bird_df %>%
  # Pivot the data to create a 2 column data frame
  pivot_longer(
    starts_with("points_"),
    names_to = "points",
    names_prefix = "points_",
    names_transform = list(points = as.integer),
    values_to = "species",
    values_drop_na = TRUE
  ) %>%
  group_by(species) %>% 
  summarize(total_points = sum(points)) %>% 
  slice_max(total_points, n = 5)

## # A tibble: 5 x 2
##   species             total_points
##   <fct>                      <int>
## 1 Yellow-eyed penguin        13522
## 2 Kākāpō                      8017
## 3 Black Robin                 6542
## 4 Banded Dotterel             5988
## 5 Kākā                        5229

Big tech stock prices

You’re an analyst at an investment firm and want to visualize the weekly closing prices of five big tech firms’ stocks. However, the dataset you’ve been handed (stock_df) is messy and has the year and week variables stored in the column headers. You’ll pivot this data into a tidy format, extract the variables from the headers, and create a line plot.

stock_df <- read.delim("DATABASE/stock_df.txt", sep = ",") %>% select(., c(-"X"))

head(stock_df)

##     company X2019_week1 X2019_week2 X2019_week3 X2019_week4 X2019_week5
## 1    Amazon   1847.8400   1640.5601    1696.200     1670.57     1626.23
## 2     Apple     73.4125     38.0725      39.205       39.44       41.63
## 3  Facebook    205.2500    143.8000     150.040      149.01      165.71
## 4    Google   1337.0200   1057.1899    1098.260     1090.99     1110.75
## 5 Microsoft    157.7000    102.8000     107.710      107.17      102.78
##   X2019_week6 X2019_week7 X2019_week8 X2019_week9 X2019_week10 X2019_week11
## 1   1588.2200    1607.950   1631.5601   1671.7300    1620.8000      1712.36
## 2     42.6025      42.605     43.2425     43.7425      43.2275        46.53
## 3    167.3300     162.500    161.8900    162.2800     169.6000       165.98
## 4   1095.0601    1113.650   1110.3700   1140.9900    1142.3199      1184.46
## 5    105.6700     108.220    110.9700    112.5300     110.5100       115.91
##   X2019_week12 X2019_week13 X2019_week14 X2019_week15 X2019_week16 X2019_week17
## 1    1764.7700    1780.7500      1837.28    1843.0601     1861.690     1950.630
## 2      47.7625      47.4875        49.25      49.7175       50.965       51.075
## 3     164.3400     166.6900       175.72     179.1000      178.280      191.490
## 4    1205.5000    1173.3101      1207.15    1217.8700     1236.370     1272.180
## 5     117.0500     117.9400       119.89     120.9500      123.370      129.890
##   X2019_week18 X2019_week19 X2019_week20 X2019_week21 X2019_week22 X2019_week23
## 1    1962.4600     1889.980      1869.00    1823.2800    1775.0699    1804.0300
## 2      52.9375       49.295        47.25      44.7425      43.7675      47.5375
## 3     195.4700      188.340       185.30     181.0600     177.4700     173.3500
## 4    1185.4000     1164.270      1162.30    1133.4700    1103.6300    1066.0400
## 5     128.9000      127.130       128.07     126.2400     123.6800     131.4000
##   X2019_week24 X2019_week25 X2019_week26 X2019_week27 X2019_week28 X2019_week29
## 1     1869.670     1911.300      1893.63    1942.9100     2011.000    1964.5200
## 2       48.185       49.695        49.48      51.0575       50.825      50.6475
## 3      181.330      191.140       193.00     196.4000      204.870     198.3600
## 4     1085.350     1121.880      1080.91    1131.5900     1144.900    1130.1000
## 5      132.450      136.970       133.96     137.0600      138.900     136.6200
##   X2019_week30 X2019_week31 X2019_week32 X2019_week33 X2019_week34 X2019_week35
## 1     1943.050     1823.240    1807.5800     1792.570      1749.62     1776.290
## 2       51.935       51.005      50.2475       51.625        50.66       52.185
## 3      199.750      189.020     187.8500      183.700       177.75      185.670
## 4     1250.410     1193.990    1188.0100     1177.600      1151.29     1188.100
## 5      141.340      136.900     137.7100      136.130       133.39      137.860
##   X2019_week36 X2019_week37 X2019_week38 X2019_week39 X2019_week40 X2019_week41
## 1     1833.510    1839.3400    1794.1600     1725.450    1739.6500    1731.9200
## 2       53.315      54.6875      54.4325       54.705      56.7525      59.0525
## 3      187.490     187.1900     189.9300      177.100     180.4500     184.1900
## 4     1204.930    1239.5601    1229.9301     1225.090    1209.0000    1215.4500
## 5      139.100     137.3200     139.4400      137.730     138.1200     139.6800
##   X2019_week42 X2019_week43 X2019_week44 X2019_week45 X2019_week46 X2019_week47
## 1    1757.5100     1761.330     1791.440     1785.880      1739.49     1745.720
## 2      59.1025       61.645       63.955       65.035        66.44       65.445
## 3     185.8500      187.890      193.620      190.840       195.10      198.820
## 4    1245.4900     1265.130     1273.740     1311.370      1334.87     1295.340
## 5     137.4100      140.730      143.720      145.960       149.97      149.590
##   X2019_week48 X2019_week49 X2019_week50 X2019_week51 X2019_week52 X2020_week1
## 1    1800.8000    1751.6000    1760.9399      1786.50      1869.80   1874.9700
## 2      66.8125      67.6775      68.7875        69.86        72.45     74.3575
## 3     201.6400     201.0500     194.1100       206.30       208.10    208.6700
## 4    1304.9600    1340.6200    1347.8300      1349.59      1351.89   1360.6600
## 5     151.3800     151.7500     154.5300       157.41       158.96    158.6200
##   X2020_week2 X2020_week3 X2020_week4 X2020_week5 X2020_week6 X2020_week7
## 1   1883.1600   1864.7200   1861.6400   2008.7200   2079.2800   2134.8701
## 2     77.5825     79.6825     79.5775     77.3775     80.0075     81.2375
## 3    218.0600    222.1400    217.9400    201.9100    212.3300    214.1800
## 4   1429.7300   1480.3900   1466.7100   1434.2300   1479.2300   1520.7400
## 5    161.3400    167.1000    165.0400    170.2300    183.8900    185.3500
##   X2020_week8 X2020_week9 X2020_week10 X2020_week11 X2020_week12 X2020_week13
## 1   2095.9700     1883.75    1901.0900    1785.0000      1846.09     1900.100
## 2     78.2625       68.34      72.2575      69.4925        57.31       61.935
## 3    210.1800      192.47     181.0900     170.2800       149.73      156.790
## 4   1485.1100     1339.33    1298.4100    1219.7300      1072.32     1110.710
## 5    178.5900      162.01     161.5700     158.8300       137.35      149.700
##   X2020_week14 X2020_week15 X2020_week16 X2020_week17 X2020_week18 X2020_week19
## 1    1906.5900    2042.7600      2375.00    2410.2200    2286.0400    2379.6101
## 2      60.3525      66.9975        70.70      70.7425      72.2675      77.5325
## 3     154.1800     175.1900       179.24     190.0700     202.2700     212.3500
## 4    1097.8800    1211.4500      1283.25    1279.3101    1320.6100    1388.3700
## 5     153.8300     165.1400       178.60     174.5500     174.5700     184.6800
##   X2020_week20 X2020_week21 X2020_week22 X2020_week23 X2020_week24 X2020_week25
## 1    2409.7800    2436.8799     2442.370     2483.000      2545.02      2675.01
## 2      76.9275      79.7225       79.485       82.875        84.70        87.43
## 3     210.8800     234.9100      225.090      230.770       228.58       238.79
## 4    1373.1899    1410.4200     1428.920     1438.390      1413.18      1431.72
## 5     183.1600     183.5100      183.250      187.200       187.74       195.15
##   X2020_week26 X2020_week27 X2020_week28 X2020_week29 X2020_week30 X2020_week31
## 1    2692.8701    2890.3000      3200.00    2961.9700     3008.910      3164.68
## 2      88.4075      91.0275        95.92      96.3275       92.615       106.26
## 3     216.0800     233.4200       245.07     242.0300      230.710       253.67
## 4    1359.9000    1464.7000      1541.74    1515.5500     1511.870      1482.96
## 5     196.3300     206.2600       213.67     202.8800      201.300       205.01
##   X2020_week32 X2020_week33 X2020_week34 X2020_week35 X2020_week36 X2020_week37
## 1    3167.4600    3148.0200      3284.72    3401.8000      3294.62      3116.22
## 2     111.1125     114.9075       124.37     124.8075       120.96       112.00
## 3     268.4400     261.2400       267.01     293.6600       282.73       266.61
## 4    1494.4900    1507.7300      1580.42    1644.4100      1591.04      1520.72
## 5     212.4800     208.9000       213.02     228.9100       214.25       204.03
##   X2020_week38 X2020_week39 X2020_week40 X2020_week41 X2020_week42 X2020_week43
## 1      2954.91      3095.13      3125.00      3286.65      3272.71      3204.40
## 2       106.84       112.28       113.02       116.97       119.02       115.04
## 3       252.53       254.82       259.94       264.45       265.93       284.79
## 4      1459.99      1444.96      1458.42      1515.22      1573.01      1641.00
## 5       200.39       207.82       206.19       215.81       219.66       216.23
##   X2020_week44 X2020_week45 X2020_week46 X2020_week47 X2020_week48 X2020_week49
## 1      3036.15      3311.37      3128.81      3099.40      3195.34      3162.58
## 2       108.86       118.69       119.26       117.34       116.59       122.25
## 3       263.11       293.41       276.95       269.70       277.81       279.70
## 4      1621.01      1761.75      1777.02      1742.19      1793.19      1827.99
## 5       202.47       223.72       216.51       210.39       215.23       214.36
##   X2020_week50 X2020_week51 X2020_week52 X2020_week53
## 1      3116.42      3201.65      3172.69      3256.93
## 2       122.41       126.66       131.97       132.69
## 3       273.55       276.40       267.40       273.16
## 4      1781.77      1731.01      1738.85      1751.88
## 5       213.26       218.59       222.75       222.42

The ggplot2 package has been pre-loaded for you.

#nos importa los nombrs mal 


columnas <- names(stock_df)

#vemos cual es el tamaño de las columnas
length <- length(columnas)

for(i in 1:length){
  
  #seleccionamos la columna
  
  columna <- columnas[i]
  # print(columna)
  
  #ESTABLECEMOS LA CONDICION 
  
  CONDICION <- str_detect(columna, "^X")
  
 
  # print(CONDICION)
  if(CONDICION == TRUE){
    
    #CORREGIMOS EL ERROR
    
    CORREGIDO <- str_extract_all(columnas[i], "[:digit:]+\\_+[:alnum:]+", simplify = TRUE)
    # print(CORREGIDO)
    
     #reemplazamos el nombre de la columna mal escrita
     columnas[i] <- replace(columnas[i], CONDICION == TRUE, CORREGIDO)
      
    
    
    
  }else{
    
    next
  }
  


}
#implementamos los nombres bien
names(stock_df) <- columnas

#vemos el resultado 

head(stock_df)

##     company 2019_week1 2019_week2 2019_week3 2019_week4 2019_week5 2019_week6
## 1    Amazon  1847.8400  1640.5601   1696.200    1670.57    1626.23  1588.2200
## 2     Apple    73.4125    38.0725     39.205      39.44      41.63    42.6025
## 3  Facebook   205.2500   143.8000    150.040     149.01     165.71   167.3300
## 4    Google  1337.0200  1057.1899   1098.260    1090.99    1110.75  1095.0601
## 5 Microsoft   157.7000   102.8000    107.710     107.17     102.78   105.6700
##   2019_week7 2019_week8 2019_week9 2019_week10 2019_week11 2019_week12
## 1   1607.950  1631.5601  1671.7300   1620.8000     1712.36   1764.7700
## 2     42.605    43.2425    43.7425     43.2275       46.53     47.7625
## 3    162.500   161.8900   162.2800    169.6000      165.98    164.3400
## 4   1113.650  1110.3700  1140.9900   1142.3199     1184.46   1205.5000
## 5    108.220   110.9700   112.5300    110.5100      115.91    117.0500
##   2019_week13 2019_week14 2019_week15 2019_week16 2019_week17 2019_week18
## 1   1780.7500     1837.28   1843.0601    1861.690    1950.630   1962.4600
## 2     47.4875       49.25     49.7175      50.965      51.075     52.9375
## 3    166.6900      175.72    179.1000     178.280     191.490    195.4700
## 4   1173.3101     1207.15   1217.8700    1236.370    1272.180   1185.4000
## 5    117.9400      119.89    120.9500     123.370     129.890    128.9000
##   2019_week19 2019_week20 2019_week21 2019_week22 2019_week23 2019_week24
## 1    1889.980     1869.00   1823.2800   1775.0699   1804.0300    1869.670
## 2      49.295       47.25     44.7425     43.7675     47.5375      48.185
## 3     188.340      185.30    181.0600    177.4700    173.3500     181.330
## 4    1164.270     1162.30   1133.4700   1103.6300   1066.0400    1085.350
## 5     127.130      128.07    126.2400    123.6800    131.4000     132.450
##   2019_week25 2019_week26 2019_week27 2019_week28 2019_week29 2019_week30
## 1    1911.300     1893.63   1942.9100    2011.000   1964.5200    1943.050
## 2      49.695       49.48     51.0575      50.825     50.6475      51.935
## 3     191.140      193.00    196.4000     204.870    198.3600     199.750
## 4    1121.880     1080.91   1131.5900    1144.900   1130.1000    1250.410
## 5     136.970      133.96    137.0600     138.900    136.6200     141.340
##   2019_week31 2019_week32 2019_week33 2019_week34 2019_week35 2019_week36
## 1    1823.240   1807.5800    1792.570     1749.62    1776.290    1833.510
## 2      51.005     50.2475      51.625       50.66      52.185      53.315
## 3     189.020    187.8500     183.700      177.75     185.670     187.490
## 4    1193.990   1188.0100    1177.600     1151.29    1188.100    1204.930
## 5     136.900    137.7100     136.130      133.39     137.860     139.100
##   2019_week37 2019_week38 2019_week39 2019_week40 2019_week41 2019_week42
## 1   1839.3400   1794.1600    1725.450   1739.6500   1731.9200   1757.5100
## 2     54.6875     54.4325      54.705     56.7525     59.0525     59.1025
## 3    187.1900    189.9300     177.100    180.4500    184.1900    185.8500
## 4   1239.5601   1229.9301    1225.090   1209.0000   1215.4500   1245.4900
## 5    137.3200    139.4400     137.730    138.1200    139.6800    137.4100
##   2019_week43 2019_week44 2019_week45 2019_week46 2019_week47 2019_week48
## 1    1761.330    1791.440    1785.880     1739.49    1745.720   1800.8000
## 2      61.645      63.955      65.035       66.44      65.445     66.8125
## 3     187.890     193.620     190.840      195.10     198.820    201.6400
## 4    1265.130    1273.740    1311.370     1334.87    1295.340   1304.9600
## 5     140.730     143.720     145.960      149.97     149.590    151.3800
##   2019_week49 2019_week50 2019_week51 2019_week52 2020_week1 2020_week2
## 1   1751.6000   1760.9399     1786.50     1869.80  1874.9700  1883.1600
## 2     67.6775     68.7875       69.86       72.45    74.3575    77.5825
## 3    201.0500    194.1100      206.30      208.10   208.6700   218.0600
## 4   1340.6200   1347.8300     1349.59     1351.89  1360.6600  1429.7300
## 5    151.7500    154.5300      157.41      158.96   158.6200   161.3400
##   2020_week3 2020_week4 2020_week5 2020_week6 2020_week7 2020_week8 2020_week9
## 1  1864.7200  1861.6400  2008.7200  2079.2800  2134.8701  2095.9700    1883.75
## 2    79.6825    79.5775    77.3775    80.0075    81.2375    78.2625      68.34
## 3   222.1400   217.9400   201.9100   212.3300   214.1800   210.1800     192.47
## 4  1480.3900  1466.7100  1434.2300  1479.2300  1520.7400  1485.1100    1339.33
## 5   167.1000   165.0400   170.2300   183.8900   185.3500   178.5900     162.01
##   2020_week10 2020_week11 2020_week12 2020_week13 2020_week14 2020_week15
## 1   1901.0900   1785.0000     1846.09    1900.100   1906.5900   2042.7600
## 2     72.2575     69.4925       57.31      61.935     60.3525     66.9975
## 3    181.0900    170.2800      149.73     156.790    154.1800    175.1900
## 4   1298.4100   1219.7300     1072.32    1110.710   1097.8800   1211.4500
## 5    161.5700    158.8300      137.35     149.700    153.8300    165.1400
##   2020_week16 2020_week17 2020_week18 2020_week19 2020_week20 2020_week21
## 1     2375.00   2410.2200   2286.0400   2379.6101   2409.7800   2436.8799
## 2       70.70     70.7425     72.2675     77.5325     76.9275     79.7225
## 3      179.24    190.0700    202.2700    212.3500    210.8800    234.9100
## 4     1283.25   1279.3101   1320.6100   1388.3700   1373.1899   1410.4200
## 5      178.60    174.5500    174.5700    184.6800    183.1600    183.5100
##   2020_week22 2020_week23 2020_week24 2020_week25 2020_week26 2020_week27
## 1    2442.370    2483.000     2545.02     2675.01   2692.8701   2890.3000
## 2      79.485      82.875       84.70       87.43     88.4075     91.0275
## 3     225.090     230.770      228.58      238.79    216.0800    233.4200
## 4    1428.920    1438.390     1413.18     1431.72   1359.9000   1464.7000
## 5     183.250     187.200      187.74      195.15    196.3300    206.2600
##   2020_week28 2020_week29 2020_week30 2020_week31 2020_week32 2020_week33
## 1     3200.00   2961.9700    3008.910     3164.68   3167.4600   3148.0200
## 2       95.92     96.3275      92.615      106.26    111.1125    114.9075
## 3      245.07    242.0300     230.710      253.67    268.4400    261.2400
## 4     1541.74   1515.5500    1511.870     1482.96   1494.4900   1507.7300
## 5      213.67    202.8800     201.300      205.01    212.4800    208.9000
##   2020_week34 2020_week35 2020_week36 2020_week37 2020_week38 2020_week39
## 1     3284.72   3401.8000     3294.62     3116.22     2954.91     3095.13
## 2      124.37    124.8075      120.96      112.00      106.84      112.28
## 3      267.01    293.6600      282.73      266.61      252.53      254.82
## 4     1580.42   1644.4100     1591.04     1520.72     1459.99     1444.96
## 5      213.02    228.9100      214.25      204.03      200.39      207.82
##   2020_week40 2020_week41 2020_week42 2020_week43 2020_week44 2020_week45
## 1     3125.00     3286.65     3272.71     3204.40     3036.15     3311.37
## 2      113.02      116.97      119.02      115.04      108.86      118.69
## 3      259.94      264.45      265.93      284.79      263.11      293.41
## 4     1458.42     1515.22     1573.01     1641.00     1621.01     1761.75
## 5      206.19      215.81      219.66      216.23      202.47      223.72
##   2020_week46 2020_week47 2020_week48 2020_week49 2020_week50 2020_week51
## 1     3128.81     3099.40     3195.34     3162.58     3116.42     3201.65
## 2      119.26      117.34      116.59      122.25      122.41      126.66
## 3      276.95      269.70      277.81      279.70      273.55      276.40
## 4     1777.02     1742.19     1793.19     1827.99     1781.77     1731.01
## 5      216.51      210.39      215.23      214.36      213.26      218.59
##   2020_week52 2020_week53
## 1     3172.69     3256.93
## 2      131.97      132.69
## 3      267.40      273.16
## 4     1738.85     1751.88
## 5      222.75      222.42

Inspect stock_df in the console. Pivot stock_df so that the integer columns year and week are created from the column names and the original values are moved to the price column. Use the names_sep argument to separate the column names.

stock_df %>% 
  # Pivot the data to create 3 new columns: year, week, price
  pivot_longer(
    -company,
    names_to = c("year", "week"),
    values_to = "price",
    names_sep = "_week",
    names_transform = list(
      year = as.integer,
      week = as.integer)
  )

## # A tibble: 525 x 4
##    company  year  week price
##    <fct>   <int> <int> <dbl>
##  1 Amazon   2019     1 1848.
##  2 Amazon   2019     2 1641.
##  3 Amazon   2019     3 1696.
##  4 Amazon   2019     4 1671.
##  5 Amazon   2019     5 1626.
##  6 Amazon   2019     6 1588.
##  7 Amazon   2019     7 1608.
##  8 Amazon   2019     8 1632.
##  9 Amazon   2019     9 1672.
## 10 Amazon   2019    10 1621.
## # … with 515 more rows

Create a line plot where the price is shown per week and color by company. The year variable has been dealt with for you.

stock_df %>% 
  # Pivot the data to create 3 new columns: year, week, price
  pivot_longer(
    -company,
    names_to = c("year", "week"),
    values_to = "price",
    names_sep = "_week",
    names_transform = list(
      year = as.integer,
      week = as.integer)
  ) %>%
  # Create a line plot with price per week, color by company
  ggplot(aes(week, price,  color = company)) +
  geom_line() +
  facet_grid(. ~ year)

Deriving variables from complex column headers

Soviet space dogs, the dog perspective

You’ll be working on an pre-processed sample of the USSR space dogs database compiled by Duncan Geere and pre-loaded for you as space_dogs_df. Each of the 42 rows in this dataset represents a test rocket launch which had one or two very brave dogs on board.

Your goal is to reshape this dataset so that for each launch, each dog has a row.

The challenge is that in the column headers (name_1, name_2, gender_1, and gender_2), the part before the _ separator can point to two different variables (name and gender), while the second part always points to the dog ID (1st or 2nd dog).

space_dogs_df <- read.delim("DATABASE/space_dogs_df.txt", sep = ",") %>% select(., c(-"X"))

head(space_dogs_df)

##         date                                                        result
## 1 1966-02-22                        recovered safely after a 22-day flight
## 2 1961-03-25                                   one orbit, recovered safely
## 3 1961-03-09                                   one orbit, recovered safely
## 4 1960-12-22  upper stage failed, dogs recovered after a suborbital flight
## 5 1960-12-01 spent one day in orbit, capsule was destroyed, both dogs died
## 6 1960-09-22                                              recovered safely
##                  name_1            name_2 gender_1 gender_2
## 1     Ugolyok / Snezhok Veterok / Bzdunok     Male     Male
## 2            Zvezdochka              <NA>   Female     <NA>
## 3            Chernuskha              <NA>   Female     <NA>
## 4                Shutka           Kometka   Female   Female
## 5                Mushka          Pchyolka   Female   Female
## 6 Kusachka / Otvazhnaya              Neva   Female   Female

As the first argument to pivot_longer(), pass the columns to pivot (name_1, name_2, gender_1, and gender_2). Complete the names_to argument so that the first part of the column headers are reused. Make sure NA values are dropped since not all rockets had two dogs.

space_dogs_df %>% 
  pivot_longer(
    # Add the columns to pivot
    name_1:gender_2,
    names_sep = "_",
    # Complete the names_to argument to re-use the first part of the column headers
    names_to = c(".value", "dog_id"),
    # Make sure NA values are dropped
    values_drop_na = TRUE
  )

## # A tibble: 81 x 5
##    date     result                                  dog_id name           gender
##    <fct>    <fct>                                   <chr>  <fct>          <fct> 
##  1 1966-02… recovered safely after a 22-day flight  1      Ugolyok / Sne… Male  
##  2 1966-02… recovered safely after a 22-day flight  2      Veterok / Bzd… Male  
##  3 1961-03… one orbit, recovered safely             1      Zvezdochka     Female
##  4 1961-03… one orbit, recovered safely             1      Chernuskha     Female
##  5 1960-12… upper stage failed, dogs recovered aft… 1      Shutka         Female
##  6 1960-12… upper stage failed, dogs recovered aft… 2      Kometka        Female
##  7 1960-12… spent one day in orbit, capsule was de… 1      Mushka         Female
##  8 1960-12… spent one day in orbit, capsule was de… 2      Pchyolka       Female
##  9 1960-09… recovered safely                        1      Kusachka / Ot… Female
## 10 1960-09… recovered safely                        2      Neva           Female
## # … with 71 more rows

WHO obesity vs. life expectancy

You’ve been given a sample of WHO data (who_df) with obesity percentages and life expectancy data per country, year, and sex. You want to visually inspect the correlation between obesity and life expectancy.

However, the data is very messy with four variables hidden in the column names. Each column name is made up of three parts separated by underscores: Values for the year, followed by those for sex, and then values for either pct.obese or life.exp. Since the third part of the column name string holds two variables you’ll need to use the special “.value” value in the names_to argument.

You’ll pivot the data into a tidy format and create the scatterplot.

who_df <- read.delim("DATABASE/who_df.txt", sep = ",") %>% select(., c(-"X"))

head(who_df)

##               country X2000_male_pct.obese X2000_male_life.exp
## 1         Afghanistan                  1.2                54.6
## 2             Albania                 11.7                70.4
## 3             Algeria                 10.3                69.4
## 4              Angola                  1.4                45.2
## 5 Antigua and Barbuda                  6.8                69.7
## 6           Argentina                 19.1                70.2
##   X2000_female_pct.obese X2000_female_life.exp X2001_male_pct.obese
## 1                    3.4                  57.3                  1.3
## 2                   14.7                  76.0                 12.1
## 3                   24.2                  72.2                 10.8
## 4                    5.6                  49.5                  1.5
## 5                   18.2                  75.1                  7.1
## 6                   22.1                  77.7                 19.6
##   X2001_male_life.exp X2001_female_pct.obese X2001_female_life.exp
## 1                55.3                    3.6                  57.9
## 2                71.6                   15.0                  76.7
## 3                69.9                   24.9                  72.5
## 4                46.0                    5.9                  50.4
## 5                70.3                   18.6                  75.5
## 6                70.3                   22.5                  77.7
##   X2002_male_pct.obese X2002_male_life.exp X2002_female_pct.obese
## 1                  1.3                56.5                    3.8
## 2                 12.6                71.4                   15.4
## 3                 11.3                70.4                   25.5
## 4                  1.6                47.3                    6.3
## 5                  7.3                71.1                   19.1
## 6                 20.0                70.4                   22.9
##   X2002_female_life.exp X2003_male_pct.obese X2003_male_life.exp
## 1                  58.5                  1.4                56.9
## 2                  75.9                 13.2                70.8
## 3                  73.0                 11.8                70.7
## 4                  51.5                  1.7                48.4
## 5                  75.7                  7.6                71.4
## 6                  78.0                 20.5                70.6
##   X2003_female_pct.obese X2003_female_life.exp X2004_male_pct.obese
## 1                    4.0                  59.1                  1.5
## 2                   15.8                  75.2                 13.7
## 3                   26.2                  73.3                 12.3
## 4                    6.6                  52.6                  1.9
## 5                   19.5                  75.2                  7.8
## 6                   23.3                  77.9                 21.0
##   X2004_male_life.exp X2004_female_pct.obese X2004_female_life.exp
## 1                57.4                    4.2                  59.6
## 2                71.0                   16.2                  75.8
## 3                71.6                   26.8                  74.0
## 4                49.4                    6.9                  53.7
## 5                71.5                   19.9                  75.1
## 6                71.4                   23.7                  78.3
##   X2005_male_pct.obese X2005_male_life.exp X2005_female_pct.obese
## 1                  1.6                57.8                    4.4
## 2                 14.3                71.6                   16.6
## 3                 12.9                72.3                   27.5
## 4                  2.0                50.6                    7.3
## 5                  8.1                71.4                   20.4
## 6                 21.5                71.6                   24.2
##   X2005_female_life.exp X2006_male_pct.obese X2006_male_life.exp
## 1                  60.1                  1.7                58.0
## 2                  75.6                 14.9                71.8
## 3                  74.6                 13.4                72.9
## 4                  54.8                  2.1                51.7
## 5                  75.3                  8.4                71.3
## 6                  78.6                 22.0                71.9
##   X2006_female_pct.obese X2006_female_life.exp X2007_male_pct.obese
## 1                    4.7                  60.5                  1.8
## 2                   17.1                  76.2                 15.5
## 3                   28.2                  75.0                 14.0
## 4                    7.7                  56.0                  2.3
## 5                   20.8                  75.6                  8.6
## 6                   24.6                  78.9                 22.5
##   X2007_male_life.exp X2007_female_pct.obese X2007_female_life.exp
## 1                58.4                    4.9                  61.0
## 2                71.9                   17.5                  76.4
## 3                73.4                   28.8                  75.5
## 4                52.8                    8.0                  57.2
## 5                71.3                   21.3                  76.2
## 6                71.5                   25.0                  78.5
##   X2008_male_pct.obese X2008_male_life.exp X2008_female_pct.obese
## 1                  2.0                59.0                    5.2
## 2                 16.1                72.1                   18.0
## 3                 14.6                73.8                   29.5
## 4                  2.4                53.9                    8.4
## 5                  8.9                71.6                   21.7
## 6                 23.0                72.0                   25.4
##   X2008_female_life.exp X2009_male_pct.obese X2009_male_life.exp
## 1                  61.6                  2.1                59.5
## 2                  76.5                 16.8                72.3
## 3                  75.8                 15.2                74.1
## 4                  58.4                  2.6                55.0
## 5                  76.3                  9.2                72.1
## 6                  79.2                 23.6                72.3
##   X2009_female_pct.obese X2009_female_life.exp X2010_male_pct.obese
## 1                    5.4                  62.1                  2.2
## 2                   18.4                  76.7                 17.4
## 3                   30.2                  76.1                 15.8
## 4                    8.8                  59.6                  2.8
## 5                   22.2                  76.7                  9.5
## 6                   25.9                  79.2                 24.1
##   X2010_male_life.exp X2010_female_pct.obese X2010_female_life.exp
## 1                59.9                    5.7                  62.5
## 2                72.5                   18.9                  77.0
## 3                74.4                   30.9                  76.4
## 4                56.1                    9.3                  60.7
## 5                72.4                   22.7                  76.9
## 6                72.3                   26.3                  79.0
##   X2011_male_pct.obese X2011_male_life.exp X2011_female_pct.obese
## 1                  2.4                60.5                    6.0
## 2                 18.1                72.6                   19.4
## 3                 16.5                74.6                   31.5
## 4                  2.9                57.2                    9.7
## 5                  9.8                72.6                   23.2
## 6                 24.6                72.5                   26.7
##   X2011_female_life.exp X2012_male_pct.obese X2012_male_life.exp
## 1                  63.1                  2.5                60.9
## 2                  78.2                 18.8                73.0
## 3                  76.6                 17.1                74.8
## 4                  61.8                  3.1                58.1
## 5                  77.2                 10.1                72.5
## 6                  79.3                 25.1                72.7
##   X2012_female_pct.obese X2012_female_life.exp X2013_male_pct.obese
## 1                    6.3                  63.6                  2.7
## 2                   19.9                  77.7                 19.5
## 3                   32.2                  76.8                 17.8
## 4                   10.2                  62.6                  3.3
## 5                   23.7                  77.0                 10.5
## 6                   27.2                  79.7                 25.7
##   X2013_male_life.exp X2013_female_pct.obese X2013_female_life.exp
## 1                61.5                    6.6                  64.1
## 2                73.6                   20.4                  78.1
## 3                75.0                   32.9                  76.9
## 4                58.8                   10.6                  63.3
## 5                72.5                   24.2                  77.0
## 6                72.8                   27.6                  79.7
##   X2014_male_pct.obese X2014_male_life.exp X2014_female_pct.obese
## 1                  2.8                61.7                    7.0
## 2                 20.2                74.0                   20.8
## 3                 18.5                75.1                   33.6
## 4                  3.6                59.4                   11.1
## 5                 10.8                72.1                   24.8
## 6                 26.2                73.2                   28.1
##   X2014_female_life.exp X2015_male_pct.obese X2015_male_life.exp
## 1                  64.4                  3.0                61.8
## 2                  78.5                 20.9                74.2
## 3                  77.1                 19.2                75.3
## 4                  64.0                  3.8                59.9
## 5                  77.1                 11.2                72.5
## 6                  80.0                 26.8                73.3
##   X2015_female_pct.obese X2015_female_life.exp X2016_male_pct.obese
## 1                    7.3                  64.7                  3.2
## 2                   21.3                  78.2                 21.6
## 3                   34.2                  77.2                 19.9
## 4                   11.6                  64.5                  4.0
## 5                   25.3                  77.4                 11.6
## 6                   28.5                  80.1                 27.3
##   X2016_male_life.exp X2016_female_pct.obese X2016_female_life.exp
## 1                61.0                    7.6                  64.5
## 2                74.3                   21.8                  78.6
## 3                75.4                   34.9                  77.4
## 4                60.3                   12.1                  64.9
## 5                72.5                   25.9                  77.5
## 6                73.5                   29.0                  80.3

#nos importa los nombrs mal 


columnas <- names(who_df)

#vemos cual es el tamaño de las columnas
length <- length(columnas)

for(i in 1:length){
  
  #seleccionamos la columna
  
  columna <- columnas[i]
  # print(columna)
  
  #ESTABLECEMOS LA CONDICION 
  
  CONDICION <- str_detect(columna, "^X")
  
 
  
  if(CONDICION == TRUE){
    
    #CORREGIMOS EL ERROR
    
    CORREGIDO <- str_extract_all(columnas[i], "[:digit:]+\\_+[:alpha:]+\\_[:alpha:]+\\.+[:alpha:]+", simplify = TRUE)
    
    
     #reemplazamos el nombre de la columna mal escrita
     columnas[i] <- replace(columnas[i], CONDICION == TRUE, CORREGIDO)
    
    
    
    
    
  }else{
    
    next
  }
  


}
#implementamos los nombres bien
names(who_df) <- columnas

#vemos el resultado

head(who_df)

##               country 2000_male_pct.obese 2000_male_life.exp
## 1         Afghanistan                 1.2               54.6
## 2             Albania                11.7               70.4
## 3             Algeria                10.3               69.4
## 4              Angola                 1.4               45.2
## 5 Antigua and Barbuda                 6.8               69.7
## 6           Argentina                19.1               70.2
##   2000_female_pct.obese 2000_female_life.exp 2001_male_pct.obese
## 1                   3.4                 57.3                 1.3
## 2                  14.7                 76.0                12.1
## 3                  24.2                 72.2                10.8
## 4                   5.6                 49.5                 1.5
## 5                  18.2                 75.1                 7.1
## 6                  22.1                 77.7                19.6
##   2001_male_life.exp 2001_female_pct.obese 2001_female_life.exp
## 1               55.3                   3.6                 57.9
## 2               71.6                  15.0                 76.7
## 3               69.9                  24.9                 72.5
## 4               46.0                   5.9                 50.4
## 5               70.3                  18.6                 75.5
## 6               70.3                  22.5                 77.7
##   2002_male_pct.obese 2002_male_life.exp 2002_female_pct.obese
## 1                 1.3               56.5                   3.8
## 2                12.6               71.4                  15.4
## 3                11.3               70.4                  25.5
## 4                 1.6               47.3                   6.3
## 5                 7.3               71.1                  19.1
## 6                20.0               70.4                  22.9
##   2002_female_life.exp 2003_male_pct.obese 2003_male_life.exp
## 1                 58.5                 1.4               56.9
## 2                 75.9                13.2               70.8
## 3                 73.0                11.8               70.7
## 4                 51.5                 1.7               48.4
## 5                 75.7                 7.6               71.4
## 6                 78.0                20.5               70.6
##   2003_female_pct.obese 2003_female_life.exp 2004_male_pct.obese
## 1                   4.0                 59.1                 1.5
## 2                  15.8                 75.2                13.7
## 3                  26.2                 73.3                12.3
## 4                   6.6                 52.6                 1.9
## 5                  19.5                 75.2                 7.8
## 6                  23.3                 77.9                21.0
##   2004_male_life.exp 2004_female_pct.obese 2004_female_life.exp
## 1               57.4                   4.2                 59.6
## 2               71.0                  16.2                 75.8
## 3               71.6                  26.8                 74.0
## 4               49.4                   6.9                 53.7
## 5               71.5                  19.9                 75.1
## 6               71.4                  23.7                 78.3
##   2005_male_pct.obese 2005_male_life.exp 2005_female_pct.obese
## 1                 1.6               57.8                   4.4
## 2                14.3               71.6                  16.6
## 3                12.9               72.3                  27.5
## 4                 2.0               50.6                   7.3
## 5                 8.1               71.4                  20.4
## 6                21.5               71.6                  24.2
##   2005_female_life.exp 2006_male_pct.obese 2006_male_life.exp
## 1                 60.1                 1.7               58.0
## 2                 75.6                14.9               71.8
## 3                 74.6                13.4               72.9
## 4                 54.8                 2.1               51.7
## 5                 75.3                 8.4               71.3
## 6                 78.6                22.0               71.9
##   2006_female_pct.obese 2006_female_life.exp 2007_male_pct.obese
## 1                   4.7                 60.5                 1.8
## 2                  17.1                 76.2                15.5
## 3                  28.2                 75.0                14.0
## 4                   7.7                 56.0                 2.3
## 5                  20.8                 75.6                 8.6
## 6                  24.6                 78.9                22.5
##   2007_male_life.exp 2007_female_pct.obese 2007_female_life.exp
## 1               58.4                   4.9                 61.0
## 2               71.9                  17.5                 76.4
## 3               73.4                  28.8                 75.5
## 4               52.8                   8.0                 57.2
## 5               71.3                  21.3                 76.2
## 6               71.5                  25.0                 78.5
##   2008_male_pct.obese 2008_male_life.exp 2008_female_pct.obese
## 1                 2.0               59.0                   5.2
## 2                16.1               72.1                  18.0
## 3                14.6               73.8                  29.5
## 4                 2.4               53.9                   8.4
## 5                 8.9               71.6                  21.7
## 6                23.0               72.0                  25.4
##   2008_female_life.exp 2009_male_pct.obese 2009_male_life.exp
## 1                 61.6                 2.1               59.5
## 2                 76.5                16.8               72.3
## 3                 75.8                15.2               74.1
## 4                 58.4                 2.6               55.0
## 5                 76.3                 9.2               72.1
## 6                 79.2                23.6               72.3
##   2009_female_pct.obese 2009_female_life.exp 2010_male_pct.obese
## 1                   5.4                 62.1                 2.2
## 2                  18.4                 76.7                17.4
## 3                  30.2                 76.1                15.8
## 4                   8.8                 59.6                 2.8
## 5                  22.2                 76.7                 9.5
## 6                  25.9                 79.2                24.1
##   2010_male_life.exp 2010_female_pct.obese 2010_female_life.exp
## 1               59.9                   5.7                 62.5
## 2               72.5                  18.9                 77.0
## 3               74.4                  30.9                 76.4
## 4               56.1                   9.3                 60.7
## 5               72.4                  22.7                 76.9
## 6               72.3                  26.3                 79.0
##   2011_male_pct.obese 2011_male_life.exp 2011_female_pct.obese
## 1                 2.4               60.5                   6.0
## 2                18.1               72.6                  19.4
## 3                16.5               74.6                  31.5
## 4                 2.9               57.2                   9.7
## 5                 9.8               72.6                  23.2
## 6                24.6               72.5                  26.7
##   2011_female_life.exp 2012_male_pct.obese 2012_male_life.exp
## 1                 63.1                 2.5               60.9
## 2                 78.2                18.8               73.0
## 3                 76.6                17.1               74.8
## 4                 61.8                 3.1               58.1
## 5                 77.2                10.1               72.5
## 6                 79.3                25.1               72.7
##   2012_female_pct.obese 2012_female_life.exp 2013_male_pct.obese
## 1                   6.3                 63.6                 2.7
## 2                  19.9                 77.7                19.5
## 3                  32.2                 76.8                17.8
## 4                  10.2                 62.6                 3.3
## 5                  23.7                 77.0                10.5
## 6                  27.2                 79.7                25.7
##   2013_male_life.exp 2013_female_pct.obese 2013_female_life.exp
## 1               61.5                   6.6                 64.1
## 2               73.6                  20.4                 78.1
## 3               75.0                  32.9                 76.9
## 4               58.8                  10.6                 63.3
## 5               72.5                  24.2                 77.0
## 6               72.8                  27.6                 79.7
##   2014_male_pct.obese 2014_male_life.exp 2014_female_pct.obese
## 1                 2.8               61.7                   7.0
## 2                20.2               74.0                  20.8
## 3                18.5               75.1                  33.6
## 4                 3.6               59.4                  11.1
## 5                10.8               72.1                  24.8
## 6                26.2               73.2                  28.1
##   2014_female_life.exp 2015_male_pct.obese 2015_male_life.exp
## 1                 64.4                 3.0               61.8
## 2                 78.5                20.9               74.2
## 3                 77.1                19.2               75.3
## 4                 64.0                 3.8               59.9
## 5                 77.1                11.2               72.5
## 6                 80.0                26.8               73.3
##   2015_female_pct.obese 2015_female_life.exp 2016_male_pct.obese
## 1                   7.3                 64.7                 3.2
## 2                  21.3                 78.2                21.6
## 3                  34.2                 77.2                19.9
## 4                  11.6                 64.5                 4.0
## 5                  25.3                 77.4                11.6
## 6                  28.5                 80.1                27.3
##   2016_male_life.exp 2016_female_pct.obese 2016_female_life.exp
## 1               61.0                   7.6                 64.5
## 2               74.3                  21.8                 78.6
## 3               75.4                  34.9                 77.4
## 4               60.3                  12.1                 64.9
## 5               72.5                  25.9                 77.5
## 6               73.5                  29.0                 80.3

The ggplot2 package has been pre-loaded for you.

who_df %>% 
  # Put each variable in its own column
  pivot_longer(
    -country,
    names_to = c("year", "sex", ".value"),
    names_sep = "_", 
    names_transform = list("year" = as.integer)
  )

## # A tibble: 6,120 x 5
##    country      year sex    pct.obese life.exp
##    <fct>       <int> <chr>      <dbl>    <dbl>
##  1 Afghanistan  2000 male         1.2     54.6
##  2 Afghanistan  2000 female       3.4     57.3
##  3 Afghanistan  2001 male         1.3     55.3
##  4 Afghanistan  2001 female       3.6     57.9
##  5 Afghanistan  2002 male         1.3     56.5
##  6 Afghanistan  2002 female       3.8     58.5
##  7 Afghanistan  2003 male         1.4     56.9
##  8 Afghanistan  2003 female       4       59.1
##  9 Afghanistan  2004 male         1.5     57.4
## 10 Afghanistan  2004 female       4.2     59.6
## # … with 6,110 more rows

Use ggplot() to create a scatterplot with life.exp over pct.obese. Color the points by sex.

who_df %>% 
  # Put each variable in its own column
  pivot_longer(
    -country,
    names_to = c("year", "sex", ".value"),
    names_sep = "_", 
    names_transform = list("year" = as.integer)
  ) %>%
  # Create a plot with life expectancy over obesity
  ggplot(.,aes(pct.obese,life.exp , color = sex))+
  geom_point()

Uncounting observations

You’ve found the job of your dreams providing technical support for a dog breed beauty contest. The jury members want a spreadsheet with the breed and id of each participating dog so that they can add the scores later on. You’ve only been given the number of participants per dog breed (dog_df) so you decide to use your tidyr skills to create the desired result.

dog_df <- read.delim("DATABASE/dog_df.txt", sep = ",") %>% select(., c(-"X"))

head(dog_df)

##              breed n_participants
## 1            Husky              5
## 2 Golden retriever             12
## 3           Poodle             13
## 4        Shiba Inu              7

Inspect the data in the console. Uncount the data so that per breed, each dog gets a row and an ID. The ID should go in the dog_id column.

dog_df %>% 
  # Create one row for each participant and add the id
  uncount(n_participants, .id = "dog_id")

##                 breed dog_id
## 1               Husky      1
## 1.1             Husky      2
## 1.2             Husky      3
## 1.3             Husky      4
## 1.4             Husky      5
## 2    Golden retriever      1
## 2.1  Golden retriever      2
## 2.2  Golden retriever      3
## 2.3  Golden retriever      4
## 2.4  Golden retriever      5
## 2.5  Golden retriever      6
## 2.6  Golden retriever      7
## 2.7  Golden retriever      8
## 2.8  Golden retriever      9
## 2.9  Golden retriever     10
## 2.10 Golden retriever     11
## 2.11 Golden retriever     12
## 3              Poodle      1
## 3.1            Poodle      2
## 3.2            Poodle      3
## 3.3            Poodle      4
## 3.4            Poodle      5
## 3.5            Poodle      6
## 3.6            Poodle      7
## 3.7            Poodle      8
## 3.8            Poodle      9
## 3.9            Poodle     10
## 3.10           Poodle     11
## 3.11           Poodle     12
## 3.12           Poodle     13
## 4           Shiba Inu      1
## 4.1         Shiba Inu      2
## 4.2         Shiba Inu      3
## 4.3         Shiba Inu      4
## 4.4         Shiba Inu      5
## 4.5         Shiba Inu      6
## 4.6         Shiba Inu      7

From long to wide data

Soviet space dogs, the flight perspective

Remember the USSR space dogs dataset1? You changed it to a long format so that for every dog in every rocket launch, there was a row. Suppose you’re given this tidy dataset and are asked to answer the question, “In what percentage of flights were both dogs of the same gender?”

You’ll reshape and investigate space_dogs_df to find the answer.

The dplyr package has been pre-loaded for you.

1 Compiled by Duncan Geere.

space_dogs_df <- read.delim("DATABASE/space_dogs_df2.txt", sep = ",") %>% select(., c(-"X"))

head(space_dogs_df)

##         date gender dog_id
## 1 1951-07-22   Male      1
## 2 1951-07-22   Male      2
## 3 1951-07-29   Male      1
## 4 1951-07-29 Female      2
## 5 1951-08-15   Male      1
## 6 1951-08-15   Male      2

Pivot the data to a wider format, deriving new column names from the dog_id column and values from the gender column.

space_dogs_df %>% 
  # Pivot the data to a wider format
  pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_")

## # A tibble: 42 x 3
##    date       gender_1 gender_2
##    <fct>      <fct>    <fct>   
##  1 1951-07-22 Male     Male    
##  2 1951-07-29 Male     Female  
##  3 1951-08-15 Male     Male    
##  4 1951-08-19 Male     Male    
##  5 1951-08-28 Male     Male    
##  6 1951-09-03 Male     Male    
##  7 1951-06-26 Female   Male    
##  8 1954-07-02 Female   Male    
##  9 1954-07-07 Female   Male    
## 10 1954-07-26 Female   Male    
## # … with 32 more rows

Drop rows that contain NA values.

space_dogs_df %>% 
  # Pivot the data to a wider format
  pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_") %>% 
  # Drop rows with NA values
  drop_na()

## # A tibble: 39 x 3
##    date       gender_1 gender_2
##    <fct>      <fct>    <fct>   
##  1 1951-07-22 Male     Male    
##  2 1951-07-29 Male     Female  
##  3 1951-08-15 Male     Male    
##  4 1951-08-19 Male     Male    
##  5 1951-08-28 Male     Male    
##  6 1951-09-03 Male     Male    
##  7 1951-06-26 Female   Male    
##  8 1954-07-02 Female   Male    
##  9 1954-07-07 Female   Male    
## 10 1954-07-26 Female   Male    
## # … with 29 more rows

Create a new column same_gender, which has a TRUE value when gender_1 equals gender_2.

space_dogs_df %>% 
  # Pivot the data to a wider format
  pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_") %>% 
  # Drop rows with NA values
  drop_na() %>% 
  # Create a Boolean column on whether both dogs have the same gender
  mutate(
    same_gender = ifelse(gender_1 == gender_2, TRUE, 0)
  ) %>% 
  summarize(pct_same_gender = mean(same_gender))

## # A tibble: 1 x 1
##   pct_same_gender
##             <dbl>
## 1           0.795

Planet temperature & distance to the Sun

The intensity of light radiated by a light source follows an inverse square relationship with the distance it has traveled.

planet_df <- read.delim("DATABASE/planet_df.txt", sep = ",") %>% select(., c(-"X"))

head(planet_df)

##    planet          metric   value
## 1 Mercury        diameter  4879.0
## 2 Mercury distance_to_sun    57.9
## 3 Mercury     temperature   167.0
## 4   Venus        diameter 12104.0
## 5   Venus distance_to_sun   108.2
## 6   Venus     temperature   464.0

You wonder if you could observe this trend in the temperature of the planets in our Solar System given their distance to the Sun. You’ll use the planet_df dataset from the devstronomy project to investigate this.

planet_df %>% 
  # Give each planet variable its own column
  pivot_wider(

    names_from = metric, 

    values_from = value


  )

## # A tibble: 8 x 4
##   planet  diameter distance_to_sun temperature
##   <fct>      <dbl>           <dbl>       <dbl>
## 1 Mercury     4879            57.9         167
## 2 Venus      12104           108.          464
## 3 Earth      12756           150.           15
## 4 Mars        6792           228.          -65
## 5 Jupiter   142984           779.         -110
## 6 Saturn    120536          1434.         -140
## 7 Uranus     51118          2872.         -195
## 8 Neptune    49528          4495.         -200

Use the ggplot() function to create a plot with the temperature over the distance_to_sun.

planet_df %>% 
  # Give each planet variable its own column
  pivot_wider(names_from = "metric", values_from = "value") %>% 
  # Plot planet temperature over distance to sun
  ggplot(., aes(temperature, distance_to_sun)) +
  geom_point(aes(size = diameter)) +
  geom_text(aes(label = planet), vjust = -1) +
  labs(x = "Distance to sun (million km)", 
       y = "Mean temperature (°C)") +
  theme(legend.position = "none")

### Transposing planet data You’re again working on a planet dataset derived from the devstronomy project. This time, you’re interested in the correlation between the diameter of a planet and the number of moons circling it.

However, the dataset (planet_df) has a row for each variable and a column for each planet (observation). You’ll transpose this data in two steps and then create a plot to inspect the correlation.

The ggplot2 package has been pre-loaded for you.

planet_df <- read.delim("DATABASE/planet_df2.txt", sep = ",") %>% select(., c(-"X"))

#isolamos la primera columna que está bien
metric <- planet_df %>% select(., c(metric))

#quitamos la columna para que no nos estorbe

planet_df <- planet_df %>% select(., c(-"metric"))

#sacamos las repeticiones
length <- length(planet_df)


for(i in 1:1){
#extraemos el nombre de LA COLUMNA
nombre_columna <- names(planet_df[i])



#extraemos los valosres de la columna en cuestion
columna <- planet_df[,c(i)] %>% as.data.frame() %>%
  #y le ponemos su nombre
  setnames(., nombre_columna)



  
#quitamos la columna extraida para no causar confusiones
planet_df <- planet_df[,c(-i)]

#juntamos a su dataframe original 

planet_df <- bind_cols(planet_df, columna)



}


#finalmente le juntamos la columna primeramente extraida

planet_df <- bind_cols(planet_df, metric)



  #borramos el dataframe creado en balde

rm(metric)

rm(columna)

#vemos el aspecto del dataframe
head(planet_df)

##      Venus    Earth     Mars  Jupiter   Saturn  Uranus Neptune Mercury
## 1     4.87     5.97    0.642   1898.0    568.0    86.8   102.0    0.33
## 2 12104.00 12756.00 6792.000 142984.0 120536.0 51118.0 49528.0 4879.00
## 3   108.20   149.60  227.900    778.6   1433.5  2872.5  4495.1   57.90
## 4     8.90     9.80    3.700     23.1      9.0     8.7    11.0    3.70
## 5     0.00     1.00    2.000     79.0     62.0    27.0    14.0    0.00
##              metric
## 1              mass
## 2          diameter
## 3 distance_from_sun
## 4           gravity
## 5   number_of_moons

Inspect planet_df in the console. Pivot the data so that planet names are put in a column named planet.

planet_df %>%
  # Pivot all columns except metric to long format
  pivot_longer(-metric, names_to = "planet")

## # A tibble: 40 x 3
##    metric   planet      value
##    <fct>    <chr>       <dbl>
##  1 mass     Venus       4.87 
##  2 mass     Earth       5.97 
##  3 mass     Mars        0.642
##  4 mass     Jupiter  1898    
##  5 mass     Saturn    568    
##  6 mass     Uranus     86.8  
##  7 mass     Neptune   102    
##  8 mass     Mercury     0.33 
##  9 diameter Venus   12104    
## 10 diameter Earth   12756    
## # … with 30 more rows

Pivot the data so that each variable in the metric column gets its own column.

planet_df %>%
  # Pivot all columns except metric to long format
  pivot_longer(-metric, names_to = "planet") %>% 
  # Put each metric in its own column
  pivot_wider(
    names_from = metric, 
    values_from = value
  )

## # A tibble: 8 x 6
##   planet      mass diameter distance_from_sun gravity number_of_moons
##   <chr>      <dbl>    <dbl>             <dbl>   <dbl>           <dbl>
## 1 Venus      4.87     12104             108.      8.9               0
## 2 Earth      5.97     12756             150.      9.8               1
## 3 Mars       0.642     6792             228.      3.7               2
## 4 Jupiter 1898       142984             779.     23.1              79
## 5 Saturn   568       120536            1434.      9                62
## 6 Uranus    86.8      51118            2872.      8.7              27
## 7 Neptune  102        49528            4495.     11                14
## 8 Mercury    0.33      4879              57.9     3.7               0

Use the ggplot() function to create a plot with the number_of_moons over diameter.

planet_df %>%
  # Pivot all columns except metric to long format
  pivot_longer(-metric, names_to = "planet") %>% 
  # Put each metric in its own column
  pivot_wider(names_from = metric, values_from = value) %>% 
  # Plot the number of moons vs planet diameter
  ggplot(., aes(x = diameter, y = number_of_moons)) +
  geom_point(aes(size = diameter)) +
  geom_text(aes(label = planet), vjust = -1) +
  labs(x = "Diameter (km)", y = "Number of moons") +
  theme(legend.position = "none")

Expanding data

Creating unique combinations of vectors

Letters of the genetic code

The basic building blocks of RNA are four molecules described by a single letter each: adenine (A), cytosine (C), guanine (G), and uracil (U). The information carried by an RNA strand can be represented as a long sequence of these four letters. To read this code, one has to divide this chain into sequences of three letters each (e.g. GCU, ACG, …). These three letter sequences are known as codons. The concept is illustrated in the image below.

Your goal for this exercise is to create a data frame with all possible three letter sequences (codons) from a vector with the four letters representing the RNA building blocks.

Create a tibble with three columns called letter1, letter2, and letter3 that holds all possible combinations of the vector letters using expand_grid().

letters <- c("A", "C", "G", "U")

# Create a tibble with all possible 3 way combinations
codon_df <- expand_grid(
  letter1 = letters,
  letter2 = letters,
  letter3 = letters
)

codon_df

## # A tibble: 64 x 3
##    letter1 letter2 letter3
##    <chr>   <chr>   <chr>  
##  1 A       A       A      
##  2 A       A       C      
##  3 A       A       G      
##  4 A       A       U      
##  5 A       C       A      
##  6 A       C       C      
##  7 A       C       G      
##  8 A       C       U      
##  9 A       G       A      
## 10 A       G       C      
## # … with 54 more rows

Use the unite() function from chapter one to merge these three columns into a single column named codon. Use an empty string as the separator.

codon_df %>% 
  # Unite these three columns into a "codon" column
  unite("codon", letter1, letter2, letter3, sep = "")

## # A tibble: 64 x 1
##    codon
##    <chr>
##  1 AAA  
##  2 AAC  
##  3 AAG  
##  4 AAU  
##  5 ACA  
##  6 ACC  
##  7 ACG  
##  8 ACU  
##  9 AGA  
## 10 AGC  
## # … with 54 more rows

When did humans replace dogs in space?

You already know that in the early days of spaceflight, the USSR was testing rockets with dogs. You now wonder when exactly humans started replacing dogs on space flight missions. You’ve been given a dataset space_df with the number of both dogs (compiled by Duncan Geere) and humans in space per year from 1951 till 1970 (collected from Wikipedia).

Your goal is to create a plot that shows you the number of individuals sent into space per species. Before you can create this plot, you’ll first have to introduce zero values for missing combinations of year and species.

space_df <- read.delim("DATABASE/space_df.txt", sep = ",") %>% select(., c(-"X"))
head(space_df)

##   year species n_in_space
## 1 1951     Dog         14
## 2 1954     Dog          6
## 3 1955     Dog          6
## 4 1956     Dog          8
## 5 1957     Dog         11
## 6 1958     Dog         12

The dplyr and ggplot2 packages have been pre-loaded for you.

Create full_df, a tibble with all unique combinations of the variables year (from 1951 to 1970) and species (“Human” and “Dog”).

# Create a tibble with all combinations of years and species
full_df <- expand_grid(

    year = 1951:1970, 
    species = c("Human", "Dog")

)

full_df

## # A tibble: 40 x 2
##     year species
##    <int> <chr>  
##  1  1951 Human  
##  2  1951 Dog    
##  3  1952 Human  
##  4  1952 Dog    
##  5  1953 Human  
##  6  1953 Dog    
##  7  1954 Human  
##  8  1954 Dog    
##  9  1955 Human  
## 10  1955 Dog    
## # … with 30 more rows

Perform a right_join() between space_df and full_df on the year and species columns.

space_df %>% 
  # Join with full_df so that missing values are introduced
  right_join(full_df, by = c("year", "species")) %>% 
  arrange(year)

##    year species n_in_space
## 1  1951     Dog         14
## 2  1951   Human         NA
## 3  1952   Human         NA
## 4  1952     Dog         NA
## 5  1953   Human         NA
## 6  1953     Dog         NA
## 7  1954     Dog          6
## 8  1954   Human         NA
## 9  1955     Dog          6
## 10 1955   Human         NA
## 11 1956     Dog          8
## 12 1956   Human         NA
## 13 1957     Dog         11
## 14 1957   Human         NA
## 15 1958     Dog         12
## 16 1958   Human         NA
## 17 1959     Dog          4
## 18 1959   Human         NA
## 19 1960     Dog         16
## 20 1960   Human         NA
## 21 1961     Dog          2
## 22 1961   Human          4
## 23 1962   Human          5
## 24 1962     Dog         NA
## 25 1963   Human          5
## 26 1963     Dog         NA
## 27 1964   Human          3
## 28 1964     Dog         NA
## 29 1965   Human         12
## 30 1965     Dog         NA
## 31 1966     Dog          2
## 32 1966   Human         10
## 33 1967   Human          1
## 34 1967     Dog         NA
## 35 1968   Human          7
## 36 1968     Dog         NA
## 37 1969   Human         23
## 38 1969     Dog         NA
## 39 1970   Human          5
## 40 1970     Dog         NA

Use the ggplot() function to create a line plot of n_in_space over year, colored by species.

space_df %>% 
  # Join with full_df so that missing values are introduced
  right_join(full_df, by = c("year", "species")) %>% 
  # Create a line plot with n_in_space over year, color by species
  ggplot(., aes(year, n_in_space, color = species)) +
  geom_line()

## Warning: Removed 14 row(s) containing missing values (geom_path).

Use the replace_na() function to overwrite NA values in the n_in_space column with zeros.

space_df %>% 
  # Join with full_df so that missing values are introduced
  right_join(full_df, by = c("year", "species")) %>% 
  # Overwrite NA values for n_in_space with 0L
  replace_na(list(n_in_space = 0L)) %>% 
  # Create a line plot with n_in_space over year, color by species
  ggplot(aes(x = year, y = n_in_space, color = species)) +
  geom_line()

Finding missing observations

You’re an inspector at a nuclear plant and have to validate whether every reactor has received its daily safety check over the course of a full year. The safety check logs are in reactor_df, a data frame with columns date, reactor, and check.

Two vectors, dates and reactors, with all dates of the year and reactors at the plant respectively have been created for you. You’ll use the combination of the expand_grid() and anti_join() functions to find dates where particular reactors were not checked.

The dplyr package has been pre-loaded for you.

reactor_df <- read.delim("DATABASE/reactor_df.txt", sep = ",") %>% select(., c(-"X"))
head(reactor_df)

##         date reactor checked
## 1 1986-01-01       A    TRUE
## 2 1986-01-01       B    TRUE
## 3 1986-01-01       C    TRUE
## 4 1986-01-01       D    TRUE
## 5 1986-01-02       A    TRUE
## 6 1986-01-02       B    TRUE

dates <- read.delim("DATABASE/dates.txt", sep = ",", header = FALSE)

head(dates)

##                                                             V1
## 1   [1] 1986-01-01 1986-01-02 1986-01-03 1986-01-04 1986-01-05
## 2   [6] 1986-01-06 1986-01-07 1986-01-08 1986-01-09 1986-01-10
## 3  [11] 1986-01-11 1986-01-12 1986-01-13 1986-01-14 1986-01-15
## 4  [16] 1986-01-16 1986-01-17 1986-01-18 1986-01-19 1986-01-20
## 5  [21] 1986-01-21 1986-01-22 1986-01-23 1986-01-24 1986-01-25
## 6  [26] 1986-01-26 1986-01-27 1986-01-28 1986-01-29 1986-01-30

reactors <- read.delim("DATABASE/reactors.txt", sep = ",", header = FALSE)

head(reactors)

##        V1
## 1 A B C D

Use the expand_grid() function to create a tibble holding all combinations of the variables date and reactor. Use the dates and reactors vectors created for you.

# Create a tibble with all combinations of dates and reactors
# full_df <- expand_grid(
# 
#   date = dates, 
#   reactor = reactors
# 
# )

full_df <- read.delim("DATABASE/full_df.txt", sep = ",", header = FALSE)

head(full_df)

##   V1         V2      V3
## 1 NA       date reactor
## 2  1 1986-01-01       A
## 3  2 1986-01-01       B
## 4  3 1986-01-01       C
## 5  4 1986-01-01       D
## 6  5 1986-01-02       A

Perform an anti-join between full_df and reactor_df on the date and reactor columns.

# full_df %>% 
#   anti_join(
#     reactor_df, 
#     by = c("date", "reactor")
#   )

Completing data with all value combinations

Completing the Solar System

You have been given a data frame (planet_df) from the devstronomy project with the number of moons per planet in our Solar System. However, Mercury and Venus, the two moonless planets, are absent. You want to expand this dataset using the complete() function and a vector planets that contains all eight planet’s names.

planet_df <- read.delim("DATABASE/planet_df3.txt", sep = ",", header = FALSE)

head(planet_df)

##   V1      V2      V3
## 1 NA  planet n_moons
## 2  1   Earth       1
## 3  2    Mars       2
## 4  3 Jupiter      79
## 5  4  Saturn      62
## 6  5  Uranus      27

planets = c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")

Complete the planet variable using the planets vector. Replace NA values in the n_moons variable with 0L values.

# planet_df %>% 
#   complete(.,
#     # Complete the planet variable
#     planet = planets
#     # Overwrite NA values for n_moons with 0L
#     fill = list(n_moons = 0L)
#   )


planet_completed <- read.delim("DATABASE/planet_complete.txt", sep = ",", header = FALSE)

head(planet_completed)

##   V1      V2      V3
## 1 NA  planet n_moons
## 2  1   Earth       1
## 3  2 Jupiter      79
## 4  3    Mars       2
## 5  4 Mercury       0
## 6  5 Neptune      14

Zero Olympic medals

Since 1896, athletes from all over the world have been competing in the modern Olympic games. You’ve been given a dataset (medal_df) with observations for all medals won by athletes from the 10 most successful countries in Olympic history. You want to create a visual with the number of medals won per country (team) per year. However, since not all countries won medals each year, you’ll have to introduce zero values before you can make an accurate visual.

The ggplot2 and dplyr packages have been pre-loaded for you. In step 2 and 3 the scale_color_brewer() function is used to color lines in the plot with a palette that makes it easier to distinguish the different countries.

medal_df <- read.delim("DATABASE/medal_df.txt", sep = ",")

head(medal_df)

##   X                      name      sport year  medal          team
## 1 1             Edward Battel    Cycling 1896 Bronze Great Britain
## 2 2      Arthur Charles Blake  Athletics 1896 Silver United States
## 3 3 Conrad Helmut Fritz Bcker Gymnastics 1896   Gold       Germany
## 4 4 Conrad Helmut Fritz Bcker Gymnastics 1896   Gold       Germany
## 5 5     John Mary Pius Boland     Tennis 1896   Gold Great Britain
## 6 6       Thomas Edmund Burke  Athletics 1896   Gold United States

Inspect medal_df in the console. Count the number of medals won per team and year.

medal_df %>% 
  # Count the medals won per team and year
  count(team, year, name = "n_medals")

##              team year n_medals
## 1       Australia 1896        2
## 2       Australia 1900        5
## 3       Australia 1904        4
## 4       Australia 1906        3
## 5       Australia 1920        7
## 6       Australia 1924       10
## 7       Australia 1928        4
## 8       Australia 1932        5
## 9       Australia 1936        1
## 10      Australia 1948       16
## 11      Australia 1952       20
## 12      Australia 1956       62
## 13      Australia 1960       46
## 14      Australia 1964       41
## 15      Australia 1968       51
## 16      Australia 1972       20
## 17      Australia 1976       23
## 18      Australia 1980       13
## 19      Australia 1984       52
## 20      Australia 1988       35
## 21      Australia 1992       57
## 22      Australia 1996      130
## 23      Australia 2000      181
## 24      Australia 2004      157
## 25      Australia 2008      149
## 26      Australia 2012      114
## 27      Australia 2016       82
## 28         France 1896       11
## 29         France 1900       75
## 30         France 1904        1
## 31         France 1906       45
## 32         France 1908       36
## 33         France 1912       23
## 34         France 1920      134
## 35         France 1924      101
## 36         France 1928       44
## 37         France 1932       40
## 38         France 1936       44
## 39         France 1948       72
## 40         France 1952       40
## 41         France 1956       33
## 42         France 1960       12
## 43         France 1964       31
## 44         France 1968       27
## 45         France 1972       22
## 46         France 1976       20
## 47         France 1980       29
## 48         France 1984       67
## 49         France 1988       29
## 50         France 1992       55
## 51         France 1996       49
## 52         France 2000       64
## 53         France 2004       53
## 54         France 2008       77
## 55         France 2012       78
## 56         France 2016       96
## 57        Germany 1896       31
## 58        Germany 1900        2
## 59        Germany 1904       15
## 60        Germany 1906       30
## 61        Germany 1908       19
## 62        Germany 1912       39
## 63        Germany 1928       77
## 64        Germany 1932       43
## 65        Germany 1936      215
## 66        Germany 1952       37
## 67        Germany 1956       52
## 68        Germany 1960       86
## 69        Germany 1964      113
## 70        Germany 1992      198
## 71        Germany 1996      122
## 72        Germany 2000      116
## 73        Germany 2004      147
## 74        Germany 2008       96
## 75        Germany 2012       92
## 76        Germany 2016      157
## 77  Great Britain 1896        7
## 78  Great Britain 1900       22
## 79  Great Britain 1904        2
## 80  Great Britain 1906       31
## 81  Great Britain 1908      167
## 82  Great Britain 1912      128
## 83  Great Britain 1920      101
## 84  Great Britain 1924       55
## 85  Great Britain 1928       58
## 86  Great Britain 1932       33
## 87  Great Britain 1936       30
## 88  Great Britain 1948       59
## 89  Great Britain 1952       31
## 90  Great Britain 1956       37
## 91  Great Britain 1960       28
## 92  Great Britain 1964       26
## 93  Great Britain 1968       19
## 94  Great Britain 1972       29
## 95  Great Britain 1976       32
## 96  Great Britain 1980       47
## 97  Great Britain 1984       71
## 98  Great Britain 1988       54
## 99  Great Britain 1992       50
## 100 Great Britain 1996       26
## 101 Great Britain 2000       52
## 102 Great Britain 2004       55
## 103 Great Britain 2008       81
## 104 Great Britain 2012      122
## 105 Great Britain 2016      145
## 106       Hungary 1896        6
## 107       Hungary 1900        5
## 108       Hungary 1904        4
## 109       Hungary 1906       13
## 110       Hungary 1908       20
## 111       Hungary 1912       30
## 112       Hungary 1924       22
## 113       Hungary 1928       21
## 114       Hungary 1932       33
## 115       Hungary 1936       41
## 116       Hungary 1948       62
## 117       Hungary 1952      102
## 118       Hungary 1956       64
## 119       Hungary 1960       66
## 120       Hungary 1964       56
## 121       Hungary 1968       81
## 122       Hungary 1972       81
## 123       Hungary 1976       55
## 124       Hungary 1980       61
## 125       Hungary 1988       44
## 126       Hungary 1992       45
## 127       Hungary 1996       43
## 128       Hungary 2000       53
## 129       Hungary 2004       40
## 130       Hungary 2008       27
## 131       Hungary 2012       26
## 132       Hungary 2016       22
## 133         Italy 1900        5
## 134         Italy 1906        9
## 135         Italy 1908        8
## 136         Italy 1912       25
## 137         Italy 1920       84
## 138         Italy 1924       51
## 139         Italy 1928       72
## 140         Italy 1932       77
## 141         Italy 1936       70
## 142         Italy 1948       67
## 143         Italy 1952       52
## 144         Italy 1956       45
## 145         Italy 1960       85
## 146         Italy 1964       51
## 147         Italy 1968       33
## 148         Italy 1972       25
## 149         Italy 1976       31
## 150         Italy 1980       37
## 151         Italy 1984       63
## 152         Italy 1988       29
## 153         Italy 1992       45
## 154         Italy 1996       71
## 155         Italy 2000       65
## 156         Italy 2004      104
## 157         Italy 2008       42
## 158         Italy 2012       68
## 159         Italy 2016       70
## 160        Russia 1908        3
## 161        Russia 1912        7
## 162        Russia 1996      115
## 163        Russia 2000      187
## 164        Russia 2004      189
## 165        Russia 2008      142
## 166        Russia 2012      138
## 167        Russia 2016      113
## 168  Soviet Union 1952      117
## 169  Soviet Union 1956      169
## 170  Soviet Union 1960      167
## 171  Soviet Union 1964      174
## 172  Soviet Union 1968      192
## 173  Soviet Union 1972      214
## 174  Soviet Union 1976      286
## 175  Soviet Union 1980      442
## 176  Soviet Union 1988      300
## 177        Sweden 1900        1
## 178        Sweden 1906       21
## 179        Sweden 1908       79
## 180        Sweden 1912      153
## 181        Sweden 1920      143
## 182        Sweden 1924       72
## 183        Sweden 1928       31
## 184        Sweden 1932       26
## 185        Sweden 1936       28
## 186        Sweden 1948       67
## 187        Sweden 1952       64
## 188        Sweden 1956       28
## 189        Sweden 1960        7
## 190        Sweden 1964       10
## 191        Sweden 1968        9
## 192        Sweden 1972       22
## 193        Sweden 1976       10
## 194        Sweden 1980       20
## 195        Sweden 1984       34
## 196        Sweden 1988       16
## 197        Sweden 1992       35
## 198        Sweden 1996       31
## 199        Sweden 2000       32
## 200        Sweden 2004       12
## 201        Sweden 2008        5
## 202        Sweden 2012       22
## 203        Sweden 2016       28
## 204 United States 1896       20
## 205 United States 1900       45
## 206 United States 1904      199
## 207 United States 1906       24
## 208 United States 1908       65
## 209 United States 1912      107
## 210 United States 1920      194
## 211 United States 1924      174
## 212 United States 1928       88
## 213 United States 1932      170
## 214 United States 1936       96
## 215 United States 1948      143
## 216 United States 1952      122
## 217 United States 1956      121
## 218 United States 1960      120
## 219 United States 1964      159
## 220 United States 1968      166
## 221 United States 1972      171
## 222 United States 1976      164
## 223 United States 1984      352
## 224 United States 1988      207
## 225 United States 1992      222
## 226 United States 1996      255
## 227 United States 2000      240
## 228 United States 2004      259
## 229 United States 2008      309
## 230 United States 2012      238
## 231 United States 2016      256

Use ggplot() to create a line plot with n_medals over year, colored by team.

medal_df %>% 
  # Count the medals won per team and year
  count(team, year, name = "n_medals") %>% 
  # Plot n_medals over year, colored by team
  ggplot(., aes(year, n_medals, color = team)) +
  geom_line() +
  scale_color_brewer(palette = "Paired")

Complete the team and year variables, replace NA values in the n_medals column with zeros.

medal_df %>% 
  # Count the medals won per team and year
  count(team, year, name = "n_medals") %>% 
  # Complete the team and year variables, fill n_medals with zeros
  complete(team, year, fill = list(n_medals = 0)) %>% 
  # Plot n_medals over year, colored by team
  ggplot(aes(x = year, y = n_medals, color = team)) +
  geom_line() +
  scale_color_brewer(palette = "Paired")

Creating a sequence with full_seq()

The full_seq() function will look for the minimal and maximal values inside the vector you pass it and will then generate a full sequence of numbers with a fixed period in between them. When used inside the complete() function, full_seq() is a handy tool to make sure there are no missing observations in your data. Before combining these two functions you’ll generate a few sequences with full_seq() on its own to get the hang of this function.

Use full_seq() to create a sequence with all years from 2020 till 2030.

# Generate all years from 2020 to 2030
years <- full_seq(2020:2030, period = 1)
years

##  [1] 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030

Use full_seq() to create a sequence with all decades from 1980 till 2030.

# Generate all decades from 1980 to 2030
decades <- full_seq(c(1980, 2030), period = 10)
decades

## [1] 1980 1990 2000 2010 2020 2030

Use full_seq() to create a sequence with all dates in 1980 using the outer_dates vector.

outer_dates <- c(as.Date("1980-01-01"), as.Date("1980-12-31"))

# Generate the dates for all days in 1980
full_seq(outer_dates, period = 1)

##   [1] "1980-01-01" "1980-01-02" "1980-01-03" "1980-01-04" "1980-01-05"
##   [6] "1980-01-06" "1980-01-07" "1980-01-08" "1980-01-09" "1980-01-10"
##  [11] "1980-01-11" "1980-01-12" "1980-01-13" "1980-01-14" "1980-01-15"
##  [16] "1980-01-16" "1980-01-17" "1980-01-18" "1980-01-19" "1980-01-20"
##  [21] "1980-01-21" "1980-01-22" "1980-01-23" "1980-01-24" "1980-01-25"
##  [26] "1980-01-26" "1980-01-27" "1980-01-28" "1980-01-29" "1980-01-30"
##  [31] "1980-01-31" "1980-02-01" "1980-02-02" "1980-02-03" "1980-02-04"
##  [36] "1980-02-05" "1980-02-06" "1980-02-07" "1980-02-08" "1980-02-09"
##  [41] "1980-02-10" "1980-02-11" "1980-02-12" "1980-02-13" "1980-02-14"
##  [46] "1980-02-15" "1980-02-16" "1980-02-17" "1980-02-18" "1980-02-19"
##  [51] "1980-02-20" "1980-02-21" "1980-02-22" "1980-02-23" "1980-02-24"
##  [56] "1980-02-25" "1980-02-26" "1980-02-27" "1980-02-28" "1980-02-29"
##  [61] "1980-03-01" "1980-03-02" "1980-03-03" "1980-03-04" "1980-03-05"
##  [66] "1980-03-06" "1980-03-07" "1980-03-08" "1980-03-09" "1980-03-10"
##  [71] "1980-03-11" "1980-03-12" "1980-03-13" "1980-03-14" "1980-03-15"
##  [76] "1980-03-16" "1980-03-17" "1980-03-18" "1980-03-19" "1980-03-20"
##  [81] "1980-03-21" "1980-03-22" "1980-03-23" "1980-03-24" "1980-03-25"
##  [86] "1980-03-26" "1980-03-27" "1980-03-28" "1980-03-29" "1980-03-30"
##  [91] "1980-03-31" "1980-04-01" "1980-04-02" "1980-04-03" "1980-04-04"
##  [96] "1980-04-05" "1980-04-06" "1980-04-07" "1980-04-08" "1980-04-09"
## [101] "1980-04-10" "1980-04-11" "1980-04-12" "1980-04-13" "1980-04-14"
## [106] "1980-04-15" "1980-04-16" "1980-04-17" "1980-04-18" "1980-04-19"
## [111] "1980-04-20" "1980-04-21" "1980-04-22" "1980-04-23" "1980-04-24"
## [116] "1980-04-25" "1980-04-26" "1980-04-27" "1980-04-28" "1980-04-29"
## [121] "1980-04-30" "1980-05-01" "1980-05-02" "1980-05-03" "1980-05-04"
## [126] "1980-05-05" "1980-05-06" "1980-05-07" "1980-05-08" "1980-05-09"
## [131] "1980-05-10" "1980-05-11" "1980-05-12" "1980-05-13" "1980-05-14"
## [136] "1980-05-15" "1980-05-16" "1980-05-17" "1980-05-18" "1980-05-19"
## [141] "1980-05-20" "1980-05-21" "1980-05-22" "1980-05-23" "1980-05-24"
## [146] "1980-05-25" "1980-05-26" "1980-05-27" "1980-05-28" "1980-05-29"
## [151] "1980-05-30" "1980-05-31" "1980-06-01" "1980-06-02" "1980-06-03"
## [156] "1980-06-04" "1980-06-05" "1980-06-06" "1980-06-07" "1980-06-08"
## [161] "1980-06-09" "1980-06-10" "1980-06-11" "1980-06-12" "1980-06-13"
## [166] "1980-06-14" "1980-06-15" "1980-06-16" "1980-06-17" "1980-06-18"
## [171] "1980-06-19" "1980-06-20" "1980-06-21" "1980-06-22" "1980-06-23"
## [176] "1980-06-24" "1980-06-25" "1980-06-26" "1980-06-27" "1980-06-28"
## [181] "1980-06-29" "1980-06-30" "1980-07-01" "1980-07-02" "1980-07-03"
## [186] "1980-07-04" "1980-07-05" "1980-07-06" "1980-07-07" "1980-07-08"
## [191] "1980-07-09" "1980-07-10" "1980-07-11" "1980-07-12" "1980-07-13"
## [196] "1980-07-14" "1980-07-15" "1980-07-16" "1980-07-17" "1980-07-18"
## [201] "1980-07-19" "1980-07-20" "1980-07-21" "1980-07-22" "1980-07-23"
## [206] "1980-07-24" "1980-07-25" "1980-07-26" "1980-07-27" "1980-07-28"
## [211] "1980-07-29" "1980-07-30" "1980-07-31" "1980-08-01" "1980-08-02"
## [216] "1980-08-03" "1980-08-04" "1980-08-05" "1980-08-06" "1980-08-07"
## [221] "1980-08-08" "1980-08-09" "1980-08-10" "1980-08-11" "1980-08-12"
## [226] "1980-08-13" "1980-08-14" "1980-08-15" "1980-08-16" "1980-08-17"
## [231] "1980-08-18" "1980-08-19" "1980-08-20" "1980-08-21" "1980-08-22"
## [236] "1980-08-23" "1980-08-24" "1980-08-25" "1980-08-26" "1980-08-27"
## [241] "1980-08-28" "1980-08-29" "1980-08-30" "1980-08-31" "1980-09-01"
## [246] "1980-09-02" "1980-09-03" "1980-09-04" "1980-09-05" "1980-09-06"
## [251] "1980-09-07" "1980-09-08" "1980-09-09" "1980-09-10" "1980-09-11"
## [256] "1980-09-12" "1980-09-13" "1980-09-14" "1980-09-15" "1980-09-16"
## [261] "1980-09-17" "1980-09-18" "1980-09-19" "1980-09-20" "1980-09-21"
## [266] "1980-09-22" "1980-09-23" "1980-09-24" "1980-09-25" "1980-09-26"
## [271] "1980-09-27" "1980-09-28" "1980-09-29" "1980-09-30" "1980-10-01"
## [276] "1980-10-02" "1980-10-03" "1980-10-04" "1980-10-05" "1980-10-06"
## [281] "1980-10-07" "1980-10-08" "1980-10-09" "1980-10-10" "1980-10-11"
## [286] "1980-10-12" "1980-10-13" "1980-10-14" "1980-10-15" "1980-10-16"
## [291] "1980-10-17" "1980-10-18" "1980-10-19" "1980-10-20" "1980-10-21"
## [296] "1980-10-22" "1980-10-23" "1980-10-24" "1980-10-25" "1980-10-26"
## [301] "1980-10-27" "1980-10-28" "1980-10-29" "1980-10-30" "1980-10-31"
## [306] "1980-11-01" "1980-11-02" "1980-11-03" "1980-11-04" "1980-11-05"
## [311] "1980-11-06" "1980-11-07" "1980-11-08" "1980-11-09" "1980-11-10"
## [316] "1980-11-11" "1980-11-12" "1980-11-13" "1980-11-14" "1980-11-15"
## [321] "1980-11-16" "1980-11-17" "1980-11-18" "1980-11-19" "1980-11-20"
## [326] "1980-11-21" "1980-11-22" "1980-11-23" "1980-11-24" "1980-11-25"
## [331] "1980-11-26" "1980-11-27" "1980-11-28" "1980-11-29" "1980-11-30"
## [336] "1980-12-01" "1980-12-02" "1980-12-03" "1980-12-04" "1980-12-05"
## [341] "1980-12-06" "1980-12-07" "1980-12-08" "1980-12-09" "1980-12-10"
## [346] "1980-12-11" "1980-12-12" "1980-12-13" "1980-12-14" "1980-12-15"
## [351] "1980-12-16" "1980-12-17" "1980-12-18" "1980-12-19" "1980-12-20"
## [356] "1980-12-21" "1980-12-22" "1980-12-23" "1980-12-24" "1980-12-25"
## [361] "1980-12-26" "1980-12-27" "1980-12-28" "1980-12-29" "1980-12-30"
## [366] "1980-12-31"

The Cold War’s hottest year

In October 1962, during the Cuban missile crisis, the world came close to a full scale nuclear war. Throughout 1962, the USA, USSR, and France together detonated a record 178 nuclear bombs for military power display and research. You’ve been given a sample of the Nuclear Explosion Database (NEDB) for that year (cumul_nukes_1962_df) with an observation for each date on which a bomb was detonated. The total_bombs variable contains the cumulative number of bombs detonated by a country up to that point in time.

You’ll complete the dataset to hold the full sequence of dates, and visualize the total number of bombs per country over time. You’ll also use the fill() function from Chapter One to impute missing values.

cumul_nukes_1962_df <- read.delim("DATABASE/cumul_nukes_1962_df.txt", sep = ",") %>% select(., c(-"X"))

#transformamos la fecha par qu lo lea como fecha

cumul_nukes_1962_df <- cumul_nukes_1962_df %>% 
  mutate(
    fecha = as.POSIXct(date, origin = '1970-01-01', tz = 'GMT')
  ) %>% 
  
  #quitamos la columna inicial 
  select(.,c (
    -"date"
  )) %>% 
  
  #renombramos aquellacolumna recientemente creada
  setnames(.,
           c("country", "total_bombs", "date"))
head(cumul_nukes_1962_df)

##              country total_bombs       date
## 1             France           0 1962-01-01
## 2             France           1 1962-05-01
## 3 Russian Federation           0 1962-01-01
## 4 Russian Federation           1 1962-02-02
## 5 Russian Federation           2 1962-08-01
## 6 Russian Federation           3 1962-08-03

#### hayq ue cambiar el formato de datoe a as.date() esta en formato factor

The dplyr and ggplot2 packages have been pre-loaded for you.

Complete the dataset so that for each country there is an observation of each date using the full_seq() function.

cumul_nukes_1962_df %>% 
  # Complete the dataset
  complete(country, date = full_seq(date, period = 1))

## # A tibble: 92,793,603 x 3
##    country date                total_bombs
##    <fct>   <dttm>                    <int>
##  1 France  1962-01-01 00:00:00           0
##  2 France  1962-01-01 00:00:01          NA
##  3 France  1962-01-01 00:00:02          NA
##  4 France  1962-01-01 00:00:03          NA
##  5 France  1962-01-01 00:00:04          NA
##  6 France  1962-01-01 00:00:05          NA
##  7 France  1962-01-01 00:00:06          NA
##  8 France  1962-01-01 00:00:07          NA
##  9 France  1962-01-01 00:00:08          NA
## 10 France  1962-01-01 00:00:09          NA
## # … with 92,793,593 more rows

Group the data by country. Use the fill() function to overwrite NA values in the total_bombs variable with the last known value.

cumul_nukes_1962_df %>% 
  # Complete the dataset
  complete(country, date = full_seq(date, period = 1)) %>% 
  # Group the data by country
  group_by(country) %>% 
  # Impute missing values with the last known observation
  fill(total_bombs)

## # A tibble: 92,793,603 x 3
## # Groups:   country [3]
##    country date                total_bombs
##    <fct>   <dttm>                    <int>
##  1 France  1962-01-01 00:00:00           0
##  2 France  1962-01-01 00:00:01           0
##  3 France  1962-01-01 00:00:02           0
##  4 France  1962-01-01 00:00:03           0
##  5 France  1962-01-01 00:00:04           0
##  6 France  1962-01-01 00:00:05           0
##  7 France  1962-01-01 00:00:06           0
##  8 France  1962-01-01 00:00:07           0
##  9 France  1962-01-01 00:00:08           0
## 10 France  1962-01-01 00:00:09           0
## # … with 92,793,593 more rows

Use ggplot() to visualize the total_bombs at any given date, color the line plot by country. Some code has been added for you to visualize the Cuban Missile Crisis.

# cumul_nukes_1962_df %>% 
#   # Complete the dataset
#   complete(country, date = full_seq(date, period = 1)) %>% 
#   # Group the data by country
#   group_by(country) %>% 
#   # Impute missing values with the last known observation
#   fill(total_bombs) %>% 
#   # Plot the number of bombs over time, color by country
#   ggplot(aes(date, total_bombs, color = country)) +
#   # These two lines will mark the Cuban Missile Crisis 
#   geom_rect(xmin = as.Date("1962-10-16"), xmax = as.Date("1962-10-29"), ymin = -Inf, ymax = Inf, color = NA)+ 
#   geom_text(x = as.Date("1962-10-22"), y = 15, label = "Cuban Missile Crisis", angle = 90, color = "white")+
#   geom_line()

#casca R

Advanced completions

Olympic medals per continent

You want to compare Olympic performance of athletes per continent over time, both on the winter and summer Olympics. You’ve been given a dataset medal_df with the average number of medals won per participant of each continent since 1928. You’ll complete this data to introduce zero values for years where a continent did not win any medals.

The ggplot2 package has been pre-loaded for you.

medal_df <- read.delim("DATABASE/medal_df2.txt", sep = ",") %>% select(., c(-"X"))
  
head(medal_df)

##   continent season year medals_per_participant
## 1    Africa Summer 1928             0.17241379
## 2    Africa Summer 1932             0.41666667
## 3    Africa Summer 1936             0.07058824
## 4    Africa Summer 1948             0.09166667
## 5    Africa Summer 1952             0.08064516
## 6    Africa Summer 1956             0.06542056

Complete the dataset so that each continent has a medals_per_participant value at each Olympic event. Missing values should be filled with zeros. Nest the season and year variables using the nesting() function, since the summer and winter Olympics don’t occur in the same years. Use ggplot() to create a line plot with the medals_per_participant per year, color the plot by continent.

medal_df %>% 
  # Give each continent an observation at each Olympic event
  complete(
    continent, 
    nesting(season, year), 
    fill = list(medals_per_participant = 0)
  ) %>%
  # Plot the medals_per_participant over time, colored by continent
  ggplot(aes(x = year, y = medals_per_participant, color = continent)) +
  geom_line() +
  facet_grid(season ~ .)

Tracking a virus outbreak

You’re a doctor in a remote village confronted with a virus outbreak. You have been collecting data on when your patients got infected and recovered in a data frame named patient_df. Your goal is to create a visual with the number of sick patients over time. You’ll first have to reshape the data so that you can count the number of sick patients per day.

patient_df <- read.delim("DATABASE/patient_df.txt", sep = ",") %>% select(., c(-"X"))
  

#transformacion de las fechas

patient_df <- patient_df %>% 
  mutate(
    infected1 = as.POSIXct(infected, origin = '1970-01-01', tz = 'GMT'), 
    recovered1 = as.POSIXct(recovered, origin = '1970-01-01', tz = 'GMT')
  ) %>% 
  
  #quitamos la columna inicial 
  select(.,c (
     -"infected", - "recovered"
  )) %>% 
  
  #renombramos aquellacolumna recientemente creada
  setnames(.,
           c("patient", "infected", "recovered"))
head(patient_df)

##   patient   infected  recovered
## 1       A 2020-02-08 2020-03-28
## 2       B 2020-02-19 2020-03-04
## 3       C 2020-02-27 2020-03-19
## 4       D 2020-03-05 2020-03-19
## 5       E 2020-03-06 2020-03-29
## 6       F 2020-03-07 2020-03-22

The dplyr and ggplot2 packages have been pre-loaded for you.

Inspect patient_df in the console. Pivot the infected and recovered columns to long format, the old column names should go in the status variable, the values to date.

patient_df %>% 
  # Pivot the infected and recovered columns to long format
  pivot_longer(

    c("infected", "recovered"), 

    names_to = "status", 
    values_to = "date"
  )

## # A tibble: 52 x 3
##    patient status    date               
##    <fct>   <chr>     <dttm>             
##  1 A       infected  2020-02-08 00:00:00
##  2 A       recovered 2020-03-28 00:00:00
##  3 B       infected  2020-02-19 00:00:00
##  4 B       recovered 2020-03-04 00:00:00
##  5 C       infected  2020-02-27 00:00:00
##  6 C       recovered 2020-03-19 00:00:00
##  7 D       infected  2020-03-05 00:00:00
##  8 D       recovered 2020-03-19 00:00:00
##  9 E       infected  2020-03-06 00:00:00
## 10 E       recovered 2020-03-29 00:00:00
## # … with 42 more rows

Group the data by patient and then complete the date column so that each date between infection and recovery is added using the full_seq() column. At the end, ungroup the data.

patient_df %>% 
  # Pivot the infected and recovered columns to long format
  pivot_longer(-patient, names_to = "status", values_to = "date") %>% 
  select(-status) %>% 
  # Group by patient
  group_by(patient) %>% 
  # Complete the date range per patient using full_seq()
  complete(
    date = full_seq(date, period = 1)


  )%>% 
   
  # Ungroup the data
  ungroup()

## # A tibble: 56,851,226 x 2
##    patient date               
##    <fct>   <dttm>             
##  1 A       2020-02-08 00:00:00
##  2 A       2020-02-08 00:00:01
##  3 A       2020-02-08 00:00:02
##  4 A       2020-02-08 00:00:03
##  5 A       2020-02-08 00:00:04
##  6 A       2020-02-08 00:00:05
##  7 A       2020-02-08 00:00:06
##  8 A       2020-02-08 00:00:07
##  9 A       2020-02-08 00:00:08
## 10 A       2020-02-08 00:00:09
## # … with 56,851,216 more rows

Each date is now a day on which a patient was sick, count the dates and name the new variable n_sick.

# DATA <- patient_df %>% 
#   # Pivot the infected and recovered columns to long format
#   pivot_longer(-patient, names_to = "status", values_to = "date") %>% 
#   select(-status) %>% 
#   # Group by patient
#   group_by(patient) %>% 
#   # Complete the date range per patient using full_seq()
#   complete(date = full_seq(date, period = 1)) %>% 
#   # Ungroup the data
#   ungroup() %>% 
#   # Count the dates, the count goes in the n_sick variable
#   count(date, name = "n_sick") 


  # ggplot(DATA, aes(x = date, y = n_sick))+
  # geom_line()

Counting office occupants

Imagine you’re an office facility manager and want to know how many people are present throughout the day. You’ve installed a sensor at the entrance that counts the number of people entering and leaving the building. The sensor sends an update at the end of every 20 minute time slot if at least one person passed.

To create a dataset ready for visualization, you’ll combine the different techniques you’ve learned so far.

sensor_df <- read.delim("DATABASE/sensor_df.txt", sep = ",") %>% select(., c(-"X"))
  

sensor_df <- sensor_df %>% 
  mutate(
    time1 = as.POSIXct(time, origin = '1970-01-01', tz = 'GMT')
  ) %>% 
  
  #quitamos la columna inicial 
  select(.,c (
     -"time"
  )) %>% 
  
  #renombramos aquellacolumna recientemente creada
  setnames(.,
           c("enter", "exit", "time"))
head(sensor_df)

##   enter exit                time
## 1     1    0 2020-01-01 07:00:00
## 2     2    0 2020-01-01 07:20:00
## 3     3   -1 2020-01-01 07:40:00
## 4     5    0 2020-01-01 08:00:00
## 5     3    0 2020-01-01 08:20:00
## 6     4   -1 2020-01-01 08:40:00

The dplyr and ggplot2 packages have been pre-loaded for you.

Complete the time variable by using the seq() function to create a sequence between the min and max values with an interval of “20 min”. Fill NA values of enter and exit with 0L.

sensor_df %>% 
  # Complete the time column with a 20 minute interval
  complete(time = seq(min(time), max(time), by = "20 min"),
           fill = list(enter = 0L, exit = 0L))

## # A tibble: 36 x 3
##    time                enter  exit
##    <dttm>              <int> <int>
##  1 2020-01-01 07:00:00     1     0
##  2 2020-01-01 07:20:00     2     0
##  3 2020-01-01 07:40:00     3    -1
##  4 2020-01-01 08:00:00     5     0
##  5 2020-01-01 08:20:00     3     0
##  6 2020-01-01 08:40:00     4    -1
##  7 2020-01-01 09:00:00     4     0
##  8 2020-01-01 09:20:00     1     0
##  9 2020-01-01 09:40:00     1    -1
## 10 2020-01-01 10:00:00     0     0
## # … with 26 more rows

Calculate the total_inside variable by taking the cumulative sum of enter plus exit using the cumsum() function.

sensor_df %>% 
  # Complete the time column with a 20 minute interval
  complete(time = seq(min(time), max(time), by = "20 min"),
           fill = list(enter = 0L, exit = 0L)) %>%
  # Calculate the total number of people inside
  mutate(total_inside = cumsum(enter + exit))

## # A tibble: 36 x 4
##    time                enter  exit total_inside
##    <dttm>              <int> <int>        <int>
##  1 2020-01-01 07:00:00     1     0            1
##  2 2020-01-01 07:20:00     2     0            3
##  3 2020-01-01 07:40:00     3    -1            5
##  4 2020-01-01 08:00:00     5     0           10
##  5 2020-01-01 08:20:00     3     0           13
##  6 2020-01-01 08:40:00     4    -1           16
##  7 2020-01-01 09:00:00     4     0           20
##  8 2020-01-01 09:20:00     1     0           21
##  9 2020-01-01 09:40:00     1    -1           21
## 10 2020-01-01 10:00:00     0     0           21
## # … with 26 more rows

Pivot the enter and exit columns to long format. The column names should go in the direction variable, the values in n_people.

sensor_df %>% 
  # Complete the time column with a 20 minute interval
  complete(time = seq(min(time), max(time), by = "20 min"),
           fill = list(enter = 0L, exit = 0L)) %>%
  # Calculate the total number of people inside
  mutate(total_inside = cumsum(enter + exit)) %>% 
  # Pivot the enter and exit columns to long format
  pivot_longer(
    c("enter", "exit"), 
    names_to = "direction", 

    values_to = "n_people"

  )

## # A tibble: 72 x 4
##    time                total_inside direction n_people
##    <dttm>                     <int> <chr>        <int>
##  1 2020-01-01 07:00:00            1 enter            1
##  2 2020-01-01 07:00:00            1 exit             0
##  3 2020-01-01 07:20:00            3 enter            2
##  4 2020-01-01 07:20:00            3 exit             0
##  5 2020-01-01 07:40:00            5 enter            3
##  6 2020-01-01 07:40:00            5 exit            -1
##  7 2020-01-01 08:00:00           10 enter            5
##  8 2020-01-01 08:00:00           10 exit             0
##  9 2020-01-01 08:20:00           13 enter            3
## 10 2020-01-01 08:20:00           13 exit             0
## # … with 62 more rows

Use ggplot() to visualize the n_people in the building over time. Use the fill argument to color the area plot by direction.

sensor_df %>% 
  # Complete the time column with a 20 minute interval
  complete(time = seq(min(time), max(time), by = "20 min"),
           fill = list(enter = 0L, exit = 0L)) %>%
  # Calculate the total number of people inside
  mutate(total_inside = cumsum(enter + exit)) %>% 
  # Pivot the enter and exit columns to long format
  pivot_longer(enter:exit, names_to = "direction", values_to = "n_people") %>% 
  # Plot the number of people over time, fill by direction
  ggplot(., aes(time, n_people, fill = direction)) +
  geom_area() +
  geom_line(aes(y = total_inside))

Reshaping-Data-with-tidyr

Tidy data

What is tidy data?

Multiple variables per column

Columns with multiple values

International phone numbers

Extracting observations from values

Separating into columns and rows

Missing values

And the Oscar for best director goes to …

Imputing sales data

Nuclear bombs per continent

From wide to long and back

From wide to long

Nuclear bombs per country

WHO obesity per country

Bond… James Bond

Deriving variables from column headers

New-Zealand’s bird of the year

Big tech stock prices

Deriving variables from complex column headers

Soviet space dogs, the dog perspective

WHO obesity vs. life expectancy

Uncounting observations

From long to wide data

Soviet space dogs, the flight perspective

Planet temperature & distance to the Sun

Expanding data

Creating unique combinations of vectors

Letters of the genetic code

When did humans replace dogs in space?

Finding missing observations

Completing data with all value combinations

Completing the Solar System

Zero Olympic medals

Creating a sequence with full_seq()

The Cold War’s hottest year

Advanced completions

Olympic medals per continent

Tracking a virus outbreak

Counting office occupants