Being a busy person, you don’t want to spend too much time on Netflix, so you decide to crunch some numbers on TV show and movie durations before deciding what to watch. You’ve managed to obtain a dataset named netflix_df, but its duration column has an issue. It contains strings with both a value and unit of duration (“min” or “Season”).
You’ll tidy this dataset so that each variable gets its own column.
As will always be the case in this course, the tidyr package has been pre-loaded for you.
Inspect netflix_df by typing its name directly in the R console at the bottom-right and hitting Enter to see what string separates the value from the unit in the duration column.
netflix_df <- read.delim("DATABASE/netflix.txt", sep = ",")
Separate the duration column over two variables named value and unit. Pass the string separating the number from the unit to the sep argument.
netflix_df %>%
# Split the duration column into value and unit columns
separate(duration, into = c("value", "unit"), sep = " ", convert = TRUE)
## X title type value unit
## 1 1 Article 15 Movie 125 min
## 2 2 Kill Me If You Dare Movie 100 min
## 3 3 The Spy TV Show 1 Season
## 4 4 The World We Make Movie 108 min
## 5 5 Watchman Movie 93 min
## 6 6 Mo Gilligan: Momentum Movie 64 min
## 7 7 Domino Movie 89 min
## 8 8 TUNA GIRL Movie 90 min
## 9 9 Bard of Blood TV Show 1 Season
## 10 10 Dragons: Rescue Riders TV Show 1 Season
## 11 11 Skylines TV Show 1 Season
## 12 12 The Politician TV Show 1 Season
## 13 13 Malaal Movie 133 min
## 14 14 Birders Movie 38 min
## 15 15 Furie Movie 97 min
## 16 16 Oh! Baby (Malayalam) Movie 146 min
## 17 17 Oh! Baby (Tamil) Movie 146 min
## 18 18 Inside Man: Most Wanted Movie 106 min
## 19 19 Manmadhudu 2 Movie 152 min
## 20 20 Team Kaylie TV Show 1 Season
## 21 21 BONDING TV Show 1 Season
## 22 22 Vagabond TV Show 1 Season
## 23 23 Criminal: France TV Show 1 Season
## 24 24 Criminal: Germany TV Show 1 Season
## 25 25 Criminal: Spain TV Show 1 Season
## 26 26 Criminal: UK TV Show 1 Season
## 27 27 Daddy Issues Movie 82 min
## 28 28 The Hockey Girls TV Show 1 Season
## 29 29 True: Tricky Treat Day Movie 24 min
## 30 30 Kabir Singh Movie 171 min
## 31 31 When the Camellia Blooms TV Show 1 Season
## 32 32 Avengement Movie 88 min
## 33 33 The Last Kids on Earth TV Show 1 Season
## 34 34 Surviving R. Kelly TV Show 1 Season
## 35 35 The Outsider Movie 86 min
## 36 36 Oh! Baby Movie 157 min
## 37 37 American Warfighter Movie 106 min
## 38 38 Deviant Love Movie 87 min
## 39 39 Marianne TV Show 1 Season
## 40 40 Monarca TV Show 1 Season
## 41 41 Shanghai Fortress Movie 107 min
## 42 42 Tall Girl Movie 103 min
## 43 43 Top Boy TV Show 1 Season
## 44 44 Unbelievable TV Show 1 Season
## 45 45 The I-Land TV Show 1 Season
## 46 46 The Mind, Explained TV Show 1 Season
## 47 47 Bill Burr: Paper Tiger Movie 67 min
## 48 48 Evelyn Movie 96 min
## 49 49 Our Godfather Movie 93 min
## 50 50 Mokalik (Mechanic) Movie 100 min
## 51 51 Uncle Naji in UAE Movie 95 min
## 52 52 After Movie 106 min
## 53 53 Rhythm + Flow TV Show 1 Season
## 54 54 Deon Cole: Cole Hearted Movie 70 min
## 55 55 My Country: The New Age TV Show 1 Season
## 56 56 Noblemen Movie 109 min
## 57 57 Brother in Love Movie 110 min
## 58 58 In the Tall Grass Movie 102 min
## 59 59 Kids on the Block Movie 102 min
## 60 60 Raising Dion TV Show 1 Season
## 61 61 Nowhere Man TV Show 1 Season
## 62 62 Mehandi Circus Movie 126 min
## 63 63 Tunnel TV Show 1 Season
## 64 64 Seis Manos TV Show 1 Season
## 65 65 A 3 Minute Hug Movie 28 min
## 66 66 Wait, My Youth TV Show 1 Season
## 67 67 The Last Whistle Movie 88 min
## 68 68 Incoming Movie 89 min
## 69 69 Assimilate Movie 94 min
## 70 70 Brotherhood TV Show 1 Season
## 71 71 Dolemite Is My Name Movie 118 min
## 72 72 It Takes a Lunatic Movie 127 min
## 73 73 Nailed It! France TV Show 1 Season
## 74 74 Nailed It! Spain TV Show 1 Season
## 75 75 Prank Encounters TV Show 1 Season
## 76 76 Rattlesnake Movie 86 min
## 77 77 The Untamed TV Show 1 Season
## 78 78 Daybreak TV Show 1 Season
## 79 79 Echo in the Canyon Movie 82 min
## 80 80 Revenge of the Pontianak Movie 92 min
## 81 81 Dancing with the Birds Movie 52 min
## 82 82 Living Undocumented TV Show 1 Season
## 83 83 Ready to Mingle Movie 98 min
## 84 84 Eli Movie 98 min
## 85 85 Interior Design Masters TV Show 1 Season
## 86 86 Living with Yourself TV Show 1 Season
## 87 87 Seventeen Movie 100 min
## 88 88 Tell Me Who I Am Movie 86 min
## 89 89 The Laundromat Movie 96 min
## 90 90 Unnatural Selection TV Show 1 Season
## 91 91 Upstarts Movie 112 min
## 92 92 THE UNLISTED TV Show 1 Season
## 93 93 Ghosts of Sugar Land Movie 22 min
## 94 94 Homeland Movie 73 min
## 95 95 Magical Andes TV Show 1 Season
## 96 96 Street Flow Movie 96 min
## 97 97 The Lies Within TV Show 1 Season
## 98 98 Fractured Movie 100 min
## 99 99 The Bling Lagosians Movie 99 min
## 100 100 The Forest of Love Movie 151 min
## 101 101 The Influence Movie 102 min
## 102 102 Ultramarine Magmell TV Show 1 Season
## 103 103 A.M.I. Movie 77 min
## 104 104 Nikki Glaser: Bangin’ Movie 64 min
## 105 105 Green Eggs and Ham TV Show 1 Season
## 106 106 Let It Snow Movie 93 min
## 107 107 Paradise Beach Movie 94 min
## 108 108 Burning Cane Movie 78 min
## 109 109 SCAMS TV Show 1 Season
## 110 110 Seth Meyers: Lobby Baby Movie 61 min
## 111 111 Tune in for Love Movie 123 min
## 112 112 Undercover Brother 2 Movie 85 min
## 113 113 Voice TV Show 1 Season
## 114 114 Oththa Seruppu Size 7 Movie 103 min
## 115 115 The Devil Next Door TV Show 1 Season
## 116 116 Chocolate TV Show 1 Season
## 117 117 The Zoya Factor Movie 135 min
## 118 118 Atlantics Movie 106 min
## 119 119 I Lost My Body Movie 81 min
## 120 120 Sugar Rush Christmas TV Show 1 Season
## 121 121 The Charming Stepmom TV Show 1 Season
## 122 122 The Movies That Made Us TV Show 1 Season
## 123 123 Holiday Rush Movie 94 min
## 124 124 Levius TV Show 1 Season
## 125 125 Lugar de Mulher TV Show 1 Season
## 126 126 Merry Happy Whatever TV Show 1 Season
## 127 127 Mythomaniac TV Show 1 Season
## 128 128 Broken TV Show 1 Season
## 129 129 Evvarikee Cheppoddu Movie 134 min
## 130 130 Little Singham: Mahabali Movie 69 min
## 131 131 The Irishman Movie 209 min
## 132 132 True: Winter Wishes Movie 46 min
## 133 133 Zona Rosa TV Show 1 Season
## 134 134 Pranaam Movie 117 min
## 135 135 What the F* Is Going On? Movie 87 min
## 136 136 Awake Movie 92 min
## 137 137 Brother Movie 97 min
## 138 138 Dino Girl Gauko TV Show 1 Season
## 139 139 Narcoworld: Dope Stories TV Show 1 Season
## 140 140 Nobody's Looking TV Show 1 Season
## 141 141 Shelby American Movie 119 min
## 142 142 Singapore Social TV Show 1 Season
## 143 143 Mortel TV Show 1 Season
## 144 144 Bangkok Buddies TV Show 1 Season
## 145 145 Dorasaani Movie 134 min
## 146 146 Holiday Secrets TV Show 1 Season
## 147 147 Mallesham Movie 131 min
## 148 148 Miss Culinary TV Show 1 Season
## 149 149 No Time for Shame TV Show 1 Season
## 150 150 Candy Online TV Show 1 Season
## 151 151 Earthquake Bird Movie 107 min
## 152 152 Guna 369 Movie 143 min
## 153 153 House Arrest Movie 105 min
## 154 154 Klaus Movie 98 min
## 155 155 The Club TV Show 1 Season
## 156 156 The Stranded TV Show 1 Season
## 157 157 My Dear Warrior TV Show 1 Season
## 158 158 Submission Movie 95 min
## 159 159 To Be of Service Movie 88 min
## 160 160 A Remarkable Tale Movie 94 min
## 161 161 American Son Movie 90 min
## 162 162 Christmas Break-In Movie 87 min
## 163 163 Drive Movie 119 min
## 164 164 Fire in Paradise Movie 40 min
## 165 165 Hache TV Show 1 Season
## 166 166 Hello Ninja TV Show 1 Season
## 167 167 Holiday in the Wild Movie 86 min
## 168 168 Santa Girl Movie 91 min
## 169 169 The King Movie 141 min
## 170 170 The Man Without Gravity Movie 107 min
## 171 171 We Are the Wave TV Show 1 Season
## 172 172 Abyss TV Show 1 Season
## 173 173 I Have a Script TV Show 1 Season
## 174 174 In the Bosom of a Thorn TV Show 1 Season
## 175 175 The Writer TV Show 1 Season
## 176 176 What If? TV Show 1 Season
## 177 177 Last Breath Movie 86 min
## 178 178 The Wandering Earth Movie 125 min
## 179 179 Always Be My Maybe Movie 102 min
## 180 180 Chopsticks Movie 101 min
## 181 181 Killer Ratings TV Show 1 Season
## 182 182 Pegasus Movie 98 min
## 183 183 When They See Us TV Show 1 Season
## 184 184 Svaha: The Sixth Finger Movie 123 min
## 185 185 All In My Family Movie 40 min
## 186 186 Crime Diaries: Night Out TV Show 1 Season
## 187 187 Dead to Me TV Show 1 Season
## 188 188 Despite Everything Movie 79 min
## 189 189 Flinch TV Show 1 Season
## 190 190 The Last Summer Movie 110 min
## 191 191 Tuca & Bertie TV Show 1 Season
## 192 192 Undercover TV Show 1 Season
## 193 193 Charmed TV Show 1 Season
## 194 194 Historical Roasts TV Show 1 Season
## 195 195 Thackeray TV Show 1 Season
## 196 196 After Maria Movie 38 min
## 197 197 Joy Movie 101 min
## 198 198 Rim of the World Movie 99 min
## 199 199 The Perfection Movie 91 min
## 200 200 WHAT / IF TV Show 1 Season
## 201 201 A Tale of Two Kitchens Movie 30 min
## 202 202 Wanda Sykes: Not Normal Movie 66 min
## 203 203 SGT. Will Gardner Movie 125 min
## 204 204 The Brawler Movie 95 min
## 205 205 1994 TV Show 1 Season
## 206 206 It's Bruno! TV Show 1 Season
## 207 207 Maria Movie 90 min
## 208 208 See You Yesterday Movie 87 min
## 209 209 Well-Intended Love TV Show 1 Season
## 210 210 Good Sam Movie 90 min
## 211 211 PATRICK (2019) Movie 94 min
## 212 212 I Hear You TV Show 1 Season
## 213 213 Backdraft 2 Movie 102 min
## 214 214 Malibu Rescue Movie 70 min
## 215 215 The Defected TV Show 1 Season
## 216 216 Shéhérazade Movie 111 min
## 217 217 Jailbirds TV Show 1 Season
## 218 218 The Society TV Show 1 Season
## 219 219 Wine Country Movie 104 min
## 220 220 Hidden in Plain Sight Movie 87 min
## 221 221 Knock Down The House Movie 88 min
## 222 222 Roswell, New Mexico TV Show 1 Season
## 223 223 Sur Sapata Movie 140 min
## 224 224 The Kindness Diaries TV Show 1 Season
## 225 225 After Life TV Show 1 Season
## 226 226 Juanita Movie 90 min
## 227 227 Lady J Movie 111 min
## 228 228 Shadow TV Show 1 Season
## 229 229 Walk. Ride. Rodeo. Movie 100 min
## 230 230 The Order TV Show 1 Season
## 231 231 The Hard Way Movie 94 min
## 232 232 Mercy Black Movie 88 min
## 233 233 15-Aug Movie 124 min
## 234 234 Bayonet Movie 102 min
## 235 235 Osmosis TV Show 1 Season
## 236 236 The Highwaymen Movie 132 min
## 237 237 Traitors TV Show 1 Season
## 238 238 All American TV Show 1 Season
## 239 239 Delhi Crime TV Show 1 Season
## 240 240 Most Beautiful Thing TV Show 1 Season
## 241 241 Selling Sunset TV Show 1 Season
## 242 242 The Dirt Movie 108 min
## 243 243 Stay Tuned! TV Show 1 Season
## 244 244 My Husband Won't Fit TV Show 1 Season
## 245 245 Amy Schumer Growing Movie 61 min
## 246 246 Green Door TV Show 1 Season
## 247 247 Love, Death & Robots TV Show 1 Season
## 248 248 Turn Up Charlie TV Show 1 Season
## 249 249 Triple Frontier Movie 126 min
## 250 250 Losers TV Show 1 Season
## 251 251 Northern Rescue TV Show 1 Season
## 252 252 Beneath the Leaves Movie 90 min
## 253 253 Berlin, I Love You Movie 120 min
## 254 254 Elisa & Marcela Movie 119 min
## 255 255 I Am Mother Movie 114 min
## 256 256 Luka Chuppi Movie 125 min
## 257 257 Pachamama Movie 71 min
## 258 258 Rock My Heart Movie 110 min
## 259 259 Tales of the City TV Show 1 Season
## 260 260 The Black Godfather Movie 118 min
## 261 261 Alles ist gut Movie 90 min
## 262 262 Everybody Knows Movie 133 min
## 263 263 Dolphin Kick Movie 90 min
## 264 264 Inhuman Kiss Movie 122 min
## 265 265 Romeo Akbar Walter Movie 141 min
## 266 266 Super Deluxe Movie 176 min
## 267 267 7SEEDS TV Show 1 Season
## 268 268 Exhibit A TV Show 1 Season
## 269 269 Family Business TV Show 1 Season
## 270 270 Staged Killer Movie 88 min
## 271 271 Super Bheem TV Show 1 Season
## 272 272 The Chosen One TV Show 1 Season
## 273 273 ANIMA Movie 15 min
## 274 274 Answer for Heaven TV Show 1 Season
## 275 275 Daniel Sosa: Maleducado Movie 55 min
## 276 276 Unbroken Movie 109 min
## 277 277 Mike Epps: Only One Mike Movie 63 min
## 278 278 Triple Threat Movie 96 min
## 279 279 Blood Will Tell Movie 113 min
## 280 280 Bolívar TV Show 1 Season
## 281 281 Mr. Iglesias TV Show 1 Season
## 282 282 Rakkhosh Movie 116 min
## 283 283 The Wolf's Call Movie 116 min
## 284 284 Arthdal Chronicles TV Show 1 Season
## 285 285 Krutant Movie 114 min
## 286 286 Beats Movie 110 min
## 287 287 The Edge of Democracy Movie 122 min
## 288 288 Charité at War TV Show 1 Season
## 289 289 Life Overtakes Me Movie 40 min
## 290 290 Luckee Movie 113 min
## 291 291 Somewhere Only We Know TV Show 1 Season
## 292 292 The Alcàsser Murders TV Show 1 Season
## 293 293 DJ Cinderella Movie 95 min
## 294 294 Leila TV Show 1 Season
## 295 295 Murder Mystery Movie 98 min
## 296 296 Trinkets TV Show 1 Season
## 297 297 Yankee TV Show 1 Season
## 298 298 Jinn TV Show 1 Season
## 299 299 The 3rd Eye 2 Movie 117 min
## 300 300 Jo Koy: Comin' In Hot Movie 60 min
## 301 301 A Thousand Goodnights TV Show 1 Season
## 302 302 Oh, Ramona! Movie 109 min
## 303 303 Then Came You Movie 97 min
## 304 304 We Are Legends Movie 109 min
## 305 305 Aziz Ansari: RIGHT NOW Movie 65 min
## 306 306 Krishna Balram TV Show 1 Season
## 307 307 In The Dark TV Show 1 Season
## 308 308 7 (Seven) TV Show 1 Season
## 309 309 All Good Ones Get Away Movie 83 min
## 310 310 American Hangman Movie 99 min
## 311 311 The Last Czars TV Show 1 Season
## 312 312 Mission of Honor Movie 107 min
## 313 313 Twelve Forever TV Show 1 Season
## 314 314 Eerie Movie 100 min
## 315 315 Girls With Balls Movie 77 min
## 316 316 The Son Movie 93 min
## 317 317 Another Life TV Show 1 Season
## 318 318 The Great Hack Movie 114 min
## 319 319 Eurovision Song Contest TV Show 1 Season
## 320 320 Sexy Central TV Show 1 Season
## 321 321 ¡Ay, mi madre! Movie 81 min
## 322 322 Typewriter TV Show 1 Season
## 323 323 Secret Obsession Movie 98 min
## 324 324 Unrequited Love TV Show 1 Season
## 325 325 4L Movie 105 min
## 326 326 Blown Away TV Show 1 Season
## 327 327 Extreme Engagement TV Show 1 Season
## 328 328 Kidnapping Stella Movie 90 min
## 329 329 One Spring Night TV Show 1 Season
## 330 330 Point Blank Movie 87 min
## 331 331 Taco Chronicles TV Show 1 Season
## 332 332 True Tunes TV Show 1 Season
## 333 333 Chhota Bheem TV Show 1 Season
## 334 334 Family Reunion TV Show 1 Season
## 335 335 Parchís: the Documentary Movie 106 min
## 336 336 Soul to Keep Movie 94 min
## 337 337 Speed Is My Need Movie 75 min
## 338 338 Kingdom TV Show 1 Season
## 339 339 Polar Movie 119 min
## 340 340 Thieves of the Wood TV Show 1 Season
## 341 341 Close Movie 95 min
## 342 342 IO Movie 96 min
## 343 343 Patria Movie 90 min
## 344 344 Deadcon Movie 78 min
## 345 345 Jezebel Movie 86 min
## 346 346 NiNoKuni Movie 107 min
## 347 347 Eye For An Eye Movie 108 min
## 348 348 Saving Zoë Movie 95 min
## 349 349 The Brave Movie 103 min
## 350 350 Deadly Switch Movie 86 min
## 351 351 Women of Mafia 2 Movie 132 min
## 352 352 Made in China Movie 128 min
## 353 353 Sex Education TV Show 1 Season
## 354 354 The Last Laugh Movie 99 min
## 355 355 Giri / Haji TV Show 1 Season
## 356 356 Fall Girls Movie 80 min
## 357 357 High Flying Bird Movie 91 min
## 358 358 Nailed It! Mexico TV Show 1 Season
## 359 359 Romance is a bonus book TV Show 1 Season
## 360 360 Firebrand Movie 117 min
## 361 361 Paddleton Movie 89 min
## 362 362 Paris Is Us Movie 84 min
## 363 363 The 43 TV Show 1 Season
## 364 364 The Umbrella Academy TV Show 1 Season
## 365 365 Dating Around TV Show 1 Season
## 366 366 Always a Witch TV Show 1 Season
## 367 367 Russian Doll TV Show 1 Season
## 368 368 True: Happy Hearts Day Movie 24 min
## 369 369 Velvet Buzzsaw Movie 113 min
## 370 370 Saaho Movie 172 min
## 371 371 Nothing to Lose 2 Movie 97 min
## 372 372 Glow Up TV Show 1 Season
## 373 373 Marriage Story Movie 137 min
## 374 374 The Confession Killer TV Show 1 Season
## 375 375 Three Days of Christmas TV Show 1 Season
## 376 376 Triad Princess TV Show 1 Season
## 377 377 Virgin River TV Show 1 Season
## 378 378 Home for Christmas TV Show 1 Season
## 379 379 V Wars TV Show 1 Season
## 380 380 Let's Dance Movie 109 min
## 381 381 The Road to Love TV Show 1 Season
## 382 382 Aaviri Movie 110 min
## 383 383 ARASHI's Diary -Voyage- TV Show 1 Season
## 384 384 How High 2 Movie 89 min
## 385 385 The Neighbor TV Show 1 Season
## 386 386 Wish Man Movie 108 min
## 387 387 The Gift TV Show 1 Season
## 388 388 The App Movie 79 min
## 389 389 The Bonfire of Destiny TV Show 1 Season
## 390 390 Doom: Annihilation Movie 97 min
## 391 391 Grand-Daddy Day Care Movie 96 min
## 392 392 Jarhead: Law of Return Movie 103 min
## 393 393 Sweetheart Movie 83 min
## 394 394 Como caído del cielo Movie 117 min
## 395 395 Jacob's Ladder Movie 90 min
## 396 396 Back of the Net Movie 86 min
## 397 397 Agent TV Show 1 Season
## 398 398 The Two Popes Movie 126 min
## 399 399 The Witcher TV Show 1 Season
## 400 400 After the Raid Movie 25 min
## 401 401 Twice Upon A Time TV Show 1 Season
## 402 402 Soundtrack TV Show 1 Season
## 403 403 Crash Landing on You TV Show 1 Season
## 404 404 Holy Expectations Movie 89 min
## 405 405 6 Underground Movie 129 min
## 406 406 The 9th Precinct Movie 95 min
## 407 407 The Sky Is Pink Movie 143 min
## 408 408 Michelle Wolf: Joke Show Movie 60 min
## 409 409 Baby Dolls Movie 127 min
## 410 410 Dead Kids Movie 94 min
## 411 411 High End Yaariyan Movie 120 min
## 412 412 Wandering Stars Movie 95 min
## 413 413 Little Singham TV Show 1 Season
## 414 414 Sintonia TV Show 1 Season
## 415 415 The Family TV Show 1 Season
## 416 416 The InBESTigators TV Show 1 Season
## 417 417 Tiny House Nation TV Show 1 Season
## 418 418 Westside vs. the World Movie 96 min
## 419 419 Dollar TV Show 1 Season
## 420 420 The Naked Director TV Show 1 Season
## 421 421 Wu Assassins TV Show 1 Season
## 422 422 Badla Movie 117 min
## 423 423 The Pursuit Movie 76 min
## 424 424 Enter the Anime Movie 59 min
## 425 425 Back to School Movie 84 min
## 426 426 CAROLE & TUESDAY TV Show 1 Season
## 427 427 Mi amigo Alexis Movie 102 min
## 428 428 Styling Hollywood TV Show 1 Season
## 429 429 Expo Movie 81 min
## 430 430 Falling Inn Love Movie 98 min
## 431 431 Kardec Movie 110 min
## 432 432 Fanatic Movie 91 min
## 433 433 Kaake Da Viyah Movie 133 min
## 434 434 Mayday Life Movie 120 min
## 435 435 Rust Valley Restorers TV Show 1 Season
## 436 436 Saavat Movie 118 min
## 437 437 Love Alarm TV Show 1 Season
## 438 438 American Factory Movie 110 min
## 439 439 Hyperdrive TV Show 1 Season
## 440 440 Simon Amstell: Set Free Movie 52 min
## 441 441 Basketball or Nothing TV Show 1 Season
## 442 442 Léa & I Movie 84 min
## 443 443 Otherhood Movie 101 min
## 444 444 Uriyadi 2 Movie 116 min
## 445 445 45 rpm TV Show 1 Season
## 446 446 Better Than Us TV Show 1 Season
## 447 447 Diagnosis TV Show 1 Season
## 448 448 For Love or Money Movie 95 min
## 449 449 Green Frontier TV Show 1 Season
## 450 450 Sextuplets Movie 100 min
## 451 451 The King's Avatar TV Show 1 Season
## 452 452 The Little Switzerland Movie 86 min
## 453 453 Cannon Busters TV Show 1 Season
## 454 454 Til Death Do Us Part TV Show 1 Season
## 455 455 Whindersson Nunes: Adult Movie 69 min
## 456 456 Happy Jail TV Show 1 Season
## 457 457 Uyare Movie 119 min
## 458 458 90 ML Movie 123 min
## 459 459 Woodstock Movie 97 min
## 460 460 DC Super Hero Girls TV Show 1 Season
## 461 461 Petta (Telugu Version) Movie 170 min
## 462 462 Alien Warfare Movie 88 min
## 463 463 Backfire Movie 97 min
## 464 464 Our Planet TV Show 1 Season
## 465 465 Petta Movie 170 min
## 466 466 Quicksand TV Show 1 Season
## 467 467 Tijuana TV Show 1 Season
## 468 468 Unicorn Store Movie 92 min
## 469 469 Possessed TV Show 1 Season
## 470 470 A Home with A View Movie 92 min
## 471 471 Chambers TV Show 1 Season
## 472 472 Loving is Losing Movie 91 min
## 473 473 Money Trap Movie 122 min
## 474 474 Street Food TV Show 1 Season
## 475 475 Little Singham in London Movie 66 min
## 476 476 Grass Is Greener Movie 98 min
## 477 477 Lunatics TV Show 1 Season
## 478 478 Music Teacher Movie 102 min
## 479 479 Rilakkuma and Kaoru TV Show 1 Season
## 480 480 Someone Great Movie 92 min
## 481 481 The Ruthless Movie 111 min
## 482 482 The Creative Brain Movie 53 min
## 483 483 A Land Imagined Movie 95 min
## 484 484 Huge in France TV Show 1 Season
## 485 485 Special TV Show 1 Season
## 486 486 The Perfect Date Movie 91 min
## 487 487 Black Summer TV Show 1 Season
## 488 488 Persona TV Show 1 Season
## 489 489 The Silence Movie 91 min
## 490 490 You vs. Wild TV Show 1 Season
## 491 491 706 Movie 118 min
## 492 492 Ultraman TV Show 1 Season
## 493 493 Elite TV Show 2 Seasons
## 494 494 Gotham TV Show 5 Seasons
## 495 495 El Marginal TV Show 3 Seasons
## 496 496 Explained TV Show 2 Seasons
## 497 497 Glitch TV Show 3 Seasons
## 498 498 Locked Up TV Show 4 Seasons
## 499 499 Disenchantment TV Show 2 Seasons
## 500 500 Fastest Car TV Show 2 Seasons
## 501 501 The Blacklist TV Show 6 Seasons
## 502 502 I'm Sorry TV Show 2 Seasons
## 503 503 The Chef Show TV Show 2 Seasons
## 504 504 The Ranch TV Show 7 Seasons
## 505 505 Scream TV Show 3 Seasons
## 506 506 Creeped Out TV Show 2 Seasons
## 507 507 Peaky Blinders TV Show 5 Seasons
## 508 508 Rotten TV Show 2 Seasons
## 509 509 Super Monsters TV Show 3 Seasons
## 510 510 KENGAN ASHURA TV Show 2 Seasons
## 511 511 Flavorful Origins TV Show 2 Seasons
## 512 512 Jeopardy! TV Show 5 Seasons
## 513 513 BoJack Horseman TV Show 6 Seasons
## 514 514 Greenhouse Academy TV Show 3 Seasons
## 515 515 The Kominsky Method TV Show 2 Seasons
## 516 516 Baby TV Show 2 Seasons
## 517 517 MeatEater TV Show 4 Seasons
## 518 518 The House of Flowers TV Show 2 Seasons
## 519 519 Chosen TV Show 2 Seasons
## 520 520 Haunted TV Show 2 Seasons
## 521 521 Insatiable TV Show 2 Seasons
## 522 522 The Hook Up Plan TV Show 2 Seasons
## 523 523 YooHoo to the Rescue TV Show 2 Seasons
## 524 524 Schitt's Creek TV Show 5 Seasons
## 525 525 Carmen Sandiego TV Show 2 Seasons
## 526 526 Little Things TV Show 3 Seasons
## 527 527 Busted! TV Show 2 Seasons
## 528 528 Wild District TV Show 2 Seasons
## 529 529 Millennials TV Show 2 Seasons
## 530 530 Chip and Potato TV Show 2 Seasons
## 531 531 La Reina del Sur TV Show 2 Seasons
## 532 532 High Seas TV Show 2 Seasons
## 533 533 Nailed It! Holiday! TV Show 2 Seasons
## 534 534 The Dragon Prince TV Show 3 Seasons
## 535 535 The Crown TV Show 3 Seasons
## 536 536 Llama Llama TV Show 2 Seasons
## 537 537 The Toys That Made Us TV Show 3 Seasons
## 538 538 The Yard TV Show 2 Seasons
## 539 539 Chief of Staff TV Show 2 Seasons
## 540 540 Atypical TV Show 3 Seasons
## 541 541 The Deep TV Show 3 Seasons
## 542 542 Lucifer TV Show 4 Seasons
## 543 543 Bad Blood TV Show 2 Seasons
## 544 544 Supernatural TV Show 14 Seasons
## 545 545 Arrested Development TV Show 5 Seasons
## 546 546 Supergirl TV Show 4 Seasons
## 547 547 Riverdale TV Show 3 Seasons
## 548 548 The Flash TV Show 5 Seasons
## 549 549 Arrow TV Show 7 Seasons
## 550 550 Prince of Peoria TV Show 2 Seasons
## 551 551 Nailed It TV Show 3 Seasons
## 552 552 The Rain TV Show 2 Seasons
## 553 553 White Gold TV Show 2 Seasons
## 554 554 Easy TV Show 3 Seasons
## 555 555 The Mechanism TV Show 2 Seasons
## 556 556 On My Block TV Show 2 Seasons
## 557 557 Santa Clarita Diet TV Show 3 Seasons
## 558 558 Black Lightning TV Show 2 Seasons
## 559 559 The OA TV Show 2 Seasons
## 560 560 Larva Island TV Show 2 Seasons
## 561 561 3% TV Show 3 Seasons
## 562 562 Designated Survivor TV Show 3 Seasons
## 563 563 Black Mirror TV Show 5 Seasons
## 564 564 Madam Secretary TV Show 5 Seasons
## 565 565 Documentary Now! TV Show 3 Seasons
## 566 566 Dynasty TV Show 2 Seasons
## 567 567 Dope TV Show 3 Seasons
## 568 568 Instant Hotel TV Show 2 Seasons
## 569 569 Motown Magic TV Show 2 Seasons
## 570 570 Slasher TV Show 3 Seasons
## 571 571 Forest of Piano TV Show 2 Seasons
## 572 572 Dark TV Show 2 Seasons
## 573 573 Girls Incarcerated TV Show 2 Seasons
## 574 574 Go! Live Your Way TV Show 2 Seasons
## 575 575 The Casketeers TV Show 2 Seasons
## 576 576 The Confession Tapes TV Show 2 Seasons
## 577 577 Aggretsuko TV Show 2 Seasons
## 578 578 Black Spot TV Show 2 Seasons
## 579 579 El desconocido TV Show 2 Seasons
## 580 580 Marvel's Jessica Jones TV Show 3 Seasons
## 581 581 Kakegurui TV Show 2 Seasons
## 582 582 Free Rein TV Show 3 Seasons
## 583 583 Kakegurui TV Show 2 Seasons
## 584 584 Stranger Things TV Show 3 Seasons
## 585 585 The Letdown TV Show 2 Seasons
## 586 586 Yummy Mummies TV Show 2 Seasons
## 587 587 My First First Love TV Show 2 Seasons
## 588 588 Orange Is the New Black TV Show 7 Seasons
## 589 589 Sugar Rush TV Show 2 Seasons
## 590 590 The Worst Witch TV Show 3 Seasons
## 591 591 La casa de papel TV Show 3 Seasons
## 592 592 Last Chance U TV Show 4 Seasons
## 593 593 Queer Eye TV Show 4 Seasons
## 594 594 Pinky Malinky TV Show 3 Seasons
## 595 595 3Below: Tales of Arcadia TV Show 2 Seasons
## 596 596 Bonus Family TV Show 3 Seasons
## 597 597 Club of Crows TV Show 4 Seasons
## 598 598 Marvel's The Punisher TV Show 2 Seasons
## 599 599 Hip-Hop Evolution TV Show 3 Seasons
## 600 600 Booba TV Show 3 Seasons
## 601 601 Grace and Frankie TV Show 5 Seasons
## 602 602 Friends from College TV Show 2 Seasons
## 603 603 Horrid Henry TV Show 2 Seasons
## 604 604 Harvey Street Kids TV Show 3 Seasons
## 605 605 COMEDIANS of the world TV Show 13 Seasons
## 606 606 Big Mouth TV Show 3 Seasons
## 607 607 Man Like Mobeen TV Show 2 Seasons
## 608 608 One Day at a Time TV Show 3 Seasons
## 609 609 Chef's Table TV Show 6 Seasons
## 610 610 Suburra: Blood on Rome TV Show 2 Seasons
## 611 611 Cable Girls TV Show 4 Seasons
## 612 612 GLOW TV Show 3 Seasons
## 613 613 IZombie TV Show 5 Seasons
## 614 614 Jane The Virgin TV Show 5 Seasons
## 615 615 No Good Nick TV Show 2 Seasons
## 616 616 Mighty Little Bheem TV Show 2 Seasons
## 617 617 Workin' Moms TV Show 3 Seasons
## 618 618 Million Pound Menu TV Show 2 Seasons
## 619 619 13 Reasons Why TV Show 3 Seasons
## 620 620 HERO MASK TV Show 2 Seasons
## 621 621 Ask the StoryBots TV Show 3 Seasons
## 622 622 Dear White People TV Show 3 Seasons
## 623 623 Derry Girls TV Show 2 Seasons
## 624 624 MINDHUNTER TV Show 2 Seasons
## 625 625 QB1: Beyond the Lights TV Show 3 Seasons
## 626 626 Sacred Games TV Show 2 Seasons
## 627 627 The 100 TV Show 6 Seasons
## 628 628 Knightfall TV Show 2 Seasons
## 629 629 Wentworth TV Show 7 Seasons
## 630 630 Men on a Mission TV Show 4 Seasons
## 631 631 Spirit: Riding Free TV Show 8 Seasons
## 632 632 Kim's Convenience TV Show 3 Seasons
## 633 633 The Protector TV Show 2 Seasons
## 634 634 Cuckoo TV Show 5 Seasons
## 635 635 Samantha! TV Show 2 Seasons
## 636 636 Crazy Ex-Girlfriend TV Show 4 Seasons
## 637 637 Pokémon the Series TV Show 2 Seasons
You work for a multinational company that uses auto-dialer software to contact its customers. When new customers subscribe online they are asked for a phone number but they often forget to add the country code needed for international calls. You were asked to fix this issue in the database. You’ve been given a data frame with national numbers and country codes named phone_nr_df. Now you want to combine the country_code and national_number columns to create valid international numbers.
Use the unite() function to create a new international_number column, using an empty string as the separator.
phone_nr_df <- read.delim("DATABASE/phone_nr_df.txt", sep ="," )
head(phone_nr_df)
## X country country_code national_number
## 1 1 USA 1 2025550117
## 2 2 United Kingdom 44 1632960924
## 3 3 Brazil 55 95552452220
## 4 4 Australia 61 1900654321
## 5 5 China 86 13555953217
## 6 6 India 91 8555843898
phone_nr_df %>%
# Unite the country_code and national_number columns
unite("international_number", country_code, national_number, sep = "")
## X country international_number
## 1 1 USA 12025550117
## 2 2 United Kingdom 441632960924
## 3 3 Brazil 5595552452220
## 4 4 Australia 611900654321
## 5 5 China 8613555953217
## 6 6 India 918555843898
You’re given a sample of the Netflix dataset containing TV shows and their casts called tvshow_df. You want to learn which six actors have the most appearances.
However, the dataset only has one row per TV show, and multiple actors are listed in the cast column.
Transform the data so that for each TV show, every actor has a row. The number of appearances will be calculated for you.
The dplyr package has been pre-loaded for you. Use separate_rows() on the cast column, using the appropriate separator for the sep argument.
tvshow_df <- read.delim("DATABASE/tvshow_df.txt", sep ="," ) %>% select(.,c(-"X"))
tvshow_df <- tvshow_df %>%
mutate(
OTRA = as.character(tvshow_df$cast)
) %>% select(., c("OTRA", "title")) %>% setnames(., c("cast","title" ))
Use the head() function to keep just the top six.
tvshow_df %>%
# Separate the actors in the cast column over multiple rows
separate_rows(cast, sep = ", ") %>%
rename(actor = cast) %>%
count(actor, sort = TRUE) %>%
head()
## # A tibble: 6 x 2
## actor n
## <chr> <int>
## 1 Takahiro Sakurai 18
## 2 Yuki Kaji 16
## 3 Daisuke Ono 14
## 4 David Attenborough 14
## 5 Ashleigh Ball 12
## 6 Hiroshi Kamiya 12
Remember the drink ingredients data from the video? You’ve been given an similar version (drink_df) that also includes quantities and units. Now you want to create an overview of how much of each ingredient you should buy to make these drinks.
The dplyr package has been pre-loaded for you.
Inspect drink_df in the console to find the right separator in the ingredients column.
drink_df <- read.delim("DATABASE/drink_df.txt", sep ="," ) %>% select(., c(-"X"))
drink_df <- drink_df %>% mutate(
ingredients_off = as.character(drink_df$ingredients)
) %>% select(., c("drink", "ingredients_off")) %>% setnames(., c("drink", "ingredients"))
Separate the ingredients column so that for each drink each ingredient gets a row.
drink_df %>%
# Separate the ingredients over rows
separate_rows(ingredients, sep = ";")
## # A tibble: 9 x 2
## drink ingredients
## <fct> <chr>
## 1 Chocolate milk "milk 0.3 L"
## 2 Chocolate milk " chocolate 40 g"
## 3 Chocolate milk " sugar 10 g"
## 4 Orange juice "oranges 3"
## 5 Orange juice " sugar 20 g"
## 6 Cappuccino "milk 0.1 L"
## 7 Cappuccino " water 0.1 L"
## 8 Cappuccino " coffee 30 g"
## 9 Cappuccino " sugar 5 g"
Inspect the output of the previous step to find the separator that splits the ingredients column into three columns: ingredient, quantity, and unit. Make sure to convert data types to numeric when possible.
drink_df %>%
# Separate the ingredients over rows
separate_rows(ingredients, sep = "; ") %>%
# Separate ingredients into three columns
separate(
ingredients,
into = c("ingredient", "quantity", "unit"),
sep = " ",
convert = TRUE
)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
## # A tibble: 9 x 4
## drink ingredient quantity unit
## <fct> <chr> <dbl> <chr>
## 1 Chocolate milk milk 0.3 L
## 2 Chocolate milk chocolate 40 g
## 3 Chocolate milk sugar 10 g
## 4 Orange juice oranges 3 <NA>
## 5 Orange juice sugar 20 g
## 6 Cappuccino milk 0.1 L
## 7 Cappuccino water 0.1 L
## 8 Cappuccino coffee 30 g
## 9 Cappuccino sugar 5 g
Group the data by ingredient and unit. Calculate the total quantity of each ingredient.
drink_df %>%
# Separate the ingredients over rows
separate_rows(ingredients, sep = "; ") %>%
# Separate ingredients into three columns
separate(
ingredients,
into = c("ingredient", "quantity", "unit"),
sep = " ",
convert = TRUE
) %>%
# Group by ingredient and unit
group_by(ingredient, unit) %>%
# Calculate the total quantity of each ingredient
summarize(quantity = sum(quantity))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [4].
## `summarise()` has grouped output by 'ingredient'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: ingredient [6]
## ingredient unit quantity
## <chr> <chr> <dbl>
## 1 chocolate g 40
## 2 coffee g 30
## 3 milk L 0.4
## 4 oranges <NA> 3
## 5 sugar g 35
## 6 water L 0.1
You’re working on a sample of the Netflix dataset pre-loaded as director_df. This time, the data frame contains just the directors and movie titles. Your goal is to identify the directors who created the most movies. Since the director column contains multiple names, you’ll first separate its values over multiple rows and then count the directors.
Since you don’t want movies without directors polluting your overview, you’ll apply the drop_na() function.
The dplyr package has been pre-loaded for you.
director_df <- read.delim("DATABASE/director_df.txt", sep ="," ) %>% select(., c(-"X"))
Inspect director_df in the console to see what string separates directors in the director column. Spread the values in the director column over separate rows.
director_df <- director_df %>% mutate(
ingredients_off = as.character(director_df$director)
) %>% select(., c("ingredients_off", "title")) %>% setnames(., c("director", "title"))
director_df %>%
# Spread the director column over separate rows
separate_rows(director, sep = ", ")
## # A tibble: 4,815 x 2
## director title
## <chr> <fct>
## 1 Richard Finn Norm of the North: King Sized Adventure
## 2 Tim Maltby Norm of the North: King Sized Adventure
## 3 <NA> Jandino: Whatever it Takes
## 4 Fernando Lebrija #realityhigh
## 5 Gabe Ibáñez Automata
## 6 Rodrigo Toro Fabrizio Copano: Solo pienso en mi
## 7 Francisco Schultz Fabrizio Copano: Solo pienso en mi
## 8 Henrik Ruben Genz Good People
## 9 José Miguel Contreras Joaquín Reyes: Una y no más
## 10 Daniel Alfredson Kidnapping Mr. Heineken
## # … with 4,805 more rows
Count the number of times each director appears in the data. Make sure to sort the output.
director_df %>%
# Spread the director column over separate rows
separate_rows(director, sep = ", ") %>%
# Count the number of movies per director
count(director, sort = TRUE)
## # A tibble: 3,536 x 2
## director n
## <chr> <int>
## 1 <NA> 128
## 2 Jan Suter 21
## 3 Raúl Campos 19
## 4 Jay Karas 14
## 5 Marcus Raboy 14
## 6 Jay Chapman 12
## 7 Martin Scorsese 9
## 8 Steven Spielberg 9
## 9 David Dhawan 8
## 10 Johnnie To 8
## # … with 3,526 more rows
Drop rows containing NA values in the director column.
director_df %>%
# Drop rows with NA values in the director column
drop_na() %>%
# Spread the director column over separate rows
separate_rows(director, sep = ", ") %>%
# Count the number of movies per director
count(director, sort = TRUE)
## # A tibble: 3,535 x 2
## director n
## <chr> <int>
## 1 Jan Suter 21
## 2 Raúl Campos 19
## 3 Jay Karas 14
## 4 Marcus Raboy 14
## 5 Jay Chapman 12
## 6 Martin Scorsese 9
## 7 Steven Spielberg 9
## 8 David Dhawan 8
## 9 Johnnie To 8
## 10 Lance Bangs 8
## # … with 3,525 more rows
You’ve been asked to create a report that allows management to compare sales figures per quarter for two years. The problem is that the dataset (sales_df) contains missing values. You’ll need to impute the values in the year column so that you can visualize the data.
The ggplot2 package has been pre-loaded for you.
sales_df <- read.delim("DATABASE/sales_df.txt", sep ="," ) %>% select(., c(-"X"))
Inspect sales_df in the console, pay attention to the year column. Use the fill() function to impute the year column in the correct direction. Create a line plot where each year has a different color.
sales_df %>%
# Impute the year column
fill(year,.direction = "up") %>%
# Create a line plot with sales per quarter colored by year.
ggplot(aes(x = quarter, y = sales, color = year, group = year)) +
geom_line()
Since WWII, a number of nations have been detonating nuclear bombs for military research. A tally of bombs detonated per nation has been calculated from the Nuclear Explosion DataBase (NEDB) and provided as nuke_df. You are interested in finding out how many bombs have been detonated by nations grouped per continent. To achieve this goal, nuke_df will be joined to country_to_continent_df which is a mapping of nation to continent. You will need to overwrite missing values with zeros so that you can create a nice plot.
The dplyr and ggplot2 packages have been pre-loaded for you.
Side note 1: Bombs detonated by the Soviet Union were attributed to the Russian Federation.
Side note 2: The Russian Federation is solely mapped to Europe for simplicity.
country_to_continent_df <- read.delim("DATABASE/country_to_continent_df.txt", sep ="," ) %>% select(., c(-"X"))
nuke_df <- read.delim("DATABASE/nuke_df.txt", sep ="," ) %>% select(., c(-"X"))
# country_to_continent_df %>%
# left_join(nuke_df, by = "country_code") %>%
# # Impute the missing values in the n_bombs column with 0L
# replace_na(list(n_bombs = 0L))
Group the dataset by continent and aggregate the data by summing the number of bombs.
# country_to_continent_df %>%
# left_join(nuke_df, by = "country_code") %>%
# # Impute the missing values in the n_bombs column with 0L
# replace_na(list(n_bombs = 0L)) %>%
# # Group the dataset by continent
# group_by(continent) %>%
# # Sum the number of bombs per continent
# summarize(n_bombs_continent = sum(n_bombs))
Plot the summed number of bombs detonated by nations from each continent.
# country_to_continent_df %>%
# left_join(nuke_df, by = "country_code") %>%
# # Impute the missing values in the n_bombs column with 0L
# replace_na(list(n_bombs = 0L)) %>%
# # Group the dataset by continent
# group_by(continent) %>%
# # Sum the number of bombs per continent
# summarize(n_bombs_continent = sum(n_bombs)) %>%
# # Plot the number of bombs per continent
# ggplot(aes(x = continent, y =n_bombs_continent)) +
# geom_col()
You’ve been given a version of the Nuclear Explosion DataBase (NEDB) where country names are specified in the column headers (nuke_df). You want to visualize how many nukes were detonated per year per country. You’ll need to pivot the data and replace NA values first.
The ggplot2 package has been pre-loaded for you.
nuke_df <- read.delim("DATABASE/nuke_df.txt", sep = ",") %>% select(., c(-"X"))
head(nuke_df)
## year United.States Russian.Federation United.Kingdom France China India
## 1 1945 3 NA NA NA NA NA
## 2 1946 2 NA NA NA NA NA
## 3 1947 NA NA NA NA NA NA
## 4 1948 3 NA NA NA NA NA
## 5 1949 NA 1 NA NA NA NA
## 6 1950 NA NA NA NA NA NA
## Pakistan North.Korea
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
Pivot all columns except for year to a longer format.
nuke_df %>%
# Pivot the data to a longer format
pivot_longer(-year)
## # A tibble: 584 x 3
## year name value
## <int> <chr> <int>
## 1 1945 United.States 3
## 2 1945 Russian.Federation NA
## 3 1945 United.Kingdom NA
## 4 1945 France NA
## 5 1945 China NA
## 6 1945 India NA
## 7 1945 Pakistan NA
## 8 1945 North.Korea NA
## 9 1946 United.States 2
## 10 1946 Russian.Federation NA
## # … with 574 more rows
The country names are now in the name column. Overwrite its name with country. The value column should be named n_bombs.
nuke_df %>%
# Pivot the data to a longer format
pivot_longer(
-year,
# Overwrite the names of the two new columns
names_to = "country",
values_to = "n_bombs"
)
## # A tibble: 584 x 3
## year country n_bombs
## <int> <chr> <int>
## 1 1945 United.States 3
## 2 1945 Russian.Federation NA
## 3 1945 United.Kingdom NA
## 4 1945 France NA
## 5 1945 China NA
## 6 1945 India NA
## 7 1945 Pakistan NA
## 8 1945 North.Korea NA
## 9 1946 United.States 2
## 10 1946 Russian.Federation NA
## # … with 574 more rows
Replace the NA values in the n_bombs column with integer zero values (0L).
nuke_df %>%
# Pivot the data to a longer format
pivot_longer(
-year,
# Overwrite the names of the two new columns
names_to = "country",
values_to = "n_bombs"
) %>%
# Replace NA values for n_bombs with 0L
replace_na(list(n_bombs = 0L))
## # A tibble: 584 x 3
## year country n_bombs
## <int> <chr> <int>
## 1 1945 United.States 3
## 2 1945 Russian.Federation 0
## 3 1945 United.Kingdom 0
## 4 1945 France 0
## 5 1945 China 0
## 6 1945 India 0
## 7 1945 Pakistan 0
## 8 1945 North.Korea 0
## 9 1946 United.States 2
## 10 1946 Russian.Federation 0
## # … with 574 more rows
Create a line plot where the number of bombs dropped per country is plotted over time. Use country to color the lines.
nuke_df %>%
# Pivot the data to a longer format
pivot_longer(
-year,
# Overwrite the names of the two new columns
names_to = "country",
values_to = "n_bombs"
) %>%
# Replace NA values for n_bombs with 0L
replace_na(list(n_bombs = 0L)) %>%
# Plot the number of bombs per country over time
ggplot(aes(x = year, y = n_bombs, color = country)) +
geom_line()
According to the World Health Organization (WHO), worldwide obesity has nearly tripled since 1975. You’re interested in the severity of this global health issue per country and whether males and females are affected differently. You’ll use the WHO’s obesity data (obesity_df) to investigate this issue. The data holds the percentage of females, males, and both sexes combined that are considered obese (BMI > 30) per country.
You want to create a scatterplot where, per nation, you can see the obesity data colored differently for females and males. This implies that sex should become a variable with its own column.
The ggplot2 package has been pre-loaded for you.
obesity_df <- read.delim("DATABASE/obesity_df.txt", sep = ",") %>% select(., c(-"X"))
head(obesity_df)
## country male female both_sexes
## 1 Afghanistan 3.2 7.6 5.5
## 2 Albania 21.6 21.8 21.7
## 3 Algeria 19.9 34.9 27.4
## 4 Andorra 25.9 25.3 25.6
## 5 Angola 4.0 12.1 8.2
## 6 Antigua and Barbuda 11.6 25.9 18.9
Inspect obesity_df in the console. Pivot the male and female columns. The old column names should go in the sex column, the original values should go in the pct_obese column.
obesity_df %>%
# Pivot the male and female columns
pivot_longer(c("male", "female"),
names_to = "sex",
values_to = "pct_obese"
)
## # A tibble: 380 x 4
## country both_sexes sex pct_obese
## <fct> <dbl> <chr> <dbl>
## 1 Afghanistan 5.5 male 3.2
## 2 Afghanistan 5.5 female 7.6
## 3 Albania 21.7 male 21.6
## 4 Albania 21.7 female 21.8
## 5 Algeria 27.4 male 19.9
## 6 Algeria 27.4 female 34.9
## 7 Andorra 25.6 male 25.9
## 8 Andorra 25.6 female 25.3
## 9 Angola 8.2 male 4
## 10 Angola 8.2 female 12.1
## # … with 370 more rows
Create a scatterplot with pct_obese per country colored by sex. The country variable has been ordered by overall obesity % and added for you.
obesity_df %>%
# Pivot the male and female columns
pivot_longer(c(male, female),
names_to = "sex",
values_to = "pct_obese") %>%
# Create a scatter plot with pct_obese per country colored by sex
ggplot(aes(x = pct_obese, color = sex,
y = forcats::fct_reorder(country, both_sexes))) +
geom_point() +
scale_y_discrete(breaks = c("India", "Nauru", "Cuba", "Brazil",
"Pakistan", "Gabon", "Italy", "Oman",
"China", "United States of America")) +
labs(x = "% Obese", y = "Country")
You’ve been given a James Bond movie dataset (bond_df) and want to visualize the number of movies that Bond actors have featured in per decade. However, the data is in an untidy format with the decade values captured in the column headers. You’ll tidy this dataset to give each variable its own column.
bond_df <- read.delim("DATABASE/bond_df.txt", sep = ",") %>% select(., c(-"X"))
#nos importa los nombrs mal
columnas <- names(bond_df)
#vemos cual es el tamaño de las columnas
length <- length(columnas)
for(i in 1:length){
#seleccionamos la columna
columna <- columnas[i]
# print(columna)
#ESTABLECEMOS LA CONDICION
CONDICION <- str_detect(columna, "^X")
if(CONDICION == TRUE){
#CORREGIMOS EL ERROR
CORREGIDO <- str_extract_all(columnas[i], "[:digit:]+", simplify = TRUE)
#reemplazamos el nombre de la columna mal escrita
columnas[i] <- replace(columnas[i], CONDICION == TRUE, CORREGIDO)
}else{
next
}
}
#implementamos los nombres bien
names(bond_df) <- columnas
#vemos el resultado
head(bond_df)
## Bond 1960 1970 1980 1990 2000 2010 2020
## 1 Sean Connery 5 1 NA NA NA NA NA
## 2 David Niven 1 NA NA NA NA NA NA
## 3 George Lazenby 1 NA NA NA NA NA NA
## 4 Roger Moore NA 4 3 NA NA NA NA
## 5 Timothy Dalton NA NA 2 NA NA NA NA
## 6 Pierce Brosnan NA NA NA 3 1 NA NA
The ggplot2 package has been pre-loaded for you.
bond_df %>%
# Pivot the data to long format and set the column names
pivot_longer(-"Bond",
names_to = "decade",
values_to = "n_movies")
## # A tibble: 49 x 3
## Bond decade n_movies
## <fct> <chr> <dbl>
## 1 Sean Connery 1960 5
## 2 Sean Connery 1970 1
## 3 Sean Connery 1980 NA
## 4 Sean Connery 1990 NA
## 5 Sean Connery 2000 NA
## 6 Sean Connery 2010 NA
## 7 Sean Connery 2020 NA
## 8 David Niven 1960 1
## 9 David Niven 1970 NA
## 10 David Niven 1980 NA
## # … with 39 more rows
Drop any NA values in the n_movies column while it is created.
bond_df %>%
# Pivot the data to long format
pivot_longer(
-Bond,
# Overwrite the names of the two newly created columns
names_to = "decade",
values_to = "n_movies",
# Drop na values
values_drop_na = TRUE
)
## # A tibble: 12 x 3
## Bond decade n_movies
## <fct> <chr> <dbl>
## 1 Sean Connery 1960 5
## 2 Sean Connery 1970 1
## 3 David Niven 1960 1
## 4 George Lazenby 1960 1
## 5 Roger Moore 1970 4
## 6 Roger Moore 1980 3
## 7 Timothy Dalton 1980 2
## 8 Pierce Brosnan 1990 3
## 9 Pierce Brosnan 2000 1
## 10 Daniel Craig 2000 2
## 11 Daniel Craig 2010 2
## 12 Daniel Craig 2020 1
Transform the decade column data type to integer.
bond_df %>%
# Pivot the data to long format
pivot_longer(
-Bond,
# Overwrite the names of the two newly created columns
names_to = "decade",
values_to = "n_movies",
# Drop na values
values_drop_na = TRUE,
# Transform the decade column data type to integer
names_transform = list(decade = as.integer)
) %>%
ggplot(aes(x = decade + 5, y = n_movies, fill = Bond))+
geom_col()
Every year New Zealanders vote en masse to decide which species gets the bird of the year trophy. The contest is organized by the Forest & Bird agency which allows each person to give points to up to five birds (first pick gets 5 points, second 4, …). Your job is to decide this year’s winner from the messy dataset that’s been pre-loaded for you as bird_df.
The dplyr package has been pre-loaded for you.
bird_df <- read.delim("DATABASE/bird_df.txt", sep = ",") %>% select(., c(-"X"))
head(bird_df)
## points_5 points_4 points_3
## 1 Yellow-eyed penguin Kea Tūī
## 2 Yellow-eyed penguin Kererū Tūī
## 3 Fiordland Crested Penguin Bar-tailed Godwit Banded Dotterel
## 4 Fantail Kea Tūī
## 5 Kākāpō Little Spotted Kiwi Rockhopper Penguin
## 6 Yellow-eyed penguin <NA> <NA>
## points_2 points_1
## 1 Southern Royal Albatross Southern Brown Kiwi
## 2 Saddleback Rockhopper Penguin
## 3 Black Robin Blue Duck
## 4 Kākā Kākāpō
## 5 Kōkako Shore Plover
## 6 <NA> <NA>
Inspect bird_df in the console. Pivot bird_df to longer format so that an integer column points and a character column species are created. Use the names_prefix argument to clean up the points column and make sure no NA values remain.
bird_df %>%
# Pivot the data to create a two column data frame
pivot_longer(
starts_with("points_"),
names_to = "points",
names_prefix = "points_",
names_transform = list(points = as.integer),
values_to = "species",
values_drop_na = TRUE
)
## # A tibble: 46,808 x 2
## points species
## <int> <fct>
## 1 5 Yellow-eyed penguin
## 2 4 Kea
## 3 3 Tūī
## 4 2 Southern Royal Albatross
## 5 1 Southern Brown Kiwi
## 6 5 Yellow-eyed penguin
## 7 4 Kererū
## 8 3 Tūī
## 9 2 Saddleback
## 10 1 Rockhopper Penguin
## # … with 46,798 more rows
Calculate the total_points each species got.
bird_df %>%
# Pivot the data to create a 2 column data frame
pivot_longer(
starts_with("points_"),
names_to = "points",
names_prefix = "points_",
names_transform = list(points = as.integer),
values_to = "species",
values_drop_na = TRUE
) %>%
group_by(species) %>%
summarize(total_points = sum(points)) %>%
slice_max(total_points, n = 5)
## # A tibble: 5 x 2
## species total_points
## <fct> <int>
## 1 Yellow-eyed penguin 13522
## 2 Kākāpō 8017
## 3 Black Robin 6542
## 4 Banded Dotterel 5988
## 5 Kākā 5229
You’re an analyst at an investment firm and want to visualize the weekly closing prices of five big tech firms’ stocks. However, the dataset you’ve been handed (stock_df) is messy and has the year and week variables stored in the column headers. You’ll pivot this data into a tidy format, extract the variables from the headers, and create a line plot.
stock_df <- read.delim("DATABASE/stock_df.txt", sep = ",") %>% select(., c(-"X"))
head(stock_df)
## company X2019_week1 X2019_week2 X2019_week3 X2019_week4 X2019_week5
## 1 Amazon 1847.8400 1640.5601 1696.200 1670.57 1626.23
## 2 Apple 73.4125 38.0725 39.205 39.44 41.63
## 3 Facebook 205.2500 143.8000 150.040 149.01 165.71
## 4 Google 1337.0200 1057.1899 1098.260 1090.99 1110.75
## 5 Microsoft 157.7000 102.8000 107.710 107.17 102.78
## X2019_week6 X2019_week7 X2019_week8 X2019_week9 X2019_week10 X2019_week11
## 1 1588.2200 1607.950 1631.5601 1671.7300 1620.8000 1712.36
## 2 42.6025 42.605 43.2425 43.7425 43.2275 46.53
## 3 167.3300 162.500 161.8900 162.2800 169.6000 165.98
## 4 1095.0601 1113.650 1110.3700 1140.9900 1142.3199 1184.46
## 5 105.6700 108.220 110.9700 112.5300 110.5100 115.91
## X2019_week12 X2019_week13 X2019_week14 X2019_week15 X2019_week16 X2019_week17
## 1 1764.7700 1780.7500 1837.28 1843.0601 1861.690 1950.630
## 2 47.7625 47.4875 49.25 49.7175 50.965 51.075
## 3 164.3400 166.6900 175.72 179.1000 178.280 191.490
## 4 1205.5000 1173.3101 1207.15 1217.8700 1236.370 1272.180
## 5 117.0500 117.9400 119.89 120.9500 123.370 129.890
## X2019_week18 X2019_week19 X2019_week20 X2019_week21 X2019_week22 X2019_week23
## 1 1962.4600 1889.980 1869.00 1823.2800 1775.0699 1804.0300
## 2 52.9375 49.295 47.25 44.7425 43.7675 47.5375
## 3 195.4700 188.340 185.30 181.0600 177.4700 173.3500
## 4 1185.4000 1164.270 1162.30 1133.4700 1103.6300 1066.0400
## 5 128.9000 127.130 128.07 126.2400 123.6800 131.4000
## X2019_week24 X2019_week25 X2019_week26 X2019_week27 X2019_week28 X2019_week29
## 1 1869.670 1911.300 1893.63 1942.9100 2011.000 1964.5200
## 2 48.185 49.695 49.48 51.0575 50.825 50.6475
## 3 181.330 191.140 193.00 196.4000 204.870 198.3600
## 4 1085.350 1121.880 1080.91 1131.5900 1144.900 1130.1000
## 5 132.450 136.970 133.96 137.0600 138.900 136.6200
## X2019_week30 X2019_week31 X2019_week32 X2019_week33 X2019_week34 X2019_week35
## 1 1943.050 1823.240 1807.5800 1792.570 1749.62 1776.290
## 2 51.935 51.005 50.2475 51.625 50.66 52.185
## 3 199.750 189.020 187.8500 183.700 177.75 185.670
## 4 1250.410 1193.990 1188.0100 1177.600 1151.29 1188.100
## 5 141.340 136.900 137.7100 136.130 133.39 137.860
## X2019_week36 X2019_week37 X2019_week38 X2019_week39 X2019_week40 X2019_week41
## 1 1833.510 1839.3400 1794.1600 1725.450 1739.6500 1731.9200
## 2 53.315 54.6875 54.4325 54.705 56.7525 59.0525
## 3 187.490 187.1900 189.9300 177.100 180.4500 184.1900
## 4 1204.930 1239.5601 1229.9301 1225.090 1209.0000 1215.4500
## 5 139.100 137.3200 139.4400 137.730 138.1200 139.6800
## X2019_week42 X2019_week43 X2019_week44 X2019_week45 X2019_week46 X2019_week47
## 1 1757.5100 1761.330 1791.440 1785.880 1739.49 1745.720
## 2 59.1025 61.645 63.955 65.035 66.44 65.445
## 3 185.8500 187.890 193.620 190.840 195.10 198.820
## 4 1245.4900 1265.130 1273.740 1311.370 1334.87 1295.340
## 5 137.4100 140.730 143.720 145.960 149.97 149.590
## X2019_week48 X2019_week49 X2019_week50 X2019_week51 X2019_week52 X2020_week1
## 1 1800.8000 1751.6000 1760.9399 1786.50 1869.80 1874.9700
## 2 66.8125 67.6775 68.7875 69.86 72.45 74.3575
## 3 201.6400 201.0500 194.1100 206.30 208.10 208.6700
## 4 1304.9600 1340.6200 1347.8300 1349.59 1351.89 1360.6600
## 5 151.3800 151.7500 154.5300 157.41 158.96 158.6200
## X2020_week2 X2020_week3 X2020_week4 X2020_week5 X2020_week6 X2020_week7
## 1 1883.1600 1864.7200 1861.6400 2008.7200 2079.2800 2134.8701
## 2 77.5825 79.6825 79.5775 77.3775 80.0075 81.2375
## 3 218.0600 222.1400 217.9400 201.9100 212.3300 214.1800
## 4 1429.7300 1480.3900 1466.7100 1434.2300 1479.2300 1520.7400
## 5 161.3400 167.1000 165.0400 170.2300 183.8900 185.3500
## X2020_week8 X2020_week9 X2020_week10 X2020_week11 X2020_week12 X2020_week13
## 1 2095.9700 1883.75 1901.0900 1785.0000 1846.09 1900.100
## 2 78.2625 68.34 72.2575 69.4925 57.31 61.935
## 3 210.1800 192.47 181.0900 170.2800 149.73 156.790
## 4 1485.1100 1339.33 1298.4100 1219.7300 1072.32 1110.710
## 5 178.5900 162.01 161.5700 158.8300 137.35 149.700
## X2020_week14 X2020_week15 X2020_week16 X2020_week17 X2020_week18 X2020_week19
## 1 1906.5900 2042.7600 2375.00 2410.2200 2286.0400 2379.6101
## 2 60.3525 66.9975 70.70 70.7425 72.2675 77.5325
## 3 154.1800 175.1900 179.24 190.0700 202.2700 212.3500
## 4 1097.8800 1211.4500 1283.25 1279.3101 1320.6100 1388.3700
## 5 153.8300 165.1400 178.60 174.5500 174.5700 184.6800
## X2020_week20 X2020_week21 X2020_week22 X2020_week23 X2020_week24 X2020_week25
## 1 2409.7800 2436.8799 2442.370 2483.000 2545.02 2675.01
## 2 76.9275 79.7225 79.485 82.875 84.70 87.43
## 3 210.8800 234.9100 225.090 230.770 228.58 238.79
## 4 1373.1899 1410.4200 1428.920 1438.390 1413.18 1431.72
## 5 183.1600 183.5100 183.250 187.200 187.74 195.15
## X2020_week26 X2020_week27 X2020_week28 X2020_week29 X2020_week30 X2020_week31
## 1 2692.8701 2890.3000 3200.00 2961.9700 3008.910 3164.68
## 2 88.4075 91.0275 95.92 96.3275 92.615 106.26
## 3 216.0800 233.4200 245.07 242.0300 230.710 253.67
## 4 1359.9000 1464.7000 1541.74 1515.5500 1511.870 1482.96
## 5 196.3300 206.2600 213.67 202.8800 201.300 205.01
## X2020_week32 X2020_week33 X2020_week34 X2020_week35 X2020_week36 X2020_week37
## 1 3167.4600 3148.0200 3284.72 3401.8000 3294.62 3116.22
## 2 111.1125 114.9075 124.37 124.8075 120.96 112.00
## 3 268.4400 261.2400 267.01 293.6600 282.73 266.61
## 4 1494.4900 1507.7300 1580.42 1644.4100 1591.04 1520.72
## 5 212.4800 208.9000 213.02 228.9100 214.25 204.03
## X2020_week38 X2020_week39 X2020_week40 X2020_week41 X2020_week42 X2020_week43
## 1 2954.91 3095.13 3125.00 3286.65 3272.71 3204.40
## 2 106.84 112.28 113.02 116.97 119.02 115.04
## 3 252.53 254.82 259.94 264.45 265.93 284.79
## 4 1459.99 1444.96 1458.42 1515.22 1573.01 1641.00
## 5 200.39 207.82 206.19 215.81 219.66 216.23
## X2020_week44 X2020_week45 X2020_week46 X2020_week47 X2020_week48 X2020_week49
## 1 3036.15 3311.37 3128.81 3099.40 3195.34 3162.58
## 2 108.86 118.69 119.26 117.34 116.59 122.25
## 3 263.11 293.41 276.95 269.70 277.81 279.70
## 4 1621.01 1761.75 1777.02 1742.19 1793.19 1827.99
## 5 202.47 223.72 216.51 210.39 215.23 214.36
## X2020_week50 X2020_week51 X2020_week52 X2020_week53
## 1 3116.42 3201.65 3172.69 3256.93
## 2 122.41 126.66 131.97 132.69
## 3 273.55 276.40 267.40 273.16
## 4 1781.77 1731.01 1738.85 1751.88
## 5 213.26 218.59 222.75 222.42
The ggplot2 package has been pre-loaded for you.
#nos importa los nombrs mal
columnas <- names(stock_df)
#vemos cual es el tamaño de las columnas
length <- length(columnas)
for(i in 1:length){
#seleccionamos la columna
columna <- columnas[i]
# print(columna)
#ESTABLECEMOS LA CONDICION
CONDICION <- str_detect(columna, "^X")
# print(CONDICION)
if(CONDICION == TRUE){
#CORREGIMOS EL ERROR
CORREGIDO <- str_extract_all(columnas[i], "[:digit:]+\\_+[:alnum:]+", simplify = TRUE)
# print(CORREGIDO)
#reemplazamos el nombre de la columna mal escrita
columnas[i] <- replace(columnas[i], CONDICION == TRUE, CORREGIDO)
}else{
next
}
}
#implementamos los nombres bien
names(stock_df) <- columnas
#vemos el resultado
head(stock_df)
## company 2019_week1 2019_week2 2019_week3 2019_week4 2019_week5 2019_week6
## 1 Amazon 1847.8400 1640.5601 1696.200 1670.57 1626.23 1588.2200
## 2 Apple 73.4125 38.0725 39.205 39.44 41.63 42.6025
## 3 Facebook 205.2500 143.8000 150.040 149.01 165.71 167.3300
## 4 Google 1337.0200 1057.1899 1098.260 1090.99 1110.75 1095.0601
## 5 Microsoft 157.7000 102.8000 107.710 107.17 102.78 105.6700
## 2019_week7 2019_week8 2019_week9 2019_week10 2019_week11 2019_week12
## 1 1607.950 1631.5601 1671.7300 1620.8000 1712.36 1764.7700
## 2 42.605 43.2425 43.7425 43.2275 46.53 47.7625
## 3 162.500 161.8900 162.2800 169.6000 165.98 164.3400
## 4 1113.650 1110.3700 1140.9900 1142.3199 1184.46 1205.5000
## 5 108.220 110.9700 112.5300 110.5100 115.91 117.0500
## 2019_week13 2019_week14 2019_week15 2019_week16 2019_week17 2019_week18
## 1 1780.7500 1837.28 1843.0601 1861.690 1950.630 1962.4600
## 2 47.4875 49.25 49.7175 50.965 51.075 52.9375
## 3 166.6900 175.72 179.1000 178.280 191.490 195.4700
## 4 1173.3101 1207.15 1217.8700 1236.370 1272.180 1185.4000
## 5 117.9400 119.89 120.9500 123.370 129.890 128.9000
## 2019_week19 2019_week20 2019_week21 2019_week22 2019_week23 2019_week24
## 1 1889.980 1869.00 1823.2800 1775.0699 1804.0300 1869.670
## 2 49.295 47.25 44.7425 43.7675 47.5375 48.185
## 3 188.340 185.30 181.0600 177.4700 173.3500 181.330
## 4 1164.270 1162.30 1133.4700 1103.6300 1066.0400 1085.350
## 5 127.130 128.07 126.2400 123.6800 131.4000 132.450
## 2019_week25 2019_week26 2019_week27 2019_week28 2019_week29 2019_week30
## 1 1911.300 1893.63 1942.9100 2011.000 1964.5200 1943.050
## 2 49.695 49.48 51.0575 50.825 50.6475 51.935
## 3 191.140 193.00 196.4000 204.870 198.3600 199.750
## 4 1121.880 1080.91 1131.5900 1144.900 1130.1000 1250.410
## 5 136.970 133.96 137.0600 138.900 136.6200 141.340
## 2019_week31 2019_week32 2019_week33 2019_week34 2019_week35 2019_week36
## 1 1823.240 1807.5800 1792.570 1749.62 1776.290 1833.510
## 2 51.005 50.2475 51.625 50.66 52.185 53.315
## 3 189.020 187.8500 183.700 177.75 185.670 187.490
## 4 1193.990 1188.0100 1177.600 1151.29 1188.100 1204.930
## 5 136.900 137.7100 136.130 133.39 137.860 139.100
## 2019_week37 2019_week38 2019_week39 2019_week40 2019_week41 2019_week42
## 1 1839.3400 1794.1600 1725.450 1739.6500 1731.9200 1757.5100
## 2 54.6875 54.4325 54.705 56.7525 59.0525 59.1025
## 3 187.1900 189.9300 177.100 180.4500 184.1900 185.8500
## 4 1239.5601 1229.9301 1225.090 1209.0000 1215.4500 1245.4900
## 5 137.3200 139.4400 137.730 138.1200 139.6800 137.4100
## 2019_week43 2019_week44 2019_week45 2019_week46 2019_week47 2019_week48
## 1 1761.330 1791.440 1785.880 1739.49 1745.720 1800.8000
## 2 61.645 63.955 65.035 66.44 65.445 66.8125
## 3 187.890 193.620 190.840 195.10 198.820 201.6400
## 4 1265.130 1273.740 1311.370 1334.87 1295.340 1304.9600
## 5 140.730 143.720 145.960 149.97 149.590 151.3800
## 2019_week49 2019_week50 2019_week51 2019_week52 2020_week1 2020_week2
## 1 1751.6000 1760.9399 1786.50 1869.80 1874.9700 1883.1600
## 2 67.6775 68.7875 69.86 72.45 74.3575 77.5825
## 3 201.0500 194.1100 206.30 208.10 208.6700 218.0600
## 4 1340.6200 1347.8300 1349.59 1351.89 1360.6600 1429.7300
## 5 151.7500 154.5300 157.41 158.96 158.6200 161.3400
## 2020_week3 2020_week4 2020_week5 2020_week6 2020_week7 2020_week8 2020_week9
## 1 1864.7200 1861.6400 2008.7200 2079.2800 2134.8701 2095.9700 1883.75
## 2 79.6825 79.5775 77.3775 80.0075 81.2375 78.2625 68.34
## 3 222.1400 217.9400 201.9100 212.3300 214.1800 210.1800 192.47
## 4 1480.3900 1466.7100 1434.2300 1479.2300 1520.7400 1485.1100 1339.33
## 5 167.1000 165.0400 170.2300 183.8900 185.3500 178.5900 162.01
## 2020_week10 2020_week11 2020_week12 2020_week13 2020_week14 2020_week15
## 1 1901.0900 1785.0000 1846.09 1900.100 1906.5900 2042.7600
## 2 72.2575 69.4925 57.31 61.935 60.3525 66.9975
## 3 181.0900 170.2800 149.73 156.790 154.1800 175.1900
## 4 1298.4100 1219.7300 1072.32 1110.710 1097.8800 1211.4500
## 5 161.5700 158.8300 137.35 149.700 153.8300 165.1400
## 2020_week16 2020_week17 2020_week18 2020_week19 2020_week20 2020_week21
## 1 2375.00 2410.2200 2286.0400 2379.6101 2409.7800 2436.8799
## 2 70.70 70.7425 72.2675 77.5325 76.9275 79.7225
## 3 179.24 190.0700 202.2700 212.3500 210.8800 234.9100
## 4 1283.25 1279.3101 1320.6100 1388.3700 1373.1899 1410.4200
## 5 178.60 174.5500 174.5700 184.6800 183.1600 183.5100
## 2020_week22 2020_week23 2020_week24 2020_week25 2020_week26 2020_week27
## 1 2442.370 2483.000 2545.02 2675.01 2692.8701 2890.3000
## 2 79.485 82.875 84.70 87.43 88.4075 91.0275
## 3 225.090 230.770 228.58 238.79 216.0800 233.4200
## 4 1428.920 1438.390 1413.18 1431.72 1359.9000 1464.7000
## 5 183.250 187.200 187.74 195.15 196.3300 206.2600
## 2020_week28 2020_week29 2020_week30 2020_week31 2020_week32 2020_week33
## 1 3200.00 2961.9700 3008.910 3164.68 3167.4600 3148.0200
## 2 95.92 96.3275 92.615 106.26 111.1125 114.9075
## 3 245.07 242.0300 230.710 253.67 268.4400 261.2400
## 4 1541.74 1515.5500 1511.870 1482.96 1494.4900 1507.7300
## 5 213.67 202.8800 201.300 205.01 212.4800 208.9000
## 2020_week34 2020_week35 2020_week36 2020_week37 2020_week38 2020_week39
## 1 3284.72 3401.8000 3294.62 3116.22 2954.91 3095.13
## 2 124.37 124.8075 120.96 112.00 106.84 112.28
## 3 267.01 293.6600 282.73 266.61 252.53 254.82
## 4 1580.42 1644.4100 1591.04 1520.72 1459.99 1444.96
## 5 213.02 228.9100 214.25 204.03 200.39 207.82
## 2020_week40 2020_week41 2020_week42 2020_week43 2020_week44 2020_week45
## 1 3125.00 3286.65 3272.71 3204.40 3036.15 3311.37
## 2 113.02 116.97 119.02 115.04 108.86 118.69
## 3 259.94 264.45 265.93 284.79 263.11 293.41
## 4 1458.42 1515.22 1573.01 1641.00 1621.01 1761.75
## 5 206.19 215.81 219.66 216.23 202.47 223.72
## 2020_week46 2020_week47 2020_week48 2020_week49 2020_week50 2020_week51
## 1 3128.81 3099.40 3195.34 3162.58 3116.42 3201.65
## 2 119.26 117.34 116.59 122.25 122.41 126.66
## 3 276.95 269.70 277.81 279.70 273.55 276.40
## 4 1777.02 1742.19 1793.19 1827.99 1781.77 1731.01
## 5 216.51 210.39 215.23 214.36 213.26 218.59
## 2020_week52 2020_week53
## 1 3172.69 3256.93
## 2 131.97 132.69
## 3 267.40 273.16
## 4 1738.85 1751.88
## 5 222.75 222.42
Inspect stock_df in the console. Pivot stock_df so that the integer columns year and week are created from the column names and the original values are moved to the price column. Use the names_sep argument to separate the column names.
stock_df %>%
# Pivot the data to create 3 new columns: year, week, price
pivot_longer(
-company,
names_to = c("year", "week"),
values_to = "price",
names_sep = "_week",
names_transform = list(
year = as.integer,
week = as.integer)
)
## # A tibble: 525 x 4
## company year week price
## <fct> <int> <int> <dbl>
## 1 Amazon 2019 1 1848.
## 2 Amazon 2019 2 1641.
## 3 Amazon 2019 3 1696.
## 4 Amazon 2019 4 1671.
## 5 Amazon 2019 5 1626.
## 6 Amazon 2019 6 1588.
## 7 Amazon 2019 7 1608.
## 8 Amazon 2019 8 1632.
## 9 Amazon 2019 9 1672.
## 10 Amazon 2019 10 1621.
## # … with 515 more rows
Create a line plot where the price is shown per week and color by company. The year variable has been dealt with for you.
stock_df %>%
# Pivot the data to create 3 new columns: year, week, price
pivot_longer(
-company,
names_to = c("year", "week"),
values_to = "price",
names_sep = "_week",
names_transform = list(
year = as.integer,
week = as.integer)
) %>%
# Create a line plot with price per week, color by company
ggplot(aes(week, price, color = company)) +
geom_line() +
facet_grid(. ~ year)
You’ll be working on an pre-processed sample of the USSR space dogs database compiled by Duncan Geere and pre-loaded for you as space_dogs_df. Each of the 42 rows in this dataset represents a test rocket launch which had one or two very brave dogs on board.
Your goal is to reshape this dataset so that for each launch, each dog has a row.
The challenge is that in the column headers (name_1, name_2, gender_1, and gender_2), the part before the _ separator can point to two different variables (name and gender), while the second part always points to the dog ID (1st or 2nd dog).
space_dogs_df <- read.delim("DATABASE/space_dogs_df.txt", sep = ",") %>% select(., c(-"X"))
head(space_dogs_df)
## date result
## 1 1966-02-22 recovered safely after a 22-day flight
## 2 1961-03-25 one orbit, recovered safely
## 3 1961-03-09 one orbit, recovered safely
## 4 1960-12-22 upper stage failed, dogs recovered after a suborbital flight
## 5 1960-12-01 spent one day in orbit, capsule was destroyed, both dogs died
## 6 1960-09-22 recovered safely
## name_1 name_2 gender_1 gender_2
## 1 Ugolyok / Snezhok Veterok / Bzdunok Male Male
## 2 Zvezdochka <NA> Female <NA>
## 3 Chernuskha <NA> Female <NA>
## 4 Shutka Kometka Female Female
## 5 Mushka Pchyolka Female Female
## 6 Kusachka / Otvazhnaya Neva Female Female
As the first argument to pivot_longer(), pass the columns to pivot (name_1, name_2, gender_1, and gender_2). Complete the names_to argument so that the first part of the column headers are reused. Make sure NA values are dropped since not all rockets had two dogs.
space_dogs_df %>%
pivot_longer(
# Add the columns to pivot
name_1:gender_2,
names_sep = "_",
# Complete the names_to argument to re-use the first part of the column headers
names_to = c(".value", "dog_id"),
# Make sure NA values are dropped
values_drop_na = TRUE
)
## # A tibble: 81 x 5
## date result dog_id name gender
## <fct> <fct> <chr> <fct> <fct>
## 1 1966-02… recovered safely after a 22-day flight 1 Ugolyok / Sne… Male
## 2 1966-02… recovered safely after a 22-day flight 2 Veterok / Bzd… Male
## 3 1961-03… one orbit, recovered safely 1 Zvezdochka Female
## 4 1961-03… one orbit, recovered safely 1 Chernuskha Female
## 5 1960-12… upper stage failed, dogs recovered aft… 1 Shutka Female
## 6 1960-12… upper stage failed, dogs recovered aft… 2 Kometka Female
## 7 1960-12… spent one day in orbit, capsule was de… 1 Mushka Female
## 8 1960-12… spent one day in orbit, capsule was de… 2 Pchyolka Female
## 9 1960-09… recovered safely 1 Kusachka / Ot… Female
## 10 1960-09… recovered safely 2 Neva Female
## # … with 71 more rows
You’ve been given a sample of WHO data (who_df) with obesity percentages and life expectancy data per country, year, and sex. You want to visually inspect the correlation between obesity and life expectancy.
However, the data is very messy with four variables hidden in the column names. Each column name is made up of three parts separated by underscores: Values for the year, followed by those for sex, and then values for either pct.obese or life.exp. Since the third part of the column name string holds two variables you’ll need to use the special “.value” value in the names_to argument.
You’ll pivot the data into a tidy format and create the scatterplot.
who_df <- read.delim("DATABASE/who_df.txt", sep = ",") %>% select(., c(-"X"))
head(who_df)
## country X2000_male_pct.obese X2000_male_life.exp
## 1 Afghanistan 1.2 54.6
## 2 Albania 11.7 70.4
## 3 Algeria 10.3 69.4
## 4 Angola 1.4 45.2
## 5 Antigua and Barbuda 6.8 69.7
## 6 Argentina 19.1 70.2
## X2000_female_pct.obese X2000_female_life.exp X2001_male_pct.obese
## 1 3.4 57.3 1.3
## 2 14.7 76.0 12.1
## 3 24.2 72.2 10.8
## 4 5.6 49.5 1.5
## 5 18.2 75.1 7.1
## 6 22.1 77.7 19.6
## X2001_male_life.exp X2001_female_pct.obese X2001_female_life.exp
## 1 55.3 3.6 57.9
## 2 71.6 15.0 76.7
## 3 69.9 24.9 72.5
## 4 46.0 5.9 50.4
## 5 70.3 18.6 75.5
## 6 70.3 22.5 77.7
## X2002_male_pct.obese X2002_male_life.exp X2002_female_pct.obese
## 1 1.3 56.5 3.8
## 2 12.6 71.4 15.4
## 3 11.3 70.4 25.5
## 4 1.6 47.3 6.3
## 5 7.3 71.1 19.1
## 6 20.0 70.4 22.9
## X2002_female_life.exp X2003_male_pct.obese X2003_male_life.exp
## 1 58.5 1.4 56.9
## 2 75.9 13.2 70.8
## 3 73.0 11.8 70.7
## 4 51.5 1.7 48.4
## 5 75.7 7.6 71.4
## 6 78.0 20.5 70.6
## X2003_female_pct.obese X2003_female_life.exp X2004_male_pct.obese
## 1 4.0 59.1 1.5
## 2 15.8 75.2 13.7
## 3 26.2 73.3 12.3
## 4 6.6 52.6 1.9
## 5 19.5 75.2 7.8
## 6 23.3 77.9 21.0
## X2004_male_life.exp X2004_female_pct.obese X2004_female_life.exp
## 1 57.4 4.2 59.6
## 2 71.0 16.2 75.8
## 3 71.6 26.8 74.0
## 4 49.4 6.9 53.7
## 5 71.5 19.9 75.1
## 6 71.4 23.7 78.3
## X2005_male_pct.obese X2005_male_life.exp X2005_female_pct.obese
## 1 1.6 57.8 4.4
## 2 14.3 71.6 16.6
## 3 12.9 72.3 27.5
## 4 2.0 50.6 7.3
## 5 8.1 71.4 20.4
## 6 21.5 71.6 24.2
## X2005_female_life.exp X2006_male_pct.obese X2006_male_life.exp
## 1 60.1 1.7 58.0
## 2 75.6 14.9 71.8
## 3 74.6 13.4 72.9
## 4 54.8 2.1 51.7
## 5 75.3 8.4 71.3
## 6 78.6 22.0 71.9
## X2006_female_pct.obese X2006_female_life.exp X2007_male_pct.obese
## 1 4.7 60.5 1.8
## 2 17.1 76.2 15.5
## 3 28.2 75.0 14.0
## 4 7.7 56.0 2.3
## 5 20.8 75.6 8.6
## 6 24.6 78.9 22.5
## X2007_male_life.exp X2007_female_pct.obese X2007_female_life.exp
## 1 58.4 4.9 61.0
## 2 71.9 17.5 76.4
## 3 73.4 28.8 75.5
## 4 52.8 8.0 57.2
## 5 71.3 21.3 76.2
## 6 71.5 25.0 78.5
## X2008_male_pct.obese X2008_male_life.exp X2008_female_pct.obese
## 1 2.0 59.0 5.2
## 2 16.1 72.1 18.0
## 3 14.6 73.8 29.5
## 4 2.4 53.9 8.4
## 5 8.9 71.6 21.7
## 6 23.0 72.0 25.4
## X2008_female_life.exp X2009_male_pct.obese X2009_male_life.exp
## 1 61.6 2.1 59.5
## 2 76.5 16.8 72.3
## 3 75.8 15.2 74.1
## 4 58.4 2.6 55.0
## 5 76.3 9.2 72.1
## 6 79.2 23.6 72.3
## X2009_female_pct.obese X2009_female_life.exp X2010_male_pct.obese
## 1 5.4 62.1 2.2
## 2 18.4 76.7 17.4
## 3 30.2 76.1 15.8
## 4 8.8 59.6 2.8
## 5 22.2 76.7 9.5
## 6 25.9 79.2 24.1
## X2010_male_life.exp X2010_female_pct.obese X2010_female_life.exp
## 1 59.9 5.7 62.5
## 2 72.5 18.9 77.0
## 3 74.4 30.9 76.4
## 4 56.1 9.3 60.7
## 5 72.4 22.7 76.9
## 6 72.3 26.3 79.0
## X2011_male_pct.obese X2011_male_life.exp X2011_female_pct.obese
## 1 2.4 60.5 6.0
## 2 18.1 72.6 19.4
## 3 16.5 74.6 31.5
## 4 2.9 57.2 9.7
## 5 9.8 72.6 23.2
## 6 24.6 72.5 26.7
## X2011_female_life.exp X2012_male_pct.obese X2012_male_life.exp
## 1 63.1 2.5 60.9
## 2 78.2 18.8 73.0
## 3 76.6 17.1 74.8
## 4 61.8 3.1 58.1
## 5 77.2 10.1 72.5
## 6 79.3 25.1 72.7
## X2012_female_pct.obese X2012_female_life.exp X2013_male_pct.obese
## 1 6.3 63.6 2.7
## 2 19.9 77.7 19.5
## 3 32.2 76.8 17.8
## 4 10.2 62.6 3.3
## 5 23.7 77.0 10.5
## 6 27.2 79.7 25.7
## X2013_male_life.exp X2013_female_pct.obese X2013_female_life.exp
## 1 61.5 6.6 64.1
## 2 73.6 20.4 78.1
## 3 75.0 32.9 76.9
## 4 58.8 10.6 63.3
## 5 72.5 24.2 77.0
## 6 72.8 27.6 79.7
## X2014_male_pct.obese X2014_male_life.exp X2014_female_pct.obese
## 1 2.8 61.7 7.0
## 2 20.2 74.0 20.8
## 3 18.5 75.1 33.6
## 4 3.6 59.4 11.1
## 5 10.8 72.1 24.8
## 6 26.2 73.2 28.1
## X2014_female_life.exp X2015_male_pct.obese X2015_male_life.exp
## 1 64.4 3.0 61.8
## 2 78.5 20.9 74.2
## 3 77.1 19.2 75.3
## 4 64.0 3.8 59.9
## 5 77.1 11.2 72.5
## 6 80.0 26.8 73.3
## X2015_female_pct.obese X2015_female_life.exp X2016_male_pct.obese
## 1 7.3 64.7 3.2
## 2 21.3 78.2 21.6
## 3 34.2 77.2 19.9
## 4 11.6 64.5 4.0
## 5 25.3 77.4 11.6
## 6 28.5 80.1 27.3
## X2016_male_life.exp X2016_female_pct.obese X2016_female_life.exp
## 1 61.0 7.6 64.5
## 2 74.3 21.8 78.6
## 3 75.4 34.9 77.4
## 4 60.3 12.1 64.9
## 5 72.5 25.9 77.5
## 6 73.5 29.0 80.3
#nos importa los nombrs mal
columnas <- names(who_df)
#vemos cual es el tamaño de las columnas
length <- length(columnas)
for(i in 1:length){
#seleccionamos la columna
columna <- columnas[i]
# print(columna)
#ESTABLECEMOS LA CONDICION
CONDICION <- str_detect(columna, "^X")
if(CONDICION == TRUE){
#CORREGIMOS EL ERROR
CORREGIDO <- str_extract_all(columnas[i], "[:digit:]+\\_+[:alpha:]+\\_[:alpha:]+\\.+[:alpha:]+", simplify = TRUE)
#reemplazamos el nombre de la columna mal escrita
columnas[i] <- replace(columnas[i], CONDICION == TRUE, CORREGIDO)
}else{
next
}
}
#implementamos los nombres bien
names(who_df) <- columnas
#vemos el resultado
head(who_df)
## country 2000_male_pct.obese 2000_male_life.exp
## 1 Afghanistan 1.2 54.6
## 2 Albania 11.7 70.4
## 3 Algeria 10.3 69.4
## 4 Angola 1.4 45.2
## 5 Antigua and Barbuda 6.8 69.7
## 6 Argentina 19.1 70.2
## 2000_female_pct.obese 2000_female_life.exp 2001_male_pct.obese
## 1 3.4 57.3 1.3
## 2 14.7 76.0 12.1
## 3 24.2 72.2 10.8
## 4 5.6 49.5 1.5
## 5 18.2 75.1 7.1
## 6 22.1 77.7 19.6
## 2001_male_life.exp 2001_female_pct.obese 2001_female_life.exp
## 1 55.3 3.6 57.9
## 2 71.6 15.0 76.7
## 3 69.9 24.9 72.5
## 4 46.0 5.9 50.4
## 5 70.3 18.6 75.5
## 6 70.3 22.5 77.7
## 2002_male_pct.obese 2002_male_life.exp 2002_female_pct.obese
## 1 1.3 56.5 3.8
## 2 12.6 71.4 15.4
## 3 11.3 70.4 25.5
## 4 1.6 47.3 6.3
## 5 7.3 71.1 19.1
## 6 20.0 70.4 22.9
## 2002_female_life.exp 2003_male_pct.obese 2003_male_life.exp
## 1 58.5 1.4 56.9
## 2 75.9 13.2 70.8
## 3 73.0 11.8 70.7
## 4 51.5 1.7 48.4
## 5 75.7 7.6 71.4
## 6 78.0 20.5 70.6
## 2003_female_pct.obese 2003_female_life.exp 2004_male_pct.obese
## 1 4.0 59.1 1.5
## 2 15.8 75.2 13.7
## 3 26.2 73.3 12.3
## 4 6.6 52.6 1.9
## 5 19.5 75.2 7.8
## 6 23.3 77.9 21.0
## 2004_male_life.exp 2004_female_pct.obese 2004_female_life.exp
## 1 57.4 4.2 59.6
## 2 71.0 16.2 75.8
## 3 71.6 26.8 74.0
## 4 49.4 6.9 53.7
## 5 71.5 19.9 75.1
## 6 71.4 23.7 78.3
## 2005_male_pct.obese 2005_male_life.exp 2005_female_pct.obese
## 1 1.6 57.8 4.4
## 2 14.3 71.6 16.6
## 3 12.9 72.3 27.5
## 4 2.0 50.6 7.3
## 5 8.1 71.4 20.4
## 6 21.5 71.6 24.2
## 2005_female_life.exp 2006_male_pct.obese 2006_male_life.exp
## 1 60.1 1.7 58.0
## 2 75.6 14.9 71.8
## 3 74.6 13.4 72.9
## 4 54.8 2.1 51.7
## 5 75.3 8.4 71.3
## 6 78.6 22.0 71.9
## 2006_female_pct.obese 2006_female_life.exp 2007_male_pct.obese
## 1 4.7 60.5 1.8
## 2 17.1 76.2 15.5
## 3 28.2 75.0 14.0
## 4 7.7 56.0 2.3
## 5 20.8 75.6 8.6
## 6 24.6 78.9 22.5
## 2007_male_life.exp 2007_female_pct.obese 2007_female_life.exp
## 1 58.4 4.9 61.0
## 2 71.9 17.5 76.4
## 3 73.4 28.8 75.5
## 4 52.8 8.0 57.2
## 5 71.3 21.3 76.2
## 6 71.5 25.0 78.5
## 2008_male_pct.obese 2008_male_life.exp 2008_female_pct.obese
## 1 2.0 59.0 5.2
## 2 16.1 72.1 18.0
## 3 14.6 73.8 29.5
## 4 2.4 53.9 8.4
## 5 8.9 71.6 21.7
## 6 23.0 72.0 25.4
## 2008_female_life.exp 2009_male_pct.obese 2009_male_life.exp
## 1 61.6 2.1 59.5
## 2 76.5 16.8 72.3
## 3 75.8 15.2 74.1
## 4 58.4 2.6 55.0
## 5 76.3 9.2 72.1
## 6 79.2 23.6 72.3
## 2009_female_pct.obese 2009_female_life.exp 2010_male_pct.obese
## 1 5.4 62.1 2.2
## 2 18.4 76.7 17.4
## 3 30.2 76.1 15.8
## 4 8.8 59.6 2.8
## 5 22.2 76.7 9.5
## 6 25.9 79.2 24.1
## 2010_male_life.exp 2010_female_pct.obese 2010_female_life.exp
## 1 59.9 5.7 62.5
## 2 72.5 18.9 77.0
## 3 74.4 30.9 76.4
## 4 56.1 9.3 60.7
## 5 72.4 22.7 76.9
## 6 72.3 26.3 79.0
## 2011_male_pct.obese 2011_male_life.exp 2011_female_pct.obese
## 1 2.4 60.5 6.0
## 2 18.1 72.6 19.4
## 3 16.5 74.6 31.5
## 4 2.9 57.2 9.7
## 5 9.8 72.6 23.2
## 6 24.6 72.5 26.7
## 2011_female_life.exp 2012_male_pct.obese 2012_male_life.exp
## 1 63.1 2.5 60.9
## 2 78.2 18.8 73.0
## 3 76.6 17.1 74.8
## 4 61.8 3.1 58.1
## 5 77.2 10.1 72.5
## 6 79.3 25.1 72.7
## 2012_female_pct.obese 2012_female_life.exp 2013_male_pct.obese
## 1 6.3 63.6 2.7
## 2 19.9 77.7 19.5
## 3 32.2 76.8 17.8
## 4 10.2 62.6 3.3
## 5 23.7 77.0 10.5
## 6 27.2 79.7 25.7
## 2013_male_life.exp 2013_female_pct.obese 2013_female_life.exp
## 1 61.5 6.6 64.1
## 2 73.6 20.4 78.1
## 3 75.0 32.9 76.9
## 4 58.8 10.6 63.3
## 5 72.5 24.2 77.0
## 6 72.8 27.6 79.7
## 2014_male_pct.obese 2014_male_life.exp 2014_female_pct.obese
## 1 2.8 61.7 7.0
## 2 20.2 74.0 20.8
## 3 18.5 75.1 33.6
## 4 3.6 59.4 11.1
## 5 10.8 72.1 24.8
## 6 26.2 73.2 28.1
## 2014_female_life.exp 2015_male_pct.obese 2015_male_life.exp
## 1 64.4 3.0 61.8
## 2 78.5 20.9 74.2
## 3 77.1 19.2 75.3
## 4 64.0 3.8 59.9
## 5 77.1 11.2 72.5
## 6 80.0 26.8 73.3
## 2015_female_pct.obese 2015_female_life.exp 2016_male_pct.obese
## 1 7.3 64.7 3.2
## 2 21.3 78.2 21.6
## 3 34.2 77.2 19.9
## 4 11.6 64.5 4.0
## 5 25.3 77.4 11.6
## 6 28.5 80.1 27.3
## 2016_male_life.exp 2016_female_pct.obese 2016_female_life.exp
## 1 61.0 7.6 64.5
## 2 74.3 21.8 78.6
## 3 75.4 34.9 77.4
## 4 60.3 12.1 64.9
## 5 72.5 25.9 77.5
## 6 73.5 29.0 80.3
The ggplot2 package has been pre-loaded for you.
who_df %>%
# Put each variable in its own column
pivot_longer(
-country,
names_to = c("year", "sex", ".value"),
names_sep = "_",
names_transform = list("year" = as.integer)
)
## # A tibble: 6,120 x 5
## country year sex pct.obese life.exp
## <fct> <int> <chr> <dbl> <dbl>
## 1 Afghanistan 2000 male 1.2 54.6
## 2 Afghanistan 2000 female 3.4 57.3
## 3 Afghanistan 2001 male 1.3 55.3
## 4 Afghanistan 2001 female 3.6 57.9
## 5 Afghanistan 2002 male 1.3 56.5
## 6 Afghanistan 2002 female 3.8 58.5
## 7 Afghanistan 2003 male 1.4 56.9
## 8 Afghanistan 2003 female 4 59.1
## 9 Afghanistan 2004 male 1.5 57.4
## 10 Afghanistan 2004 female 4.2 59.6
## # … with 6,110 more rows
Use ggplot() to create a scatterplot with life.exp over pct.obese. Color the points by sex.
who_df %>%
# Put each variable in its own column
pivot_longer(
-country,
names_to = c("year", "sex", ".value"),
names_sep = "_",
names_transform = list("year" = as.integer)
) %>%
# Create a plot with life expectancy over obesity
ggplot(.,aes(pct.obese,life.exp , color = sex))+
geom_point()
You’ve found the job of your dreams providing technical support for a dog breed beauty contest. The jury members want a spreadsheet with the breed and id of each participating dog so that they can add the scores later on. You’ve only been given the number of participants per dog breed (dog_df) so you decide to use your tidyr skills to create the desired result.
dog_df <- read.delim("DATABASE/dog_df.txt", sep = ",") %>% select(., c(-"X"))
head(dog_df)
## breed n_participants
## 1 Husky 5
## 2 Golden retriever 12
## 3 Poodle 13
## 4 Shiba Inu 7
Inspect the data in the console. Uncount the data so that per breed, each dog gets a row and an ID. The ID should go in the dog_id column.
dog_df %>%
# Create one row for each participant and add the id
uncount(n_participants, .id = "dog_id")
## breed dog_id
## 1 Husky 1
## 1.1 Husky 2
## 1.2 Husky 3
## 1.3 Husky 4
## 1.4 Husky 5
## 2 Golden retriever 1
## 2.1 Golden retriever 2
## 2.2 Golden retriever 3
## 2.3 Golden retriever 4
## 2.4 Golden retriever 5
## 2.5 Golden retriever 6
## 2.6 Golden retriever 7
## 2.7 Golden retriever 8
## 2.8 Golden retriever 9
## 2.9 Golden retriever 10
## 2.10 Golden retriever 11
## 2.11 Golden retriever 12
## 3 Poodle 1
## 3.1 Poodle 2
## 3.2 Poodle 3
## 3.3 Poodle 4
## 3.4 Poodle 5
## 3.5 Poodle 6
## 3.6 Poodle 7
## 3.7 Poodle 8
## 3.8 Poodle 9
## 3.9 Poodle 10
## 3.10 Poodle 11
## 3.11 Poodle 12
## 3.12 Poodle 13
## 4 Shiba Inu 1
## 4.1 Shiba Inu 2
## 4.2 Shiba Inu 3
## 4.3 Shiba Inu 4
## 4.4 Shiba Inu 5
## 4.5 Shiba Inu 6
## 4.6 Shiba Inu 7
Remember the USSR space dogs dataset1? You changed it to a long format so that for every dog in every rocket launch, there was a row. Suppose you’re given this tidy dataset and are asked to answer the question, “In what percentage of flights were both dogs of the same gender?”
You’ll reshape and investigate space_dogs_df to find the answer.
The dplyr package has been pre-loaded for you.
1 Compiled by Duncan Geere.
space_dogs_df <- read.delim("DATABASE/space_dogs_df2.txt", sep = ",") %>% select(., c(-"X"))
head(space_dogs_df)
## date gender dog_id
## 1 1951-07-22 Male 1
## 2 1951-07-22 Male 2
## 3 1951-07-29 Male 1
## 4 1951-07-29 Female 2
## 5 1951-08-15 Male 1
## 6 1951-08-15 Male 2
Pivot the data to a wider format, deriving new column names from the dog_id column and values from the gender column.
space_dogs_df %>%
# Pivot the data to a wider format
pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_")
## # A tibble: 42 x 3
## date gender_1 gender_2
## <fct> <fct> <fct>
## 1 1951-07-22 Male Male
## 2 1951-07-29 Male Female
## 3 1951-08-15 Male Male
## 4 1951-08-19 Male Male
## 5 1951-08-28 Male Male
## 6 1951-09-03 Male Male
## 7 1951-06-26 Female Male
## 8 1954-07-02 Female Male
## 9 1954-07-07 Female Male
## 10 1954-07-26 Female Male
## # … with 32 more rows
Drop rows that contain NA values.
space_dogs_df %>%
# Pivot the data to a wider format
pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_") %>%
# Drop rows with NA values
drop_na()
## # A tibble: 39 x 3
## date gender_1 gender_2
## <fct> <fct> <fct>
## 1 1951-07-22 Male Male
## 2 1951-07-29 Male Female
## 3 1951-08-15 Male Male
## 4 1951-08-19 Male Male
## 5 1951-08-28 Male Male
## 6 1951-09-03 Male Male
## 7 1951-06-26 Female Male
## 8 1954-07-02 Female Male
## 9 1954-07-07 Female Male
## 10 1954-07-26 Female Male
## # … with 29 more rows
Create a new column same_gender, which has a TRUE value when gender_1 equals gender_2.
space_dogs_df %>%
# Pivot the data to a wider format
pivot_wider(names_from = dog_id, values_from = gender, names_prefix = "gender_") %>%
# Drop rows with NA values
drop_na() %>%
# Create a Boolean column on whether both dogs have the same gender
mutate(
same_gender = ifelse(gender_1 == gender_2, TRUE, 0)
) %>%
summarize(pct_same_gender = mean(same_gender))
## # A tibble: 1 x 1
## pct_same_gender
## <dbl>
## 1 0.795
The intensity of light radiated by a light source follows an inverse square relationship with the distance it has traveled.
planet_df <- read.delim("DATABASE/planet_df.txt", sep = ",") %>% select(., c(-"X"))
head(planet_df)
## planet metric value
## 1 Mercury diameter 4879.0
## 2 Mercury distance_to_sun 57.9
## 3 Mercury temperature 167.0
## 4 Venus diameter 12104.0
## 5 Venus distance_to_sun 108.2
## 6 Venus temperature 464.0
You wonder if you could observe this trend in the temperature of the planets in our Solar System given their distance to the Sun. You’ll use the planet_df dataset from the devstronomy project to investigate this.
planet_df %>%
# Give each planet variable its own column
pivot_wider(
names_from = metric,
values_from = value
)
## # A tibble: 8 x 4
## planet diameter distance_to_sun temperature
## <fct> <dbl> <dbl> <dbl>
## 1 Mercury 4879 57.9 167
## 2 Venus 12104 108. 464
## 3 Earth 12756 150. 15
## 4 Mars 6792 228. -65
## 5 Jupiter 142984 779. -110
## 6 Saturn 120536 1434. -140
## 7 Uranus 51118 2872. -195
## 8 Neptune 49528 4495. -200
Use the ggplot() function to create a plot with the temperature over the distance_to_sun.
planet_df %>%
# Give each planet variable its own column
pivot_wider(names_from = "metric", values_from = "value") %>%
# Plot planet temperature over distance to sun
ggplot(., aes(temperature, distance_to_sun)) +
geom_point(aes(size = diameter)) +
geom_text(aes(label = planet), vjust = -1) +
labs(x = "Distance to sun (million km)",
y = "Mean temperature (°C)") +
theme(legend.position = "none")
### Transposing planet data You’re again working on a planet dataset derived from the devstronomy project. This time, you’re interested in the correlation between the diameter of a planet and the number of moons circling it.
However, the dataset (planet_df) has a row for each variable and a column for each planet (observation). You’ll transpose this data in two steps and then create a plot to inspect the correlation.
The ggplot2 package has been pre-loaded for you.
planet_df <- read.delim("DATABASE/planet_df2.txt", sep = ",") %>% select(., c(-"X"))
#isolamos la primera columna que está bien
metric <- planet_df %>% select(., c(metric))
#quitamos la columna para que no nos estorbe
planet_df <- planet_df %>% select(., c(-"metric"))
#sacamos las repeticiones
length <- length(planet_df)
for(i in 1:1){
#extraemos el nombre de LA COLUMNA
nombre_columna <- names(planet_df[i])
#extraemos los valosres de la columna en cuestion
columna <- planet_df[,c(i)] %>% as.data.frame() %>%
#y le ponemos su nombre
setnames(., nombre_columna)
#quitamos la columna extraida para no causar confusiones
planet_df <- planet_df[,c(-i)]
#juntamos a su dataframe original
planet_df <- bind_cols(planet_df, columna)
}
#finalmente le juntamos la columna primeramente extraida
planet_df <- bind_cols(planet_df, metric)
#borramos el dataframe creado en balde
rm(metric)
rm(columna)
#vemos el aspecto del dataframe
head(planet_df)
## Venus Earth Mars Jupiter Saturn Uranus Neptune Mercury
## 1 4.87 5.97 0.642 1898.0 568.0 86.8 102.0 0.33
## 2 12104.00 12756.00 6792.000 142984.0 120536.0 51118.0 49528.0 4879.00
## 3 108.20 149.60 227.900 778.6 1433.5 2872.5 4495.1 57.90
## 4 8.90 9.80 3.700 23.1 9.0 8.7 11.0 3.70
## 5 0.00 1.00 2.000 79.0 62.0 27.0 14.0 0.00
## metric
## 1 mass
## 2 diameter
## 3 distance_from_sun
## 4 gravity
## 5 number_of_moons
Inspect planet_df in the console. Pivot the data so that planet names are put in a column named planet.
planet_df %>%
# Pivot all columns except metric to long format
pivot_longer(-metric, names_to = "planet")
## # A tibble: 40 x 3
## metric planet value
## <fct> <chr> <dbl>
## 1 mass Venus 4.87
## 2 mass Earth 5.97
## 3 mass Mars 0.642
## 4 mass Jupiter 1898
## 5 mass Saturn 568
## 6 mass Uranus 86.8
## 7 mass Neptune 102
## 8 mass Mercury 0.33
## 9 diameter Venus 12104
## 10 diameter Earth 12756
## # … with 30 more rows
Pivot the data so that each variable in the metric column gets its own column.
planet_df %>%
# Pivot all columns except metric to long format
pivot_longer(-metric, names_to = "planet") %>%
# Put each metric in its own column
pivot_wider(
names_from = metric,
values_from = value
)
## # A tibble: 8 x 6
## planet mass diameter distance_from_sun gravity number_of_moons
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Venus 4.87 12104 108. 8.9 0
## 2 Earth 5.97 12756 150. 9.8 1
## 3 Mars 0.642 6792 228. 3.7 2
## 4 Jupiter 1898 142984 779. 23.1 79
## 5 Saturn 568 120536 1434. 9 62
## 6 Uranus 86.8 51118 2872. 8.7 27
## 7 Neptune 102 49528 4495. 11 14
## 8 Mercury 0.33 4879 57.9 3.7 0
Use the ggplot() function to create a plot with the number_of_moons over diameter.
planet_df %>%
# Pivot all columns except metric to long format
pivot_longer(-metric, names_to = "planet") %>%
# Put each metric in its own column
pivot_wider(names_from = metric, values_from = value) %>%
# Plot the number of moons vs planet diameter
ggplot(., aes(x = diameter, y = number_of_moons)) +
geom_point(aes(size = diameter)) +
geom_text(aes(label = planet), vjust = -1) +
labs(x = "Diameter (km)", y = "Number of moons") +
theme(legend.position = "none")
The basic building blocks of RNA are four molecules described by a single letter each: adenine (A), cytosine (C), guanine (G), and uracil (U). The information carried by an RNA strand can be represented as a long sequence of these four letters. To read this code, one has to divide this chain into sequences of three letters each (e.g. GCU, ACG, …). These three letter sequences are known as codons. The concept is illustrated in the image below.
Your goal for this exercise is to create a data frame with all possible three letter sequences (codons) from a vector with the four letters representing the RNA building blocks.
Create a tibble with three columns called letter1, letter2, and letter3 that holds all possible combinations of the vector letters using expand_grid().
letters <- c("A", "C", "G", "U")
# Create a tibble with all possible 3 way combinations
codon_df <- expand_grid(
letter1 = letters,
letter2 = letters,
letter3 = letters
)
codon_df
## # A tibble: 64 x 3
## letter1 letter2 letter3
## <chr> <chr> <chr>
## 1 A A A
## 2 A A C
## 3 A A G
## 4 A A U
## 5 A C A
## 6 A C C
## 7 A C G
## 8 A C U
## 9 A G A
## 10 A G C
## # … with 54 more rows
Use the unite() function from chapter one to merge these three columns into a single column named codon. Use an empty string as the separator.
codon_df %>%
# Unite these three columns into a "codon" column
unite("codon", letter1, letter2, letter3, sep = "")
## # A tibble: 64 x 1
## codon
## <chr>
## 1 AAA
## 2 AAC
## 3 AAG
## 4 AAU
## 5 ACA
## 6 ACC
## 7 ACG
## 8 ACU
## 9 AGA
## 10 AGC
## # … with 54 more rows
You already know that in the early days of spaceflight, the USSR was testing rockets with dogs. You now wonder when exactly humans started replacing dogs on space flight missions. You’ve been given a dataset space_df with the number of both dogs (compiled by Duncan Geere) and humans in space per year from 1951 till 1970 (collected from Wikipedia).
Your goal is to create a plot that shows you the number of individuals sent into space per species. Before you can create this plot, you’ll first have to introduce zero values for missing combinations of year and species.
space_df <- read.delim("DATABASE/space_df.txt", sep = ",") %>% select(., c(-"X"))
head(space_df)
## year species n_in_space
## 1 1951 Dog 14
## 2 1954 Dog 6
## 3 1955 Dog 6
## 4 1956 Dog 8
## 5 1957 Dog 11
## 6 1958 Dog 12
The dplyr and ggplot2 packages have been pre-loaded for you.
Create full_df, a tibble with all unique combinations of the variables year (from 1951 to 1970) and species (“Human” and “Dog”).
# Create a tibble with all combinations of years and species
full_df <- expand_grid(
year = 1951:1970,
species = c("Human", "Dog")
)
full_df
## # A tibble: 40 x 2
## year species
## <int> <chr>
## 1 1951 Human
## 2 1951 Dog
## 3 1952 Human
## 4 1952 Dog
## 5 1953 Human
## 6 1953 Dog
## 7 1954 Human
## 8 1954 Dog
## 9 1955 Human
## 10 1955 Dog
## # … with 30 more rows
Perform a right_join() between space_df and full_df on the year and species columns.
space_df %>%
# Join with full_df so that missing values are introduced
right_join(full_df, by = c("year", "species")) %>%
arrange(year)
## year species n_in_space
## 1 1951 Dog 14
## 2 1951 Human NA
## 3 1952 Human NA
## 4 1952 Dog NA
## 5 1953 Human NA
## 6 1953 Dog NA
## 7 1954 Dog 6
## 8 1954 Human NA
## 9 1955 Dog 6
## 10 1955 Human NA
## 11 1956 Dog 8
## 12 1956 Human NA
## 13 1957 Dog 11
## 14 1957 Human NA
## 15 1958 Dog 12
## 16 1958 Human NA
## 17 1959 Dog 4
## 18 1959 Human NA
## 19 1960 Dog 16
## 20 1960 Human NA
## 21 1961 Dog 2
## 22 1961 Human 4
## 23 1962 Human 5
## 24 1962 Dog NA
## 25 1963 Human 5
## 26 1963 Dog NA
## 27 1964 Human 3
## 28 1964 Dog NA
## 29 1965 Human 12
## 30 1965 Dog NA
## 31 1966 Dog 2
## 32 1966 Human 10
## 33 1967 Human 1
## 34 1967 Dog NA
## 35 1968 Human 7
## 36 1968 Dog NA
## 37 1969 Human 23
## 38 1969 Dog NA
## 39 1970 Human 5
## 40 1970 Dog NA
Use the ggplot() function to create a line plot of n_in_space over year, colored by species.
space_df %>%
# Join with full_df so that missing values are introduced
right_join(full_df, by = c("year", "species")) %>%
# Create a line plot with n_in_space over year, color by species
ggplot(., aes(year, n_in_space, color = species)) +
geom_line()
## Warning: Removed 14 row(s) containing missing values (geom_path).
Use the replace_na() function to overwrite NA values in the n_in_space column with zeros.
space_df %>%
# Join with full_df so that missing values are introduced
right_join(full_df, by = c("year", "species")) %>%
# Overwrite NA values for n_in_space with 0L
replace_na(list(n_in_space = 0L)) %>%
# Create a line plot with n_in_space over year, color by species
ggplot(aes(x = year, y = n_in_space, color = species)) +
geom_line()
You’re an inspector at a nuclear plant and have to validate whether every reactor has received its daily safety check over the course of a full year. The safety check logs are in reactor_df, a data frame with columns date, reactor, and check.
Two vectors, dates and reactors, with all dates of the year and reactors at the plant respectively have been created for you. You’ll use the combination of the expand_grid() and anti_join() functions to find dates where particular reactors were not checked.
The dplyr package has been pre-loaded for you.
reactor_df <- read.delim("DATABASE/reactor_df.txt", sep = ",") %>% select(., c(-"X"))
head(reactor_df)
## date reactor checked
## 1 1986-01-01 A TRUE
## 2 1986-01-01 B TRUE
## 3 1986-01-01 C TRUE
## 4 1986-01-01 D TRUE
## 5 1986-01-02 A TRUE
## 6 1986-01-02 B TRUE
dates <- read.delim("DATABASE/dates.txt", sep = ",", header = FALSE)
head(dates)
## V1
## 1 [1] 1986-01-01 1986-01-02 1986-01-03 1986-01-04 1986-01-05
## 2 [6] 1986-01-06 1986-01-07 1986-01-08 1986-01-09 1986-01-10
## 3 [11] 1986-01-11 1986-01-12 1986-01-13 1986-01-14 1986-01-15
## 4 [16] 1986-01-16 1986-01-17 1986-01-18 1986-01-19 1986-01-20
## 5 [21] 1986-01-21 1986-01-22 1986-01-23 1986-01-24 1986-01-25
## 6 [26] 1986-01-26 1986-01-27 1986-01-28 1986-01-29 1986-01-30
reactors <- read.delim("DATABASE/reactors.txt", sep = ",", header = FALSE)
head(reactors)
## V1
## 1 A B C D
Use the expand_grid() function to create a tibble holding all combinations of the variables date and reactor. Use the dates and reactors vectors created for you.
# Create a tibble with all combinations of dates and reactors
# full_df <- expand_grid(
#
# date = dates,
# reactor = reactors
#
# )
full_df <- read.delim("DATABASE/full_df.txt", sep = ",", header = FALSE)
head(full_df)
## V1 V2 V3
## 1 NA date reactor
## 2 1 1986-01-01 A
## 3 2 1986-01-01 B
## 4 3 1986-01-01 C
## 5 4 1986-01-01 D
## 6 5 1986-01-02 A
Perform an anti-join between full_df and reactor_df on the date and reactor columns.
# full_df %>%
# anti_join(
# reactor_df,
# by = c("date", "reactor")
# )
You have been given a data frame (planet_df) from the devstronomy project with the number of moons per planet in our Solar System. However, Mercury and Venus, the two moonless planets, are absent. You want to expand this dataset using the complete() function and a vector planets that contains all eight planet’s names.
planet_df <- read.delim("DATABASE/planet_df3.txt", sep = ",", header = FALSE)
head(planet_df)
## V1 V2 V3
## 1 NA planet n_moons
## 2 1 Earth 1
## 3 2 Mars 2
## 4 3 Jupiter 79
## 5 4 Saturn 62
## 6 5 Uranus 27
planets = c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
Complete the planet variable using the planets vector. Replace NA values in the n_moons variable with 0L values.
# planet_df %>%
# complete(.,
# # Complete the planet variable
# planet = planets
# # Overwrite NA values for n_moons with 0L
# fill = list(n_moons = 0L)
# )
planet_completed <- read.delim("DATABASE/planet_complete.txt", sep = ",", header = FALSE)
head(planet_completed)
## V1 V2 V3
## 1 NA planet n_moons
## 2 1 Earth 1
## 3 2 Jupiter 79
## 4 3 Mars 2
## 5 4 Mercury 0
## 6 5 Neptune 14
Since 1896, athletes from all over the world have been competing in the modern Olympic games. You’ve been given a dataset (medal_df) with observations for all medals won by athletes from the 10 most successful countries in Olympic history. You want to create a visual with the number of medals won per country (team) per year. However, since not all countries won medals each year, you’ll have to introduce zero values before you can make an accurate visual.
The ggplot2 and dplyr packages have been pre-loaded for you. In step 2 and 3 the scale_color_brewer() function is used to color lines in the plot with a palette that makes it easier to distinguish the different countries.
medal_df <- read.delim("DATABASE/medal_df.txt", sep = ",")
head(medal_df)
## X name sport year medal team
## 1 1 Edward Battel Cycling 1896 Bronze Great Britain
## 2 2 Arthur Charles Blake Athletics 1896 Silver United States
## 3 3 Conrad Helmut Fritz Bcker Gymnastics 1896 Gold Germany
## 4 4 Conrad Helmut Fritz Bcker Gymnastics 1896 Gold Germany
## 5 5 John Mary Pius Boland Tennis 1896 Gold Great Britain
## 6 6 Thomas Edmund Burke Athletics 1896 Gold United States
Inspect medal_df in the console. Count the number of medals won per team and year.
medal_df %>%
# Count the medals won per team and year
count(team, year, name = "n_medals")
## team year n_medals
## 1 Australia 1896 2
## 2 Australia 1900 5
## 3 Australia 1904 4
## 4 Australia 1906 3
## 5 Australia 1920 7
## 6 Australia 1924 10
## 7 Australia 1928 4
## 8 Australia 1932 5
## 9 Australia 1936 1
## 10 Australia 1948 16
## 11 Australia 1952 20
## 12 Australia 1956 62
## 13 Australia 1960 46
## 14 Australia 1964 41
## 15 Australia 1968 51
## 16 Australia 1972 20
## 17 Australia 1976 23
## 18 Australia 1980 13
## 19 Australia 1984 52
## 20 Australia 1988 35
## 21 Australia 1992 57
## 22 Australia 1996 130
## 23 Australia 2000 181
## 24 Australia 2004 157
## 25 Australia 2008 149
## 26 Australia 2012 114
## 27 Australia 2016 82
## 28 France 1896 11
## 29 France 1900 75
## 30 France 1904 1
## 31 France 1906 45
## 32 France 1908 36
## 33 France 1912 23
## 34 France 1920 134
## 35 France 1924 101
## 36 France 1928 44
## 37 France 1932 40
## 38 France 1936 44
## 39 France 1948 72
## 40 France 1952 40
## 41 France 1956 33
## 42 France 1960 12
## 43 France 1964 31
## 44 France 1968 27
## 45 France 1972 22
## 46 France 1976 20
## 47 France 1980 29
## 48 France 1984 67
## 49 France 1988 29
## 50 France 1992 55
## 51 France 1996 49
## 52 France 2000 64
## 53 France 2004 53
## 54 France 2008 77
## 55 France 2012 78
## 56 France 2016 96
## 57 Germany 1896 31
## 58 Germany 1900 2
## 59 Germany 1904 15
## 60 Germany 1906 30
## 61 Germany 1908 19
## 62 Germany 1912 39
## 63 Germany 1928 77
## 64 Germany 1932 43
## 65 Germany 1936 215
## 66 Germany 1952 37
## 67 Germany 1956 52
## 68 Germany 1960 86
## 69 Germany 1964 113
## 70 Germany 1992 198
## 71 Germany 1996 122
## 72 Germany 2000 116
## 73 Germany 2004 147
## 74 Germany 2008 96
## 75 Germany 2012 92
## 76 Germany 2016 157
## 77 Great Britain 1896 7
## 78 Great Britain 1900 22
## 79 Great Britain 1904 2
## 80 Great Britain 1906 31
## 81 Great Britain 1908 167
## 82 Great Britain 1912 128
## 83 Great Britain 1920 101
## 84 Great Britain 1924 55
## 85 Great Britain 1928 58
## 86 Great Britain 1932 33
## 87 Great Britain 1936 30
## 88 Great Britain 1948 59
## 89 Great Britain 1952 31
## 90 Great Britain 1956 37
## 91 Great Britain 1960 28
## 92 Great Britain 1964 26
## 93 Great Britain 1968 19
## 94 Great Britain 1972 29
## 95 Great Britain 1976 32
## 96 Great Britain 1980 47
## 97 Great Britain 1984 71
## 98 Great Britain 1988 54
## 99 Great Britain 1992 50
## 100 Great Britain 1996 26
## 101 Great Britain 2000 52
## 102 Great Britain 2004 55
## 103 Great Britain 2008 81
## 104 Great Britain 2012 122
## 105 Great Britain 2016 145
## 106 Hungary 1896 6
## 107 Hungary 1900 5
## 108 Hungary 1904 4
## 109 Hungary 1906 13
## 110 Hungary 1908 20
## 111 Hungary 1912 30
## 112 Hungary 1924 22
## 113 Hungary 1928 21
## 114 Hungary 1932 33
## 115 Hungary 1936 41
## 116 Hungary 1948 62
## 117 Hungary 1952 102
## 118 Hungary 1956 64
## 119 Hungary 1960 66
## 120 Hungary 1964 56
## 121 Hungary 1968 81
## 122 Hungary 1972 81
## 123 Hungary 1976 55
## 124 Hungary 1980 61
## 125 Hungary 1988 44
## 126 Hungary 1992 45
## 127 Hungary 1996 43
## 128 Hungary 2000 53
## 129 Hungary 2004 40
## 130 Hungary 2008 27
## 131 Hungary 2012 26
## 132 Hungary 2016 22
## 133 Italy 1900 5
## 134 Italy 1906 9
## 135 Italy 1908 8
## 136 Italy 1912 25
## 137 Italy 1920 84
## 138 Italy 1924 51
## 139 Italy 1928 72
## 140 Italy 1932 77
## 141 Italy 1936 70
## 142 Italy 1948 67
## 143 Italy 1952 52
## 144 Italy 1956 45
## 145 Italy 1960 85
## 146 Italy 1964 51
## 147 Italy 1968 33
## 148 Italy 1972 25
## 149 Italy 1976 31
## 150 Italy 1980 37
## 151 Italy 1984 63
## 152 Italy 1988 29
## 153 Italy 1992 45
## 154 Italy 1996 71
## 155 Italy 2000 65
## 156 Italy 2004 104
## 157 Italy 2008 42
## 158 Italy 2012 68
## 159 Italy 2016 70
## 160 Russia 1908 3
## 161 Russia 1912 7
## 162 Russia 1996 115
## 163 Russia 2000 187
## 164 Russia 2004 189
## 165 Russia 2008 142
## 166 Russia 2012 138
## 167 Russia 2016 113
## 168 Soviet Union 1952 117
## 169 Soviet Union 1956 169
## 170 Soviet Union 1960 167
## 171 Soviet Union 1964 174
## 172 Soviet Union 1968 192
## 173 Soviet Union 1972 214
## 174 Soviet Union 1976 286
## 175 Soviet Union 1980 442
## 176 Soviet Union 1988 300
## 177 Sweden 1900 1
## 178 Sweden 1906 21
## 179 Sweden 1908 79
## 180 Sweden 1912 153
## 181 Sweden 1920 143
## 182 Sweden 1924 72
## 183 Sweden 1928 31
## 184 Sweden 1932 26
## 185 Sweden 1936 28
## 186 Sweden 1948 67
## 187 Sweden 1952 64
## 188 Sweden 1956 28
## 189 Sweden 1960 7
## 190 Sweden 1964 10
## 191 Sweden 1968 9
## 192 Sweden 1972 22
## 193 Sweden 1976 10
## 194 Sweden 1980 20
## 195 Sweden 1984 34
## 196 Sweden 1988 16
## 197 Sweden 1992 35
## 198 Sweden 1996 31
## 199 Sweden 2000 32
## 200 Sweden 2004 12
## 201 Sweden 2008 5
## 202 Sweden 2012 22
## 203 Sweden 2016 28
## 204 United States 1896 20
## 205 United States 1900 45
## 206 United States 1904 199
## 207 United States 1906 24
## 208 United States 1908 65
## 209 United States 1912 107
## 210 United States 1920 194
## 211 United States 1924 174
## 212 United States 1928 88
## 213 United States 1932 170
## 214 United States 1936 96
## 215 United States 1948 143
## 216 United States 1952 122
## 217 United States 1956 121
## 218 United States 1960 120
## 219 United States 1964 159
## 220 United States 1968 166
## 221 United States 1972 171
## 222 United States 1976 164
## 223 United States 1984 352
## 224 United States 1988 207
## 225 United States 1992 222
## 226 United States 1996 255
## 227 United States 2000 240
## 228 United States 2004 259
## 229 United States 2008 309
## 230 United States 2012 238
## 231 United States 2016 256
Use ggplot() to create a line plot with n_medals over year, colored by team.
medal_df %>%
# Count the medals won per team and year
count(team, year, name = "n_medals") %>%
# Plot n_medals over year, colored by team
ggplot(., aes(year, n_medals, color = team)) +
geom_line() +
scale_color_brewer(palette = "Paired")
Complete the team and year variables, replace NA values in the n_medals column with zeros.
medal_df %>%
# Count the medals won per team and year
count(team, year, name = "n_medals") %>%
# Complete the team and year variables, fill n_medals with zeros
complete(team, year, fill = list(n_medals = 0)) %>%
# Plot n_medals over year, colored by team
ggplot(aes(x = year, y = n_medals, color = team)) +
geom_line() +
scale_color_brewer(palette = "Paired")
The full_seq() function will look for the minimal and maximal values inside the vector you pass it and will then generate a full sequence of numbers with a fixed period in between them. When used inside the complete() function, full_seq() is a handy tool to make sure there are no missing observations in your data. Before combining these two functions you’ll generate a few sequences with full_seq() on its own to get the hang of this function.
Use full_seq() to create a sequence with all years from 2020 till 2030.
# Generate all years from 2020 to 2030
years <- full_seq(2020:2030, period = 1)
years
## [1] 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030
Use full_seq() to create a sequence with all decades from 1980 till 2030.
# Generate all decades from 1980 to 2030
decades <- full_seq(c(1980, 2030), period = 10)
decades
## [1] 1980 1990 2000 2010 2020 2030
Use full_seq() to create a sequence with all dates in 1980 using the outer_dates vector.
outer_dates <- c(as.Date("1980-01-01"), as.Date("1980-12-31"))
# Generate the dates for all days in 1980
full_seq(outer_dates, period = 1)
## [1] "1980-01-01" "1980-01-02" "1980-01-03" "1980-01-04" "1980-01-05"
## [6] "1980-01-06" "1980-01-07" "1980-01-08" "1980-01-09" "1980-01-10"
## [11] "1980-01-11" "1980-01-12" "1980-01-13" "1980-01-14" "1980-01-15"
## [16] "1980-01-16" "1980-01-17" "1980-01-18" "1980-01-19" "1980-01-20"
## [21] "1980-01-21" "1980-01-22" "1980-01-23" "1980-01-24" "1980-01-25"
## [26] "1980-01-26" "1980-01-27" "1980-01-28" "1980-01-29" "1980-01-30"
## [31] "1980-01-31" "1980-02-01" "1980-02-02" "1980-02-03" "1980-02-04"
## [36] "1980-02-05" "1980-02-06" "1980-02-07" "1980-02-08" "1980-02-09"
## [41] "1980-02-10" "1980-02-11" "1980-02-12" "1980-02-13" "1980-02-14"
## [46] "1980-02-15" "1980-02-16" "1980-02-17" "1980-02-18" "1980-02-19"
## [51] "1980-02-20" "1980-02-21" "1980-02-22" "1980-02-23" "1980-02-24"
## [56] "1980-02-25" "1980-02-26" "1980-02-27" "1980-02-28" "1980-02-29"
## [61] "1980-03-01" "1980-03-02" "1980-03-03" "1980-03-04" "1980-03-05"
## [66] "1980-03-06" "1980-03-07" "1980-03-08" "1980-03-09" "1980-03-10"
## [71] "1980-03-11" "1980-03-12" "1980-03-13" "1980-03-14" "1980-03-15"
## [76] "1980-03-16" "1980-03-17" "1980-03-18" "1980-03-19" "1980-03-20"
## [81] "1980-03-21" "1980-03-22" "1980-03-23" "1980-03-24" "1980-03-25"
## [86] "1980-03-26" "1980-03-27" "1980-03-28" "1980-03-29" "1980-03-30"
## [91] "1980-03-31" "1980-04-01" "1980-04-02" "1980-04-03" "1980-04-04"
## [96] "1980-04-05" "1980-04-06" "1980-04-07" "1980-04-08" "1980-04-09"
## [101] "1980-04-10" "1980-04-11" "1980-04-12" "1980-04-13" "1980-04-14"
## [106] "1980-04-15" "1980-04-16" "1980-04-17" "1980-04-18" "1980-04-19"
## [111] "1980-04-20" "1980-04-21" "1980-04-22" "1980-04-23" "1980-04-24"
## [116] "1980-04-25" "1980-04-26" "1980-04-27" "1980-04-28" "1980-04-29"
## [121] "1980-04-30" "1980-05-01" "1980-05-02" "1980-05-03" "1980-05-04"
## [126] "1980-05-05" "1980-05-06" "1980-05-07" "1980-05-08" "1980-05-09"
## [131] "1980-05-10" "1980-05-11" "1980-05-12" "1980-05-13" "1980-05-14"
## [136] "1980-05-15" "1980-05-16" "1980-05-17" "1980-05-18" "1980-05-19"
## [141] "1980-05-20" "1980-05-21" "1980-05-22" "1980-05-23" "1980-05-24"
## [146] "1980-05-25" "1980-05-26" "1980-05-27" "1980-05-28" "1980-05-29"
## [151] "1980-05-30" "1980-05-31" "1980-06-01" "1980-06-02" "1980-06-03"
## [156] "1980-06-04" "1980-06-05" "1980-06-06" "1980-06-07" "1980-06-08"
## [161] "1980-06-09" "1980-06-10" "1980-06-11" "1980-06-12" "1980-06-13"
## [166] "1980-06-14" "1980-06-15" "1980-06-16" "1980-06-17" "1980-06-18"
## [171] "1980-06-19" "1980-06-20" "1980-06-21" "1980-06-22" "1980-06-23"
## [176] "1980-06-24" "1980-06-25" "1980-06-26" "1980-06-27" "1980-06-28"
## [181] "1980-06-29" "1980-06-30" "1980-07-01" "1980-07-02" "1980-07-03"
## [186] "1980-07-04" "1980-07-05" "1980-07-06" "1980-07-07" "1980-07-08"
## [191] "1980-07-09" "1980-07-10" "1980-07-11" "1980-07-12" "1980-07-13"
## [196] "1980-07-14" "1980-07-15" "1980-07-16" "1980-07-17" "1980-07-18"
## [201] "1980-07-19" "1980-07-20" "1980-07-21" "1980-07-22" "1980-07-23"
## [206] "1980-07-24" "1980-07-25" "1980-07-26" "1980-07-27" "1980-07-28"
## [211] "1980-07-29" "1980-07-30" "1980-07-31" "1980-08-01" "1980-08-02"
## [216] "1980-08-03" "1980-08-04" "1980-08-05" "1980-08-06" "1980-08-07"
## [221] "1980-08-08" "1980-08-09" "1980-08-10" "1980-08-11" "1980-08-12"
## [226] "1980-08-13" "1980-08-14" "1980-08-15" "1980-08-16" "1980-08-17"
## [231] "1980-08-18" "1980-08-19" "1980-08-20" "1980-08-21" "1980-08-22"
## [236] "1980-08-23" "1980-08-24" "1980-08-25" "1980-08-26" "1980-08-27"
## [241] "1980-08-28" "1980-08-29" "1980-08-30" "1980-08-31" "1980-09-01"
## [246] "1980-09-02" "1980-09-03" "1980-09-04" "1980-09-05" "1980-09-06"
## [251] "1980-09-07" "1980-09-08" "1980-09-09" "1980-09-10" "1980-09-11"
## [256] "1980-09-12" "1980-09-13" "1980-09-14" "1980-09-15" "1980-09-16"
## [261] "1980-09-17" "1980-09-18" "1980-09-19" "1980-09-20" "1980-09-21"
## [266] "1980-09-22" "1980-09-23" "1980-09-24" "1980-09-25" "1980-09-26"
## [271] "1980-09-27" "1980-09-28" "1980-09-29" "1980-09-30" "1980-10-01"
## [276] "1980-10-02" "1980-10-03" "1980-10-04" "1980-10-05" "1980-10-06"
## [281] "1980-10-07" "1980-10-08" "1980-10-09" "1980-10-10" "1980-10-11"
## [286] "1980-10-12" "1980-10-13" "1980-10-14" "1980-10-15" "1980-10-16"
## [291] "1980-10-17" "1980-10-18" "1980-10-19" "1980-10-20" "1980-10-21"
## [296] "1980-10-22" "1980-10-23" "1980-10-24" "1980-10-25" "1980-10-26"
## [301] "1980-10-27" "1980-10-28" "1980-10-29" "1980-10-30" "1980-10-31"
## [306] "1980-11-01" "1980-11-02" "1980-11-03" "1980-11-04" "1980-11-05"
## [311] "1980-11-06" "1980-11-07" "1980-11-08" "1980-11-09" "1980-11-10"
## [316] "1980-11-11" "1980-11-12" "1980-11-13" "1980-11-14" "1980-11-15"
## [321] "1980-11-16" "1980-11-17" "1980-11-18" "1980-11-19" "1980-11-20"
## [326] "1980-11-21" "1980-11-22" "1980-11-23" "1980-11-24" "1980-11-25"
## [331] "1980-11-26" "1980-11-27" "1980-11-28" "1980-11-29" "1980-11-30"
## [336] "1980-12-01" "1980-12-02" "1980-12-03" "1980-12-04" "1980-12-05"
## [341] "1980-12-06" "1980-12-07" "1980-12-08" "1980-12-09" "1980-12-10"
## [346] "1980-12-11" "1980-12-12" "1980-12-13" "1980-12-14" "1980-12-15"
## [351] "1980-12-16" "1980-12-17" "1980-12-18" "1980-12-19" "1980-12-20"
## [356] "1980-12-21" "1980-12-22" "1980-12-23" "1980-12-24" "1980-12-25"
## [361] "1980-12-26" "1980-12-27" "1980-12-28" "1980-12-29" "1980-12-30"
## [366] "1980-12-31"
In October 1962, during the Cuban missile crisis, the world came close to a full scale nuclear war. Throughout 1962, the USA, USSR, and France together detonated a record 178 nuclear bombs for military power display and research. You’ve been given a sample of the Nuclear Explosion Database (NEDB) for that year (cumul_nukes_1962_df) with an observation for each date on which a bomb was detonated. The total_bombs variable contains the cumulative number of bombs detonated by a country up to that point in time.
You’ll complete the dataset to hold the full sequence of dates, and visualize the total number of bombs per country over time. You’ll also use the fill() function from Chapter One to impute missing values.
cumul_nukes_1962_df <- read.delim("DATABASE/cumul_nukes_1962_df.txt", sep = ",") %>% select(., c(-"X"))
#transformamos la fecha par qu lo lea como fecha
cumul_nukes_1962_df <- cumul_nukes_1962_df %>%
mutate(
fecha = as.POSIXct(date, origin = '1970-01-01', tz = 'GMT')
) %>%
#quitamos la columna inicial
select(.,c (
-"date"
)) %>%
#renombramos aquellacolumna recientemente creada
setnames(.,
c("country", "total_bombs", "date"))
head(cumul_nukes_1962_df)
## country total_bombs date
## 1 France 0 1962-01-01
## 2 France 1 1962-05-01
## 3 Russian Federation 0 1962-01-01
## 4 Russian Federation 1 1962-02-02
## 5 Russian Federation 2 1962-08-01
## 6 Russian Federation 3 1962-08-03
#### hayq ue cambiar el formato de datoe a as.date() esta en formato factor
The dplyr and ggplot2 packages have been pre-loaded for you.
Complete the dataset so that for each country there is an observation of each date using the full_seq() function.
cumul_nukes_1962_df %>%
# Complete the dataset
complete(country, date = full_seq(date, period = 1))
## # A tibble: 92,793,603 x 3
## country date total_bombs
## <fct> <dttm> <int>
## 1 France 1962-01-01 00:00:00 0
## 2 France 1962-01-01 00:00:01 NA
## 3 France 1962-01-01 00:00:02 NA
## 4 France 1962-01-01 00:00:03 NA
## 5 France 1962-01-01 00:00:04 NA
## 6 France 1962-01-01 00:00:05 NA
## 7 France 1962-01-01 00:00:06 NA
## 8 France 1962-01-01 00:00:07 NA
## 9 France 1962-01-01 00:00:08 NA
## 10 France 1962-01-01 00:00:09 NA
## # … with 92,793,593 more rows
Group the data by country. Use the fill() function to overwrite NA values in the total_bombs variable with the last known value.
cumul_nukes_1962_df %>%
# Complete the dataset
complete(country, date = full_seq(date, period = 1)) %>%
# Group the data by country
group_by(country) %>%
# Impute missing values with the last known observation
fill(total_bombs)
## # A tibble: 92,793,603 x 3
## # Groups: country [3]
## country date total_bombs
## <fct> <dttm> <int>
## 1 France 1962-01-01 00:00:00 0
## 2 France 1962-01-01 00:00:01 0
## 3 France 1962-01-01 00:00:02 0
## 4 France 1962-01-01 00:00:03 0
## 5 France 1962-01-01 00:00:04 0
## 6 France 1962-01-01 00:00:05 0
## 7 France 1962-01-01 00:00:06 0
## 8 France 1962-01-01 00:00:07 0
## 9 France 1962-01-01 00:00:08 0
## 10 France 1962-01-01 00:00:09 0
## # … with 92,793,593 more rows
Use ggplot() to visualize the total_bombs at any given date, color the line plot by country. Some code has been added for you to visualize the Cuban Missile Crisis.
# cumul_nukes_1962_df %>%
# # Complete the dataset
# complete(country, date = full_seq(date, period = 1)) %>%
# # Group the data by country
# group_by(country) %>%
# # Impute missing values with the last known observation
# fill(total_bombs) %>%
# # Plot the number of bombs over time, color by country
# ggplot(aes(date, total_bombs, color = country)) +
# # These two lines will mark the Cuban Missile Crisis
# geom_rect(xmin = as.Date("1962-10-16"), xmax = as.Date("1962-10-29"), ymin = -Inf, ymax = Inf, color = NA)+
# geom_text(x = as.Date("1962-10-22"), y = 15, label = "Cuban Missile Crisis", angle = 90, color = "white")+
# geom_line()
#casca R
You want to compare Olympic performance of athletes per continent over time, both on the winter and summer Olympics. You’ve been given a dataset medal_df with the average number of medals won per participant of each continent since 1928. You’ll complete this data to introduce zero values for years where a continent did not win any medals.
The ggplot2 package has been pre-loaded for you.
medal_df <- read.delim("DATABASE/medal_df2.txt", sep = ",") %>% select(., c(-"X"))
head(medal_df)
## continent season year medals_per_participant
## 1 Africa Summer 1928 0.17241379
## 2 Africa Summer 1932 0.41666667
## 3 Africa Summer 1936 0.07058824
## 4 Africa Summer 1948 0.09166667
## 5 Africa Summer 1952 0.08064516
## 6 Africa Summer 1956 0.06542056
Complete the dataset so that each continent has a medals_per_participant value at each Olympic event. Missing values should be filled with zeros. Nest the season and year variables using the nesting() function, since the summer and winter Olympics don’t occur in the same years. Use ggplot() to create a line plot with the medals_per_participant per year, color the plot by continent.
medal_df %>%
# Give each continent an observation at each Olympic event
complete(
continent,
nesting(season, year),
fill = list(medals_per_participant = 0)
) %>%
# Plot the medals_per_participant over time, colored by continent
ggplot(aes(x = year, y = medals_per_participant, color = continent)) +
geom_line() +
facet_grid(season ~ .)
You’re a doctor in a remote village confronted with a virus outbreak. You have been collecting data on when your patients got infected and recovered in a data frame named patient_df. Your goal is to create a visual with the number of sick patients over time. You’ll first have to reshape the data so that you can count the number of sick patients per day.
patient_df <- read.delim("DATABASE/patient_df.txt", sep = ",") %>% select(., c(-"X"))
#transformacion de las fechas
patient_df <- patient_df %>%
mutate(
infected1 = as.POSIXct(infected, origin = '1970-01-01', tz = 'GMT'),
recovered1 = as.POSIXct(recovered, origin = '1970-01-01', tz = 'GMT')
) %>%
#quitamos la columna inicial
select(.,c (
-"infected", - "recovered"
)) %>%
#renombramos aquellacolumna recientemente creada
setnames(.,
c("patient", "infected", "recovered"))
head(patient_df)
## patient infected recovered
## 1 A 2020-02-08 2020-03-28
## 2 B 2020-02-19 2020-03-04
## 3 C 2020-02-27 2020-03-19
## 4 D 2020-03-05 2020-03-19
## 5 E 2020-03-06 2020-03-29
## 6 F 2020-03-07 2020-03-22
The dplyr and ggplot2 packages have been pre-loaded for you.
Inspect patient_df in the console. Pivot the infected and recovered columns to long format, the old column names should go in the status variable, the values to date.
patient_df %>%
# Pivot the infected and recovered columns to long format
pivot_longer(
c("infected", "recovered"),
names_to = "status",
values_to = "date"
)
## # A tibble: 52 x 3
## patient status date
## <fct> <chr> <dttm>
## 1 A infected 2020-02-08 00:00:00
## 2 A recovered 2020-03-28 00:00:00
## 3 B infected 2020-02-19 00:00:00
## 4 B recovered 2020-03-04 00:00:00
## 5 C infected 2020-02-27 00:00:00
## 6 C recovered 2020-03-19 00:00:00
## 7 D infected 2020-03-05 00:00:00
## 8 D recovered 2020-03-19 00:00:00
## 9 E infected 2020-03-06 00:00:00
## 10 E recovered 2020-03-29 00:00:00
## # … with 42 more rows
Group the data by patient and then complete the date column so that each date between infection and recovery is added using the full_seq() column. At the end, ungroup the data.
patient_df %>%
# Pivot the infected and recovered columns to long format
pivot_longer(-patient, names_to = "status", values_to = "date") %>%
select(-status) %>%
# Group by patient
group_by(patient) %>%
# Complete the date range per patient using full_seq()
complete(
date = full_seq(date, period = 1)
)%>%
# Ungroup the data
ungroup()
## # A tibble: 56,851,226 x 2
## patient date
## <fct> <dttm>
## 1 A 2020-02-08 00:00:00
## 2 A 2020-02-08 00:00:01
## 3 A 2020-02-08 00:00:02
## 4 A 2020-02-08 00:00:03
## 5 A 2020-02-08 00:00:04
## 6 A 2020-02-08 00:00:05
## 7 A 2020-02-08 00:00:06
## 8 A 2020-02-08 00:00:07
## 9 A 2020-02-08 00:00:08
## 10 A 2020-02-08 00:00:09
## # … with 56,851,216 more rows
Each date is now a day on which a patient was sick, count the dates and name the new variable n_sick.
# DATA <- patient_df %>%
# # Pivot the infected and recovered columns to long format
# pivot_longer(-patient, names_to = "status", values_to = "date") %>%
# select(-status) %>%
# # Group by patient
# group_by(patient) %>%
# # Complete the date range per patient using full_seq()
# complete(date = full_seq(date, period = 1)) %>%
# # Ungroup the data
# ungroup() %>%
# # Count the dates, the count goes in the n_sick variable
# count(date, name = "n_sick")
# ggplot(DATA, aes(x = date, y = n_sick))+
# geom_line()
Imagine you’re an office facility manager and want to know how many people are present throughout the day. You’ve installed a sensor at the entrance that counts the number of people entering and leaving the building. The sensor sends an update at the end of every 20 minute time slot if at least one person passed.
To create a dataset ready for visualization, you’ll combine the different techniques you’ve learned so far.
sensor_df <- read.delim("DATABASE/sensor_df.txt", sep = ",") %>% select(., c(-"X"))
sensor_df <- sensor_df %>%
mutate(
time1 = as.POSIXct(time, origin = '1970-01-01', tz = 'GMT')
) %>%
#quitamos la columna inicial
select(.,c (
-"time"
)) %>%
#renombramos aquellacolumna recientemente creada
setnames(.,
c("enter", "exit", "time"))
head(sensor_df)
## enter exit time
## 1 1 0 2020-01-01 07:00:00
## 2 2 0 2020-01-01 07:20:00
## 3 3 -1 2020-01-01 07:40:00
## 4 5 0 2020-01-01 08:00:00
## 5 3 0 2020-01-01 08:20:00
## 6 4 -1 2020-01-01 08:40:00
The dplyr and ggplot2 packages have been pre-loaded for you.
Complete the time variable by using the seq() function to create a sequence between the min and max values with an interval of “20 min”. Fill NA values of enter and exit with 0L.
sensor_df %>%
# Complete the time column with a 20 minute interval
complete(time = seq(min(time), max(time), by = "20 min"),
fill = list(enter = 0L, exit = 0L))
## # A tibble: 36 x 3
## time enter exit
## <dttm> <int> <int>
## 1 2020-01-01 07:00:00 1 0
## 2 2020-01-01 07:20:00 2 0
## 3 2020-01-01 07:40:00 3 -1
## 4 2020-01-01 08:00:00 5 0
## 5 2020-01-01 08:20:00 3 0
## 6 2020-01-01 08:40:00 4 -1
## 7 2020-01-01 09:00:00 4 0
## 8 2020-01-01 09:20:00 1 0
## 9 2020-01-01 09:40:00 1 -1
## 10 2020-01-01 10:00:00 0 0
## # … with 26 more rows
Calculate the total_inside variable by taking the cumulative sum of enter plus exit using the cumsum() function.
sensor_df %>%
# Complete the time column with a 20 minute interval
complete(time = seq(min(time), max(time), by = "20 min"),
fill = list(enter = 0L, exit = 0L)) %>%
# Calculate the total number of people inside
mutate(total_inside = cumsum(enter + exit))
## # A tibble: 36 x 4
## time enter exit total_inside
## <dttm> <int> <int> <int>
## 1 2020-01-01 07:00:00 1 0 1
## 2 2020-01-01 07:20:00 2 0 3
## 3 2020-01-01 07:40:00 3 -1 5
## 4 2020-01-01 08:00:00 5 0 10
## 5 2020-01-01 08:20:00 3 0 13
## 6 2020-01-01 08:40:00 4 -1 16
## 7 2020-01-01 09:00:00 4 0 20
## 8 2020-01-01 09:20:00 1 0 21
## 9 2020-01-01 09:40:00 1 -1 21
## 10 2020-01-01 10:00:00 0 0 21
## # … with 26 more rows
Pivot the enter and exit columns to long format. The column names should go in the direction variable, the values in n_people.
sensor_df %>%
# Complete the time column with a 20 minute interval
complete(time = seq(min(time), max(time), by = "20 min"),
fill = list(enter = 0L, exit = 0L)) %>%
# Calculate the total number of people inside
mutate(total_inside = cumsum(enter + exit)) %>%
# Pivot the enter and exit columns to long format
pivot_longer(
c("enter", "exit"),
names_to = "direction",
values_to = "n_people"
)
## # A tibble: 72 x 4
## time total_inside direction n_people
## <dttm> <int> <chr> <int>
## 1 2020-01-01 07:00:00 1 enter 1
## 2 2020-01-01 07:00:00 1 exit 0
## 3 2020-01-01 07:20:00 3 enter 2
## 4 2020-01-01 07:20:00 3 exit 0
## 5 2020-01-01 07:40:00 5 enter 3
## 6 2020-01-01 07:40:00 5 exit -1
## 7 2020-01-01 08:00:00 10 enter 5
## 8 2020-01-01 08:00:00 10 exit 0
## 9 2020-01-01 08:20:00 13 enter 3
## 10 2020-01-01 08:20:00 13 exit 0
## # … with 62 more rows
Use ggplot() to visualize the n_people in the building over time. Use the fill argument to color the area plot by direction.
sensor_df %>%
# Complete the time column with a 20 minute interval
complete(time = seq(min(time), max(time), by = "20 min"),
fill = list(enter = 0L, exit = 0L)) %>%
# Calculate the total number of people inside
mutate(total_inside = cumsum(enter + exit)) %>%
# Pivot the enter and exit columns to long format
pivot_longer(enter:exit, names_to = "direction", values_to = "n_people") %>%
# Plot the number of people over time, fill by direction
ggplot(., aes(time, n_people, fill = direction)) +
geom_area() +
geom_line(aes(y = total_inside))