This document covers the first two activities for the http://www.becomingadatascientist.com/learningclub/.
The first activity is to find, import and explore a dataset. The second activity is to create visuals for exploratory data analysis. I will combine both activities here.
For my project, I decided to explore the top grossing movies released in 2015 according to boxofficemojo.com.
I scraped and combined data from boxofficemojo.com and omdbapi.com in order to get the information I wanted.
This data was captured at the end of December.
My movies dataset contains 422 movies with I ended up with the following structure:
## [1] "imdb_id" "moviedb_id" "IsAdult"
## [4] "Title" "Rated" "ReleaseDate"
## [7] "Status" "Budget" "Runtime"
## [10] "OriginalLanguage" "Director" "Writer"
## [13] "Genres" "Countries" "Type"
## [16] "imdbRating" "imdbVotes" "Metascore"
## [19] "TotalGross" "TotalGrossTheaters" "Opening"
## [22] "OpeningTheaters" "OpenDate" "Revenue"
## [25] "CloseDate" "BubbleSize"
The 422 movies on the database are from 80 different countries and span 19 different categories. The movie with the biggest budget was Spectre with a $300,000,000 budget. Surprisingly Jurassic World, which was the movie with the highest opening weekend was one of the movies with the lowest budget with a $150,000,000 budget.
| Title | Rated | TotalGross | imdbRating |
|---|---|---|---|
| Jurassic World | PG-13 | 652,270,625 | 7.1 |
| Avengers: Age of Ultron | PG-13 | 459,005,868 | 7.6 |
| Inside Out | PG | 356,461,711 | 8.4 |
| Furious 7 | PG-13 | 353,007,020 | 7.3 |
| Minions | PG | 336,045,770 | 6.5 |
| Star Wars: The Force Awakens | PG-13 | 288,076,417 | 8.9 |
| The Hunger Games: Mockingjay - Part 2 | PG-13 | 255,685,045 | 7.1 |
| The Martian | PG-13 | 224,003,532 | 8.2 |
| Cinderella | PG | 201,151,353 | 7.1 |
| Mission: Impossible – Rogue Nation | PG-13 | 195,042,377 | 7.5 |
| Title | Rated | Opening | imdbRating |
|---|---|---|---|
| Star Wars: The Force Awakens | PG-13 | 247,966,675 | 8.9 |
| Jurassic World | PG-13 | 208,806,270 | 7.1 |
| Avengers: Age of Ultron | PG-13 | 191,271,109 | 7.6 |
| Furious 7 | PG-13 | 147,187,040 | 7.3 |
| Minions | PG | 115,718,405 | 6.5 |
| The Hunger Games: Mockingjay - Part 2 | PG-13 | 102,665,981 | 7.1 |
| Inside Out | PG | 90,440,272 | 8.4 |
| Fifty Shades of Grey | R | 85,171,450 | 4.1 |
| Spectre | PG-13 | 70,403,148 | 7.0 |
| Pitch Perfect 2 | PG-13 | 69,216,890 | 6.6 |
| Title | Rated | Budget | Opening | imdbRating |
|---|---|---|---|---|
| Spectre | PG-13 | 300,000,000 | 70,403,148 | 7.0 |
| Avengers: Age of Ultron | PG-13 | 250,000,000 | 191,271,109 | 7.6 |
| Star Wars: The Force Awakens | PG-13 | 200,000,000 | 247,966,675 | 8.9 |
| Tomorrowland | PG | 190,000,000 | 33,028,165 | 6.5 |
| Furious 7 | PG-13 | 190,000,000 | 147,187,040 | 7.3 |
| Jupiter Ascending | PG-13 | 176,000,003 | 18,372,372 | 5.4 |
| Inside Out | PG | 175,000,000 | 90,440,272 | 8.4 |
| Jurassic World | PG-13 | 150,000,000 | 208,806,270 | 7.1 |
| Mad Max: Fury Road | R | 150,000,000 | 45,428,128 | 8.2 |
| Mission: Impossible – Rogue Nation | PG-13 | 150,000,000 | 55,520,089 | 7.5 |
| Pan | PG | 150,000,000 | 15,315,435 | 6.0 |
Finally, let’s take a look at all the movies from the dataset by their budget and IMDB rating.
Data on the internet is messy It took me way longer than expected to get the data I needed. I thought since I would be getting the data from an API it would be nice and clean. Don’t get me wrong, it wasn’t all that bad, but I ended up with much more missing data than I thought I would.
Time is money friend! Wow, for a side project, this took a lot of time. I’m not complaining, I’m doing something that I like and about a topic I chose, but the truth is I didn’t think it would take me this much time when I first started. With limited free time, this became a big issue.
Can’t be too ambitious I wanted to get a dataset where I could answer so many questions that I think I might have made this a bit more complicated than it had to be. For a first assignment and with limited time, I should have picked something a bit less ambitious.
I need to be more organized I wrote a few python scripts and a few R scripts in order to get the data that I wanted. Some things I saved on a script, others, I did directly on the console, the result is that I didn’t end up with a reproducible script. I need to be more organized with scripts so I can end up with a full reproducible analysis that I could share with others.
plot.ly I tried plot.ly for the first time with this and I have to say I’m impressed with the features. I still need to figure out how to make the plots look (margins, padding, etc), but I like what I’ve seen so far.