This document covers the first two activities for the http://www.becomingadatascientist.com/learningclub/.

The first activity is to find, import and explore a dataset. The second activity is to create visuals for exploratory data analysis. I will combine both activities here.

For my project, I decided to explore the top grossing movies released in 2015 according to boxofficemojo.com.

I scraped and combined data from boxofficemojo.com and omdbapi.com in order to get the information I wanted.

This data was captured at the end of December.

The Data

My movies dataset contains 422 movies with I ended up with the following structure:

##  [1] "imdb_id"            "moviedb_id"         "IsAdult"           
##  [4] "Title"              "Rated"              "ReleaseDate"       
##  [7] "Status"             "Budget"             "Runtime"           
## [10] "OriginalLanguage"   "Director"           "Writer"            
## [13] "Genres"             "Countries"          "Type"              
## [16] "imdbRating"         "imdbVotes"          "Metascore"         
## [19] "TotalGross"         "TotalGrossTheaters" "Opening"           
## [22] "OpeningTheaters"    "OpenDate"           "Revenue"           
## [25] "CloseDate"          "BubbleSize"

My Findings

The 422 movies on the database are from 80 different countries and span 19 different categories. The movie with the biggest budget was Spectre with a $300,000,000 budget. Surprisingly Jurassic World, which was the movie with the highest opening weekend was one of the movies with the lowest budget with a $150,000,000 budget.

Top Grossing Movies of 2015

Title Rated TotalGross imdbRating
Jurassic World PG-13 652,270,625 7.1
Avengers: Age of Ultron PG-13 459,005,868 7.6
Inside Out PG 356,461,711 8.4
Furious 7 PG-13 353,007,020 7.3
Minions PG 336,045,770 6.5
Star Wars: The Force Awakens PG-13 288,076,417 8.9
The Hunger Games: Mockingjay - Part 2 PG-13 255,685,045 7.1
The Martian PG-13 224,003,532 8.2
Cinderella PG 201,151,353 7.1
Mission: Impossible – Rogue Nation PG-13 195,042,377 7.5

Movies with Best Opening Weekends of 2015

Title Rated Opening imdbRating
Star Wars: The Force Awakens PG-13 247,966,675 8.9
Jurassic World PG-13 208,806,270 7.1
Avengers: Age of Ultron PG-13 191,271,109 7.6
Furious 7 PG-13 147,187,040 7.3
Minions PG 115,718,405 6.5
The Hunger Games: Mockingjay - Part 2 PG-13 102,665,981 7.1
Inside Out PG 90,440,272 8.4
Fifty Shades of Grey R 85,171,450 4.1
Spectre PG-13 70,403,148 7.0
Pitch Perfect 2 PG-13 69,216,890 6.6

Big Budget Movies of 2015

Title Rated Budget Opening imdbRating
Spectre PG-13 300,000,000 70,403,148 7.0
Avengers: Age of Ultron PG-13 250,000,000 191,271,109 7.6
Star Wars: The Force Awakens PG-13 200,000,000 247,966,675 8.9
Tomorrowland PG 190,000,000 33,028,165 6.5
Furious 7 PG-13 190,000,000 147,187,040 7.3
Jupiter Ascending PG-13 176,000,003 18,372,372 5.4
Inside Out PG 175,000,000 90,440,272 8.4
Jurassic World PG-13 150,000,000 208,806,270 7.1
Mad Max: Fury Road R 150,000,000 45,428,128 8.2
Mission: Impossible – Rogue Nation PG-13 150,000,000 55,520,089 7.5
Pan PG 150,000,000 15,315,435 6.0

Finally, let’s take a look at all the movies from the dataset by their budget and IMDB rating.

Lessons learned

Data on the internet is messy It took me way longer than expected to get the data I needed. I thought since I would be getting the data from an API it would be nice and clean. Don’t get me wrong, it wasn’t all that bad, but I ended up with much more missing data than I thought I would.

Time is money friend! Wow, for a side project, this took a lot of time. I’m not complaining, I’m doing something that I like and about a topic I chose, but the truth is I didn’t think it would take me this much time when I first started. With limited free time, this became a big issue.

Can’t be too ambitious I wanted to get a dataset where I could answer so many questions that I think I might have made this a bit more complicated than it had to be. For a first assignment and with limited time, I should have picked something a bit less ambitious.

I need to be more organized I wrote a few python scripts and a few R scripts in order to get the data that I wanted. Some things I saved on a script, others, I did directly on the console, the result is that I didn’t end up with a reproducible script. I need to be more organized with scripts so I can end up with a full reproducible analysis that I could share with others.

plot.ly I tried plot.ly for the first time with this and I have to say I’m impressed with the features. I still need to figure out how to make the plots look (margins, padding, etc), but I like what I’ve seen so far.