I have recently been enjoying watching movies that are in the top 500 on the IMDb. IMDb is one of the most credible sites for movie information. Also I have found an interest in looking at certain directors movies because I have realized how good they are at their craft. On the other side, I have loved looking at actors performacnes and how well they get into character. One of my favorite movies is the social network and I think that the cast of that movie plays their characters perfectly Throughout this document, I will be scraping the IMDb website for information.I will be using data from the top 500 movies with at least 25,000 votes. I have always had an interest in looking at movies
How much on an impact do directors, actors, and plot have to do on movies being in the top 500 movies on IMDb?
I used rvest and xml2 for the HTML files, tidyverse and dplyr for data manipulation, ggplot for graphing visuals, knitr for making adjustments on rmarkdown, and tidytext to evaluate the words in the summary.
In the original link, I obtained it from the IMDb website where I was able to sort the search to the parameters that I wanted. I used an IMDb. I decided that the best way to filter the data was to use with at least 25,000 user votes to show that it has the credibility from users. I think that User votes matter the most because the users are common people that have seen the movie instead of having only a movie critics opinion. I then had to read the url as an HTML to convert it into data that I can use.
I inspected the values that I wanted to take. I wanted to take the movie name, release year, IMDb rating, the movie summary, director, lead actor, and the number of votes the move has. After getting the values, I put them into a tibble to make it easier to see and be easy to use the data.
I had to create a function to properly use the second page of the data. I used the same values that I inspected but ran it with the URL from the second page to get more data and have more movies to work with. At The end of this function, I put both URLs into a data frame to have it all in one location.
After doing the setup process and creating a data frame that I can use this to answer my original question.
To first answer the question I wanted to see which directors had at least 5 movies in the top 500 movie list that I have.
| director | n |
|---|---|
| Akira Kurosawa | 10 |
| Ingmar Bergman | 8 |
| Billy Wilder | 7 |
| Christopher Nolan | 7 |
| Martin Scorsese | 7 |
| Stanley Kubrick | 7 |
| Steven Spielberg | 7 |
| Alfred Hitchcock | 6 |
| Andrei Tarkovsky | 6 |
| Charles Chaplin | 6 |
I wanted to look at directors because they have a big part in how well movies are done and have the most input on the production. This table lists the directors with the most movies and we will use this data to conduct a visualization of the average IMDb ratings. We can see that Akira Kurosawa has the most movies in the top 500 movies.
In this we can know see that Christopher Nolan has the highest average IMDb rating. This shows that he has done his 7 movies with a rating higher than 7.5 proving to be one of the best directors ever.
| lead_actor | n |
|---|---|
| Aamir Khan | 8 |
| Leonardo DiCaprio | 8 |
| Toshirô Mifune | 7 |
| Charles Chaplin | 6 |
| James Stewart | 6 |
After looking at the directors, I wanted to look at Actors in the top 500? Actors are the main role in the movies and determine how well the movie is portrayed.
I found that Aamir Kham and Leonardo DiCaprio are in 8 of the top 500 movies. Proving that they are the best actors and it affects the ratings. It was interesting to see how many actors are in the top 500 movies. ## Significant words I wanted to see if the summary of the movies had an common words in them that will lead to them being in the top 500.
This graph shows the common words used. The ones in the top 500 work with about life, war and the world. These are the most common words in the summaries leading to movies about these topics to be the top 500. The one thing that I found interesting in this is that all of these common words are just about characters and setting of the movie. The one that stands out to me is war. I personally have not seen many war movies and it hasn’t crossed my mind that the category would be that big.
I thought it would be interesting to see if more plots had positive words because I initially thought that positive moveis would be a stronger plot to build on. These words show that Love is the most common used word and it is positive leading to having more movies about romance or a love story in the top 500.
Finally we need to save this data as a csv so we can have the data that does not change over time. First we will set our working directory to source files. Then we can write a command to save this in our folder.
After looking at many factors in this data. I can see that actors and directors play a big role in how well the movie performs. After looking at the summary of the movies, there are some key words that help the plot for it to be popular. Also taking a look at the positivity and negativity of words was interesting because it gives another sense of what the movies are about and how well they will be rated.