Connecting R to imdb SQL database

The pupose of this project is to answer a series of questions and run some analysis of imdb.com’s database using SQL code.A schema of the database is presented below. All the information presented is as of November 2015.

To do - Questions 7

Installing the required packages for this project and database. Now we can get started!

Flowchart of the Database

Flowchart of the Database

Questions

How many actors are there in the database? How many movies?

numactors = as.integer(dbGetQuery(db, 'SELECT COUNT(idactors) FROM actors;'))
nummovies = as.integer(dbGetQuery(db, 'SELECT COUNT(idmovies) FROM movies;'))
numactors
## [1] 3500167
nummovies
## [1] 1298737

What time period does the database cover?

years = dbGetQuery(db, 'SELECT DISTINCT year FROM movies;')
sort(years$year)
##   [1]    1    2    3    4    5    6    7    8    9   10   11   12   13   14
##  [15]   15   16   17   18   19   20   21   23   24   25   27   30   32   35
##  [29]   39   40   43   45   48   53   60   70  100  101  206  216  500  718
##  [43] 1585 1640 1725 1776 1809 1811 1816 1827 1830 1865 1867 1870 1873 1874
##  [57] 1878 1881 1883 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897
##  [71] 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911
##  [85] 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925
##  [99] 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939
## [113] 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953
## [127] 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967
## [141] 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981
## [155] 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## [169] 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [183] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
## [197] 2024 2025

Since the first film cameras were invented in 1891, we can safely conclude that anything listed as having a release date before 1891 as inaccurate. From this, we can find the range of movies that the database covers that dont include obviously incorrect years

dbGetQuery(db, 'SELECT MAX(year) FROM movies;')
##   MAX(year)
## 1      2025

The range of movies presented lie between 1891 and 2025 (unreleased titles)

What proportion of the actors are female? male?

femaleactors = dbGetQuery(db, 'SELECT * FROM actors WHERE gender IS NULL;')
nrow(femaleactors)/numactors
## [1] 0.3537034
maleactors = dbGetQuery(db, 'SELECT * FROM actors WHERE gender = "1";')
nrow(maleactors)/ numactors
## [1] 0.6462966

From this we can see that the ratio of female to male actors in the database is close to 2:1

What proportion of the entries in the movies table are actual movies and what proportion are television series, etc.?

#type 1 is adult entertainment
#type 2 is video games
#type 3 is movies
#type 'NULL' is series

dbGetQuery(db, 'SELECT DISTINCT type FROM movies;')
##   type
## 1    3
## 2   NA
## 3    1
## 4    2
dbGetQuery(db, 'SELECT * FROM movies WHERE type IS NULL LIMIT 25')
##    idmovies                                      title year type number
## 1        12                   This American Life Live! 2012   NA     NA
## 2        33                            Tweeduizendseks 2010   NA     NA
## 3        35                    Il barbiere di Siviglia 1992   NA     NA
## 4        54                      Meus Prmios Nick 2012 2012   NA     NA
## 5        76                      2nd Annual BET Awards 2002   NA     NA
## 6        78                      3rd Annual BET Awards 2003   NA     NA
## 7        79                      4th Annual BET Awards 2004   NA     NA
## 8        80                            BET Awards 2006 2006   NA     NA
## 9        81       Nickelodeon Kids' Choice Awards 2000 2000   NA     NA
## 10       82    The 16th Annual Soul Train Music Awards 2002   NA     NA
## 11       83    The 17th Annual Soul Train Music Awards 2003   NA     NA
## 12       84                The Teen Choice Awards 2001 2001   NA     NA
## 13      135        50 Most Awesomely Bad Songs... Ever 2004   NA     NA
## 14      165                 2006 American Music Awards 2006   NA     NA
## 15      166     Walt Disney World Christmas Day Parade 2005   NA     NA
## 16      178                       Happy Birthday 2 You 2000   NA     NA
## 17      179                       Murder on the Blade? 2003   NA     NA
## 18      180                        TV2 Big Comedy Gala 2000   NA     NA
## 19      181            What Now! 20th Birthday Special 2001   NA     NA
## 20      191                              Disappearance 2002   NA     NA
## 21      193               Super Glue and Angel's Wings 2001   NA     NA
## 22      210                              Sterrencircus 1991   NA     NA
## 23      218                             De todo corazn 2000   NA     NA
## 24      219 Especial Nochevieja 1996: Somos la primera 1997   NA     NA
## 25      220                              Gala FAO 2000 2000   NA     NA
##    location language
## 1      <NA>     <NA>
## 2      <NA>     <NA>
## 3      <NA>     <NA>
## 4      <NA>     <NA>
## 5      <NA>     <NA>
## 6      <NA>     <NA>
## 7      <NA>     <NA>
## 8      <NA>     <NA>
## 9      <NA>     <NA>
## 10     <NA>     <NA>
## 11     <NA>     <NA>
## 12     <NA>     <NA>
## 13     <NA>     <NA>
## 14     <NA>     <NA>
## 15     <NA>     <NA>
## 16     <NA>     <NA>
## 17     <NA>     <NA>
## 18     <NA>     <NA>
## 19     <NA>     <NA>
## 20     <NA>     <NA>
## 21     <NA>     <NA>
## 22     <NA>     <NA>
## 23     <NA>     <NA>
## 24     <NA>     <NA>
## 25     <NA>     <NA>
dbGetQuery(db, 'SELECT * FROM movies WHERE type = "3" LIMIT 25')
##    idmovies                                          title year type
## 1         1                            Night of the Demons 2009    3
## 2         2 The Bad Lieutenant: Port of Call - New Orleans 2009    3
## 3         3                                 Please Like Me 2013    3
## 4         4                         A Woman of Distinction 1950    3
## 5         5                               Around the World 1943    3
## 6         6                                Chain Lightning 1950    3
## 7         7                                O-Kay for Sound 1937    3
## 8         8                                        Cabaret 1936    3
## 9         9                                Four Star Revue 1950    3
## 10       10                                     Music-Hall 1949    3
## 11       11                                        Variety 1936    3
## 12       13                                         Mmesis 2012    3
## 13       14                                   Supernatural 2005    3
## 14       15                                  El hormiguero 2006    3
## 15       16                        Welcome to My Darkside! 2009    3
## 16       17                                     The Shrink 2015    3
## 17       18                                   Los tarantos 1963    3
## 18       19                               La granja tolima 2004    3
## 19       22                                 Aqu hay tomate 2003    3
## 20       23                                       StudZmen 2013    3
## 21       24                                  Between Women 2011    3
## 22       25                           Welcome to Slab City 2012    3
## 23       26         The Tonight Show Starring Jimmy Fallon 2014    3
## 24       27                                    Criss Cross 2010    3
## 25       28                                     Exorsister 2012    3
##    number location language
## 1      NA     <NA>     <NA>
## 2      NA     <NA>     <NA>
## 3      NA     <NA>     <NA>
## 4      NA     <NA>     <NA>
## 5      NA     <NA>     <NA>
## 6      NA     <NA>     <NA>
## 7      NA     <NA>     <NA>
## 8      NA     <NA>     <NA>
## 9      NA     <NA>     <NA>
## 10     NA     <NA>     <NA>
## 11     NA     <NA>     <NA>
## 12     NA     <NA>     <NA>
## 13     NA     <NA>     <NA>
## 14     NA     <NA>     <NA>
## 15     NA     <NA>     <NA>
## 16     NA     <NA>     <NA>
## 17     NA     <NA>     <NA>
## 18     NA     <NA>     <NA>
## 19     NA     <NA>     <NA>
## 20     NA     <NA>     <NA>
## 21     NA     <NA>     <NA>
## 22     NA     <NA>     <NA>
## 23     NA     <NA>     <NA>
## 24     NA     <NA>     <NA>
## 25     NA     <NA>     <NA>
as.integer(dbGetQuery(db, 'SELECT COUNT(*) FROM movies WHERE type = "1" '))/nummovies
## [1] 0.113488
as.integer(dbGetQuery(db, 'SELECT COUNT(*) FROM movies WHERE type = "2" '))/nummovies
## [1] 0.01184535
as.integer(dbGetQuery(db, 'SELECT COUNT(*) FROM movies WHERE type = "3" '))/nummovies
## [1] 0.7813322
as.integer(dbGetQuery(db, 'SELECT COUNT(*) FROM movies WHERE type IS NULL '))/nummovies
## [1] 0.09333452

How many genres are there? What are their names/descriptions?

dbGetQuery(db, 'SELECT DISTINCT COUNT(genre) FROM genres;')
##   COUNT(genre)
## 1           32
genres = dbGetQuery(db, 'SELECT DISTINCT genre FROM genres;')
genres
##           genre
## 1   Documentary
## 2       Reality
## 3        Horror
## 4         Drama
## 5        Comedy
## 6       Musical
## 7          Talk
## 8       Mystery
## 9          News
## 10        Sport
## 11          Sci
## 12      Romance
## 13       Family
## 14        Short
## 15    Biography
## 16        Music
## 17         Game
## 18    Adventure
## 19        Crime
## 20          War
## 21      Fantasy
## 22     Thriller
## 23    Animation
## 24       Action
## 25      History
## 26        Adult
## 27      Western
## 28    Lifestyle
## 29         Film
## 30 Experimental
## 31   Commercial
## 32      Erotica

List the 10 most common genres of movies, showing the number of movies in each of these genres.

moviegenres = dbGetQuery(db, "SELECT idmovies_genres, idgenres, idmovies FROM movies_genres")
#creates table showing movie ids and genre ids

genres = dbGetQuery(db, "SELECT * FROM genres INNER JOIN movies_genres ON genres.idgenres=movies_genres.idgenres")
#this joins genres and movies_genres showing the genre name next to the id

#sorts genres but number of entries
sort(table(genres$genre), decreasing = TRUE)[1:10]
## 
##      Comedy       Drama Documentary     Reality      Family        Talk 
##       28152       20149       14934       10360        8915        7949 
##   Animation       Music     Romance        Game 
##        6797        5222        4679        4367

Has the number of movies in each genre changed over time? Plot the overall number of movies in each year over time, and for each genre.

yeargenre = dbGetQuery(db, 'SELECT COUNT(movies.idmovies) AS count, year, genre
           FROM genres, movies_genres, movies
           WHERE genres.idgenres = movies_genres.idgenres
           AND movies_genres.idmovies = movies.idmovies GROUP BY genre, year')
dbGetQuery(db, 'SELECT genre, year, COUNT(movies.idmovies)
           FROM genres, movies_genres, movies
           WHERE genres.idgenres = movies_genres.idgenres
           AND movies_genres.idmovies = movies.idmovies GROUP BY genre, year LIMIT 10')
##     genre year COUNT(movies.idmovies)
## 1  Action   NA                     48
## 2  Action 1949                      2
## 3  Action 1950                      3
## 4  Action 1951                      2
## 5  Action 1952                      4
## 6  Action 1953                      3
## 7  Action 1954                      2
## 8  Action 1955                      5
## 9  Action 1956                      1
## 10 Action 1957                      5
ordermovies = order(yeargenre$year)
yeargenre = yeargenre[ordermovies,]
yeargenre = na.omit(yeargenre) #remove NA observations
library(ggplot2)
ggplot(yeargenre, aes(year, count, colour = genre)) +geom_line() +xlab("Year") + ylab("Count") + ggtitle("Genres of Movies Per Year")

Who are the actors that have been in the most movies? List the top 20.

counts = dbGetQuery(db, 'SELECT fname, lname, count FROM counts')
actorcounts = order(counts$count, decreasing = TRUE)
counts = counts[actorcounts,]
head(counts, n = 20)
##                      fname      lname count
## 3284300               Alex     Trebek  7259
## 1963707             Johnny    Gilbert  7233
## 1363104                Bob     Barker  6898
## 3007713                Pat      Sajak  6278
## 1191283              Vanna      White  6219
## 1164868              Carol  Vorderman  5740
## 863139              Janice Pennington  5540
## 2401654                Jay       Leno  5352
## 2745293             Johnny      Olson  4971
## 2407134              David  Letterman  4740
## 3413195            Richard   Whiteley  4605
## 2727246 O'Donnell, Charlie       <NA>  4389
## 3400314              Frank     Welker  4272
## 3079123               Paul    Shaffer  4215
## 2724636     O'Brien, Conan       <NA>  4040
## 2955611                Rod      Roddy  4013
## 509347              Helena     Isabel  3994
## 1832364                Lus Esparteiro  3914
## 1569248             Manuel     Cavaco  3803
## 614277     Katherine Kelly       Lang  3759

Who are the actors that have had the most number of movies with “top billing”, i.e., billed as 1, 2 or 3? For each actor, also show the years these movies spanned?

Who are the 10 actors that performed in the most movies within any given year? What are their names and the year they starred in these movies?

dbGetQuery(db, 'CREATE TEMP TABLE num11 AS
           SELECT fname, lname, COUNT(acted_in.idactors) AS count, year 
           FROM acted_in, actors, movies
           WHERE acted_in.idactors = actors.idactors
           AND movies.idmovies = acted_in.idmovies
           AND movies.type = 1
           GROUP BY year, acted_in.idactors')
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## data frame with 0 columns and 0 rows
allactorsbyyear = dbGetQuery(db, 'SELECT DISTINCT * FROM num11 
                 WHERE year > "1878"
                 ORDER BY count DESC')

top10byyear = dbGetQuery(db, 'SELECT DISTINCT * FROM num11 
                 WHERE year > "1878" 
                 ORDER BY count DESC 
                 LIMIT 10')
top10byyear
##    fname    lname count year
## 1  Marco Banderas   229 2007
## 2    Lee    Stone   214 2003
## 3    Lee    Stone   208 2005
## 4    Lee    Stone   201 2004
## 5   Alex  Sanders   192 1995
## 6   Alex  Sanders   186 1994
## 7  Talon     <NA>   184 2005
## 8  James     Deen   178 2008
## 9  James     Deen   178 2009
## 10 Marco Banderas   175 2006

Who are the 10 actors that have the most aliases? (i.e., see the aka_names table)

#SQL query lists out actors and number of aliases in order
dbGetQuery(db, 'SELECT fname, lname, count(aka_names.idactors) AS count 
           FROM aka_names, actors 
           WHERE aka_names.idactors = actors.idactors
           GROUP BY aka_names.idactors ORDER BY count DESC LIMIT 10')
##           fname    lname count
## 1          Jess   Franco    76
## 2  D'Amato, Joe     <NA>    69
## 3         Uschi   Digard    60
## 4      Herschel   Savage    52
## 5       Godfrey       Ho    49
## 6          Joey  Silvera    41
## 7        Zuzana  Presova    37
## 8     Christoph    Clark    37
## 9     Nathanael      Len    37
## 10       Sandra Kalerman    36