The pupose of this project is to answer a series of questions and run some analysis of imdb.com’s database using SQL code.A schema of the database is presented below. All the information presented is as of November 2015.
Installing the required packages for this project and database. Now we can get started!
Flowchart of the Database
How many actors are there in the database? How many movies?
numactors = as.integer(dbGetQuery(db, 'SELECT COUNT(idactors) FROM actors;'))
nummovies = as.integer(dbGetQuery(db, 'SELECT COUNT(idmovies) FROM movies;'))
numactors
## [1] 3500167
nummovies
## [1] 1298737
What time period does the database cover?
years = dbGetQuery(db, 'SELECT DISTINCT year FROM movies;')
sort(years$year)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## [15] 15 16 17 18 19 20 21 23 24 25 27 30 32 35
## [29] 39 40 43 45 48 53 60 70 100 101 206 216 500 718
## [43] 1585 1640 1725 1776 1809 1811 1816 1827 1830 1865 1867 1870 1873 1874
## [57] 1878 1881 1883 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897
## [71] 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911
## [85] 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925
## [99] 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939
## [113] 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953
## [127] 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967
## [141] 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981
## [155] 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## [169] 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [183] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
## [197] 2024 2025
Since the first film cameras were invented in 1891, we can safely conclude that anything listed as having a release date before 1891 as inaccurate. From this, we can find the range of movies that the database covers that dont include obviously incorrect years
dbGetQuery(db, 'SELECT MAX(year) FROM movies;')
## MAX(year)
## 1 2025
The range of movies presented lie between 1891 and 2025 (unreleased titles)
What proportion of the actors are female? male?
femaleactors = dbGetQuery(db, 'SELECT * FROM actors WHERE gender IS NULL;')
nrow(femaleactors)/numactors
## [1] 0.3537034
maleactors = dbGetQuery(db, 'SELECT * FROM actors WHERE gender = "1";')
nrow(maleactors)/ numactors
## [1] 0.6462966
From this we can see that the ratio of female to male actors in the database is close to 2:1
What proportion of the entries in the movies table are actual movies and what proportion are television series, etc.?
#type 1 is adult entertainment
#type 2 is video games
#type 3 is movies
#type 'NULL' is series
dbGetQuery(db, 'SELECT DISTINCT type FROM movies;')
## type
## 1 3
## 2 NA
## 3 1
## 4 2
dbGetQuery(db, 'SELECT * FROM movies WHERE type IS NULL LIMIT 25')
## idmovies title year type number
## 1 12 This American Life Live! 2012 NA NA
## 2 33 Tweeduizendseks 2010 NA NA
## 3 35 Il barbiere di Siviglia 1992 NA NA
## 4 54 Meus Prmios Nick 2012 2012 NA NA
## 5 76 2nd Annual BET Awards 2002 NA NA
## 6 78 3rd Annual BET Awards 2003 NA NA
## 7 79 4th Annual BET Awards 2004 NA NA
## 8 80 BET Awards 2006 2006 NA NA
## 9 81 Nickelodeon Kids' Choice Awards 2000 2000 NA NA
## 10 82 The 16th Annual Soul Train Music Awards 2002 NA NA
## 11 83 The 17th Annual Soul Train Music Awards 2003 NA NA
## 12 84 The Teen Choice Awards 2001 2001 NA NA
## 13 135 50 Most Awesomely Bad Songs... Ever 2004 NA NA
## 14 165 2006 American Music Awards 2006 NA NA
## 15 166 Walt Disney World Christmas Day Parade 2005 NA NA
## 16 178 Happy Birthday 2 You 2000 NA NA
## 17 179 Murder on the Blade? 2003 NA NA
## 18 180 TV2 Big Comedy Gala 2000 NA NA
## 19 181 What Now! 20th Birthday Special 2001 NA NA
## 20 191 Disappearance 2002 NA NA
## 21 193 Super Glue and Angel's Wings 2001 NA NA
## 22 210 Sterrencircus 1991 NA NA
## 23 218 De todo corazn 2000 NA NA
## 24 219 Especial Nochevieja 1996: Somos la primera 1997 NA NA
## 25 220 Gala FAO 2000 2000 NA NA
## location language
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## 7 <NA> <NA>
## 8 <NA> <NA>
## 9 <NA> <NA>
## 10 <NA> <NA>
## 11 <NA> <NA>
## 12 <NA> <NA>
## 13 <NA> <NA>
## 14 <NA> <NA>
## 15 <NA> <NA>
## 16 <NA> <NA>
## 17 <NA> <NA>
## 18 <NA> <NA>
## 19 <NA> <NA>
## 20 <NA> <NA>
## 21 <NA> <NA>
## 22 <NA> <NA>
## 23 <NA> <NA>
## 24 <NA> <NA>
## 25 <NA> <NA>
dbGetQuery(db, 'SELECT * FROM movies WHERE type = "3" LIMIT 25')
## idmovies title year type
## 1 1 Night of the Demons 2009 3
## 2 2 The Bad Lieutenant: Port of Call - New Orleans 2009 3
## 3 3 Please Like Me 2013 3
## 4 4 A Woman of Distinction 1950 3
## 5 5 Around the World 1943 3
## 6 6 Chain Lightning 1950 3
## 7 7 O-Kay for Sound 1937 3
## 8 8 Cabaret 1936 3
## 9 9 Four Star Revue 1950 3
## 10 10 Music-Hall 1949 3
## 11 11 Variety 1936 3
## 12 13 Mmesis 2012 3
## 13 14 Supernatural 2005 3
## 14 15 El hormiguero 2006 3
## 15 16 Welcome to My Darkside! 2009 3
## 16 17 The Shrink 2015 3
## 17 18 Los tarantos 1963 3
## 18 19 La granja tolima 2004 3
## 19 22 Aqu hay tomate 2003 3
## 20 23 StudZmen 2013 3
## 21 24 Between Women 2011 3
## 22 25 Welcome to Slab City 2012 3
## 23 26 The Tonight Show Starring Jimmy Fallon 2014 3
## 24 27 Criss Cross 2010 3
## 25 28 Exorsister 2012 3
## number location language
## 1 NA <NA> <NA>
## 2 NA <NA> <NA>
## 3 NA <NA> <NA>
## 4 NA <NA> <NA>
## 5 NA <NA> <NA>
## 6 NA <NA> <NA>
## 7 NA <NA> <NA>
## 8 NA <NA> <NA>
## 9 NA <NA> <NA>
## 10 NA <NA> <NA>
## 11 NA <NA> <NA>
## 12 NA <NA> <NA>
## 13 NA <NA> <NA>
## 14 NA <NA> <NA>
## 15 NA <NA> <NA>
## 16 NA <NA> <NA>
## 17 NA <NA> <NA>
## 18 NA <NA> <NA>
## 19 NA <NA> <NA>
## 20 NA <NA> <NA>
## 21 NA <NA> <NA>
## 22 NA <NA> <NA>
## 23 NA <NA> <NA>
## 24 NA <NA> <NA>
## 25 NA <NA> <NA>
as.integer(dbGetQuery(db, 'SELECT COUNT(*) FROM movies WHERE type = "1" '))/nummovies
## [1] 0.113488
as.integer(dbGetQuery(db, 'SELECT COUNT(*) FROM movies WHERE type = "2" '))/nummovies
## [1] 0.01184535
as.integer(dbGetQuery(db, 'SELECT COUNT(*) FROM movies WHERE type = "3" '))/nummovies
## [1] 0.7813322
as.integer(dbGetQuery(db, 'SELECT COUNT(*) FROM movies WHERE type IS NULL '))/nummovies
## [1] 0.09333452
How many genres are there? What are their names/descriptions?
dbGetQuery(db, 'SELECT DISTINCT COUNT(genre) FROM genres;')
## COUNT(genre)
## 1 32
genres = dbGetQuery(db, 'SELECT DISTINCT genre FROM genres;')
genres
## genre
## 1 Documentary
## 2 Reality
## 3 Horror
## 4 Drama
## 5 Comedy
## 6 Musical
## 7 Talk
## 8 Mystery
## 9 News
## 10 Sport
## 11 Sci
## 12 Romance
## 13 Family
## 14 Short
## 15 Biography
## 16 Music
## 17 Game
## 18 Adventure
## 19 Crime
## 20 War
## 21 Fantasy
## 22 Thriller
## 23 Animation
## 24 Action
## 25 History
## 26 Adult
## 27 Western
## 28 Lifestyle
## 29 Film
## 30 Experimental
## 31 Commercial
## 32 Erotica
List the 10 most common genres of movies, showing the number of movies in each of these genres.
moviegenres = dbGetQuery(db, "SELECT idmovies_genres, idgenres, idmovies FROM movies_genres")
#creates table showing movie ids and genre ids
genres = dbGetQuery(db, "SELECT * FROM genres INNER JOIN movies_genres ON genres.idgenres=movies_genres.idgenres")
#this joins genres and movies_genres showing the genre name next to the id
#sorts genres but number of entries
sort(table(genres$genre), decreasing = TRUE)[1:10]
##
## Comedy Drama Documentary Reality Family Talk
## 28152 20149 14934 10360 8915 7949
## Animation Music Romance Game
## 6797 5222 4679 4367
Has the number of movies in each genre changed over time? Plot the overall number of movies in each year over time, and for each genre.
yeargenre = dbGetQuery(db, 'SELECT COUNT(movies.idmovies) AS count, year, genre
FROM genres, movies_genres, movies
WHERE genres.idgenres = movies_genres.idgenres
AND movies_genres.idmovies = movies.idmovies GROUP BY genre, year')
dbGetQuery(db, 'SELECT genre, year, COUNT(movies.idmovies)
FROM genres, movies_genres, movies
WHERE genres.idgenres = movies_genres.idgenres
AND movies_genres.idmovies = movies.idmovies GROUP BY genre, year LIMIT 10')
## genre year COUNT(movies.idmovies)
## 1 Action NA 48
## 2 Action 1949 2
## 3 Action 1950 3
## 4 Action 1951 2
## 5 Action 1952 4
## 6 Action 1953 3
## 7 Action 1954 2
## 8 Action 1955 5
## 9 Action 1956 1
## 10 Action 1957 5
ordermovies = order(yeargenre$year)
yeargenre = yeargenre[ordermovies,]
yeargenre = na.omit(yeargenre) #remove NA observations
library(ggplot2)
ggplot(yeargenre, aes(year, count, colour = genre)) +geom_line() +xlab("Year") + ylab("Count") + ggtitle("Genres of Movies Per Year")
Who are the actors that have been in the most movies? List the top 20.
counts = dbGetQuery(db, 'SELECT fname, lname, count FROM counts')
actorcounts = order(counts$count, decreasing = TRUE)
counts = counts[actorcounts,]
head(counts, n = 20)
## fname lname count
## 3284300 Alex Trebek 7259
## 1963707 Johnny Gilbert 7233
## 1363104 Bob Barker 6898
## 3007713 Pat Sajak 6278
## 1191283 Vanna White 6219
## 1164868 Carol Vorderman 5740
## 863139 Janice Pennington 5540
## 2401654 Jay Leno 5352
## 2745293 Johnny Olson 4971
## 2407134 David Letterman 4740
## 3413195 Richard Whiteley 4605
## 2727246 O'Donnell, Charlie <NA> 4389
## 3400314 Frank Welker 4272
## 3079123 Paul Shaffer 4215
## 2724636 O'Brien, Conan <NA> 4040
## 2955611 Rod Roddy 4013
## 509347 Helena Isabel 3994
## 1832364 Lus Esparteiro 3914
## 1569248 Manuel Cavaco 3803
## 614277 Katherine Kelly Lang 3759
Who are the actors that have had the most number of movies with “top billing”, i.e., billed as 1, 2 or 3? For each actor, also show the years these movies spanned?
Who are the 10 actors that performed in the most movies within any given year? What are their names and the year they starred in these movies?
dbGetQuery(db, 'CREATE TEMP TABLE num11 AS
SELECT fname, lname, COUNT(acted_in.idactors) AS count, year
FROM acted_in, actors, movies
WHERE acted_in.idactors = actors.idactors
AND movies.idmovies = acted_in.idmovies
AND movies.type = 1
GROUP BY year, acted_in.idactors')
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## data frame with 0 columns and 0 rows
allactorsbyyear = dbGetQuery(db, 'SELECT DISTINCT * FROM num11
WHERE year > "1878"
ORDER BY count DESC')
top10byyear = dbGetQuery(db, 'SELECT DISTINCT * FROM num11
WHERE year > "1878"
ORDER BY count DESC
LIMIT 10')
top10byyear
## fname lname count year
## 1 Marco Banderas 229 2007
## 2 Lee Stone 214 2003
## 3 Lee Stone 208 2005
## 4 Lee Stone 201 2004
## 5 Alex Sanders 192 1995
## 6 Alex Sanders 186 1994
## 7 Talon <NA> 184 2005
## 8 James Deen 178 2008
## 9 James Deen 178 2009
## 10 Marco Banderas 175 2006
Who are the 10 actors that have the most aliases? (i.e., see the aka_names table)
#SQL query lists out actors and number of aliases in order
dbGetQuery(db, 'SELECT fname, lname, count(aka_names.idactors) AS count
FROM aka_names, actors
WHERE aka_names.idactors = actors.idactors
GROUP BY aka_names.idactors ORDER BY count DESC LIMIT 10')
## fname lname count
## 1 Jess Franco 76
## 2 D'Amato, Joe <NA> 69
## 3 Uschi Digard 60
## 4 Herschel Savage 52
## 5 Godfrey Ho 49
## 6 Joey Silvera 41
## 7 Zuzana Presova 37
## 8 Christoph Clark 37
## 9 Nathanael Len 37
## 10 Sandra Kalerman 36