Question 0
Download the data and save it as a dataframe called movies.errors
movies_errors <- read.delim("~/Dropbox/RSeminar/movies_errors.txt")
Question 1
The column names in the dataframe are not great. Some contain some random numbers/letters, and others are too long. Change the column names of the dataframe so they make sense. I recommend making each name a single word with no capital letters. But it’s up to you.
names(movies_errors)
## [1] "movie7653.name" "total.boxoffice.earnings"
## [3] "dvd.earnings.in.us.639c" "total.movie.budget"
## [5] "rating.GPGPG13RNC17" "genreX8423"
## [7] "TIME" "year.of.release"
## [9] "sequel"
names(movies_errors)[1] <- "name"
names(movies_errors)[2] <- "earnings"
names(movies_errors)[3] <- "dvd"
names(movies_errors)[4] <- "budget"
names(movies_errors)[5] <- "rating"
names(movies_errors)[6] <- "genre"
names(movies_errors)[7] <- "time"
names(movies_errors)[8] <- "year"
names(movies_errors)
## [1] "name" "earnings" "dvd" "budget" "rating" "genre"
## [7] "time" "year" "sequel"
Question 2
Check ALL the columns (except for the first “name” column) for errors! If you find any errors in a column, correct them!
Keep the following tips in mind: To get a quick look at the values in a numeric column with many (e.g; over 100 possible values), use summary() or hist() To get a quick look at the values in a string column (or a numeric column with only a few possible values), use table() In numeric columns, check for values that don’t make any sense (that is, those that are too large or too small). In character columns, check for misspelled values. If you find values that are misspelled, correct them. If you want to convert a character column to numeric, make sure all the values look like numbers before using as.numeric(). For sample, if a numeric column has a value of “one hundred”, you’ll need to convert this to 100.
str(movies_errors)
## 'data.frame': 5000 obs. of 9 variables:
## $ name : Factor w/ 4922 levels "__ (Jik zin)",..: 364 4518 2014 3632 1439 3633 1634 1431 1906 2431 ...
## $ earnings: num 2.78e+09 2.21e+09 1.67e+09 1.52e+09 1.52e+09 ...
## $ dvd : int 230915507 NA NA 109515497 14947559 7312791 100789867 204294533 25338875 NA ...
## $ budget : num 4.25e+08 2.00e+08 2.15e+08 2.25e+08 1.90e+08 2.50e+08 1.25e+08 1.50e+08 2.00e+08 7.40e+07 ...
## $ rating : Factor w/ 12 levels "13","g","G","General",..: 1 10 1 1 10 10 9 8 1 8 ...
## $ genre : Factor w/ 21 levels "action","Action",..: 2 20 2 3 2 2 3 3 2 7 ...
## $ time : Factor w/ 236 levels "-1","-10","-11",..: 134 154 92 113 106 111 8 68 99 226 ...
## $ year : int 2009 1997 2015 2012 2014 2015 2011 2013 2012 2014 ...
## $ sequel : Factor w/ 6 levels "0","1","n","no",..: 1 1 2 1 2 2 2 1 2 1 ...
# names
movies_errors$name <- as.character(movies_errors$name)
# earnings
summary(movies_errors$earnings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.251e+07 2.256e+07 4.222e+07 9.821e+07 1.023e+08 2.784e+09
hist(movies_errors$earnings)
# dvd
movies_errors$dvd <- as.numeric(movies_errors$dvd)
summary(movies_errors$dvd)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6339 7563000 15800000 27980000 30570000 540400000 3566
hist(movies_errors$dvd)
# budget
summary(movies_errors$budget)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 1.200e+07 1.882e+21 3.925e+07 9.770e+23
hist(movies_errors$budget)
movies_errors$budget[movies_errors$budget > 3e+08] <- NA
# rating
table(movies_errors$rating)
##
## 13 g G General GP NC-17 Not Rated
## 452 58 46 54 1 5 196
## PG PG-13 PG13 R X
## 699 457 462 1489 3
recode.v <- function(orig.vector,
old.values,
new.values,
others = NULL){
if (is.null(others)){
new.vector <- orig.vector
}
if (is.null(others) == F) {
new.vector <- rep(others,
length(orig.vector))
}
for (i in 1:length(old.values)){
change.log <- new.vector == old.values[i]
new.vector[change.log] <- new.values[i]
}
return(new.vector)
}
movies_errors$rating <- as.character(movies_errors$rating)
movies_errors$rating <- recode.v(orig.vector = movies_errors$rating,
old.values = c("13", "g", "General", "GP", "PG13", "X"),
new.values = c("PG-13", "G", "G", "PG", "PG-13", "Not Rated"))
table(movies_errors$rating)
##
## G NC-17 Not Rated PG PG-13 R
## 158 5 199 700 1371 1489
# genre
table(movies_errors$genre)
##
## action Action Adventure
## 1 691 485
## Black Comedy Comdy comedy
## 33 1 2
## Comedy Concert/Performance Documentary
## 1208 14 63
## drama Drama Horror
## 4 1083 299
## Multiple Genres musical Musical
## 2 2 77
## Reality REALITY Romantic Comedy
## 2 2 248
## ROMANTIC COMEDY Thriller/Suspense Western
## 3 427 38
movies_errors$genre <- as.character(movies_errors$genre)
movies_errors$genre <- recode.v(orig.vector = movies_errors$genre,
old.values = c("action", "comdy", "comedy", "drama", "musical", "REALITY", "ROMANTIC COMEDY"),
new.values = c("Action", "Comedy", "Comedy", "Drama", "Musical", "Reality", "Romantic Comedy"))
table(movies_errors$genre)
##
## Action Adventure Black Comedy
## 692 485 33
## Comdy Comedy Concert/Performance
## 1 1210 14
## Documentary Drama Horror
## 63 1087 299
## Multiple Genres Musical Reality
## 2 79 4
## Romantic Comedy Thriller/Suspense Western
## 251 427 38
# time
summary(movies_errors$time)
## 0 105 100 110 97 120 95 98 106
## 108 59 58 55 55 53 52 52 50
## 118 107 109 91 103 104 101 115 102
## 48 47 47 46 44 44 43 43 42
## 96 108 116 92 111 121 123 113 114
## 42 39 38 37 36 36 36 33 33
## 130 93 99 127 90 112 119 122 124
## 33 33 33 32 31 30 30 30 29
## 125 126 117 129 89 94 88 128 87
## 29 29 28 28 28 28 27 24 22
## 131 132 86 133 135 138 139 85 134
## 20 19 19 18 18 17 17 17 15
## 136 140 82 81 137 143 146 141 83
## 15 15 15 14 13 12 12 11 11
## 144 142 152 84 150 155 149 165 -54
## 9 8 8 8 7 7 6 6 5
## 145 154 158 160 164 40 -8 -87 147
## 5 5 5 5 5 5 4 4 4
## 75 80 -12 -41 -45 -51 -6 -66 -68
## 4 4 3 3 3 3 3 3 3
## -94 -96 151 153 157 161 169 189 (Other)
## 3 3 3 3 3 3 3 3 163
## NA's
## 2600
movies_errors$time[movies_errors$time == "not sure"] <- NA
movies_errors$time <- as.numeric(movies_errors$time)
movies_errors$time[movies_errors$time < 60] <- NA
summary(movies_errors$time)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 60.0 75.0 93.0 124.7 212.0 235.0 2694
# year
summary(movies_errors$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17 1991 2002 2015 2009 3997
movies_errors$year[movies_errors$year < 1925] <- NA
movies_errors$year[movies_errors$year > 2016] <- NA
summary(movies_errors$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1925 1991 2002 1999 2009 2016 70
# sequel
table(movies_errors$sequel)
##
## 0 1 n no y yes
## 4358 579 11 12 10 12
movies_errors$sequel <- as.character(movies_errors$sequel)
movies_errors$sequel <- recode.v(orig.vector = movies_errors$sequel,
old.values = c("n", "no", "y", "yes"),
new.values = c("0", "0", "1", "1"))
movies_errors$sequel <- as.integer(movies_errors$sequel)
table(movies_errors$sequel)
##
## 0 1
## 4381 601
Question 3
Create a new column called decade which shows the decade that a movie was made. For example, movies between 1950 and 1959 should be in one category, those between 1960 and 1969 should be in another category (etc.).
Create a table showing the number of movies in each decade
movies_errors$decade <- cut(movies_errors$year, seq(1920,2020,10))
movies_errors$decade <- as.numeric(movies_errors$decade)
head(movies_errors[c("year", "decade")])
## year decade
## 1 2009 9
## 2 1997 8
## 3 2015 10
## 4 2012 10
## 5 2014 10
## 6 2015 10
table(movies_errors$decade)
##
## 1 2 3 4 5 6 7 8 9 10
## 1 5 20 40 135 288 671 1061 1844 865
Question 4
Create a new column called time.30 that groups the time variable in blocks of 30 minutes. For example, movie times between 0 and 29 should be in one category, those between 30 and 59 minutes should be in a second category (etc.).
Create a table showing the number of movies in each group of 30 minutes.
summary(movies_errors$time)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 60.0 75.0 93.0 124.7 212.0 235.0 2694
movies_errors$time.30 <- cut(movies_errors$time, seq(60,260,30))
movies_errors$time.30 <- as.numeric(movies_errors$time.30)
head(movies_errors[c("time", "time.30")])
## time time.30
## 1 134 3
## 2 154 4
## 3 92 2
## 4 113 2
## 5 106 2
## 6 111 2
table(movies_errors$decade)
##
## 1 2 3 4 5 6 7 8 9 10
## 1 5 20 40 135 288 671 1061 1844 865
Question 5
Create a new column called age that has one of two values: child or adult. Movies with ratings of G, PG, or PG-13 are ok for children. Movies with ratings of R, NC-17, or X are for adults.
What percentage of movies are only for adults?
movies_errors$age[movies_errors$rating %in% c("G", "PG", "PG-13")] <- "child"
movies_errors$age[!movies_errors$rating %in% c("G", "PG", "PG-13")] <- "adult"
table(movies_errors$age)
##
## adult child
## 2771 2229
sum(mean(movies_errors$age == "adult"))
## [1] 0.5542
Now, let’s add some more information to our movies dataset. The dataframe year.lookup is a dataframe that tells us, for each year, how well the world economomy was doing in that year, plus whether or not there was a major international conflict in taht year. You can get the dataframe from the following link. Like before, the data are tab-delimited and have a header row:
Save the data as a new dataframe called year.index.
year_index <- read.delim("~/Dropbox/RSeminar/year_index.txt")
Using merge() add the year.index data to the movies dataframe.
movies_errors <- merge(movies_errors, year_index, by = "year")
head(movies_errors)
## year name earnings dvd budget rating
## 1 1925 The Big Parade 22000000 NA 245000 <NA>
## 2 1937 Snow White and the Seven Dwarfs 184925485 NA 1488000 G
## 3 1939 Gone with the Wind 390525192 NA 3900000 G
## 4 1939 The Wizard of Oz 33711566 NA 2777000 PG-13
## 5 1940 Pinocchio 84300000 NA 0 G
## 6 1940 Fantasia 83320000 NA 2280000 G
## genre time sequel decade time.30 age economy
## 1 Drama NA 0 1 NA adult good
## 2 Musical 217 0 2 6 child ok
## 3 Drama 162 0 2 4 child ok
## 4 Musical 67 1 2 1 child ok
## 5 Adventure NA 0 2 NA child poor
## 6 Musical NA 0 2 NA child poor
## international.conflict
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
What was the median boxoffice review of movies in good, ok, and poor economic years?
aggregate(earnings ~ economy, data = movies_errors, median)
## economy earnings
## 1 good 43092117
## 2 ok 42598498
## 3 poor 41252428
Create a boxplot (or beanplot or pireateplot) showing the distribution of movie budgets for those movies released during international conflict years compared to those released during non-conflict years.
boxplot(budget ~ international.conflict,
data = movies_errors[!movies_errors$budget == 0,],
xlab = "Conflict",
ylab = "Budget",
main = "Distribution of Movie Budgets")