In this WPA, you’ll download a dataset with many errors. You’ll then clean it so it’s ready for analysis!
You can get the dataset from the following link. The dataset is tab-delimited and has a header row:
http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt
Here is some important information about the columns:
movie7653.name: The name of the movie
total.boxoffice.earnings: The total boxoffice revenue in USD
dvd.earnings.in.us.639c: Total DVD revenue in USD
total.movie.budget: Budget in USD
genreX8423: The genre of the movie. There are many possible values ranging from Action to Western.
TIME: Length of the movie in minutes
year.of.release: Release year of movie
sequel: Is the movie a sequel? 0 means no, 1 means yes
Question 0
movies.errors <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt",
sep = "\t",
header = T,
stringsAsFactors = F
)
Question 1
names(movies.errors) <- c("name", "revenue", "dvd", "budget", "rating", "genre", "time", "year", "sequel")
Question 2
Check ALL the columns (except for the first “name” column) for errors! If you find any errors in a column, correct them!
Keep the following tips in mind:
recode.v <- function(original.vector,
old.values,
new.values,
others = NULL) {
if(is.null(others)) {
new.vector <- original.vector
}
if(is.null(others) == F) {
new.vector <- rep(others,
length(original.vector))
}
for (i in 1:length(old.values)) {
change.log <- new.vector == old.values[i] &
is.na(new.vector) == F
new.vector[change.log] <- new.values[i]
}
return(new.vector)
}
#### revenue
class(movies.errors$revenue)
## [1] "numeric"
hist(movies.errors$revenue)
summary(movies.errors$revenue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.251e+07 2.256e+07 4.222e+07 9.821e+07 1.023e+08 2.784e+09
# revenue looks ok
# dvd
class(movies.errors$dvd)
## [1] "integer"
hist(movies.errors$dvd)
summary(movies.errors$dvd)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6339 7563000 15800000 27980000 30570000 540400000 3566
# dvd looks ok
#### budget
class(movies.errors$budget)
## [1] "numeric"
hist(movies.errors$budget,
main = "One value is way too high"
)
summary(movies.errors$budget)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 1.200e+07 1.882e+21 3.925e+07 9.770e+23
# ok, one value is way, way too high
# I think a reasonable upper limit is 1 billion
movies.errors$budget[movies.errors$budget > 1000000000] <- NA
hist(movies.errors$budget,
main = "Looks good")
#### rating
table(movies.errors$rating)
##
## 13 g G General GP NC-17 Not Rated
## 452 58 46 54 1 5 196
## PG PG-13 PG13 R X
## 699 457 462 1489 3
# Some bad values, let's change them
movies.errors$rating <- recode.v(original.vector = movies.errors$rating,
old.values = c("PG13", "13", "g", "General", "GP", "X"),
new.values = c("PG-13", "PG-13", "G", "G", "G", "NC-17")
)
table(movies.errors$rating)
##
## G NC-17 Not Rated PG PG-13 R
## 159 8 196 699 1371 1489
# much better!
#### genre
table(movies.errors$genre)
##
## action Action Adventure
## 1 691 485
## Black Comedy Comdy comedy
## 33 1 2
## Comedy Concert/Performance Documentary
## 1208 14 63
## drama Drama Horror
## 4 1083 299
## Multiple Genres musical Musical
## 2 2 77
## Reality REALITY Romantic Comedy
## 2 2 248
## ROMANTIC COMEDY Thriller/Suspense Western
## 3 427 38
# some more bad values, let's change them:
movies.errors$genre <- recode.v(original.vector = movies.errors$genre,
old.values = c("action", "Comdy", "comedy", "drama", "musical", "REALITY", "ROMANTIC COMEDY"),
new.values = c("Action", "Comedy", "Comedy", "Drama", "Musical", "Reality", "Romantic Comedy")
)
table(movies.errors$genre)
##
## Action Adventure Black Comedy
## 692 485 33
## Comedy Concert/Performance Documentary
## 1211 14 63
## Drama Horror Multiple Genres
## 1087 299 2
## Musical Reality Romantic Comedy
## 79 4 251
## Thriller/Suspense Western
## 427 38
# looks good!
#### time
class(movies.errors$time)
## [1] "character"
# class is character, must be a bad value in there...
table(movies.errors$time)
##
## -1 -10 -11 -12 -15 -19 -2 -20
## 1 1 2 3 2 1 1 1
## -21 -24 -25 -27 -28 -29 -3 -30
## 1 1 1 1 1 1 1 1
## -33 -35 -36 -37 -38 -39 -4 -40
## 2 2 1 1 1 1 1 1
## -41 -42 -43 -45 -46 -48 -49 -5
## 3 2 1 3 1 1 1 1
## -50 -51 -54 -56 -57 -58 -59 -6
## 1 3 5 1 1 2 1 3
## -65 -66 -68 -70 -71 -78 -8 -80
## 1 3 3 1 2 1 4 1
## -81 -82 -83 -85 -87 -88 -89 -9
## 1 1 1 1 4 1 1 1
## -90 -91 -93 -94 -95 -96 -98 -99
## 1 1 2 3 1 3 1 1
## 0 100 101 102 103 104 105 106
## 108 58 43 42 44 44 59 50
## 107 108 109 110 1108 111 1117 112
## 47 39 47 55 1 36 1 30
## 113 114 115 116 117 118 119 120
## 33 33 43 38 28 48 30 53
## 121 122 123 124 125 126 127 128
## 36 30 36 29 29 29 32 24
## 1284 129 130 131 132 133 134 135
## 1 28 33 20 19 18 15 18
## 136 137 1373 138 139 140 141 142
## 15 13 1 17 17 15 11 8
## 143 144 145 1454 146 147 148 149
## 12 9 5 1 12 4 2 6
## 150 151 152 153 154 155 156 157
## 7 3 8 3 5 7 1 3
## 158 159 1591 160 161 162 164 165
## 5 2 1 5 3 2 5 6
## 167 1689 169 1694 170 172 174 175
## 2 1 3 1 2 1 1 2
## 1765 177 179 181 182 187 189 191
## 1 1 2 2 1 1 3 1
## 1920 194 195 1957 197 200 201 2019
## 1 1 1 1 1 1 2 1
## 2129 220 2212 2258 240 2705 2736 2755
## 1 1 1 1 1 1 1 1
## 2791 2874 3053 3145 3340 3420 3457 3578
## 1 1 1 1 1 1 1 1
## 3751 3838 39 40 4001 4048 4098 4371
## 1 1 1 5 1 1 1 1
## 44 4443 4662 47 4749 4826 4963 4983
## 1 1 1 1 1 1 1 1
## 506 515 52 62 643 65 660 69
## 1 2 1 1 1 1 1 2
## 70 71 720 74 75 76 77 78
## 1 1 1 1 4 1 2 3
## 783 784 79 80 81 82 824 825
## 1 1 2 4 14 15 1 1
## 83 84 845 85 86 87 88 89
## 11 8 1 17 19 22 27 28
## 90 91 92 93 94 95 959 96
## 31 46 37 33 28 52 1 42
## 97 98 99 not sure
## 55 52 33 3
# ah ha! there is a "not sure" value there, let's convert it to NA
movies.errors$time[movies.errors$time == "not sure"] <- NA
# now convert to numeric
movies.errors$time <- as.numeric(movies.errors$time)
class(movies.errors$time)
## [1] "numeric"
# Better, now we'll look for bad values:
hist(movies.errors$time,
main = "Some values are too low (less than 0), and some are too high"
)
# Let's convert values less than 30 and greater than 300 to NA
# These are just values that make sense to me...
movies.errors$time[movies.errors$time < 30] <- NA
movies.errors$time[movies.errors$time > 300] <- NA
hist(movies.errors$time,
main = "Much better"
)
#### year
class(movies.errors$year)
## [1] "integer"
hist(movies.errors$year,
main = "Some bad years here... I don't think many movies were made\nbefore 1900 or have been made after 2050")
# let's remove years less than 1900 and greater than 2016
movies.errors$year[movies.errors$year < 1900] <- NA
movies.errors$year[movies.errors$year > 2016] <- NA
hist(movies.errors$year,
main = "These look much better!"
)
#### sequel
class(movies.errors$sequel)
## [1] "character"
table(movies.errors$sequel)
##
## 0 1 n no y yes
## 4358 579 11 12 10 12
# some bad values here (e.g.; n, no, y, yes)
# let's convert them to 0 and 1
movies.errors$sequel <- recode.v(original.vector = movies.errors$sequel,
old.values = c("n", "no", "y", "yes"),
new.values = c(0, 0, 1, 1)
)
table(movies.errors$sequel)
##
## 0 1
## 4381 601
# Now we'll convert the column to numeric
movies.errors$sequel <- as.numeric(movies.errors$sequel)
Question 3
movies.errors$decade <- cut(movies.errors$year, breaks = seq(1910, 2020, 10))
table(movies.errors$decade)
##
## (1.91e+03,1.92e+03] (1.92e+03,1.93e+03] (1.93e+03,1.94e+03]
## 0 1 5
## (1.94e+03,1.95e+03] (1.95e+03,1.96e+03] (1.96e+03,1.97e+03]
## 20 40 135
## (1.97e+03,1.98e+03] (1.98e+03,1.99e+03] (1.99e+03,2e+03]
## 288 671 1061
## (2e+03,2.01e+03] (2.01e+03,2.02e+03]
## 1844 865
Question 4
movies.errors$time.30 <- cut(movies.errors$time, breaks = seq(0, 300, 30))
table(movies.errors$time.30)
##
## (0,30] (30,60] (60,90] (90,120] (120,150] (150,180] (180,210]
## 0 9 215 1278 549 72 14
## (210,240] (240,270] (270,300]
## 2 0 0
Question 5
movies.errors$age <- recode.v(original.vector = movies.errors$rating,
old.values = c("G", "PG", "PG-13", "R", "NC-17"),
new.values = c("child", "child", "child", "adult", "adult")
)
mean(movies.errors$age == "adult", na.rm = T)
## [1] 0.381693
Question 6
Now, let’s add some more information to our movies dataset. The dataframe year.lookup is a dataframe that tells us, for each year, how well the world economomy was doing in that year, plus whether or not there was a major international conflict in taht year. You can get the dataframe from the following link. Like before, the data are tab-delimited and have a header row:
http://nathanieldphillips.com/wp-content/uploads/2016/01/year_index.txt
Save the data as a new dataframe called year.index.
Using merge() add the year.index data to the movies dataframe.
movies.errors <- merge(movies.errors, year.index, by = "year")
aggregate(revenue ~ economy,
FUN = median,
data = movies.errors)
## economy revenue
## 1 good 43092117
## 2 ok 42598498
## 3 poor 41252428
library(yarrr)
# Need to specify that budget is finite (i.e.; not NA) to get the plot to work
pirateplot(dv.name = "budget",
iv.name = "international.conflict",
data = subset(movies.errors, is.finite(budget)),
main = "Movie budgets by International Conflict"
)