In this WPA, you’ll download a dataset with many errors. You’ll then clean it so it’s ready for analysis!

You can get the dataset from the following link. The dataset is tab-delimited and has a header row:

http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt

Here is some important information about the columns:

Question 0

  1. Download the data and save it as a dataframe called movies.errors
movies.errors <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt", 
                            sep = "\t", 
                            header = T,
                            stringsAsFactors = F
                            )

Question 1

  1. The column names in the dataframe are not great. Some contain some random numbers/letters, and others are too long. Change the column names of the dataframe so they make sense. I recommend making each name a single word with no capital letters. But it’s up to you.
names(movies.errors) <- c("name", "revenue", "dvd", "budget", "rating", "genre", "time", "year", "sequel")

Question 2

Check ALL the columns (except for the first “name” column) for errors! If you find any errors in a column, correct them!

Keep the following tips in mind:

recode.v <- function(original.vector, 
                   old.values, 
                   new.values, 
                   others = NULL) {
  
if(is.null(others)) {
  
  new.vector <- original.vector
  
}

if(is.null(others) == F) {
  
 new.vector <- rep(others, 
                   length(original.vector))
 
}

for (i in 1:length(old.values)) {
  
change.log <- new.vector == old.values[i] & 
              is.na(new.vector) == F

new.vector[change.log] <- new.values[i] 

}

return(new.vector)
  
}
#### revenue
class(movies.errors$revenue)
## [1] "numeric"
hist(movies.errors$revenue)

summary(movies.errors$revenue)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.251e+07 2.256e+07 4.222e+07 9.821e+07 1.023e+08 2.784e+09
# revenue looks ok

# dvd

class(movies.errors$dvd)
## [1] "integer"
hist(movies.errors$dvd)

summary(movies.errors$dvd)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##      6339   7563000  15800000  27980000  30570000 540400000      3566
# dvd looks ok

#### budget

class(movies.errors$budget)
## [1] "numeric"
hist(movies.errors$budget,
     main = "One value is way too high"
     )

summary(movies.errors$budget)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 1.200e+07 1.882e+21 3.925e+07 9.770e+23
# ok, one value is way, way too high
# I think a reasonable upper limit is 1 billion

movies.errors$budget[movies.errors$budget > 1000000000] <- NA

hist(movies.errors$budget, 
     main = "Looks good")

#### rating

table(movies.errors$rating)
## 
##        13         g         G   General        GP     NC-17 Not Rated 
##       452        58        46        54         1         5       196 
##        PG     PG-13      PG13         R         X 
##       699       457       462      1489         3
# Some bad values, let's change them

movies.errors$rating <- recode.v(original.vector = movies.errors$rating,
                                 old.values = c("PG13", "13", "g", "General", "GP", "X"),
                                 new.values = c("PG-13", "PG-13", "G", "G", "G", "NC-17")
                                 )

table(movies.errors$rating)
## 
##         G     NC-17 Not Rated        PG     PG-13         R 
##       159         8       196       699      1371      1489
# much better!

#### genre

table(movies.errors$genre)
## 
##              action              Action           Adventure 
##                   1                 691                 485 
##        Black Comedy               Comdy              comedy 
##                  33                   1                   2 
##              Comedy Concert/Performance         Documentary 
##                1208                  14                  63 
##               drama               Drama              Horror 
##                   4                1083                 299 
##     Multiple Genres             musical             Musical 
##                   2                   2                  77 
##             Reality             REALITY     Romantic Comedy 
##                   2                   2                 248 
##     ROMANTIC COMEDY   Thriller/Suspense             Western 
##                   3                 427                  38
# some more bad values, let's change them:


movies.errors$genre <- recode.v(original.vector = movies.errors$genre,
                                 old.values = c("action", "Comdy", "comedy", "drama", "musical", "REALITY", "ROMANTIC COMEDY"),
                                 new.values = c("Action", "Comedy", "Comedy", "Drama", "Musical", "Reality", "Romantic Comedy")
                                 )

table(movies.errors$genre)
## 
##              Action           Adventure        Black Comedy 
##                 692                 485                  33 
##              Comedy Concert/Performance         Documentary 
##                1211                  14                  63 
##               Drama              Horror     Multiple Genres 
##                1087                 299                   2 
##             Musical             Reality     Romantic Comedy 
##                  79                   4                 251 
##   Thriller/Suspense             Western 
##                 427                  38
# looks good!

#### time

class(movies.errors$time)
## [1] "character"
# class is character, must be a bad value in there...

table(movies.errors$time)
## 
##       -1      -10      -11      -12      -15      -19       -2      -20 
##        1        1        2        3        2        1        1        1 
##      -21      -24      -25      -27      -28      -29       -3      -30 
##        1        1        1        1        1        1        1        1 
##      -33      -35      -36      -37      -38      -39       -4      -40 
##        2        2        1        1        1        1        1        1 
##      -41      -42      -43      -45      -46      -48      -49       -5 
##        3        2        1        3        1        1        1        1 
##      -50      -51      -54      -56      -57      -58      -59       -6 
##        1        3        5        1        1        2        1        3 
##      -65      -66      -68      -70      -71      -78       -8      -80 
##        1        3        3        1        2        1        4        1 
##      -81      -82      -83      -85      -87      -88      -89       -9 
##        1        1        1        1        4        1        1        1 
##      -90      -91      -93      -94      -95      -96      -98      -99 
##        1        1        2        3        1        3        1        1 
##        0      100      101      102      103      104      105      106 
##      108       58       43       42       44       44       59       50 
##      107      108      109      110     1108      111     1117      112 
##       47       39       47       55        1       36        1       30 
##      113      114      115      116      117      118      119      120 
##       33       33       43       38       28       48       30       53 
##      121      122      123      124      125      126      127      128 
##       36       30       36       29       29       29       32       24 
##     1284      129      130      131      132      133      134      135 
##        1       28       33       20       19       18       15       18 
##      136      137     1373      138      139      140      141      142 
##       15       13        1       17       17       15       11        8 
##      143      144      145     1454      146      147      148      149 
##       12        9        5        1       12        4        2        6 
##      150      151      152      153      154      155      156      157 
##        7        3        8        3        5        7        1        3 
##      158      159     1591      160      161      162      164      165 
##        5        2        1        5        3        2        5        6 
##      167     1689      169     1694      170      172      174      175 
##        2        1        3        1        2        1        1        2 
##     1765      177      179      181      182      187      189      191 
##        1        1        2        2        1        1        3        1 
##     1920      194      195     1957      197      200      201     2019 
##        1        1        1        1        1        1        2        1 
##     2129      220     2212     2258      240     2705     2736     2755 
##        1        1        1        1        1        1        1        1 
##     2791     2874     3053     3145     3340     3420     3457     3578 
##        1        1        1        1        1        1        1        1 
##     3751     3838       39       40     4001     4048     4098     4371 
##        1        1        1        5        1        1        1        1 
##       44     4443     4662       47     4749     4826     4963     4983 
##        1        1        1        1        1        1        1        1 
##      506      515       52       62      643       65      660       69 
##        1        2        1        1        1        1        1        2 
##       70       71      720       74       75       76       77       78 
##        1        1        1        1        4        1        2        3 
##      783      784       79       80       81       82      824      825 
##        1        1        2        4       14       15        1        1 
##       83       84      845       85       86       87       88       89 
##       11        8        1       17       19       22       27       28 
##       90       91       92       93       94       95      959       96 
##       31       46       37       33       28       52        1       42 
##       97       98       99 not sure 
##       55       52       33        3
# ah ha! there is a "not sure" value there, let's convert it to NA

movies.errors$time[movies.errors$time == "not sure"] <- NA

# now convert to numeric

movies.errors$time <- as.numeric(movies.errors$time)

class(movies.errors$time)
## [1] "numeric"
# Better, now we'll look for bad values:

hist(movies.errors$time,
     main = "Some values are too low (less than 0), and some are too high"
     )

# Let's convert values less than 30 and greater than 300 to NA
# These are just values that make sense to me...

movies.errors$time[movies.errors$time < 30] <- NA
movies.errors$time[movies.errors$time > 300] <- NA

hist(movies.errors$time,
     main = "Much better"
     )

#### year

class(movies.errors$year)
## [1] "integer"
hist(movies.errors$year, 
     main = "Some bad years here... I don't think many movies were made\nbefore 1900 or have been made after 2050")

# let's remove years less than 1900 and greater than 2016

movies.errors$year[movies.errors$year < 1900] <- NA
movies.errors$year[movies.errors$year > 2016] <- NA

hist(movies.errors$year,
     main = "These look much better!"
     )

#### sequel

class(movies.errors$sequel)
## [1] "character"
table(movies.errors$sequel)
## 
##    0    1    n   no    y  yes 
## 4358  579   11   12   10   12
# some bad values here (e.g.; n, no, y, yes)
# let's convert them to 0 and 1

movies.errors$sequel <- recode.v(original.vector = movies.errors$sequel,
                                 old.values = c("n", "no", "y", "yes"),
                                 new.values = c(0, 0, 1, 1)
)

table(movies.errors$sequel)
## 
##    0    1 
## 4381  601
# Now we'll convert the column to numeric
movies.errors$sequel <- as.numeric(movies.errors$sequel)

Question 3

  1. Create a new column called decade which shows the decade that a movie was made. For example, movies between 1950 and 1959 should be in one category, those between 1960 and 1969 should be in another category (etc.).
movies.errors$decade <- cut(movies.errors$year, breaks = seq(1910, 2020, 10))
  1. Create a table showing the number of movies in each decade
table(movies.errors$decade)
## 
## (1.91e+03,1.92e+03] (1.92e+03,1.93e+03] (1.93e+03,1.94e+03] 
##                   0                   1                   5 
## (1.94e+03,1.95e+03] (1.95e+03,1.96e+03] (1.96e+03,1.97e+03] 
##                  20                  40                 135 
## (1.97e+03,1.98e+03] (1.98e+03,1.99e+03]    (1.99e+03,2e+03] 
##                 288                 671                1061 
##    (2e+03,2.01e+03] (2.01e+03,2.02e+03] 
##                1844                 865

Question 4

  1. Create a new column called time.30 that groups the time variable in blocks of 30 minutes. For example, movie times between 0 and 29 should be in one category, those between 30 and 59 minutes should be in a second category (etc.).
movies.errors$time.30 <- cut(movies.errors$time, breaks = seq(0, 300, 30))
  1. Create a table showing the number of movies in each group of 30 minutes.
table(movies.errors$time.30)
## 
##    (0,30]   (30,60]   (60,90]  (90,120] (120,150] (150,180] (180,210] 
##         0         9       215      1278       549        72        14 
## (210,240] (240,270] (270,300] 
##         2         0         0

Question 5

  1. Create a new column called age that has one of two values: child or adult. Movies with ratings of G, PG, or PG-13 are ok for children. Movies with ratings of R, NC-17, or X are for adults.
movies.errors$age <- recode.v(original.vector = movies.errors$rating,
                              old.values = c("G", "PG", "PG-13", "R", "NC-17"),
                              new.values = c("child", "child", "child", "adult", "adult")
                              )
  1. What percentage of movies are only for adults?
mean(movies.errors$age == "adult", na.rm = T)
## [1] 0.381693

Question 6

Now, let’s add some more information to our movies dataset. The dataframe year.lookup is a dataframe that tells us, for each year, how well the world economomy was doing in that year, plus whether or not there was a major international conflict in taht year. You can get the dataframe from the following link. Like before, the data are tab-delimited and have a header row:

http://nathanieldphillips.com/wp-content/uploads/2016/01/year_index.txt

  1. Save the data as a new dataframe called year.index.

  2. Using merge() add the year.index data to the movies dataframe.

movies.errors <- merge(movies.errors, year.index, by = "year")
  1. What was the median boxoffice review of movies in good, ok, and poor economic years?
aggregate(revenue ~ economy, 
          FUN = median, 
          data = movies.errors)
##   economy  revenue
## 1    good 43092117
## 2      ok 42598498
## 3    poor 41252428
  1. Create a boxplot (or beanplot or pireateplot) showing the distribution of movie budgets for those movies released during international conflict years compared to those released during non-conflict years.
library(yarrr)

# Need to specify that budget is finite (i.e.; not NA) to get the plot to work
pirateplot(dv.name = "budget",
           iv.name = "international.conflict",
          data = subset(movies.errors, is.finite(budget)),
          main = "Movie budgets by International Conflict"
           )