WPA #10: Data Cleaning

In this WPA, you’ll download a dataset with many errors. You’ll then clean it so it’s ready for analysis!

You can get the dataset from the following link. The dataset is tab-delimited and has a header row:

http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt

Here is some important information about the columns:

movie7653.name: The name of the movie
total.boxoffice.earnings: The total boxoffice revenue in USD
dvd.earnings.in.us.639c: Total DVD revenue in USD
total.movie.budget: Budget in USD
rating.GPGPG13RNC17: The MPAA (Motion Picture Association of America) rating.
- G = General Audiences
- PG = Parental Guidance suggested for children
- PG-13 = Parental guidance suggested for children under 13
- R = Parental guidance REQUIRED for children under 18.
- NC-17 (also known as X) = No children under 18 allowed.
genreX8423: The genre of the movie. There are many possible values ranging from Action to Western.
TIME: Length of the movie in minutes
year.of.release: Release year of movie
sequel: Is the movie a sequel? 0 means no, 1 means yes

Question 0

Download the data and save it as a dataframe called movies.errors

movies.errors <- read.table("http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt", 
                            sep = "\t", 
                            header = T,
                            stringsAsFactors = F
                            )

Question 1

The column names in the dataframe are not great. Some contain some random numbers/letters, and others are too long. Change the column names of the dataframe so they make sense. I recommend making each name a single word with no capital letters. But it’s up to you.

names(movies.errors) <- c("name", "revenue", "dvd", "budget", "rating", "genre", "time", "year", "sequel")

Question 2

Check ALL the columns (except for the first “name” column) for errors! If you find any errors in a column, correct them!

Keep the following tips in mind:

To get a quick look at the values in a numeric column with many (e.g; over 100 possible values), use summary() or hist()
To get a quick look at the values in a string column (or a numeric column with only a few possible values), use table()
In numeric columns, check for values that don’t make any sense (that is, those that are too large or too small).
In character columns, check for misspelled values. If you find values that are misspelled, correct them.
If you want to convert a character column to numeric, make sure all the values look like numbers before using as.numeric(). For sample, if a numeric column has a value of “one hundred”, you’ll need to convert this to 100.

recode.v <- function(original.vector, 
                   old.values, 
                   new.values, 
                   others = NULL) {
  
if(is.null(others)) {
  
  new.vector <- original.vector
  
}

if(is.null(others) == F) {
  
 new.vector <- rep(others, 
                   length(original.vector))
 
}

for (i in 1:length(old.values)) {
  
change.log <- new.vector == old.values[i] & 
              is.na(new.vector) == F

new.vector[change.log] <- new.values[i] 

}

return(new.vector)
  
}

#### revenue
class(movies.errors$revenue)

## [1] "numeric"

hist(movies.errors$revenue)

summary(movies.errors$revenue)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.251e+07 2.256e+07 4.222e+07 9.821e+07 1.023e+08 2.784e+09

# revenue looks ok

# dvd

class(movies.errors$dvd)

## [1] "integer"

hist(movies.errors$dvd)

summary(movies.errors$dvd)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##      6339   7563000  15800000  27980000  30570000 540400000      3566

# dvd looks ok

#### budget

class(movies.errors$budget)

## [1] "numeric"

hist(movies.errors$budget,
     main = "One value is way too high"
     )

summary(movies.errors$budget)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 1.200e+07 1.882e+21 3.925e+07 9.770e+23

# ok, one value is way, way too high
# I think a reasonable upper limit is 1 billion

movies.errors$budget[movies.errors$budget > 1000000000] <- NA

hist(movies.errors$budget, 
     main = "Looks good")

#### rating

table(movies.errors$rating)

## 
##        13         g         G   General        GP     NC-17 Not Rated 
##       452        58        46        54         1         5       196 
##        PG     PG-13      PG13         R         X 
##       699       457       462      1489         3

# Some bad values, let's change them

movies.errors$rating <- recode.v(original.vector = movies.errors$rating,
                                 old.values = c("PG13", "13", "g", "General", "GP", "X"),
                                 new.values = c("PG-13", "PG-13", "G", "G", "G", "NC-17")
                                 )

table(movies.errors$rating)

## 
##         G     NC-17 Not Rated        PG     PG-13         R 
##       159         8       196       699      1371      1489

# much better!

#### genre

table(movies.errors$genre)

## 
##              action              Action           Adventure 
##                   1                 691                 485 
##        Black Comedy               Comdy              comedy 
##                  33                   1                   2 
##              Comedy Concert/Performance         Documentary 
##                1208                  14                  63 
##               drama               Drama              Horror 
##                   4                1083                 299 
##     Multiple Genres             musical             Musical 
##                   2                   2                  77 
##             Reality             REALITY     Romantic Comedy 
##                   2                   2                 248 
##     ROMANTIC COMEDY   Thriller/Suspense             Western 
##                   3                 427                  38

# some more bad values, let's change them:


movies.errors$genre <- recode.v(original.vector = movies.errors$genre,
                                 old.values = c("action", "Comdy", "comedy", "drama", "musical", "REALITY", "ROMANTIC COMEDY"),
                                 new.values = c("Action", "Comedy", "Comedy", "Drama", "Musical", "Reality", "Romantic Comedy")
                                 )

table(movies.errors$genre)

## 
##              Action           Adventure        Black Comedy 
##                 692                 485                  33 
##              Comedy Concert/Performance         Documentary 
##                1211                  14                  63 
##               Drama              Horror     Multiple Genres 
##                1087                 299                   2 
##             Musical             Reality     Romantic Comedy 
##                  79                   4                 251 
##   Thriller/Suspense             Western 
##                 427                  38

# looks good!

#### time

class(movies.errors$time)

## [1] "character"

# class is character, must be a bad value in there...

table(movies.errors$time)

## 
##       -1      -10      -11      -12      -15      -19       -2      -20 
##        1        1        2        3        2        1        1        1 
##      -21      -24      -25      -27      -28      -29       -3      -30 
##        1        1        1        1        1        1        1        1 
##      -33      -35      -36      -37      -38      -39       -4      -40 
##        2        2        1        1        1        1        1        1 
##      -41      -42      -43      -45      -46      -48      -49       -5 
##        3        2        1        3        1        1        1        1 
##      -50      -51      -54      -56      -57      -58      -59       -6 
##        1        3        5        1        1        2        1        3 
##      -65      -66      -68      -70      -71      -78       -8      -80 
##        1        3        3        1        2        1        4        1 
##      -81      -82      -83      -85      -87      -88      -89       -9 
##        1        1        1        1        4        1        1        1 
##      -90      -91      -93      -94      -95      -96      -98      -99 
##        1        1        2        3        1        3        1        1 
##        0      100      101      102      103      104      105      106 
##      108       58       43       42       44       44       59       50 
##      107      108      109      110     1108      111     1117      112 
##       47       39       47       55        1       36        1       30 
##      113      114      115      116      117      118      119      120 
##       33       33       43       38       28       48       30       53 
##      121      122      123      124      125      126      127      128 
##       36       30       36       29       29       29       32       24 
##     1284      129      130      131      132      133      134      135 
##        1       28       33       20       19       18       15       18 
##      136      137     1373      138      139      140      141      142 
##       15       13        1       17       17       15       11        8 
##      143      144      145     1454      146      147      148      149 
##       12        9        5        1       12        4        2        6 
##      150      151      152      153      154      155      156      157 
##        7        3        8        3        5        7        1        3 
##      158      159     1591      160      161      162      164      165 
##        5        2        1        5        3        2        5        6 
##      167     1689      169     1694      170      172      174      175 
##        2        1        3        1        2        1        1        2 
##     1765      177      179      181      182      187      189      191 
##        1        1        2        2        1        1        3        1 
##     1920      194      195     1957      197      200      201     2019 
##        1        1        1        1        1        1        2        1 
##     2129      220     2212     2258      240     2705     2736     2755 
##        1        1        1        1        1        1        1        1 
##     2791     2874     3053     3145     3340     3420     3457     3578 
##        1        1        1        1        1        1        1        1 
##     3751     3838       39       40     4001     4048     4098     4371 
##        1        1        1        5        1        1        1        1 
##       44     4443     4662       47     4749     4826     4963     4983 
##        1        1        1        1        1        1        1        1 
##      506      515       52       62      643       65      660       69 
##        1        2        1        1        1        1        1        2 
##       70       71      720       74       75       76       77       78 
##        1        1        1        1        4        1        2        3 
##      783      784       79       80       81       82      824      825 
##        1        1        2        4       14       15        1        1 
##       83       84      845       85       86       87       88       89 
##       11        8        1       17       19       22       27       28 
##       90       91       92       93       94       95      959       96 
##       31       46       37       33       28       52        1       42 
##       97       98       99 not sure 
##       55       52       33        3

# ah ha! there is a "not sure" value there, let's convert it to NA

movies.errors$time[movies.errors$time == "not sure"] <- NA

# now convert to numeric

movies.errors$time <- as.numeric(movies.errors$time)

class(movies.errors$time)

## [1] "numeric"

# Better, now we'll look for bad values:

hist(movies.errors$time,
     main = "Some values are too low (less than 0), and some are too high"
     )

# Let's convert values less than 30 and greater than 300 to NA
# These are just values that make sense to me...

movies.errors$time[movies.errors$time < 30] <- NA
movies.errors$time[movies.errors$time > 300] <- NA

hist(movies.errors$time,
     main = "Much better"
     )

#### year

class(movies.errors$year)

## [1] "integer"

hist(movies.errors$year, 
     main = "Some bad years here... I don't think many movies were made\nbefore 1900 or have been made after 2050")

# let's remove years less than 1900 and greater than 2016

movies.errors$year[movies.errors$year < 1900] <- NA
movies.errors$year[movies.errors$year > 2016] <- NA

hist(movies.errors$year,
     main = "These look much better!"
     )

#### sequel

class(movies.errors$sequel)

## [1] "character"

table(movies.errors$sequel)

## 
##    0    1    n   no    y  yes 
## 4358  579   11   12   10   12

# some bad values here (e.g.; n, no, y, yes)
# let's convert them to 0 and 1

movies.errors$sequel <- recode.v(original.vector = movies.errors$sequel,
                                 old.values = c("n", "no", "y", "yes"),
                                 new.values = c(0, 0, 1, 1)
)

table(movies.errors$sequel)

## 
##    0    1 
## 4381  601

# Now we'll convert the column to numeric
movies.errors$sequel <- as.numeric(movies.errors$sequel)

Question 3

Create a new column called decade which shows the decade that a movie was made. For example, movies between 1950 and 1959 should be in one category, those between 1960 and 1969 should be in another category (etc.).

movies.errors$decade <- cut(movies.errors$year, breaks = seq(1910, 2020, 10))

Create a table showing the number of movies in each decade

table(movies.errors$decade)

## 
## (1.91e+03,1.92e+03] (1.92e+03,1.93e+03] (1.93e+03,1.94e+03] 
##                   0                   1                   5 
## (1.94e+03,1.95e+03] (1.95e+03,1.96e+03] (1.96e+03,1.97e+03] 
##                  20                  40                 135 
## (1.97e+03,1.98e+03] (1.98e+03,1.99e+03]    (1.99e+03,2e+03] 
##                 288                 671                1061 
##    (2e+03,2.01e+03] (2.01e+03,2.02e+03] 
##                1844                 865

Question 4

Create a new column called time.30 that groups the time variable in blocks of 30 minutes. For example, movie times between 0 and 29 should be in one category, those between 30 and 59 minutes should be in a second category (etc.).

movies.errors$time.30 <- cut(movies.errors$time, breaks = seq(0, 300, 30))

Create a table showing the number of movies in each group of 30 minutes.

table(movies.errors$time.30)

## 
##    (0,30]   (30,60]   (60,90]  (90,120] (120,150] (150,180] (180,210] 
##         0         9       215      1278       549        72        14 
## (210,240] (240,270] (270,300] 
##         2         0         0

Question 5

Create a new column called age that has one of two values: child or adult. Movies with ratings of G, PG, or PG-13 are ok for children. Movies with ratings of R, NC-17, or X are for adults.

movies.errors$age <- recode.v(original.vector = movies.errors$rating,
                              old.values = c("G", "PG", "PG-13", "R", "NC-17"),
                              new.values = c("child", "child", "child", "adult", "adult")
                              )

What percentage of movies are only for adults?

mean(movies.errors$age == "adult", na.rm = T)

## [1] 0.381693

Question 6

Now, let’s add some more information to our movies dataset. The dataframe year.lookup is a dataframe that tells us, for each year, how well the world economomy was doing in that year, plus whether or not there was a major international conflict in taht year. You can get the dataframe from the following link. Like before, the data are tab-delimited and have a header row:

http://nathanieldphillips.com/wp-content/uploads/2016/01/year_index.txt

Save the data as a new dataframe called year.index.
Using merge() add the year.index data to the movies dataframe.

movies.errors <- merge(movies.errors, year.index, by = "year")

What was the median boxoffice review of movies in good, ok, and poor economic years?

aggregate(revenue ~ economy, 
          FUN = median, 
          data = movies.errors)

##   economy  revenue
## 1    good 43092117
## 2      ok 42598498
## 3    poor 41252428

Create a boxplot (or beanplot or pireateplot) showing the distribution of movie budgets for those movies released during international conflict years compared to those released during non-conflict years.

library(yarrr)

# Need to specify that budget is finite (i.e.; not NA) to get the plot to work
pirateplot(dv.name = "budget",
           iv.name = "international.conflict",
          data = subset(movies.errors, is.finite(budget)),
          main = "Movie budgets by International Conflict"
           )