Question 0

Download the data and save it as a dataframe called movies.errors

movies_errors <- read.delim("~/Dropbox/RSeminar/movies_errors.txt")

Question 1

The column names in the dataframe are not great. Some contain some random numbers/letters, and others are too long. Change the column names of the dataframe so they make sense. I recommend making each name a single word with no capital letters. But it’s up to you.

names(movies_errors) 
## [1] "movie7653.name"           "total.boxoffice.earnings"
## [3] "dvd.earnings.in.us.639c"  "total.movie.budget"      
## [5] "rating.GPGPG13RNC17"      "genreX8423"              
## [7] "TIME"                     "year.of.release"         
## [9] "sequel"
names(movies_errors)[1] <- "name"
names(movies_errors)[2] <- "earnings"
names(movies_errors)[3] <- "dvd"
names(movies_errors)[4] <- "budget"
names(movies_errors)[5] <- "rating"
names(movies_errors)[6] <- "genre"
names(movies_errors)[7] <- "time"
names(movies_errors)[8] <- "year"
names(movies_errors)
## [1] "name"     "earnings" "dvd"      "budget"   "rating"   "genre"   
## [7] "time"     "year"     "sequel"

Question 2

Check ALL the columns (except for the first “name” column) for errors! If you find any errors in a column, correct them!

Keep the following tips in mind: To get a quick look at the values in a numeric column with many (e.g; over 100 possible values), use summary() or hist() To get a quick look at the values in a string column (or a numeric column with only a few possible values), use table() In numeric columns, check for values that don’t make any sense (that is, those that are too large or too small). In character columns, check for misspelled values. If you find values that are misspelled, correct them. If you want to convert a character column to numeric, make sure all the values look like numbers before using as.numeric(). For sample, if a numeric column has a value of “one hundred”, you’ll need to convert this to 100.

str(movies_errors)
## 'data.frame':    5000 obs. of  9 variables:
##  $ name    : Factor w/ 4922 levels "__ (Jik zin)",..: 364 4518 2014 3632 1439 3633 1634 1431 1906 2431 ...
##  $ earnings: num  2.78e+09 2.21e+09 1.67e+09 1.52e+09 1.52e+09 ...
##  $ dvd     : int  230915507 NA NA 109515497 14947559 7312791 100789867 204294533 25338875 NA ...
##  $ budget  : num  4.25e+08 2.00e+08 2.15e+08 2.25e+08 1.90e+08 2.50e+08 1.25e+08 1.50e+08 2.00e+08 7.40e+07 ...
##  $ rating  : Factor w/ 12 levels "13","g","G","General",..: 1 10 1 1 10 10 9 8 1 8 ...
##  $ genre   : Factor w/ 21 levels "action","Action",..: 2 20 2 3 2 2 3 3 2 7 ...
##  $ time    : Factor w/ 236 levels "-1","-10","-11",..: 134 154 92 113 106 111 8 68 99 226 ...
##  $ year    : int  2009 1997 2015 2012 2014 2015 2011 2013 2012 2014 ...
##  $ sequel  : Factor w/ 6 levels "0","1","n","no",..: 1 1 2 1 2 2 2 1 2 1 ...
# names
movies_errors$name <- as.character(movies_errors$name)

# earnings
summary(movies_errors$earnings)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.251e+07 2.256e+07 4.222e+07 9.821e+07 1.023e+08 2.784e+09
hist(movies_errors$earnings)

# dvd
movies_errors$dvd <- as.numeric(movies_errors$dvd)
summary(movies_errors$dvd)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##      6339   7563000  15800000  27980000  30570000 540400000      3566
hist(movies_errors$dvd)

# budget
summary(movies_errors$budget)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 1.200e+07 1.882e+21 3.925e+07 9.770e+23
hist(movies_errors$budget)

movies_errors$budget[movies_errors$budget > 3e+08] <- NA


# rating
table(movies_errors$rating)
## 
##        13         g         G   General        GP     NC-17 Not Rated 
##       452        58        46        54         1         5       196 
##        PG     PG-13      PG13         R         X 
##       699       457       462      1489         3
recode.v <- function(orig.vector,
                     old.values,
                     new.values,
                     others = NULL){
  if (is.null(others)){
    new.vector <- orig.vector
  }
  
  if (is.null(others) == F) {
    new.vector <- rep(others,
                      length(orig.vector))
  }
  
  for (i in 1:length(old.values)){
    change.log <- new.vector == old.values[i]
    new.vector[change.log] <- new.values[i]
  }
  return(new.vector)
}

movies_errors$rating <- as.character(movies_errors$rating)
movies_errors$rating <- recode.v(orig.vector = movies_errors$rating, 
         old.values = c("13", "g", "General", "GP", "PG13", "X"),
         new.values = c("PG-13", "G", "G", "PG", "PG-13", "Not Rated"))

table(movies_errors$rating)
## 
##         G     NC-17 Not Rated        PG     PG-13         R 
##       158         5       199       700      1371      1489
# genre 
table(movies_errors$genre)
## 
##              action              Action           Adventure 
##                   1                 691                 485 
##        Black Comedy               Comdy              comedy 
##                  33                   1                   2 
##              Comedy Concert/Performance         Documentary 
##                1208                  14                  63 
##               drama               Drama              Horror 
##                   4                1083                 299 
##     Multiple Genres             musical             Musical 
##                   2                   2                  77 
##             Reality             REALITY     Romantic Comedy 
##                   2                   2                 248 
##     ROMANTIC COMEDY   Thriller/Suspense             Western 
##                   3                 427                  38
movies_errors$genre <- as.character(movies_errors$genre)
movies_errors$genre <- recode.v(orig.vector = movies_errors$genre, 
         old.values = c("action", "comdy", "comedy", "drama", "musical", "REALITY", "ROMANTIC COMEDY"),
         new.values = c("Action", "Comedy", "Comedy", "Drama", "Musical", "Reality", "Romantic Comedy"))

table(movies_errors$genre)
## 
##              Action           Adventure        Black Comedy 
##                 692                 485                  33 
##               Comdy              Comedy Concert/Performance 
##                   1                1210                  14 
##         Documentary               Drama              Horror 
##                  63                1087                 299 
##     Multiple Genres             Musical             Reality 
##                   2                  79                   4 
##     Romantic Comedy   Thriller/Suspense             Western 
##                 251                 427                  38
# time
summary(movies_errors$time)
##       0     105     100     110      97     120      95      98     106 
##     108      59      58      55      55      53      52      52      50 
##     118     107     109      91     103     104     101     115     102 
##      48      47      47      46      44      44      43      43      42 
##      96     108     116      92     111     121     123     113     114 
##      42      39      38      37      36      36      36      33      33 
##     130      93      99     127      90     112     119     122     124 
##      33      33      33      32      31      30      30      30      29 
##     125     126     117     129      89      94      88     128      87 
##      29      29      28      28      28      28      27      24      22 
##     131     132      86     133     135     138     139      85     134 
##      20      19      19      18      18      17      17      17      15 
##     136     140      82      81     137     143     146     141      83 
##      15      15      15      14      13      12      12      11      11 
##     144     142     152      84     150     155     149     165     -54 
##       9       8       8       8       7       7       6       6       5 
##     145     154     158     160     164      40      -8     -87     147 
##       5       5       5       5       5       5       4       4       4 
##      75      80     -12     -41     -45     -51      -6     -66     -68 
##       4       4       3       3       3       3       3       3       3 
##     -94     -96     151     153     157     161     169     189 (Other) 
##       3       3       3       3       3       3       3       3     163 
##    NA's 
##    2600
movies_errors$time[movies_errors$time == "not sure"] <- NA
movies_errors$time <- as.numeric(movies_errors$time)
movies_errors$time[movies_errors$time < 60] <- NA
summary(movies_errors$time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    60.0    75.0    93.0   124.7   212.0   235.0    2694
# year 
summary(movies_errors$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      17    1991    2002    2015    2009    3997
movies_errors$year[movies_errors$year < 1925] <- NA
movies_errors$year[movies_errors$year > 2016] <- NA
summary(movies_errors$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1925    1991    2002    1999    2009    2016      70
# sequel
table(movies_errors$sequel)
## 
##    0    1    n   no    y  yes 
## 4358  579   11   12   10   12
movies_errors$sequel <- as.character(movies_errors$sequel)
movies_errors$sequel <- recode.v(orig.vector = movies_errors$sequel, 
         old.values = c("n", "no", "y", "yes"),
         new.values = c("0", "0", "1", "1"))
movies_errors$sequel <- as.integer(movies_errors$sequel)
table(movies_errors$sequel)
## 
##    0    1 
## 4381  601

Question 3

Create a new column called decade which shows the decade that a movie was made. For example, movies between 1950 and 1959 should be in one category, those between 1960 and 1969 should be in another category (etc.).

Create a table showing the number of movies in each decade

movies_errors$decade <- cut(movies_errors$year, seq(1920,2020,10))
movies_errors$decade <- as.numeric(movies_errors$decade)
head(movies_errors[c("year", "decade")])
##   year decade
## 1 2009      9
## 2 1997      8
## 3 2015     10
## 4 2012     10
## 5 2014     10
## 6 2015     10
table(movies_errors$decade)
## 
##    1    2    3    4    5    6    7    8    9   10 
##    1    5   20   40  135  288  671 1061 1844  865

Question 4

Create a new column called time.30 that groups the time variable in blocks of 30 minutes. For example, movie times between 0 and 29 should be in one category, those between 30 and 59 minutes should be in a second category (etc.).

Create a table showing the number of movies in each group of 30 minutes.

summary(movies_errors$time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    60.0    75.0    93.0   124.7   212.0   235.0    2694
movies_errors$time.30 <- cut(movies_errors$time, seq(60,260,30))
movies_errors$time.30 <- as.numeric(movies_errors$time.30)
head(movies_errors[c("time", "time.30")])
##   time time.30
## 1  134       3
## 2  154       4
## 3   92       2
## 4  113       2
## 5  106       2
## 6  111       2
table(movies_errors$decade)
## 
##    1    2    3    4    5    6    7    8    9   10 
##    1    5   20   40  135  288  671 1061 1844  865

Question 5

Create a new column called age that has one of two values: child or adult. Movies with ratings of G, PG, or PG-13 are ok for children. Movies with ratings of R, NC-17, or X are for adults.

What percentage of movies are only for adults?

movies_errors$age[movies_errors$rating %in% c("G", "PG", "PG-13")] <- "child" 
movies_errors$age[!movies_errors$rating %in% c("G", "PG", "PG-13")] <- "adult" 
table(movies_errors$age)
## 
## adult child 
##  2771  2229
sum(mean(movies_errors$age == "adult"))
## [1] 0.5542

Now, let’s add some more information to our movies dataset. The dataframe year.lookup is a dataframe that tells us, for each year, how well the world economomy was doing in that year, plus whether or not there was a major international conflict in taht year. You can get the dataframe from the following link. Like before, the data are tab-delimited and have a header row:

Save the data as a new dataframe called year.index.

year_index <- read.delim("~/Dropbox/RSeminar/year_index.txt")

Using merge() add the year.index data to the movies dataframe.

movies_errors <- merge(movies_errors, year_index, by = "year")
head(movies_errors)
##   year                            name  earnings dvd  budget rating
## 1 1925                  The Big Parade  22000000  NA  245000   <NA>
## 2 1937 Snow White and the Seven Dwarfs 184925485  NA 1488000      G
## 3 1939              Gone with the Wind 390525192  NA 3900000      G
## 4 1939                The Wizard of Oz  33711566  NA 2777000  PG-13
## 5 1940                       Pinocchio  84300000  NA       0      G
## 6 1940                        Fantasia  83320000  NA 2280000      G
##       genre time sequel decade time.30   age economy
## 1     Drama   NA      0      1      NA adult    good
## 2   Musical  217      0      2       6 child      ok
## 3     Drama  162      0      2       4 child      ok
## 4   Musical   67      1      2       1 child      ok
## 5 Adventure   NA      0      2      NA child    poor
## 6   Musical   NA      0      2      NA child    poor
##   international.conflict
## 1                      0
## 2                      0
## 3                      0
## 4                      0
## 5                      0
## 6                      0

What was the median boxoffice review of movies in good, ok, and poor economic years?

aggregate(earnings ~ economy, data = movies_errors, median)
##   economy earnings
## 1    good 43092117
## 2      ok 42598498
## 3    poor 41252428

Create a boxplot (or beanplot or pireateplot) showing the distribution of movie budgets for those movies released during international conflict years compared to those released during non-conflict years.

boxplot(budget ~ international.conflict,
        data = movies_errors[!movies_errors$budget == 0,],
        xlab = "Conflict",
        ylab = "Budget",
        main = "Distribution of Movie Budgets")