WPA #10: Data Cleaning

In this WPA, you’ll download a dataset with many errors. You’ll then clean it so it’s ready for analysis!

You can get the dataset from the following link. The dataset is tab-delimited and has a header row:

http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt

Here is some important information about the columns:

movie7653.name: The name of the movie
total.boxoffice.earnings: The total boxoffice revenue in USD
dvd.earnings.in.us.639c: Total DVD revenue in USD
total.movie.budget: Budget in USD
rating.GPGPG13RNC17: The MPAA (Motion Picture Association of America) rating.
- G = General Audiences
- PG = Parental Guidance suggested for children
- PG-13 = Parental guidance suggested for children under 13
- R = Parental guidance REQUIRED for children under 18.
- NC-17 (also known as X) = No children under 18 allowed.
genreX8423: The genre of the movie. There are many possible values ranging from Action to Western.
TIME: Length of the movie in minutes
year.of.release: Release year of movie
sequel: Is the movie a sequel? 0 means no, 1 means yes

Question 0

Download the data and save it as a dataframe called movies.errors

Question 1

The column names in the dataframe are not great. Some contain some random numbers/letters, and others are too long. Change the column names of the dataframe so they make sense. I recommend making each name a single word with no capital letters. But it’s up to you.

Question 2

Check ALL the columns (except for the first “name” column) for errors! If you find any errors in a column, correct them!

Keep the following tips in mind:

To get a quick look at the values in a numeric column with many (e.g; over 100 possible values), use summary() or hist()
To get a quick look at the values in a string column (or a numeric column with only a few possible values), use table()
In numeric columns, check for values that don’t make any sense (that is, those that are too large or too small).
In character columns, check for misspelled values. If you find values that are misspelled, correct them.
If you want to convert a character column to numeric, make sure all the values look like numbers before using as.numeric(). For sample, if a numeric column has a value of “one hundred”, you’ll need to convert this to 100.

Question 3

Create a new column called decade which shows the decade that a movie was made. For example, movies between 1950 and 1959 should be in one category, those between 1960 and 1969 should be in another category (etc.).
Create a table showing the number of movies in each decade

Question 4

Create a new column called time.30 that groups the time variable in blocks of 30 minutes. For example, movie times between 0 and 29 should be in one category, those between 30 and 59 minutes should be in a second category (etc.).
Create a table showing the number of movies in each group of 30 minutes.

Question 5

Create a new column called age that has one of two values: child or adult. Movies with ratings of G, PG, or PG-13 are ok for children. Movies with ratings of R, NC-17, or X are for adults.
What percentage of movies are only for adults?

Question 6

Now, let’s add some more information to our movies dataset. The dataframe year.lookup is a dataframe that tells us, for each year, how well the world economomy was doing in that year, plus whether or not there was a major international conflict in taht year. You can get the dataframe from the following link. Like before, the data are tab-delimited and have a header row:

http://nathanieldphillips.com/wp-content/uploads/2016/01/year_index.txt

Save the data as a new dataframe called year.index.
Using merge() add the year.index data to the movies dataframe.
What was the median boxoffice review of movies in good, ok, and poor economic years?
Create a boxplot (or beanplot or pireateplot) showing the distribution of movie budgets for those movies released during international conflict years compared to those released during non-conflict years.