In this WPA, you’ll download a dataset with many errors. You’ll then clean it so it’s ready for analysis!

You can get the dataset from the following link. The dataset is tab-delimited and has a header row:

http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt

Here is some important information about the columns:

Question 0

  1. Download the data and save it as a dataframe called movies.errors

Question 1

  1. The column names in the dataframe are not great. Some contain some random numbers/letters, and others are too long. Change the column names of the dataframe so they make sense. I recommend making each name a single word with no capital letters. But it’s up to you.

Question 2

Check ALL the columns (except for the first “name” column) for errors! If you find any errors in a column, correct them!

Keep the following tips in mind:

Question 3

  1. Create a new column called decade which shows the decade that a movie was made. For example, movies between 1950 and 1959 should be in one category, those between 1960 and 1969 should be in another category (etc.).

  2. Create a table showing the number of movies in each decade

Question 4

  1. Create a new column called time.30 that groups the time variable in blocks of 30 minutes. For example, movie times between 0 and 29 should be in one category, those between 30 and 59 minutes should be in a second category (etc.).

  2. Create a table showing the number of movies in each group of 30 minutes.

Question 5

  1. Create a new column called age that has one of two values: child or adult. Movies with ratings of G, PG, or PG-13 are ok for children. Movies with ratings of R, NC-17, or X are for adults.

  2. What percentage of movies are only for adults?

Question 6

Now, let’s add some more information to our movies dataset. The dataframe year.lookup is a dataframe that tells us, for each year, how well the world economomy was doing in that year, plus whether or not there was a major international conflict in taht year. You can get the dataframe from the following link. Like before, the data are tab-delimited and have a header row:

http://nathanieldphillips.com/wp-content/uploads/2016/01/year_index.txt

  1. Save the data as a new dataframe called year.index.
  2. Using merge() add the year.index data to the movies dataframe.
  3. What was the median boxoffice review of movies in good, ok, and poor economic years?
  4. Create a boxplot (or beanplot or pireateplot) showing the distribution of movie budgets for those movies released during international conflict years compared to those released during non-conflict years.