In this WPA, you’ll download a dataset with many errors. You’ll then clean it so it’s ready for analysis!
You can get the dataset from the following link. The dataset is tab-delimited and has a header row:
http://nathanieldphillips.com/wp-content/uploads/2016/01/movies_errors.txt
Here is some important information about the columns:
movie7653.name: The name of the movie
total.boxoffice.earnings: The total boxoffice revenue in USD
dvd.earnings.in.us.639c: Total DVD revenue in USD
total.movie.budget: Budget in USD
genreX8423: The genre of the movie. There are many possible values ranging from Action to Western.
TIME: Length of the movie in minutes
year.of.release: Release year of movie
sequel: Is the movie a sequel? 0 means no, 1 means yes
Question 0
Question 1
Question 2
Check ALL the columns (except for the first “name” column) for errors! If you find any errors in a column, correct them!
Keep the following tips in mind:
Question 3
Create a new column called decade which shows the decade that a movie was made. For example, movies between 1950 and 1959 should be in one category, those between 1960 and 1969 should be in another category (etc.).
Create a table showing the number of movies in each decade
Question 4
Create a new column called time.30 that groups the time variable in blocks of 30 minutes. For example, movie times between 0 and 29 should be in one category, those between 30 and 59 minutes should be in a second category (etc.).
Create a table showing the number of movies in each group of 30 minutes.
Question 5
Create a new column called age that has one of two values: child or adult. Movies with ratings of G, PG, or PG-13 are ok for children. Movies with ratings of R, NC-17, or X are for adults.
What percentage of movies are only for adults?
Question 6
Now, let’s add some more information to our movies dataset. The dataframe year.lookup is a dataframe that tells us, for each year, how well the world economomy was doing in that year, plus whether or not there was a major international conflict in taht year. You can get the dataframe from the following link. Like before, the data are tab-delimited and have a header row:
http://nathanieldphillips.com/wp-content/uploads/2016/01/year_index.txt