Please Indicate

  • Who you collaborated with: N/A
  • Approximately how much time did you spend on this problem set: 1 hour
  • What, if anything, gave you the most trouble: Minor difficulty knitting HTML because of a small error in my code.

Tips

  • Look and explore your data with the View() command
  • To get a quick idea of what kind of variables you have in a data frame, use the glimpse() function from the dplyr package
  • Help files are your friend. Many (but not all) R functions and datasets have help files. For example, you can access the help file for the movies data set below by typing ?movies

Question 1: Movie Ratings

The movies data set in the ggplot2movies package has information and ratings on 28,819 movies. This many data points is a bit unwieldy, so let’s take a random sample of 1000 of these movies. Furthermore, let’s take the variable Comedy and convert it to a yes vs no (binary) categorical variable. Note: you don’t need to understand this code for now, we’ll see this when we study data manipulation.

# Do not edit this section
data(movies)
movies <- movies %>% 
  sample_n(1000) %>% 
  mutate(Comedy=ifelse(Comedy==1, "yes", "no"))

a)

You want to know for these 1000 randomly chosen movies: What is the relationship between the year the movie was made and the IMDB rating? Furthermore, I want to distinguish between comedies and non-comedies. In the code block below, write the code that generates a graphic that will answer this for you:

# Write your code here:

ggplot(data=movies, aes(x = year, y = rating, color = Comedy)) + geom_point(alpha = 0.6)

b)

As best you can, answer this question: Within these 1000 movies, do comedies get rated higher than non-comedies?

The spread of both datasets appears indistinguishable, and do not show dissimilar trends.

So, no - within these 100 movies, comedies are not consistently rated higher than non-comedies.

Question 2: Babynames

Considering the babynames data set in the babynames package again, we will limit consideration to only the name “Casey”.

# Do not edit this section
data(babynames)
babynames <- babynames %>% 
  filter(name=="Casey")

a)

I want to know about popularity trends of the name “Casey” as a male name and as a female name over the years. In the code block below, write the code that generates a graphic that will answer this for you:

# Write your code here:

ggplot(data=babynames, aes(x=year, y=n, color=sex)) + geom_line()

b)

Given this graphic, what can you say about the name “Casey”? Don’t merely describe what is already apparent on the graphic, but make a broader statement.

The name Casey increased dramatically beginning in the ’60s to the mid-2000s as both a Male and Female name. It is a slightly more common name for Men, but popularity is not historically differentiated by gender.