Please Indicate

  • Who you collaborated with: Sophia Konanc helped me
  • Approximately how much time did you spend on this problem set: 30 mins
  • What, if anything, gave you the most trouble: I had trouble figuring out where to start/loading the narrowed down data

Tips

  • Look and explore your data with the View() command
  • To get a quick idea of what kind of variables you have in a data frame, use the glimpse() function from the dplyr package
  • Help files are your friend. Many (but not all) R functions and datasets have help files. For example, you can access the help file for the movies data set below by typing ?movies

Question 1: Movie Ratings

The movies data set in the ggplot2movies package has information and ratings on 28,819 movies. This many data points is a bit unwieldy, so let’s take a random sample of 1000 of these movies. Furthermore, let’s take the variable Comedy and convert it to a yes vs no (binary) categorical variable. Note: you don’t need to understand this code for now, we’ll see this when we study data manipulation.

# Do not edit this section
data(movies)
movies <- movies %>% 
  sample_n(1000) %>% 
  mutate(Comedy=ifelse(Comedy==1, "yes", "no"))

a)

You want to know for these 1000 randomly chosen movies: What is the relationship between the year the movie was made and the IMDB rating? Furthermore, I want to distinguish between comedies and non-comedies. In the code block below, write the code that generates a graphic that will answer this for you:

# Write your code here:
ggplot(data=movies, aes(x = year, y = rating, color=Comedy)) + 
   geom_point(alpha=.8)

b)

As best you can, answer this question: Within these 1000 movies, do comedies get rated higher than non-comedies?

Commedies seem to have higher ratings thatn non commedies. There are more noncommedies with lower ratings. A majority of commedies have recieved rathings of 5.0 and highr.

Question 2: Babynames

Considering the babynames data set in the babynames package again, we will limit consideration to only the name “Casey”.

# Do not edit this section
data(babynames)
babynames <- babynames %>% 
  filter(name=="Casey")

a)

I want to know about popularity trends of the name “Casey” as a male name and as a female name over the years. In the code block below, write the code that generates a graphic that will answer this for you:

# Write your code here:
ggplot(data=babynames, aes(x=year, y=prop, color=sex)) + 
  geom_line()

b)

Given this graphic, what can you say about the name “Casey”? Don’t merely describe what is already apparent on the graphic, but make a broader statement.

It seems that often the number of names drastically declines after it is at its peak. The name Casey peaked in 1990 but it really grew as a female name around the the middle of the 1980s. In the 1990s the name declined in general but became more popular in females which is interesting because it had previously been more popular with males.