library(fivethirtyeight) #NEW!!
library(moderndive) #NEW!!
library(broom) #NEW!!
library(NHANES) #NEW!!
library(tidyverse)

Bechdel Test

Before we jump into the statistics, I would like you to take the time to read this fivethirtyeight article. We are going to be using more or less the same data they used in that article.

Throughout the analysis, you should use the bechdel_1990up dataset that removes missing values and only includes movies that were made in 1990 or later.

bechdel_1990up <-
  bechdel %>% 
  filter(year>1989, 
         !is.na(budget_2013), 
         !is.na(intgross_2013)) 

Exploratory Work

(1@) The code below summarizes the percentage of films that pass for each of the 24 years that we have data. It also counts the number of films by year. Use this code and pipe into a ggplot to create a graph of pass_pct by year. Represent it as both a point and a line and size the points by n. (HINT: remember the size aesthetic). Describe the trend you observe.

#Scatterplot Graph
bechdel_1990up %>% 
  group_by(year) %>% 
  summarize(pass_pct = mean(binary=="PASS"),
            n = n()) %>%

ggplot()+
  geom_point(aes(x=year,y=pass_pct, size = n)) + 
  geom_line(aes(x=year, y=pass_pct))

" The more years that go by the larger the points become, which shows that more movies have passed the bechel test over time. There is a sharp dip in the number of movies that passed the test around 1992 and higher peaks from 2001 to 2006."
## [1] " The more years that go by the larger the points become, which shows that more movies have passed the bechel test over time. There is a sharp dip in the number of movies that passed the test around 1992 and higher peaks from 2001 to 2006."
  1. Create a graph that shows the distribution of intgross_2013, the film’s international gross in 2013 inflation adjusted US dollars. Describe the distribution.
bechdel_1990up %>% 
ggplot()+
geom_histogram(aes(x=intgross_2013), bins = 50)

bechdel_1990up %>% 
ggplot()+
geom_histogram(aes(x=intgross_2013, fill = binary), bins = 50)

"The distribution is like a ramp. There is a peak towards the beginning of the X-axis and then it quickly tapers of as you move to the right of the X-axis, so it is skewed right. Movies that failed the test are stacked on top of those that passed it."
## [1] "The distribution is like a ramp. There is a peak towards the beginning of the X-axis and then it quickly tapers of as you move to the right of the X-axis, so it is skewed right. Movies that failed the test are stacked on top of those that passed it."
  1. Now compare the distribution of intgross_2013 by binary (PASS if the film passed the test and FAIL if it did not). Describe the differences, if any.
bechdel_1990up %>%
  ggplot()+
  geom_boxplot(aes(x=binary,y=intgross_2013))

bechdel_1990up %>% 
ggplot()+
geom_histogram(aes(x=intgross_2013), bins = 50)

"The movies that failed appear to have a higher international gross income than the movies that passed. However, the medians don't appear to differ too strongly from each other. The data in this graph appears to follow a similar trend to the previous one. "
## [1] "The movies that failed appear to have a higher international gross income than the movies that passed. However, the medians don't appear to differ too strongly from each other. The data in this graph appears to follow a similar trend to the previous one. "
  1. Also create a graph that compares the distribution of budget_2013, the film’s budget in 2013 inflation adjusted US dollars, by binary. Describe the differences, if any.
ggplot(data=bechdel_1990up)+
  geom_boxplot(aes(x=binary,y=budget_2013))

"The median budget for movies that failed the test is higher than those that passed the test. Therefore, the difference in international gross income from the previous graph could be influenced by the higher budget that the movies that failed had in comparison to the movies that passed."
## [1] "The median budget for movies that failed the test is higher than those that passed the test. Therefore, the difference in international gross income from the previous graph could be influenced by the higher budget that the movies that failed had in comparison to the movies that passed."
  1. Create a plot that examines the relationship between budget_2013 and intgross_2013 (use intgross_2013 as the response variable). Color the points by binary. Describe what you observe.
bechdel_1990up %>% 
  ggplot()+
  geom_point(aes(x=budget_2013,y=intgross_2013, color=binary))

"The two color of dots(pass and fail) follow relatively similar trends by clumping at the beginning of the x-axis and then spreading out further to the right of the graph. The relationship appears linear, so the lower the budget the lower the gross international profit tends to be for the movies."
## [1] "The two color of dots(pass and fail) follow relatively similar trends by clumping at the beginning of the x-axis and then spreading out further to the right of the graph. The relationship appears linear, so the lower the budget the lower the gross international profit tends to be for the movies."

Fitting and Interpreting Models

According to the fivethirtyeight article, there is a perception in Hollywood that “films featuring women do worse at the box office”. We aim to investigate this using the bechdel_1990up data. We have already started this investigation in the previous section with some exploratory work. Now we will fit some models to help solidify our observations. Note that we will continue this analysis later in class or another lab as it’s a bit too long for one class period.

(6@) Fit a model that uses just the binary bechdel test results to explain intgross_2013. Interpret both coefficients in the context of the data.

#Model Fit w/binary
lm(intgross_2013 ~ binary,
   bechdel_1990up)
## 
## Call:
## lm(formula = intgross_2013 ~ binary, data = bechdel_1990up)
## 
## Coefficients:
## (Intercept)   binaryPASS  
##   205584255    -41141817
"The average international gross income of the movies sampled is positive while the coefficent of bianryPass is negative. A negative intercept in the y=mx+b format assuming that m is positive would mean that the budget for a binary pass could be high enough that at a point the predicted y value would be positive. "
## [1] "The average international gross income of the movies sampled is positive while the coefficent of bianryPass is negative. A negative intercept in the y=mx+b format assuming that m is positive would mean that the budget for a binary pass could be high enough that at a point the predicted y value would be positive. "

(7@) Use the model to find the fitted value and the residual for the movie “Cloudy with a Chance of Meatballs”.

 lm_BL <- lm(bechdel_1990up)
get_regression_points(lm_BL)
257131292 - 205584255    
## [1] 51547037
  1. Fit a model that uses budget_2013 to explain intgross_2013. Interpret both coefficients in the context of the data.
#Model with Budget
lm(intgross_2013 ~ budget_2013,
   bechdel_1990up)
## 
## Call:
## lm(formula = intgross_2013 ~ budget_2013, data = bechdel_1990up)
## 
## Coefficients:
## (Intercept)  budget_2013  
##   3167802.4          3.2
"Because the two variables are discrete the slope can be calculated and it is 3.2. The slope is positive, so as the budget goes up the international gross income also increases. The intercept is the mean international gross income when budget is looked at."
## [1] "Because the two variables are discrete the slope can be calculated and it is 3.2. The slope is positive, so as the budget goes up the international gross income also increases. The intercept is the mean international gross income when budget is looked at."
  1. Use the model to find the fitted value and the residual for the movie “Cloudy with a Chance of Meatballs”.
"residual"
## [1] "residual"
257131292- 3.23*108573160 + 3167802.4 
## [1] -90392212
"fitted"
## [1] "fitted"
3.23*108573160 + 3167802.4 
## [1] 353859109

(10@) Fit another model that uses both budget_2013 AND binary to predict intgross_2013. How do you interpret each of the coefficients (be careful!) and how do these interpretations differ from the interpretations above?

#Model with Budget and Binary
lm(intgross_2013 ~ binary + budget_2013,
   bechdel_1990up)
## 
## Call:
## lm(formula = intgross_2013 ~ binary + budget_2013, data = bechdel_1990up)
## 
## Coefficients:
## (Intercept)   binaryPASS  budget_2013  
##  -7.847e+06    1.997e+07    3.230e+00

NHANES data (more practice with graphing)

The National Health and Nutrition Examination Survey (NHANES) is a series of studies that collects health and demographic data from individuals of all ages in the United States. NHANES collects information via surveys, physical examination, and laboratory testing.

In this assignment, you will be looking at a dataset from the 2009-2010 and 2011-2012 data collection periods to explore relationships between demographic and health variables.

Some data cleaning. Filter out cases for the survey year 2009-2010. (NHANES did not collect data on the expanded race (Race3) category in 2009-2010.) Also filter out cases that have missing information for age, diabetes diagnosis, and depressed days. The last mutate fixes the AgeDecade variable. Use the nhanes object for the remainder of the assignment.

nhanes <- NHANES %>%
    filter(SurveyYr != "2009_10", 
           !is.na(Age), 
           !is.na(Diabetes), 
           !is.na(Depressed)) %>%
    mutate(AgeDecade = cut(Age, seq(10,80,10), 
                           labels = c("11-20", 
                                      "21-30", 
                                      "31-40", 
                                      "41-50", 
                                      "51-60", 
                                      "61-70", 
                                      "71-80")))

You can learn more about the variables in the dataset by searching for NHANES in Help.

Questions:

  1. How many individuals have and have not been diagnosed with diabetes?
nhanes %>%
count(Diabetes)
  1. Make a visualization that shows the distribution of poor mental health days in the past month (DaysMentHlthBad) in this study population and describe the distribution (shape, center, spread) using words and statistics.
nhanes %>%
ggplot()+
  geom_bar(aes(x=DaysMentHlthBad))
## Warning: Removed 2 rows containing non-finite values (stat_count).

nhanes %>%
  summarize(
    mean_DaysMentHlthBad=mean(DaysMentHlthBad,na.rm=TRUE)
  )
nhanes %>%
  summarize(
    sd_DaysMentHlthBad=sd(DaysMentHlthBad,na.rm=TRUE)
  )
nhanes %>%
  summarize(
    median(DaysPhysHlthBad,na.rm=TRUE)
  )
"There is one peak in the data at 0 bad mental health days, so it is unimodal. The mean is 3.9 which is relatively close to some of the higher peaks. Then it tapers off very quickly with values being below 250. It appears to be right skewed."
## [1] "There is one peak in the data at 0 bad mental health days, so it is unimodal. The mean is 3.9 which is relatively close to some of the higher peaks. Then it tapers off very quickly with values being below 250. It appears to be right skewed."
  1. Create a graph that shows how diabetes diagnosis varies by age category (AgeDecade). Describe what you see.
nhanes %>%
ggplot()+
  geom_bar(aes(x=AgeDecade))

"The number of people with diabetes has several peaks with the highest one being at 41 - 50 years of age. It is bimodal and relatively symmetrical with the exception of the low peak at 11 - 20 years old. People 21 -30 and 41 - 60 seem be the most represented in those with diabetes."
## [1] "The number of people with diabetes has several peaks with the highest one being at 41 - 50 years of age. It is bimodal and relatively symmetrical with the exception of the low peak at 11 - 20 years old. People 21 -30 and 41 - 60 seem be the most represented in those with diabetes."
  1. Make and discuss a visualization that shows how body mass index (BMI) differs between those who have and have not been diagnosed with diabetes.
nhanes %>%
ggplot()+
  geom_boxplot(aes(x=Diabetes, y=BMI))
## Warning: Removed 23 rows containing non-finite values (stat_boxplot).

"The median value for BMI is higher in those that have been diagnosed with diabetes than those that have not. This means that there is a higher represenation of those with a higher BMI being diagnosed with diabetes. So there could be a positive relationship between BMI and diabetes. However, there is also alot of variation in both categories."
## [1] "The median value for BMI is higher in those that have been diagnosed with diabetes than those that have not. This means that there is a higher represenation of those with a higher BMI being diagnosed with diabetes. So there could be a positive relationship between BMI and diabetes. However, there is also alot of variation in both categories."
  1. CHALLENGE: Make and discuss a visualization that shows how the BMI-diabetes relationship differs by current smoking status (SmokeNow). (Don’t worry about the NA category.)
nhanes %>%
ggplot()+
  geom_jitter(aes(x=SmokeNow,y=BMI, color =Diabetes))
## Warning: Removed 23 rows containing missing values (geom_point).

"There is a higher representation of those not diagnosed with diabetes, with a lower BMI having not smoked before. However, both the categories of that having smoked and having not smoked have representation of people with and without diabetes. As the BMI goes up we do see that there is a higher representation of people with diabetes that have not smoked. Therefore, the relationship isnt strong between smoking, diabetes, and BMI."
## [1] "There is a higher representation of those not diagnosed with diabetes, with a lower BMI having not smoked before. However, both the categories of that having smoked and having not smoked have representation of people with and without diabetes. As the BMI goes up we do see that there is a higher representation of people with diabetes that have not smoked. Therefore, the relationship isnt strong between smoking, diabetes, and BMI."
  1. Create and discuss a graph to show the relationship between bad mental health days (DaysMentHlthBad) and number of physically active days (PhysActiveDays).
nhanes %>%
  ggplot()+
  geom_jitter(aes(x=DaysMentHlthBad,y=PhysActiveDays))
## Warning: Removed 1739 rows containing missing values (geom_point).

"There is no observable correlation between bad Mental health days and the number of physically active days. Many of the data points overlap at 0 bad mental health days and each of the values for days of physcial activity are representted in this overlap. This shows that on average someone who is physically active for 6 days versus someone active for 1 day don't have a large difference in their chance of having a bad mental health day. "
## [1] "There is no observable correlation between bad Mental health days and the number of physically active days. Many of the data points overlap at 0 bad mental health days and each of the values for days of physcial activity are representted in this overlap. This shows that on average someone who is physically active for 6 days versus someone active for 1 day don't have a large difference in their chance of having a bad mental health day. "