Choose two numeric variables, and pair each one with a column you built (i.e., calculated based on others)
#loading libraries and data into the file
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(stringr)
bechdel_data_movies <- read_csv("C:/Users/Lauren/Documents/Stats Data/movies.csv")
## Rows: 1794 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): imdb, title, test, clean_test, binary, domgross, intgross, code, d...
## dbl (7): year, budget, budget_2013, period_code, decade_code, metascore, im...
## num (1): imdb_votes
## lgl (2): response, error
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bechdel_data_movies <- bechdel_data_movies |>
mutate(costly = case_when(
budget_2013 > 200000000 ~ 4,
budget_2013 >= 100000000 ~ 3,
budget_2013 <= 15000000 ~ 1,
TRUE ~ 2
))
print(bechdel_data_movies)
## # A tibble: 1,794 × 35
## year imdb title test clean_test binary budget domgross intgross code
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 2013 tt1711425 21 &a… nota… notalk FAIL 1.3 e7 25682380 42195766 2013…
## 2 2012 tt1343727 Dredd… ok-d… ok PASS 4.50e7 13414714 40868994 2012…
## 3 2013 tt2024544 12 Ye… nota… notalk FAIL 2 e7 53107035 1586070… 2013…
## 4 2013 tt1272878 2 Guns nota… notalk FAIL 6.1 e7 75612460 1324930… 2013…
## 5 2013 tt0453562 42 men men FAIL 4 e7 95020213 95020213 2013…
## 6 2013 tt1335975 47 Ro… men men FAIL 2.25e8 38362475 1458038… 2013…
## 7 2013 tt1606378 A Goo… nota… notalk FAIL 9.2 e7 67349198 3042491… 2013…
## 8 2013 tt2194499 About… ok-d… ok PASS 1.20e7 15323921 87324746 2013…
## 9 2013 tt1814621 Admis… ok ok PASS 1.3 e7 18007317 18007317 2013…
## 10 2013 tt1815862 After… nota… notalk FAIL 1.3 e8 60522097 2443731… 2013…
## # ℹ 1,784 more rows
## # ℹ 25 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## # intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## # plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## # writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## # released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## # type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, costly <dbl>
bechdel_data_movies <- bechdel_data_movies |>
mutate(old_movie = case_when(
year > 2000 ~ 3,
year >= 1994 ~ 2,
TRUE ~ 1
))
print(bechdel_data_movies)
## # A tibble: 1,794 × 36
## year imdb title test clean_test binary budget domgross intgross code
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 2013 tt1711425 21 &a… nota… notalk FAIL 1.3 e7 25682380 42195766 2013…
## 2 2012 tt1343727 Dredd… ok-d… ok PASS 4.50e7 13414714 40868994 2012…
## 3 2013 tt2024544 12 Ye… nota… notalk FAIL 2 e7 53107035 1586070… 2013…
## 4 2013 tt1272878 2 Guns nota… notalk FAIL 6.1 e7 75612460 1324930… 2013…
## 5 2013 tt0453562 42 men men FAIL 4 e7 95020213 95020213 2013…
## 6 2013 tt1335975 47 Ro… men men FAIL 2.25e8 38362475 1458038… 2013…
## 7 2013 tt1606378 A Goo… nota… notalk FAIL 9.2 e7 67349198 3042491… 2013…
## 8 2013 tt2194499 About… ok-d… ok PASS 1.20e7 15323921 87324746 2013…
## 9 2013 tt1814621 Admis… ok ok PASS 1.3 e7 18007317 18007317 2013…
## 10 2013 tt1815862 After… nota… notalk FAIL 1.3 e8 60522097 2443731… 2013…
## # ℹ 1,784 more rows
## # ℹ 26 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## # intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## # plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## # writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## # released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## # type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, costly <dbl>, …
At least one pair should be a response variable and an explanatory variable
Plot a visualization for each relationship, and draw some conclusions based on the plot
bechdel_data_movies <- bechdel_data_movies |>
mutate(runtime_num = str_remove_all(runtime, " min"), runtime_num= as.numeric(runtime_num), na.rm = TRUE)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `runtime_num = as.numeric(runtime_num)`.
## Caused by warning:
## ! NAs introduced by coercion
bechdel_data_movies
## # A tibble: 1,794 × 38
## year imdb title test clean_test binary budget domgross intgross code
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 2013 tt1711425 21 &a… nota… notalk FAIL 1.3 e7 25682380 42195766 2013…
## 2 2012 tt1343727 Dredd… ok-d… ok PASS 4.50e7 13414714 40868994 2012…
## 3 2013 tt2024544 12 Ye… nota… notalk FAIL 2 e7 53107035 1586070… 2013…
## 4 2013 tt1272878 2 Guns nota… notalk FAIL 6.1 e7 75612460 1324930… 2013…
## 5 2013 tt0453562 42 men men FAIL 4 e7 95020213 95020213 2013…
## 6 2013 tt1335975 47 Ro… men men FAIL 2.25e8 38362475 1458038… 2013…
## 7 2013 tt1606378 A Goo… nota… notalk FAIL 9.2 e7 67349198 3042491… 2013…
## 8 2013 tt2194499 About… ok-d… ok PASS 1.20e7 15323921 87324746 2013…
## 9 2013 tt1814621 Admis… ok ok PASS 1.3 e7 18007317 18007317 2013…
## 10 2013 tt1815862 After… nota… notalk FAIL 1.3 e8 60522097 2443731… 2013…
## # ℹ 1,784 more rows
## # ℹ 28 more variables: budget_2013 <dbl>, domgross_2013 <chr>,
## # intgross_2013 <chr>, period_code <dbl>, decade_code <dbl>, imdb_id <chr>,
## # plot <chr>, rated <chr>, response <lgl>, language <chr>, country <chr>,
## # writer <chr>, metascore <dbl>, imdb_rating <dbl>, director <chr>,
## # released <chr>, actors <chr>, genre <chr>, awards <chr>, runtime <chr>,
## # type <chr>, poster <chr>, imdb_votes <dbl>, error <lgl>, costly <dbl>, …
costly_plot <- ggplot(data = bechdel_data_movies, aes(x=runtime_num , y= costly))+ geom_count()+ labs(title = "Costliness in Movies", x= "Movie Runtime in Minutes", y="Costly Rating" )
costly_plot
## Warning: Removed 203 rows containing non-finite outside the scale range
## (`stat_sum()`).
old_plot <- ggplot(data = bechdel_data_movies, aes(x=budget_2013, y= old_movie)) + geom_jitter() + labs(title= "2013 Adjusted Budget by Age Group of Movie", x = "Budget", y="Movie Era")
print(old_plot)
draw some conclusions based on the plot. Use what we’ve covered so far in class to scrutinize the plot (e.g., are there any outliers?)For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
For the costliness/runtime graph, it looks like costliness does in fact lead to a longer runtime per level of costliness. This makes sense as the more you need to film, the more a film will cost in materials, sets, acting, staff, etc.
If we had to select an outlier for this graph, it would look to be movies that are above 225 minutes long. There are a few of these movies, and strangely enough they are in the lower two levels of costliness. I’m not sure what the signifigance of this is. Perhaps lower budgets are freer to present longer movies with less action/set design/acting/staff labor due to genre or another confounding variable.
I’d like to know if there is a relationship between runtime and genre. I’d also love to know if there is a relationship between costliness and domestic gross income. While there is a small relationship with runtime, I imagine there is a stronger relationship between the investment of making a movie and the outcome of profit.
For the age/budget graph, it looks like movies are getting more expensive over time, despite the column accounting for the budget by adjusting it in 2013 dollars. If we were to pick an outlier, I would choose the highest budget movie in the most recent era. They stand out on the graph and likely would numerically as well. This means that movies are getting more expensive to produce over time. I’m not curious about this relationship, but if I had to ask a question of this data, I’d like to know more about the ratio of investment (budget_2013) to gross domestic and international income on the lower cost modern movies. Perhaps there is an opportunity there.
Calculate the appropriate correlation coefficient for each of these combinations
old_corr_coef <- cor(bechdel_data_movies$budget_2013, bechdel_data_movies$old_movie)
old_corr_coef
## [1] 0.07202
runtime_corr_coef <- cor(bechdel_data_movies$budget_2013, bechdel_data_movies$runtime_num)
runtime_corr_coef
## [1] NAruntime_cor_coef <- cor(bechdel_data_movies$runtime_num, bechdel_data_movies$costly)
runtime_cor_coef
## [1] NA
I’m honestly surprised by these correlation coefficients. I expected to see a modest positive correlation between runtime and budget, as it looks like it should be correlated in the graph. I suppose a small, negligibly positive relationship between budget and era of the movie makes sense. The date cut-offs were chosen arbitrarily for the purposes of this assignment.
Build a confidence interval for each of the response variable(s). Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.
conf_int_runtime <- t.test(bechdel_data_movies$runtime_num)$conf.int
conf_int_runtime
## [1] 109.9603 112.0171
## attr(,"conf.level")
## [1] 0.95
conf_int_budget <- t.test(bechdel_data_movies$budget_2013)$conf.int
conf_int_budget
## [1] 52921588 58007629
## attr(,"conf.level")
## [1] 0.95
The confidence intervals tell us where the mean of the populations are for runtimes and movie budgets could likely be. I’m not sure whata confidence interval means other than that.