Your RMarkdown notebook for this data dive should contain the following:
A list of at least 2 columns (or values) in your data which are unclear until you read the documentation.
E.g., this could be a column name, or just some value inside a cell of your data
Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?
For binary and response, I imagine that those who encoded the documentation coded it to make calculations that rely on binary categories easier
For code, I think those who encoded the information wanted a way to encode the year and binary outcome in the same column for calculations.
At least one element or your data that is unclear even after reading the documentation
You may need to do some digging, but is there anything about the data that your documentation does not explain?
Build at least two visualizations which use a column of data that is affected by the issue you brought up in bullet #2, above. In these visualizations, find a way to highlight the issue from different perspectives, explain what is unclear, and why it might be concerning.
You can use color or a text label, but also make sure to explain your thoughts using Markdown.
#loading libraries and data into the file
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
bechdel_data_movies <- read_csv("C:/Users/Lauren/Documents/Stats Data/movies.csv")
## Rows: 1794 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): imdb, title, test, clean_test, binary, domgross, intgross, code, d...
## dbl (7): year, budget, budget_2013, period_code, decade_code, metascore, im...
## num (1): imdb_votes
## lgl (2): response, error
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
unique_error <- unique(bechdel_data_movies$error)
Since error as a column appears to be empty, so we’re going to pick another column!
#year vs period code
unique_period_codes <- unique(bechdel_data_movies$period_code)
unique_period_codes
## [1] 1 2 3 4 5 NA
year_vs_period <- ggplot(bechdel_data_movies, aes(x=period_code,y=year)) + geom_count()
year_vs_period
## Warning: Removed 179 rows containing non-finite outside the scale range
## (`stat_sum()`).
period_code_bar <- ggplot(bechdel_data_movies, aes(x=period_code)) + geom_bar()
period_code_bar
## Warning: Removed 179 rows containing non-finite outside the scale range
## (`stat_count()`).
It looks like the period code very loosely correlates to year when
the movie comes out, but it is not a direct correlation.
If I wanted to check bechdel test results against period of time the
movie released or depicted, I would not get straightforward, reliable
results.
Do you notice any significant risks? If so, what could you do to reduce negative consequences?
For at least two categorical columns, check for examples of the following, and describe what you find:
Are there explicitly missing rows?
Are there implicitly missing rows?
Are there empty groups?
For at least one continuous column, what would you define as an outlier, and why?
For the budget, I’d define anything outside $0 - 200,000,000 (adjusted to 2013 dollars) as an outlier. I don’t know enough about movies and their budgets as an individual, so I did some research, and the general consensus is that anything with a budget over $100 million dollars is considered “big-budget”. Thus, anything more than twice that would be really unusual, especially for this data set that that ends in 2013, as movies have gone up in cost and budget over time. This gives us 51 movies out of 1794 movies or about 3% of the data set. I am satisfied with this number. Conversely, on the low end, movies can be made with volunteers and friends and borrowed equipment for a total cost of $0. I wouldn’t consider those to be outliers, just underrepresented in this data set.
count_high_budget <- sum(bechdel_data_movies$budget_2013 > 200000000, na.rm = TRUE)
count_high_budget
## [1] 51