Homework Assignment

Your RMarkdown notebook for this data dive should contain the following:

  1. A list of at least 2 columns (or values) in your data which are unclear until you read the documentation.

    • E.g., this could be a column name, or just some value inside a cell of your data

      • binary
      • response
      • code
    • Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

      • For binary and response, I imagine that those who encoded the documentation coded it to make calculations that rely on binary categories easier

        • If I hadn’t read the documentation, I wouldn’t have been able to find where the test results were and how to calculate things from the bechdel test results.
      • For code, I think those who encoded the information wanted a way to encode the year and binary outcome in the same column for calculations.

        • If I hadn’t read the documentation, I wouldn’t understand the “code” column. I migt have even assumed it was the only results column.
  2. At least one element or your data that is unclear even after reading the documentation

    • You may need to do some digging, but is there anything about the data that your documentation does not explain?

      • Period code, decade code, error
  3. Build at least two visualizations which use a column of data that is affected by the issue you brought up in bullet #2, above. In these visualizations, find a way to highlight the issue from different perspectives, explain what is unclear, and why it might be concerning.

    • You can use color or a text label, but also make sure to explain your thoughts using Markdown.

    #loading libraries and data into the file
    library(readr)
    library(dplyr)
    ## 
    ## Attaching package: 'dplyr'
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    library(ggplot2)
    library(tidyr)
    
    bechdel_data_movies <- read_csv("C:/Users/Lauren/Documents/Stats Data/movies.csv")
    ## Rows: 1794 Columns: 34
    ## ── Column specification ────────────────────────────────────────────────────────
    ## Delimiter: ","
    ## chr (24): imdb, title, test, clean_test, binary, domgross, intgross, code, d...
    ## dbl  (7): year, budget, budget_2013, period_code, decade_code, metascore, im...
    ## num  (1): imdb_votes
    ## lgl  (2): response, error
    ## 
    ## ℹ Use `spec()` to retrieve the full column specification for this data.
    ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    unique_error <- unique(bechdel_data_movies$error)

    Since error as a column appears to be empty, so we’re going to pick another column!

    #year vs period code
    unique_period_codes <- unique(bechdel_data_movies$period_code)
    unique_period_codes
    ## [1]  1  2  3  4  5 NA
    year_vs_period <- ggplot(bechdel_data_movies, aes(x=period_code,y=year)) + geom_count()
    year_vs_period
    ## Warning: Removed 179 rows containing non-finite outside the scale range
    ## (`stat_sum()`).

    period_code_bar <- ggplot(bechdel_data_movies, aes(x=period_code)) + geom_bar()
    period_code_bar
    ## Warning: Removed 179 rows containing non-finite outside the scale range
    ## (`stat_count()`).

    It looks like the period code very loosely correlates to year when the movie comes out, but it is not a direct correlation.
    If I wanted to check bechdel test results against period of time the movie released or depicted, I would not get straightforward, reliable results.

    • Do you notice any significant risks? If so, what could you do to reduce negative consequences?

      • Perhaps if I misunderstood the period_code to refer to the period the movie depicted, I would draw erroneous conclusions about the types of movies that pass/fail the bechdel test. I could reduce risk by sampling the data to something like 10-20 rows, and cross check that the period code does indeed refer to the period the movie portrays by checking the plot.
  4. For at least two categorical columns, check for examples of the following, and describe what you find:

    • Are there explicitly missing rows?

      • Yes there are movies with some missing values in language and rated! I’m not sure what significange/insight this promotes, but while viewing the data, I saw Klingon represented as a language, and now I would ask what languages represented are not spoken (at all/ anymore) as a native/first language on earth.
    • Are there implicitly missing rows?

      • Yes, the number of movies is not the totality of all movies ever, so some are missing. This is expected and not significant, but the insight it promotes is that there should be more movies and frankly, stage plays included in this data set.
    • Are there empty groups?

      • I’m honestly not sure! I can’t tell from reviewing my data, but there certainly could be! This is significant because empty groups are important in the data set, and if you can’t find them, you cannot account for them. I would love to know what an empty group for this data set would even look like.
  5. For at least one continuous column, what would you define as an outlier, and why?

    For the budget, I’d define anything outside $0 - 200,000,000 (adjusted to 2013 dollars) as an outlier. I don’t know enough about movies and their budgets as an individual, so I did some research, and the general consensus is that anything with a budget over $100 million dollars is considered “big-budget”. Thus, anything more than twice that would be really unusual, especially for this data set that that ends in 2013, as movies have gone up in cost and budget over time. This gives us 51 movies out of 1794 movies or about 3% of the data set. I am satisfied with this number. Conversely, on the low end, movies can be made with volunteers and friends and borrowed equipment for a total cost of $0. I wouldn’t consider those to be outliers, just underrepresented in this data set.

count_high_budget <- sum(bechdel_data_movies$budget_2013 > 200000000, na.rm = TRUE)
count_high_budget
## [1] 51