Week 5 Analysis: TMDB Movie Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)

Data-set Overview

The data set we’re working with contains information about 10000 movies, such as their names, release dates,Original_Title, Origin_language, scores, budgets, revenue, Description of movie, genre, crew.

data <- read_csv("C:/Users/chitt/OneDrive/Desktop/TMDB_R_project/TMDB.csv")

## Rows: 10178 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Names, Orig_title, Orig_lang, Genre, Status, Country, Crew, Overview
## dbl  (3): Score, Budget, Revenue
## date (1): Release_Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View the first few rows of the data to inspect its structure
head(data)

## # A tibble: 6 × 12
##   Names      Orig_title Orig_lang Genre Release_Date Score Budget Revenue Status
##   <chr>      <chr>      <chr>     <chr> <date>       <dbl>  <dbl>   <dbl> <chr> 
## 1 Creed III  Creed III  English   Dram… 2023-03-02      73   75     272.  Relea…
## 2 Avatar: T… Avatar: T… English   Scie… 2022-12-15      78  460    2317.  Relea…
## 3 The Super… The Super… English   Anim… 2023-04-05      76  100     724.  Relea…
## 4 Mummies    Momias     Spanish,… Anim… 2023-01-05      70   12.3    34.2 Relea…
## 5 Supercell  Supercell  English   Acti… 2023-03-17      61   77     341.  Relea…
## 6 Cocaine B… Cocaine B… English   Thri… 2023-02-23      66   35      80   Relea…
## # ℹ 3 more variables: Country <chr>, Crew <chr>, Overview <chr>

summary_data <- data |> 
  select(1:10) |> 
  summary()
summary_data

##     Names            Orig_title         Orig_lang            Genre          
##  Length:10178       Length:10178       Length:10178       Length:10178      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Release_Date            Score           Budget          Revenue       
##  Min.   :1903-05-15   Min.   :  0.0   Min.   :  0.00   Min.   :   0.00  
##  1st Qu.:2001-12-25   1st Qu.: 59.0   1st Qu.: 15.00   1st Qu.:  28.59  
##  Median :2013-05-09   Median : 65.0   Median : 50.00   Median : 152.93  
##  Mean   :2008-06-15   Mean   : 63.5   Mean   : 64.88   Mean   : 253.14  
##  3rd Qu.:2019-10-17   3rd Qu.: 71.0   3rd Qu.:105.00   3rd Qu.: 417.80  
##  Max.   :2023-12-31   Max.   :100.0   Max.   :460.00   Max.   :2923.71  
##     Status            Country         
##  Length:10178       Length:10178      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Task 1:

Orig_lang: Represents the original language of the movie produced or the main language spoken in the movie.
Budget and Revenue: These represent the movie’s budget and revenue, but we need to know the currency like (USD, EUR, AUD, INR etc.) and the scale Thousands, Millions etc.
Country: This column must be more precise like Orig_Country or where it got produced and if full country name is given it would be nice.

Task 2:

Even after understanding our columns, the Score column still is unclear.

The scale is not clear, we don’t know if it’s out of 100 or some other number.

Task 3: Visualizing the Ambiguity

To better understand the Score column, let’s create a histogram. A histogram will show us how scores are distributed among movies.

ggplot(data, aes(x = Score)) +
  geom_histogram(fill = "skyblue", binwidth = 1) +
  labs(title = "Distribution of Movie Scores", x = "Score", y = "Number of Movies") +
  theme_minimal()

This histogram shows the distribution of movie scores.

This are some risks I noticed when I am analyzing the Score column.

The source of this score is unknown. Is it from critics, audience ratings, or a combination of both?
we don’t know the exact metric how they calculated this score.
Are there other potential biases in the scoring system?