library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
The data set we’re working with contains information about 10000 movies, such as their names, release dates,Original_Title, Origin_language, scores, budgets, revenue, Description of movie, genre, crew.
data <- read_csv("C:/Users/chitt/OneDrive/Desktop/TMDB_R_project/TMDB.csv")
## Rows: 10178 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Names, Orig_title, Orig_lang, Genre, Status, Country, Crew, Overview
## dbl (3): Score, Budget, Revenue
## date (1): Release_Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first few rows of the data to inspect its structure
head(data)
## # A tibble: 6 × 12
## Names Orig_title Orig_lang Genre Release_Date Score Budget Revenue Status
## <chr> <chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <chr>
## 1 Creed III Creed III English Dram… 2023-03-02 73 75 272. Relea…
## 2 Avatar: T… Avatar: T… English Scie… 2022-12-15 78 460 2317. Relea…
## 3 The Super… The Super… English Anim… 2023-04-05 76 100 724. Relea…
## 4 Mummies Momias Spanish,… Anim… 2023-01-05 70 12.3 34.2 Relea…
## 5 Supercell Supercell English Acti… 2023-03-17 61 77 341. Relea…
## 6 Cocaine B… Cocaine B… English Thri… 2023-02-23 66 35 80 Relea…
## # ℹ 3 more variables: Country <chr>, Crew <chr>, Overview <chr>
summary_data <- data |>
select(1:10) |>
summary()
summary_data
## Names Orig_title Orig_lang Genre
## Length:10178 Length:10178 Length:10178 Length:10178
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Release_Date Score Budget Revenue
## Min. :1903-05-15 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:2001-12-25 1st Qu.: 59.0 1st Qu.: 15.00 1st Qu.: 28.59
## Median :2013-05-09 Median : 65.0 Median : 50.00 Median : 152.93
## Mean :2008-06-15 Mean : 63.5 Mean : 64.88 Mean : 253.14
## 3rd Qu.:2019-10-17 3rd Qu.: 71.0 3rd Qu.:105.00 3rd Qu.: 417.80
## Max. :2023-12-31 Max. :100.0 Max. :460.00 Max. :2923.71
## Status Country
## Length:10178 Length:10178
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Orig_lang: Represents the original language of the movie produced or the main language spoken in the movie.
Budget and Revenue: These represent the movie’s budget and revenue, but we need to know the currency like (USD, EUR, AUD, INR etc.) and the scale Thousands, Millions etc.
Country: This column must be more precise like Orig_Country or where it got produced and if full country name is given it would be nice.
Even after understanding our columns, the Score column still is unclear.
To better understand the Score column, let’s create a histogram. A histogram will show us how scores are distributed among movies.
ggplot(data, aes(x = Score)) +
geom_histogram(fill = "skyblue", binwidth = 1) +
labs(title = "Distribution of Movie Scores", x = "Score", y = "Number of Movies") +
theme_minimal()
This histogram shows the distribution of movie scores.
This are some risks I noticed when I am analyzing the Score column.