TMDB

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

{r} library(tidyverse)

I have choosen TMDb Movie Data for my Investigation in this project from Kaggle, https://www.kaggle.com/datasets/nagrajdesai/latest-10000-movies-dataset-from-tmdb/code.

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

Project Goals:

1st goal is to learn R
Know how to investigate problems in a data set wrangle the data into a format that can be used.
To draw the Insights from The Movie Data set

Importing the Data Set:

```{r}

IMDB_Movies



Summary of Data Set:

```{r}

summary(IMDB_Movies)

Names              Date_x              Score          Genre          
Length:10178       Length:10178       Min.   :  0.0   Length:10178      
Class :character   Class :character   1st Qu.: 59.0   Class :character  
Mode  :character   Mode  :character   Median : 65.0   Mode  :character  
                                      Mean   : 63.5                     
                                      3rd Qu.: 71.0                     
                                      Max.   :100.0                     




Overview             Crew            Orig_title           Status         
Length:10178       Length:10178       Length:10178       Length:10178      
Class :character   Class :character   Class :character   Class :character  
Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                        
 
                                                                   
                                                                        
Orig_lang            Budget_x            Revenue            Country         
Length:10178       Min.   :        1   Min.   :0.000e+00   Length:10178      
Class :character   1st Qu.: 15000000   1st Qu.:2.859e+07   Class :character  
Mode  :character   Median : 50000000   Median :1.529e+08   Mode  :character  
                   Mean   : 64882379   Mean   :2.531e+08                     
                   3rd Qu.:105000000   3rd Qu.:4.178e+08                     
                   Max.   :460000000   Max.   :2.924e+09

Data documentation for the variables:

Names:
- Length: 10178
- Class: Character
- Description: This variable contains the names of movies. It is of character data type, and there are 10,178 observations in this variable.
Date_x:
- Length: 10178
- Class: Character
- Description: This variable contains dates related to the movies. It is of character data type, and there are 10,178 observations in this variable.
Score:
- Minimum Value: 0.0
- 1st Quartile: 59.0
- Description: This variable represents the scores associated with the movies. It appears to be numeric, with a minimum value of 0.0 and a 1st quartile value of 59.0.
Genre:
- Length: 10178
- Class: Character
- Description: This variable contains the genres of the movies. It is of character data type, and there are 10,178 observations in this variable.
Overview:
- Length: 10178
- Class: Character
- Mode: Character
- Description: This variable contains overviews or descriptions of the movies. It is of character data type, and there are 10,178 observations in this variable. The mode is character, indicating that most values are stored as text.
Crew:
- Length: 10178
- Class: Character
- Mode: Character
- Description: This variable contains information about the crew involved in making the movies. It is of character data type, and there are 10,178 observations in this variable. The mode is character, indicating that most values are stored as text.
Orig_title:
- Length: 10178
- Class: Character
- Mode: Character
- Description: This variable contains the original titles of the movies. It is of character data type, and there are 10,178 observations in this variable. The mode is character, indicating that most values are stored as text.
Status:
- Length: 10178
- Class: Character
- Mode: Character
- Description: This variable contains the status or state of the movies. It is of character data type, and there are 10,178 observations in this variable. The mode is character, indicating that most values are stored as text.
Orig_lang:
- Length: 10178
- Class: Character
- Description: This variable represents the original languages of the movies. It is of character data type, and there are 10,178 observations in this variable.
Budget_x:

Minimum Value: 1
1st Quartile: 15,000,000
Description: This variable contains budget information for the movies. It appears to be numeric, with a minimum value of 1 and a 1st quartile value of 15,000,000.

Revenue:

Minimum Value: 0.0
1st Quartile: 2.859e+07 (approximately 28,590,000)
Description: This variable represents the revenue generated by the movies. It appears to be numeric, with a minimum value of 0.0 and a 1st quartile value of approximately 28,590,000.

Country:

Length: 10178
Class: Character
Description: This variable contains information about the countries associated with the movies. It is of character data type, and there are 10,178 observations in this variable.

These descriptions provide an overview of the data types, lengths, and summary statistics for each variable in your dataset.

Plotting Budget_x vs Score

```{r}

ggplot(data = IMDB_Movies) + geom_point (mapping = aes(x = Budget_x, y = Score ))




Plotting Top ten orig_lang vs Budget_x

```{r}

top_movies <- IMDB_Movies |>
  arrange(desc(Budget_x)) |>
  group_by(Orig_lang) |>
  slice(1:10)

```{r}

ggplot(data = top_movies, aes(x = Orig_lang, y = Budget_x,)) + geom_bar(stat = “identity”) + labs(title = “Top Ten Movies by Original Language vs. Budget”, x = “Original Language”, y = “Budget”) + theme(axis.text.x = element_text(angle = 45, hjust = 1))




```{r}

IMDB_Movies |>
    group_by(Genre) |>
    summarise(Avg_Rating = mean(Rating, na.rm = TRUE)) |>
    arrange(desc(Avg_Rating))
  
  
p <- ggplot(IMDB_Movies, aes(x = Rating))
p <- p +  geom_histogram(binwidth = 0.5, fill = "blue", alpha = 0.7)
p <- p +  labs(title = "Distribution of Ratings")
  
  ggplot(IMDB_Movies, aes(x = Genre, y = Rating)) +
    geom_boxplot(fill = "purple", alpha = 0.7) +
    labs(title = "Rating Distribution by Genre")
  
  ggplot(IMDB_Movies, aes(x = Year, y = Rating, color = Genre)) +
    geom_point() +
    labs(title = "Rating Trends Over the Years by Genre")

TMDB

2023-09-04