Load necessary libraries

library(tidyverse) library(ggplot2) library(dplyr) library(tidyr) library(knitr)


I have choosen TMDb Movie Data for my Investigation in this project from Kaggle, <https://www.kaggle.com/datasets/nagrajdesai/latest-10000-movies-dataset-from-tmdb/code>.

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

## Project Goals:

1.  1st goal is to learn R

2.  Know how to investigate problems in a data set wrangle the data into a format that can be used.

3.  To draw the Insights from The Movie Data set

## Importing the Data Set:

```{r}
# Load the dataset
IMDB_Movies <- read.csv("IMDB_Movies.csv")

# Check the structure of the dataset
str(IMDB_Movies)

# Summary statistics
summary(IMDB_Movies)

Data documentation for the variables:

Names: Length: 10178 Class: Character Description: This variable contains the names of movies. It is of character data type, and there are 10,178 observations in this variable.

Date_x: Length: 10178 Class: Character Description: This variable contains dates related to the movies. It is of character data type, and there are 10,178 observations in this variable. Score: Minimum Value: 0.0 1st Quartile: 59.0 Description: This variable represents the scores associated with the movies. It appears to be numeric, with a minimum value of 0.0 and a 1st quartile value of 59.0. Genre: Length: 10178 Class: Character Description: This variable contains the genres of the movies. It is of character data type, and there are 10,178 observations in this variable. Overview: Length: 10178 Class: Character Mode: Character Description: This variable contains overviews or descriptions of the movies. It is of character data type, and there are 10,178 observations in this variable. The mode is character, indicating that most values are stored as text.

Crew: Length: 10178 Class: Character Mode: Character Description: This variable contains information about the crew involved in making the movies. It is of character data type, and there are 10,178 observations in this variable. The mode is character, indicating that most values are stored as text.

Orig_title: Length: 10178 Class: Character Mode: Character Description: This variable contains the original titles of the movies. It is of character data type, and there are 10,178 observations in this variable. The mode is character, indicating that most values are stored as text.

Status: Length: 10178 Class: Character Mode: Character Description: This variable contains the status or state of the movies. It is of character data type, and there are 10,178 observations in this variable. The mode is character, indicating that most values are stored as text.

Orig_lang: Length: 10178 Class: Character Description: This variable represents the original languages of the movies. It is of character data type, and there are 10,178 observations in this variable.

Budget_x: Minimum Value: 1 1st Quartile: 15,000,000 Description: This variable contains budget information for the movies. It appears to be numeric, with a minimum value of 1 and a 1st quartile value of 15,000,000.

Revenue: Minimum Value: 0.0 1st Quartile: 2.859e+07 (approximately 28,590,000) Description: This variable represents the revenue generated by the movies. It appears to be numeric, with a minimum value of 0.0 and a 1st quartile value of approximately 28,590,000.

Country: Length: 10178 Class: Character Description: This variable contains information about the countries associated with the movies. It is of character data type, and there are 10,178 observations in this variable.

These descriptions provide an overview of the data types, lengths, and summary statistics for each variable in your dataset.

Data Exploration and Visualization

```{r} # Create a bar plot to visualize the distribution of genres genre_counts <- IMDB_Movies %>% separate_rows(Genre, sep = “\|”) %>% group_by(Genre) %>% summarise(Count = n()) %>% arrange(desc(Count))

Plot the genre distribution

ggplot(genre_counts, aes(x = reorder(Genre, -Count), y = Count)) + geom_bar(stat = “identity”) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(x = “Genre”, y = “Count”, title = “Genre Distribution”)


## Budget vs. Revenue Scatterplot

```{r}
# Create a scatterplot of Budget vs. Revenue
ggplot(IMDB_Movies, aes(x = Budget_x, y = Revenue)) +
  geom_point() +
  labs(x = "Budget", y = "Revenue", title = "Budget vs. Revenue Scatterplot")

```{r}

Box plot to visualize revenue distribution by genre

ggplot(movie_data_genre, aes(x = Genre, y = Revenue)) + geom_boxplot() + labs(x = “Genre”, y = “Revenue”, title = “Revenue Distribution by Genre”) + theme(axis.text.x = element_text(angle = 45, hjust = 1))


```{r}

# Bar plot to visualize average budget by country
ggplot(movie_data_country, aes(x = reorder(Country, -Average_Budget), y = Average_Budget)) +
  geom_bar(stat = "identity") +
  labs(x = "Country", y = "Average Budget", title = "Average Budget by Country") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```{r} # Scatter plot to visualize the relationship between score and average score by original language ggplot(movie_data_language, aes(x = Score, y = Average_Score)) + geom_point() + labs(x = “Score”, y = “Average Score”, title = “Score vs. Average Score by Language”)


```{r}
genre_summary <- IMDB_Movies %>%
  separate_rows(Genre, sep = "\\|") %>%
  group_by(Genre) %>%
  summarise(Average_Revenue = mean(Revenue, na.rm = TRUE))

# Calculate expected probability and identify anomalies
genre_summary <- genre_summary %>%
  mutate(Expected_Probability = Average_Revenue / sum(Average_Revenue),
         Anomaly = ifelse(Expected_Probability < min(Expected_Probability), "Anomaly", "Normal"))

# Merge the anomaly information back to the original data
movie_data_genre <- IMDB_Movies %>%
  separate_rows(Genre, sep = "\\|") %>%
  left_join(genre_summary, by = "Genre")

# Visualize the results (e.g., bar plot of anomalies)
ggplot(movie_data_genre, aes(x = Anomaly)) +
  geom_bar() +
  labs(x = "Anomaly Status", y = "Count", title = "Genre Anomaly Analysis")

```{r} country_summary <- IMDB_Movies %>% group_by(Country) %>% summarise(Average_Budget = mean(Budget_x, na.rm = TRUE))

Calculate expected probability and identify anomalies

country_summary <- country_summary %>% mutate(Expected_Probability = Average_Budget / sum(Average_Budget), Anomaly = ifelse(Expected_Probability < min(Expected_Probability), “Anomaly”, “Normal”))

Merge the anomaly information back to the original data

movie_data_country <- IMDB_Movies %>% left_join(country_summary, by = “Country”)

Visualize the results (e.g., bar plot of anomalies)

ggplot(movie_data_country, aes(x = Anomaly)) + geom_bar() + labs(x = “Anomaly Status”, y = “Count”, title = “Country Anomaly Analysis”)


```{r}
language_summary <- IMDB_Movies %>%
  group_by(Orig_lang) %>%
  summarise(Average_Score = mean(Score, na.rm = TRUE))

# Calculate expected probability and identify anomalies
language_summary <- language_summary %>%
  mutate(Expected_Probability = Average_Score / sum(Average_Score),
         Anomaly = ifelse(Expected_Probability < min(Expected_Probability), "Anomaly", "Normal"))

# Merge the anomaly information back to the original data
movie_data_language <- IMDB_Movies %>%
  left_join(language_summary, by = "Orig_lang")

# Visualize the results (e.g., bar plot of anomalies)
ggplot(movie_data_language, aes(x = Anomaly)) +
  geom_bar() +
  labs(x = "Anomaly Status", y = "Count", title = "Language Anomaly Analysis")