title: “LohitDataDive” output: html_document —

#Group by

grouped_genre <- data %>%
  group_by(genre) %>%
  summarize(mean_score = mean(score, na.rm = TRUE))
print(grouped_genre)

## # A tibble: 2,304 × 2
##    genre                                                           mean_score
##    <chr>                                                                <dbl>
##  1 ""                                                                    27.9
##  2 "Action"                                                              59.8
##  3 "Action, Adventure"                                                   62.2
##  4 "Action, Adventure, Animation"                                        62.3
##  5 "Action, Adventure, Animation, Comedy"                                69  
##  6 "Action, Adventure, Animation, Comedy, Family"                        68  
##  7 "Action, Adventure, Animation, Comedy, Family, Science Fiction"       68  
##  8 "Action, Adventure, Animation, Comedy, Romance, Family"               58  
##  9 "Action, Adventure, Animation, Crime, Mystery"                        65  
## 10 "Action, Adventure, Animation, Drama"                                 78.5
## # ℹ 2,294 more rows

grouped_status <- data %>%
  group_by(status) %>%
  summarize(mean_revenue = mean(revenue, na.rm = TRUE))
print(grouped_status)

## # A tibble: 3 × 2
##   status             mean_revenue
##   <chr>                     <dbl>
## 1 " In Production"     154868097.
## 2 " Post Production"   119669447.
## 3 " Released"          253703704.

grouped_orig_lang <- data %>%
  group_by(orig_lang) %>%
  summarize(mean_budget = mean(budget_x, na.rm = TRUE))
print(grouped_orig_lang)

## # A tibble: 54 × 2
##    orig_lang                              mean_budget
##    <chr>                                        <dbl>
##  1 " Arabic"                                55571750 
##  2 " Basque"                               115600000 
##  3 " Bengali"                              180740000 
##  4 " Bokmål, Norwegian, Norwegian Bokmål"    3500000 
##  5 " Cantonese"                             71818132.
##  6 " Catalan, Valencian"                    87500000 
##  7 " Central Khmer"                        148200000 
##  8 " Chinese"                               71309369.
##  9 " Czech"                                   855355 
## 10 " Danish"                                49008752.
## # ℹ 44 more rows

(The purpose of the provided code is to perform data summarization and analysis by grouping the data based on different categorical variables (“genre,” “status,” and “orig_lang”) and calculating the mean of specific numerical columns within each group. )

Function to calculate probabilities and assign “anomaly” tag

calculate_probabilities <- function(grouped_data, column_name) {
  grouped_data <- grouped_data %>%
    mutate(probability = n() / sum(n()))
  
  min_probability <- min(grouped_data$probability)
  
  grouped_data <- grouped_data %>%
    mutate(anomaly_tag = ifelse(probability == min_probability, "Anomaly", "Normal"))
  
  return(grouped_data)
}


grouped_genre <- calculate_probabilities(grouped_genre, "genre")
grouped_status <- calculate_probabilities(grouped_status, "status")
grouped_orig_lang <- calculate_probabilities(grouped_orig_lang, "orig_lang")


print(grouped_genre)

## # A tibble: 2,304 × 4
##    genre                                      mean_score probability anomaly_tag
##    <chr>                                           <dbl>       <dbl> <chr>      
##  1 ""                                               27.9           1 Anomaly    
##  2 "Action"                                         59.8           1 Anomaly    
##  3 "Action, Adventure"                              62.2           1 Anomaly    
##  4 "Action, Adventure, Animation"                   62.3           1 Anomaly    
##  5 "Action, Adventure, Animation, Comedy"           69             1 Anomaly    
##  6 "Action, Adventure, Animation, Comedy, Fa…       68             1 Anomaly    
##  7 "Action, Adventure, Animation, Comedy, Fa…       68             1 Anomaly    
##  8 "Action, Adventure, Animation, Comedy, Ro…       58             1 Anomaly    
##  9 "Action, Adventure, Animation, Crime, Mys…       65             1 Anomaly    
## 10 "Action, Adventure, Animation, Drama"            78.5           1 Anomaly    
## # ℹ 2,294 more rows

print(grouped_status)

## # A tibble: 3 × 4
##   status             mean_revenue probability anomaly_tag
##   <chr>                     <dbl>       <dbl> <chr>      
## 1 " In Production"     154868097.           1 Anomaly    
## 2 " Post Production"   119669447.           1 Anomaly    
## 3 " Released"          253703704.           1 Anomaly

print(grouped_orig_lang)

## # A tibble: 54 × 4
##    orig_lang                              mean_budget probability anomaly_tag
##    <chr>                                        <dbl>       <dbl> <chr>      
##  1 " Arabic"                                55571750            1 Anomaly    
##  2 " Basque"                               115600000            1 Anomaly    
##  3 " Bengali"                              180740000            1 Anomaly    
##  4 " Bokmål, Norwegian, Norwegian Bokmål"    3500000            1 Anomaly    
##  5 " Cantonese"                             71818132.           1 Anomaly    
##  6 " Catalan, Valencian"                    87500000            1 Anomaly    
##  7 " Central Khmer"                        148200000            1 Anomaly    
##  8 " Chinese"                               71309369.           1 Anomaly    
##  9 " Czech"                                   855355            1 Anomaly    
## 10 " Danish"                                49008752.           1 Anomaly    
## # ℹ 44 more rows

(The purpose of the provided code is to calculate probabilities and assign anomaly tags to categories within grouped data frames. It does this for three different categorical variables: “genre,” “status,” and “orig_lang.” The calculated probabilities are based on the frequency of each category within the grouped data, and the anomaly tags are assigned based on whether a category has the minimum probability (tagged as “Anomaly”) or not (tagged as “Normal”).

Printing the “anomaly_tag” columns in grouped data frames

print(grouped_genre)

## # A tibble: 2,304 × 4
##    genre                                      mean_score probability anomaly_tag
##    <chr>                                           <dbl>       <dbl> <chr>      
##  1 ""                                               27.9           1 Anomaly    
##  2 "Action"                                         59.8           1 Anomaly    
##  3 "Action, Adventure"                              62.2           1 Anomaly    
##  4 "Action, Adventure, Animation"                   62.3           1 Anomaly    
##  5 "Action, Adventure, Animation, Comedy"           69             1 Anomaly    
##  6 "Action, Adventure, Animation, Comedy, Fa…       68             1 Anomaly    
##  7 "Action, Adventure, Animation, Comedy, Fa…       68             1 Anomaly    
##  8 "Action, Adventure, Animation, Comedy, Ro…       58             1 Anomaly    
##  9 "Action, Adventure, Animation, Crime, Mys…       65             1 Anomaly    
## 10 "Action, Adventure, Animation, Drama"            78.5           1 Anomaly    
## # ℹ 2,294 more rows

print(grouped_status)

## # A tibble: 3 × 4
##   status             mean_revenue probability anomaly_tag
##   <chr>                     <dbl>       <dbl> <chr>      
## 1 " In Production"     154868097.           1 Anomaly    
## 2 " Post Production"   119669447.           1 Anomaly    
## 3 " Released"          253703704.           1 Anomaly

print(grouped_orig_lang)

## # A tibble: 54 × 4
##    orig_lang                              mean_budget probability anomaly_tag
##    <chr>                                        <dbl>       <dbl> <chr>      
##  1 " Arabic"                                55571750            1 Anomaly    
##  2 " Basque"                               115600000            1 Anomaly    
##  3 " Bengali"                              180740000            1 Anomaly    
##  4 " Bokmål, Norwegian, Norwegian Bokmål"    3500000            1 Anomaly    
##  5 " Cantonese"                             71818132.           1 Anomaly    
##  6 " Catalan, Valencian"                    87500000            1 Anomaly    
##  7 " Central Khmer"                        148200000            1 Anomaly    
##  8 " Chinese"                               71309369.           1 Anomaly    
##  9 " Czech"                                   855355            1 Anomaly    
## 10 " Danish"                                49008752.           1 Anomaly    
## # ℹ 44 more rows

(The code above is used to print the contents of the grouped data frames grouped_genre, grouped_status, and grouped_orig_lang.)

Creating grouped data frames with unique “anomaly_tag” column names and Left join “anomaly_tag” information to the original data frame

grouped_genre <- grouped_genre %>%
  mutate(genre_anomaly = ifelse(probability == min(probability), "Anomaly", "Normal"))

grouped_status <- grouped_status %>%
  mutate(status_anomaly = ifelse(probability == min(probability), "Anomaly", "Normal"))

grouped_orig_lang <- grouped_orig_lang %>%
  mutate(orig_lang_anomaly = ifelse(probability == min(probability), "Anomaly", "Normal"))

# Left join "anomaly_tag" information to the original data frame
data <- data %>%
  left_join(select(grouped_genre, genre, genre_anomaly), by = c("genre" = "genre")) %>%
  left_join(select(grouped_status, status, status_anomaly), by = c("status" = "status")) %>%
  left_join(select(grouped_orig_lang, orig_lang, orig_lang_anomaly), by = c("orig_lang" = "orig_lang"))

head(data)

##                         names      date_x score
## 1                   Creed III 03/02/2023     73
## 2    Avatar: The Way of Water 12/15/2022     78
## 3 The Super Mario Bros. Movie 04/05/2023     76
## 4                     Mummies 01/05/2023     70
## 5                   Supercell 03/17/2023     61
## 6                Cocaine Bear 02/23/2023     66
##                                           genre
## 1                                 Drama, Action
## 2            Science Fiction, Adventure, Action
## 3 Animation, Adventure, Family, Fantasy, Comedy
## 4 Animation, Comedy, Family, Adventure, Fantasy
## 5                                        Action
## 6                       Thriller, Comedy, Crime
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                     overview
## 1 After dominating the boxing world, Adonis Creed has been thriving in both his career and family life. When a childhood friend and former boxing prodigy, Damien Anderson, resurfaces after serving a long sentence in prison, he is eager to prove that he deserves his shot in the ring. The face-off between former friends is more than just a fight. To settle the score, Adonis must put his future on the line to battle Damien — a fighter who has nothing to lose.
## 2                                                                                                                                                                                           Set more than a decade after the events of the first film, learn the story of the Sully family (Jake, Neytiri, and their kids), the trouble that follows them, the lengths they go to keep each other safe, the battles they fight to stay alive, and the tragedies they endure.
## 3                                                                                                                                                                                                               While working underground to fix a water main, Brooklyn plumbers—and brothers—Mario and Luigi are transported down a mysterious pipe and wander into a magical new world. But when the brothers are separated, Mario embarks on an epic quest to find Luigi.
## 4                                                                                                                                                                                                                                     Through a series of unfortunate events, three mummies end up in present-day London and embark on a wacky and hilarious journey in search of an old ring belonging to the Royal Family, stolen by ambitious archaeologist Lord Carnaby.
## 5                                                        Good-hearted teenager William always lived in hope of following in his late father’s footsteps and becoming a storm chaser. His father’s legacy has now been turned into a storm-chasing tourist business, managed by the greedy and reckless Zane Rogers, who is now using William as the main attraction to lead a group of unsuspecting adventurers deep into the eye of the most dangerous supercell ever seen.
## 6                                                                                                                                                                                                                                                           Inspired by a true story, an oddball group of cops, criminals, tourists and teens converge in a Georgia forest where a 500-pound black bear goes on a murderous rampage after unintentionally ingesting cocaine.
##                                                                                                                                                                                                                                                                                                              crew
## 1          Michael B. Jordan, Adonis Creed, Tessa Thompson, Bianca Taylor, Jonathan Majors, Damien Anderson, Wood Harris, Tony 'Little Duke' Evers, Phylicia Rashād, Mary Anne Creed, Mila Davis-Kent, Amara Creed, Florian Munteanu, Viktor Drago, José Benavidez Jr., Felix Chavez, Selenis Leyva, Laura Chavez
## 2                                    Sam Worthington, Jake Sully, Zoe Saldaña, Neytiri, Sigourney Weaver, Kiri / Dr. Grace Augustine, Stephen Lang, Colonel Miles Quaritch, Kate Winslet, Ronal, Cliff Curtis, Tonowari, Joel David Moore, Norm Spellman, CCH Pounder, Mo'at, Edie Falco, General Frances Ardmore
## 3 Chris Pratt, Mario (voice), Anya Taylor-Joy, Princess Peach (voice), Charlie Day, Luigi (voice), Jack Black, Bowser (voice), Keegan-Michael Key, Toad (voice), Seth Rogen, Donkey Kong (voice), Fred Armisen, Cranky Kong (voice), Kevin Michael Richardson, Kamek (voice), Sebastian Maniscalco, Spike (voice)
## 4    Óscar Barberán, Thut (voice), Ana Esther Alborg, Nefer (voice), Luis Pérez Reina, Carnaby (voice), María Luisa Solá, Madre (voice), Jaume Solà, Sekhem (voice), José Luis Mediavilla, Ed (voice), José Javier Serrano Rodríguez, Danny (voice), Aleix Estadella, Dennis (voice), María Moscardó, Usi (voice)
## 5                                                                Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quinn Brody, Daniel Diemer, William Brody, Jordan Kristine Seamón, Harper Hunter, Alec Baldwin, Zane Rogers, Richard Gunn, Bill Brody, Praya Lundberg, Amy, Johnny Wactor, Martin, Anjul Nigam, Ramesh
## 6                                                                      Keri Russell, Sari, Alden Ehrenreich, Eddie, O'Shea Jackson Jr., Daveed, Ray Liotta, Syd, Kristofer Hivju, Olaf (Kristoffer), Margo Martindale, Ranger Liz, Christian Convery, Henry, Isiah Whitlock Jr., Bob, Jesse Tyler Ferguson, Peter
##                    orig_title    status           orig_lang budget_x    revenue
## 1                   Creed III  Released             English 7.50e+07  271616668
## 2    Avatar: The Way of Water  Released             English 4.60e+08 2316794914
## 3 The Super Mario Bros. Movie  Released             English 1.00e+08  724459031
## 4                      Momias  Released  Spanish, Castilian 1.23e+07   34200000
## 5                   Supercell  Released             English 7.70e+07  340941959
## 6                Cocaine Bear  Released             English 3.50e+07   80000000
##   country genre_anomaly status_anomaly orig_lang_anomaly
## 1      AU       Anomaly        Anomaly           Anomaly
## 2      AU       Anomaly        Anomaly           Anomaly
## 3      AU       Anomaly        Anomaly           Anomaly
## 4      AU       Anomaly        Anomaly           Anomaly
## 5      US       Anomaly        Anomaly           Anomaly
## 6      AU       Anomaly        Anomaly           Anomaly

(This above code is useful for augmenting my original data with anomaly tags based on certain conditions within grouped data frames. It can help us to analyze and visualize the data with these additional tags to gain insights into anomalies within different categories.)

Create a bar plot to visualize anomaly tags by genre

library(ggplot2)


ggplot(data, aes(x = genre, fill = genre_anomaly)) +
  geom_bar() +
  labs(title = "Distribution of Anomaly Tags by Genre",
       x = "Genre",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

(This above code generates a bar plot using the ggplot2 library to visualize the distribution of anomaly tags by genre. This plot helps us to understand how anomaly tags are distributed across different genres in my dataset. By visualizing the distribution, we can quickly identify which genres have more “Anomaly” or “Normal” records.)

Creating a box plot to compare “score” for normal and anomaly groups

ggplot(data, aes(x = genre_anomaly, y = score)) +
  geom_boxplot() +
  labs(title = "Comparison of Scores for Normal and Anomaly Groups",
       x = "Anomaly Tag",
       y = "Score") +
  theme_minimal()

( The code chunk above used for generating a boxplot using the ggplot2 package. This code will create a boxplot to compare scores for the “Normal” and “Anomaly” groups based on the “genre_anomaly” column.

ggplot(data, aes(x = genre_anomaly, y = score)): This line sets up the base plot using my dataset (data). It specifies that i want to use “genre_anomaly” on the x-axis and “score” on the y-axis.

geom_boxplot(): This adds a boxplot layer to my plot, which will display the distribution of scores for each “genre_anomaly” group.

labs: This sets the title and axis labels for my plot.

theme_minimal(): This applies a minimal theme to my plot, adjusting the visual appearance.)

combinations never show up and most/least common combinations

combinations <- expand.grid( Genre = unique(data$genre), Status = unique(data$status), Orig_Lang = unique(data$orig_lang) )

combination_counts <- table(data$genre, data$status, data$orig_lang)

missing_combinations <- which(combination_counts == 0)

most_common_combinations <- which(combination_counts == max(combination_counts)) least_common_combinations <- which(combination_counts == min(combination_counts))

cat(“Missing Combinations:”) for (row_index in missing_combinations) { combination <- combinations[row_index, ] cat(paste(names(combination), combination, sep = ” = “),”“) }

cat(“Common Combinations:”) for (row_index in most_common_combinations) { combination <- combinations[row_index, ] cat(paste(names(combination), combination, sep = ” = “),”“) }

cat(“Common Combinations:”) for (row_index in least_common_combinations) { combination <- combinations[row_index, ] cat(paste(names(combination), combination, sep = ” = “),”“) }

( When i Run this code, im getting the result, But when i try bto knit this part its not knitting. so, i wrote the code directly than in a chunk.

The code above provided used to generate combinations of values from the “Genre,” “Status,” and “Orig_Lang” columns in my data and then analyzing the presence and frequency of these combinations. It prints out missing combinations, most common combinations, and least common combinations.)