1) Analyzing key patterns within provided data: data exploration and initial visualization

Initial dataset consist of 3635 films (rows) and 28 (27 exoginious and 1 endogenious) variables (columns)

Initial dataset consist of 0 NAs, which tell us that data is well structured and dont have any missing values, so we can use this dataset in further analysis"

Duration seems to be quite important factor when desiding whether a giving film is a comedy or not. Some rule based methods (or dummy variables) can be used further in the analysis to use this knowledge

  • there are no Comedies that have more than 240 screen time;
  • at the same time 95% of comedies have duration of 125 or less minutes; while 95% of non-Comedies have duration more than 159 minutes