DataVisualisationReport

Exercise 1.

import numpy as np import pandas as pd import matplotlib.pyplot as plt

data = pd.read_csv(“credit_card.csv”) # Calculate average monthly credit card expenditure data[‘avg_expenditure’] = data[‘expenditure’] / data[‘months’] data.head()

data[‘avg_expenditure’].plot(figsize=(10, 6), color=‘skyblue’, alpha=0.7) plt.title(‘Bar Chart of Average Monthly Expenditure’) plt.xlabel(‘Index’) plt.ylabel(‘Average Monthly Expenditure’) plt.grid(True) plt.show()

from tabulate import tabulate

Frequency distribution

frequency_table = data[‘avg_expenditure’].value_counts().reset_index() frequency_table.columns = [‘Average Monthly Expenditure’, ‘Frequency’]

Print frequency table, first 1000 characters

print(tabulate(frequency_table, headers=‘keys’, tablefmt=‘pipe’)[:1000])

Jupiter file

There is a Jupiter file with this python code, plots and tables in repository https://github.com/WiktoriaKop/descriptiveStat.git in file visualisationReport.

Exercise 2.

Answer with the most appropriate data visualization for the following questions:

What is the distribution of Imdb scores for Polish movies and movie-series?

Imdb_scores_distribution <- mydata %>%
  filter(grepl("Polish", Tags)) %>%
  group_by(Series.or.Movie)

ggplot(Imdb_scores_distribution, aes(x = IMDb.Score, fill = after_stat(count))) +
  geom_histogram(binwidth = 0.1) +
  scale_fill_gradient(high = "lightblue", low = "darkblue") +
  scale_y_continuous(breaks = seq(0, 8, by = 1)) +
  scale_x_continuous(breaks = seq(3, 9, by = 0.5)) +
  facet_wrap(vars(Series.or.Movie)) +
  scale_fill_distiller(direction = -1, palette = 4) +
  labs(title = "Distribution of IMDb Scores", y = "frequency", x = "IMDb Score (0 - 10)") +
  theme_dark()

## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

What is the density function of Imdb scores for Polish movies and movie-series?

ggplot(Imdb_scores_distribution, aes(x = IMDb.Score)) +
  geom_density(color = "navyblue",size = 0.7) +
    scale_x_continuous(breaks = seq(3, 9, by = 0.5)) + 
    scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
  labs(title = "Density of IMDb sores for Polish movies and series together", x = "IMDb scores")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(Imdb_scores_distribution, aes(x = IMDb.Score)) +
  geom_density(color = "navyblue",size = 0.7) +
  scale_x_continuous(breaks = seq(3, 9, by = 0.5)) + 
  scale_y_continuous(breaks = seq(0, 10, by = 0.1)) +
  facet_wrap(~Series.or.Movie) +
  labs(title = "Density of IMDb sores for Polish movies and series", x = "IMDb scores")

What are the most popular languages available on Netflix?

language_separated <- mydata %>%
  separate_rows(Languages, sep = ", ") %>%
  mutate(Languages = fct_rev(fct_infreq(Languages))) %>%
  filter(Languages != "") %>%
  mutate(language_lump = fct_lump(Languages, n = 30)) 
  

ggplot(language_separated, aes(y = language_lump)) +
  geom_bar() +
  scale_x_continuous(breaks = seq(0, 6500, by = 300)) +
  labs(title = "Most popular languages on Netflix", y = "Language")

language_separated %>%
  group_by(Languages) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(10) %>% 
  kbl(align = c(rep("c", 7), rep("r", 5)), caption = "TOP 10 MOST POPULAR LANGUAGES ON NETFLIX") %>%
  kable_styling(bootstrap_options = "striped")

## Selecting by count

TOP 10 MOST POPULAR LANGUAGES ON NETFLIX
Languages	count
English	6170
Japanese	1177
Spanish	837
French	801
Korean	562
German	489
Hindi	349
Mandarin	335
Italian	312
Russian	278

For extra credits:

Extra challenge 1.: Create a chart showing actors starring in the most popular productions.

actors_production <- data2 %>%
  select(Title, IMDb.Votes, Actor) %>%
  arrange(desc(IMDb.Votes)) %>%
  unique()


top_productions <- actors_production %>%
  select(Title) %>%
  unique %>%
  head(100)


actors <- actors_production %>%
  filter(Title %in% top_productions$Title)

actors_top <- actors %>%
  group_by(Actor) %>%
  summarize(count=n()) %>%
  arrange(desc(count)) %>%
  head(10)

Extra challenge 2.: For movies and series, create rating charts from the various portals (Hidden Gem, IMDb, Rotten Tomatoes, Metacritic). Hint: it’s a good idea to reshape the data to long format.

data_films <- data2 %>% 
    arrange(desc(IMDb.Votes)) %>%
      select(Title, Hidden.Gem.Score,IMDb.Score, 
             Rotten.Tomatoes.Score, Metacritic.Score) %>%
               unique() %>%
                 head(10)

data_films_long <- data_films %>%
  gather(Score, Portals, -Title)

ggplot(data_films_long, aes(y = Title, x = Portals)) +
  geom_col(position="dodge", fill = "orange4") +
  facet_wrap(~ Score, scales="free") +
   labs(title = "Top 10 Productions Scores by Website",
       x = "",
       y = "Title")

Extra challenge 3.: Which film studios produce the most and how has this changed over the years?

studios <- data2 %>%
  select(Title, Production.House, Release.Date) %>%
  filter(Production.House != "" & Release.Date != "") %>%
  unique() 
 

studios_top <- studios %>%
  group_by(Production.House) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) %>%
  head(4)
  


studio_new <- studios %>%
  filter(Production.House %in% studios_top$Production.House) %>%
  group_by(Production.House, Release.Date) %>%
  summarize(count = n())