Exercise 1.
Using data on credit card applications’ status please present the frequency table with the nice, kable format for average monthly credit card expenditures of applicants.| Income KUSD | Frequency | Percentage % | Cumulative frequency | Cumulative percentage % |
|---|---|---|---|---|
| [0,500] | 1208 | 0.9158453 | 1208 | 0.9158453 |
| (500,1e+03] | 90 | 0.0682335 | 1298 | 0.9840788 |
| (1e+03,1.5e+03] | 8 | 0.0060652 | 1306 | 0.9901440 |
| (1.5e+03,2e+03] | 10 | 0.0075815 | 1316 | 0.9977255 |
| (2e+03,2.5e+03] | 2 | 0.0015163 | 1318 | 0.9992418 |
| (2.5e+03,3e+03] | 0 | 0.0000000 | 1318 | 0.9992418 |
| (3e+03,3.5e+03] | 1 | 0.0007582 | 1319 | 1.0000000 |
We created minimalistic and clean kable, using this code in R:
freq_table %>%
kable_styling(bootstrap_options = "bordered") %>%
row_spec(seq(1, nrow(freq_table), by = 2), background = "lightgrey") %>%
row_spec(seq(2, nrow(freq_table), by = 2), background = "white")
Exercise 2
The data comes from https://flixgem.com/ (dataset version as of March 12, 2021). The data contains information on 9425 movies and series available on Netlix. Let’s analyze this data!
Exercise 2.1
What is the distribution of Imdb scores for Polish movies and movie-series
From what we see, this plot shows distribution of scores for Polish movies on IMDb website. We grep every observation containing word “Polish” in column [Tags]. Here is code in R:
ggplot(pol_films) +
geom_histogram(aes(x = IMDb.Score, fill = after_stat(count)), binwidth = 0.5,)
Exercise 2.2
What is the density function of Imdb scores for Polish movies and movie-series?
Once again same observations, but plotted differently, code in R:
pol_films$IMDb.Score <- as.numeric(pol_films$IMDb.Score)
imdb_scores <- pol_films$IMDb.Score
density_imdb <- density(imdb_scores)
ggplot(data.frame(x = density_imdb$x, y = density_imdb$y), aes(x, y)) +
geom_line() +
labs(x = "IMDb Score", y = "Density") +
ggtitle("Density Plot of IMDb Scores for Polish Movies and Series") +
theme_minimal()
Exercise 2.3
What are the most popular languages available on Netflix?
English dominated the TOP10 scoreboard of the most popular languages on Netflix. Code in R:
language_pop <- mydata %>%
separate_rows(Languages, sep = ",") %>%
count(Languages, name = "lC") %>%
slice_max(order_by = lC , n =10)
ggplot(language_pop[1:10, ], aes(x = reorder(Languages, lC), y = lC)) +
geom_col(fill = "skyblue", color = "black") +
coord_flip() +
labs(x = "Languages", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Extra challenge
Extra challenge 1
Create a chart showing actors starring in the most popular productions.
Here is a TOP10 actors chart. It can be said that there are no actors who stand out significantly by their attendance in participating in film productions.
Extra challenge 2
For movies and series, create rating charts from the various portals (Hidden Gem, IMDb, Rotten Tomatoes, Metacritic)
We can also plot it as a boxplot to show distribution of ratings from different point of view.
Using faceting, we can show graph showing rating from each portal. Looking at those vizualisation we can form some conclusions. All ratings were reduced to values in the range [0-10]. Let’s also look at IQR, Range and SD.
| Website | Range | IQR | Standard deviation |
|---|---|---|---|
| Hidden.Gem.Score | 9.2 | 4.7 | 2.4474622 |
| IMDb.Score | 8.1 | 1.0 | 0.8996811 |
| Metacritic.Score | 9.4 | 2.5 | 1.7143187 |
| Rotten.Tomatoes.Score | 10.0 | 3.6 | 2.5269466 |
On the Hidden Gem graph, you can see that the ratings are distributed in such a way that the number of ratings increases ( roughly from the “zero” level) to the middle of the scale. The number of ratings from rating 5 increases again from the low level.
IMDb ratings seem to be evenly distributed around ratings 6.5 - 7.5, with the highest number of ratings for rating 7. So it looks like there is little dispersion of ratings on this site.
On the MetaCritic site, we observe a very even distribution of ratings, with the largest number of ratings in the region of the arithmetic mean.
On the Rotten Tomatoes site, the number of ratings seems to increase in tandem with the “rating value.” In a word, the higher the rating, the higher the number of assignments of that rating.
Extra challenge 3
Which film studios produce the most and how has this changed over the years?
Given that we are analyzing films that are available on the Netflix platform, we must keep in mind that any changes in the number of films produced may relate to a decline in the studio’s “productivity,” or in this case, more to the availability of these films on the Netflix platform.
Based on the above chart, we can make the following conclusions. As the years go by, there is an increase in film production among the top 5 Film Producers. Secondly, in recent years, we have seen the dominance of Netflix, not only as far as film production is concerned, but also probably limiting access to films produced by other studios on the Netflix platform