Kable package

In a previous weeks, we saw R Markdown in action, where multiple things can be created in one location: code, commentary, and output.

In this chapter we will explore package which will facilitate the creation of presentation-worthy tables: “kableExtra”.

Let’s work with the cross-sectional data on the credit history for a sample of applicants for a type of credit card.

data(CreditCard)
cardHead <- head(CreditCard)
cardHead
##   card reports      age income       share expenditure owner selfemp dependents
## 1  yes       0 37.66667 4.5200 0.033269910  124.983300   yes      no          3
## 2  yes       0 33.25000 2.4200 0.005216942    9.854167    no      no          3
## 3  yes       0 33.66667 4.5000 0.004155556   15.000000   yes      no          4
## 4  yes       0 30.50000 2.5400 0.065213780  137.869200    no      no          0
## 5  yes       0 32.16667 9.7867 0.067050590  546.503300   yes      no          2
## 6  yes       0 23.25000 2.5000 0.044438400   91.996670    no      no          0
##   months majorcards active
## 1     54          1     12
## 2     34          1     13
## 3     58          1      5
## 4     25          1      7
## 5     64          1      5
## 6     54          1      1


cardHead %>%
  kbl()
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


Let’s tweak the appearance of this with the “align” and the “caption” arguments.

The align argument takes a character vector with letters “l”, “c”, or “r” - specifying where you want the columns to be aligned.

The caption argument gives a caption to the table.


base <- cardHead %>%
  kbl(align = c(rep("c", 7), rep("r", 5)), caption = "kable example with card data")
base
kable example with card data
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


A key function, where we can enjoy much of the configuration for the table, is via kable_styling().

We have options “bootstrap_options” or “latex_options”, where the latter requires the use of the package “tinytex” and a local installation of LaTeX.

Possible options for “bootstrap_options” include ‘basic’, ‘striped’, ‘bordered’, ‘hover’, ‘condensed’, ‘responsive’, and none.

Possible for “latex_options” include ‘basic’, ‘striped’, ‘hold_position’, ‘HOLD_position’, ‘scale_down’, and ‘repeat_header’.


base %>%
  kable_styling(bootstrap_options = "striped")
kable example with card data
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


Next, we can customize the look and feel of particular rows and columns.

Let’s see an example here, where we make the last three rows blue.

base %>%
  kable_styling(bootstrap_options = "bordered") %>%
  column_spec(8:12, bold = T) %>%
  row_spec(4:6, italic = T, color = "gold", background = "blue")
kable example with card data
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


We can also create groups for our columns.


base %>%
  kable_styling(bootstrap_options = "bordered") %>%
  add_header_above(c("Group 1" = 4, "Group 2" = 2, "Group 3" = 6))
kable example with card data
Group 1
Group 2
Group 3
card reports age income share expenditure owner selfemp dependents months majorcards active
yes 0 37.66667 4.5200 0.0332699 124.983300 yes no 3 54 1 12
yes 0 33.25000 2.4200 0.0052169 9.854167 no no 3 34 1 13
yes 0 33.66667 4.5000 0.0041556 15.000000 yes no 4 58 1 5
yes 0 30.50000 2.5400 0.0652138 137.869200 no no 0 25 1 7
yes 0 32.16667 9.7867 0.0670506 546.503300 yes no 2 64 1 5
yes 0 23.25000 2.5000 0.0444384 91.996670 no no 0 54 1 1


Data Aggregation

In the first stage of our analysis we are going to group our data in the form of the simple frequency table.

First, let’s take a look at the distribution of income in our sample and verify the tabular accuracy using TAI measure:

options(scipen=999)

limits<- cut(CreditCard$income,seq(0,14,by=2))
tabelka <- freq(limits,type="html")
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
tabelka
## $`x:`
##                x label Freq Percent Valid Percent Cumulative Percent
##    Valid   (0,2]        236    17.9          17.9               17.9
##            (2,4]        783    59.4          59.4               77.3
##            (4,6]        205    15.5          15.5               92.8
##            (6,8]         63     4.8           4.8               97.6
##           (8,10]         23     1.7           1.7               99.3
##          (10,12]          7     0.5           0.5               99.8
##          (12,14]          2     0.2           0.2              100.0
##            Total       1319   100.0         100.0                   
##  Missing <blank>          0     0.0                                 
##             <NA>          0     0.0                                 
##            Total       1319   100.0

Without ‘kable’ styling it’s quite ugly right? ;-)

Tabular accuracy

An index of tabular accuracy TAI, described by Jenks and Casspal in 1971 is to optimize the class distribution used in a cartograms/frequency tables etc.

The TAI indicator takes values in the range (0;1). The numerator of the expression is the sum of the absolute deviations of the values classified into classes, and the denominator is the sum of the absolute deviations of the entire classified set.

The better the class division reflects the nature of the data, the larger the indicator will be. As the number of classes increases, the indicator will take on larger values.

Let’s calculate TAI index to check the properties of the tabulated data:

tabelka2 <- classIntervals(CreditCard$income, n=7, style="fixed", fixedBreaks=seq(0,14,by=2))
jenks.tests(tabelka2)
##        # classes  Goodness of fit Tabular accuracy 
##        7.0000000        0.9085328        0.6568085

As we can see - TAI index…

We can use different recipes… (styles):

tabelka3<-classIntervals(CreditCard$income, n=10, style="sd")
plot(tabelka3,pal=c(1:10))

jenks.tests(tabelka3)
##        # classes  Goodness of fit Tabular accuracy 
##        8.0000000        0.9274792        0.6909392

Still, the TAI indicator is not satisfactory. What should we change in the final frequency table design?

hist(CreditCard$income)

Continuous variables

We can calculate the absolute and relative frequencies of a vector x with the function ‘Freq’ from the DescTools packages. Continuous (numeric) variables will be cut using the same logic as used by the function hist. Categorical variables will be aggregated by table. The result will contain single and cumulative frequencies for both, absolute values and percentages.

tabela4<-Freq(CreditCard$income,breaks=seq(0,14,by=2),useNA="ifany")

tabela4 %>%
  kable(col.names = c("Incomes in kUSD","Frequency","Percentage %","Cumulative frequency","Cumulative percentage %")) %>%
  kable_classic(full_width = F, html_font = "Cambria") 
Incomes in kUSD Frequency Percentage % Cumulative frequency Cumulative percentage %
[0,2] 236 0.1789234 236 0.1789234
(2,4] 783 0.5936315 1019 0.7725550
(4,6] 205 0.1554208 1224 0.9279757
(6,8] 63 0.0477635 1287 0.9757392
(8,10] 23 0.0174375 1310 0.9931766
(10,12] 7 0.0053071 1317 0.9984837
(12,14] 2 0.0015163 1319 1.0000000

BTW: what about TAI of that table?…

Categorical variables

Now, let’s take a look at the categorical data and make some tabulations. The xtabs function works like table except it can produce tables from frequencies using the formula interface.

Let’s say we want to see the table with data on how many card applications was accepted or not:

## card
##   no  yes 
##  296 1023

We may easily produce cross-tabs (status vs. Does the individual own their home?) as well:

crosstab<-xtabs(~ card + owner, data=CreditCard)
crosstab
##      owner
## card   no yes
##   no  206  90
##   yes 532 491

and transform it into pretty html table with the kable function:

crosstab %>% 
  kbl() %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = T) %>%
  column_spec(2, background = "yellow")
no yes
no 206 90
yes 532 491

Data Visualization

We will explore the “ggplot2” package of the tidyverse for data visualization purposes. The “ggplot2” packages involve the the following three mandatory components:

  1. Data
  2. An aesthetic mapping
  3. Geoms (aka objects)

The following components can also optionally be added:

  1. Stats (aka transformations)
  2. Scales
  3. Facets
  4. Coordinate systems
  5. Position adjustments
  6. Themes

Please note that code in this tutorial was adapted from Chapters 3 of the book “R for Data Science” by Hadley Wickham and Garrett Grolemund.

The full book can be found at: https://r4ds.had.co.nz/

A good cheat sheet for ggplot2 functions can be found at: https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Scatterplots

Let’s create an extremely simple scatterplot.

We will use the function ggplot() to do this.

The format of any ggplot graph is this function, followed by another function to add objects.

The objects on a graph in the case of a scatterplot are points. The function we add to it is geom_point.

These functions rely on a function on the inside called aes().

The data and aesthetic mapping components can be added to either the ggplot() or geom functions.

ggplot(data = mpg) +
  geom_point(aes(x = displ, y = hwy))


This is one of the most basic graphs that one can make using the ggplot2 framework.

Next, let’s add color.

geom_point() understands the following aesthetics: x, y, alpha, color, fill, group, shape, size, and stroke (see help documentation).

Let’s map the color argument to the variable “class” from mpg.

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, color = class))

This is not the only way to color objects.

Including the color argument inside of the aes() function can map colors to a choice of variable.

However, we can specify colors manually, by specifying color outside of the aes() function. We will also illustrate the “size” argument.

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, size = class), color = "blue")

Barplots

Lastly, let’s examine other objects that we can plot using ggplot(). We will create a bar chart using the function geom_bar().

ggplot(mpg) +
  geom_bar(aes(x = class))


With the geom_bar() function, we have a great use-case for a stat transformation.

The following code can be used to convert these counts to proportions:

ggplot(mpg) +
  geom_bar(aes(x = class, y = stat(prop), group = 1))

Histograms

Next, let’s create a histogram with the geom_histogram() function.

ggplot(mpg) +
  geom_histogram(aes(x = hwy))

The geom_histogram() function accepts the argument “binwidth”, and has two key arguments for color: fill (this controls the overall color), and color (this controls the border).

Let’s fill all these in:

ggplot(mpg) +
  geom_histogram(aes(x = hwy), binwidth = 5, fill = "navy", color = "gold")


geom_histogram() provides a great example to modify the scale.

Notice in this example that the axis is automatically broken up by units of 10, and does not begin at 0.

We can modify this with the function scale_x_continuous(), as well as the y-axis with the function scale_y_continuous().

There are three key arguments we will feed this function: “breaks”, “limits”, and “expand”.

“breaks” will define the breaks on the axis.

“limits” will define the beginning and end of the axis, and the “expand” argument can be used to start the axes at 0 by using “expand = c(0,0)”.

ggplot(mpg) +
  geom_histogram(aes(x = hwy), binwidth = 5, fill = "navy", color = "gold") +
  scale_x_continuous(breaks = seq(0, 45, 5), limits = c(0, 50), expand = c(0,0)) +
  scale_y_continuous(breaks = seq(0, 90, 10), limits = c(0, 90), expand = c(0,0))

Boxplots

Next, we will create boxplots.

p <- ggplot(mpg) +
  geom_boxplot(aes(x = class, y = cty, fill = class))
p

Facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different.

Read more about facets here.

Notice in this document the use of the fig.height and fig.width options.

Key arguments to facet_wrap() are “facets”, “nrow”, and “ncol”.

ggplot(mpg) +
  geom_boxplot(aes(x = class, y = cty, fill = class)) +
  facet_wrap(facets = ~cyl, nrow = 2, ncol = 2)

Coordinates

Other coordinate systems can be applied to graphs created from ggplot2.

One example is coord_polar(), which uses polar coordinates. Most of these are quite rare. Probably the most common one is coord_flip(), which will flip the X and Y axes. Let’s also illustrate the labs() function, which can be used to change labels.

ggplot(mpg) +
  geom_bar(aes(x = class, fill = factor(cyl))) +
  labs(title = "Cylinders by Class", fill = "cylinders") +
  coord_flip()


These bars are stacked on top of each of other, due to the “cyl” variable being mapped to the “fill” argument. There are various position adjustments that can be used. Again, most of these are not very common, but a common one is the argument “position = ‘dodge’”, which will put items side-by-side.

See this example:

ggplot(mpg) +
  geom_bar(aes(x = class, fill = factor(cyl)), position = "dodge") +
  labs(title = "Cylinders by Class", fill = "cylinders") + 
  coord_flip()

Themes

Lastly, we can alter the “theme”, or the overall appearance of our plot.

I recommend using the ggThemeAssist package, because this will make this incredibly easy, with an interface that will automatically generate reproducible code.

This can be used by highlighting a ggplot2 object, and navigating to Addins > ggplot Theme Assistant.

We’ll make the following changes: eliminating the panel grid lines, eliminating axis ticks, adding a title called “Boxplot Example”, making it bigger and putting it in bold, and adjusting it to the center.

# p
p + theme(axis.ticks = element_line(linetype = "blank"),
    panel.grid.major = element_line(linetype = "blank"),
    panel.grid.minor = element_line(linetype = "blank"),
    plot.title = element_text(size = 14,
        face = "bold", hjust = 0.5)) +labs(title = "Boxplot Example")


There are many more examples of things that can be done with ggplot2.

It is an amazingly powerful and flexible package, and it is worth getting acquainted with the cheat sheet.

Exercise 1.

“Using data on credit card applications’ status please present the frequency table with the nice, kable format for average monthly credit card expenditures of applicants.”

This piece of R code was written to analyze users’ average monthly spending by performing some operations on a credit card spending dataset.

Avg_Exp<- CreditCard %>%
  filter(months != 0) %>%
  mutate(monthly_avg_exp = expenditure / months) %>%
  arrange(desc(monthly_avg_exp))

Avg_Exp <- Avg_Exp %>%
  arrange(desc(monthly_avg_exp))

ggplot(Avg_Exp) +
  geom_histogram(aes(x = monthly_avg_exp))

frequency_table <- Freq(Avg_Exp$monthly_avg_exp, useNA = "ifany")

kable_table_2 <- frequency_table %>%
  kable(col.names = c("Avg exp in kUSD", "Frequency", "Percentage %", "Cumulative frequency", "Cumulative percentage %")) %>%
  kable_styling(bootstrap_options = "bordered") %>% 
  kable_classic(full_width = FALSE, html_font = "Arial")

frequency_table
##         level   freq   perc  cumfreq  cumperc
## 1      [0,50]  1'231  93.5%    1'231    93.5%
## 2    (50,100]     44   3.3%    1'275    96.9%
## 3   (100,150]     14   1.1%    1'289    97.9%
## 4   (150,200]     12   0.9%    1'301    98.9%
## 5   (200,250]      8   0.6%    1'309    99.5%
## 6   (250,300]      1   0.1%    1'310    99.5%
## 7   (300,350]      0   0.0%    1'310    99.5%
## 8   (350,400]      1   0.1%    1'311    99.6%
## 9   (400,450]      1   0.1%    1'312    99.7%
## 10  (450,500]      4   0.3%    1'316   100.0%
kable_table_2
Avg exp in kUSD Frequency Percentage % Cumulative frequency Cumulative percentage %
[0,50] 1231 0.9354103 1231 0.9354103
(50,100] 44 0.0334347 1275 0.9688450
(100,150] 14 0.0106383 1289 0.9794833
(150,200] 12 0.0091185 1301 0.9886018
(200,250] 8 0.0060790 1309 0.9946809
(250,300] 1 0.0007599 1310 0.9954407
(300,350] 0 0.0000000 1310 0.9954407
(350,400] 1 0.0007599 1311 0.9962006
(400,450] 1 0.0007599 1312 0.9969605
(450,500] 4 0.0030395 1316 1.0000000

Records whose months column was not 0 were filtered from the CreditCard data set. This is to prevent the divisor from being zero when calculating monthly expenses. A new column (monthly_avg_exp) is added with the mutate() function. This column shows the average monthly expenditure, calculated by dividing total expenditures by the number of months. With arrange(desc(monthly_avg_exp)) the data set is sorted in descending order by average monthly spending amount. Using the ggplot() function, a histogram is drawn for the monthly_avg_exp variable. This chart shows users’ average monthly spending distribution.

A frequency table for monthly_avg_exp is created with the Freq() function. This table includes the number of times each spend value occurs and the percentage of those values in the total. The frequency table is formatted aesthetically by using kable() and kable_styling() functions. Table headers are specified and a classic HTML table view is provided with kable_classic(). The table is presented in a user-friendly format with designated headings (“Avg exp in kUSD”, “Frequency” etc.).

This code is a typical example for financial data analysis. Calculating key metrics such as average monthly spend allows for a better understanding of customer spending behavior. Visualizations and presentations such as histograms and frequency tables, It provides in-depth insights in analyzing such data and can help make strategic decisions.

Exercise 2.

“The data comes from https://flixgem.com/ (dataset version as of March 12, 2021). The data contains information on 9425 movies and series available on Netlix.”

This piece of R code downloads and reads a CSV file from the internet and then loads the dataset into the R environment.

This code is a typical data loading process used as the initial stage in data analysis projects. Successful upload of the dataset is the first step in the analysis process

“Answer with the most appropriate data visualization for the following questions:”

“1. What is the distribution of Imdb scores for Polish movies and movie-series?”

The provided R code aims to visualize the distribution of IMDb scores for Polish movies and movie-series that are available in Poland, using two different types of plots: a bar chart and a histogram.

polish_movies_series <- mydata %>% 
  filter(Languages == "Polish", Country.Availability == "Poland")

ggplot(polish_movies_series) +
  geom_bar(aes(x = IMDb.Score, fill = factor(IMDb.Score))) +
  labs(title = "Polish movie ratings", fill = "ratings") +
  coord_flip()

ggplot(polish_movies_series) +
  
  geom_histogram(aes(x = IMDb.Score), binwidth = 0.1, fill = "black", color = "white")

In the first part of the code, we filtered the dataset to only include entries where the language is Polish and the movie or series is available in Poland.

filter(): This function is used to subset the data based on specified conditions. In this case, it ensures that only Polish movies and series available in Poland are included.

In the second part, we created a bar chart of IMDb scores to show the frequency of each score. ggplot(): The main function that starts a plot. geom_bar(): This geom creates a bar chart. The aes() function inside maps IMDb ratings to the x-axis and fills are colored based on unique IMDb ratings. Using factor(IMDb.Score),It causes each different score to have a different color. labs(): Used to add labels to the chart, including a title and description title. coord_flip(): Flips the x and y axes to make the graph horizontal; this can improve readability when x-axis labels are numbers or there are many categories.

In the third part of the code, we created a histogram to display the distribution of IMDb scores among Polish movies and TV series.

geom_histogram(): This geom creates a histogram with binwidth = 0.1 specifying the width of each bin. The bin width determines the level of detail in the histogram; a smaller chamber width, can provide a more detailed view of the distribution. aes(x = IMDb.Score): Specifies that IMDb scores should be plotted on the x-axis. fill="black", color="white": Increases visual contrast and clarity by setting the color of bars to black and their borders to white

Both visualizations serve different purposes:

Bar Chart: Useful for seeing the exact frequency of each IMDb score and comparing frequencies across different scores. Histogram: Provides a more general view of the distribution of scores, helping to understand the shape of the distribution (e.g., whether it is normal, skewed, has outliers, etc.). Together, these plots offer a comprehensive view of how IMDb scores are distributed among Polish movies and series available in Poland, which can be insightful for understanding audience reception and the quality distribution of the local film and series industry.

“2. What is the density function of Imdb scores for Polish movies and movie-series?”

The provided R code snippet is used to create a density plot for IMDb scores of Polish movies and series, using the ggplot2 package. This type of visualization is useful for examining the distribution’s shape, central tendency, and variability.

ggplot(polish_movies_series, aes(x = IMDb.Score)) +
  geom_density(fill = "skyblue", color = "black") +
  labs(title = "Densityof IMDb scores of polish movies and series",
       x = "IMDb Score", y = "Density")

aes(x = IMDb.Score): Defines the aesthetic mapping for the graph where IMDb scores are mapped to the x-axis, He states that the intensity will be calculated according to IMDb scores.

geom_density(): Adds a smoothed density estimate to the chart. Density plots are useful for visualizing the distribution of a variable and are particularly useful when the exact shape of the distribution is of interest.

fill = "skyblue": Specifies the color used to fill the area under the density curve, increases visual appeal and readability. color = "black": Defines the color of the density curve outline, providing contrast against the fill for better clarity.

labs(): Adds labels to the plot, including the main title and axis labels.

The density plot created by this code provides a visual interpretation of the distribution of IMDb scores within the filtered subset of Polish movies and series. Unlike histograms, density plots provide a continuous curve representing the distribution, which can be more informative for identifying modes, symmetry, and skewness in the data.

This visualization is particularly useful in identifying the concentration of data points and any potential outliers. By using a density plot, analysts and viewers can quickly grasp the range and commonality of IMDb scores, potentially guiding further statistical analysis or decision-making related to content quality and viewer preferences in the Polish media context.

  1. What are the most popular languages available on Netflix?

The provided R code snippet analyzes a dataset to find the frequency of different languages used in content available on Netflix, focusing particularly on the top five languages

library(dplyr)
library(tidyr)
library(ggplot2)

# Assuming mydata is already read and available
language_count <- mydata %>%
  tidyr::separate_rows(Languages, sep=", ") %>%
  dplyr::group_by(Languages) %>%
  dplyr::summarise(count = dplyr::n()) %>%
  dplyr::arrange(desc(count))

# Convert Languages to factor with desired order
language_count$Languages <- factor(language_count$Languages, levels = language_count$Languages[order(-language_count$count)])

# Plotting the top 5 languages
ggplot(language_count[1:5, ], aes(x = Languages, y = count)) +
  geom_bar(stat = "identity", fill = "black") +
  labs(title = "Top 5 Most Popular Languages on Netflix",
       x = "Languages")

In the first part of the code, function separates multiple languages listed in a single row into multiple rows to facilitate easier counting. For instance, if a movie is listed as “English, Spanish”, it will create two separate entries and groups the dataset by individual languages and calculates the count of each language’s occurrence across the dataset. Also, sorts the languages in descending order based on their frequency, helping identify the most popular languages.

In the second part, converts the Languages column into a factor with levels ordered by frequency. This ensures that the plot displays the languages in the order of their frequency.

In the last part, create a bar graph visualization that displays the top five most popular languages on Netflix based on their frequency of appearance in the dataset. It focus on the top five languages only, ensuring the plot is concise and focused, defines the aesthetic mappings for the plot and adds important labeling information to the graph.

The combined effect of these components is to produce a clear and informative visualization that highlights the languages most frequently used in Netflix content available in the dataset.

For extra credits:

Extra challenge 1.: Create a chart showing actors starring in the most popular productions.”

This R code snippet processes a dataset to identify the top ten movies with the highest box office earnings and then presents the results in both a detailed view and a stylized HTML table focusing on movie titles and actors.

#I thought the first step for getting the most popular movies would be the highest gain in the box office so i got highest boxoffice values

mydata$Boxoffice <- gsub(",", "", mydata$Boxoffice)
popular_movies <- mydata %>%
  arrange(desc(Boxoffice))
popular_movies <- head(popular_movies, 10)
#then put them in a kable cahrt where only title and actors are shown
popular_movie_actors<- popular_movies[, c("Title", "Actors")]

popular_movie_actors%>%
  kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Title Actors
Little Italy Hayden Christensen, Danny Aiello, Andrea Martin, Emma Roberts
Olympus Has Fallen Dylan McDermott, Gerard Butler, Aaron Eckhart, Finley Jacobsen
In the Name of Mateusz Kosciukiewicz, Andrzej Chyra, Maja Ostaszewska, Lukasz Simlat
The Green Hornet Seth Rogen, Jay Chou, Tom Wilkinson, Cameron Diaz
Date Night Steve Carell, Taraji P. Henson, Tina Fey, Mark Wahlberg
The Color Purple Whoopi Goldberg, Danny Glover, Margaret Avery, Oprah Winfrey
Fida Fardeen Khan, Kareena Kapoor, Kim Sharma, Shahid Kapoor
The Swan Princess Liz Callaway, Michelle Nicastro, Jack Palance, Howard McGillin
Yes Man John Michael Higgins, Jim Carrey, Bradley Cooper, Zooey Deschanel
Sausage Party Michael Cera, Iris Apatow, Alistair Abell, Sugar Lyn Beard
print(popular_movies)
##                 Title                                                Genre
## 1        Little Italy                                      Comedy, Romance
## 2  Olympus Has Fallen                                     Action, Thriller
## 3      In the Name of                                      Drama, Thriller
## 4    The Green Hornet                                Action, Comedy, Crime
## 5          Date Night                     Comedy, Crime, Romance, Thriller
## 6    The Color Purple                                                Drama
## 7                Fida              Action, Crime, Drama, Romance, Thriller
## 8   The Swan Princess Animation, Comedy, Family, Fantasy, Musical, Romance
## 9             Yes Man                                      Comedy, Romance
## 10      Sausage Party                Animation, Adventure, Comedy, Fantasy
##                                                                                                                                                                                                                                                                       Tags
## 1                                                                                                                                                                                                            Romantic Movies,Romantic Comedies,Comedies,Romantic Favorites
## 2                                                                                                                                                                                                                                      Action Thrillers,Action & Adventure
## 3                                                                                                                                           Independent Dramas,Dramas,Independent Films,International Dramas,Gay & Lesbian Films,International Movies,Gay & Lesbian Dramas
## 4  Gangster Films,Comic Book and Superhero Films,Action Comedies,Comedies,Martial Arts Films,Action & Adventure,Crime Action & Adventure,Dark Comedies,Gangster Action & Adventure,US Movies,Crime Comedies,Crime Films,Blockbuster Action & Adventure,Action,Crime Action
## 5                                                           Police Action & Adventure,Crime Action & Adventure,Romantic Comedies,Police Movies,Action Comedies,Comedies,Romantic Movies,Action & Adventure,Quirky Romance,Romantic Favorites,Comedy Blockbusters,US Movies
## 6                      Dramas based on a book,Dramas,Social Issue Dramas,Dramas based on contemporary literature,Dramas,Social Issue Dramas,Dramas based on contemporary literature,Dramas based on a book,Tearjerkers,Movies Based on Books,Classic Dramas,Classic Movies
## 7                                                                                                                                                        Bollywood Movies,Crime Movies,Romantic Movies,Thriller Movies,Indian Movies,Crime Thrillers,Hindi-Language Movies
## 8                                                                                                                       Family Sci-Fi & Fantasy,Children & Family Films,Films for ages 8 to 10,Family Animation,Films for ages 5 to 7,Musicals,Kids Music,Music & Musicals
## 9                                                                                                                                                          Romantic Comedies,Comedies,Romantic Films,Films Based on Books,Slapstick Comedies,Romantic Favourites,US Movies
## 10                                                                                                                                                  Dark Comedies,Adult Animation,Comedies,Action & Adventure,Late Night Comedies,Adventures,Comedy Blockbusters,US Movies
##                    Languages Series.or.Movie Hidden.Gem.Score
## 1    English, Italian, Latin           Movie              1.7
## 2            English, Korean           Movie              2.6
## 3            Polish, English           Movie              6.8
## 4          English, Mandarin           Movie              2.3
## 5            English, Hebrew           Movie              3.1
## 6                    English           Movie              4.0
## 7                      Hindi           Movie              6.7
## 8                    English           Movie              2.9
## 9  English, Korean, Estonian           Movie              2.7
## 10                   English           Movie              3.5
##                                                                                                                                                                                                                                                                                                     Country.Availability
## 1                                                                                                                                                                                                                                                                 Czech Republic,Sweden,Argentina,Brazil,Mexico,Colombia
## 2                                                                                                                                                                                                       Japan,Romania,Sweden,United States,South Korea,Hungary,Czech Republic,Belgium,Canada,Greece,Slovakia,Netherlands
## 3                                                                                                                                                                                                                                                                                                                 Poland
## 4                                                                                                                                                                               Turkey,Hong Kong,Japan,Canada,Singapore,Switzerland,United States,Greece,Sweden,Thailand,Malaysia,Netherlands,Italy,Israel,India,Germany
## 5                                                                                                                                                                                                                                                                                                            South Korea
## 6                                                                                                                                                                                                                                                                                                              Australia
## 7  Lithuania,Mexico,India,Czech Republic,Russia,United Kingdom,Germany,United States,Australia,Poland,Hong Kong,Japan,France,Canada,Spain,Singapore,Argentina,Greece,Switzerland,Slovakia,Sweden,Thailand,Belgium,Turkey,Malaysia,Hungary,Brazil,Netherlands,Italy,South Africa,Iceland,Portugal,Israel,Colombia,Romania
## 8                                                                                                                                                                                                                                Iceland,Australia,Hong Kong,Thailand,Singapore,Malaysia,Sweden,France,Spain,Switzerland
## 9                                                                                                                                                                                                                     Switzerland,Canada,Sweden,France,Belgium,Turkey,United Kingdom,Portugal,Germany,Japan,Italy,Russia
## 10                                                                                                                                                                                            Canada,Thailand,Hong Kong,Switzerland,United Kingdom,France,South Korea,Spain,Singapore,Greece,Malaysia,Turkey,Netherlands
##     Runtime                    Director
## 1  1-2 hour               Donald Petrie
## 2  1-2 hour               Antoine Fuqua
## 3  1-2 hour        Malgorzata Szumowska
## 4  1-2 hour               Michel Gondry
## 5  1-2 hour                  Shawn Levy
## 6   > 2 hrs            Steven Spielberg
## 7  1-2 hour                   Ken Ghosh
## 8  1-2 hour                Richard Rich
## 9  1-2 hour                 Peyton Reed
## 10 1-2 hour Conrad Vernon, Greg Tiernan
##                                                               Writer
## 1                         Brent Cote, Steve Galluccio, Vinay Virmani
## 2                            Katrin Benedikt, Creighton Rothenberger
## 3            Szczepan Twardoch, Michal Englert, Malgorzata Szumowska
## 4                       Seth Rogen, George W. Trendle, Evan Goldberg
## 5                                                      Josh Klausner
## 6                                         Alice Walker, Menno Meyjes
## 7                        Kiran Kotrial, Lalit Mahajan, Sunny Mahajan
## 8                                         Richard Rich, Brian Nissen
## 9         Jarrad Paul, Nicholas Stoller, Andrew Mogel, Danny Wallace
## 10 Seth Rogen, Jonah Hill, Evan Goldberg, Ariel Shaffir, Kyle Hunter
##                                                                   Actors
## 1          Hayden Christensen, Danny Aiello, Andrea Martin, Emma Roberts
## 2         Dylan McDermott, Gerard Butler, Aaron Eckhart, Finley Jacobsen
## 3  Mateusz Kosciukiewicz, Andrzej Chyra, Maja Ostaszewska, Lukasz Simlat
## 4                      Seth Rogen, Jay Chou, Tom Wilkinson, Cameron Diaz
## 5                Steve Carell, Taraji P. Henson, Tina Fey, Mark Wahlberg
## 6           Whoopi Goldberg, Danny Glover, Margaret Avery, Oprah Winfrey
## 7                Fardeen Khan, Kareena Kapoor, Kim Sharma, Shahid Kapoor
## 8         Liz Callaway, Michelle Nicastro, Jack Palance, Howard McGillin
## 9      John Michael Higgins, Jim Carrey, Bradley Cooper, Zooey Deschanel
## 10            Michael Cera, Iris Apatow, Alistair Abell, Sugar Lyn Beard
##    View.Rating IMDb.Score Rotten.Tomatoes.Score Metacritic.Score
## 1            R        5.7                    14               28
## 2            R        6.5                    49               41
## 3    Not Rated        6.6                    77               52
## 4        PG-13        5.8                    44               39
## 5        PG-13        6.3                    66               56
## 6        PG-13        7.8                    81               78
## 7                     5.5                    60               NA
## 8            G        6.5                    50               NA
## 9        PG-13        6.8                    46               46
## 10           R        6.1                    82               66
##    Awards.Received Awards.Nominated.For  Boxoffice Release.Date
## 1               NA                   NA   $990230     9/21/2018
## 2                1                    5 $98925640     3/22/2013
## 3                7                    8     $9883     9/20/2013
## 4                4                    7 $98780042     1/14/2011
## 5                4                    8 $98711404      4/9/2010
## 6               14                   23 $98467863      2/7/1986
## 7               NA                   NA    $98297     8/20/2004
## 8                1                    8  $9771658    11/18/1994
## 9                3                    9 $97690976    12/19/2008
## 10               1                   24 $97685686     8/12/2016
##    Netflix.Release.Date
## 1             6/15/2019
## 2              8/5/2015
## 3             4/14/2015
## 4             4/14/2015
## 5              4/1/2016
## 6              6/1/2016
## 7            10/14/2020
## 8             4/14/2015
## 9             4/14/2015
## 10            2/23/2017
##                                                 Production.House
## 1                Lionsgate, Les Films Séville, Entertainment One
## 2                                               Millennium Films
## 3                                                               
## 4                                                  Original Film
## 5                                          21 Laps Entertainment
## 6    Warner Brothers, Guber-Peters Company, Amblin Entertainment
## 7                                                               
## 8                                         Rich Animation Studios
## 9                                   Heyday Films, Zanuck Company
## 10 Sony Music, Point Grey, Annapurna Pictures, Columbia Pictures
##                              Netflix.Link                            IMDb.Link
## 1  https://www.netflix.com/watch/81020104 https://www.imdb.com/title/tt6957966
## 2  https://www.netflix.com/watch/70259801 https://www.imdb.com/title/tt2302755
## 3  https://www.netflix.com/watch/70270778 https://www.imdb.com/title/tt2650642
## 4  https://www.netflix.com/watch/70117699 https://www.imdb.com/title/tt0990407
## 5  https://www.netflix.com/watch/70121501 https://www.imdb.com/title/tt1279935
## 6  https://www.netflix.com/watch/60026621 https://www.imdb.com/title/tt0088939
## 7  https://www.netflix.com/watch/70018449 https://www.imdb.com/title/tt0422236
## 8  https://www.netflix.com/watch/60034386 https://www.imdb.com/title/tt0111333
## 9  https://www.netflix.com/watch/70100379 https://www.imdb.com/title/tt1068680
## 10 https://www.netflix.com/watch/80098100 https://www.imdb.com/title/tt1700841
##                                                                                                                                                Summary
## 1                                          Two young lovers from warring pizza places try to hide their burgeoning affair from their feuding families.
## 2           A disgraced Secret Service agent must come to the rescue when Korean terrorists descend on the White House and take the president hostage.
## 3       Running toward God but away from his sexuality, Adam became a priest at age 21. Now the head of a rural parish, hes still tormented by desire.
## 4      A hard-partying heir dons a disguise to fight crime after hours. But with no talents or skills, he relies on his friend, a martial-arts genius.
## 5                     Who knew simple dinner reservations under a different name could turn one New Jersey couples date night so terribly upside-down?
## 6  A Southern womans correspondence with her sister in Africa and friendship with a singer help her escape an abusive husband and a hardscrabble life.
## 7            An all-around nice guy finds himself in a dangerous situation after he makes the ultimate sacrifice for the woman he loves in this drama.
## 8        Based on the tale of Swan Lake, this animated feature tells the story of Odette, a sweet girl whos turned into a graceful swan by a sorcerer.
## 9  After a bitter divorce, a bank drone falls under the sway of a self-help guru who urges him to say yes to everything that comes his way for a year.
## 10    After making a gruesome discovery about life beyond the supermarket, an affable sausage strives to save his fellow foods in this raunchy comedy.
##    IMDb.Votes
## 1       11636
## 2      257808
## 3        2819
## 4      155265
## 5      152742
## 6       78284
## 7        2423
## 8       22817
## 9      336720
## 10     176325
##                                                                                                                                                                                   Image
## 1   http://occ-0-1490-1489.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABSB31_nbMkvKQkVYRpVzUhM4fSWTZuctZouH26CrnH6OKWiZzLjrYbAm1_VcmjfUZ44tQQ-fpzCdJqdYTf1wI2n97A.jpg?r=762
## 2  https://occ-0-2773-2774.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABa8W2Oz8uTwZWGEGSEXhkc0v9JWRid9j3aqzjl7PZ_eO-CtyEqmxej5kax65kROSQCNaOjAoEEY_ktTKyBOh75a0kg.jpg?r=a18
## 3  https://occ-0-2506-1432.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABdEzJ11sJJqr0JCfp8lCaPluekSNN2SerMn90_azUFuUlqvQ6IN9z5O9xKphqwlqVmTGRgolNb2AyAprGqG2pObIXw.jpg?r=7ad
## 4    https://occ-0-2851-38.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABamCqGYXomwmrDUFlqkmo8uroCE03E9ssA25dFPkTmME4o6pepKD1H-Aj1GGuEu4i_4f3F0anQS5jUA1gOUbp3PhyQ.jpg?r=e4f
## 5     https://occ-0-64-325.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABWJ3QnFafBG6vrfXpHQNxyqHhRWsDhaulMuoXYnaUZ6KfihhUDS2gqt7xiK1S9D5ZuaUlLdxaZ4mlb_Ddx-fAmCmTg.jpg?r=d46
## 6  https://occ-0-3466-2774.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABQKgDRrfutZAuuotjed8qWZW8fViSskKx-fkNw08GcNXaz11xFxeSsv-yOcM7t3-JJKlDFLmWyyrWiB2384rs-lrvg.jpg?r=a6f
## 7  https://occ-0-2851-1432.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABfsTONpq0wDxtNWaEQC_Z0rp_aYrvgpYBhxNsNknIwUm24syxdKj0X_F6treVzZLs1AvDm7cLyCg1WeJviTCaRxVmg.jpg?r=3ba
## 8    https://occ-0-2851-38.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABSMBlN-XZaVXSEHRvT-Pra2tIiPr6tFUmL7tnCEzHivLXTYL4RQikENQaqEl37ynQDLjbEPjqoYK9FJRVfmQqfybCg.jpg?r=5a4
## 9    https://occ-0-2851-38.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABRTzEOQA8YocVzGOzGquWjeYdTbXrXhQhcqHR0cmiiTGEDc0h7Ko3xi1ZRb1omJuHSgNQu2lIfE_rNy6DA3GAHDarQ.jpg?r=826
## 10    https://occ-0-138-38.1.nflxso.net/dnm/api/v6/evlCitJPPCVCry0BZlEFb5-QjKc/AAAABS_CbFdl2xQ5xUYUNCJiZrLY_kG9lc5ZVmlB31kJLI2nIbQ65w0gqINSPkQa_Od6JQWjv0GdSDaAgk_oJG2gjevcsA.jpg?r=ad1
##                                                                                                                                                             Poster
## 1                                                               https://m.media-amazon.com/images/M/MV5BMjM3MDc2NDc2N15BMl5BanBnXkFtZTgwNzg2NjExNjM@._V1_SX300.jpg
## 2                  https://images-na.ssl-images-amazon.com/images/M/MV5BNTU0NmY4MWYtNzRlMS00MDkxLWJkODYtOTM3NGI2ZDc1NTJhXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_SX300.jpg
## 3                                                                 http://ia.media-imdb.com/images/M/MV5BMjI5NDQxMjQ5Nl5BMl5BanBnXkFtZTgwNjU0MzAwMDE@._V1_SX300.jpg
## 4                                                  https://images-na.ssl-images-amazon.com/images/M/MV5BMTcwOTMwMDYyMl5BMl5BanBnXkFtZTcwMzAxMjMyNA@@._V1_SX300.jpg
## 5                                                  https://images-na.ssl-images-amazon.com/images/M/MV5BODgwMjM2ODE4M15BMl5BanBnXkFtZTcwMTU2MDcyMw@@._V1_SX300.jpg
## 6                                                                 http://ia.media-imdb.com/images/M/MV5BMTUzOTkxNjY4M15BMl5BanBnXkFtZTgwNjE5MDgxMTE@._V1_SX300.jpg
## 7                               https://m.media-amazon.com/images/M/MV5BMWIwOWZkZjYtMmZkMy00OTk1LTg4ZWMtNmE3YTJiYzA2NjU5XkEyXkFqcGdeQXVyODE5NzE3OTE@._V1_SX300.jpg
## 8  https://images-na.ssl-images-amazon.com/images/M/MV5BNDM2OGM1MjAtYjA3Zi00NzEzLWFiOWMtYjg4MDdiMzYzMWVkL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNDA5Mjg5MjA@._V1_SX300.jpg
## 9                  https://images-na.ssl-images-amazon.com/images/M/MV5BMzBmZTMzYmItNzhhMC00M2FkLWIxMGEtMjIxMjAwNmQ2ZmM4XkEyXkFqcGdeQXVyNTIzOTk5ODM@._V1_SX300.jpg
## 10                                                 https://images-na.ssl-images-amazon.com/images/M/MV5BMjkxOTk1MzY4MF5BMl5BanBnXkFtZTgwODQzOTU5ODE@._V1_SX300.jpg
##                                   TMDb.Trailer Trailer.Site
## 1  https://www.youtube.com/watch?v=ZH6kK9oiy4E      YouTube
## 2  https://www.youtube.com/watch?v=ar-IaAx7s8k      YouTube
## 3  https://www.youtube.com/watch?v=hmgD42ZHiP8      YouTube
## 4  https://www.youtube.com/watch?v=PMA-taGtfXs      YouTube
## 5  https://www.youtube.com/watch?v=aspBKFz2dBI      YouTube
## 6  https://www.youtube.com/watch?v=d83NnlL83mc      YouTube
## 7  https://www.youtube.com/watch?v=AVtvjfoXNXc      YouTube
## 8  https://www.youtube.com/watch?v=5wfMVdyDa_g      YouTube
## 9  https://www.youtube.com/watch?v=rvpsiIe2vBE      YouTube
## 10 https://www.youtube.com/watch?v=c7fP9q_LyDc      YouTube
print(popular_movie_actors)
##                 Title
## 1        Little Italy
## 2  Olympus Has Fallen
## 3      In the Name of
## 4    The Green Hornet
## 5          Date Night
## 6    The Color Purple
## 7                Fida
## 8   The Swan Princess
## 9             Yes Man
## 10      Sausage Party
##                                                                   Actors
## 1          Hayden Christensen, Danny Aiello, Andrea Martin, Emma Roberts
## 2         Dylan McDermott, Gerard Butler, Aaron Eckhart, Finley Jacobsen
## 3  Mateusz Kosciukiewicz, Andrzej Chyra, Maja Ostaszewska, Lukasz Simlat
## 4                      Seth Rogen, Jay Chou, Tom Wilkinson, Cameron Diaz
## 5                Steve Carell, Taraji P. Henson, Tina Fey, Mark Wahlberg
## 6           Whoopi Goldberg, Danny Glover, Margaret Avery, Oprah Winfrey
## 7                Fardeen Khan, Kareena Kapoor, Kim Sharma, Shahid Kapoor
## 8         Liz Callaway, Michelle Nicastro, Jack Palance, Howard McGillin
## 9      John Michael Higgins, Jim Carrey, Bradley Cooper, Zooey Deschanel
## 10            Michael Cera, Iris Apatow, Alistair Abell, Sugar Lyn Beard

In the first part of code, with the gsub() function, Boxoffice removes the commas in the column, making the data suitable for numerical operations. Then, these numerical data are arranged in descending order with the arrange() function of the dplyr package, and the top ten movies with the highest grossing are selected with the head() function. These transactions, It is done to rank the movies according to their revenues and to identify the most popular ones, so that the information obtained from the data set becomes more meaningful and useful.

In the second part, subsets the popular_movies dataframe to include only the columns Title and Actors, focusing on the most relevant information for a summary presentation.

and, the last part, make data more accessible and visually appealing. We implemented Bootstrap styles such as “striped” and “hover” to improve readability and interactivity.

The code is structured to achieve the task of identifying and presenting the top ten highest-grossing movies effectively. By focusing on the Boxoffice figures, the approach directly targets a measure of movie popularity. The final presentation, which showcases only the movie titles and actors in a styled HTML table, gives a summary suitable for our reports.

Extra challenge 2.: For movies and series, create rating charts from the various portals (Hidden Gem, IMDb, Rotten Tomatoes, Metacritic). Hint: it’s a good idea to reshape the data to long format.”

This R code downloads the Netflix dataset from an online source, reads it, and then uses it to visualize the rating distributions of movie and TV genres across different rating sources.

library(tidyr)
library(ggplot2)


download.file("https://raw.githubusercontent.com/kflisikowski/ds/master/netflix-dataset.csv?raw=true", destfile ="dane.csv",mode="wb")
mydata <- read.csv(file="dane.csv", encoding ="UTF-8", header=TRUE, sep = ",")


ggplot(mydata %>% pivot_longer(cols = c(`Hidden.Gem.Score`, `IMDb.Score`, `Rotten.Tomatoes.Score`, `Metacritic.Score`), names_to = "Rating_Source", values_to = "Rating") %>% drop_na(), aes(x = `Series.or.Movie`, y = Rating, fill = Rating_Source)) + geom_boxplot() + labs(title = "Rating Distribution on Different Platforms by Genre", x = "Genre (Movie or Series)",  y = "Rating") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + facet_wrap(~Rating_Source, scales = "free_y", ncol = 2)

Using the tidyr and ggplot2 libraries, the code collects the various rating scores (Hidden.Gem.Score, IMDb.Score, Rotten.Tomatoes.Score, Metacritic.Score) in the dataset into a single column, discards the missing data, and plots the data in a boxplot. presents with each rating source. Boxplots are visualized as having a different y-axis at the bottom, making it easier to compare score ranges from different sources. Boxplots are grouped by genre of movies and TV series, and a different chart is created for each rating source. This process allows to visually compare and analyze rating differences between genres on different platforms

Extra challenge 3.: Which film studios produce the most and how has this changed over the years?

