Probability and Statistics I. Hands-on 1

Multivariate Analysis
Univariate Analysis

Before we start, we make sure that the working directory is the correct one, as this is where we have to place our data file.

getwd()

## [1] "/Users/raul/ownCloud/Trabajo/Docencia/2013 PyE1/Practicas/Practica 1/Notebook"

We load the data file and change the name to a more convenient one:

load("pelis1997.RData")

# We can also read an online file
#load(url("https://github.com/rgcmme/PyE-Practicas/raw/master/Practica1/data/pelis1997.RData"))

films = pelis1997

We check both the dimensions of the data and its structure:

dim(films)

## [1] 100   4

names(films)

## [1] "title"  "length" "budget" "rating"

We check that the data we have collect information on 100 films (100 rows) released in the year 1997. The variables (columns) we have are four: title, length, budget and rating.

We visualise the first rows of the imported data:

head(films)

To access the variables of my dataset (data frame) we use matrix notation. For example to access all movie titles, we can use films[,1] (first column, all rows) or directly access the variable with the $ symbol: films$title.

films[,1]

##   [1] "'Til There Was You"                         
##   [2] "100 Proof"                                  
##   [3] "Absolute Power"                             
##   [4] "Acts of Betrayal"                           
##   [5] "Affliction"                                 
##   [6] "Air Bud"                                    
##   [7] "Air Force One"                              
##   [8] "Alan Smithee Film: Burn Hollywood Burn, An" 
##   [9] "Alien Escape"                               
##  [10] "Alien: Resurrection"                        
##  [11] "American Werewolf in Paris, An"             
##  [12] "Amistad"                                    
##  [13] "Anaconda"                                   
##  [14] "Anastasia"                                  
##  [15] "Apostle, The"                               
##  [16] "As Good as It Gets"                         
##  [17] "Austin Powers: International Man of Mystery"
##  [18] "Bacheha-Ye aseman"                          
##  [19] "Batman & Robin"                             
##  [20] "Bean"                                       
##  [21] "Better Place, A"                            
##  [22] "Bloodletting"                               
##  [23] "Boogie Nights"                              
##  [24] "Borrowers, The"                             
##  [25] "Bossu, Le"                                  
##  [26] "Breakdown"                                  
##  [27] "Budbringeren"                               
##  [28] "Cats Don't Dance"                           
##  [29] "Changing Habits"                            
##  [30] "Chasing Amy"                                
##  [31] "Childhood's End"                            
##  [32] "City of Industry"                           
##  [33] "Commandments"                               
##  [34] "Con Air"                                    
##  [35] "Conspiracy Theory"                          
##  [36] "Contact"                                    
##  [37] "Cop Land"                                   
##  [38] "Critical Care"                              
##  [39] "Dante's Peak"                               
##  [40] "David Searching"                            
##  [41] "Deconstructing Harry"                       
##  [42] "Detail, The"                                
##  [43] "Devil's Advocate, The"                      
##  [44] "Devil's Own, The"                           
##  [45] "Dilemma"                                    
##  [46] "Dong er shi er tiao"                        
##  [47] "Donnie Brasco"                              
##  [48] "Double Tap"                                 
##  [49] "Dream with the Fishes"                      
##  [50] "Drive"                                      
##  [51] "Ed Mort"                                    
##  [52] "End of Violence, The"                       
##  [53] "Err On the Side of Caution"                 
##  [54] "Eve's Bayou"                                
##  [55] "Event Horizon"                              
##  [56] "Face/Off"                                   
##  [57] "Fall"                                       
##  [58] "Falling Words"                              
##  [59] "Fathers' Day"                               
##  [60] "Fifth Element, The"                         
##  [61] "First Love, Last Rites"                     
##  [62] "Full Monty, The"                            
##  [63] "G.I. Jane"                                  
##  [64] "Game, The"                                  
##  [65] "Gattaca"                                    
##  [66] "George of the Jungle"                       
##  [67] "Gone Fishin'"                               
##  [68] "Good Book, The"                             
##  [69] "Good Burger"                                
##  [70] "Good Will Hunting"                          
##  [71] "Gravesend"                                  
##  [72] "Grosse Pointe Blank"                        
##  [73] "Guerra de Canudos"                          
##  [74] "Habit"                                      
##  [75] "Harpist, The"                               
##  [76] "Hav Plenty"                                 
##  [77] "Hercules"                                   
##  [78] "House of Yes, The"                          
##  [79] "Hurricane"                                  
##  [80] "I Know What You Did Last Summer"            
##  [81] "I Married a Strange Person!"                
##  [82] "Ice Storm, The"                             
##  [83] "In & Out"                                   
##  [84] "In the Company of Men"                      
##  [85] "Jackal, The"                                
##  [86] "Jackie Brown"                               
##  [87] "Karakter"                                   
##  [88] "Kicked in the Head"                         
##  [89] "Killers"                                    
##  [90] "Kiss or Kill"                               
##  [91] "Kiss the Girls"                             
##  [92] "Kundun"                                     
##  [93] "L.A. Confidential"                          
##  [94] "Last Resort, The"                           
##  [95] "Last Time I Committed Suicide, The"         
##  [96] "Lawn Dogs"                                  
##  [97] "Leather Jacket Love Story"                  
##  [98] "Legend of the Mummy"                        
##  [99] "Liar Liar"                                  
## [100] "Life Less Ordinary, A"

We are also going to load the data analysis libraries. The ones we are going to use are the following:

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: carData
## 
## 
## Attaching package: 'car'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## 
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Multivariate Analysis

We begin by presenting the information on the first 30 films. Let’s do it in two ways; in the hands-on, you can use just one or both.

First, we create a new data.frame that contains only the data of the first 30 movies. We call it films_30.

films_30 <- films[1:30,]

Multivariate bar plot

Since the three variables are measured on very different scales, we are going to standardize their values, that is, transform them using the following formula:

\[ new\_value=\frac{value-mean.variable}{standard.deviation.variable} \] With this transformation, the new data will have a mean of 0 and a standard deviation of 1. The range of the transformed data will depend on the standard deviation of each variable and the outliers.

First, we create three new standardized variables (length_z, rating_z, film_z) that we add to the films_30 data file.

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

films_30$length_z <-scale(films_30$length, scale=TRUE)
films_30$budget_z <-scale(films_30$budget, scale=TRUE)
films_30$rating_z <-scale(films_30$rating, scale=TRUE)
films_30$film <- seq(1:30) # variable that indicates the film

We define a new data.frame with the transformed (standardized) variables of the first 30 movies. We call it films_wide because they are in wide format.

# Wide format
films_wide <- data.frame(film=films_30$film, length=films_30$length_z, budget=films_30$budget_z, rating=films_30$rating_z)

We change the data format (from wide to long, as we need it to draw the bar chart) and draw a bar chart for each movie:

# Long format
films_long <-pivot_longer(films_wide, cols=2:4, values_to="value")

# Chart
films_long %>% ggplot( aes(x=name, y=value, fill=name)) +
       geom_bar(stat="identity", width=0.4) +
   scale_y_continuous(breaks=c(-2,0,2,3))+
   scale_fill_manual(values=c("wheat4", 
                              "#69b3a2", 
                              "darkblue")) +
       geom_hline(yintercept=0, color="grey", linewidth=0.5) +
       coord_flip()+facet_wrap(~film, ncol=5)

Star plot

For each of the first 30 films, the three quantitative variables are plotted (columns 2, 3 and 4) and rows 1 to 30 (1:30):

stars(films[1:30,c(2,3,4)],key.loc=c(3.5,15.5), draw.segments=T, full=FALSE)

Comment for the multivariate analysis

In the graphs we quickly find examples of films with bad reviews and a big budget, in particular the maximum budget of the first 30, row 19. To find out its title, since it is row 19, we type:

films[19,]

There are others with good reviews (above average), low budget, and varying durations (15, 17, 18, 27, and 30). Number 28 stands out for its good reviews and short duration. Number 8 stands out for being very poorly reviewed, with durations and budget below average.

Specifically, analysing ONLY the first 30:

Among the movies with the highest budgets, the notable ones are number 3 (“Absolute Power”), 7 (“Air Force One”), 10 (“Alien: Resurrection”), and 14 (“Anastasia”), which received good reviews; number 12 (“Amistad”) and 16 (“As Good as it Gets”) also had good budgets and good reviews; number 19 spent the most money, being one of the longest but very poorly reviewed. We observe that the movies with the highest budgets are American productions.
Movies 2, 8, 13, and 19 stand out for their poor reviews. Additionally, 13 and 19 had good budgets. Movies 2 and 8 have very poor reviews and durations and budgets below average.

Overall, we observe that:

There are movies with all variables below their averages, therefore, with poor reviews: 2, 4, 6, 8, and 29.
Movies 3, 7, 10, 12, 16, and 25 have all their characteristics above average. Notably, 25 has good reviews and is long despite its budget being close to the average.
Movies 5, 15, 23, and 30 have reviews and durations above average and budgets below average. We see that a high budget is not necessary to make good movies (from the critics’ point of view) and, let’s say, entertaining due to their duration.

Scatterplot matrix

We plot the three quantitative variables in a scatterplot matrix to see if there are linear relationships between them:

#spm(films[,2:4], diagonal=list(method ="histogram", breaks="FD"), smooth=FALSE, col=3)
ggpairs(films[,2:4], diag = list(continuous = "barDiag"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

None of the patterns we observed clearly resembles a straight line, they are rather amorphous, indicating that the two-by-two linear relationships are rather weak for these variables. To quantify these linear relationships, we calculated Pearson’s correlation coefficients (also included in the previous graph):

round(cor(films[,2:4]), 4)

##        length budget rating
## length 1.0000 0.3835 0.3641
## budget 0.3835 1.0000 0.0849
## rating 0.3641 0.0849 1.0000

As we can see, the variables that are most linearly related are length and budget, although this relationship is very weak. The relationship is directly proportional (the coefficient is positive), although weak ($0.38$). It seems that films with higher budgets tend to be longer, although, as we have already said, the linear dependence is weak. The variables budget and rating do not seem to be linearly related since their correlation coefficient is almost zero ($0.084$). We cannot say that the more money the film has, the better it will be.

Optional: if you want to draw a nicer picture representing the correlation matrix, install and load the corrplot library:

library("corrplot")

## corrplot 0.95 loaded

corrplot(cor(films[,2:4]), method="number")

Univariate Analysis

We choose only ONE variable for this analysis, we choose the rating variable and visualise its first values.

var = films$rating

head(var)

## [1] 4.8 3.3 6.4 4.6 6.9 4.7

Frequency table

To construct a frequency table we have to process the variable a bit. First, we find out its range in order to group the values into intervals.

summary(var)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.400   5.300   6.200   6.019   7.100   9.200

The minimum is $2.4$ and the maximum is $9.2$. To divide the values into intervals we use the cut function. By default, we just enter the number of intervals we want and R calculates them. If we put $5$ intervals, the limits that R chooses don’t look very natural. We see this with the table function.

cut1=cut(var,breaks=5)
table(cut1)

## cut1
## (2.39,3.76] (3.76,5.12] (5.12,6.48] (6.48,7.84] (7.84,9.21] 
##           7          17          35          39           2

Let’s tell R that we want the following intervals, which could be equivalent to Bad, Regular, Good and Very Good movies according to the review. It doesn’t matter that the end of the last interval is open since we know we don’t have any $10$.

\[ [2,4), [4,6), [6,8), [8,10)\]

c=c(2,4,6,8,10)
cut2=cut(var,breaks=c, right=FALSE)
table(cut2)

## cut2
##  [2,4)  [4,6)  [6,8) [8,10) 
##      9     35     54      2

Better this way. We construct the frequency table:

myTable = data.frame(table(cut2))
colnames(myTable) = c("Values","n_i")
myTable = mutate(myTable,N_i=cumsum(myTable$n_i),f_i=prop.table(myTable$n_i), F_i = cumsum(prop.table(myTable$n_i)))
myTable

We don’t get the row of the Total, we should add it by hand, but we are not going to do it. We observe that the interval with more films is $[6,8)$ (Good), it is the modal interval, and that the absolute frequencies increase until the third interval and decrease sharply in the fourth interval, there are only $2$ films qualified as Very Good. The median is in the interval $[6,8)$ and so is $Q_3$. In fact, $98%$ of the films have a score lower than $8$.

Statistical summary

summary(var)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.400   5.300   6.200   6.019   7.100   9.200

# Standard deviation
sd(var)

## [1] 1.327007

# Coefficient of variation
sd(var)/mean(var)

## [1] 0.2204697

# Asimmetry
skewness(var)

## [1] -0.4826905

# Kurtosis
kurtosis(var)

## [1] -0.1706457

The standard deviation of the observations from the mean ($6.019$) is $1.3$, so, on average, the observations are at that distance from the mean. $75\%$ of the data are less than $7.1$ ($Q_3$) and $25\%$ less than $5.3$, so not many movies fail in this data set. We see a slight asymmetry to the left (negative), very slight; mean and median are practically equal with the mean being slightly lower. The kurtosis is negative, indicating a slightly platicurtic distribution, more flattened than the reference distribution (the normal distribution). The coefficient of variation is less than $1$, indicating homogeneity in the data.

Histogram

hist(var, freq=T, main="Histogram", xlab="Rating", col=3)

We can see that from 2 to 8 the histogram draws a very nice bell shape, but that it falls apart in the last two classes, causing a slight asymmetry to the left or negative.

Cumulative relative frequency curve

c = c(2,3,4,5,6,7,8,9,10)
cut2 = cut(var,breaks=c, right=FALSE)
d = as.data.frame(table(x = cut2))
d = rbind(data.frame(x = 2, Freq = 0), d)
d$x = c
plot(
  cumsum(prop.table(d$Freq)),
  ylim=c(0,1),
  xlab="Rating",
  xaxt="n",
  ylab="Cumulative relative frecuency")
axis(side=1, at=seq(1,length(d$x)), labels=d$x)
xspline(1:length(d$x), cumsum(prop.table(d$Freq)), col="blue", lwd=1)

We can also draw the cumulative relative frequency curve. The cumulative relative frequency plot is useful to know, at a glance, information related to percentiles, quartiles, etc. For example, we can get an approximate value for the median ($6.2$) or the interquartile range ($Q3-Q1$) which would be $1.8$. It also helps us to see, as we have seen before, that there are few films with a score higher than 8.

Stem and leaf plot

stem(var)

## 
##   The decimal point is at the |
## 
##   2 | 48
##   3 | 2336789
##   4 | 0124677888
##   5 | 0000133334445556667777889
##   6 | 00112223344444455556677889
##   7 | 0001111122233345555666666778
##   8 | 4
##   9 | 2

We observe the shape described above. The absolute frequencies are increasing and fall rapidly in the last two intervals. It can be seen that the best reviewed films have scores of $8.4$ and $9.2$ and, in this case, the mode is found in the movies rated with 7 and something.

Another view of the same diagram, with more classes, can be obtained by doing:

stem(var, scale=2)

## 
##   The decimal point is at the |
## 
##   2 | 4
##   2 | 8
##   3 | 233
##   3 | 6789
##   4 | 0124
##   4 | 677888
##   5 | 000013333444
##   5 | 5556667777889
##   6 | 001122233444444
##   6 | 55556677889
##   7 | 000111112223334
##   7 | 5555666666778
##   8 | 4
##   8 | 
##   9 | 2

Boxplot

boxplot(var, horizontal=TRUE, xlab="Rating")

The box plot detects a default outlier, the minimum of the ratings (a $2.4$). That movie stands out as Very Bad in this data set. The whisker on the right is shorter than the one on the left, indicating an asymmetry on the left (hump shifted to the right). Were it not for that datum, the skewness of the distribution would be almost imperceptible.