Luka Pascal Cavazza, 24. September 2022

IMB bootcamp; Statistics take home exam;

TASK 1

IMDB movie dataset

The basic idea behind my project is to determine how movie runtime (length) affects it’s rating. I will probably also check for Genre, Gross revenue etc.

Description of the variables:

  • Poster_Link: link of the poster that IMDB is using
  • Series_Title: name of the movie
  • Released_Year: Year of movie release
  • Certificate: certificate earned by the movie
  • Runtime: total runtime of the movie
  • Genre: genre of the movie
  • IMDB_Rating: movie rating on IMBD (0-10)
  • Overview: short summary
  • Meta_score: score given by movie critics (0-100)
  • Director: name of the movie director
  • Star1, Star2, Star3, Star4: name of movie star
  • No_of_Votes: total number of votes
  • Gross: money/revenue earned by that movie

I believe this data is from 2020 and it takes into account top 1000 movies from the IMDB website.

imdb_top_1000 <- read_csv("~/Desktop/FAKS/IMB/Statistics/bootcamp r/R Take Home Exam/imdb_top_1000.csv")
## Rows: 1000 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): Poster_Link, Series_Title, Released_Year, Certificate, Runtime, Genre, Overview, Director, Star1, Star2, S...
## dbl  (3): IMDB_Rating, Meta_score, No_of_Votes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(imdb_top_1000) #structure of data frame: I use this function to see what type of data I'm dealing with 
## spec_tbl_df [1,000 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Poster_Link  : chr [1:1000] "https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk"| __truncated__ "https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ "https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg" "https://m.media-amazon.com/images/M/MV5BMWMwMGQzZTItY2JlNC00OWZiLWIyMDctNDk2ZDQ2YjRjMWQ0XkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ ...
##  $ Series_Title : chr [1:1000] "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
##  $ Released_Year: chr [1:1000] "1994" "1972" "2008" "1974" ...
##  $ Certificate  : chr [1:1000] "A" "A" "UA" "A" ...
##  $ Runtime      : chr [1:1000] "142 min" "175 min" "152 min" "202 min" ...
##  $ Genre        : chr [1:1000] "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
##  $ IMDB_Rating  : num [1:1000] 9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
##  $ Overview     : chr [1:1000] "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency." "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son." "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of th"| __truncated__ "The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands"| __truncated__ ...
##  $ Meta_score   : num [1:1000] 80 100 84 90 96 94 94 94 74 66 ...
##  $ Director     : chr [1:1000] "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
##  $ Star1        : chr [1:1000] "Tim Robbins" "Marlon Brando" "Christian Bale" "Al Pacino" ...
##  $ Star2        : chr [1:1000] "Morgan Freeman" "Al Pacino" "Heath Ledger" "Robert De Niro" ...
##  $ Star3        : chr [1:1000] "Bob Gunton" "James Caan" "Aaron Eckhart" "Robert Duvall" ...
##  $ Star4        : chr [1:1000] "William Sadler" "Diane Keaton" "Michael Caine" "Diane Keaton" ...
##  $ No_of_Votes  : num [1:1000] 2343110 1620367 2303232 1129952 689845 ...
##  $ Gross        : num [1:1000] 2.83e+07 1.35e+08 5.35e+08 5.73e+07 4.36e+06 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Poster_Link = col_character(),
##   ..   Series_Title = col_character(),
##   ..   Released_Year = col_character(),
##   ..   Certificate = col_character(),
##   ..   Runtime = col_character(),
##   ..   Genre = col_character(),
##   ..   IMDB_Rating = col_double(),
##   ..   Overview = col_character(),
##   ..   Meta_score = col_double(),
##   ..   Director = col_character(),
##   ..   Star1 = col_character(),
##   ..   Star2 = col_character(),
##   ..   Star3 = col_character(),
##   ..   Star4 = col_character(),
##   ..   No_of_Votes = col_double(),
##   ..   Gross = col_number()
##   .. )
##  - attr(*, "problems")=<externalptr>

As we can see there are 2 types of variables in this dataset: character and numeric.

Now I have to remove variables that aren’t important for me, I would argue those are Poster_Link, Overview, Director and Star1-4 for now.

note: After some thought I decided to also remove the Genre variable, because it has more than one value in most cases.. maybe some other time

IMDB_important <- imdb_top_1000[ , c(-1, -6, -8, -10, -11, -12, -13, -14)] #there is probably a more elegant way to remove some of these

head(IMDB_important) #let's see what we are dealing with
## # A tibble: 6 × 8
##   Series_Title                                  Released_Year Certificate Runtime IMDB_Rating Meta_score No_of_…¹  Gross
##   <chr>                                         <chr>         <chr>       <chr>         <dbl>      <dbl>    <dbl>  <dbl>
## 1 The Shawshank Redemption                      1994          A           142 min         9.3         80  2343110 2.83e7
## 2 The Godfather                                 1972          A           175 min         9.2        100  1620367 1.35e8
## 3 The Dark Knight                               2008          UA          152 min         9           84  2303232 5.35e8
## 4 The Godfather: Part II                        1974          A           202 min         9           90  1129952 5.73e7
## 5 12 Angry Men                                  1957          U           96 min          9           96   689845 4.36e6
## 6 The Lord of the Rings: The Return of the King 2003          U           201 min         8.9         94  1642758 3.78e8
## # … with abbreviated variable name ¹​No_of_Votes

Now it would probably make sense to specify movie certificates. A quick Google search gives us the following answer:

Certificate description:

  • U: “unrestricted public exhibition”
  • A: “restricted to adult audiences”
  • UA: “unrestricted public exhibition subject to parental guidance for children below the age of twelve”
  • PG: “parental guidance suggested”
  • PG-13: “parental guidance suggested for children under 13”
  • R: “under 17 not admitted without parent or guardian”
  • G: “general; for all audiences”

As we can see, some of these are almost the same.. why? Because different countries have different names for certificates. This presents a small problem, I will try to somehow merge certificates if I’ll have time. (ref.: https://help.imdb.com/article/contribution/titles/certificates/GU757M8ZJ9ZPXB39?ref_=helpart_nav_27#)

The next thing I have to do is to remove min in Runtime. According to the summary of the dataset it should be done this way:

IMDB_important$Runtime <-as.integer(gsub("[a-zA-Z ]", "", IMDB_important$Runtime)) #Removing 'min' and converting the Runtime as integer, gsub performs a replacement on the first value and matches all respectively
IMDB_important$Gross <- as.integer(gsub(",","",IMDB_important$Gross)) #I do the same for Gross

head(IMDB_important)
## # A tibble: 6 × 8
##   Series_Title                                  Released_Year Certificate Runtime IMDB_Rating Meta_score No_of_…¹  Gross
##   <chr>                                         <chr>         <chr>         <int>       <dbl>      <dbl>    <dbl>  <int>
## 1 The Shawshank Redemption                      1994          A               142         9.3         80  2343110 2.83e7
## 2 The Godfather                                 1972          A               175         9.2        100  1620367 1.35e8
## 3 The Dark Knight                               2008          UA              152         9           84  2303232 5.35e8
## 4 The Godfather: Part II                        1974          A               202         9           90  1129952 5.73e7
## 5 12 Angry Men                                  1957          U                96         9           96   689845 4.36e6
## 6 The Lord of the Rings: The Return of the King 2003          U               201         8.9         94  1642758 3.78e8
## # … with abbreviated variable name ¹​No_of_Votes

Much better.

summary(IMDB_important) #midway check
##  Series_Title       Released_Year      Certificate           Runtime       IMDB_Rating      Meta_score    
##  Length:1000        Length:1000        Length:1000        Min.   : 45.0   Min.   :7.600   Min.   : 28.00  
##  Class :character   Class :character   Class :character   1st Qu.:103.0   1st Qu.:7.700   1st Qu.: 70.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :119.0   Median :7.900   Median : 79.00  
##                                                           Mean   :122.9   Mean   :7.949   Mean   : 77.97  
##                                                           3rd Qu.:137.0   3rd Qu.:8.100   3rd Qu.: 87.00  
##                                                           Max.   :321.0   Max.   :9.300   Max.   :100.00  
##                                                                                           NA's   :157     
##   No_of_Votes          Gross          
##  Min.   :  25088   Min.   :     1305  
##  1st Qu.:  55526   1st Qu.:  3253559  
##  Median : 138548   Median : 23530892  
##  Mean   : 273693   Mean   : 68034751  
##  3rd Qu.: 374161   3rd Qu.: 80750894  
##  Max.   :2343110   Max.   :936662225  
##                    NA's   :169

For some strange reason, Released_Year is also defined as a character and I didn’t notice it before… so I have to change that as well

IMDB_important$Released_y <- lapply(IMDB_important$Released_Year, as.character)
IMDB_important$Released_Year <- as.integer(IMDB_important$Released_y)
## Warning: NAs introduced by coercion

I get a warning there is a NA value

which(is.na(IMDB_important$Released_Year)) #finding the NA value
## [1] 967
IMDB_important$Series_Title[967] #what is the movie title, Google the release date
## [1] "Apollo 13"
IMDB_important$Released_Year[967] <- 1995 #replace NA with the right release date

head(IMDB_important)
## # A tibble: 6 × 9
##   Series_Title                                  Released_Year Certificate Runtime IMDB_…¹ Meta_…² No_of…³  Gross Relea…⁴
##   <chr>                                                 <dbl> <chr>         <int>   <dbl>   <dbl>   <dbl>  <int> <list> 
## 1 The Shawshank Redemption                               1994 A               142     9.3      80 2343110 2.83e7 <chr>  
## 2 The Godfather                                          1972 A               175     9.2     100 1620367 1.35e8 <chr>  
## 3 The Dark Knight                                        2008 UA              152     9        84 2303232 5.35e8 <chr>  
## 4 The Godfather: Part II                                 1974 A               202     9        90 1129952 5.73e7 <chr>  
## 5 12 Angry Men                                           1957 U                96     9        96  689845 4.36e6 <chr>  
## 6 The Lord of the Rings: The Return of the King          2003 U               201     8.9      94 1642758 3.78e8 <chr>  
## # … with abbreviated variable names ¹​IMDB_Rating, ²​Meta_score, ³​No_of_Votes, ⁴​Released_y

A late night Google search provided me with a solution for the Genre problem. Let’s see if it works:

library(dplyr)
library(tidyr)
library(stringr)

IMDB_final <- imdb_top_1000[ , c(-1, -8, -10, -11, -12, -13, -14)] #we now have to get Genre back

IMDB_final %>%
 select(Genre) %>%
 mutate(Genre = str_split(Genre,',')) %>%
 unnest(Genre) %>%
 group_by(Genre) %>%
 count() %>%
 arrange(desc(n)) %>%
  ggplot() + geom_col(aes(y = reorder(Genre,n), x = n)) +
 labs(title = 'Genre of Titles',
     x = 'No. of titles',
     y = 'Genre') +
theme_minimal()

Works great… but I have no idea why there are two Drama values. Weird.

I think I have now done everything regarding data cleaning, let’s get to some descriptive statistics.

I want to know what is the average IMDB score, Meta score and movie runtime

summary(IMDB_final$IMDB_Rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.600   7.700   7.900   7.949   8.100   9.300

The average score for top 1000 movies on IMDB is 7,949

summary(IMDB_final$Meta_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   28.00   70.00   79.00   77.97   87.00  100.00     157

The average Meta score for top 1000 movies on IMDB is 77,97

IMDB_final$Runtime <- as.numeric(IMDB_important$Runtime) #just remembered I have to change Runtime to numeric
head(IMDB_final) #let's see
## # A tibble: 6 × 9
##   Series_Title                                  Released_Year Certificate Runtime Genre   IMDB_…¹ Meta_…² No_of…³  Gross
##   <chr>                                         <chr>         <chr>         <dbl> <chr>     <dbl>   <dbl>   <dbl>  <dbl>
## 1 The Shawshank Redemption                      1994          A               142 Drama       9.3      80 2343110 2.83e7
## 2 The Godfather                                 1972          A               175 Crime,…     9.2     100 1620367 1.35e8
## 3 The Dark Knight                               2008          UA              152 Action…     9        84 2303232 5.35e8
## 4 The Godfather: Part II                        1974          A               202 Crime,…     9        90 1129952 5.73e7
## 5 12 Angry Men                                  1957          U                96 Crime,…     9        96  689845 4.36e6
## 6 The Lord of the Rings: The Return of the King 2003          U               201 Action…     8.9      94 1642758 3.78e8
## # … with abbreviated variable names ¹​IMDB_Rating, ²​Meta_score, ³​No_of_Votes

Great.

summary(IMDB_final$Runtime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    45.0   103.0   119.0   122.9   137.0   321.0

The average length of top 1000 movies on IMDB is 122.9 minutes.

Note: for some reason when I tried to do this before (line 151) it wouldn’t give me the summary, hence I tried changing it again to numeric… but if I look at the midway check (line 99) I already have these values given. I don’t know exactly what’s going on to be honest.

Let’s see them all together:

library(pastecs)

stat.desc(IMDB_final[c(-1,-2,-3,-5)]) #removing variables that are not important to us now
##                   Runtime  IMDB_Rating   Meta_score  No_of_Votes        Gross
## nbr.val      1.000000e+03 1.000000e+03 8.430000e+02 1.000000e+03 8.310000e+02
## nbr.null     0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## nbr.na       0.000000e+00 0.000000e+00 1.570000e+02 0.000000e+00 1.690000e+02
## min          4.500000e+01 7.600000e+00 2.800000e+01 2.508800e+04 1.305000e+03
## max          3.210000e+02 9.300000e+00 1.000000e+02 2.343110e+06 9.366622e+08
## range        2.760000e+02 1.700000e+00 7.200000e+01 2.318022e+06 9.366609e+08
## sum          1.228910e+05 7.949300e+03 6.573000e+04 2.736929e+08 5.653688e+10
## median       1.190000e+02 7.900000e+00 7.900000e+01 1.385485e+05 2.353089e+07
## mean         1.228910e+02 7.949300e+00 7.797153e+01 2.736929e+05 6.803475e+07
## SE.mean      8.883999e-01 8.711797e-03 4.262555e-01 1.035243e+04 3.807187e+06
## CI.mean.0.95 1.743344e+00 1.709552e-02 8.366481e-01 2.031501e+04 7.472846e+06
## var          7.892544e+02 7.589541e-02 1.531678e+02 1.071729e+11 1.204507e+16
## std.dev      2.809367e+01 2.754912e-01 1.237610e+01 3.273727e+05 1.097500e+08
## coef.var     2.286064e-01 3.465603e-02 1.587259e-01 1.196131e+00 1.613147e+00
round(stat.desc(IMDB_final[c(-1,-2,-3,-5) ]), 2) #rounding the results to 2 decimal numbers
##                Runtime IMDB_Rating Meta_score  No_of_Votes        Gross
## nbr.val        1000.00     1000.00     843.00 1.000000e+03 8.310000e+02
## nbr.null          0.00        0.00       0.00 0.000000e+00 0.000000e+00
## nbr.na            0.00        0.00     157.00 0.000000e+00 1.690000e+02
## min              45.00        7.60      28.00 2.508800e+04 1.305000e+03
## max             321.00        9.30     100.00 2.343110e+06 9.366622e+08
## range           276.00        1.70      72.00 2.318022e+06 9.366609e+08
## sum          122891.00     7949.30   65730.00 2.736929e+08 5.653688e+10
## median          119.00        7.90      79.00 1.385485e+05 2.353089e+07
## mean            122.89        7.95      77.97 2.736929e+05 6.803475e+07
## SE.mean           0.89        0.01       0.43 1.035243e+04 3.807187e+06
## CI.mean.0.95      1.74        0.02       0.84 2.031501e+04 7.472846e+06
## var             789.25        0.08     153.17 1.071729e+11 1.204507e+16
## std.dev          28.09        0.28      12.38 3.273727e+05 1.097500e+08
## coef.var          0.23        0.03       0.16 1.200000e+00 1.610000e+00

Runtime: the shortest movie was 45 minutes, the longest 321 minutes.. the mean was 122,89 minutes Rating: the “worst” (remember these are top 1000 movies) rating was 7,6 while the highest was 9,3.. the mean was 7,95 Meta score: the lowest was 28, highest 100 and the mean was 77,97

The difference between the movie with the most votes to the least voted movie was approximately 990k votes if my math is right. The average number of votes was 273 700

The average revenue top 1000 movies made was 68 million dollars, while the winner made 936 million (I find that hard to believe, I’ll have to check)

library(car)

scatterplotMatrix(IMDB_final[ , c(-1,-2,-3,-5)], #we exclude non-numerical variables
                  smooth = FALSE) 

library(car)

scatterplot(y = IMDB_final$IMDB_Rating, #we use scatter plot when we want to show a relationship between 2 variables
            x = IMDB_final$Runtime,
            boxplot = FALSE,
            ylab = "IMDB Rating", 
            xlab = "Movie runtime (in minutes)", 
            smooth = TRUE) #for smooth function we always use false

As we can see on the upper graph, shorter movies tend to have a lower rating than longer movies. The lowest average rating is at aroud the 90 minute mark.

imdb_final1 <- IMDB_final

library(ggplot2)
ggplot(imdb_final1, aes( x = Runtime, y = IMDB_Rating)) +
         geom_point()

str(imdb_final1)
## tibble [1,000 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Series_Title : chr [1:1000] "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
##  $ Released_Year: chr [1:1000] "1994" "1972" "2008" "1974" ...
##  $ Certificate  : chr [1:1000] "A" "A" "UA" "A" ...
##  $ Runtime      : num [1:1000] 142 175 152 202 96 201 154 195 148 139 ...
##  $ Genre        : chr [1:1000] "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
##  $ IMDB_Rating  : num [1:1000] 9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
##  $ Meta_score   : num [1:1000] 80 100 84 90 96 94 94 94 74 66 ...
##  $ No_of_Votes  : num [1:1000] 2343110 1620367 2303232 1129952 689845 ...
##  $ Gross        : num [1:1000] 2.83e+07 1.35e+08 5.35e+08 5.73e+07 4.36e+06 ...

Unfortunately I ran out of time to finish this.. I hope I can do it some other time as I am genuinely interested in this

TASK 2

knitr::opts_chunk$set(echo = TRUE)

#install.packages("readr")
library(readr)
#install.packages("ggplot2")
library(ggplot2)
#install.packages("dplyr")
library(dplyr)
#install.packages("tidyr")
library(tidyr)
#install.packages("stringr")
library(stringr)
#install.packages("pastecs")
library(pastecs)
#install.packages("car")
library(car)
#install.packages("graphics")
library(graphics)
library(readr)
Body_mass <- read_csv("~/Desktop/FAKS/IMB/Statistics/bootcamp r/R Take Home Exam/Task 2/Body mass.csv")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 50 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ID;Mass
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Body_mass) #what are we dealing with?
## # A tibble: 6 × 1
##   `ID;Mass`
##   <chr>    
## 1 1;62,1   
## 2 2;64,5   
## 3 3;56,5   
## 4 4;53,4   
## 5 5;61,3   
## 6 6;62,2

Looks like we’ll have to split our data…

Body_mass <- read.table("~/Desktop/FAKS/IMB/Statistics/bootcamp r/R Take Home Exam/Task 2/Body mass.csv",
                      header= TRUE,
                      sep = ";",
                      dec = ",")

head(Body_mass) #let's see how our data looks
##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

Much better.

library(pastecs)

stat.desc(Body_mass) #we want to see our descriptive statistics
##                       ID         Mass
## nbr.val        50.000000 5.000000e+01
## nbr.null        0.000000 0.000000e+00
## nbr.na          0.000000 0.000000e+00
## min             1.000000 4.970000e+01
## max            50.000000 8.320000e+01
## range          49.000000 3.350000e+01
## sum          1275.000000 3.143800e+03
## median         25.500000 6.280000e+01
## mean           25.500000 6.287600e+01
## SE.mean         2.061553 8.501407e-01
## CI.mean.0.95    4.142845 1.708422e+00
## var           212.500000 3.613696e+01
## std.dev        14.577380 6.011403e+00
## coef.var        0.571662 9.560727e-02
round(stat.desc(Body_mass), 2) #rounding the values
##                   ID    Mass
## nbr.val        50.00   50.00
## nbr.null        0.00    0.00
## nbr.na          0.00    0.00
## min             1.00   49.70
## max            50.00   83.20
## range          49.00   33.50
## sum          1275.00 3143.80
## median         25.50   62.80
## mean           25.50   62.88
## SE.mean         2.06    0.85
## CI.mean.0.95    4.14    1.71
## var           212.50   36.14
## std.dev        14.58    6.01
## coef.var        0.57    0.10

The lowest recorded weight was 49,7 kg, the highest 83,2 kg and the average weight was 62,88 kg. Considering the average body weight in the school year 2018/2019 was 59,9 kg we can see an increase of 2,98 kilograms, which we can attribute to less exercise due to staying at home because of the Covid-19 pandemic.

library(graphics)
                          
hist(Body_mass[ , -1], #showing the data on a histogram
     main = "Distribution of body mass",
     xlab = "Body weight in kilograms",
     ylab = "Number of students",
     breaks = seq(from=40,to=90, by=1), #visuals
     col = "navy blue") #colour

As we can see we have a couple of outliers, which could explain the rise in overall average weight.

Now lets start the hypothesis testing, where:

H0= the average weight stayed the same (or µ = 59,9 kg). H1= there is a change in average weight (or µ ≠ 59,9 kg).

mean(Body_mass$Mass) #mean of body weight
## [1] 62.876
sd(Body_mass$Mass) #standard deviation
## [1] 6.011403
t.test(Body_mass$Mass, #preform a T test
       mu = 59.9, #alternative hypothesis
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Body_mass$Mass
## t = 3.5006, df = 49, p-value = 0.0009995
## alternative hypothesis: true mean is not equal to 59.9
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

let’s say alpha stands at 5%, then p-value (0,0009995) tells us that we definitely reject H0 and accept H1.

If we’d like to calculate the effect size, we need to get the correlation coefficient, which we can get with the following equation:

r(absolute)=sqrt((t2)/(t2+df))=0,447

from which we can say the effect is medium.

TASK 3

knitr::opts_chunk$set(echo = TRUE)
options(width=120)
#install.packages("car")
library(car)
#install.packages("ggplot2")
library(ggplot2)
#install.packages("reshape2")
library(reshape2)

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("~/Desktop/FAKS/IMB/Statistics/bootcamp r/R Take Home Exam/Task 3/Apartments.xlsx")

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes
head(Apartments) #I want to see which variables are what
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Categorical are parking and balcony since they’re binary.

Change categorical variables into factors

Apartments$ParkingFactor <- factor(Apartments$Parking, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

Apartments$BalconyFactor <- factor(Apartments$Balcony, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

str(Apartments)
## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Age          : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance     : num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price        : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking      : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony      : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
##  $ ParkingFactor: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ BalconyFactor: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

Works as hoped.

H0: Mu_Price=1900 H1: Mu_Price≠1900

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

mean(Apartments$Price) #just for later, to double check
## [1] 2018.941
t.test(Apartments$Price, #preform a T test
       mu = 1900, #alternative hypothesis
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

P-value is less than alpha=5% so we can reject H0.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Age ~ Price,
           data = Apartments)

summary(fit1) #to get the coeff.
## 
## Call:
## lm(formula = Age ~ Price, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.184  -6.673  -2.555   7.048  25.803 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.479211   5.627511   5.416 5.82e-07 ***
## Price       -0.005907   0.002740  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.49 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

coefficient of determination is equal to R-squared, which is R^2=0,053… from this we can say, we can explain that approx. 5,3% of our results is predicted by the independent variable.

cor(Apartments$Age,Apartments$Price)
## [1] -0.230255

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

scatterplotMatrix(Apartments[ , c(-4,-5,-6,-7)], #we exclude others
                  smooth = FALSE) 

Based on the upper scatterplot matrix I wouldn’t say that we may encouter a problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)

Chech the multicolinearity with VIF statistics. Explain the findings.

vif(fit2) #multicollinearity
##      Age Distance 
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845

The values are below 5, which means we shouldn’t expect multicollinearity

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic case (outlier or unit with big influence).

Apartments$StdResid <- round(rstandard(fit2), 4) #calculating standardized residuals from fit2 rounded to 4 decimals
Apartments$CooksD <- round(cooks.distance(fit2), 4) #calculating Cook's distance from fit2 rounded to 4 decimals

Let’s check for any potential outliers with help of histogram.

hist(Apartments$StdResid, 
     xlab = "Stand. Resid.", 
     ylab = "Frequency", 
     main = "Outliers check",
     breaks = seq(from= -3,to= 3, by=0.5), #visuals
     col = "navy blue")

I wouldn’t say there is a problem with outliers, but I will still remove the value next to -2.

Before let’s check for Cook’s distance:

hist(Apartments$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Outliers check",
     breaks = seq(from= -3,to= 3, by=0.5), #visuals
     col = "navy blue")

Yeah, one value is not optimal.

head(Apartments[order(-Apartments$CooksD),], 6) 
## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes           Yes               2.58 0.320 
## 2    43       37  1740       0       0 No            No                1.44 0.104 
## 3     2       11  2790       1       0 Yes           No                2.05 0.0691
## 4     7        2  1760       0       1 No            Yes              -2.15 0.0663
## 5    37        3  2540       1       1 Yes           Yes               1.58 0.0609
## 6    40        2  2400       0       1 No            Yes               1.09 0.0375

I will remove the variable with Cooks dist.= 0.3197

Apartments_new <- Apartments[c(-53),] #removing the 53rd row and saving it into a new object

I guess I have to make a new model fit3

fit3 <- lm(Price ~ Age + Distance,
           data = Apartments_new)

Now let’s see if it looks better…

Apartments_new$StdResid <- round(rstandard(fit3), 4) 
Apartments_new$CooksD <- round(cooks.distance(fit3), 4)

hist(Apartments_new$StdResid, 
     xlab = "Stand. Resid.", 
     ylab = "Frequency", 
     main = "Outliers check",
     breaks = seq(from= -3,to= 3, by=0.5), #visuals
     col = "navy blue")

Better.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

library(car)

Apartments_new$StdfittedValues <- scale(fit3$fitted.values) #if we check in the environment we see that we have now one less than before

scatterplot(y = Apartments_new$StdResid, #we use scatter plot when we want to show a relationship between 2 variables
            x = Apartments_new$StdfittedValues,
            boxplot = FALSE, #we don't need this
            ylab = "Std. residuals", 
            xlab = "Std. fitted values", 
            smooth = FALSE)

We can see, that we don’t have a problem with heteroskedasticity, because the values are more evenly distributed than not (at least I think so?).

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(Apartments_new$StdResid, 
     xlab = "Stand. Resid.", 
     ylab = "Frequency", 
     main = "Outliers check",
     breaks = seq(from= -3,to= 3, by=0.5), #visuals
     col = "navy blue")

shapiro.test(Apartments_new$StdResid) #formal test
## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments_new$StdResid
## W = 0.93899, p-value = 0.0006087

Hypothesis testing:

H0: std. residuals = normally distributed H1: std. residuals ≠ normally distributed

P-value = 0.0006087 alpha= 5 %

from this we can reject the null hypothesis, accept H1 with P-value of 0,00061 (alpha =5 %).

All in all, standardized residuals are not normally distributed.

Estimate the fit2 again without potentially excluded cases and show the summary of the model. Explain all coefficients.

fit3 <- lm(Price ~ Age + Distance, #I will take fit3 because I excluded one variable
           data = Apartments_new)
vif(fit3)
##      Age Distance 
##  1.00078  1.00078
summary(fit3) #summary of the model
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -438.92 -220.46  -81.11  195.96  690.93 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2493.556     76.421  32.629  < 2e-16 ***
## Age           -8.821      3.178  -2.776  0.00684 ** 
## Distance     -21.342      2.703  -7.895 1.21e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 279.8 on 81 degrees of freedom
## Multiple R-squared:  0.4682, Adjusted R-squared:  0.4551 
## F-statistic: 35.66 on 2 and 81 DF,  p-value: 7.818e-12

beta1 equals -8.821, which means that every additional year on average apartment’s price per m^2 will decrease by -8.821 (p-value=0.006), considering that everything else remains unchanged.

beta2 equals -21.342, which means that every additional kilometer from the city center on average apartment’s price per m^2 will decrease by -21.342 (p-value=0.001), considering that everything else remains unchanged.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit4 <- lm(formula =Price ~ Age + Distance + ParkingFactor + BalconyFactor, #I already have fit3 so I will name it fit4
           data = Apartments_new)

With function anova check if model fit3 fits data better than model fit2.

anova(fit3, fit4)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6341478                              
## 2     79 5760725  2    580753 3.9821 0.02251 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because P-value=0,01 we can say that fit4 is better than fit3

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit4)
## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = Apartments_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413.40 -207.51  -55.73  260.39  567.03 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2337.595     95.195  24.556  < 2e-16 ***
## Age                -7.579      3.100  -2.445  0.01671 *  
## Distance          -18.837      2.758  -6.831 1.57e-09 ***
## ParkingFactorYes  177.285     62.939   2.817  0.00612 ** 
## BalconyFactorYes   14.012     59.608   0.235  0.81476    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 270 on 79 degrees of freedom
## Multiple R-squared:  0.5169, Adjusted R-squared:  0.4924 
## F-statistic: 21.13 on 4 and 79 DF,  p-value: 7.086e-12

Parking: Looking at Age, Distance and Balcony we can conclude that apartments with parking have on average the price higher for approx. 177 EUR, p-value=0,006 which is less than alpha=5%. Balcony: Because p-value=0,845 (alpha=5%) we can not say for sure that balcony affects the apartment price.

Hypothesis (F-stat.):

H0: ro^2 = 0 H1: ro^2 > 0

p-value=0,001 so we can reject H0 and say we found a linear relationship between price (independent var.) and at leats one dependent variable.

Save fitted values and claculate the residual for apartment ID2.

Apartments_new <- Apartments_new %>% mutate(Fittedvalues = fitted.values(fit4))
myresiduals <- residuals(fit4)
print(myresiduals)
##           1           2           3           4           5           6           7           8           9          10 
## -131.115201  440.382830  -97.102940  256.884827 -413.399618 -143.344986  -18.757257 -309.444192   53.721875  282.274205 
##          11          12          13          14          15          16          17          18          19          20 
##  370.255884 -208.690975 -346.937922  -49.770397  190.817087 -190.504851 -265.321819   71.369962 -171.967426 -288.368758 
##          21          22          23          24          25          26          27          28          29          30 
## -197.372180  348.049265  -21.612597 -207.118288  321.504869  112.581789  375.224451 -400.550309 -208.676070   -7.127462 
##          31          32          33          34          35          36          37          38          39          40 
##  295.034666   60.255884  497.486276  -86.883017 -284.105448 -408.793842 -353.106026  536.671208  389.234811  -71.807867 
##          41          42          43          44          45          46          47          48          49          50 
## -382.409644 -108.154294   -4.801557  129.447489  -61.699542  312.155219 -112.115752 -208.716043  289.161557 -268.074593 
##          51          52          53          54          55          56          57          58          59          60 
##  153.475671 -173.471932 -229.989106  425.281313   -9.681522  567.034338  403.427983  -69.776428  189.314511  426.370569 
##          61          62          63          64          65          66          67          68          69          70 
##  -97.102940  270.897088 -413.399618 -129.332724   -4.744995 -309.444192   53.721875  282.274205  356.243623 -208.690975 
##          71          72          73          74          75          76          77          78          79          80 
## -332.925661  -35.758136  190.817087 -190.504851 -251.309558  -47.687281  298.142958 -112.115752 -194.703781  289.161557 
##          81          82          83          84 
## -268.074593  153.475671 -173.471932 -112.115752

The residual for apartment ID2 is equal to 440,38