The basic idea behind my project is to determine how movie runtime (length) affects it’s rating. I will probably also check for Genre, Gross revenue etc.
Description of the variables:
I believe this data is from 2020 and it takes into account top 1000 movies from the IMDB website.
imdb_top_1000 <- read_csv("~/Desktop/FAKS/IMB/Statistics/bootcamp r/R Take Home Exam/imdb_top_1000.csv")
## Rows: 1000 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): Poster_Link, Series_Title, Released_Year, Certificate, Runtime, Genre, Overview, Director, Star1, Star2, S...
## dbl (3): IMDB_Rating, Meta_score, No_of_Votes
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(imdb_top_1000) #structure of data frame: I use this function to see what type of data I'm dealing with
## spec_tbl_df [1,000 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Poster_Link : chr [1:1000] "https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk"| __truncated__ "https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ "https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg" "https://m.media-amazon.com/images/M/MV5BMWMwMGQzZTItY2JlNC00OWZiLWIyMDctNDk2ZDQ2YjRjMWQ0XkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ ...
## $ Series_Title : chr [1:1000] "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
## $ Released_Year: chr [1:1000] "1994" "1972" "2008" "1974" ...
## $ Certificate : chr [1:1000] "A" "A" "UA" "A" ...
## $ Runtime : chr [1:1000] "142 min" "175 min" "152 min" "202 min" ...
## $ Genre : chr [1:1000] "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
## $ IMDB_Rating : num [1:1000] 9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
## $ Overview : chr [1:1000] "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency." "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son." "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of th"| __truncated__ "The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands"| __truncated__ ...
## $ Meta_score : num [1:1000] 80 100 84 90 96 94 94 94 74 66 ...
## $ Director : chr [1:1000] "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
## $ Star1 : chr [1:1000] "Tim Robbins" "Marlon Brando" "Christian Bale" "Al Pacino" ...
## $ Star2 : chr [1:1000] "Morgan Freeman" "Al Pacino" "Heath Ledger" "Robert De Niro" ...
## $ Star3 : chr [1:1000] "Bob Gunton" "James Caan" "Aaron Eckhart" "Robert Duvall" ...
## $ Star4 : chr [1:1000] "William Sadler" "Diane Keaton" "Michael Caine" "Diane Keaton" ...
## $ No_of_Votes : num [1:1000] 2343110 1620367 2303232 1129952 689845 ...
## $ Gross : num [1:1000] 2.83e+07 1.35e+08 5.35e+08 5.73e+07 4.36e+06 ...
## - attr(*, "spec")=
## .. cols(
## .. Poster_Link = col_character(),
## .. Series_Title = col_character(),
## .. Released_Year = col_character(),
## .. Certificate = col_character(),
## .. Runtime = col_character(),
## .. Genre = col_character(),
## .. IMDB_Rating = col_double(),
## .. Overview = col_character(),
## .. Meta_score = col_double(),
## .. Director = col_character(),
## .. Star1 = col_character(),
## .. Star2 = col_character(),
## .. Star3 = col_character(),
## .. Star4 = col_character(),
## .. No_of_Votes = col_double(),
## .. Gross = col_number()
## .. )
## - attr(*, "problems")=<externalptr>
As we can see there are 2 types of variables in this dataset: character and numeric.
Now I have to remove variables that aren’t important for me, I would argue those are Poster_Link, Overview, Director and Star1-4 for now.
note: After some thought I decided to also remove the Genre variable, because it has more than one value in most cases.. maybe some other time
IMDB_important <- imdb_top_1000[ , c(-1, -6, -8, -10, -11, -12, -13, -14)] #there is probably a more elegant way to remove some of these
head(IMDB_important) #let's see what we are dealing with
## # A tibble: 6 × 8
## Series_Title Released_Year Certificate Runtime IMDB_Rating Meta_score No_of_…¹ Gross
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The Shawshank Redemption 1994 A 142 min 9.3 80 2343110 2.83e7
## 2 The Godfather 1972 A 175 min 9.2 100 1620367 1.35e8
## 3 The Dark Knight 2008 UA 152 min 9 84 2303232 5.35e8
## 4 The Godfather: Part II 1974 A 202 min 9 90 1129952 5.73e7
## 5 12 Angry Men 1957 U 96 min 9 96 689845 4.36e6
## 6 The Lord of the Rings: The Return of the King 2003 U 201 min 8.9 94 1642758 3.78e8
## # … with abbreviated variable name ¹No_of_Votes
Now it would probably make sense to specify movie certificates. A quick Google search gives us the following answer:
Certificate description:
As we can see, some of these are almost the same.. why? Because different countries have different names for certificates. This presents a small problem, I will try to somehow merge certificates if I’ll have time. (ref.: https://help.imdb.com/article/contribution/titles/certificates/GU757M8ZJ9ZPXB39?ref_=helpart_nav_27#)
The next thing I have to do is to remove min in Runtime. According to the summary of the dataset it should be done this way:
IMDB_important$Runtime <-as.integer(gsub("[a-zA-Z ]", "", IMDB_important$Runtime)) #Removing 'min' and converting the Runtime as integer, gsub performs a replacement on the first value and matches all respectively
IMDB_important$Gross <- as.integer(gsub(",","",IMDB_important$Gross)) #I do the same for Gross
head(IMDB_important)
## # A tibble: 6 × 8
## Series_Title Released_Year Certificate Runtime IMDB_Rating Meta_score No_of_…¹ Gross
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <int>
## 1 The Shawshank Redemption 1994 A 142 9.3 80 2343110 2.83e7
## 2 The Godfather 1972 A 175 9.2 100 1620367 1.35e8
## 3 The Dark Knight 2008 UA 152 9 84 2303232 5.35e8
## 4 The Godfather: Part II 1974 A 202 9 90 1129952 5.73e7
## 5 12 Angry Men 1957 U 96 9 96 689845 4.36e6
## 6 The Lord of the Rings: The Return of the King 2003 U 201 8.9 94 1642758 3.78e8
## # … with abbreviated variable name ¹No_of_Votes
Much better.
summary(IMDB_important) #midway check
## Series_Title Released_Year Certificate Runtime IMDB_Rating Meta_score
## Length:1000 Length:1000 Length:1000 Min. : 45.0 Min. :7.600 Min. : 28.00
## Class :character Class :character Class :character 1st Qu.:103.0 1st Qu.:7.700 1st Qu.: 70.00
## Mode :character Mode :character Mode :character Median :119.0 Median :7.900 Median : 79.00
## Mean :122.9 Mean :7.949 Mean : 77.97
## 3rd Qu.:137.0 3rd Qu.:8.100 3rd Qu.: 87.00
## Max. :321.0 Max. :9.300 Max. :100.00
## NA's :157
## No_of_Votes Gross
## Min. : 25088 Min. : 1305
## 1st Qu.: 55526 1st Qu.: 3253559
## Median : 138548 Median : 23530892
## Mean : 273693 Mean : 68034751
## 3rd Qu.: 374161 3rd Qu.: 80750894
## Max. :2343110 Max. :936662225
## NA's :169
For some strange reason, Released_Year is also defined as a character and I didn’t notice it before… so I have to change that as well
IMDB_important$Released_y <- lapply(IMDB_important$Released_Year, as.character)
IMDB_important$Released_Year <- as.integer(IMDB_important$Released_y)
## Warning: NAs introduced by coercion
I get a warning there is a NA value
which(is.na(IMDB_important$Released_Year)) #finding the NA value
## [1] 967
IMDB_important$Series_Title[967] #what is the movie title, Google the release date
## [1] "Apollo 13"
IMDB_important$Released_Year[967] <- 1995 #replace NA with the right release date
head(IMDB_important)
## # A tibble: 6 × 9
## Series_Title Released_Year Certificate Runtime IMDB_…¹ Meta_…² No_of…³ Gross Relea…⁴
## <chr> <dbl> <chr> <int> <dbl> <dbl> <dbl> <int> <list>
## 1 The Shawshank Redemption 1994 A 142 9.3 80 2343110 2.83e7 <chr>
## 2 The Godfather 1972 A 175 9.2 100 1620367 1.35e8 <chr>
## 3 The Dark Knight 2008 UA 152 9 84 2303232 5.35e8 <chr>
## 4 The Godfather: Part II 1974 A 202 9 90 1129952 5.73e7 <chr>
## 5 12 Angry Men 1957 U 96 9 96 689845 4.36e6 <chr>
## 6 The Lord of the Rings: The Return of the King 2003 U 201 8.9 94 1642758 3.78e8 <chr>
## # … with abbreviated variable names ¹IMDB_Rating, ²Meta_score, ³No_of_Votes, ⁴Released_y
A late night Google search provided me with a solution for the Genre problem. Let’s see if it works:
library(dplyr)
library(tidyr)
library(stringr)
IMDB_final <- imdb_top_1000[ , c(-1, -8, -10, -11, -12, -13, -14)] #we now have to get Genre back
IMDB_final %>%
select(Genre) %>%
mutate(Genre = str_split(Genre,',')) %>%
unnest(Genre) %>%
group_by(Genre) %>%
count() %>%
arrange(desc(n)) %>%
ggplot() + geom_col(aes(y = reorder(Genre,n), x = n)) +
labs(title = 'Genre of Titles',
x = 'No. of titles',
y = 'Genre') +
theme_minimal()
Works great… but I have no idea why there are two Drama values.
Weird.
I think I have now done everything regarding data cleaning, let’s get to some descriptive statistics.
I want to know what is the average IMDB score, Meta score and movie runtime
summary(IMDB_final$IMDB_Rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.600 7.700 7.900 7.949 8.100 9.300
The average score for top 1000 movies on IMDB is 7,949
summary(IMDB_final$Meta_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 28.00 70.00 79.00 77.97 87.00 100.00 157
The average Meta score for top 1000 movies on IMDB is 77,97
IMDB_final$Runtime <- as.numeric(IMDB_important$Runtime) #just remembered I have to change Runtime to numeric
head(IMDB_final) #let's see
## # A tibble: 6 × 9
## Series_Title Released_Year Certificate Runtime Genre IMDB_…¹ Meta_…² No_of…³ Gross
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The Shawshank Redemption 1994 A 142 Drama 9.3 80 2343110 2.83e7
## 2 The Godfather 1972 A 175 Crime,… 9.2 100 1620367 1.35e8
## 3 The Dark Knight 2008 UA 152 Action… 9 84 2303232 5.35e8
## 4 The Godfather: Part II 1974 A 202 Crime,… 9 90 1129952 5.73e7
## 5 12 Angry Men 1957 U 96 Crime,… 9 96 689845 4.36e6
## 6 The Lord of the Rings: The Return of the King 2003 U 201 Action… 8.9 94 1642758 3.78e8
## # … with abbreviated variable names ¹IMDB_Rating, ²Meta_score, ³No_of_Votes
Great.
summary(IMDB_final$Runtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45.0 103.0 119.0 122.9 137.0 321.0
The average length of top 1000 movies on IMDB is 122.9 minutes.
Note: for some reason when I tried to do this before (line 151) it wouldn’t give me the summary, hence I tried changing it again to numeric… but if I look at the midway check (line 99) I already have these values given. I don’t know exactly what’s going on to be honest.
Let’s see them all together:
library(pastecs)
stat.desc(IMDB_final[c(-1,-2,-3,-5)]) #removing variables that are not important to us now
## Runtime IMDB_Rating Meta_score No_of_Votes Gross
## nbr.val 1.000000e+03 1.000000e+03 8.430000e+02 1.000000e+03 8.310000e+02
## nbr.null 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## nbr.na 0.000000e+00 0.000000e+00 1.570000e+02 0.000000e+00 1.690000e+02
## min 4.500000e+01 7.600000e+00 2.800000e+01 2.508800e+04 1.305000e+03
## max 3.210000e+02 9.300000e+00 1.000000e+02 2.343110e+06 9.366622e+08
## range 2.760000e+02 1.700000e+00 7.200000e+01 2.318022e+06 9.366609e+08
## sum 1.228910e+05 7.949300e+03 6.573000e+04 2.736929e+08 5.653688e+10
## median 1.190000e+02 7.900000e+00 7.900000e+01 1.385485e+05 2.353089e+07
## mean 1.228910e+02 7.949300e+00 7.797153e+01 2.736929e+05 6.803475e+07
## SE.mean 8.883999e-01 8.711797e-03 4.262555e-01 1.035243e+04 3.807187e+06
## CI.mean.0.95 1.743344e+00 1.709552e-02 8.366481e-01 2.031501e+04 7.472846e+06
## var 7.892544e+02 7.589541e-02 1.531678e+02 1.071729e+11 1.204507e+16
## std.dev 2.809367e+01 2.754912e-01 1.237610e+01 3.273727e+05 1.097500e+08
## coef.var 2.286064e-01 3.465603e-02 1.587259e-01 1.196131e+00 1.613147e+00
round(stat.desc(IMDB_final[c(-1,-2,-3,-5) ]), 2) #rounding the results to 2 decimal numbers
## Runtime IMDB_Rating Meta_score No_of_Votes Gross
## nbr.val 1000.00 1000.00 843.00 1.000000e+03 8.310000e+02
## nbr.null 0.00 0.00 0.00 0.000000e+00 0.000000e+00
## nbr.na 0.00 0.00 157.00 0.000000e+00 1.690000e+02
## min 45.00 7.60 28.00 2.508800e+04 1.305000e+03
## max 321.00 9.30 100.00 2.343110e+06 9.366622e+08
## range 276.00 1.70 72.00 2.318022e+06 9.366609e+08
## sum 122891.00 7949.30 65730.00 2.736929e+08 5.653688e+10
## median 119.00 7.90 79.00 1.385485e+05 2.353089e+07
## mean 122.89 7.95 77.97 2.736929e+05 6.803475e+07
## SE.mean 0.89 0.01 0.43 1.035243e+04 3.807187e+06
## CI.mean.0.95 1.74 0.02 0.84 2.031501e+04 7.472846e+06
## var 789.25 0.08 153.17 1.071729e+11 1.204507e+16
## std.dev 28.09 0.28 12.38 3.273727e+05 1.097500e+08
## coef.var 0.23 0.03 0.16 1.200000e+00 1.610000e+00
Runtime: the shortest movie was 45 minutes, the longest 321 minutes.. the mean was 122,89 minutes Rating: the “worst” (remember these are top 1000 movies) rating was 7,6 while the highest was 9,3.. the mean was 7,95 Meta score: the lowest was 28, highest 100 and the mean was 77,97
The difference between the movie with the most votes to the least voted movie was approximately 990k votes if my math is right. The average number of votes was 273 700
The average revenue top 1000 movies made was 68 million dollars, while the winner made 936 million (I find that hard to believe, I’ll have to check)
library(car)
scatterplotMatrix(IMDB_final[ , c(-1,-2,-3,-5)], #we exclude non-numerical variables
smooth = FALSE)
library(car)
scatterplot(y = IMDB_final$IMDB_Rating, #we use scatter plot when we want to show a relationship between 2 variables
x = IMDB_final$Runtime,
boxplot = FALSE,
ylab = "IMDB Rating",
xlab = "Movie runtime (in minutes)",
smooth = TRUE) #for smooth function we always use false
As we can see on the upper graph, shorter movies tend to have a lower
rating than longer movies. The lowest average rating is at aroud the 90
minute mark.
imdb_final1 <- IMDB_final
library(ggplot2)
ggplot(imdb_final1, aes( x = Runtime, y = IMDB_Rating)) +
geom_point()
str(imdb_final1)
## tibble [1,000 × 9] (S3: tbl_df/tbl/data.frame)
## $ Series_Title : chr [1:1000] "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
## $ Released_Year: chr [1:1000] "1994" "1972" "2008" "1974" ...
## $ Certificate : chr [1:1000] "A" "A" "UA" "A" ...
## $ Runtime : num [1:1000] 142 175 152 202 96 201 154 195 148 139 ...
## $ Genre : chr [1:1000] "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
## $ IMDB_Rating : num [1:1000] 9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
## $ Meta_score : num [1:1000] 80 100 84 90 96 94 94 94 74 66 ...
## $ No_of_Votes : num [1:1000] 2343110 1620367 2303232 1129952 689845 ...
## $ Gross : num [1:1000] 2.83e+07 1.35e+08 5.35e+08 5.73e+07 4.36e+06 ...
knitr::opts_chunk$set(echo = TRUE)
#install.packages("readr")
library(readr)
#install.packages("ggplot2")
library(ggplot2)
#install.packages("dplyr")
library(dplyr)
#install.packages("tidyr")
library(tidyr)
#install.packages("stringr")
library(stringr)
#install.packages("pastecs")
library(pastecs)
#install.packages("car")
library(car)
#install.packages("graphics")
library(graphics)
library(readr)
Body_mass <- read_csv("~/Desktop/FAKS/IMB/Statistics/bootcamp r/R Take Home Exam/Task 2/Body mass.csv")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 50 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ID;Mass
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Body_mass) #what are we dealing with?
## # A tibble: 6 × 1
## `ID;Mass`
## <chr>
## 1 1;62,1
## 2 2;64,5
## 3 3;56,5
## 4 4;53,4
## 5 5;61,3
## 6 6;62,2
Looks like we’ll have to split our data…
Body_mass <- read.table("~/Desktop/FAKS/IMB/Statistics/bootcamp r/R Take Home Exam/Task 2/Body mass.csv",
header= TRUE,
sep = ";",
dec = ",")
head(Body_mass) #let's see how our data looks
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
Much better.
library(pastecs)
stat.desc(Body_mass) #we want to see our descriptive statistics
## ID Mass
## nbr.val 50.000000 5.000000e+01
## nbr.null 0.000000 0.000000e+00
## nbr.na 0.000000 0.000000e+00
## min 1.000000 4.970000e+01
## max 50.000000 8.320000e+01
## range 49.000000 3.350000e+01
## sum 1275.000000 3.143800e+03
## median 25.500000 6.280000e+01
## mean 25.500000 6.287600e+01
## SE.mean 2.061553 8.501407e-01
## CI.mean.0.95 4.142845 1.708422e+00
## var 212.500000 3.613696e+01
## std.dev 14.577380 6.011403e+00
## coef.var 0.571662 9.560727e-02
round(stat.desc(Body_mass), 2) #rounding the values
## ID Mass
## nbr.val 50.00 50.00
## nbr.null 0.00 0.00
## nbr.na 0.00 0.00
## min 1.00 49.70
## max 50.00 83.20
## range 49.00 33.50
## sum 1275.00 3143.80
## median 25.50 62.80
## mean 25.50 62.88
## SE.mean 2.06 0.85
## CI.mean.0.95 4.14 1.71
## var 212.50 36.14
## std.dev 14.58 6.01
## coef.var 0.57 0.10
The lowest recorded weight was 49,7 kg, the highest 83,2 kg and the average weight was 62,88 kg. Considering the average body weight in the school year 2018/2019 was 59,9 kg we can see an increase of 2,98 kilograms, which we can attribute to less exercise due to staying at home because of the Covid-19 pandemic.
library(graphics)
hist(Body_mass[ , -1], #showing the data on a histogram
main = "Distribution of body mass",
xlab = "Body weight in kilograms",
ylab = "Number of students",
breaks = seq(from=40,to=90, by=1), #visuals
col = "navy blue") #colour
As we can see we have a couple of outliers, which could explain the rise in overall average weight.
Now lets start the hypothesis testing, where:
H0= the average weight stayed the same (or µ = 59,9 kg). H1= there is a change in average weight (or µ ≠ 59,9 kg).
mean(Body_mass$Mass) #mean of body weight
## [1] 62.876
sd(Body_mass$Mass) #standard deviation
## [1] 6.011403
t.test(Body_mass$Mass, #preform a T test
mu = 59.9, #alternative hypothesis
alternative = "two.sided")
##
## One Sample t-test
##
## data: Body_mass$Mass
## t = 3.5006, df = 49, p-value = 0.0009995
## alternative hypothesis: true mean is not equal to 59.9
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
let’s say alpha stands at 5%, then p-value (0,0009995) tells us that we definitely reject H0 and accept H1.
If we’d like to calculate the effect size, we need to get the correlation coefficient, which we can get with the following equation:
r(absolute)=sqrt((t2)/(t2+df))=0,447
from which we can say the effect is medium.
knitr::opts_chunk$set(echo = TRUE)
options(width=120)
#install.packages("car")
library(car)
#install.packages("ggplot2")
library(ggplot2)
#install.packages("reshape2")
library(reshape2)
library(readxl)
Apartments <- read_excel("~/Desktop/FAKS/IMB/Statistics/bootcamp r/R Take Home Exam/Task 3/Apartments.xlsx")
Description:
head(Apartments) #I want to see which variables are what
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Categorical are parking and balcony since they’re binary.
Apartments$ParkingFactor <- factor(Apartments$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
Apartments$BalconyFactor <- factor(Apartments$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
str(Apartments)
## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance : num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
## $ ParkingFactor: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ BalconyFactor: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
Works as hoped.
H0: Mu_Price=1900 H1: Mu_Price≠1900
mean(Apartments$Price) #just for later, to double check
## [1] 2018.941
t.test(Apartments$Price, #preform a T test
mu = 1900, #alternative hypothesis
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
P-value is less than alpha=5% so we can reject H0.
fit1 <- lm(Age ~ Price,
data = Apartments)
summary(fit1) #to get the coeff.
##
## Call:
## lm(formula = Age ~ Price, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.184 -6.673 -2.555 7.048 25.803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.479211 5.627511 5.416 5.82e-07 ***
## Price -0.005907 0.002740 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.49 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
coefficient of determination is equal to R-squared, which is R^2=0,053… from this we can say, we can explain that approx. 5,3% of our results is predicted by the independent variable.
cor(Apartments$Age,Apartments$Price)
## [1] -0.230255
scatterplotMatrix(Apartments[ , c(-4,-5,-6,-7)], #we exclude others
smooth = FALSE)
Based on the upper scatterplot matrix I wouldn’t say that we may encouter a problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
vif(fit2) #multicollinearity
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
The values are below 5, which means we shouldn’t expect multicollinearity
Apartments$StdResid <- round(rstandard(fit2), 4) #calculating standardized residuals from fit2 rounded to 4 decimals
Apartments$CooksD <- round(cooks.distance(fit2), 4) #calculating Cook's distance from fit2 rounded to 4 decimals
Let’s check for any potential outliers with help of histogram.
hist(Apartments$StdResid,
xlab = "Stand. Resid.",
ylab = "Frequency",
main = "Outliers check",
breaks = seq(from= -3,to= 3, by=0.5), #visuals
col = "navy blue")
I wouldn’t say there is a problem with outliers, but I will still remove
the value next to -2.
Before let’s check for Cook’s distance:
hist(Apartments$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Outliers check",
breaks = seq(from= -3,to= 3, by=0.5), #visuals
col = "navy blue")
Yeah, one value is not optimal.
head(Apartments[order(-Apartments$CooksD),], 6)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.320
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.0691
## 4 7 2 1760 0 1 No Yes -2.15 0.0663
## 5 37 3 2540 1 1 Yes Yes 1.58 0.0609
## 6 40 2 2400 0 1 No Yes 1.09 0.0375
I will remove the variable with Cooks dist.= 0.3197
Apartments_new <- Apartments[c(-53),] #removing the 53rd row and saving it into a new object
I guess I have to make a new model fit3
fit3 <- lm(Price ~ Age + Distance,
data = Apartments_new)
Now let’s see if it looks better…
Apartments_new$StdResid <- round(rstandard(fit3), 4)
Apartments_new$CooksD <- round(cooks.distance(fit3), 4)
hist(Apartments_new$StdResid,
xlab = "Stand. Resid.",
ylab = "Frequency",
main = "Outliers check",
breaks = seq(from= -3,to= 3, by=0.5), #visuals
col = "navy blue")
Better.
library(car)
Apartments_new$StdfittedValues <- scale(fit3$fitted.values) #if we check in the environment we see that we have now one less than before
scatterplot(y = Apartments_new$StdResid, #we use scatter plot when we want to show a relationship between 2 variables
x = Apartments_new$StdfittedValues,
boxplot = FALSE, #we don't need this
ylab = "Std. residuals",
xlab = "Std. fitted values",
smooth = FALSE)
We can see, that we don’t have a problem with heteroskedasticity,
because the values are more evenly distributed than not (at least I
think so?).
hist(Apartments_new$StdResid,
xlab = "Stand. Resid.",
ylab = "Frequency",
main = "Outliers check",
breaks = seq(from= -3,to= 3, by=0.5), #visuals
col = "navy blue")
shapiro.test(Apartments_new$StdResid) #formal test
##
## Shapiro-Wilk normality test
##
## data: Apartments_new$StdResid
## W = 0.93899, p-value = 0.0006087
Hypothesis testing:
H0: std. residuals = normally distributed H1: std. residuals ≠ normally distributed
P-value = 0.0006087 alpha= 5 %
from this we can reject the null hypothesis, accept H1 with P-value of 0,00061 (alpha =5 %).
All in all, standardized residuals are not normally distributed.
fit3 <- lm(Price ~ Age + Distance, #I will take fit3 because I excluded one variable
data = Apartments_new)
vif(fit3)
## Age Distance
## 1.00078 1.00078
summary(fit3) #summary of the model
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -438.92 -220.46 -81.11 195.96 690.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2493.556 76.421 32.629 < 2e-16 ***
## Age -8.821 3.178 -2.776 0.00684 **
## Distance -21.342 2.703 -7.895 1.21e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 279.8 on 81 degrees of freedom
## Multiple R-squared: 0.4682, Adjusted R-squared: 0.4551
## F-statistic: 35.66 on 2 and 81 DF, p-value: 7.818e-12
beta1 equals -8.821, which means that every additional year on average apartment’s price per m^2 will decrease by -8.821 (p-value=0.006), considering that everything else remains unchanged.
beta2 equals -21.342, which means that every additional kilometer from the city center on average apartment’s price per m^2 will decrease by -21.342 (p-value=0.001), considering that everything else remains unchanged.
fit4 <- lm(formula =Price ~ Age + Distance + ParkingFactor + BalconyFactor, #I already have fit3 so I will name it fit4
data = Apartments_new)
anova(fit3, fit4)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6341478
## 2 79 5760725 2 580753 3.9821 0.02251 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because P-value=0,01 we can say that fit4 is better than fit3
summary(fit4)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = Apartments_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.40 -207.51 -55.73 260.39 567.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2337.595 95.195 24.556 < 2e-16 ***
## Age -7.579 3.100 -2.445 0.01671 *
## Distance -18.837 2.758 -6.831 1.57e-09 ***
## ParkingFactorYes 177.285 62.939 2.817 0.00612 **
## BalconyFactorYes 14.012 59.608 0.235 0.81476
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 270 on 79 degrees of freedom
## Multiple R-squared: 0.5169, Adjusted R-squared: 0.4924
## F-statistic: 21.13 on 4 and 79 DF, p-value: 7.086e-12
Parking: Looking at Age, Distance and Balcony we can conclude that apartments with parking have on average the price higher for approx. 177 EUR, p-value=0,006 which is less than alpha=5%. Balcony: Because p-value=0,845 (alpha=5%) we can not say for sure that balcony affects the apartment price.
Hypothesis (F-stat.):
H0: ro^2 = 0 H1: ro^2 > 0
p-value=0,001 so we can reject H0 and say we found a linear relationship between price (independent var.) and at leats one dependent variable.
Apartments_new <- Apartments_new %>% mutate(Fittedvalues = fitted.values(fit4))
myresiduals <- residuals(fit4)
print(myresiduals)
## 1 2 3 4 5 6 7 8 9 10
## -131.115201 440.382830 -97.102940 256.884827 -413.399618 -143.344986 -18.757257 -309.444192 53.721875 282.274205
## 11 12 13 14 15 16 17 18 19 20
## 370.255884 -208.690975 -346.937922 -49.770397 190.817087 -190.504851 -265.321819 71.369962 -171.967426 -288.368758
## 21 22 23 24 25 26 27 28 29 30
## -197.372180 348.049265 -21.612597 -207.118288 321.504869 112.581789 375.224451 -400.550309 -208.676070 -7.127462
## 31 32 33 34 35 36 37 38 39 40
## 295.034666 60.255884 497.486276 -86.883017 -284.105448 -408.793842 -353.106026 536.671208 389.234811 -71.807867
## 41 42 43 44 45 46 47 48 49 50
## -382.409644 -108.154294 -4.801557 129.447489 -61.699542 312.155219 -112.115752 -208.716043 289.161557 -268.074593
## 51 52 53 54 55 56 57 58 59 60
## 153.475671 -173.471932 -229.989106 425.281313 -9.681522 567.034338 403.427983 -69.776428 189.314511 426.370569
## 61 62 63 64 65 66 67 68 69 70
## -97.102940 270.897088 -413.399618 -129.332724 -4.744995 -309.444192 53.721875 282.274205 356.243623 -208.690975
## 71 72 73 74 75 76 77 78 79 80
## -332.925661 -35.758136 190.817087 -190.504851 -251.309558 -47.687281 298.142958 -112.115752 -194.703781 289.161557
## 81 82 83 84
## -268.074593 153.475671 -173.471932 -112.115752
The residual for apartment ID2 is equal to 440,38