IMB Bootcamp: R-Take Home Exam - Netflix shows IMDB ratings

Task 1: Import & analyse (Kaggle) dataset on movies & TV shows accessible on Netflix

I chose this data-set I found on Kaggle as it seemed like an interesting idea to delve into and analyse data & ratings on movies and TV shows available to watch on Netflix, as the winter season is coming up and staying in-doors and watching movies will become a more frequent occurence. Let’s begin!

First, I imported the data set into R Studio and activated the most common libraries we used throughout the Bootcamp (Output is hidden, the libraries are: readr, car, ggplot2, reshape2, onewaytests, dplyr, pastecs, psych, combinat, knitr, readxl, effectsize).The libraries are useful both for functions used in analysis, ease of use and aesthetics.

Netflix_IMDB_Data <- read_csv("~/R-Take home exam/Netflix TV Shows and Movies.csv", show_col_types = FALSE)

After importing, we can see that the set has 5283 observations and 11 variables.
Using the function head we can quickly look at the taken data:

head(Netflix_IMDB_Data)

## # A tibble: 6 × 11
##   index id       title            type  description release_year age_certification runtime imdb_id imdb_score imdb_votes
##   <dbl> <chr>    <chr>            <chr> <chr>              <dbl> <chr>               <dbl> <chr>        <dbl>      <dbl>
## 1     0 tm84618  Taxi Driver      MOVIE "A mentall…         1976 R                     113 tt0075…        8.3     795222
## 2     1 tm127384 Monty Python an… MOVIE "King Arth…         1975 PG                     91 tt0071…        8.2     530877
## 3     2 tm70993  Life of Brian    MOVIE "Brian Coh…         1979 R                      94 tt0079…        8       392419
## 4     3 tm190788 The Exorcist     MOVIE "12-year-o…         1973 R                     133 tt0070…        8.1     391942
## 5     4 ts22164  Monty Python's … SHOW  "A British…         1969 TV-14                  30 tt0063…        8.8      72895
## 6     5 tm14873  Dirty Harry      MOVIE "When a ma…         1971 R                     102 tt0066…        7.7     153463

As we can see the output is in a messy tibble, we will have to do some data restructuring in the following steps to clean our data and output.

The following data is specified in the imported data-set as follows:

Title: The name of a TV show or movie. (Text)
Type: Indicates whether an entry is a TV show or a movie. (Text)
Description: A brief summary or description of a TV show or movie. (Text)
Release_year: The year in which a TV show or movie was released. (Numeric)
Age_certification: The age certification rating for a TV show or movie. (Text)
Runtime: The length of an episode for TV shows or duration of a movie. (Text)
IMBD_score: The score given by users on IMDB for a particular title. (Numeric)
IMBD_votes: The number of votes received by each title on IMDB. (Numeric)

Altering the data

Renaming variables (Data columns)

Before beginning the analysis I decided to use the rename function in the library dplyr to rename the column headers in the data to make them a bit prettier by capitalizing some letters.

Netflix_IMDB_Data <- Netflix_IMDB_Data %>%
  rename(
    Title = title,
    Type = type,
    Description = description,
    Release_year = release_year,
    Age_certification = age_certification,
    Runtime = runtime,
    IMDB_score = imdb_score,
    IMDB_votes = imdb_votes
  )

Removing variables

As some columns are redundant for my analysis, I decided to remove them and make a new dataframe without the index, id and imdb id. As the age certification is not stated (NA) / certified for a lot of the movies, I removed it as well.

Netflix_IMDB_Data_Clean <- Netflix_IMDB_Data[ , c(-1, -2, -7, -9)]

Data classification & restructuring

In order to classify whether it’s a movie or TV show, we’ll add an additional column with 0 or 1 values, where 0 = movie, 1 = TV Show.

Netflix_IMDB_Data_Clean$ContentType <- factor(Netflix_IMDB_Data_Clean$Type, 
                             levels = c("MOVIE", "SHOW"), 
                             labels = c(0, 1))

Next we’ll use the summary function to get an overview of the clean data without the names, type (either movie or TV show) and description. Due to problems with the R Knitting function, I had to adjust the summary function for it to print the output correctly.

summary_output <- summary(Netflix_IMDB_Data_Clean[ , c(-1, -2, -3)])
print(summary_output)

##   Release_year     Runtime        IMDB_score      IMDB_votes      ContentType
##  Min.   :1953   Min.   :  0.0   Min.   :1.500   Min.   :      5   0:3407     
##  1st Qu.:2015   1st Qu.: 45.0   1st Qu.:5.800   1st Qu.:    521   1:1876     
##  Median :2018   Median : 87.0   Median :6.600   Median :   2279              
##  Mean   :2016   Mean   : 79.2   Mean   :6.533   Mean   :  23407              
##  3rd Qu.:2020   3rd Qu.:106.0   3rd Qu.:7.400   3rd Qu.:  10144              
##  Max.   :2022   Max.   :235.0   Max.   :9.600   Max.   :2268288              
##                                                 NA's   :16

We can see there are 16 missing values for the number of IMDB_votes column, we’ll isolate them using the dplyr filter function and remove them so we have even cleaner data. Afterwards we run and print the summary again:

Netflix_IMDB_Data_Clean <- Netflix_IMDB_Data_Clean %>%
  filter(!is.na(IMDB_votes))

Initial data analysis - Both movies & TV shows

After cleaning the data, we can quickly overview the data to see our progress:

print(head(Netflix_IMDB_Data_Clean[ , c(-2, -3)]))

## # A tibble: 6 × 6
##   Title                           Release_year Runtime IMDB_score IMDB_votes ContentType
##   <chr>                                  <dbl>   <dbl>      <dbl>      <dbl> <fct>      
## 1 Taxi Driver                             1976     113        8.3     795222 0          
## 2 Monty Python and the Holy Grail         1975      91        8.2     530877 0          
## 3 Life of Brian                           1979      94        8       392419 0          
## 4 The Exorcist                            1973     133        8.1     391942 0          
## 5 Monty Python's Flying Circus            1969      30        8.8      72895 1          
## 6 Dirty Harry                             1971     102        7.7     153463 0

summary_output <- summary(Netflix_IMDB_Data_Clean[ , c(-1, -2, -3)])
print(summary_output)

##   Release_year     Runtime         IMDB_score      IMDB_votes      ContentType
##  Min.   :1953   Min.   :  0.00   Min.   :1.500   Min.   :      5   0:3391     
##  1st Qu.:2015   1st Qu.: 46.00   1st Qu.:5.800   1st Qu.:    521   1:1876     
##  Median :2018   Median : 87.00   Median :6.600   Median :   2279              
##  Mean   :2016   Mean   : 79.31   Mean   :6.533   Mean   :  23407              
##  3rd Qu.:2020   3rd Qu.:106.00   3rd Qu.:7.400   3rd Qu.:  10144              
##  Max.   :2022   Max.   :235.00   Max.   :9.600   Max.   :2268288

Content type and release year.

We can see there are 3391 movies and 1876 TV shows in the clean data set, which shows a greater representation of movies available on Netflix.
From the summary we can see the earliest movie in the set was released in 1953 and the latest one in 2022, meaning the set only includes movies available to watch up to 2022. - We must also take into account the regional view settings that apply to Netflix, meaning a lot of the movies will not be available for us to watch in Slovenia, but we will still use the full set for the analysis.
The median release year is 2018, indicating that half of the shows and movies were released before 2018 and half of the shows and movies after were released 2018, which is logical as Netflix has recent movies and shows in the catalog.

Runtime

The data shows us the longest movie is 235 minutes long - No Longer Kids (1979), which is 3,9 hours long. We can not confidently say which movie or TV show is the shortest as the Minimum shown is 0, implying either imperfect data in the set or other anomalies in the content runtime.
The mean runtime is 79.31 minutes, indicating the content is somewhat skewed towards shorter movies or TV show episodes.

IMDB Score & votes

The IMDB scores range from 1.5 (reflecting very poorly rated titles) to 9.6 (highly acclaimed titles).
The median score of 6.6 suggests that half of the titles have a rating above this, while half have a rating below.
The number of votes ranges from as few as 5 votes to over 2.26 million votes. This large range shows a huge disparity in the popularity of content in the dataset.
The median number of votes is 2279, indicating that half of the titles received fewer than this amount of votes, suggesting a significant portion of the dataset consists of less popular or less widely viewed content.

Seperate analysis for movies & TV shows.

As movies and TV shows are two differing types of media, for my further analysis I will differentiate the two and analyse the data. First, we’ll see how the descriptive statistics & insights change as compared to the previous combined data.

I will now use the same type of filter as before to create two new datasets which include only movies and only TV shows.

Netflix_IMDB_Data_Clean_Movies <- Netflix_IMDB_Data_Clean %>%
  filter(ContentType == 0)

Netflix_IMDB_Data_Clean_TVShows <- Netflix_IMDB_Data_Clean %>%
  filter(ContentType == 1)

We can now run the same summary as before for each data set:

summary_output_Movies <- summary(Netflix_IMDB_Data_Clean_Movies[ , c(-1, -2, -3)])
print(summary_output_Movies)

##   Release_year     Runtime        IMDB_score      IMDB_votes      ContentType
##  Min.   :1953   Min.   :  8.0   Min.   :1.500   Min.   :      5   0:3391     
##  1st Qu.:2014   1st Qu.: 88.0   1st Qu.:5.600   1st Qu.:    573   1:   0     
##  Median :2018   Median :100.0   Median :6.400   Median :   2632              
##  Mean   :2015   Mean   :101.8   Mean   :6.265   Mean   :  26683              
##  3rd Qu.:2020   3rd Qu.:116.0   3rd Qu.:7.100   3rd Qu.:  12281              
##  Max.   :2022   Max.   :235.0   Max.   :9.000   Max.   :2268288

summary_output_TVShows <- summary(Netflix_IMDB_Data_Clean_TVShows[ , c(-1, -2, -3)])
print(summary_output_TVShows)

##   Release_year     Runtime         IMDB_score      IMDB_votes        ContentType
##  Min.   :1969   Min.   :  0.00   Min.   :1.600   Min.   :      5.0   0:   0     
##  1st Qu.:2016   1st Qu.: 24.00   1st Qu.:6.400   1st Qu.:    444.5   1:1876     
##  Median :2018   Median : 41.00   Median :7.200   Median :   1782.0              
##  Mean   :2017   Mean   : 38.63   Mean   :7.017   Mean   :  17485.6              
##  3rd Qu.:2020   3rd Qu.: 49.00   3rd Qu.:7.800   3rd Qu.:   7216.8              
##  Max.   :2022   Max.   :178.00   Max.   :9.600   Max.   :1727694.0

Differences Between Movies and TV Shows:

Looking at the two outputs we can pinpoint some differences

Runtime, IMDB Score & votes

The mean runtime for movies is 101.8 minutes, while TV Shows average at 38.63 minutes. We see that TV shows are much shorter on average compared to movies.
Movies have an average score of 6.27 and TV shows average at 7.02, which implies TV shows generally receive better ratings on average than movies.
Movies have an average of 26683 votes and TV shows average 17486, which shows movies typically attract more votes than TV shows on average.

Graphical analysis of movie and TV show data

Now we’ll attempt to discover some new insights by visualizing our data.
I’ll start by using a simple ggplot function to create a histogram

ggplot_MoviesTV <- ggplot(Netflix_IMDB_Data_Clean, aes(x = Type, fill = Type)) +
  geom_bar() +
  labs(title = "Distribution of Movies and TV Shows in data",
       x = "Type",
       y = "Count") +
  theme_minimal()
print(ggplot_MoviesTV)

As we saw in the descriptive analysis before, there are more movies than TV shows in the dataset (3391 vs. 1876).

Release date line graph

ggplot_MoviesTV_ReleaseYear <- Netflix_IMDB_Data_Clean %>%
  group_by(Release_year, Type) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  ggplot(aes(x = Release_year, y = Count, color = Type)) +
  geom_line(linewidth = 1.5) +
  labs(title = "Distribution of Movies and TV Shows Over Release Year",
       x = "Release Year",
       y = "Count",
       color = "Type") +
  theme_minimal()
print(ggplot_MoviesTV_ReleaseYear)

Here we can see the distribution of movies and TV shows over the years.Both movies and TV shows show a significant increase in their production starting in the early 2000s, peaking around 2015. This growth aligns with the rise of streaming platforms, including Netflix, which has invested heavily in original content.

Task 2: MBA students data set analysis

First we have to import our data and use head to give us a quick overview.

MBA_Data_Excel <- read_excel("~/R-Take home exam/Business School.xlsx")
head(MBA_Data_Excel)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade` `Work Experience` `Employability (Before)`
##          <dbl> <chr>                          <dbl>       <dbl> <chr>                                <dbl>
## 1            1 Business                        68.4        90.2 No                                     252
## 2            2 Computer Science                70.2        68.7 Yes                                    101
## 3            3 Finance                         76.4        83.3 No                                     401
## 4            4 Business                        82.6        88.7 No                                     287
## 5            5 Finance                         76.9        75.4 No                                     275
## 6            6 Computer Science                83.3        82.1 No                                     254
## # ℹ 3 more variables: `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

We can now format the data as a data frame.

MBA_Data <- as.data.frame(MBA_Data_Excel)

We can see from the data that there are 5 types of undergraduate degrees: Art, Business, Computer Science, Engineering and Finance, which we will have to factor for further analysis.

Before I do that I will rename the column to a one word item so it does not cause problems in functions.

colnames(MBA_Data)[2] <- "Degree"

MBA_Data$Degree <- factor(MBA_Data$Degree,
                                levels = c("Art", "Business", "Computer Science", "Engineering", "Finance"),
                                labels = c("Art", "Business", "Computer Science", "Engineering", "Finance"))

Distribution of undergraduate degrees among students:

ggplot_Undergraduate_Degrees <- ggplot(MBA_Data, aes(x = Degree)) +
  geom_bar(fill = "chocolate1") +
  labs(title = "Undergraduate degrees", 
       x = "Type of degree",
       y = "Frequency")
print(ggplot_Undergraduate_Degrees)

We can see the general distribution of degrees in the graph above, now let’s look at some statistical data:

For the purpose of easier analysis I removed the id, work experience and status columns.

MBA_Data_summary_output <- summary(MBA_Data[ , c(-1, -5, -8)])
print(MBA_Data_summary_output)

##               Degree   Undergrad Grade    MBA Grade     Employability (Before) Employability (After) Annual Salary   
##  Art             : 6   Min.   : 61.20   Min.   :58.14   Min.   :101.0          Min.   :119.0         Min.   : 20000  
##  Business        :35   1st Qu.: 71.47   1st Qu.:71.14   1st Qu.:245.8          1st Qu.:312.0         1st Qu.: 87125  
##  Computer Science:25   Median : 76.65   Median :76.38   Median :256.8          Median :435.6         Median :103500  
##  Engineering     : 9   Mean   : 76.90   Mean   :76.04   Mean   :257.9          Mean   :422.7         Mean   :109058  
##  Finance         :25   3rd Qu.: 81.70   3rd Qu.:82.15   3rd Qu.:261.0          3rd Qu.:529.0         3rd Qu.:124000  
##                        Max.   :100.00   Max.   :95.00   Max.   :421.0          Max.   :631.0         Max.   :340000

We can see the most common undergraduate degree was business with 35 students.

Annual salary:

Now lets look at the annual salary.

MBA_stat.desc.data <- round(stat.desc(MBA_Data$`Annual Salary`))
print(MBA_stat.desc.data)

##      nbr.val     nbr.null       nbr.na          min          max        range          sum       median         mean 
##          100            0            0        20000       340000       320000     10905800       103500       109058 
##      SE.mean CI.mean.0.95          var      std.dev     coef.var 
##         4150         8235   1722373475        41501            0

options(scipen = 10)
ggplot_Salary <- ggplot(MBA_Data, aes(x=`Annual Salary`)) + 
  geom_histogram(binwidth = 20000, colour="aliceblue", fill = "chocolate3") + 
  theme_dark() + 
  ylab("Frequency")
print(ggplot_Salary)

We can see the distribution is asymmetrical to the right, which is usually the case with salary data. There are also outliers with one person earning 340.000 anually.

Hypothesis testing: H0:u MBAGrade = 74.

We will be testing the following hypothesis:

H0: mean MBA grade is equal to 74 H1: mean MBA Grade is not equal to 74

MBA_t.test <- t.test(MBA_Data$`MBA Grade`,
       mu = 74,
       alternative = "two.sided")
print(MBA_t.test)

## 
##  One Sample t-test
## 
## data:  MBA_Data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

We see that p=0.00915, and basing on that we can reject the null hypothesis H0 and accept the alternate hypothesis. We can predict (with 95% certainty) that the true mean is between 74.51764 and 77.56346.

Effect size

effectsize::cohens_d(MBA_Data$`MBA Grade`, mu=74)

## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

We see the Cohen’s deviation of 0.27, which is a small effect based on the Sawilowsky 2009 rate. - Such a small effect implies further analysis might not yield interesting results.

Task 3: Apartments

First, we import the dataset:

Apartment_Data <- read_excel("~/R-Take home exam/Apartments.xlsx")

head(Apartment_Data)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

We can describe the data as follows: - Age: How old an apartment is - Distance: The distance from the city centre (km) - Price: Price per square meter - Parking: 0 = no, 1 = yes - Balcony: 0 = no, 1 = yes

First we format the data as a data frame.

Apartments <- as.data.frame(Apartment_Data)
head(Apartments)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Change categorical variables into factors.

Apartments$Parking_type <- factor(Apartments$Parking,
                              levels = c(0,1),
                              labels = c("No", "Yes"))

Apartments$Balcony_type <- factor(Apartments$Balcony,
                              levels = c(0,1),
                              labels = c("No", "Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

Now we can test the hypothesis H0 = Mu_Price = 1900EUR using a t.test

Apartments_t.test <- t.test(Apartments$Price,
       mu = 1900,
       alternative = "two.sided")
print(Apartments_t.test)

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We see that p=0.004731, and basing on that we can reject the null hypothesis H0.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age,
           data = Apartments)

print(summary(fit1))

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

The estimate of regression coefficient is -8.975. If the age of the apartment increases by one year (at p=0,034), the price of the apartment decreases by 8.975EUR on average, considering all other factors remain the same.

Show the scaterrplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

Apartments_scatterplot <- scatterplotMatrix(Apartments[,c(-4,-5,-6,-7)],
                  smooth = FALSE)

print(Apartments_scatterplot)

## NULL

Based on the matrix we can not spot multicolinearity, as the regression function is not steep. Although from just the matrix we can not conclude a clear definite observation.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)
print(summary(fit2))

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value        Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10         < 2e-16 ***
## Age           -7.934      3.225   -2.46           0.016 *  
## Distance     -20.667      2.748   -7.52 0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Chech the multicolinearity with VIF statistics. Explain the findings.

print(vif(fit2))

##      Age Distance 
## 1.001845 1.001845

print(mean(vif(fit2)))

## [1] 1.001845

The VIF statistic shows there should not be a problem with multicolinearity as both statistics are very close to 1 (and very far away from 5).

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$CooksDistance <- round(cooks.distance(fit2), 3)

Apartments_Resid_Distance <- hist(Apartments$StdResid,
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Distribution of Standardized residuals")

Apartments_CooksDistance <- hist(Apartments$CooksDistance,
     main = "Distribution of Cooks distances",
     xlab = "Cooks distances",
     ylab = "Frequency")

print(Apartments_CooksDistance)

## $breaks
## [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
## 
## $counts
## [1] 80  3  1  0  0  0  1
## 
## $density
## [1] 18.8235294  0.7058824  0.2352941  0.0000000  0.0000000  0.0000000  0.2352941
## 
## $mids
## [1] 0.025 0.075 0.125 0.175 0.225 0.275 0.325
## 
## $xname
## [1] "Apartments$CooksDistance"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

We can not spot any outliers in the graph as there is no standardized residual greater than +3 or smaller than -3. So we did not have to remove any problematic units.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)

Apartments$StdFitValues <- scale(fit2$fitted.values)

Heteroskedasticity_Apartments <- ggplot(Apartments,
       aes(x=StdFitValues, y=StdResid)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
print(Heteroskedasticity_Apartments)

## `geom_smooth()` using formula = 'y ~ x'

Based on the observations in the scatterplot, the values are heteroskedastic.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

ggplot_Apartments <- ggplot(Apartments,
       aes(x=StdResid)) +
  geom_histogram(color = "gray") +
  labs(title = "Histogram of standardized residuals",
       x = "Standardized residuals",
       y = "Frequency")
print(ggplot_Apartments)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

print(shapiro.test(Apartments$StdResid))

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.95303, p-value = 0.003645

Due to the graph having a seemingly random distribution we have to conduct a Shapiro-Wilk normality test. The p value is 0.003645 so we can reject H0 and we can assume that the distribution of errors is not normal.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance,
           data = Apartments)

print(summary(fit2))

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value        Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10         < 2e-16 ***
## Age           -7.934      3.225   -2.46           0.016 *  
## Distance     -20.667      2.748   -7.52 0.0000000000618 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 0.00000000004896

Age (-7.934): If the age increases by 1 year, the price decreases on average by 7,934 EUR per square meter.
Distance (-20.677): If the distance from the city centre increases by 1 kilometer, the price decreases on average by 20,667 EUR per square meter.
R-Squared (0.4396 = 43,96%) Relationship between price, age affect the variability of apartment price by 43,96%

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
           data = Apartments)

With function anova check if model fit3 fits data better than model fit2.

print(anova (fit2, fit3))

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can test the following: H0 = First model is better than second H1 = First model is not better than second

We reject H0 (p=0,01007) and conclude the second model is statistically better.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

print(summary(fit3))

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415       < 2e-16 ***
## Age           -6.799      3.110  -2.186       0.03172 *  
## Distance     -18.045      2.758  -6.543 0.00000000528 ***
## Parking      196.168     62.868   3.120       0.00251 ** 
## Balcony        1.935     60.014   0.032       0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 0.00000000001849

Looking at the explanitory values we see the R-Squared has increased, suggesting the model is better. Age and Distance both negatively influence price, with coefficients of -6.799 and -18.045. We can also see that the apartment having a balcony does not have a statistically significant impact on the price (p = 0.97436).

Save fitted values and calculate the residual for apartment ID2.

Apartments$Fitted <- fitted(fit3) 
Apartments$Residuals <- residuals(fit3)

print(head(Apartments[colnames(Apartments) %in% c("StdFitValues", "Residuals")]))

##   StdFitValues  Residuals
## 1   -0.7706550 -110.74150
## 2    1.1084679  442.58893
## 3   -0.7706550  -88.80674
## 4   -1.5182915  260.07897
## 5   -0.2940493 -412.57579
## 6   -0.1157744 -126.69107

Residual for 2nd row (ID 2 apartment) is 442,58893.