Jakob Pirs

Task 1

Importing data

data(package = .packages(all.available = TRUE))
library(MASS)
mydata <- force(painters)

head(mydata)
##               Composition Drawing Colour Expression School
## Da Udine               10       8     16          3      A
## Da Vinci               15      16      4         14      A
## Del Piombo              8      13     16          7      A
## Del Sarto              12      16      9          8      A
## Fr. Penni               0      15      8          0      A
## Guilio Romano          15      16      4         14      A

Description:

  • Population: 54 classical painters

  • Unit: 1 classical painter

  • Composition: Composition score (1-20 scale)

  • Drawing: Drawing score (1-20 scale)

  • Colour: Colour score (1-20 scale)

  • Expression: Expression score (1-20 scale)

  • School: The school to which a painter belongs: “A”: Renaissance; “B”: Mannerist; “C”: Seicento; “D”: Venetian; “E”: Lombard; “F”: Sixteenth Century; “G”: Seventeenth Century; “H”: French.

Rename the variable

mydata$School <- factor(mydata$School,
                         levels = c("A", "B","C", "D", "E", "F", "G", "H"),
                         labels = c("Renaissance", "Mannerist", "Seicento", "Venetian", "Lombard", "Sixteenth Century", "Seventeenth Century", "French"))

head(mydata)
##               Composition Drawing Colour Expression      School
## Da Udine               10       8     16          3 Renaissance
## Da Vinci               15      16      4         14 Renaissance
## Del Piombo              8      13     16          7 Renaissance
## Del Sarto              12      16      9          8 Renaissance
## Fr. Penni               0      15      8          0 Renaissance
## Guilio Romano          15      16      4         14 Renaissance
summary(mydata[c(-5)])
##   Composition       Drawing          Colour        Expression    
##  Min.   : 0.00   Min.   : 6.00   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 8.25   1st Qu.:10.00   1st Qu.: 7.25   1st Qu.: 4.000  
##  Median :12.50   Median :13.50   Median :10.00   Median : 6.000  
##  Mean   :11.56   Mean   :12.46   Mean   :10.94   Mean   : 7.667  
##  3rd Qu.:15.00   3rd Qu.:15.00   3rd Qu.:16.00   3rd Qu.:11.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :18.000

The maximum score for composition is 18.00.

The minimum score for drawing is 6.

25 % of clssical painters scored Colour above 16.00.

25 % of painters had an Expression score of 4 or lower.

50 % of painters had a Composition score same or above the 12.5.

The range for a Drawing score is 12.

Adding new variable with if function

mydata$FrenchSchool <- ifelse(mydata$School == "French", 1, 0)

head(mydata)
##               Composition Drawing Colour Expression      School FrenchSchool
## Da Udine               10       8     16          3 Renaissance            0
## Da Vinci               15      16      4         14 Renaissance            0
## Del Piombo              8      13     16          7 Renaissance            0
## Del Sarto              12      16      9          8 Renaissance            0
## Fr. Penni               0      15      8          0 Renaissance            0
## Guilio Romano          15      16      4         14 Renaissance            0

Decsription:

  • 1 = painter belongs to the French school

  • 0 = painter does not belong to French school

Removing painters who belong to French school

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  filter(!School == "French")

tail(mydata)
##             Composition Drawing Colour Expression              School
## J. Jordaens          10       8     16          6 Seventeenth Century
## Otho Venius          13      14     10         10 Seventeenth Century
## Rembrandt            15       6     17         12 Seventeenth Century
## Rubens               18      13     17         17 Seventeenth Century
## Teniers              15      12     13          6 Seventeenth Century
## Van Dyck             15      10     17         13 Seventeenth Century
##             FrenchSchool
## J. Jordaens            0
## Otho Venius            0
## Rembrandt              0
## Rubens                 0
## Teniers                0
## Van Dyck               0
summary(mydata[c(-5,-6)])
##   Composition       Drawing          Colour       Expression   
##  Min.   : 0.00   Min.   : 6.00   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.: 8.00   1st Qu.:10.00   1st Qu.: 8.0   1st Qu.: 4.00  
##  Median :12.00   Median :13.00   Median :11.0   Median : 6.00  
##  Mean   :11.36   Mean   :12.34   Mean   :11.3   Mean   : 7.28  
##  3rd Qu.:15.00   3rd Qu.:15.00   3rd Qu.:16.0   3rd Qu.:10.00  
##  Max.   :18.00   Max.   :18.00   Max.   :18.0   Max.   :18.00

We can see that some data has changed for example Median for Composition and 1st Qu for Colour.

Now 50 % of painters have a Composition score same or above the 12.00.

25 % of painters had an Colour score of 8 or lower.

Adding new variable with factor

mydata$Numeric_School <- factor(mydata$School,
                         levels = c("Renaissance", "Mannerist", "Seicento", "Venetian", "Lombard", "Sixteenth Century", "Seventeenth Century"),
                         labels = c(1, 2, 3, 4, 5, 6, 7))

Makig boxplots

library(ggplot2)
ggplot(mydata, aes(x=School, y=Composition)) +
  geom_boxplot() +
  xlab("Type of School")

Explanation:

Outliers indicate painters who were exceptionally high or low compared to their peers in the same school. We can see that we have on both sides outliers in Lombard school and on one side in Sixteenth Century school.

Renaissance school is the only one, which has minimum at 0.

Task 2

library(readxl)
task2_data <- read_xlsx("./Business School.xlsx")

task2_data <- as.data.frame(task2_data)

head(task2_data)
##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1          1         Business            68.4      90.2              No
## 2          2 Computer Science            70.2      68.7             Yes
## 3          3          Finance            76.4      83.3              No
## 4          4         Business            82.6      88.7              No
## 5          5          Finance            76.9      75.4              No
## 6          6 Computer Science            83.3      82.1              No
##   Employability (Before) Employability (After) Status Annual Salary
## 1                    252                   276 Placed        111000
## 2                    101                   119 Placed        107000
## 3                    401                   462 Placed        109000
## 4                    287                   342 Placed        148000
## 5                    275                   347 Placed        255500
## 6                    254                   313 Placed        103500
ggplot(task2_data, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Distribution of Undergraduate Degrees",
       x = "Undergraduate Degree",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Answer:

Most common degree is Business.

summary(task2_data$`Annual Salary`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000
sd(task2_data$`Annual Salary`, na.rm = TRUE)
## [1] 41501.49
ggplot(task2_data, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 5000, fill = "steelblue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Annual Salary",
       x = "Annual Salary",
       y = "Frequency")

Description:

We have a distribution, which is asymetrical to the right. On histogram we can see two outliers on the right side.

t_test_result <- t.test(task2_data$`MBA Grade`, mu = 74)

t_test_result
## 
##  One Sample t-test
## 
## data:  task2_data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

H0 = The population mean MBA Grade = 74

H1 = The population mean MBA Grade ≠ 74

We can reject H0, because p<0,05, so the popolation meand MBA Grade is not equal to 74.

Task 3

Import the dataset Apartments.xlsx

library(readxl)
task3_data <- read_xlsx("./Apartments.xlsx")

task3_data <- as.data.frame(task3_data)

head(task3_data)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

task3_data$Parking <- factor(task3_data$Parking,
                         levels = c(0, 1),
                         labels = c("No", "Yes"))

task3_data$Balcony <- factor(task3_data$Balcony,
                         levels = c(0, 1),
                         labels = c("No", "Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(task3_data$Price, mu = 1900)
## 
##  One Sample t-test
## 
## data:  task3_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We can reject H0, because p<0,05, so the Mu_Price ≠ 1900 EUR.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(task3_data$Price ~ task3_data$Age, data = task3_data)

summary(fit1)
## 
## Call:
## lm(formula = task3_data$Price ~ task3_data$Age, data = task3_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2185.455     87.043  25.108   <2e-16 ***
## task3_data$Age   -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor(task3_data$Price, task3_data$Age)
## [1] -0.230255

If the apartment is older by one year, in average the price per m2 in average decrease by 8,98 EUR.

5.3% of the variation in Price is explained by linear affect of Age.

There is a weak negative correlation between Price and Age. As Age increases, Price tends to decrease slightly.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
scatterplotMatrix(task3_data[ , c(-4, -5)],
                  smooth = FALSE)

Based on the matrix, we do not have problem with multicolinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(task3_data$Price ~ task3_data$Age + task3_data$Distance, data = task3_data)

summary(fit2)
## 
## Call:
## lm(formula = task3_data$Price ~ task3_data$Age + task3_data$Distance, 
##     data = task3_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2460.101     76.632   32.10  < 2e-16 ***
## task3_data$Age        -7.934      3.225   -2.46    0.016 *  
## task3_data$Distance  -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
vif(fit2)
##      task3_data$Age task3_data$Distance 
##            1.001845            1.001845
mean(vif(fit2))
## [1] 1.001845

Because the mean VIF is bellow the 5, we do not have a problem with multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

task3_data$StdResid <- round(rstandard(fit2),3)
task3_data$CooksD <- round(cooks.distance(fit2), 3)

hist(task3_data$StdResid,
     xlab = "Standardized resiudals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(task3_data$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  task3_data$StdResid
## W = 0.95303, p-value = 0.003645
hist(task3_data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

task3_data$NewID <- paste0("ID", 1:nrow(task3_data))
library(dplyr)
task3_data <- task3_data %>%
  filter(!NewID == "ID38")
fit2 <- lm(task3_data$Price ~ task3_data$Age + task3_data$Distance, data = task3_data)

task3_data$StdResid <- round(rstandard(fit2),3)
task3_data$CooksD <- round(cooks.distance(fit2), 3)

hist(task3_data$StdResid,
     xlab = "Standardized resiudals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(task3_data$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  task3_data$StdResid
## W = 0.95649, p-value = 0.006355
hist(task3_data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = fit2$StdResid, x = fit2$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

According to scatterplot, we can say that we have potential homoskedasticity.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

task3_data$StdResid <- round(rstandard(fit2), 3) 
task3_data$CooksD <- round(cooks.distance(fit2), 3) 

hist(task3_data$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(task3_data$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  task3_data$StdResid
## W = 0.95649, p-value = 0.006355

H₀: residuals are normally distributed

H₁: residuals are not normally distributed

According to histogram, we have one outlier. But we can reject the H0, so residuals are not normally distributed (p<0,05).

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(task3_data$Price ~ task3_data$Age + task3_data$Distance, data = task3_data)

summary(fit2)
## 
## Call:
## lm(formula = task3_data$Price ~ task3_data$Age + task3_data$Distance, 
##     data = task3_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2456.076     73.931  33.221  < 2e-16 ***
## task3_data$Age        -6.464      3.159  -2.046    0.044 *  
## task3_data$Distance  -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

For each additional year of age of the apartment, the expected Price decreases by about 6.46 EUR, keeping Distance constant.

p-value = 0.044 → significant at the 5% level (so Age has a statistically meaningful effect, but weaker than Distance).

Extremely significant (p < 0.001), so distance is a strong predictor of Price.

For each additional unit of Distance, the Price decreases by about 22.96 EUR, holding Age constant.

On average, the observed prices deviate from the regression line by about 276 EUR.

About 48.4% of the variation in Price is explained by Age and Distance.

Adjusted for the number of predictors, about 47.1% of the variation is explained.

H₀: All slope coefficients = 0 (Age and Distance have no joint effect)

H₁: At least one slope coefficient ≠ 0 (Age and Distance have joint effect)

p< 05 and we reject H0, so age and Distance jointly explain a significant part of the Price variation.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = task3_data)

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = task3_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingYes   167.531     62.864   2.665  0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)
## Warning in anova.lmlist(object, ...): models with response '"Price"' removed
## because response differs from model 1
## Analysis of Variance Table
## 
## Response: task3_data$Price
##                     Df  Sum Sq Mean Sq F value    Pr(>F)    
## task3_data$Age       1  611157  611157  8.0145  0.005851 ** 
## task3_data$Distance  1 5178032 5178032 67.9029 2.523e-12 ***
## Residuals           81 6176767   76256                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = task3_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingYes   167.531     62.864   2.665  0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Given the age and distance, apartments with parking have a higher price by 147,83 EUR per m2 on overage than apartments without it.

Given the age and distance, apartments with balcony have a higher price by 6,4 EUR per m2 on overage than apartments without it.

H0: p = 0

H1: p ≠ 0

We find that the population coefficient of determination is greater than zero, which means that we have found a linear relationship between price of apartment, age, distance, having parking and having balcony.

Save fitted values and claculate the residual for apartment ID2.

task3_data$fitted  <- fitted(fit3)
task3_data$residual <- resid(fit3)

task3_data[task3_data$NewID == "ID2", ]
##   Age Distance Price Parking Balcony StdResid CooksD NewID   fitted residual
## 2  18        1  2800     Yes      No    1.775  0.031   ID2 2372.197 427.8029