data(package = .packages(all.available = TRUE))
library(MASS)
mydata <- force(painters)
head(mydata)
## Composition Drawing Colour Expression School
## Da Udine 10 8 16 3 A
## Da Vinci 15 16 4 14 A
## Del Piombo 8 13 16 7 A
## Del Sarto 12 16 9 8 A
## Fr. Penni 0 15 8 0 A
## Guilio Romano 15 16 4 14 A
Description:
Population: 54 classical painters
Unit: 1 classical painter
Composition: Composition score (1-20 scale)
Drawing: Drawing score (1-20 scale)
Colour: Colour score (1-20 scale)
Expression: Expression score (1-20 scale)
School: The school to which a painter belongs: “A”: Renaissance; “B”: Mannerist; “C”: Seicento; “D”: Venetian; “E”: Lombard; “F”: Sixteenth Century; “G”: Seventeenth Century; “H”: French.
mydata$School <- factor(mydata$School,
levels = c("A", "B","C", "D", "E", "F", "G", "H"),
labels = c("Renaissance", "Mannerist", "Seicento", "Venetian", "Lombard", "Sixteenth Century", "Seventeenth Century", "French"))
head(mydata)
## Composition Drawing Colour Expression School
## Da Udine 10 8 16 3 Renaissance
## Da Vinci 15 16 4 14 Renaissance
## Del Piombo 8 13 16 7 Renaissance
## Del Sarto 12 16 9 8 Renaissance
## Fr. Penni 0 15 8 0 Renaissance
## Guilio Romano 15 16 4 14 Renaissance
summary(mydata[c(-5)])
## Composition Drawing Colour Expression
## Min. : 0.00 Min. : 6.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 8.25 1st Qu.:10.00 1st Qu.: 7.25 1st Qu.: 4.000
## Median :12.50 Median :13.50 Median :10.00 Median : 6.000
## Mean :11.56 Mean :12.46 Mean :10.94 Mean : 7.667
## 3rd Qu.:15.00 3rd Qu.:15.00 3rd Qu.:16.00 3rd Qu.:11.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :18.000
The maximum score for composition is 18.00.
The minimum score for drawing is 6.
25 % of clssical painters scored Colour above 16.00.
25 % of painters had an Expression score of 4 or lower.
50 % of painters had a Composition score same or above the 12.5.
The range for a Drawing score is 12.
mydata$FrenchSchool <- ifelse(mydata$School == "French", 1, 0)
head(mydata)
## Composition Drawing Colour Expression School FrenchSchool
## Da Udine 10 8 16 3 Renaissance 0
## Da Vinci 15 16 4 14 Renaissance 0
## Del Piombo 8 13 16 7 Renaissance 0
## Del Sarto 12 16 9 8 Renaissance 0
## Fr. Penni 0 15 8 0 Renaissance 0
## Guilio Romano 15 16 4 14 Renaissance 0
Decsription:
1 = painter belongs to the French school
0 = painter does not belong to French school
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!School == "French")
tail(mydata)
## Composition Drawing Colour Expression School
## J. Jordaens 10 8 16 6 Seventeenth Century
## Otho Venius 13 14 10 10 Seventeenth Century
## Rembrandt 15 6 17 12 Seventeenth Century
## Rubens 18 13 17 17 Seventeenth Century
## Teniers 15 12 13 6 Seventeenth Century
## Van Dyck 15 10 17 13 Seventeenth Century
## FrenchSchool
## J. Jordaens 0
## Otho Venius 0
## Rembrandt 0
## Rubens 0
## Teniers 0
## Van Dyck 0
summary(mydata[c(-5,-6)])
## Composition Drawing Colour Expression
## Min. : 0.00 Min. : 6.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 8.00 1st Qu.:10.00 1st Qu.: 8.0 1st Qu.: 4.00
## Median :12.00 Median :13.00 Median :11.0 Median : 6.00
## Mean :11.36 Mean :12.34 Mean :11.3 Mean : 7.28
## 3rd Qu.:15.00 3rd Qu.:15.00 3rd Qu.:16.0 3rd Qu.:10.00
## Max. :18.00 Max. :18.00 Max. :18.0 Max. :18.00
We can see that some data has changed for example Median for Composition and 1st Qu for Colour.
Now 50 % of painters have a Composition score same or above the 12.00.
25 % of painters had an Colour score of 8 or lower.
mydata$Numeric_School <- factor(mydata$School,
levels = c("Renaissance", "Mannerist", "Seicento", "Venetian", "Lombard", "Sixteenth Century", "Seventeenth Century"),
labels = c(1, 2, 3, 4, 5, 6, 7))
library(ggplot2)
ggplot(mydata, aes(x=School, y=Composition)) +
geom_boxplot() +
xlab("Type of School")
Explanation:
Outliers indicate painters who were exceptionally high or low compared to their peers in the same school. We can see that we have on both sides outliers in Lombard school and on one side in Sixteenth Century school.
Renaissance school is the only one, which has minimum at 0.
library(readxl)
task2_data <- read_xlsx("./Business School.xlsx")
task2_data <- as.data.frame(task2_data)
head(task2_data)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1 1 Business 68.4 90.2 No
## 2 2 Computer Science 70.2 68.7 Yes
## 3 3 Finance 76.4 83.3 No
## 4 4 Business 82.6 88.7 No
## 5 5 Finance 76.9 75.4 No
## 6 6 Computer Science 83.3 82.1 No
## Employability (Before) Employability (After) Status Annual Salary
## 1 252 276 Placed 111000
## 2 101 119 Placed 107000
## 3 401 462 Placed 109000
## 4 287 342 Placed 148000
## 5 275 347 Placed 255500
## 6 254 313 Placed 103500
ggplot(task2_data, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "steelblue") +
theme_minimal() +
labs(title = "Distribution of Undergraduate Degrees",
x = "Undergraduate Degree",
y = "Count") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Answer:
Most common degree is Business.
summary(task2_data$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
sd(task2_data$`Annual Salary`, na.rm = TRUE)
## [1] 41501.49
ggplot(task2_data, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 5000, fill = "steelblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Annual Salary",
x = "Annual Salary",
y = "Frequency")
Description:
We have a distribution, which is asymetrical to the right. On histogram we can see two outliers on the right side.
t_test_result <- t.test(task2_data$`MBA Grade`, mu = 74)
t_test_result
##
## One Sample t-test
##
## data: task2_data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
H0 = The population mean MBA Grade = 74
H1 = The population mean MBA Grade ≠ 74
We can reject H0, because p<0,05, so the popolation meand MBA Grade is not equal to 74.
library(readxl)
task3_data <- read_xlsx("./Apartments.xlsx")
task3_data <- as.data.frame(task3_data)
head(task3_data)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
task3_data$Parking <- factor(task3_data$Parking,
levels = c(0, 1),
labels = c("No", "Yes"))
task3_data$Balcony <- factor(task3_data$Balcony,
levels = c(0, 1),
labels = c("No", "Yes"))
t.test(task3_data$Price, mu = 1900)
##
## One Sample t-test
##
## data: task3_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can reject H0, because p<0,05, so the Mu_Price ≠ 1900 EUR.
fit1 <- lm(task3_data$Price ~ task3_data$Age, data = task3_data)
summary(fit1)
##
## Call:
## lm(formula = task3_data$Price ~ task3_data$Age, data = task3_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## task3_data$Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(task3_data$Price, task3_data$Age)
## [1] -0.230255
If the apartment is older by one year, in average the price per m2 in average decrease by 8,98 EUR.
5.3% of the variation in Price is explained by linear affect of Age.
There is a weak negative correlation between Price and Age. As Age increases, Price tends to decrease slightly.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(task3_data[ , c(-4, -5)],
smooth = FALSE)
Based on the matrix, we do not have problem with multicolinearity.
fit2 <- lm(task3_data$Price ~ task3_data$Age + task3_data$Distance, data = task3_data)
summary(fit2)
##
## Call:
## lm(formula = task3_data$Price ~ task3_data$Age + task3_data$Distance,
## data = task3_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## task3_data$Age -7.934 3.225 -2.46 0.016 *
## task3_data$Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
vif(fit2)
## task3_data$Age task3_data$Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Because the mean VIF is bellow the 5, we do not have a problem with multicolinearity.
task3_data$StdResid <- round(rstandard(fit2),3)
task3_data$CooksD <- round(cooks.distance(fit2), 3)
hist(task3_data$StdResid,
xlab = "Standardized resiudals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(task3_data$StdResid)
##
## Shapiro-Wilk normality test
##
## data: task3_data$StdResid
## W = 0.95303, p-value = 0.003645
hist(task3_data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
task3_data$NewID <- paste0("ID", 1:nrow(task3_data))
library(dplyr)
task3_data <- task3_data %>%
filter(!NewID == "ID38")
fit2 <- lm(task3_data$Price ~ task3_data$Age + task3_data$Distance, data = task3_data)
task3_data$StdResid <- round(rstandard(fit2),3)
task3_data$CooksD <- round(cooks.distance(fit2), 3)
hist(task3_data$StdResid,
xlab = "Standardized resiudals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(task3_data$StdResid)
##
## Shapiro-Wilk normality test
##
## data: task3_data$StdResid
## W = 0.95649, p-value = 0.006355
hist(task3_data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
fit2$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = fit2$StdResid, x = fit2$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
According to scatterplot, we can say that we have potential homoskedasticity.
task3_data$StdResid <- round(rstandard(fit2), 3)
task3_data$CooksD <- round(cooks.distance(fit2), 3)
hist(task3_data$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(task3_data$StdResid)
##
## Shapiro-Wilk normality test
##
## data: task3_data$StdResid
## W = 0.95649, p-value = 0.006355
H₀: residuals are normally distributed
H₁: residuals are not normally distributed
According to histogram, we have one outlier. But we can reject the H0, so residuals are not normally distributed (p<0,05).
fit2 <- lm(task3_data$Price ~ task3_data$Age + task3_data$Distance, data = task3_data)
summary(fit2)
##
## Call:
## lm(formula = task3_data$Price ~ task3_data$Age + task3_data$Distance,
## data = task3_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## task3_data$Age -6.464 3.159 -2.046 0.044 *
## task3_data$Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
For each additional year of age of the apartment, the expected Price decreases by about 6.46 EUR, keeping Distance constant.
p-value = 0.044 → significant at the 5% level (so Age has a statistically meaningful effect, but weaker than Distance).
Extremely significant (p < 0.001), so distance is a strong predictor of Price.
For each additional unit of Distance, the Price decreases by about 22.96 EUR, holding Age constant.
On average, the observed prices deviate from the regression line by about 276 EUR.
About 48.4% of the variation in Price is explained by Age and Distance.
Adjusted for the number of predictors, about 47.1% of the variation is explained.
H₀: All slope coefficients = 0 (Age and Distance have no joint effect)
H₁: At least one slope coefficient ≠ 0 (Age and Distance have joint effect)
p< 05 and we reject H0, so age and Distance jointly explain a significant part of the Price variation.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = task3_data)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = task3_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingYes 167.531 62.864 2.665 0.00933 **
## BalconyYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
anova(fit2, fit3)
## Warning in anova.lmlist(object, ...): models with response '"Price"' removed
## because response differs from model 1
## Analysis of Variance Table
##
## Response: task3_data$Price
## Df Sum Sq Mean Sq F value Pr(>F)
## task3_data$Age 1 611157 611157 8.0145 0.005851 **
## task3_data$Distance 1 5178032 5178032 67.9029 2.523e-12 ***
## Residuals 81 6176767 76256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = task3_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingYes 167.531 62.864 2.665 0.00933 **
## BalconyYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Given the age and distance, apartments with parking have a higher price by 147,83 EUR per m2 on overage than apartments without it.
Given the age and distance, apartments with balcony have a higher price by 6,4 EUR per m2 on overage than apartments without it.
H0: p = 0
H1: p ≠ 0
We find that the population coefficient of determination is greater than zero, which means that we have found a linear relationship between price of apartment, age, distance, having parking and having balcony.
task3_data$fitted <- fitted(fit3)
task3_data$residual <- resid(fit3)
task3_data[task3_data$NewID == "ID2", ]
## Age Distance Price Parking Balcony StdResid CooksD NewID fitted residual
## 2 18 1 2800 Yes No 1.775 0.031 ID2 2372.197 427.8029