mydata <- force(airquality)
head(mydata)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
This data set measures the daily air quality in New York from May to September 1973.
Explanation of variables:
summary(mydata)
## Ozone Solar.R Wind Temp Month Day
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000 Min. : 1.0
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000 1st Qu.: 8.0
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000 Median :16.0
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993 Mean :15.8
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000 Max. :31.0
## NA's :37 NA's :7
mydata$Temp <- (mydata$Temp - 32)/(9/5)
Temp_round <- round(mydata$Temp)
mydata$Temp <- Temp_round
mydata$Wind <- (mydata$Wind * 1.6)
Wind_round <- round(mydata$Wind,1)
mydata$Wind <- Wind_round
head(mydata)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 11.8 19 5 1
## 2 36 118 12.8 22 5 2
## 3 12 149 20.2 23 5 3
## 4 18 313 18.4 17 5 4
## 5 NA NA 22.9 13 5 5
## 6 28 NA 23.8 19 5 6
We changed the unit of Temperature to degrees Celsius and Wind to kph for easier understanding.
library(tidyr)
mydata_clean <- drop_na(mydata)
head(mydata_clean)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 11.8 19 5 1
## 2 36 118 12.8 22 5 2
## 3 12 149 20.2 23 5 3
## 4 18 313 18.4 17 5 4
## 5 23 299 13.8 18 5 7
## 6 19 99 22.1 15 5 8
We removed NA from the sample.
mydata_clean$WindChill <- round((10 * sqrt(mydata_clean$Wind) - mydata_clean$Wind + 10.5) * (33 - mydata_clean$Temp), 2)
We makd a new variable based on sampled data.
mydata_clean$MonthFactor <- factor(mydata_clean$Month,
levels = c(5, 6, 7, 8, 9),
labels = c("May", "June", "July", "August", "September"))
We can change the Month variables into factorial.
mydata2 <- mydata_clean[,c(1,2,3,4, 7, 6, 5, 8)]
head(mydata2)
## Ozone Solar.R Wind Temp WindChill Day Month MonthFactor
## 1 41 190 11.8 19 462.72 1 5 May
## 2 36 118 12.8 22 368.25 2 5 May
## 3 12 149 20.2 23 352.44 3 5 May
## 4 18 313 18.4 17 559.92 4 5 May
## 5 23 299 13.8 18 507.73 7 5 May
## 6 19 99 22.1 15 637.39 8 5 May
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
round(stat.desc(mydata2[ , c(-6, -7, -8)]), 2)
## Ozone Solar.R Wind Temp WindChill
## nbr.val 111.00 111.00 111.00 111.00 111.00
## nbr.null 0.00 0.00 0.00 0.00 4.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 1.00 7.00 3.70 14.00 -103.11
## max 168.00 334.00 33.10 36.00 671.36
## range 167.00 327.00 29.40 22.00 774.47
## sum 4673.00 20513.00 1764.90 2821.00 28989.97
## median 31.00 207.00 15.50 26.00 225.26
## mean 42.10 184.80 15.90 25.41 261.17
## SE.mean 3.16 8.65 0.54 0.50 17.55
## CI.mean.0.95 6.26 17.15 1.07 1.00 34.78
## var 1107.29 8308.74 32.41 28.04 34187.75
## std.dev 33.28 91.15 5.69 5.30 184.90
## coef.var 0.79 0.49 0.36 0.21 0.71
The mean of the ozone level is between 35.84 and 48.36 with a 95% significance level.
The temperature when the sample was taken fluctuated between 14°C and 36°C.
The mean temperature is 25.41°C and the median temperature is 26°C. Since they are not much different we can presume normal distribution of samples.
For the 50% of the time Wind Chill was below 225.26.
The maximum difference between two levels of Wind Chill is 774.47.
Ozone level had the biggest variability of the tested variables.
library(ggplot2)
ggplot(mydata, aes(y=Temp, x=seq(from = 1, to = 153))) +
geom_line() +
ylab("Temperature") +
xlab("Day of study")
The graph shows Temperature fluctuations during the days of study.
library(ggplot2)
ggplot(mydata2, aes(x=Ozone)) +
geom_histogram(binwidth = 5, color="grey", fill="orange") +
ylab("Frequency")
We can see that Ozone level is asymmetrical to the right. The higher Ozone level is less frequent. We can spot potential outliers.
library(car)
## Loading required package: carData
scatterplot(y = mydata2$Temp,
x = mydata2$Wind,
ylab = "Temperature [°C]",
xlab = "Wind speed [kph]",
smooth = FALSE)
We can see a negative correlation between Wind speed and Temperature.
The median Temperature is 26°C. The third quartile is °29°C.
We have some potential outliers for Wind speed on the boxplot. If we look at them in the scatterplot they aren’t problematic.
library(readxl)
mydata <- read_xlsx("./Take home exam/Task 2/Business School.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience Employability (Before) Employability (After)
## 1 1 Business 68.4 90.2 No 252 276
## 2 2 Computer Science 70.2 68.7 Yes 101 119
## 3 3 Finance 76.4 83.3 No 401 462
## 4 4 Business 82.6 88.7 No 287 342
## 5 5 Finance 76.9 75.4 No 275 347
## 6 6 Computer Science 83.3 82.1 No 254 313
## Status Annual Salary
## 1 Placed 111000
## 2 Placed 107000
## 3 Placed 109000
## 4 Placed 148000
## 5 Placed 255500
## 6 Placed 103500
library(ggplot2)
ggplot(mydata, aes(x=`Undergrad Degree`)) +
geom_bar(fill = "chocolate") +
ylab("Frequency")
The most common undergrad degree is Business.
library(pastecs)
round(stat.desc(mydata$`Annual Salary`), 0)
## nbr.val nbr.null nbr.na min max range sum median mean
## 100 0 0 20000 340000 320000 10905800 103500 109058
## SE.mean CI.mean.0.95 var std.dev coef.var
## 4150 8235 1722373475 41501 0
library(ggplot2)
ggplot(mydata, aes(x=`Annual Salary`)) +
geom_histogram(binwidth = 10000, color="grey", fill="forestgreen") +
ylab("Frequency")
In the graph we can see the distribution of Annual Salary. It is slightly asymmetrical to the right, so it has positive skewness. We can spot some outliers.
t.test(mydata$`MBA Grade`,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
We can reject the null hypothesis (p = 0,009). The average grade is between 74.52 and 77.56 with a 95% confidence interval.
library(effectsize)
effectsize::cohens_d(mydata$`MBA Grade`,
mu = 74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
effectsize::interpret_cohens_d(0.27,
rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
With the usage of the Sawilowsky (2009) rule we can state that the difference in the grade is small.
library(readxl)
mydata <- read_xlsx("./Take home exam/Task 3/Apartments.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata$ParkingF <- factor(mydata$Parking,
levels = c (0, 1),
labels = c ("No", "Yes"))
mydata$BalconyF <- factor(mydata$Balcony,
levels = c (0, 1),
labels = c ("No", "Yes"))
head(mydata)
## Age Distance Price Parking Balcony ParkingF BalconyF
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
t.test(mydata$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can reject the null hypothesis (p = 0,005). The average apartment price is between 1937€ and 2101€ with a 95% confidence interval.
fit1 <- lm(Price ~ Age,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
sqrt(summary(fit1)$r.squared)
## [1] 0.230255
If the age of an apartment increases by 1 year, on average the price decreases by 8.975€ per m2 (p = 0,034).
5.3% of variability in price of apartments is affected by linear effect of age.
Linear relationship between price and age is weak.
library(car)
scatterplotMatrix(mydata[ , c(3, 1, 2)],
smooth = FALSE)
There is not a potential problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
With VIF we check the multicolinearity, how the explanatory variables are connected. VIF ranges from 1 to 5 as the cut-off value. All values in this case are close to 1 so we can continue.
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
We don’t have any outliers. All standardised residuals are between -3 and 3.
We have some units with high influence. Their value is substantially greater than others. We need to remove them.
head(mydata[order(-mydata$CooksD), ])
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
## 38 5 45 2180 1 1 Yes Yes 2.577 0.320
## 55 43 37 1740 0 0 No No 1.445 0.104
## 33 2 11 2790 1 0 Yes No 2.051 0.069
## 53 7 2 1760 0 1 No Yes -2.152 0.066
## 22 37 3 2540 1 1 Yes Yes 1.576 0.061
## 39 40 2 2400 0 1 No Yes 1.091 0.038
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
filter(!CooksD > 0.07)
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
mydata$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
It seems that there can be signs of heteroskedacity. The variance isn’t constant.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.775135
## Prob > Chi2 = 0.05201969
We cannot reject the null hypothesis (p = 0,052). The variance is constant so there is no heteroskedaticity.
mydata$StdResid <- round(rstandard(fit2), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.95952, p-value = 0.01044
We reject the null hypothesis (p = 0,01). The distribution of standard residuals is not normal. This doesn’t affect our test because we have a sample bigger than 30 (n = 83), because of the standard limit theorem.
fit2 <- lm(Price ~ Age + Distance,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.27 -212.96 -46.23 205.05 578.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2490.112 76.189 32.684 < 2e-16 ***
## Age -7.850 3.244 -2.420 0.0178 *
## Distance -23.945 2.826 -8.473 9.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared: 0.4968, Adjusted R-squared: 0.4842
## F-statistic: 39.49 on 2 and 80 DF, p-value: 1.173e-12
If the age of the apartments increases by 1 year, on average the price of the apartment decreases by 7.85€ per m2 (p = 0.018) assuming constant distance from city center.
If the distance from city center increases by 1 km, on average the price of the apartment decreases by 23.945€ per m2 (p < 0.001) assuming age is constant.
49.68% of the variability of price of apartments is affected by linear effect of age and distance from city center.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
data = mydata)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5982100
## 2 78 5458696 2 523404 3.7395 0.02813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject the null hypothesis (p = 0.028). The model fit3 fits the data better.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.06 -194.33 -32.04 219.03 544.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2358.900 93.664 25.185 < 2e-16 ***
## Age -7.197 3.148 -2.286 0.02499 *
## Distance -21.241 2.911 -7.296 2.14e-10 ***
## ParkingFYes 168.921 62.166 2.717 0.00811 **
## BalconyFYes -6.985 58.745 -0.119 0.90566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5173
## F-statistic: 22.97 on 4 and 78 DF, p-value: 1.449e-12
Given the age and distance from the city center of apartments, apartments with parking have on average 168.921€ higher price per m2 (p = 0.008) than apartments without parking.
We cannot say that balcony would effect the expected price per m2.
F-statistics is the test of Significance of regression. The hypothesis is:
mydata$Fitted <- fitted.values(fit3)
mydata$Residuals <- residuals(fit3)
head(mydata$Residuals[2])
## [1] 422.9572
The residual for apartment ID2 is 422.96€.