I will first choose the data set.
mydata <- force(swiss)
head(mydata)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
Description of the chosen data set:
Fertility: common standardized fertility measure
Agriculture: percentage (%) of males involved in agriculture as occupation
Examination: percentage (%) of draftees receiving highest mark on army examination
Education: percentage (%) of draftees educated beyond primary school
Catholic: percentage (%) of those who are catholic (as opposed to protestant)
Infant.Morality: live births who live less than 1 year
summary(mydata)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
I will now round my data to two decimal points.
library(pastecs)
round(stat.desc(mydata), 2)
## Fertility Agriculture Examination Education Catholic
## nbr.val 47.00 47.00 47.00 47.00 47.00
## nbr.null 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 35.00 1.20 3.00 1.00 2.15
## max 92.50 89.70 37.00 53.00 100.00
## range 57.50 88.50 34.00 52.00 97.85
## sum 3296.70 2381.00 775.00 516.00 1933.76
## median 70.40 54.10 16.00 8.00 15.14
## mean 70.14 50.66 16.49 10.98 41.14
## SE.mean 1.82 3.31 1.16 1.40 6.08
## CI.mean.0.95 3.67 6.67 2.34 2.82 12.25
## var 156.04 515.80 63.65 92.46 1739.29
## std.dev 12.49 22.71 7.98 9.62 41.70
## coef.var 0.18 0.45 0.48 0.88 1.01
## Infant.Mortality
## nbr.val 47.00
## nbr.null 0.00
## nbr.na 0.00
## min 10.80
## max 26.60
## range 15.80
## sum 937.30
## median 20.00
## mean 19.94
## SE.mean 0.42
## CI.mean.0.95 0.86
## var 8.48
## std.dev 2.91
## coef.var 0.15
I will now rename the variable “Agriculture” to “Farming”.
colnames(mydata) [2] <- "Farming"
head(mydata)
## Fertility Farming Examination Education Catholic Infant.Mortality
## Courtelary 80.2 17.0 15 12 9.96 22.2
## Delemont 83.1 45.1 6 9 84.84 22.2
## Franches-Mnt 92.5 39.7 5 5 93.40 20.2
## Moutier 85.8 36.5 12 7 33.77 20.3
## Neuveville 76.9 43.5 17 15 5.16 20.6
## Porrentruy 76.1 35.3 9 7 90.57 26.6
I will now create new data.frame, which includes only those cantons where infant mortality is between 18 and 20.
mydata2 <- mydata[mydata$Infant.Mortality >= 18 & mydata$Infant.Mortality <= 20 , ]
I will now calculate the arithmetic mean for Fertility. As we can see below, it is 70.14.
mean(mydata$Fertility)
## [1] 70.14255
I will now calculate the standard deviation for Infant.Mortality. As we can see below, it is 2.91.
sd(mydata$Infant.Mortality)
## [1] 2.912697
Let’s calculate the median for Infant.Mortality. As we can see below, it is 20.
median(mydata$Infant.Mortality)
## [1] 20
library(ggplot2)
ggplot(mydata, aes(y = Examination, x = Education)) +
geom_point()
I have chosen to observe the correlation between Education (percentage of draftees educated beyond primary school), which is on x-axis and is my independent variable, and Examination (percentage of draftees receiving highest mark on army examination), which is on y-axis and is my dependent variable. Observing the graph, there seems to be a positive correlation between the two variables.
I have renamed the data set “Business-School” to “mydata3” so that it is easier to work with.
library(readxl)
Business_School <- read_excel("C:/Users/blapu/OneDrive/Desktop/IMB 2024 - R Program/Bootcamp/R data/Business School.xlsx")
mydata3 <- Business_School
head(mydata3)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
library(ggplot2)
ggplot(mydata3, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "orange", color = "yellow")
Now I will show the descriptive statistics of the Annual Salary.
summary(mydata3$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
We can see that the average annual salary is 109,058.
library(ggplot2)
ggplot(mydata3, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 5000, colour = "orange") +
ylab("Frequency")
options(scipen = 999)
The distribution is skewed to the right. Most graduates have an average annual salary of 100,000 euros.
t.test(mydata3$`MBA Grade`,
mu = 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata3$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
Our H0 (null hypothesis) is that the average MBA grade is 74, and our H1 (alternative hypothesis) is that the average MBA grade is not 74.
p-value = 0.01. Because our p-value is smaller than 0.05 (our significance value), we can reject the null hypothesis. Reading the “mean of x”, we can see that the average MBA grade is higher than 74 - it is around 76.
I will now test the effect size.
library(effectsize)
effectsize::cohens_d(mydata3$`MBA Grade`,
mu = 74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
effectsize::interpret_cohens_d(0.27, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)
The effect size is small.
I have renamed “Apartments” to data4 so that it is easier to work with. I have displayed the first six columns.
library(readxl)
Apartments <- read_excel("C:/Users/blapu/OneDrive/Desktop/IMB 2024 - R Program/Bootcamp/R data/Task 3/Apartments.xlsx")
mydata4 <- Apartments
head(mydata4)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata4$Parking <- factor (mydata4$Parking,
levels = c(0,1),
labels = c("No", "Yes"))
mydata4$Balcony <- factor (mydata4$Balcony,
levels = c(0,1),
labels = c("No", "Yes"))
t.test(mydata4$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata4$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
If H0: Mu_Price = 1900 eur, we can say that H1: Mu-Price =/= 1900 eur. We can see that the p-value = 0.005, which is less than 0.05 (our significance level). Thus, we can reject H0. By “mean of x”, we can see that the average price of an apartment is around 2019 eur.
fit <- lm(Price ~ Age,
data = mydata4)
summary(fit)
##
## Call:
## lm(formula = Price ~ Age, data = mydata4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <0.0000000000000002 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
The regression coefficient for age is -8.975. If the age of the apartment increases by one year, the price of the apartment will be on average 8.98 euros lower, assuming other explanatory variables are constant.
The coefficient of determination is 5.3%. This means that 5.3% of variability in price is explained by linear effect of age, distance, parking, and balcony.
sqrt(summary(fit)$r.squared)
## [1] 0.230255
The coefficient of correlation is 0.230. This indicates a weak positive correlation between price and age.
library(car)
## Loading required package: carData
scatterplotMatrix(mydata4[ , c(3, 1, 2)],
smooth = FALSE)
There is no problem with potential multicolinearity, meaning that there is no correlation of explanatory variables, which are in our case age and distance. The graph (third row, second column) is not that steep, so I think there is no problem of multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata4)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
vif(fit2)
## Age Distance
## 1.001845 1.001845
As said in the lectures, the higher the VIF statistic, the more strongly the explanatory variable is linked to other explanatory variables. VIF is not larger than 5 for either of the two values.Thus, there is no strong multicolinearity.
mydata4$StdResid <- round(rstandard(fit2), 2)
mydata4$CooksD <- round (cooks.distance(fit2), 2)
hist(mydata4$StdResid,
xlab = "Standardised residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals")
hist(mydata4$CooksD,
xlab = "Cooks Distance",
ylab = "Frequency",
main = "Histogram of Cooks Distances")
There are no outliers as all units are within the interval [-3, +3]. There are, however, some units with high influence, which we can see by the large gap between 0.10 and 0.35. Let’s remove those units.
head(mydata4[order(-mydata4$CooksD), ], 6)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.1
## 3 2 11 2790 Yes No 2.05 0.07
## 4 7 2 1760 No Yes -2.15 0.07
## 5 37 3 2540 Yes Yes 1.58 0.06
## 6 40 2 2400 No Yes 1.09 0.04
Going to mydata4 and clicking the arrow at CooksD, I have realized that the unit with high influence is apartment number 38. I will now remove it.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata4 <- mydata4 %>%
filter (!Price %in% 2180)
hist(mydata4$CooksD,
xlab = "Cooks Distance",
ylab = "Frequency",
main = "Histogram of Cooks Distances")
fit2 <- lm(Price ~ Age + Distance,
data = mydata4)
mydata4$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot (y = mydata4$StdResid, x = mydata4$StdFitted,
ylab = "Standardised residuals",
xlab = "Standardised fitted values",
boxplots = FALSE,
regline = FALSE,
smooth = FALSE)
## Warning in plot.window(...): "regline" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "regline" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "regline" is not a
## graphical parameter
## Warning in box(...): "regline" is not a graphical parameter
## Warning in title(...): "regline" is not a graphical parameter
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 2.927455
## Prob > Chi2 = 0.08708469
Let’s check heteroskedasticity with Breusch-Pagan heteroskedasticity test. We can say: H0 = the variance is constant, and H1 = the variance is not constant.
We can see that the p-value = 0.33. Thus, we cannot reject H0. p-value is larger than 0.05.
hist(mydata4$StdResid,
xlab = "Standardised residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals")
As said above, there are no units outside of the [-3, +3] interval. Let’s formally test this assumption. We can say that H0 = errors are distributed normally, and H1 = errors are not distributed normally.
shapiro.test(mydata4$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata4$StdResid
## W = 0.9488, p-value = 0.00219
We can see that the p-value = 0.002, which is less than 0.05 (our chosen significance level). This means that we can reject H0.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 0.0000000000000002 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 0.00000000000252 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 0.000000000002339
If age increases by one year, the price of the apartment per m2 will on average be about 7.93 euros lower (p-value = 0.02), assuming other explanatory variables are constant.
If distance increases by 1 km from the city center, the price of the apartment per m2 will on average be 20.67 euros lower (p-value < 0.01), assuming other explanatory variables are constant.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = mydata4)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 0.0000000000000002 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 0.000000000666 ***
## ParkingYes 167.531 62.864 2.665 0.00933 **
## BalconyYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 0.000000000003018
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s write our hypotheses: H0: both models are equally good, and H1: one model is better than the other.
We can see that p = 0.03, meaning that it is smaller than 0.05 (our significance level). Thus, we can reject H0.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = mydata4)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 0.0000000000000002 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 0.000000000666 ***
## ParkingYes 167.531 62.864 2.665 0.00933 **
## BalconyYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 0.000000000003018
Explanation for Parking: Given the age, distance, and balcony, apartments with parking are on average 167.531 euros more expensive per m2 compared to those without parking.
Explanation for Balcony: Given the age, distance, and parking, apartments with a balcony are on average 15.207 euros cheaper per m2 compared to those without a balcony.
Our hypotheses for F-statistic are: H0: our model does not explain any variability of the dependent variable, and H1: our model explains a proportion of the variability of the dependent variable.
We check the p-value and see that it is smaller than 0.01. Thus, we can reject H0.
mydata4$Fitted <- fitted.values(fit3)
mydata4$Residuals <- residuals(fit3)
head(mydata4[colnames(mydata4) %in% c("Price", "Fitted", "Residuals")])
## # A tibble: 6 × 3
## Price Fitted Residuals
## <dbl> <dbl> <dbl>
## 1 1640 1706. -66.0
## 2 2800 2372. 428.
## 3 1660 1721. -61.2
## 4 1850 1563. 287.
## 5 1640 2012. -372.
## 6 1770 1908. -138.
According to the estimated regression function, the expected price for the second apartment would be around 2372 euros per m2. The actual price of the second apartment is, however, 2800 euros. The difference between the two prices (428 euros) is a residual.