library(carData)
mydata <- Salaries
rownames(mydata) <- NULL
head(mydata)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
Rank: a factor with levels AssocProf AsstProf Prof Discipline: a factor with levels A (“theoretical” departments) or B (“applied” departments). Yrs.since.phd: years since PhD Yrs.service: years of service Sex: a factor with levels Female Male Salary: nine-month salary, in dollars.
colnames(mydata) <- c("Academic_title", "Discipline", "Years since PhD", "Years of service", "Gender", "Salary")
head(mydata)
## Academic_title Discipline Years since PhD Years of service Gender Salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
round(stat.desc(mydata[ , -c (1, 2, 5)]), 2)
## Years since PhD Years of service Salary
## nbr.val 397.00 397.00 397.00
## nbr.null 0.00 11.00 0.00
## nbr.na 0.00 0.00 0.00
## min 1.00 0.00 57800.00
## max 56.00 60.00 231545.00
## range 55.00 60.00 173745.00
## sum 8859.00 6993.00 45141464.00
## median 21.00 16.00 107300.00
## mean 22.31 17.61 113706.46
## SE.mean 0.65 0.65 1520.16
## CI.mean.0.95 1.27 1.28 2988.60
## var 166.07 169.16 917425865.05
## std.dev 12.89 13.01 30289.04
## coef.var 0.58 0.74 0.27
Mean: From the output, we can infer that the average years since PhD is 22.3 years, the average years of service is 17,61, and the average nine-month salary in dollars is 113706.46.
Median: From the output, we can infer that the median (the middle most value of a variable) years since PhD is 21 years, the median years of service is 16, and the median nine-month salary in dollars is 107300.00.
Range: From the output, we can see that the difference between the highest and the lowest number of years since PhD is 55, the difference between the highest and the lowest number of years of service is 60. We can also conclude that the difference between the highest and lowest ninth-month salary in dollars is 173745.00.
hist(mydata$Salary,
main = "Distribution of Salaries for Professors",
xlab = "Nine-month salary, in dollars",
ylab = "Frequency",
breaks = seq(from = 55000, to = 250000, by = 5000))
We can see that the distribution of nine-month salary is skewed to the
right however not by much. The distribution is close to normal.
hist(mydata$"Years of service",
main = "Distribution of Years of Service",
xlab = "Years of service",
ylab = "Frequency",
breaks = seq(from = 0, to = 60, by = 2))
We can see that the distribution Years of service is slightly skeweed to
the right.
ggplot(mydata, aes(x= Academic_title, y= Salary))+
geom_boxplot()+
theme_bw()
As expected, the salary increases in proportion to Academic title
(levels).
ggplot(mydata, aes(x=Gender, y= Salary))+
geom_boxplot()+
theme_bw()
There are some cases of an outliers which apply to the deviating salary. Therefore I removed the three most outstanding cases.
head(mydata[order(-mydata$Salary),])
## Academic_title Discipline Years since PhD Years of service Gender Salary
## 44 Prof B 38 38 Male 231545
## 365 Prof A 43 43 Male 205500
## 250 Prof A 29 7 Male 204000
## 272 Prof A 42 18 Male 194800
## 78 Prof B 26 19 Male 193000
## 331 Prof B 49 60 Male 192253
newdata<-mydata[c(-44, -365, -250),]
ggplot(newdata, aes(x=Gender, y= Salary))+
geom_boxplot()+
theme_bw()
In the graphs below there is shown a dependance among Years since PhD, Years of service and Salary. It is evident from the graph that the Years since PhD and Years of Service are in rather strong correlation, while there is no strong correlation between salary and years since PhD and Years of Service.
scatterplotMatrix(mydata[ , c(-1, -2, -5)],
smooth = FALSE)
mydata2 <- read.table("~/Downloads/R Take Home Exam/Task 2/Body mass.csv", header=TRUE, sep=";", dec=",")
head (mydata2)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
Description:
#install.packages("pastecs")
library(pastecs)
round(stat.desc (mydata2 [(-1)]), 2)
## Mass
## nbr.val 50.00
## nbr.null 0.00
## nbr.na 0.00
## min 49.70
## max 83.20
## range 33.50
## sum 3143.80
## median 62.80
## mean 62.88
## SE.mean 0.85
## CI.mean.0.95 1.71
## var 36.14
## std.dev 6.01
## coef.var 0.10
hist(mydata2$Mass,
main = "Distribution of Mass",
xlab = "Kilograms",
ylab = "Frequency",
breaks = seq(from = 40, to = 90, by = 1))
#install.packages("psych")
library(ggplot2)
t.test(mydata2$Mass, mu=59.5, alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata2$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
We can reject the hypothesis that average body mass is 59,5 at p-value 0.000234, because we can see that mean is 62.876, also CI is between 61.16758 and 64.58442, which does not intercept 59,5
t <- 3.9711
r <- sqrt ((t^2)/(t^2 + 49))
r
## [1] 0.4934295
r = 0.4934295, so the effect is medium.
#install.packages("readxl")
library("readxl")
mydata3 <- read_excel("~/Downloads/R Take Home Exam/Task 3/Apartments.xlsx")
head(mydata3)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata3$Parking<- factor(mydata3$Parking, levels= c (0,1), labels = c ("No","Yes"))
mydata3$Balcony<- factor(mydata3$Balcony, levels= c (0,1), labels = c ("No", "Yes"))
head(mydata3)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <fct> <fct>
## 1 7 28 1640 No Yes
## 2 18 1 2800 Yes No
## 3 7 28 1660 No No
## 4 28 29 1850 No Yes
## 5 18 18 1640 Yes Yes
## 6 28 12 1770 No Yes
We can reject the hypothesis that average price is 1900 at p-value 0.004731, because we can see that mean is 2018.941, also CI is between 1937.443 and 2100.440, which does not intercept 1900
#install.packages("psych")
library(ggplot2)
t.test(mydata3$Price, mu=1900, alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata3$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
fit1 <- lm(Price~Age, data = mydata3)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydata3$Age, mydata3$Price)
## [1] -0.230255
There is a significant distance between independent variables and the drawn line, therefore we can assume the multicolinearity is not present. To know for certain, we would have to calculate the VIF.
library(car)
scatterplotMatrix(mydata3[, c(3, 1, 2)], smooth = FALSE)
fit2 <- lm(Price ~ Age + Distance, data = mydata3)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
VIF is less than 5 so it is okay and we have no multicolinearity
There is none very problematic case of an outlier or a case with a big influence, however i removed one that was “sticking out” the most.
library (tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ plyr::arrange() masks dplyr::arrange()
## ✖ purrr::compact() masks plyr::compact()
## ✖ plyr::count() masks dplyr::count()
## ✖ pastecs::extract() masks tidyr::extract()
## ✖ plyr::failwith() masks dplyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ pastecs::first() masks dplyr::first()
## ✖ plyr::id() masks dplyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ pastecs::last() masks dplyr::last()
## ✖ plyr::mutate() masks dplyr::mutate()
## ✖ car::recode() masks dplyr::recode()
## ✖ plyr::rename() masks dplyr::rename()
## ✖ purrr::some() masks car::some()
## ✖ Hmisc::src() masks dplyr::src()
## ✖ plyr::summarise() masks dplyr::summarise()
## ✖ Hmisc::summarize() masks plyr::summarize(), dplyr::summarize()
mydata3$StdResid <- round(rstandard(fit2), 3)
mydata3$CooksD <- round(cooks.distance(fit2), 3)
view(mydata3)
hist(mydata3$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
hist(mydata3$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histagram of Cooks distances")
head(mydata3[order(mydata3$StdResid),], 3)
## # A tibble: 3 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 7 2 1760 No Yes -2.15 0.066
## 2 12 14 1650 No Yes -1.50 0.013
## 3 12 14 1650 No No -1.50 0.013
head(mydata3[order(-mydata3$CooksD),], 6)
## # A tibble: 6 × 7
## Age Distance Price Parking Balcony StdResid CooksD
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 Yes Yes 2.58 0.32
## 2 43 37 1740 No No 1.44 0.104
## 3 2 11 2790 Yes No 2.05 0.069
## 4 7 2 1760 No Yes -2.15 0.066
## 5 37 3 2540 Yes Yes 1.58 0.061
## 6 40 2 2400 No Yes 1.09 0.038
newdata1<-mydata3[c(-53),]
fit2 <- lm(Price ~ Age + Distance, data = newdata1)
newdata1$StdResid <- round(rstandard(fit2), 3)
newdata1$CooksD <- round(cooks.distance(fit2), 3)
hist(newdata1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
hist(newdata1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histagram of Cooks distances")
We check the assumption with the help of a scatter plot between standardized residuals and (standardized) fitted values. Variance is not growing, so we expect that assumption of homoskedasticity is not violated.
newdata1$StdFittedValues <- scale(fit2$fitted.values)
library(car)
scatterplot(y = newdata1$StdResid, x = newdata1$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
library(olsrr)
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 0.1801242
## Prob > Chi2 = 0.6712665
Assumptions:
H0: variable is normally distributed H1: variable is not normally distributed
p-value = 0.0008604 –> p-value is smaller than alpha = 5%, so we can reject H0 and accept H1 With p=0.0008604 with alpha = 0.05. So it means the standardized residuals are not distributed normally.
Assumption of normal distribution of errors is violated, meaning t-distribution may not be correct.
#install.packages("olsrr")
library(olsrr)
shapiro.test(newdata1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: newdata1$StdResid
## W = 0.93901, p-value = 0.0006104
hist(newdata1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
fit2 <- lm(Price ~ Age + Distance, data = mydata3)
vif(fit2)
## Age Distance
## 1.001845 1.001845
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = mydata3)
Fit3 has 2 more Df and p-value 0.03 so we can say that including more variables fits data better than in fit2.
anova (fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Parking: Given Age, Distance and Balcony Factor, the apartments with parking place have on average price higher by 196 euro with p = 0.0025 (p < α, (α=5%)) Balcony: p=0.97, p> α, (α=5%): we can not claim that the presence of a balcony affects the price (can not reject the null hypothesis).
Hypothesis for F-statistics: H0: ρ squared = 0 H1: ρ squared > 0
p-value < 0.001 therefore we reject H0 at p< 0.001. It means we have found the linear relationship between dependent (Price) and at least one explanatory variable. Model is good (ρ squared > 0)
summary (fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
mydata3$FittedValues <- fitted.values(fit3)
mydata3$Residuals <- residuals(fit3)
head(mydata3)
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony StdResid CooksD FittedValues Residuals
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 No Yes -0.665 0.007 1751. -111.
## 2 18 1 2800 Yes No 1.78 0.03 2357. 443.
## 3 7 28 1660 No No -0.594 0.006 1749. -88.8
## 4 28 29 1850 No Yes 0.754 0.008 1590. 260.
## 5 18 18 1640 Yes Yes -1.07 0.005 2053. -413.
## 6 28 12 1770 No Yes -0.778 0.005 1897. -127.
Residual = 442,588