Source: I have selected a data set that is integrated in R. The data set “Salaries” is from the “carData” package. There is no information about the source of this data.
#data()
#data(package = .packages(all.available = TRUE))
#install.packages("carData")
library(carData)
mydata <- force(Salaries)
head(mydata)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
Explanation of data set:
The data is about the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.. The data set contains 397 units of observation and 6 variables.
Variables:
str(mydata[ , -1]) #Compact and informative description
## 'data.frame': 397 obs. of 5 variables:
## $ discipline : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
## $ yrs.since.phd: int 19 20 4 45 40 6 30 45 21 18 ...
## $ yrs.service : int 18 16 3 39 41 6 23 45 20 18 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
## $ salary : int 139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...
mydata2 <- mydata[ , -5] #Deleting the 5 column
head(mydata2)
## rank discipline yrs.since.phd yrs.service salary
## 1 Prof B 19 18 139750
## 2 Prof B 20 16 173200
## 3 AsstProf B 4 3 79750
## 4 Prof B 45 39 115000
## 5 Prof B 40 41 141500
## 6 AssocProf B 6 6 97000
In this step I removed the variable “sex”, because I think that this variable is not needed and does not affect salaries.
colnames(mydata2) <- c("Level", "Department", "Years.since.PhD", "Service", "Earnings") #Changes of names of the variables
head(mydata2)
## Level Department Years.since.PhD Service Earnings
## 1 Prof B 19 18 139750
## 2 Prof B 20 16 173200
## 3 AsstProf B 4 3 79750
## 4 Prof B 45 39 115000
## 5 Prof B 40 41 141500
## 6 AssocProf B 6 6 97000
mydata3 <- subset(mydata2, Service > 0) #Filtering an existing data frame
I filtered my data so that only Professors with at least 1 year of service are in data set.
mydata3 <- mydata3[c(1, 5, 2, 3, 4)] # Reordering the variables
tail(mydata3, 3)
## Level Earnings Department Years.since.PhD Service
## 395 Prof 101738 A 42 25
## 396 Prof 95329 A 25 15
## 397 AsstProf 81035 A 8 4
In this step I reorder the columns and moved “Earnings” in the second column, next to variable “Level”. I also extracted the last 3 rows of a data frame.
mydata4 <- mydata3
mydata4$Earnings.in.thousands <- mydata3$Earnings *0.001
head(mydata4)
## Level Earnings Department Years.since.PhD Service Earnings.in.thousands
## 1 Prof 139750 B 19 18 139.75
## 2 Prof 173200 B 20 16 173.20
## 3 AsstProf 79750 B 4 3 79.75
## 4 Prof 115000 B 45 39 115.00
## 5 Prof 141500 B 40 41 141.50
## 6 AssocProf 97000 B 6 6 97.00
In this data manipulation I divided all earnings with 0.001 and get new variable that shows earnings in thousands dollars. I did that only because of exercise and I will not include this variable in my future analysis.
I will now present some descriptive statistics
round(mean(mydata3$Earnings))
## [1] 114560
The arithmetic mean for “Earnings” is 114560, that means that average nine-month academic salary for professors that were included in research is 114560 dollars.
library(psych)
describe(mydata3)
## vars n mean sd median trimmed mad min
## Level* 1 386 2.54 0.74 3.0 2.67 0.00 1
## Earnings 2 386 114560.02 30247.47 108150.0 112380.21 29133.09 57800
## Department* 3 386 1.54 0.50 2.0 1.55 0.00 1
## Years.since.PhD 4 386 22.82 12.69 21.5 22.38 14.08 1
## Service 5 386 18.12 12.84 17.0 17.01 14.83 1
## max range skew kurtosis se
## Level* 3 2 -1.23 -0.06 0.04
## Earnings 231545 173745 0.68 0.17 1539.56
## Department* 2 1 -0.17 -1.98 0.03
## Years.since.PhD 56 55 0.29 -0.80 0.65
## Service 60 59 0.65 -0.33 0.65
In RStudio there are many different function that we can use to do some descriptive statistics. Above I used two function. First one is single function and gives only result for particular parameter that you want to know. The second function “describe” on the other hand gives us broader overview for many different parameters.
Below I interpreted three results:
summary(mydata3[c(1, 3)])
## Level Department
## AsstProf : 57 A:177
## AssocProf: 64 B:209
## Prof :265
Because I have also two factor variables, I use this function to present number of the each category in the variable “Level” and “Department”(a factor with A means “theoretical” departments and B is referred to “applied” departments). A:177 means that there are 177 professors who worked in theoretical department.
hist(mydata3$Earnings,
main = "Salaries",
xlab = "Salary",
ylab = "Frequency",
col = "antiquewhite",
breaks = seq(from = 0, to = 250000, by = 10000))
In the histogram above it is visible that salary is positively or right skewed. We can see that the most frequent salary is between 100000 dollars and 110000 dollars. In the histogram that there are only few professors that have salary more than 200000 dollars.
#install.packages("car")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot (y=mydata3$Earnings,
x = mydata3$Service,
ylab = "Salary",
xlab = "Years of service",
boxplots = FALSE,
smooth = FALSE)
I draw scatter plot with function “scatterplot” from library “car”. On the X axis I plotted years of services and on the Y axis I plotted salaries. We can see positive correlation between years of service and salaries and we can describe this as the professors with more years of service have on average higher salary.
scatterplot(Earnings ~ Service | Department,
ylab = "Salary",
xlab = "Years of service",
smooth = FALSE,
data = mydata3)
In this scatter plot I included also factor variable “Department”.A factor with mark A means “theoretical” departments and with mark B is referred to “applied” departments. The slope of the line for “applied” departments is slightly higher than the slope of the line for “theoretical” departments. Because the slope is higher for “applied” departments, we can say that professors that works in “applied” departments, on average have higher salaries.
In Task 2 I will analyse the effects of online schooling on body weight in ninth graders.
mydata_1 <- read.table("C:/LUKA/IMB/BootCamp/R Take Home Exam/R Take Home Exam/Task 2/Body mass.csv",
header = TRUE,
sep = ";",
dec = ",")
head(mydata_1)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
#install.packages(pastecs)
library(pastecs)
round(stat.desc(mydata_1), 2) #Descriptive statistics, rounded to two decimal places
## ID Mass
## nbr.val 50.00 50.00
## nbr.null 0.00 0.00
## nbr.na 0.00 0.00
## min 1.00 49.70
## max 50.00 83.20
## range 49.00 33.50
## sum 1275.00 3143.80
## median 25.50 62.80
## mean 25.50 62.88
## SE.mean 2.06 0.85
## CI.mean.0.95 4.14 1.71
## var 212.50 36.14
## std.dev 14.58 6.01
## coef.var 0.57 0.10
#install.packages(pastecs)
library(pastecs)
round(stat.desc(mydata_1[-1]), 2) #Excluding ID variable
## Mass
## nbr.val 50.00
## nbr.null 0.00
## nbr.na 0.00
## min 49.70
## max 83.20
## range 33.50
## sum 3143.80
## median 62.80
## mean 62.88
## SE.mean 0.85
## CI.mean.0.95 1.71
## var 36.14
## std.dev 6.01
## coef.var 0.10
hist(mydata_1$Mass,
main = "Distribution of 50 nine-graders weight",
ylab = "Frequency",
xlab = "Weight in kg",
breaks = seq( 0 , 100 , 5 ),
col = "darkgreen")
Given data: * The average weight in 2018/2019 was 59.5kg.
Our hypotheses: * H0: Mu = 59.5 kg * H1: Mu is not equal to 59.5
t.test(mydata_1$Mass,
mu = 59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata_1$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
For Hypothesis testing I used “t.test” function.
From the t-test above we can reject H0 at ( p < 0.001), accept H1 and conclude that on average, the weight of nine-graders, measured in year 2020/2021 is not equal to the average weight of nine-graders, measured in year 2018/2019. The average weight of nine-graders in 2020/2021 increased to 62.88 kg. The 95 percent confidence interval lies between 61.16 and 64.59. We can conclude with 95% certainty that the average weight in 2020/2021 lies between 61.16 and 64.59.
#install.packages("effectsize")
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
cohens_d(mydata_1$Mass, mu = 59.5)
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
interpret_cohens_d(0.56, rules = "sawilowsky2009")
## [1] "medium"
## (Rules: sawilowsky2009)
Above we can see that Cohen’s values is 0.56. By using interpretation rules by Sawilowsky, we can say that the effect size is medium.
#install.packages("readxl")
library(readxl)
mydata_2 <- read_xlsx("C:/LUKA/IMB/BootCamp/R Take Home Exam/R Take Home Exam/Task 3/Apartments.xlsx")
mydata_2 <- as.data.frame(mydata_2)
head(mydata_2)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
mydata_2$ParkingFactor <- factor(mydata_2$Parking,
levels = c( 0, 1),
labels = c("No", "Yes"))
mydata_2$BalconyFactor <- factor(mydata_2$Balcony,
levels = c ( 0, 1),
labels = c ("No", "Yes"))
head(mydata_2)
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 1 7 28 1640 0 1 No Yes
## 2 18 1 2800 1 0 Yes No
## 3 7 28 1660 0 0 No No
## 4 28 29 1850 0 1 No Yes
## 5 18 18 1640 1 1 Yes Yes
## 6 28 12 1770 0 1 No Yes
H0: Mu price = 1900EUR H1: Mu price is not equal to 1900EUR
t.test(mydata_2$Price,
mu = 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydata_2$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We can reject H0 at (p < 0.05), accept H1 and conclude that on average, the price per m2 of apartments in the Ljubljana region is not equal to 1900EUR. We can also say with 95% certainty that average price per m2 lies between 1937.44 and 2100.44.
fit1 <- lm(Price ~ Age,
data = mydata_2)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
cor(mydata_2$Price, mydata_2$Age)
## [1] -0.230255
Regression function: Price = 2185.455 - 8.975Age
Regression coefficient: -8.975 This coefficient tells us that if the age of an apartment increases by 1 year, the price for m2 decreases by 8.975EUR.
Coefficient of correlation: -0.230255 -0.23 means that it is negative correlation between price per m2 and age.
Coefficient of determination: 0.05302 The Multiple R-squared tells us that 5.3% of the variability in the price is explained by the age.
#install.packages("car")
library(car)
scatterplotMatrix(mydata_2[ , c(-4,-5,-6,-7)],
smooth = FALSE)
When we want to check if there is potential problem with multicolinearity, we need to check relationship between the explanatory variables. In our case we need to take a look of correlation between Age and Distance. From the scatter plot above, we can see that the slope of the line that presents correlation between Age and Distance is slightly positive correlated. That means that there is no significant correlation between Age and Distance and we can conduct that in our case there is no potential problem with multicolinearity.
fit2 <- lm(Price ~ Age + Distance,
data = mydata_2)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
vif(fit2)
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
With VIF statistics we can easily check the multicolinearity. The rule is that value of VIF statistics for explanatory variables needs to be less than 5 and the mean of VIF statistics needs to be as close to 1 as possible. In our case both values are significantly less than 5 and also the mean of VIF values is very close to 1. We can again conduct that there is no problem with multicolinearity in our case.
mydata_2$StdResid <- round(rstandard(fit2), 3)
mydata_2$CooksD <- round(cooks.distance(fit2), 3)
head(mydata_2)
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## 1 7 28 1640 0 1 No Yes -0.665
## 2 18 1 2800 1 0 Yes No 1.783
## 3 7 28 1660 0 0 No No -0.594
## 4 28 29 1850 0 1 No Yes 0.754
## 5 18 18 1640 1 1 Yes Yes -1.073
## 6 28 12 1770 0 1 No Yes -0.778
## CooksD
## 1 0.007
## 2 0.030
## 3 0.006
## 4 0.008
## 5 0.005
## 6 0.005
hist(mydata_2$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
breaks = seq( from = -3, to = 3, by = 1))
hist(mydata_2$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
From the Histogram of Standard residuals, we can see that there is no residuals that have value more than 3 or less than -3, which means that on the first look we do not have any outliers. However when we take a look in Histogram of Cooks distances we can see there is one value that has particular higher Cooks distance than others and we can interpret that as a fact that it has a higher effect than other observations. I will check this observation below.
head(mydata_2[order(-mydata_2$CooksD), ], 10) # Ten units with highest value of Cooks distance
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## 38 5 45 2180 1 1 Yes Yes 2.577
## 55 43 37 1740 0 0 No No 1.445
## 33 2 11 2790 1 0 Yes No 2.051
## 53 7 2 1760 0 1 No Yes -2.152
## 22 37 3 2540 1 1 Yes Yes 1.576
## 39 40 2 2400 0 1 No Yes 1.091
## 58 8 2 2820 1 0 Yes No 1.655
## 25 8 26 2300 1 1 Yes Yes 1.571
## 57 10 1 2810 0 0 No No 1.601
## 2 18 1 2800 1 0 Yes No 1.783
## CooksD
## 38 0.320
## 55 0.104
## 33 0.069
## 53 0.066
## 22 0.061
## 39 0.038
## 58 0.037
## 25 0.034
## 57 0.032
## 2 0.030
In the findings above, we can see that there is in 1 observation, which out stands not only with Cooks distance but also with standard residual close to 3 (2.577). With this finding we can confirm that there is one observation with high influence. (I removed the unit, but then I have had problem with next exercise (there was written an error and I was not able to solve it), so I decided not to remove unit, so I could do the next exercise).
mydata_2$StdFitted <- scale(fit2$fitted.values)
scatterplot(y = mydata_2$StdResid, x = mydata_2$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = TRUE,
smooth = FALSE)
In this scatter plot it is visible that the variance of standardized residuals is constant and there is no problem with potential heteroskedasticity.
hist(mydata_2$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals",
breaks = seq(from = -3, to = 3, by = 1))
From the histogram we can say that standardized residuals are probably not distributed normally, but we should formally test it with the Shapiro-Wilk normality test.
H0: Standard residuals are normally distributed H1: Standard residuals are not normally distributed
shapiro.test(mydata_2$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata_2$StdResid
## W = 0.95303, p-value = 0.003645
We can reject H0 with (p < 0.05) and conclude that standard residuals are not normally distributed.
mydata_2 <- mydata_2[!(mydata_2$CooksD == 0.320), ]
head(mydata_2[order(-mydata_2$CooksD), ])
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid
## 55 43 37 1740 0 0 No No 1.445
## 33 2 11 2790 1 0 Yes No 2.051
## 53 7 2 1760 0 1 No Yes -2.152
## 22 37 3 2540 1 1 Yes Yes 1.576
## 39 40 2 2400 0 1 No Yes 1.091
## 58 8 2 2820 1 0 Yes No 1.655
## CooksD StdFitted
## 55 0.104 -2.6533964
## 33 0.069 0.7902277
## 53 0.066 1.3743712
## 22 0.061 0.3416766
## 39 0.038 0.3291583
## 58 0.037 1.3426980
fit2 <- lm(Price ~ Age + Distance,
data = mydata_2)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -604.92 -229.63 -56.49 192.97 599.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2456.076 73.931 33.221 < 2e-16 ***
## Age -6.464 3.159 -2.046 0.044 *
## Distance -22.955 2.786 -8.240 2.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared: 0.4838, Adjusted R-squared: 0.4711
## F-statistic: 37.96 on 2 and 81 DF, p-value: 2.339e-12
At first I used function that remove unit with Cooks Diference = 0.320. After I removed it, I used function summary to present values.
Explain all coefficients:
The regression function: Price= 2456.076 - 6.464Age - 22.955Distance
If age increases by 1 year, the price of an apartment will on average decrease by 6.464EUR per m2 if everything else stays equal.
If distance from the city center increases by 1km, the price of an apartment on average decrease by 22.955EUR per m2 if everything else stays equal.
The Multiple R-squared tells us that 48.38% of the variability in the price is explained by Age and Distance
Linear regresion function:
fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
data = mydata_2)
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 6176767
## 2 79 5654480 2 522287 3.6485 0.03051 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From Analysis of Variance Table, it is visible that p < 0.05, and we can conclude that model fit3 fits data better than model fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor,
## data = mydata_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473.21 -192.37 -28.89 204.17 558.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2329.724 93.066 25.033 < 2e-16 ***
## Age -5.821 3.074 -1.894 0.06190 .
## Distance -20.279 2.886 -7.026 6.66e-10 ***
## ParkingFactorYes 167.531 62.864 2.665 0.00933 **
## BalconyFactorYes -15.207 59.201 -0.257 0.79795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared: 0.5275, Adjusted R-squared: 0.5035
## F-statistic: 22.04 on 4 and 79 DF, p-value: 3.018e-12
Explanation of categorical factors:
The apartments with parking will on average have a higher price for 167.531 EUR assuming that all other variables are the same.
The apartments with balcony will on average have a lower price for 15.207 EUR assuming that all other variables are the same.
mydata_2$Fitted <- fitted.values(fit3)
mydata_2$Residuals <- residuals(fit3)
head(mydata_2[colnames(mydata_2) %in% c("Price", "Fitted", "Residuals")])
## Price Fitted Residuals
## 1 1640 1705.952 -65.95206
## 2 2800 2372.197 427.80292
## 3 1660 1721.159 -61.15894
## 4 1850 1563.431 286.56890
## 5 1640 2012.244 -372.24396
## 6 1770 1908.177 -138.17733
The residual represents “actual number” - “fitted number”. The difference between “actual number” and “fitted number” or the residual is 427.80 EUR.