library(tidyverse)
library(openintro)
library(car)
employee <- read.csv("/Users/snawaz/Downloads/r_projects/Small-R-projects/Project 13/employee.csv")
class(employee)
## [1] "data.frame"
nrow(employee)
## [1] 71
The number of rows above shows that sample size is 71.
Answer: The formula for calculating probability is pt(q=T,df). Here we have df=69 as given above so we calculate the probability as
pt(0,df=69,lower.tail=F)
## [1] 0.5
The result shows that one sided p value is 0 which shows that there is 50% chance of getting distribution on positive side of the t-curve.
Answer:
pt(0.8,df=69,lower.tail = T)
## [1] 0.7867717
1-pt(0.69,69,lower.tail = T)
## [1] 0.2462542
Consider experience as the explanatory variable and salary as the explained variable.
donc x= experience (explanatory variable), y= salary (explained variable).
Answer: La moyenne de la variable expérience est : 5.746479 l’écart type de la variable expérience est : 3.241333
mean(employee$experience)
## [1] 5.746479
sd(employee$experience)
## [1] 3.241333
Mean value for experience is 5.74 Standard deviation for experience is 3.24.
mean(employee$salary)
## [1] 45141.51
sd(employee$salary)
## [1] 10805.85
La moyenne de la variable salary est : 45141.51 l’écart type de la variable expérience est : 10805.85
library(ggplot2)
ggplot(employee) +
aes(x = experience, y = salary) +
geom_point(shape = "circle open", size = 2.6, colour = "#461124") +
labs(subtitle = "Relationship between salary and experience") +
ggthemes::theme_base()
Generally with increase in experience salary also increases according to the plot. The relationship between two variable is not strictly linear. In some cases the salary is high even with less experience. The employee with 10 years of experience have highest salaries in the dataset.
cor(employee$experience,employee$salary,method=c("pearson", "kendall", "spearman"))
## [1] 0.5515126
Corelation coefficient value is 0.55 between the two variables. Generally the coefficient value > 0.5 means that two variables are weakly positive correlated with each other. It can be observed from the scatter plot as well where with increase of experience salary increases as well. The scatter plots shows that there are some cases where salary is not increasing with experience for employee and it can be an underlying reason for not getting a much value of correlation coefficient.
a <- employee$experience
b <- employee$salary
model <- lm(a~b,data=employee)
model
##
## Call:
## lm(formula = a ~ b, data = employee)
##
## Coefficients:
## (Intercept) b
## -1.7213787 0.0001654
Answer:
salary = -1.7213787 + 0.0001654*experience
The intercept value is negative which means that the
ggplot(data = employee, aes(x = experience, y = salary)) + geom_point() + stat_smooth(method = "lm", se = TRUE) +
ggthemes::theme_base()
## `geom_smooth()` using formula = 'y ~ x'
sqrt(deviance(model)/df.residual(model))
## [1] 2.723334
standard error \(S_{b1}\) of slope \(b_1\): 2.72 on 69 degrees of freedom
confint(model,'employee$salary',level=0.95)
## 2.5 % 97.5 %
## employee$salary NA NA
For this purpose we need to define 2 hypothesis at 5% sigi=nficance level which can tested afterwards; Null hypothesis H0: Slope is significantly different than 0 Alternate Hypothesis HA: Slope is not significantly different than 0
The result of t-test below shows that p-value is less than 0.05 (significance level) so we reject our null hypothesis.
Since we rejected the null hypothesis, we have sufficient evidence to say that the true average increase in salary for experience is not zero.
t.test(employee$experience,employee$salary,data=employee)
##
## Welch Two Sample t-test
##
## data: employee$experience and employee$salary
## t = -35.196, df = 70, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -47693.46 -42578.06
## sample estimates:
## mean of x mean of y
## 5.746479 45141.507042
summary(model)
##
## Call:
## lm(formula = a ~ b, data = employee)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1165 -2.0753 0.0411 2.5985 4.3389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.721e+00 1.398e+00 -1.232 0.222
## b 1.654e-04 3.012e-05 5.492 6.21e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.723 on 69 degrees of freedom
## Multiple R-squared: 0.3042, Adjusted R-squared: 0.2941
## F-statistic: 30.16 on 1 and 69 DF, p-value: 6.206e-07
The summary of our regression model shows that R2 is 0.29 which means for 29% percent of time the salary can be predicted based on the experience value. Hence 29% of variability in one can be explained by the differences in the other variable.
We can deduce that from our linear equation of the model.
predict(model,data.frame(b=8),interval = 'confidence',level=0.95)*10000
## fit lwr upr
## 1 -17200.55 -45078.49 10677.38
so for an employee with 8 years of experience, salary of 17201 is expected.
predict(model,data.frame(b=3),interval = 'confidence',level=0.95)*10000
## fit lwr upr
## 1 -17208.82 -45089.68 10672.03
so for an employee with 3 years of experience, salary of 17209 is expected.
Extraction des résidus
head(resid(model))
## 1 2 3 4 5 6
## -0.6333298 2.9246884 2.9994637 -6.1165043 -1.6498201 3.6097430
It is observed that the residuals seem to follow a trend. On time series data (traditionally ordered chronologically), this could indicate an auto-correlation of errors (contrary to the independence hypothesis), and therefore of condition not taken into account (ex: age, training, etc.) .
res<-resid(model)
plot(res,main="Résidus") + abline(h=0,col="blue")
## integer(0)
#subtracting 1st id column
p <- employee[,-1]
# creating a dataframe for female employee
employee_f <- p %>% filter(gender==0)
#creating a dataframe for male employee
employee_m <- p %>% filter(gender==1)
#fitting regression model on female employee
model_f <- lm(employee_f$experience~employee_f$salary,data=employee_f)
#fitting regression model on male employee
model_m <- lm(employee_m$experience~employee_m$salary,data=employee_m)
# slope of regression for female subgroup
model_f$coef[2]
## employee_f$salary
## 0.0001175642
# slope of regression for male subgroup
model_m$coef[2]
## employee_m$salary
## 0.0002276517
The slope of regression for female subgroup is 1.175642^{-4} and the slope of regression for male subgroup is 2.276517^{-4}.