projectdata<-read.csv(file="C://Users//TOSHIBA//Desktop//Statistics Course//StatisticsDataFiles//Project3Data.csv", header=TRUE)
PRIN19<- subset(projectdata[projectdata$Job.Title =="Principal",], select=Annual.Salary19)
ASSTPRIN19<-subset(projectdata[projectdata$Job.Title =="Assistant Principal",], select=Annual.Salary19)
Checking data for normality:
boxplot(PRIN19$Annual.Salary19, ASSTPRIN19$Annual.Salary19, main = "Annual Salaries of Principals and Asst. Principals",names = c("Principal", "Assistant Principal"), col = c("orange","red"))
Checking distribution of Principals’ Annual Salary 19
qqnorm(PRIN19$Annual.Salary19)
qqline(PRIN19$Annual.Salary19)
library(psych)
describe(PRIN19$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 377 147641.2 8947.85 149329 147738 10704.37 128750 167417 38667
## skew kurtosis se
## X1 -0.13 -0.62 460.84
hist(PRIN19$Annual.Salary19)
Checking for outliers in PRIN19 Annual Salary 19
boxplot(PRIN19$Annual.Salary19)$out
## numeric(0)
There are not outliers in PRIN19 Annual Salary 19 Checking the distribution of ASSTPRIN19 Annual Salary 19
qqnorm(ASSTPRIN19$Annual.Salary19)
qqline(ASSTPRIN19$Annual.Salary19)
describe(ASSTPRIN19$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 413 116027 8310.19 117098 115925.2 8819.99 62139 137409 75270
## skew kurtosis se
## X1 -1.14 6.95 408.92
hist(ASSTPRIN19$Annual.Salary19)
There are outliers in ASSTPRIN19 Annual Salary 19. Cleaning the outliers:
ASSTPRIN19$Level<-2
boxplot(ASSTPRIN19$Annual.Salary19)$out
## [1] 62139 62139
outliers_ASSTPRIN19 <- boxplot(ASSTPRIN19$Annual.Salary19)$out
ASSTPRIN19.trimmed <- ASSTPRIN19[-which(ASSTPRIN19$Annual.Salary19 %in% outliers_ASSTPRIN19),]
boxplot(ASSTPRIN19.trimmed$Annual.Salary19)
Normality treatment for and PRIN19 and ASSTPRIN19.trimmed applying a log transformation:
ASSTPRIN19.trimmed$Annual.Salary19<-log(ASSTPRIN19.trimmed$Annual.Salary19)
hist(ASSTPRIN19.trimmed$Annual.Salary19)
PRIN19$Annual.Salary19<-log(PRIN19$Annual.Salary19)
hist(PRIN19$Annual.Salary19)
Creating a data frame (q1) with two variables:
PRIN19$Level<-1
ASSTPRIN19.trimmed$Level<-2
q1<-rbind(PRIN19,ASSTPRIN19.trimmed)
Running the t-test:
library(lessR)
##
## lessR 3.8.9 feedback: gerbing@pdx.edu web: lessRstats.com/new
## ---------------------------------------------------------------------
## 1. d <- Read("") Read text, Excel, SPSS, SAS or R data file
## d: default data frame, no need for data=
## 2. l <- Read("", var_labels=TRUE) Read variable labels into l,
## required name for data frame of labels
## 3. Help() Get help, and, e.g., Help(Read)
## 4. hs(), bc(), or ca() All histograms, all bar charts, or both
## 5. Plot(X) or Plot(X,Y) For continuous and categorical variables
## 6. by1= , by2= Trellis graphics, a plot for each by1, by2
## 7. reg(Y ~ X, Rmd="eg") Regression with full interpretative output
## 8. style("gray") Grayscale theme, + many others available
## style(show=TRUE) all color/style options and current values
## 9. getColors() create many styles of color palettes
##
## lessR parameter names now use _'s. Names with a period are deprecated.
## Ex: bin_width instead of bin.width
##
## Attaching package: 'lessR'
## The following objects are masked from 'package:psych':
##
## reflect, scree
ttest(Annual.Salary19 ~ Level, data = q1, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2")
##
## Compare Annual.Salary19 across Level levels 1 and 2
## --------------------------------------------------------------
##
##
## ------ Description ------
##
## Annual.Salary19 for Level 1: n.miss = 0, n = 377, mean = 11.900691, sd = 0.061063
## Annual.Salary19 for Level 2: n.miss = 0, n = 411, mean = 11.661805, sd = 0.063778
##
## Sample Mean Difference of Annual.Salary19: 0.238885
##
## Within-group Standard Deviation: 0.062494
##
##
## ------ Assumptions ------
##
## Note: These hypothesis tests can perform poorly, and the
## t-test is typically robust to violations of assumptions.
## Use as heuristic guides instead of interpreting literally.
##
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
##
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test: F = 0.004068/0.003729 = 1.090896, df = 410;376, p-value = 0.390
## Levene's test, Brown-Forsythe: t = -1.146, df = 786, p-value = 0.252
##
##
## ------ Inference ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.963
## Standard Error of Mean Difference: SE = 0.004457
##
## Hypothesis Test of 0 Mean Diff: t = 53.602, df = 786, p-value = 0.000
##
## Margin of Error for 95% Confidence Level: 0.008748
## 95% Confidence Interval for Mean Difference: 0.230137 to 0.247634
##
##
## --- Do not assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.963
## Standard Error of Mean Difference: SE = 0.004448
##
## Hypothesis Test of 0 Mean Diff: t = 53.702, df = 784.551, p-value = 0.000
##
## Margin of Error for 95% Confidence Level: 0.008732
## 95% Confidence Interval for Mean Difference: 0.230153 to 0.247617
##
##
## ------ Effect Size ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## Standardized Mean Difference of Annual.Salary19, Cohen's d: 3.822525
##
##
## ------ Practical Importance ------
##
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
##
##
## ------ Graphics Smoothing Parameter ------
##
## Density bandwidth for Level 1: 0.016134
## Density bandwidth for Level 2: 0.017919
The Levene’s test result is p-value = 0.252, therefore the variances satisfy the assumption of homogenity of variance. The test result is t(786)=53.602, p=0.0000. There is a significant difference.
Trying a log transformation:
//QUESTION 2
REGT<- subset(projectdata[projectdata$Job.Title =="Regular Teacher",], select=Annual.Salary19)
NREGT<-subset(projectdata[projectdata$Job.Title !="Regular Teacher",], select=Annual.Salary19)
Checking data for normality:
boxplot(REGT$Annual.Salary19, NREGT$Annual.Salary19, main = "Annual Salaries of Regular Teachers and Others",names = c("Regular Teacher", "Others"), col = c("blue","green"))
hist(REGT$Annual.Salary19)
hist(NREGT$Annual.Salary19)
describe(REGT$Annual.Salary19)
## vars n mean sd median trimmed mad min max
## X1 1 10032 80068.26 14752.45 85613 81001.69 13595.44 11533 149329
## range skew kurtosis se
## X1 137796 -0.65 -0.14 147.29
describe(NREGT$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 18153 59758.91 31232.7 52505 57672.23 29029.31 9809 260000 250191
## skew kurtosis se
## X1 0.75 0.25 231.81
Removing outliers from REGT Annual.Salary19
REGT$Level<-1
boxplot(REGT$Annual.Salary19)$out
## [1] 149329 19849 20232 12386 12753 12883 19465 18441 19323 17733
## [11] 18061 12883 12386 12883 12883 14316 19162 17750 11533 11773
## [21] 17652
outliers_REGT <- boxplot(REGT$Annual.Salary19)$out
REGT.trimmed <- REGT[-which(REGT$Annual.Salary19 %in% outliers_REGT),]
boxplot(REGT.trimmed$Annual.Salary19)
Removing outliers from NREGT Annual.Salary19
NREGT$Level<-2
boxplot(NREGT$Annual.Salary19)$out
## [1] 167417 167417 167417 167417 167417 175000 180000 175000 180000 170000
## [11] 225000 190000 180000 168000 175000 198000 260000 210000 170000
outliers_NREGT <- boxplot(NREGT$Annual.Salary19)$out
NREGT.trimmed <- NREGT[-which(NREGT$Annual.Salary19 %in% outliers_NREGT),]
boxplot(NREGT.trimmed$Annual.Salary19)
Checking the distributions again:
hist(REGT.trimmed$Annual.Salary19)
hist(NREGT.trimmed$Annual.Salary19)
Applying a Z transformation:
library(QuantPsyc)
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
## Loading required package: MASS
##
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
##
## norm
Z.NREGT<-Make.Z(NREGT.trimmed$Annual.Salary19)
Z.REGT<-Make.Z(REGT.trimmed$Annual.Salary19)
hist(Z.NREGT)
hist(Z.REGT)
The Z transformation did not work for the data. (of course.)
Applying a log transformation:
REGT.trimmed$Annual.Salary19<-log(REGT.trimmed$Annual.Salary19)
NREGT.trimmed$Annual.Salary19<-log(NREGT.trimmed$Annual.Salary19)
hist(REGT.trimmed$Annual.Salary19)
hist(NREGT.trimmed$Annual.Salary19)
Binding the data frames:
q2<-rbind(REGT.trimmed,NREGT.trimmed)
Running the t test:
ttest(Annual.Salary19 ~ Level, data = q2, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2", show_title = TRUE)
##
## Compare Annual.Salary19 across Level levels 1 and 2
## --------------------------------------------------------------
##
##
## ------ Description ------
##
## Annual.Salary19 for Level 1: n.miss = 0, n = 10011, mean = 11.273959, sd = 0.196844
## Annual.Salary19 for Level 2: n.miss = 0, n = 18134, mean = 10.850785, sd = 0.560155
##
## Sample Mean Difference of Annual.Salary19: 0.423174
##
## Within-group Standard Deviation: 0.464706
##
##
## ------ Assumptions ------
##
## Note: These hypothesis tests can perform poorly, and the
## t-test is typically robust to violations of assumptions.
## Use as heuristic guides instead of interpreting literally.
##
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
##
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test: F = 0.313774/0.038748 = 8.097862, df = 18133;10010, p-value = 0.000
## Levene's test, Brown-Forsythe: t = -98.650, df = 28143, p-value = 0.000
##
##
## ------ Inference ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.960
## Standard Error of Mean Difference: SE = 0.005786
##
## Hypothesis Test of 0 Mean Diff: t = 73.135, df = 28143, p-value = 0.000
##
## Margin of Error for 95% Confidence Level: 0.011341
## 95% Confidence Interval for Mean Difference: 0.411832 to 0.434515
##
##
## --- Do not assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.960
## Standard Error of Mean Difference: SE = 0.004601
##
## Hypothesis Test of 0 Mean Diff: t = 91.965, df = 24896.034, p-value = 0.000
##
## Margin of Error for 95% Confidence Level: 0.009019
## 95% Confidence Interval for Mean Difference: 0.414154 to 0.432193
##
##
## ------ Effect Size ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## Standardized Mean Difference of Annual.Salary19, Cohen's d: 0.910626
##
##
## ------ Practical Importance ------
##
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
##
##
## ------ Graphics Smoothing Parameter ------
##
## Density bandwidth for Level 1: 0.022024
## Density bandwidth for Level 2: 0.028473
Levene’s test result is p-value = 0.000, the data does not satisfy the homogenity of variance assumption.
///////QUESTION 3
SECAI<- subset(projectdata[projectdata$Job.Title =="Special Ed Classroom Assist",], select=Annual.Salary19)
SECAII<-subset(projectdata[projectdata$Job.Title =="Special Ed Classroom Assist II",], select=Annual.Salary19)
Checking data for normality:
boxplot(SECAI$Annual.Salary19, SECAII$Annual.Salary19, main = "Annual Salaries of Special Ed Classroom Assist",names = c("SECA I", "SECA II"), col = c("pink","orange"))
hist(SECAI$Annual.Salary19)
hist(SECAII$Annual.Salary19)
describe(SECAI$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 1260 37220.45 3784.2 36891 37349.48 2584.17 16619 50781 34162
## skew kurtosis se
## X1 -1.39 6.33 106.61
describe(SECAII$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 1396 37284.75 3988.16 36446 37449.35 2584.17 17268 45678 28410
## skew kurtosis se
## X1 -2.22 9.38 106.74
Outlier treatment for SECAI:
SECAI$Level<-1
boxplot(SECAI$Annual.Salary19)$out
## [1] 50781 19317 18446 20245 16619 21877 18446 21228 17410 21228 18446
## [12] 18446 20245 17410 16619 21228 21228 18446 19317 21877 18446
outliers_SECAI <- boxplot(SECAI$Annual.Salary19)$out
SECAI.trimmed <- SECAI[-which(SECAI$Annual.Salary19 %in% outliers_SECAI),]
boxplot(SECAI.trimmed$Annual.Salary19)
hist(SECAI.trimmed$Annual.Salary19)
describe(SECAI.trimmed$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 1239 37498.36 3019.8 36891 37430.13 2584.17 31720 42456 10736
## skew kurtosis se
## X1 0.32 -1.05 85.79
Outlier treatment for SECAII:
SECAII$Level<-2
boxplot(SECAII$Annual.Salary19)$out
## [1] 45678 18058 18058 21877 21877 18058 18058 19095 21877 19095 19095
## [12] 19095 21877 21877 19095 21877 18223 19095 21877 21877 20894 20894
## [23] 19095 19095 18223 18223 17268 18223 18223 18223 21877 17268 18223
## [34] 18223 18223 21877 21877 18223 21877 18223 18223
outliers_SECAII <- boxplot(SECAII$Annual.Salary19)$out
SECAII.trimmed <- SECAII[-which(SECAII$Annual.Salary19 %in% outliers_SECAII),]
boxplot(SECAII.trimmed$Annual.Salary19)
hist(SECAII.trimmed$Annual.Salary19)
describe(SECAII.trimmed$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 1355 37801.71 2589.93 36446 37557.98 2584.17 33018 43754 10736
## skew kurtosis se
## X1 0.72 -0.22 70.36
Binding the data frames:
q3<-rbind(SECAI.trimmed,SECAII.trimmed)
Running the t test:
library(lessR)
ttest(Annual.Salary19 ~ Level, data = q3, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2", show_title = TRUE)
##
## Compare Annual.Salary19 across Level levels 2 and 1
## --------------------------------------------------------------
##
##
## ------ Description ------
##
## Annual.Salary19 for Level 2: n.miss = 0, n = 1355, mean = 37801.714, sd = 2589.926
## Annual.Salary19 for Level 1: n.miss = 0, n = 1239, mean = 37498.359, sd = 3019.801
##
## Sample Mean Difference of Annual.Salary19: 303.355
##
## Within-group Standard Deviation: 2803.479
##
##
## ------ Assumptions ------
##
## Note: These hypothesis tests can perform poorly, and the
## t-test is typically robust to violations of assumptions.
## Use as heuristic guides instead of interpreting literally.
##
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
##
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test: F = 9119195.450/6707716.417 = 1.360, df = 1238;1354, p-value = 0.000
## Levene's test, Brown-Forsythe: t = -5.130, df = 2592, p-value = 0.000
##
##
## ------ Inference ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.961
## Standard Error of Mean Difference: SE = 110.199
##
## Hypothesis Test of 0 Mean Diff: t = 2.753, df = 2592, p-value = 0.006
##
## Margin of Error for 95% Confidence Level: 216.087
## 95% Confidence Interval for Mean Difference: 87.269 to 519.442
##
##
## --- Do not assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.961
## Standard Error of Mean Difference: SE = 110.953
##
## Hypothesis Test of 0 Mean Diff: t = 2.734, df = 2450.003, p-value = 0.006
##
## Margin of Error for 95% Confidence Level: 217.570
## 95% Confidence Interval for Mean Difference: 85.785 to 520.926
##
##
## ------ Effect Size ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## Standardized Mean Difference of Annual.Salary19, Cohen's d: 0.108
##
##
## ------ Practical Importance ------
##
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
##
##
## ------ Graphics Smoothing Parameter ------
##
## Density bandwidth for Level 2: 697.576
## Density bandwidth for Level 1: 828.049
Levene’s test result is p-value = 0.000, The samples do not satisfy the assumption of homogenity of variance.
Trying to run the test with a log transformation:
SECAI.trimmed$Annual.Salary19<-log(SECAI.trimmed$Annual.Salary19)
SECAII.trimmed$Annual.Salary19<-log(SECAII.trimmed$Annual.Salary19)
hist(SECAI.trimmed$Annual.Salary19)
hist(SECAII.trimmed$Annual.Salary19)
Merging the log transformed data:
q3.1<-rbind(SECAI.trimmed,SECAII.trimmed)
Running the t-test again:
ttest(Annual.Salary19 ~ Level, data = q3.1, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2", show_title = TRUE)
##
## Compare Annual.Salary19 across Level levels 2 and 1
## --------------------------------------------------------------
##
##
## ------ Description ------
##
## Annual.Salary19 for Level 2: n.miss = 0, n = 1355, mean = 10.537828, sd = 0.067145
## Annual.Salary19 for Level 1: n.miss = 0, n = 1239, mean = 10.528848, sd = 0.079890
##
## Sample Mean Difference of Annual.Salary19: 0.008980
##
## Within-group Standard Deviation: 0.073508
##
##
## ------ Assumptions ------
##
## Note: These hypothesis tests can perform poorly, and the
## t-test is typically robust to violations of assumptions.
## Use as heuristic guides instead of interpreting literally.
##
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
##
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test: F = 0.006382/0.004508 = 1.415677, df = 1238;1354, p-value = 0.000
## Levene's test, Brown-Forsythe: t = -5.699, df = 2592, p-value = 0.000
##
##
## ------ Inference ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.961
## Standard Error of Mean Difference: SE = 0.002889
##
## Hypothesis Test of 0 Mean Diff: t = 3.108, df = 2592, p-value = 0.002
##
## Margin of Error for 95% Confidence Level: 0.005666
## 95% Confidence Interval for Mean Difference: 0.003314 to 0.014646
##
##
## --- Do not assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.961
## Standard Error of Mean Difference: SE = 0.002912
##
## Hypothesis Test of 0 Mean Diff: t = 3.084, df = 2427.696, p-value = 0.002
##
## Margin of Error for 95% Confidence Level: 0.005710
## 95% Confidence Interval for Mean Difference: 0.003270 to 0.014690
##
##
## ------ Effect Size ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## Standardized Mean Difference of Annual.Salary19, Cohen's d: 0.122162
##
##
## ------ Practical Importance ------
##
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
##
##
## ------ Graphics Smoothing Parameter ------
##
## Density bandwidth for Level 2: 0.018085
## Density bandwidth for Level 1: 0.021906
The log transformation does not improve homogenity of variance either.
////////QUESTION 4////////////
LATT<- subset(projectdata[projectdata$Job.Title =="Lunchroom Attendant",], select=Annual.Salary19)
CUSTW<-subset(projectdata[projectdata$Job.Title =="Custodial Worker",], select=Annual.Salary19)
Checking data for normality:
boxplot(LATT$Annual.Salary19, CUSTW$Annual.Salary19, main = "Annual Salaries of Lunchroom Attendants and Custodial Workers",names = c("Lunchroom Attendant", "Custodial Worker"), col = c("orange","red"))
hist(LATT$Annual.Salary19)
hist(CUSTW$Annual.Salary19)
describe(LATT$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 690 18189.78 2977.94 17293 18324.36 4272.85 10993 23057 12064
## skew kurtosis se
## X1 -0.44 -0.06 113.37
describe(CUSTW$Annual.Salary19)
## vars n mean sd median trimmed mad min max range skew
## X1 1 547 35362.15 4336.2 35537 34845.72 0 28323 50509 22186 0.91
## kurtosis se
## X1 0.91 185.4
Outlier treatment for LATT Annual Salary 19:
LATT$Level<-1
boxplot(LATT$Annual.Salary19)$out
## [1] 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529
## [12] 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529 10993
## [23] 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529
## [34] 10993 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529
## [45] 11529 11529 11529 11529 11529
outliers_LATT <- boxplot(LATT$Annual.Salary19)$out
LATT.trimmed <- LATT[-which(LATT$Annual.Salary19 %in% outliers_LATT),]
boxplot(LATT.trimmed$Annual.Salary19)
hist(LATT.trimmed$Annual.Salary19)
describe(LATT.trimmed$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 641 18700.62 2421.94 18734 18690.72 2136.43 12970 23057 10087
## skew kurtosis se
## X1 0.08 -0.54 95.66
library(rcompanion)
## Registered S3 method overwritten by 'DescTools':
## method from
## print.palette wesanderson
##
## Attaching package: 'rcompanion'
## The following object is masked from 'package:psych':
##
## phi
plotNormalHistogram(LATT.trimmed$Annual.Salary19)
help(plotNormalHistogram)
## starting httpd help server ...
## done
Outlier treatment for CUSTW Annual Salary 19:
CUSTW$Level<-2
boxplot(CUSTW$Annual.Salary19)$out
## [1] 45625 44963 45625 45625 45625 45625 45625 44331 45625 45625 44963
## [12] 45625 45625 45625 45625 45625 45625 45625 43048 45625 45625 45625
## [23] 45625 45625 45625 44963 44331 45625 45625 45625 45625 45625 45625
## [34] 45625 45625 45625 44331 45625 45625 44963 45625 45625 45625 43048
## [45] 45625 50509 45625 45625 44963 45625 45625 44331 44319 43048 45625
## [56] 45625 45625 44331 45625
outliers_CUSTW <- boxplot(CUSTW$Annual.Salary19)$out
CUSTW.trimmed <- CUSTW[-which(CUSTW$Annual.Salary19 %in% outliers_CUSTW),]
boxplot(CUSTW.trimmed$Annual.Salary19)
hist(CUSTW.trimmed$Annual.Salary19)
describe(CUSTW.trimmed$Annual.Salary19)
## vars n mean sd median trimmed mad min max range skew
## X1 1 488 34149.9 2705.54 35537 34335.45 0 28323 38658 10335 -0.69
## kurtosis se
## X1 -0.77 122.47
plotNormalHistogram(CUSTW.trimmed$Annual.Salary19)
Binding the dataframes:
q4<-rbind(LATT.trimmed,CUSTW.trimmed)
Running the t test:
ttest(Annual.Salary19 ~ Level, data = q4, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2", show_title = TRUE)
##
## Compare Annual.Salary19 across Level levels 2 and 1
## --------------------------------------------------------------
##
##
## ------ Description ------
##
## Annual.Salary19 for Level 2: n.miss = 0, n = 488, mean = 34149.902, sd = 2705.538
## Annual.Salary19 for Level 1: n.miss = 0, n = 641, mean = 18700.621, sd = 2421.935
##
## Sample Mean Difference of Annual.Salary19: 15449.281
##
## Within-group Standard Deviation: 2548.361
##
##
## ------ Assumptions ------
##
## Note: These hypothesis tests can perform poorly, and the
## t-test is typically robust to violations of assumptions.
## Use as heuristic guides instead of interpreting literally.
##
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
##
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test: F = 7319933.514/5865770.339 = 1.248, df = 487;640, p-value = 0.009
## Levene's test, Brown-Forsythe: t = -2.060, df = 1127, p-value = 0.040
##
##
## ------ Inference ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.962
## Standard Error of Mean Difference: SE = 153.098
##
## Hypothesis Test of 0 Mean Diff: t = 100.911, df = 1127, p-value = 0.000
##
## Margin of Error for 95% Confidence Level: 300.389
## 95% Confidence Interval for Mean Difference: 15148.892 to 15749.670
##
##
## --- Do not assume equal population variances of Annual.Salary19 for each Level
##
## t-cutoff: tcut = 1.962
## Standard Error of Mean Difference: SE = 155.405
##
## Hypothesis Test of 0 Mean Diff: t = 99.413, df = 983.832, p-value = 0.000
##
## Margin of Error for 95% Confidence Level: 304.964
## 95% Confidence Interval for Mean Difference: 15144.317 to 15754.245
##
##
## ------ Effect Size ------
##
## --- Assume equal population variances of Annual.Salary19 for each Level
##
## Standardized Mean Difference of Annual.Salary19, Cohen's d: 6.062
##
##
## ------ Practical Importance ------
##
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
##
##
## ------ Graphics Smoothing Parameter ------
##
## Density bandwidth for Level 2: 891.789
## Density bandwidth for Level 1: 757.639
Levene’s test p-value = 0.040. The samples do not satisfy the assumption of homogenity of variance. However, the minimum score of CUSTW is higher than the maximum score of the LATT, therefore a t-test is not meaningful.
////////QUESTION 5/////////// Subsetting the Annual Salaries of Regular Teachers in years 18 and 19.
REGT19<-subset(projectdata[projectdata$Job.Title =="Regular Teacher",], select=Annual.Salary19)
REGT18<-subset(projectdata[projectdata$Job.Title =="Regular Teacher",], select=Annual.Salary18)
Checking the distributions:
boxplot(REGT18$Annual.Salary18, REGT19$Annual.Salary19, main = "Annual Salaries of Regular Teachers in 2018-2019",names = c("Annual Salary 18", "Annual Salary 19"), col = c("yellow","orange"))
library(psych)
describe(REGT18$Annual.Salary18)
## vars n mean sd median trimmed mad min max
## X1 1 10032 78828.25 15398.88 84804 79642.99 15205.55 11333 147850
## range skew kurtosis se
## X1 136517 -0.55 -0.47 153.74
describe(REGT19$Annual.Salary19)
## vars n mean sd median trimmed mad min max
## X1 1 10032 80068.26 14752.45 85613 81001.69 13595.44 11533 149329
## range skew kurtosis se
## X1 137796 -0.65 -0.14 147.29
hist(REGT18$Annual.Salary18)
hist(REGT19$Annual.Salary19)
library(PairedData)
## Loading required package: gld
## Loading required package: mvtnorm
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'PairedData'
## The following object is masked from 'package:base':
##
## summary
pd <- paired(REGT18, REGT19)
plot(pd, type = "profile") + theme_bw()
Histogram of differences:
D.REGT.18.19 = REGT18$Annual.Salary18 - REGT19$Annual.Salary19
describe(D.REGT.18.19)
## vars n mean sd median trimmed mad min max range
## X1 1 10032 -1240.01 1649.95 -1215 -1118.69 1801.36 -34814 48346 83160
## skew kurtosis se
## X1 4.77 173.17 16.47
hist(D.REGT.18.19,
col="gray",
main="Histogram of differences between 2018-2019 Annual Salaries of Regular Teachers",
xlab="Difference")
The differences do not have a normal distribution.But the sample size is large (n>30)
Running the paired t-test
t.test(REGT18$Annual.Salary18, REGT19$Annual.Salary19, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: REGT18$Annual.Salary18 and REGT19$Annual.Salary19
## t = -75.275, df = 10031, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1272.304 -1207.723
## sample estimates:
## mean of the differences
## -1240.013
p<2.2e-16 The annual salary of regular teachers in 2019 (M:80068.26, SD:14752.45) is significantly different than their annual salary in 2018 (M: 78828.25, SD:78828.25), t(10031)=-75.275, p-value < 2.2^{-16}.
///////QUESTION 6/////////////
PRIN19<-subset(projectdata[projectdata$Job.Title =="Principal",], select=Annual.Salary19)
PRIN18<-subset(projectdata[projectdata$Job.Title =="Principal",], select=Annual.Salary18)
Comparing the distributions:
boxplot(PRIN18$Annual.Salary18, PRIN19$Annual.Salary19, main = "Annual Salaries of Principals in 2018-2019",names = c("Annual Salary 18", "Annual Salary 19"), col = c("orange","red"))
describe(PRIN18$Annual.Salary18)
## vars n mean sd median trimmed mad min max range
## X1 1 377 145224.8 9822.51 146657 145356.2 10873.39 125000 167417 42417
## skew kurtosis se
## X1 -0.14 -0.7 505.88
describe(PRIN19$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 377 147641.2 8947.85 149329 147738 10704.37 128750 167417 38667
## skew kurtosis se
## X1 -0.13 -0.62 460.84
hist(PRIN18$Annual.Salary18)
hist(PRIN19$Annual.Salary19)
Histogram of Differences:
D.PRIN.18.19 =PRIN18$Annual.Salary18 - PRIN19$Annual.Salary19
describe(D.PRIN.18.19)
## vars n mean sd median trimmed mad min max range skew
## X1 1 377 -2416.34 1163.87 -2732 -2428.19 1742.05 -7499 0 7499 -0.2
## kurtosis se
## X1 0.49 59.94
hist(D.PRIN.18.19,
col="gray",
main="Histogram of differences between 2018-2019 Annual Salaries of Principals",
xlab="Difference")
The differences follow a normal distribution. Running the t-test:
t.test(PRIN18$Annual.Salary18, PRIN19$Annual.Salary19, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: PRIN18$Annual.Salary18 and PRIN19$Annual.Salary19
## t = -40.311, df = 376, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2534.201 -2298.473
## sample estimates:
## mean of the differences
## -2416.337
The annual salary of principals in 2019 (M:147641.2 SD:8947.85) is significantly different than their annual salary in 2018 (M:145224.8 SD:9822.51), t(376)=-40.311, p-value < 2.2^{-16}.
/////////QUESTION 7//////
ASSTPRIN19<-subset(projectdata[projectdata$Job.Title =="Assistant Principal",], select=Annual.Salary19)
ASSTPRIN18<-subset(projectdata[projectdata$Job.Title =="Assistant Principal",], select=Annual.Salary18)
Comparing the distributions:
boxplot(ASSTPRIN18$Annual.Salary18, ASSTPRIN19$Annual.Salary19, main = "Annual Salaries of Assistant Principals in 2018-2019",names = c("Annual Salary 18", "Annual Salary 19"), col = c("green","blue"))
describe(ASSTPRIN18$Annual.Salary18)
## vars n mean sd median trimmed mad min max range
## X1 1 413 113710.9 8932.19 114802 113451.2 10418.23 61524 137409 75885
## skew kurtosis se
## X1 -0.7 4.23 439.52
describe(ASSTPRIN19$Annual.Salary19)
## vars n mean sd median trimmed mad min max range
## X1 1 413 116027 8310.19 117098 115925.2 8819.99 62139 137409 75270
## skew kurtosis se
## X1 -1.14 6.95 408.92
hist(ASSTPRIN18$Annual.Salary18)
hist(ASSTPRIN19$Annual.Salary19)
Histogram of Differences:
D.ASSTPRIN.18.19 =ASSTPRIN18$Annual.Salary18 - ASSTPRIN19$Annual.Salary19
describe(D.ASSTPRIN.18.19)
## vars n mean sd median trimmed mad min max range
## X1 1 413 -2316.09 1240.1 -2342 -2420.84 1108.98 -15739 8712 24451
## skew kurtosis se
## X1 -1.06 46.49 61.02
hist(D.ASSTPRIN.18.19,
col="gray",
main="Histogram of differences between 2018-2019 Annual Salaries of Assistant Principals",
xlab="Difference")
The differences do not follow a normal distribution. Running the t-test:
t.test(ASSTPRIN18$Annual.Salary18, ASSTPRIN19$Annual.Salary19, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: ASSTPRIN18$Annual.Salary18 and ASSTPRIN19$Annual.Salary19
## t = -37.955, df = 412, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2436.039 -2196.136
## sample estimates:
## mean of the differences
## -2316.087
The annual salary of assistant principals in 2019 (M:116027 SD:8310.19) is significantly different than their annual salary in 2018 (M:113710.9 SD:8932.19), t(412)=-37.955, p-value < 2.2^{-16}.
////////QuESTION 8/////////// Grouping the variables:
PRIN.19<-subset(projectdata[projectdata$Job.Title =="Principal",], select=Annual.Salary19)
ASSTPRIN.19<-subset(projectdata[projectdata$Job.Title =="Assistant Principal",], select=Annual.Salary19)
REGT.19<-subset(projectdata[projectdata$Job.Title =="Regular Teacher",], select=Annual.Salary19)
SET.19<-subset(projectdata[projectdata$Job.Title =="Special Education Teacher",], select=Annual.Salary19)
CUSTW.19<-subset(projectdata[projectdata$Job.Title =="Custodial Worker",], select=Annual.Salary19)
Assigning Levels to groups:
PRIN.19$Level<-"1"
ASSTPRIN.19$Level<-"2"
REGT.19$Level<-"3"
SET.19$Level<-"4"
CUSTW.19$Level<-"5"
Merging the groups:
q8<-rbind(PRIN.19,ASSTPRIN.19,REGT.19,SET.19,CUSTW.19)
Checking the data:
levels(q8$Level)
## NULL
It does not read the levels.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dplyr::sample_n(q8, 10)
## Annual.Salary19 Level
## 1 58705 4
## 2 66237 3
## 3 91894 3
## 4 88390 3
## 5 79878 4
## 6 67432 3
## 7 88580 3
## 8 82630 3
## 9 96039 3
## 10 90307 4
Summary statistics by group:
library(dplyr)
group_by(q8, Level) %>%
summarise(
count = n(),
mean = mean(Annual.Salary19, na.rm = TRUE),
sd = sd(Annual.Salary19, na.rm = TRUE)
)
## # A tibble: 5 x 4
## Level count mean sd
## <chr> <int> <dbl> <dbl>
## 1 1 377 147641. 8948.
## 2 2 413 116027. 8310.
## 3 3 10032 80068. 14752.
## 4 4 2756 79523. 14704.
## 5 5 547 35362. 4336.
Visualizing the data:
library(ggpubr)
## Loading required package: magrittr
ggboxplot(q8, x = "Level", y = "Annual.Salary19",
color = "Level", palette = c("#00AFBB", "#E7B800", "#FC4E07", "#009999", "#4DB3E6"),
order = c("1", "2", "3", "4", "5"),
ylab = "Annual Salary 19", xlab = "Job Title")
Mean plot:
ggline(q8, x = "Level", y = "Annual.Salary19",
add = c("mean_se", "jitter"),
order = c("1", "2", "3","4","5"),
ylab = "Annual Salary 19", xlab = "Job Title")
Running the one-way ANOVA:
res.aov <- aov(Annual.Salary19 ~ Level, data = q8)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Level 4 3334900909600 833725227400 4134 <0.0000000000000002
## Residuals 14120 2847547864601 201667696
Since the p-value<2e-16 there are significant differences between the groups. Running the Tukey’s HSD:
TukeyHSD(res.aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Annual.Salary19 ~ Level, data = q8)
##
## $Level
## diff lwr upr p adj
## 2-1 -31614.1852 -34373.814 -28854.5567 0.0000000
## 3-1 -67572.9166 -69605.381 -65540.4527 0.0000000
## 4-1 -68118.5683 -70245.986 -65991.1509 0.0000000
## 5-1 -112279.0341 -114872.344 -109685.7242 0.0000000
## 3-2 -35958.7314 -37903.949 -34013.5137 0.0000000
## 4-2 -36504.3830 -38548.611 -34460.1554 0.0000000
## 5-2 -80664.8489 -83190.362 -78139.3354 0.0000000
## 4-3 -545.6516 -1378.854 287.5512 0.3813911
## 5-3 -44706.1175 -46407.170 -43005.0652 0.0000000
## 5-4 -44160.4659 -45973.908 -42347.0234 0.0000000
All the group comparisons are significant except 4-3, which is between the Special Education Teachers (4) and the Regular teachers (3).