Lab Project 3.2

R Markdown

projectdata<-read.csv(file="C://Users//TOSHIBA//Desktop//Statistics Course//StatisticsDataFiles//Project3Data.csv", header=TRUE)

PRIN19<- subset(projectdata[projectdata$Job.Title =="Principal",], select=Annual.Salary19)
ASSTPRIN19<-subset(projectdata[projectdata$Job.Title =="Assistant Principal",], select=Annual.Salary19)

Checking data for normality:

boxplot(PRIN19$Annual.Salary19, ASSTPRIN19$Annual.Salary19, main = "Annual Salaries of Principals and Asst. Principals",names = c("Principal", "Assistant Principal"), col = c("orange","red"))

Checking distribution of Principals’ Annual Salary 19

qqnorm(PRIN19$Annual.Salary19)
qqline(PRIN19$Annual.Salary19)

library(psych)
describe(PRIN19$Annual.Salary19)

##    vars   n     mean      sd median trimmed      mad    min    max range
## X1    1 377 147641.2 8947.85 149329  147738 10704.37 128750 167417 38667
##     skew kurtosis     se
## X1 -0.13    -0.62 460.84

hist(PRIN19$Annual.Salary19)

Checking for outliers in PRIN19 Annual Salary 19

boxplot(PRIN19$Annual.Salary19)$out

## numeric(0)

There are not outliers in PRIN19 Annual Salary 19 Checking the distribution of ASSTPRIN19 Annual Salary 19

qqnorm(ASSTPRIN19$Annual.Salary19)
qqline(ASSTPRIN19$Annual.Salary19)

describe(ASSTPRIN19$Annual.Salary19)

##    vars   n   mean      sd median  trimmed     mad   min    max range
## X1    1 413 116027 8310.19 117098 115925.2 8819.99 62139 137409 75270
##     skew kurtosis     se
## X1 -1.14     6.95 408.92

hist(ASSTPRIN19$Annual.Salary19)

There are outliers in ASSTPRIN19 Annual Salary 19. Cleaning the outliers:

ASSTPRIN19$Level<-2
boxplot(ASSTPRIN19$Annual.Salary19)$out

## [1] 62139 62139

outliers_ASSTPRIN19 <- boxplot(ASSTPRIN19$Annual.Salary19)$out

ASSTPRIN19.trimmed <- ASSTPRIN19[-which(ASSTPRIN19$Annual.Salary19 %in% outliers_ASSTPRIN19),]
boxplot(ASSTPRIN19.trimmed$Annual.Salary19)

Normality treatment for and PRIN19 and ASSTPRIN19.trimmed applying a log transformation:

ASSTPRIN19.trimmed$Annual.Salary19<-log(ASSTPRIN19.trimmed$Annual.Salary19)
hist(ASSTPRIN19.trimmed$Annual.Salary19)

PRIN19$Annual.Salary19<-log(PRIN19$Annual.Salary19)
hist(PRIN19$Annual.Salary19)

Creating a data frame (q1) with two variables:

PRIN19$Level<-1
ASSTPRIN19.trimmed$Level<-2
q1<-rbind(PRIN19,ASSTPRIN19.trimmed)

Running the t-test:

library(lessR)

## 
## lessR 3.8.9     feedback: gerbing@pdx.edu     web: lessRstats.com/new
## ---------------------------------------------------------------------
## 1. d <- Read("")           Read text, Excel, SPSS, SAS or R data file
##                            d: default data frame, no need for data=
## 2. l <- Read("", var_labels=TRUE)   Read variable labels into l,
##                            required name for data frame of labels
## 3. Help()                  Get help, and, e.g., Help(Read)
## 4. hs(), bc(), or ca()     All histograms, all bar charts, or both
## 5. Plot(X) or Plot(X,Y)    For continuous and categorical variables
## 6. by1= , by2=             Trellis graphics, a plot for each by1, by2
## 7. reg(Y ~ X, Rmd="eg")    Regression with full interpretative output
## 8. style("gray")           Grayscale theme, + many others available
##    style(show=TRUE)        all color/style options and current values
## 9. getColors()             create many styles of color palettes
## 
## lessR parameter names now use _'s. Names with a period are deprecated.
## Ex:  bin_width  instead of  bin.width

## 
## Attaching package: 'lessR'

## The following objects are masked from 'package:psych':
## 
##     reflect, scree

ttest(Annual.Salary19 ~ Level, data = q1, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2")

## 
## Compare Annual.Salary19 across Level levels 1 and 2 
## --------------------------------------------------------------
## 
## 
## ------ Description ------
## 
## Annual.Salary19 for Level 1:  n.miss = 0,  n = 377,  mean = 11.900691,  sd = 0.061063
## Annual.Salary19 for Level 2:  n.miss = 0,  n = 411,  mean = 11.661805,  sd = 0.063778
## 
## Sample Mean Difference of Annual.Salary19:  0.238885
## 
## Within-group Standard Deviation:   0.062494 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
## 
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test:  F = 0.004068/0.003729 = 1.090896,  df = 410;376,  p-value = 0.390
## Levene's test, Brown-Forsythe:  t = -1.146,  df = 786,  p-value = 0.252
## 
## 
## ------ Inference ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.963 
## Standard Error of Mean Difference: SE =  0.004457 
## 
## Hypothesis Test of 0 Mean Diff:  t = 53.602,  df = 786,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.008748
## 95% Confidence Interval for Mean Difference:  0.230137 to 0.247634
## 
## 
## --- Do not assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.963 
## Standard Error of Mean Difference: SE =  0.004448 
## 
## Hypothesis Test of 0 Mean Diff:  t = 53.702,  df = 784.551, p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.008732
## 95% Confidence Interval for Mean Difference:  0.230153 to 0.247617
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## Standardized Mean Difference of Annual.Salary19, Cohen's d:  3.822525
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for Level 1: 0.016134
## Density bandwidth for Level 2: 0.017919

The Levene’s test result is p-value = 0.252, therefore the variances satisfy the assumption of homogenity of variance. The test result is t(786)=53.602, p=0.0000. There is a significant difference.

Trying a log transformation:

//QUESTION 2

REGT<- subset(projectdata[projectdata$Job.Title =="Regular Teacher",], select=Annual.Salary19)
NREGT<-subset(projectdata[projectdata$Job.Title !="Regular Teacher",], select=Annual.Salary19)

Checking data for normality:

boxplot(REGT$Annual.Salary19, NREGT$Annual.Salary19, main = "Annual Salaries of Regular Teachers and Others",names = c("Regular Teacher", "Others"), col = c("blue","green"))

hist(REGT$Annual.Salary19)

hist(NREGT$Annual.Salary19)

describe(REGT$Annual.Salary19)

##    vars     n     mean       sd median  trimmed      mad   min    max
## X1    1 10032 80068.26 14752.45  85613 81001.69 13595.44 11533 149329
##     range  skew kurtosis     se
## X1 137796 -0.65    -0.14 147.29

describe(NREGT$Annual.Salary19)

##    vars     n     mean      sd median  trimmed      mad  min    max  range
## X1    1 18153 59758.91 31232.7  52505 57672.23 29029.31 9809 260000 250191
##    skew kurtosis     se
## X1 0.75     0.25 231.81

Removing outliers from REGT Annual.Salary19

REGT$Level<-1
boxplot(REGT$Annual.Salary19)$out

##  [1] 149329  19849  20232  12386  12753  12883  19465  18441  19323  17733
## [11]  18061  12883  12386  12883  12883  14316  19162  17750  11533  11773
## [21]  17652

outliers_REGT <- boxplot(REGT$Annual.Salary19)$out

REGT.trimmed <- REGT[-which(REGT$Annual.Salary19 %in% outliers_REGT),]
boxplot(REGT.trimmed$Annual.Salary19)

Removing outliers from NREGT Annual.Salary19

NREGT$Level<-2
boxplot(NREGT$Annual.Salary19)$out

##  [1] 167417 167417 167417 167417 167417 175000 180000 175000 180000 170000
## [11] 225000 190000 180000 168000 175000 198000 260000 210000 170000

outliers_NREGT <- boxplot(NREGT$Annual.Salary19)$out

NREGT.trimmed <- NREGT[-which(NREGT$Annual.Salary19 %in% outliers_NREGT),]
boxplot(NREGT.trimmed$Annual.Salary19)

Checking the distributions again:

hist(REGT.trimmed$Annual.Salary19)

hist(NREGT.trimmed$Annual.Salary19)

Applying a Z transformation:

library(QuantPsyc)

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked from 'package:psych':
## 
##     logit

## Loading required package: MASS

## 
## Attaching package: 'QuantPsyc'

## The following object is masked from 'package:base':
## 
##     norm

Z.NREGT<-Make.Z(NREGT.trimmed$Annual.Salary19)
Z.REGT<-Make.Z(REGT.trimmed$Annual.Salary19)
hist(Z.NREGT)

hist(Z.REGT)

The Z transformation did not work for the data. (of course.)

Applying a log transformation:

REGT.trimmed$Annual.Salary19<-log(REGT.trimmed$Annual.Salary19)
NREGT.trimmed$Annual.Salary19<-log(NREGT.trimmed$Annual.Salary19)
hist(REGT.trimmed$Annual.Salary19)

hist(NREGT.trimmed$Annual.Salary19)

Binding the data frames:

q2<-rbind(REGT.trimmed,NREGT.trimmed)

Running the t test:

ttest(Annual.Salary19 ~ Level, data = q2, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2", show_title = TRUE)

## 
## Compare Annual.Salary19 across Level levels 1 and 2 
## --------------------------------------------------------------
## 
## 
## ------ Description ------
## 
## Annual.Salary19 for Level 1:  n.miss = 0,  n = 10011,  mean = 11.273959,  sd = 0.196844
## Annual.Salary19 for Level 2:  n.miss = 0,  n = 18134,  mean = 10.850785,  sd = 0.560155
## 
## Sample Mean Difference of Annual.Salary19:  0.423174
## 
## Within-group Standard Deviation:   0.464706 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
## 
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test:  F = 0.313774/0.038748 = 8.097862,  df = 18133;10010,  p-value = 0.000
## Levene's test, Brown-Forsythe:  t = -98.650,  df = 28143,  p-value = 0.000
## 
## 
## ------ Inference ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.960 
## Standard Error of Mean Difference: SE =  0.005786 
## 
## Hypothesis Test of 0 Mean Diff:  t = 73.135,  df = 28143,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.011341
## 95% Confidence Interval for Mean Difference:  0.411832 to 0.434515
## 
## 
## --- Do not assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.960 
## Standard Error of Mean Difference: SE =  0.004601 
## 
## Hypothesis Test of 0 Mean Diff:  t = 91.965,  df = 24896.034, p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  0.009019
## 95% Confidence Interval for Mean Difference:  0.414154 to 0.432193
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## Standardized Mean Difference of Annual.Salary19, Cohen's d:  0.910626
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for Level 1: 0.022024
## Density bandwidth for Level 2: 0.028473

Levene’s test result is p-value = 0.000, the data does not satisfy the homogenity of variance assumption.

///////QUESTION 3

SECAI<- subset(projectdata[projectdata$Job.Title =="Special Ed Classroom Assist",], select=Annual.Salary19)
SECAII<-subset(projectdata[projectdata$Job.Title =="Special Ed Classroom Assist II",], select=Annual.Salary19)

Checking data for normality:

boxplot(SECAI$Annual.Salary19, SECAII$Annual.Salary19, main = "Annual Salaries of Special Ed Classroom Assist",names = c("SECA I", "SECA II"), col = c("pink","orange"))

hist(SECAI$Annual.Salary19)

hist(SECAII$Annual.Salary19)

describe(SECAI$Annual.Salary19)

##    vars    n     mean     sd median  trimmed     mad   min   max range
## X1    1 1260 37220.45 3784.2  36891 37349.48 2584.17 16619 50781 34162
##     skew kurtosis     se
## X1 -1.39     6.33 106.61

describe(SECAII$Annual.Salary19)

##    vars    n     mean      sd median  trimmed     mad   min   max range
## X1    1 1396 37284.75 3988.16  36446 37449.35 2584.17 17268 45678 28410
##     skew kurtosis     se
## X1 -2.22     9.38 106.74

Outlier treatment for SECAI:

SECAI$Level<-1
boxplot(SECAI$Annual.Salary19)$out

##  [1] 50781 19317 18446 20245 16619 21877 18446 21228 17410 21228 18446
## [12] 18446 20245 17410 16619 21228 21228 18446 19317 21877 18446

outliers_SECAI <- boxplot(SECAI$Annual.Salary19)$out

SECAI.trimmed <- SECAI[-which(SECAI$Annual.Salary19 %in% outliers_SECAI),]
boxplot(SECAI.trimmed$Annual.Salary19)

hist(SECAI.trimmed$Annual.Salary19)

describe(SECAI.trimmed$Annual.Salary19)

##    vars    n     mean     sd median  trimmed     mad   min   max range
## X1    1 1239 37498.36 3019.8  36891 37430.13 2584.17 31720 42456 10736
##    skew kurtosis    se
## X1 0.32    -1.05 85.79

Outlier treatment for SECAII:

SECAII$Level<-2
boxplot(SECAII$Annual.Salary19)$out

##  [1] 45678 18058 18058 21877 21877 18058 18058 19095 21877 19095 19095
## [12] 19095 21877 21877 19095 21877 18223 19095 21877 21877 20894 20894
## [23] 19095 19095 18223 18223 17268 18223 18223 18223 21877 17268 18223
## [34] 18223 18223 21877 21877 18223 21877 18223 18223

outliers_SECAII <- boxplot(SECAII$Annual.Salary19)$out

SECAII.trimmed <- SECAII[-which(SECAII$Annual.Salary19 %in% outliers_SECAII),]
boxplot(SECAII.trimmed$Annual.Salary19)

hist(SECAII.trimmed$Annual.Salary19)

describe(SECAII.trimmed$Annual.Salary19)

##    vars    n     mean      sd median  trimmed     mad   min   max range
## X1    1 1355 37801.71 2589.93  36446 37557.98 2584.17 33018 43754 10736
##    skew kurtosis    se
## X1 0.72    -0.22 70.36

Binding the data frames:

q3<-rbind(SECAI.trimmed,SECAII.trimmed)

Running the t test:

library(lessR)
ttest(Annual.Salary19 ~ Level, data = q3, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2", show_title = TRUE)

## 
## Compare Annual.Salary19 across Level levels 2 and 1 
## --------------------------------------------------------------
## 
## 
## ------ Description ------
## 
## Annual.Salary19 for Level 2:  n.miss = 0,  n = 1355,  mean = 37801.714,  sd = 2589.926
## Annual.Salary19 for Level 1:  n.miss = 0,  n = 1239,  mean = 37498.359,  sd = 3019.801
## 
## Sample Mean Difference of Annual.Salary19:  303.355
## 
## Within-group Standard Deviation:   2803.479 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
## 
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test:  F = 9119195.450/6707716.417 = 1.360,  df = 1238;1354,  p-value = 0.000
## Levene's test, Brown-Forsythe:  t = -5.130,  df = 2592,  p-value = 0.000
## 
## 
## ------ Inference ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.961 
## Standard Error of Mean Difference: SE =  110.199 
## 
## Hypothesis Test of 0 Mean Diff:  t = 2.753,  df = 2592,  p-value = 0.006
## 
## Margin of Error for 95% Confidence Level:  216.087
## 95% Confidence Interval for Mean Difference:  87.269 to 519.442
## 
## 
## --- Do not assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.961 
## Standard Error of Mean Difference: SE =  110.953 
## 
## Hypothesis Test of 0 Mean Diff:  t = 2.734,  df = 2450.003, p-value = 0.006
## 
## Margin of Error for 95% Confidence Level:  217.570
## 95% Confidence Interval for Mean Difference:  85.785 to 520.926
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## Standardized Mean Difference of Annual.Salary19, Cohen's d:  0.108
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for Level 2: 697.576
## Density bandwidth for Level 1: 828.049

Levene’s test result is p-value = 0.000, The samples do not satisfy the assumption of homogenity of variance.

Trying to run the test with a log transformation:

SECAI.trimmed$Annual.Salary19<-log(SECAI.trimmed$Annual.Salary19)
SECAII.trimmed$Annual.Salary19<-log(SECAII.trimmed$Annual.Salary19)
hist(SECAI.trimmed$Annual.Salary19)

hist(SECAII.trimmed$Annual.Salary19)

Merging the log transformed data:

q3.1<-rbind(SECAI.trimmed,SECAII.trimmed)

Running the t-test again:

ttest(Annual.Salary19 ~ Level, data = q3.1, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2", show_title = TRUE)

## 
## Compare Annual.Salary19 across Level levels 2 and 1 
## --------------------------------------------------------------
## 
## 
## ------ Description ------
## 
## Annual.Salary19 for Level 2:  n.miss = 0,  n = 1355,  mean = 10.537828,  sd = 0.067145
## Annual.Salary19 for Level 1:  n.miss = 0,  n = 1239,  mean = 10.528848,  sd = 0.079890
## 
## Sample Mean Difference of Annual.Salary19:  0.008980
## 
## Within-group Standard Deviation:   0.073508 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
## 
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test:  F = 0.006382/0.004508 = 1.415677,  df = 1238;1354,  p-value = 0.000
## Levene's test, Brown-Forsythe:  t = -5.699,  df = 2592,  p-value = 0.000
## 
## 
## ------ Inference ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.961 
## Standard Error of Mean Difference: SE =  0.002889 
## 
## Hypothesis Test of 0 Mean Diff:  t = 3.108,  df = 2592,  p-value = 0.002
## 
## Margin of Error for 95% Confidence Level:  0.005666
## 95% Confidence Interval for Mean Difference:  0.003314 to 0.014646
## 
## 
## --- Do not assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.961 
## Standard Error of Mean Difference: SE =  0.002912 
## 
## Hypothesis Test of 0 Mean Diff:  t = 3.084,  df = 2427.696, p-value = 0.002
## 
## Margin of Error for 95% Confidence Level:  0.005710
## 95% Confidence Interval for Mean Difference:  0.003270 to 0.014690
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## Standardized Mean Difference of Annual.Salary19, Cohen's d:  0.122162
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for Level 2: 0.018085
## Density bandwidth for Level 1: 0.021906

The log transformation does not improve homogenity of variance either.

////////QUESTION 4////////////

LATT<- subset(projectdata[projectdata$Job.Title =="Lunchroom Attendant",], select=Annual.Salary19)
CUSTW<-subset(projectdata[projectdata$Job.Title =="Custodial Worker",], select=Annual.Salary19)

Checking data for normality:

boxplot(LATT$Annual.Salary19, CUSTW$Annual.Salary19, main = "Annual Salaries of Lunchroom Attendants and Custodial Workers",names = c("Lunchroom Attendant", "Custodial Worker"), col = c("orange","red"))

hist(LATT$Annual.Salary19)

hist(CUSTW$Annual.Salary19)

describe(LATT$Annual.Salary19)

##    vars   n     mean      sd median  trimmed     mad   min   max range
## X1    1 690 18189.78 2977.94  17293 18324.36 4272.85 10993 23057 12064
##     skew kurtosis     se
## X1 -0.44    -0.06 113.37

describe(CUSTW$Annual.Salary19)

##    vars   n     mean     sd median  trimmed mad   min   max range skew
## X1    1 547 35362.15 4336.2  35537 34845.72   0 28323 50509 22186 0.91
##    kurtosis    se
## X1     0.91 185.4

Outlier treatment for LATT Annual Salary 19:

LATT$Level<-1
boxplot(LATT$Annual.Salary19)$out

##  [1] 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529
## [12] 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529 10993
## [23] 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529
## [34] 10993 11529 11529 11529 11529 11529 11529 11529 11529 11529 11529
## [45] 11529 11529 11529 11529 11529

outliers_LATT <- boxplot(LATT$Annual.Salary19)$out

LATT.trimmed <- LATT[-which(LATT$Annual.Salary19 %in% outliers_LATT),]
boxplot(LATT.trimmed$Annual.Salary19)

hist(LATT.trimmed$Annual.Salary19)

describe(LATT.trimmed$Annual.Salary19)

##    vars   n     mean      sd median  trimmed     mad   min   max range
## X1    1 641 18700.62 2421.94  18734 18690.72 2136.43 12970 23057 10087
##    skew kurtosis    se
## X1 0.08    -0.54 95.66

library(rcompanion)

## Registered S3 method overwritten by 'DescTools':
##   method        from       
##   print.palette wesanderson

## 
## Attaching package: 'rcompanion'

## The following object is masked from 'package:psych':
## 
##     phi

plotNormalHistogram(LATT.trimmed$Annual.Salary19)

help(plotNormalHistogram)

## starting httpd help server ...

##  done

Outlier treatment for CUSTW Annual Salary 19:

CUSTW$Level<-2
boxplot(CUSTW$Annual.Salary19)$out

##  [1] 45625 44963 45625 45625 45625 45625 45625 44331 45625 45625 44963
## [12] 45625 45625 45625 45625 45625 45625 45625 43048 45625 45625 45625
## [23] 45625 45625 45625 44963 44331 45625 45625 45625 45625 45625 45625
## [34] 45625 45625 45625 44331 45625 45625 44963 45625 45625 45625 43048
## [45] 45625 50509 45625 45625 44963 45625 45625 44331 44319 43048 45625
## [56] 45625 45625 44331 45625

outliers_CUSTW <- boxplot(CUSTW$Annual.Salary19)$out

CUSTW.trimmed <- CUSTW[-which(CUSTW$Annual.Salary19 %in% outliers_CUSTW),]
boxplot(CUSTW.trimmed$Annual.Salary19)

hist(CUSTW.trimmed$Annual.Salary19)

describe(CUSTW.trimmed$Annual.Salary19)

##    vars   n    mean      sd median  trimmed mad   min   max range  skew
## X1    1 488 34149.9 2705.54  35537 34335.45   0 28323 38658 10335 -0.69
##    kurtosis     se
## X1    -0.77 122.47

plotNormalHistogram(CUSTW.trimmed$Annual.Salary19)

Binding the dataframes:

q4<-rbind(LATT.trimmed,CUSTW.trimmed)

Running the t test:

ttest(Annual.Salary19 ~ Level, data = q4, Ynm = Annual.Salary19, Xnm = Level, X1nm = "1", X2nm = "2", show_title = TRUE)

## 
## Compare Annual.Salary19 across Level levels 2 and 1 
## --------------------------------------------------------------
## 
## 
## ------ Description ------
## 
## Annual.Salary19 for Level 2:  n.miss = 0,  n = 488,  mean = 34149.902,  sd = 2705.538
## Annual.Salary19 for Level 1:  n.miss = 0,  n = 641,  mean = 18700.621,  sd = 2421.935
## 
## Sample Mean Difference of Annual.Salary19:  15449.281
## 
## Within-group Standard Deviation:   2548.361 
## 
## 
## ------ Assumptions ------
## 
## Note: These hypothesis tests can perform poorly, and the 
##       t-test is typically robust to violations of assumptions. 
##       Use as heuristic guides instead of interpreting literally. 
## 
## Null hypothesis, for each group, is a normal distribution of Annual.Salary19.
## Group 2: Sample mean assumed normal because n>30, so no test needed.
## Group 1: Sample mean assumed normal because n>30, so no test needed.
## 
## Null hypothesis is equal variances of Annual.Salary19, i.e., homogeneous.
## Variance Ratio test:  F = 7319933.514/5865770.339 = 1.248,  df = 487;640,  p-value = 0.009
## Levene's test, Brown-Forsythe:  t = -2.060,  df = 1127,  p-value = 0.040
## 
## 
## ------ Inference ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.962 
## Standard Error of Mean Difference: SE =  153.098 
## 
## Hypothesis Test of 0 Mean Diff:  t = 100.911,  df = 1127,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  300.389
## 95% Confidence Interval for Mean Difference:  15148.892 to 15749.670
## 
## 
## --- Do not assume equal population variances of Annual.Salary19 for each Level 
## 
## t-cutoff: tcut =  1.962 
## Standard Error of Mean Difference: SE =  155.405 
## 
## Hypothesis Test of 0 Mean Diff:  t = 99.413,  df = 983.832, p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  304.964
## 95% Confidence Interval for Mean Difference:  15144.317 to 15754.245
## 
## 
## ------ Effect Size ------
## 
## --- Assume equal population variances of Annual.Salary19 for each Level 
## 
## Standardized Mean Difference of Annual.Salary19, Cohen's d:  6.062
## 
## 
## ------ Practical Importance ------
## 
## Minimum Mean Difference of practical importance: mmd
## Minimum Standardized Mean Difference of practical importance: msmd
## Neither value specified, so no analysis
## 
## 
## ------ Graphics Smoothing Parameter ------
## 
## Density bandwidth for Level 2: 891.789
## Density bandwidth for Level 1: 757.639

Levene’s test p-value = 0.040. The samples do not satisfy the assumption of homogenity of variance. However, the minimum score of CUSTW is higher than the maximum score of the LATT, therefore a t-test is not meaningful.

////////QUESTION 5/////////// Subsetting the Annual Salaries of Regular Teachers in years 18 and 19.

REGT19<-subset(projectdata[projectdata$Job.Title =="Regular Teacher",], select=Annual.Salary19)
REGT18<-subset(projectdata[projectdata$Job.Title =="Regular Teacher",], select=Annual.Salary18)

Checking the distributions:

boxplot(REGT18$Annual.Salary18, REGT19$Annual.Salary19, main = "Annual Salaries of Regular Teachers in 2018-2019",names = c("Annual Salary 18", "Annual Salary 19"), col = c("yellow","orange"))

library(psych)
describe(REGT18$Annual.Salary18)

##    vars     n     mean       sd median  trimmed      mad   min    max
## X1    1 10032 78828.25 15398.88  84804 79642.99 15205.55 11333 147850
##     range  skew kurtosis     se
## X1 136517 -0.55    -0.47 153.74

describe(REGT19$Annual.Salary19)

##    vars     n     mean       sd median  trimmed      mad   min    max
## X1    1 10032 80068.26 14752.45  85613 81001.69 13595.44 11533 149329
##     range  skew kurtosis     se
## X1 137796 -0.65    -0.14 147.29

hist(REGT18$Annual.Salary18)

hist(REGT19$Annual.Salary19)

library(PairedData)

## Loading required package: gld

## Loading required package: mvtnorm

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## 
## Attaching package: 'PairedData'

## The following object is masked from 'package:base':
## 
##     summary

pd <- paired(REGT18, REGT19)
plot(pd, type = "profile") + theme_bw()

Histogram of differences:

D.REGT.18.19 = REGT18$Annual.Salary18 - REGT19$Annual.Salary19
describe(D.REGT.18.19)

##    vars     n     mean      sd median  trimmed     mad    min   max range
## X1    1 10032 -1240.01 1649.95  -1215 -1118.69 1801.36 -34814 48346 83160
##    skew kurtosis    se
## X1 4.77   173.17 16.47

hist(D.REGT.18.19,   
     col="gray", 
     main="Histogram of differences between 2018-2019 Annual Salaries of Regular Teachers",
     xlab="Difference")

The differences do not have a normal distribution.But the sample size is large (n>30)

Running the paired t-test

t.test(REGT18$Annual.Salary18, REGT19$Annual.Salary19, paired = TRUE, alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  REGT18$Annual.Salary18 and REGT19$Annual.Salary19
## t = -75.275, df = 10031, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1272.304 -1207.723
## sample estimates:
## mean of the differences 
##               -1240.013

p<2.2e-16 The annual salary of regular teachers in 2019 (M:80068.26, SD:14752.45) is significantly different than their annual salary in 2018 (M: 78828.25, SD:78828.25), t(10031)=-75.275, p-value < 2.2^{-16}.

///////QUESTION 6/////////////

PRIN19<-subset(projectdata[projectdata$Job.Title =="Principal",], select=Annual.Salary19)
PRIN18<-subset(projectdata[projectdata$Job.Title =="Principal",], select=Annual.Salary18)

Comparing the distributions:

boxplot(PRIN18$Annual.Salary18, PRIN19$Annual.Salary19, main = "Annual Salaries of Principals in 2018-2019",names = c("Annual Salary 18", "Annual Salary 19"), col = c("orange","red"))

describe(PRIN18$Annual.Salary18)

##    vars   n     mean      sd median  trimmed      mad    min    max range
## X1    1 377 145224.8 9822.51 146657 145356.2 10873.39 125000 167417 42417
##     skew kurtosis     se
## X1 -0.14     -0.7 505.88

describe(PRIN19$Annual.Salary19)

##    vars   n     mean      sd median trimmed      mad    min    max range
## X1    1 377 147641.2 8947.85 149329  147738 10704.37 128750 167417 38667
##     skew kurtosis     se
## X1 -0.13    -0.62 460.84

hist(PRIN18$Annual.Salary18)

hist(PRIN19$Annual.Salary19)

Histogram of Differences:

D.PRIN.18.19 =PRIN18$Annual.Salary18 - PRIN19$Annual.Salary19
describe(D.PRIN.18.19)

##    vars   n     mean      sd median  trimmed     mad   min max range skew
## X1    1 377 -2416.34 1163.87  -2732 -2428.19 1742.05 -7499   0  7499 -0.2
##    kurtosis    se
## X1     0.49 59.94

hist(D.PRIN.18.19,   
     col="gray", 
     main="Histogram of differences between 2018-2019 Annual Salaries of Principals",
     xlab="Difference")

The differences follow a normal distribution. Running the t-test:

t.test(PRIN18$Annual.Salary18, PRIN19$Annual.Salary19, paired = TRUE, alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  PRIN18$Annual.Salary18 and PRIN19$Annual.Salary19
## t = -40.311, df = 376, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2534.201 -2298.473
## sample estimates:
## mean of the differences 
##               -2416.337

The annual salary of principals in 2019 (M:147641.2 SD:8947.85) is significantly different than their annual salary in 2018 (M:145224.8 SD:9822.51), t(376)=-40.311, p-value < 2.2^{-16}.

/////////QUESTION 7//////

ASSTPRIN19<-subset(projectdata[projectdata$Job.Title =="Assistant Principal",], select=Annual.Salary19)
ASSTPRIN18<-subset(projectdata[projectdata$Job.Title =="Assistant Principal",], select=Annual.Salary18)

Comparing the distributions:

boxplot(ASSTPRIN18$Annual.Salary18, ASSTPRIN19$Annual.Salary19, main = "Annual Salaries of Assistant Principals in 2018-2019",names = c("Annual Salary 18", "Annual Salary 19"), col = c("green","blue"))

describe(ASSTPRIN18$Annual.Salary18)

##    vars   n     mean      sd median  trimmed      mad   min    max range
## X1    1 413 113710.9 8932.19 114802 113451.2 10418.23 61524 137409 75885
##    skew kurtosis     se
## X1 -0.7     4.23 439.52

describe(ASSTPRIN19$Annual.Salary19)

##    vars   n   mean      sd median  trimmed     mad   min    max range
## X1    1 413 116027 8310.19 117098 115925.2 8819.99 62139 137409 75270
##     skew kurtosis     se
## X1 -1.14     6.95 408.92

hist(ASSTPRIN18$Annual.Salary18)

hist(ASSTPRIN19$Annual.Salary19)

Histogram of Differences:

D.ASSTPRIN.18.19 =ASSTPRIN18$Annual.Salary18 - ASSTPRIN19$Annual.Salary19
describe(D.ASSTPRIN.18.19)

##    vars   n     mean     sd median  trimmed     mad    min  max range
## X1    1 413 -2316.09 1240.1  -2342 -2420.84 1108.98 -15739 8712 24451
##     skew kurtosis    se
## X1 -1.06    46.49 61.02

hist(D.ASSTPRIN.18.19,   
     col="gray", 
     main="Histogram of differences between 2018-2019 Annual Salaries of Assistant Principals",
     xlab="Difference")

The differences do not follow a normal distribution. Running the t-test:

t.test(ASSTPRIN18$Annual.Salary18, ASSTPRIN19$Annual.Salary19, paired = TRUE, alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  ASSTPRIN18$Annual.Salary18 and ASSTPRIN19$Annual.Salary19
## t = -37.955, df = 412, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2436.039 -2196.136
## sample estimates:
## mean of the differences 
##               -2316.087

The annual salary of assistant principals in 2019 (M:116027 SD:8310.19) is significantly different than their annual salary in 2018 (M:113710.9 SD:8932.19), t(412)=-37.955, p-value < 2.2^{-16}.

////////QuESTION 8/////////// Grouping the variables:

PRIN.19<-subset(projectdata[projectdata$Job.Title =="Principal",], select=Annual.Salary19)
ASSTPRIN.19<-subset(projectdata[projectdata$Job.Title =="Assistant Principal",], select=Annual.Salary19)
REGT.19<-subset(projectdata[projectdata$Job.Title =="Regular Teacher",], select=Annual.Salary19)
SET.19<-subset(projectdata[projectdata$Job.Title =="Special Education Teacher",], select=Annual.Salary19)
CUSTW.19<-subset(projectdata[projectdata$Job.Title =="Custodial Worker",], select=Annual.Salary19)

Assigning Levels to groups:

PRIN.19$Level<-"1"
ASSTPRIN.19$Level<-"2"
REGT.19$Level<-"3"
SET.19$Level<-"4"
CUSTW.19$Level<-"5"

Merging the groups:

q8<-rbind(PRIN.19,ASSTPRIN.19,REGT.19,SET.19,CUSTW.19)

Checking the data:

levels(q8$Level)

## NULL

It does not read the levels.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dplyr::sample_n(q8, 10)

##    Annual.Salary19 Level
## 1            58705     4
## 2            66237     3
## 3            91894     3
## 4            88390     3
## 5            79878     4
## 6            67432     3
## 7            88580     3
## 8            82630     3
## 9            96039     3
## 10           90307     4

Summary statistics by group:

library(dplyr)
group_by(q8, Level) %>%
  summarise(
    count = n(),
    mean = mean(Annual.Salary19, na.rm = TRUE),
    sd = sd(Annual.Salary19, na.rm = TRUE)
  )

## # A tibble: 5 x 4
##   Level count    mean     sd
##   <chr> <int>   <dbl>  <dbl>
## 1 1       377 147641.  8948.
## 2 2       413 116027.  8310.
## 3 3     10032  80068. 14752.
## 4 4      2756  79523. 14704.
## 5 5       547  35362.  4336.

Visualizing the data:

library(ggpubr)

## Loading required package: magrittr

ggboxplot(q8, x = "Level", y = "Annual.Salary19", 
          color = "Level", palette = c("#00AFBB", "#E7B800", "#FC4E07", "#009999", "#4DB3E6"),
          order = c("1", "2", "3", "4", "5"),
          ylab = "Annual Salary 19", xlab = "Job Title")

Mean plot:

ggline(q8, x = "Level", y = "Annual.Salary19", 
       add = c("mean_se", "jitter"), 
       order = c("1", "2", "3","4","5"),
       ylab = "Annual Salary 19", xlab = "Job Title")

Running the one-way ANOVA:

res.aov <- aov(Annual.Salary19 ~ Level, data = q8)
summary(res.aov)

##                Df        Sum Sq      Mean Sq F value              Pr(>F)
## Level           4 3334900909600 833725227400    4134 <0.0000000000000002
## Residuals   14120 2847547864601    201667696

Since the p-value<2e-16 there are significant differences between the groups. Running the Tukey’s HSD:

TukeyHSD(res.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Annual.Salary19 ~ Level, data = q8)
## 
## $Level
##             diff         lwr          upr     p adj
## 2-1  -31614.1852  -34373.814  -28854.5567 0.0000000
## 3-1  -67572.9166  -69605.381  -65540.4527 0.0000000
## 4-1  -68118.5683  -70245.986  -65991.1509 0.0000000
## 5-1 -112279.0341 -114872.344 -109685.7242 0.0000000
## 3-2  -35958.7314  -37903.949  -34013.5137 0.0000000
## 4-2  -36504.3830  -38548.611  -34460.1554 0.0000000
## 5-2  -80664.8489  -83190.362  -78139.3354 0.0000000
## 4-3    -545.6516   -1378.854     287.5512 0.3813911
## 5-3  -44706.1175  -46407.170  -43005.0652 0.0000000
## 5-4  -44160.4659  -45973.908  -42347.0234 0.0000000

All the group comparisons are significant except 4-3, which is between the Special Education Teachers (4) and the Regular teachers (3).

Lab Project 3.2

azra

11/9/2019

R Markdown