Load Libraries:
library(dplyr) # for manipulating data
library(ggplot2) # for making graphs
library(knitr) # for nicer table formatting
library(summarytools) # for frequency distribution tables
Set your working directory, where the folder “Datasets” is located:
setwd("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Computer Assignments")
We are going to work with a new data set - a random sample of 1,000 federal personnel records for March 1994. These are not the responses to questionnaires as the previous data set was. Instead, they include the sort of information the government keeps in its personnel files: grade, salary, occupation, supervisory status, education, age, years of federal experience, sex, race, etc.
load("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Datasets/Class3set/OPM94.RData")
See a full listing of the variables by using names(dataset_name) command:
names(opm94)
## [1] "x" "sal" "grade" "patco" "major" "age"
## [7] "male" "vet" "handvet" "hand" "yos" "edyrs"
## [13] "promo" "exit" "supmgr" "race" "minority" "grade4"
## [19] "promo01" "supmgr01" "male01" "exit01"
1.1 First, lets work temporarily with American Indian females only (this step will subset the data set): race == "American Indian", male == "female"
opm94AIF <- opm94 %>% filter(race == "American Indian", male == "female") # subset data
opm94AIF %>% pander::pander(split.table = Inf) # print the resulting dataset nicely formatted
| x | sal | grade | patco | major | age | male | vet | handvet | hand | yos | edyrs | promo | exit | supmgr | race | minority | grade4 | promo01 | supmgr01 | male01 | exit01 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 256 | 49401 | 13 | Administrative | 42 | female | no | no | no | 14 | 13 | no | no | yes | American Indian | 1 | grades 13 to 16 | 0 | 1 | 0 | 0 | |
| 257 | 25672 | 5 | Technical | 31 | female | no | no | no | 6 | 13 | no | no | no | American Indian | 1 | grades 5 to 8 | 0 | 0 | 0 | 0 | |
| 258 | 23316 | 5 | Clerical | 46 | female | no | no | no | 16 | 12 | no | no | no | American Indian | 1 | grades 5 to 8 | 0 | 0 | 0 | 0 | |
| 259 | 45697 | 12 | Administrative | 53 | female | no | no | yes | 23 | 15 | no | no | yes | American Indian | 1 | grades 9 to 12 | 0 | 1 | 0 | 0 | |
| 260 | 45383 | 9 | Professional | 57 | female | no | no | no | 36 | 12 | no | no | no | American Indian | 1 | grades 9 to 12 | 0 | 0 | 0 | 0 | |
| 261 | 24576 | 5 | Technical | 62 | female | no | no | no | 38 | 10 | no | no | no | American Indian | 1 | grades 5 to 8 | 0 | 0 | 0 | 0 | |
| 262 | 20166 | 5 | Clerical | 33 | female | no | no | no | 6 | 13 | no | no | no | American Indian | 1 | grades 5 to 8 | 0 | 0 | 0 | 0 | |
| 263 | 42751 | 11 | Professional | PUBAF | 43 | female | no | no | no | 16 | 18 | no | no | no | American Indian | 1 | grades 9 to 12 | 0 | 0 | 0 | 0 |
| 264 | 24585 | 6 | Administrative | 53 | female | no | no | no | 18 | 15 | yes | no | no | American Indian | 1 | grades 5 to 8 | 1 | 0 | 0 | 0 | |
| 265 | 20796 | 5 | Technical | 32 | female | no | no | no | 10 | 13 | no | no | no | American Indian | 1 | grades 5 to 8 | 0 | 0 | 0 | 0 |
Let’s print out the individual values of the variables age, edyrs, grade, promo01, supmgr01 in the subset of the data, so that you could calculate statistics manually. We’ll also have the computer calculate the same statistics so that you could check your answers.
Individual values:
opm94AIF <- opm94AIF %>% select("age", "edyrs", "grade", "promo01", "supmgr01")
opm94AIF
## age edyrs grade promo01 supmgr01
## 1 42 13 13 0 1
## 2 31 13 5 0 0
## 3 46 12 5 0 0
## 4 53 15 12 0 1
## 5 57 12 9 0 0
## 6 62 10 5 0 0
## 7 33 13 5 0 0
## 8 43 18 11 0 0
## 9 53 15 6 1 0
## 10 32 13 5 0 0
Descriptive statistics for the same variables (three different commands/packages to choose from):
Using summary() from base package:
opm94AIF %>% select("age", "edyrs", "grade", "promo01", "supmgr01") %>% summary()
## age edyrs grade promo01 supmgr01
## Min. :31.00 Min. :10.00 Min. : 5.0 Min. :0.0 Min. :0.0
## 1st Qu.:35.25 1st Qu.:12.25 1st Qu.: 5.0 1st Qu.:0.0 1st Qu.:0.0
## Median :44.50 Median :13.00 Median : 5.5 Median :0.0 Median :0.0
## Mean :45.20 Mean :13.40 Mean : 7.6 Mean :0.1 Mean :0.2
## 3rd Qu.:53.00 3rd Qu.:14.50 3rd Qu.:10.5 3rd Qu.:0.0 3rd Qu.:0.0
## Max. :62.00 Max. :18.00 Max. :13.0 Max. :1.0 Max. :1.0
Using descr() from descr package:
descr::descr(opm94AIF)
##
## age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.00 35.25 44.50 45.20 53.00 62.00
##
## edyrs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 12.25 13.00 13.40 14.50 18.00
##
## grade
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 5.0 5.5 7.6 10.5 13.0
##
## promo01
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.1 0.0 1.0
##
## supmgr01
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.2 0.0 1.0
Using descr() from summarytools package:
summarytools::descr(opm94AIF)
## Descriptive Statistics
## opm94AIF
## N: 10
##
## age edyrs grade promo01 supmgr01
## ----------------- -------- -------- -------- --------- ----------
## Mean 45.20 13.40 7.60 0.10 0.20
## Std.Dev 10.97 2.17 3.31 0.32 0.42
## Min 31.00 10.00 5.00 0.00 0.00
## Q1 33.00 12.00 5.00 0.00 0.00
## Median 44.50 13.00 5.50 0.00 0.00
## Q3 53.00 15.00 11.00 0.00 0.00
## Max 62.00 18.00 13.00 1.00 1.00
## MAD 14.83 1.48 0.74 0.00 0.00
## IQR 17.75 2.25 5.50 0.00 0.00
## CV 0.24 0.16 0.44 3.16 2.11
## Skewness 0.02 0.59 0.53 2.28 1.28
## SE.Skewness 0.69 0.69 0.69 0.69 0.69
## Kurtosis -1.62 -0.29 -1.66 3.57 -0.37
## N.Valid 10.00 10.00 10.00 10.00 10.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00
QUESTION 1.1: Which of the three outputs for descriptive statistics do you find the most useful? Explain
The thrid output is the most readable, and is efficient in terms of how many words it takes to call. The third output is easier to read compared to the other two as the variables are displayed in a vertical column format in addition to having one row for each descriptive statistic compared to the first output that repeats for each variable. The thrid output includes skewness which is not present in any other outputs.
1.2 Using the raw data above, let’s compute (as appropriate) the mode, median, mean, range, variance, and standard deviation for variable age (opm94AIF$age: 42 31 46 53 57 62 33 43 53 32 ) listed for American Indian females:
age <- opm94AIF$age # save the values in a new variable with the name `age` for less typing
table(c(42, 31, 46, 53, 57, 62, 33, 43, 53, 32)) # figure out the mode from the table or use which.max()
##
## 31 32 33 42 43 46 53 57 62
## 1 1 1 1 1 1 2 1 1
which.max(table(c(42, 31, 46, 53, 57, 62, 33, 43, 53, 32)))
## 53
## 7
sort(c(42, 31, 46, 53, 57, 62, 33, 43, 53, 32)) # find the median from the ordered vector or use R function median()
## [1] 31 32 33 42 43 46 53 53 57 62
median(opm94AIF$age)
## [1] 44.5
(42+31+46+53+57+62+33+43+53+32)/10 # or
## [1] 45.2
sum(opm94AIF$age)/length(opm94AIF$age)
## [1] 45.2
mean(opm94AIF$age)
## [1] 45.2
sort(c(42, 31, 46, 53, 57, 62, 33, 43, 53, 32)) # or
## [1] 31 32 33 42 43 46 53 53 57 62
range(opm94AIF$age)
## [1] 31 62
age
## [1] 42 31 46 53 57 62 33 43 53 32
age - mean(age)
## [1] -3.2 -14.2 0.8 7.8 11.8 16.8 -12.2 -2.2 7.8 -13.2
(age - mean(age))^2
## [1] 10.24 201.64 0.64 60.84 139.24 282.24 148.84 4.84 60.84 174.24
sum((age - mean(age))^2)/(10-1)
## [1] 120.4
var(age)
## [1] 120.4
sqrt(sum((age - mean(age))^2)/(10-1) )
## [1] 10.97269
sd(age)
## [1] 10.97269
QUESTION 1.2: Do the manually calcualted results match the descriptive statistics in the tables above in section 1.1?
The listed descriptive statistics in the output do correspond with the manually calculated results such as mean, range, standard deviation.
QUESTION 1.3: Similarly, compute (as appropriate) the mode, median, mean, range, variance, and standard deviation for variables edyrs and supmgr01 (opm94AIF$edyrs: 13 13 12 15 12 10 13 18 15 13, opm94AIF$supmgr01: 1 0 0 1 0 0 0 0 0 0 ) listed for American Indian females. Check your results against the output in 1.1.
#variables
edyrs <- opm94AIF$edyrs
supmgr01 <- opm94AIF$supmgr01
## edyrs DS##
#Mode
which.max(edyrs)
## [1] 8
#Mean
mean(edyrs)
## [1] 13.4
#Range
range(edyrs)
## [1] 10 18
#Variance
var(edyrs)
## [1] 4.711111
#SD
sd(edyrs)
## [1] 2.170509
## supmgr01 DS##
#Mode
which.max(supmgr01)
## [1] 1
#Mean
mean(supmgr01)
## [1] 0.2
#Range
range(supmgr01)
## [1] 0 1
#Variance
var(supmgr01)
## [1] 0.1777778
#SD
sd(supmgr01)
## [1] 0.421637
Let’s generate grouped data (frequency table) that you will use for calculating statistics (mode, median, mean) for variable edyrs from the full dataset opm94:
summarytools::freq(opm94$edyrs) # grouped data
## Frequencies
## opm94$edyrs
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 10 12 1.20 1.20 1.20 1.20
## 12 330 33.00 34.20 33.00 34.20
## 13 101 10.10 44.30 10.10 44.30
## 14 98 9.80 54.10 9.80 54.10
## 15 39 3.90 58.00 3.90 58.00
## 16 290 29.00 87.00 29.00 87.00
## 18 112 11.20 98.20 11.20 98.20
## 20 18 1.80 100.00 1.80 100.00
## <NA> 0 0.00 100.00
## Total 1000 100.00 100.00 100.00 100.00
summary(opm94$edyrs) # summary statistics by R
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 12.00 14.00 14.37 16.00 20.00
Finding mode, median, mean for edyrs using the grouped data:
Mode - the most frequent value, can be seen in the frequency table (=12)
Median - the value in the middle, can be seen in the frequency table from the % Valid Cum. column (=14)
Mean: SUM(Xi*fi)/n:
(10*12 + 12*330 + 13*101 + 14*98 + 15*39 + 16*290 + 18*112 + 20*18)/1000
## [1] 14.366
QUESTION 2.1: Similarlly to the example above, find the mode, median, mean for variables yos and supmgr01 using the grouped data:
#variables
yos <- opm94$yos
supmgr001 <- opm94$supmgr01
#summarytools (yos)
summarytools::freq(yos)
## Frequencies
## yos
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 1 16 1.60 1.60 1.60 1.60
## 2 21 2.10 3.70 2.10 3.70
## 3 45 4.50 8.20 4.50 8.20
## 4 30 3.00 11.20 3.00 11.20
## 5 53 5.30 16.50 5.30 16.50
## 6 49 4.90 21.40 4.90 21.40
## 7 58 5.80 27.20 5.80 27.20
## 8 31 3.10 30.30 3.10 30.30
## 9 43 4.30 34.60 4.30 34.60
## 10 45 4.50 39.10 4.50 39.10
## 11 33 3.30 42.40 3.30 42.40
## 12 35 3.50 45.90 3.50 45.90
## 13 31 3.10 49.00 3.10 49.00
## 14 34 3.40 52.40 3.40 52.40
## 15 43 4.30 56.70 4.30 56.70
## 16 36 3.60 60.30 3.60 60.30
## 17 31 3.10 63.40 3.10 63.40
## 18 29 2.90 66.30 2.90 66.30
## 19 28 2.80 69.10 2.80 69.10
## 20 34 3.40 72.50 3.40 72.50
## 21 30 3.00 75.50 3.00 75.50
## 22 26 2.60 78.10 2.60 78.10
## 23 29 2.90 81.00 2.90 81.00
## 24 26 2.60 83.60 2.60 83.60
## 25 17 1.70 85.30 1.70 85.30
## 26 23 2.30 87.60 2.30 87.60
## 27 18 1.80 89.40 1.80 89.40
## 28 26 2.60 92.00 2.60 92.00
## 29 25 2.50 94.50 2.50 94.50
## 30 6 0.60 95.10 0.60 95.10
## 31 8 0.80 95.90 0.80 95.90
## 32 9 0.90 96.80 0.90 96.80
## 33 6 0.60 97.40 0.60 97.40
## 34 8 0.80 98.20 0.80 98.20
## 35 6 0.60 98.80 0.60 98.80
## 36 4 0.40 99.20 0.40 99.20
## 37 1 0.10 99.30 0.10 99.30
## 38 4 0.40 99.70 0.40 99.70
## 39 2 0.20 99.90 0.20 99.90
## 41 1 0.10 100.00 0.10 100.00
## <NA> 0 0.00 100.00
## Total 1000 100.00 100.00 100.00 100.00
summary(yos)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 14.81 21.00 41.00
#summarytools (supmgr001)
summarytools::freq(supmgr001)
## Frequencies
## supmgr001
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 0 821 82.10 82.10 82.10 82.10
## 1 179 17.90 100.00 17.90 100.00
## <NA> 0 0.00 100.00
## Total 1000 100.00 100.00 100.00 100.00
summary(supmgr001)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.179 0.000 1.000
QUESTION 3: Male01 and exit01 are dummy variables. (They only have two possible values, o and 1.) For each, compare its mean to the percentage of cases with the value 1. How are these two measures related?
Percentage of cases with the value 0 and 1 for male01:
table(opm94$male01) %>% prop.table()*100
##
## 0 1
## 48.8 51.2
table(opm94$exit01) %>% prop.table()*100
##
## 0 1
## 91.4 8.6
Mean value of male01:
mean(opm94$male01)
## [1] 0.512
mean(opm94$exit01)
## [1] 0.086
If I am understanding this correctly, the mean of male01 is .512 which is also the percentage of 1's within the variable in other words, since 0 does not have a value when calculating the mean 0 does not get accounted for and thus the mean of male01 is equal to the percentage of the the 1's within the set.
(male01) 51.2% 1's = .512 mean of 1's
(exit01) 08.6% 1's = .086 mean of 1's
QUESTION 4: Using the Frequencies output for the entire data set (and the grouped data formulas for intervals), calculate the mean grade, using GRADE4 instead of grade.Calculate means using the midpoint of each interval of grade4
summarytools::freq(opm94$grade4)
## Frequencies
## opm94$grade4
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## --------------------- ------ --------- -------------- --------- --------------
## grades 1 to 4 70 7.00 7.00 7.00 7.00
## grades 13 to 16 223 22.30 29.30 22.30 29.30
## grades 5 to 8 299 29.90 59.20 29.90 59.20
## grades 9 to 12 408 40.80 100.00 40.80 100.00
## <NA> 0 0.00 100.00
## Total 1000 100.00 100.00 100.00 100.00
summary(opm94$grade4)
## grades 1 to 4 grades 13 to 16 grades 5 to 8 grades 9 to 12
## 70 223 299 408
Mean:
#midpoint = (lower + upper) / 2
mid1<- (1+4)/2
mid2<- (5+8)/2
mid3<- (9+12)/2
mid4<- (13+16)/2
#Mean: SUM(Xi*fi)/n:
((70 * mid1) + (223 * mid4) + (299 * mid2) + (408 * mid3)) / 1000
## [1] 9.636
Let’s calculate the means of a variety of variables for black and white workers so that you can describe differences between the two groups of workers:
opm94$race %>% table()
## .
## American Indian Asian Black Hispanic White
## 17 31 175 49 728
opm94 %>% filter(race == "White") %>% select(sal) %>% summarise(mean_sal_White = mean(sal, na.rm = T))
## mean_sal_White
## 1 43294.39
opm94 %>% filter(race == "Black") %>% select(sal) %>% summarise(mean_sal_black = mean(sal, na.rm = T))
## mean_sal_black
## 1 32712.78
opm94 %>% filter(race == "White") %>% select(edyrs) %>% summarise(mean_edyrs_white = mean(edyrs, na.rm = T))
## mean_edyrs_white
## 1 14.57692
opm94 %>% filter(race == "Black") %>% select(edyrs) %>% summarise(mean_edyrs_black = mean(edyrs, na.rm = T))
## mean_edyrs_black
## 1 13.6
Or, alternatively, use the following commands:
opm94 %>% select(race, sal) %>% group_by(race) %>% summarise(mean_sal = mean(sal, na.rm = T))
## # A tibble: 5 x 2
## race mean_sal
## <fct> <dbl>
## 1 American Indian 32846.
## 2 Asian 38440.
## 3 Black 32713.
## 4 Hispanic 36500.
## 5 White 43294.
opm94 %>% select(race, edyrs) %>% group_by(race) %>% summarise(mean_edyrs = mean(edyrs, na.rm = T))
## # A tibble: 5 x 2
## race mean_edyrs
## <fct> <dbl>
## 1 American Indian 13.5
## 2 Asian 14.7
## 3 Black 13.6
## 4 Hispanic 14.1
## 5 White 14.6
opm94 %>% select(race, grade) %>% group_by(race) %>% summarise(mean_grade = mean(grade, na.rm = T))
## # A tibble: 5 x 2
## race mean_grade
## <fct> <dbl>
## 1 American Indian 8.24
## 2 Asian 9.65
## 3 Black 7.91
## 4 Hispanic 8.94
## 5 White 10.1
opm94 %>% select(race, promo01) %>% group_by(race) %>% summarise(mean_promo01 = mean(promo01, na.rm = T))
## # A tibble: 5 x 2
## race mean_promo01
## <fct> <dbl>
## 1 American Indian 0.25
## 2 Asian 0.172
## 3 Black 0.126
## 4 Hispanic 0.213
## 5 White 0.123
opm94 %>% select(race, supmgr01) %>% group_by(race) %>% summarise(mean_supmgr01 = mean(supmgr01, na.rm = T))
## # A tibble: 5 x 2
## race mean_supmgr01
## <fct> <dbl>
## 1 American Indian 0.118
## 2 Asian 0.161
## 3 Black 0.109
## 4 Hispanic 0.143
## 5 White 0.201
Question 5: Do whites receive higher rewards (e.g., salaries, grades, supervisory status, promotions) than minorities? Do differences in education and federal experience seem to be partly responsible for these patterns? Write a paragraph discussing differences between the groups (be specific about which groups you compare).
According to the study, whites on average have greater rewards than blacks:
Rewards (Whites) (Blacks) (Difference)
Salary ($43294.39) > ($32712.78) (10581.78)
Education years (14.58) > (13.60) (.98)
Grades (10.08) > (7.91) (2.17)
Supervisory Status (.20) > (.11) (.09)
Where they only advantage blacks have over whites is in promotion (.125) > (.122).
From this data, it paints a certain picture of those surveryed that blacks on average have less educational years and recieve lower grades which could reflect the disparity in the lower salary recieved by blacks and the lower supervisiory status which is tied to salary as supervisory roles tend to have higher salary rates. On average, whites tend to higher salaries than other minorities even when Asians for instance have longer education years (14.68) than whites (14.58) but have worse grades (9.64) than whites (10.08). In addition, Asians recieve higher rates of promotions (.17) compared to whites (.12), which if promotions were correlated with higher salaries then why is the disparity between whites salaries (43294.39) and Asians salaries (38439.60), a difference of $4854.79, not smaller due to more promotions. Which comes into question, whether grades is the biggest factor in determining the salary levels of rewards as the higher one races grades have been the more salary one recieves from there supposed jobs.
***
Knit the document into an html file and upload to RPubs or save as a pdf file and submit on iCollege.