The structural of the data

  1. There are 102 obs. of 6 variables.
  2. There are three types of occupation. “Bc” means “blue_collar” and prof means “professional” and “wc” means “white_collar”. We will discuss the prestige scores and the relationship of education and income within different types of occupation later.
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
education income women prestige census type
gov.administrators 13.11 12351 11.16 68.8 1113 prof
general.managers 12.26 25879 4.02 69.1 1130 prof
accountants 12.77 9271 15.70 63.4 1171 prof
purchasing.officers 11.42 8865 9.11 56.8 1175 prof
chemists 14.62 8403 11.68 73.5 2111 prof
physicists 15.64 11030 5.13 77.6 2113 prof
## Classes 'tbl_df', 'tbl' and 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

Comment on the chunk code

  1. group the data by different types of occpation.
  2. summarize the median prestige scores in different types of occupation. We omitted the missing data.
  3. The “professional” got the highest median prestige scores among three types of occupation.
dta %>% group_by(type) %>% summarize(prestige_median=median(prestige, na.rm=T))
## Warning: Factor `type` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 4 x 2
##   type  prestige_median
##   <fct>           <dbl>
## 1 bc               35.9
## 2 prof             68.4
## 3 wc               41.5
## 4 <NA>             35

Comment on the chunk code

  1. choose the rows of “bc”
  2. Use the median score in each type of occupation to define two levels of prestige: High and low
  3. name the new variable “prestige_f”
  4. draw the scatter diagram.

Conclusion: for the category of “blue collars”, in high levels of prestige, there is a positive relation between education and income; while in low levels of prestige, there is a very weak linear relationship between education and income.

dta1 <- dta %>% filter(type=="bc") %>% mutate(prestige_f = cut(prestige, breaks=quantile(prestige, probs=c(0, .50, 1)), label=c("Low", "High"), ordered=T)) 
dta1 %>% xyplot(income ~ education, groups=prestige_f, type=c("p","g","r"), data=., xlab = "education", ylab="income", auto.key=list(columns=2))

dta1 %>%
  group_by(prestige_f) %>%
  dplyr::summarize(r=cor(education, income)) 
## Warning: Factor `prestige_f` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 2
##   prestige_f       r
##   <ord>        <dbl>
## 1 Low         0.0403
## 2 High        0.450 
## 3 <NA>       NA

Conclusion: For the category of “professional”, in high levels of prestige, there is a very weak linear relation between education and income; while in low levels of prestige, there is a mild postive relationship between education and income. Interesting, the result is oppsite from the blue collar category.

dta2 <- dta %>% filter(type=="prof") %>% mutate(prestige_m = cut(prestige, breaks=quantile(prestige, probs=c(0, .50, 1)), label=c("Low", "High"), ordered=T)) 
dta2 %>% xyplot(income ~ education, groups=prestige_m, type=c("p","g","r"), data=., xlab = "education", ylab="income", auto.key=list(columns=2))

dta2 %>%
  group_by(prestige_m) %>%
  dplyr::summarize(r=cor(education, income))
## Warning: Factor `prestige_m` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 2
##   prestige_m           r
##   <ord>            <dbl>
## 1 Low         0.375     
## 2 High        0.00000166
## 3 <NA>       NA

Conclusion: For the category of “white collars”, in high levels of prestige, there is a positive relation between education and income; while in low levels of prestige, there is a negative relationship.

dta3 <- dta %>% filter(type=="wc") %>% mutate(prestige_f = cut(prestige, breaks=quantile(prestige, probs=c(0, .50, 1)), label=c("Low", "High"), ordered=T)) 
dta3 %>% xyplot(income ~ education, groups=prestige_f, type=c("p","g","r"), data=., xlab = "education", ylab="income", auto.key=list(columns=2))

dta3 %>%
  group_by(prestige_f) %>%
  dplyr::summarize(r=cor(education, income)) 
## Warning: Factor `prestige_f` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 2
##   prestige_f      r
##   <ord>       <dbl>
## 1 Low        -0.155
## 2 High        0.277
## 3 <NA>       NA