Data wrangling: Homework 3

2020-Spring [Data Management] Instructor: SHEU, Ching-Fan

CHIU, Ming-Tzu

2020-04-13

Use the Prestige{car} data set for this problem.

  1. Find the median prestige score for each of the three types of occupation, respectively.

  2. Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively.

  3. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.

讀資料,檢查資料結構

library(car)
#> Loading required package: carData
library(carData)
str(Prestige)
#> 'data.frame':    102 obs. of  6 variables:
#>  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
#>  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
#>  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
#>  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
#>  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
#>  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

三種類型的中位數

先把 Prestige 裡的 prestige 資料依 type 分組取出,再計算各組描述統計。

dta <- Prestige
dta1 <- split(Prestige, Prestige$type)
lapply(dta1, function(x) median(x$prestige))
#> $bc
#> [1] 35.9
#> 
#> $prof
#> [1] 68.4
#> 
#> $wc
#> [1] 41.5

得到不同類型的中位數。

利用三個類型的中位數將 prestige 分為 High、Low 二類

先了解三個類型 prestige 的數值範圍。

lapply(dta1, function(x) summary(x$prestige))
#> $bc
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   17.30   27.10   35.90   35.53   42.60   54.90 
#> 
#> $prof
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   53.80   61.00   68.40   67.85   72.95   87.20 
#> 
#> $wc
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   26.50   35.90   41.50   42.24   47.50   67.50

可以發現,三個類型的 prestige 數值都介於 17 到 88 之間。

**

將三個類型各自依其中位數分類。

dta1$bc$level <- with(dta1$bc, cut(prestige, ordered=T, breaks=c(0, 35.9, 100), labels=c("Low", "High")))
dta1$prof$level <- with(dta1$prof, cut(prestige, ordered=T, breaks= c(0, 68.4, 100), labels=c("Low", "High")))
dta1$wc$level <- with(dta1$wc, cut(prestige, ordered=T, breaks= c(0, 41.5, 100), labels=c("Low", "High")))
lapply(dta1, function(y) table(y$level))
#> $bc
#> 
#>  Low High 
#>   24   20 
#> 
#> $prof
#> 
#>  Low High 
#>   16   15 
#> 
#> $wc
#> 
#>  Low High 
#>   12   11

不同類型的 prestige 高低類群將資料分為 6 組

這 6 組的 educationincome 的關係分別如何呢?

res_bc <- aggregate(cbind(education, income) ~ level, data = dta1$bc, FUN = mean)
res_bc
#>   level education   income
#> 1   Low  7.870417 4087.125
#> 2  High  8.946000 6918.550
res_prof <- aggregate(cbind(education, income) ~ level, data = dta1$prof, FUN = mean)
res_prof
#>   level education    income
#> 1   Low  13.49375  8762.062
#> 2  High  14.71400 12476.667
res_wc <- aggregate(cbind(education, income) ~ level, data = dta1$wc, FUN = mean)
res_wc
#>   level education   income
#> 1   Low  10.56250 4751.667
#> 2  High  11.52273 5380.273

將結果視覺化。

這裡將三個類型的高低分類分布圖之座標軸尺度設定為,涵蓋 Prestige 資料之 educationincome 的最大值與最小值。

summary(Prestige$education)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   6.380   8.445  10.540  10.738  12.648  15.970
summary(Prestige$income)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     611    4106    5930    6798    8187   25879
library(lattice)
xyplot(education ~ income | level, data=dta1$bc, type=c("g","p","r"), xlab="Income", xlim= c(600, 26000), ylab="Education", ylim= c(6, 16), main="Relationship between Income & Education in Type BC")


xyplot(education ~ income | level, data=dta1$prof, type=c("g","p","r"), xlab="Income", xlim= c(600, 26000), ylab="Education", ylim= c(6, 16), main="Relationship between Income & Education in Type PROF")


xyplot(education ~ income | level, data=dta1$wc, type=c("g","p","r"), xlab="Income", xlim= c(600, 26000), ylab="Education", ylim= c(6, 16), main="Relationship between Income & Education in Type WC")