Data wrangling: Homework 5

2020-Spring [Data Management] Instructor: SHEU, Ching-Fan

CHIU, Ming-Tzu

2020-04-12

Augment the data object in the ‘SAT’ lecture note with state.division{datasets}. For each of the 9 divisions, find the slope estimate for regressing average SAT scores onto average teacher’s salary. How many of them are of negative signs?

讀取資料

dta <- read.table('http://www.amstat.org/publications/jse/datasets/sat.dat.txt', row.names=1)
head(dta)
#>               V2   V3     V4 V5  V6  V7   V8
#> Alabama    4.405 17.2 31.144  8 491 538 1029
#> Alaska     8.963 17.6 47.951 47 445 489  934
#> Arizona    4.778 19.3 32.175 27 448 496  944
#> Arkansas   4.459 17.1 28.934  6 482 523 1005
#> California 4.992 24.0 41.078 45 417 485  902
#> Colorado   5.443 18.4 34.571 29 462 518  980

命名變量

names(dta) <- c("Spending", "PTR", "Salary", "PE", "Verbal", "Math", "SAT")
str(dta)
#> 'data.frame':    50 obs. of  7 variables:
#>  $ Spending: num  4.41 8.96 4.78 4.46 4.99 ...
#>  $ PTR     : num  17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
#>  $ Salary  : num  31.1 48 32.2 28.9 41.1 ...
#>  $ PE      : int  8 47 27 6 45 29 81 68 48 65 ...
#>  $ Verbal  : int  491 445 448 482 417 462 431 429 420 406 ...
#>  $ Math    : int  538 489 496 523 485 518 477 468 469 448 ...
#>  $ SAT     : int  1029 934 944 1005 902 980 908 897 889 854 ...

增加欄位

dta$Division <- state.division
str(dta)
#> 'data.frame':    50 obs. of  8 variables:
#>  $ Spending: num  4.41 8.96 4.78 4.46 4.99 ...
#>  $ PTR     : num  17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
#>  $ Salary  : num  31.1 48 32.2 28.9 41.1 ...
#>  $ PE      : int  8 47 27 6 45 29 81 68 48 65 ...
#>  $ Verbal  : int  491 445 448 482 417 462 431 429 420 406 ...
#>  $ Math    : int  538 489 496 523 485 518 477 468 469 448 ...
#>  $ SAT     : int  1029 934 944 1005 902 980 908 897 889 854 ...
#>  $ Division: Factor w/ 9 levels "New England",..: 4 9 8 5 9 8 1 3 3 3 ...

畫圖看看資料長什麼樣子

library(lattice)
xyplot(SAT ~ Salary, groups=Division, data=dta, type=c("g","p","r"), auto.key=list(columns=3))

尋找 slope

slope <- sapply(split(dta, dta$Division),  function(x) coef(lm(x$SAT ~ x$Salary))[2])
as.data.frame(slope)
#>                                   slope
#> New England.x$Salary         -0.2067041
#> Middle Atlantic.x$Salary      3.9113380
#> South Atlantic.x$Salary       3.7559011
#> East South Central.x$Salary  -2.6215161
#> West South Central.x$Salary -28.1929942
#> East North Central.x$Salary  18.4312440
#> West North Central.x$Salary  -1.3174294
#> Mountain.x$Salary           -13.7999482
#> Pacific.x$Salary              0.3558940

根據回歸線斜率計算結果,我們可以發現,9 個 division 皆不同,以 West South Central、Mountain、West North Central 地區的教師薪水與學生 SAT 成績呈現負相關,程度由大到小排列;而 East North Cental 為主要正相關之地區,其次為 Middle and South Atlantic; New England 和 Pacific 二個地區幾乎沒有相關性。