hw1:example of regression with sampling weights

0.1 set some options

options(digits=3, show.signif.stars=FALSE)

digits=3設定數值呈現到小數第三位 show.signif.stars=FALSE 顯著值不用打星星

0.2 packagage management

# load packages
pacman::p_load(alr4, tidyverse)

load alr4 tidyverse packages

0.3 load data

data(UN11, package="alr4")

load UN11 data from package=alr4

0.4 random sample

#seed the random number generator to get the same sample
set.seed(6102)

設定抽樣隨機起始值:6102

0.5 pick 81 countries from three regions

# arrange the rows by alphabetical order
dta <- UN11 %>%
       filter(region %in% c("Africa", "Asia", "Europe")) %>%
       sample_n(81) %>%
       arrange(region)

從非洲、亞洲、歐洲三區抽81個國家，並按照字母進行順序的排列

# first 6 lines of data frame
head(dta)

##              region  group fertility ppgdp lifeExpF pctUrban
## Ghana        Africa africa      3.99  1333     65.8       52
## Seychelles   Africa africa      2.34 11451     78.0       56
## Gabon        Africa africa      3.19 12469     64.3       86
## Libya        Africa africa      2.41 11321     77.9       78
## Benin        Africa africa      5.08   741     58.7       42
## Burkina Faso Africa africa      5.75   520     57.0       27

顯示資料前6行

# data dimensions - rows and columns
dim(dta)

## [1] 81  6

顯示資料的行數與列數

# how many countries in each of the three regions
R3 <- table(dta$region)
head(R3)

## 
##        Africa          Asia     Caribbean        Europe    Latin Amer 
##            32            27             0            22             0 
## North America 
##             0

在這三區域中各有多少個國家

# percentage of countries from each of the three regions selected
w <- R3/table(UN11$region)
head(w)

## 
##        Africa          Asia     Caribbean        Europe    Latin Amer 
##         0.604         0.540         0.000         0.564         0.000 
## North America 
##         0.000

計算歐亞非三區抽出的國家數所占三區真實國家數的比例

# add the sampling weights variable to data
# skip over countries in regions not selected
dta$wt <- rep(1/w[w != 0], R3[R3 != 0])

針對三區國家抽出之比例進行倒數加權

# simple regression
summary(m0 <- lm(fertility ~ log(ppgdp), data=dta))

## 
## Call:
## lm(formula = fertility ~ log(ppgdp), data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2686 -0.7716  0.0497  0.6811  2.6292 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    8.313      0.575   14.46  < 2e-16
## log(ppgdp)    -0.652      0.068   -9.58  7.2e-15
## 
## Residual standard error: 1.07 on 79 degrees of freedom
## Multiple R-squared:  0.537,  Adjusted R-squared:  0.532 
## F-statistic: 91.8 on 1 and 79 DF,  p-value: 7.15e-15

進行log(ppgdp)對生育率的線性回歸結果顯示log後的ppgdp對生育率具顯著差異(P<0.01)，beta=-0.625

# weighted regression
summary(m1 <- update(m0, weights=wt))

## 
## Call:
## lm(formula = fertility ~ log(ppgdp), data = dta, weights = wt)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -3.031 -1.001  0.063  0.921  3.425 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    8.210      0.577   14.22  < 2e-16
## log(ppgdp)    -0.642      0.068   -9.44  1.3e-14
## 
## Residual standard error: 1.42 on 79 degrees of freedom
## Multiple R-squared:  0.53,   Adjusted R-squared:  0.524 
## F-statistic: 89.1 on 1 and 79 DF,  p-value: 1.35e-14

針對抽出國家比例進行加權後，結果顯示，ppgdp對生育率具顯著差異(P<0.01)，beta=-0.642

# plot
ggplot(dta, 
       aes(log(ppgdp), fertility, label=region)) +
 stat_smooth(method="lm", formula=y ~ x, se=F, col="peru", lwd=rel(.5)) +
 stat_smooth(aes(weight=wt), method="lm", formula=y ~ x, se=F, lwd=rel(.5), col="gray")+
 geom_text(check_overlap=TRUE, size=rel(2.3), aes(color=region))+
 labs(x="GDP (US$ in log unit)", 
      y="Number of children per woman") +
 theme_minimal() +
 theme(legend.position="NONE")

畫一條沒有加權調整的回歸線和有經加權的回歸線，兩條回歸線皆顯示，GDP與生育率具負向關係。高GDP低生育率。

0.6 Question

1.設定抽樣隨機起始值的用意? 2.加權這行的語法細節不太懂?[w != 0]與R3 != 0的用意為何? dta$wt <- rep(1/w[w != 0], R3[R3 != 0])