UN11.utf8

##title: “Weighted regression” ##author: “Szu-Yu Chen” ##date: 24 September 2020 —

Example of Regression with Sampling Weights



```r
## Set some options(設定output顯著值設定到小數三位?)
options(digits=3, show.signif.stars=FALSE)


## packagage management(安裝pacman)
# install.packages(pacman)

# load packages
pacman::p_load(alr4, tidyverse)

## load data (載入欲分析檔案)
data(UN11, package="alr4")

## seed the random number generator to get the same sample (設定抽樣隨機起始值為6102)
set.seed(6102)

## pick 81 countries from three regions (選定自三個區域抽81個國家,並以字母排序)
# arrange the rows by alphabetical order
dta <- UN11 %>%
       filter(region %in% c("Africa", "Asia", "Europe")) %>%
       sample_n(81) %>%
       arrange(region)

# first 6 lines of data frame(顯示資料前6行)
head(dta)

##              region  group fertility ppgdp lifeExpF pctUrban
## Ghana        Africa africa      3.99  1333     65.8       52
## Seychelles   Africa africa      2.34 11451     78.0       56
## Gabon        Africa africa      3.19 12469     64.3       86
## Libya        Africa africa      2.41 11321     77.9       78
## Benin        Africa africa      5.08   741     58.7       42
## Burkina Faso Africa africa      5.75   520     57.0       27

# data dimensions - rows and columns(顯示資料有幾行幾列)
dim(dta)

## [1] 81  6

## how many countries in each of the three regions(以表格呈現每個區域有幾個國家)
R3 <- table(dta$region)

## percentage of countries from each of the three regions selected(將已選定的國家除以原三區域所包含的國家數得到每區域選定國家數的百分比)
w <- R3/table(UN11$region)

## add the sampling weights variable to data(資料加權,1/w assign給百分比不為0的國家???)
# skip over countries in regions not selected
dta$wt <- rep(1/w[w != 0], R3[R3 != 0])

## simple regression(跑一個IV為log(ppgdp),DV為fertility的迴歸模型m0)
summary(m0 <- lm(fertility ~ log(ppgdp), data=dta))

## 
## Call:
## lm(formula = fertility ~ log(ppgdp), data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2686 -0.7716  0.0497  0.6811  2.6292 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    8.313      0.575   14.46  < 2e-16
## log(ppgdp)    -0.652      0.068   -9.58  7.2e-15
## 
## Residual standard error: 1.07 on 79 degrees of freedom
## Multiple R-squared:  0.537,  Adjusted R-squared:  0.532 
## F-statistic: 91.8 on 1 and 79 DF,  p-value: 7.15e-15

## weighted regression(重跑一個有加權的m0模型,命名m1)
summary(m1 <- update(m0, weights=wt))

## 
## Call:
## lm(formula = fertility ~ log(ppgdp), data = dta, weights = wt)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -3.031 -1.001  0.063  0.921  3.425 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    8.210      0.577   14.22  < 2e-16
## log(ppgdp)    -0.642      0.068   -9.44  1.3e-14
## 
## Residual standard error: 1.42 on 79 degrees of freedom
## Multiple R-squared:  0.53,   Adjusted R-squared:  0.524 
## F-statistic: 89.1 on 1 and 79 DF,  p-value: 1.35e-14

## plot(繪製以GDP預測生育率的回歸線,x=log(ppgdp),標籤為GDP。y=fertility標籤為Number of children per woman,並將線性模式(祕魯色)與加權模式(灰色)資料分布走向標出。根據圖表，GDP高的國家傾向低生育率，反之亦然。)
ggplot(dta, 
       aes(log(ppgdp), fertility, label=region)) +
 stat_smooth(method="lm", formula=y ~ x, se=F, col="peru", lwd=rel(.5)) +
 stat_smooth(aes(weight=wt), method="lm", formula=y ~ x, se=F, lwd=rel(.5), col="gray")+
 geom_text(check_overlap=TRUE, size=rel(2.3), aes(color=region))+
 labs(x="GDP (US$ in log unit)", 
      y="Number of children per woman") +
 theme_minimal() +
 theme(legend.position="NONE")

# The end