example of regression with sampling weights

set some options

options(digits=3, show.signif.stars=FALSE)

設定環境

packagage management

#install.packages("pacman")

load packages

pacman::p_load(alr4, tidyverse)

load data

data(UN11, package="alr4")

seed the random number generator to get the same sample

set.seed(6102)

隨機取樣

pick 81 countries from three regions

arrange the rows by alphabetical order

dta <- UN11 %>%
       filter(region %in% c("Africa", "Asia", "Europe")) %>%
       sample_n(81) %>%
       arrange(region)

first 6 lines of data frame

head(dta)

##              region  group fertility ppgdp lifeExpF pctUrban
## Ghana        Africa africa      3.99  1333     65.8       52
## Seychelles   Africa africa      2.34 11451     78.0       56
## Gabon        Africa africa      3.19 12469     64.3       86
## Libya        Africa africa      2.41 11321     77.9       78
## Benin        Africa africa      5.08   741     58.7       42
## Burkina Faso Africa africa      5.75   520     57.0       27

data dimensions - rows and columns

dim(dta)

## [1] 81  6

資料維度：81列，6行

how many countries in each of the three regions

R3 <- table(dta$region)

從三個區域選出81個國家，有32個國家來自非洲，27個國家來自亞洲，有22個國家來自歐洲。

percentage of countries from each of the three regions selected

w <- R3/table(UN11$region)
print(w)

## 
##        Africa          Asia     Caribbean        Europe    Latin Amer 
##         0.604         0.540         0.000         0.564         0.000 
## North America NorthAtlantic       Oceania 
##         0.000         0.000         0.000

所選出來的81國中，有60.4%來自非洲，54%來自亞洲，56.4%來自歐洲。

add the sampling weights variable to data

skip over countries in regions not selected

dta$wt <- rep(1/w[w != 0], R3[R3 != 0])

將權重變數新增到資料中

simple regression

summary(m0 <- lm(fertility ~ log(ppgdp), data=dta))

## 
## Call:
## lm(formula = fertility ~ log(ppgdp), data = dta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2686 -0.7716  0.0497  0.6811  2.6292 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    8.313      0.575   14.46  < 2e-16
## log(ppgdp)    -0.652      0.068   -9.58  7.2e-15
## 
## Residual standard error: 1.07 on 79 degrees of freedom
## Multiple R-squared:  0.537,  Adjusted R-squared:  0.532 
## F-statistic: 91.8 on 1 and 79 DF,  p-value: 7.15e-15

log(ppgdp)平均每人國民所得成長率增加1%，出生率減少0.65%。

weighted regression

summary(m1 <- update(m0, weights=wt))

## 
## Call:
## lm(formula = fertility ~ log(ppgdp), data = dta, weights = wt)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -3.031 -1.001  0.063  0.921  3.425 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    8.210      0.577   14.22  < 2e-16
## log(ppgdp)    -0.642      0.068   -9.44  1.3e-14
## 
## Residual standard error: 1.42 on 79 degrees of freedom
## Multiple R-squared:  0.53,   Adjusted R-squared:  0.524 
## F-statistic: 89.1 on 1 and 79 DF,  p-value: 1.35e-14

加權回歸結果：平均每人國民所得成長率增加1%，出生率減少0.642%

plot

ggplot(dta, 
       aes(log(ppgdp), fertility, label=region)) +
 stat_smooth(method="lm", formula=y ~ x, se=F, col="peru", lwd=rel(.5)) +
 stat_smooth(aes(weight=wt), method="lm", formula=y ~ x, se=F, lwd=rel(.5), col="gray")+
 geom_text(check_overlap=TRUE, size=rel(2.3), aes(color=region))+
 labs(x="GDP (US$ in log unit)", 
      y="Number of children per woman") +
 theme_minimal() +
 theme(legend.position="NONE")

HW1：Weighted regression

吳靜芳

27 September 2020

example of regression with sampling weights

set some options

packagage management

load packages

load data

seed the random number generator to get the same sample

pick 81 countries from three regions

arrange the rows by alphabetical order

first 6 lines of data frame

data dimensions - rows and columns

how many countries in each of the three regions

percentage of countries from each of the three regions selected

add the sampling weights variable to data

skip over countries in regions not selected

simple regression

weighted regression

plot

The end