1 Goal

This course will depart from descriptive analysis and visualization. Moreover, it will show how to do ordinary linear regression and logistic regression. For example, we can visualize the relationship between two variables like this:

library(ggplot2)
ggplot(mtcars, aes(x=wt, y=mpg)) +
    geom_point() +
    geom_smooth(method='lm', se=T)

2 Descriptive Statistics

We can decribe our data in terms of central tendency and dispersion.

2.1 Mean and Standard Deviation

Mean and standard deviation represent the central tendency and disperson of a random variable respectively..

We can use stargazer to report the mean, standard deviation, minimum, and maximum values.

library(stargazer)
dss <- as.data.frame(diamonds)
dss <- dss[, c(1,5,7)]
stargazer(dss, type="html")


Statistic	N	Mean	St. Dev.	Min	Max

carat	53,940	0.798	0.474	0.200	5.010
depth	53,940	61.700	1.430	43.000	79.000
price	53,940	3,933.000	3,989.000	326	18,823

2.2 apply()

apply can return the values that we specify in our functions.

options(digits=4)
dss <- as.data.frame(diamonds)
dss <- dss[, c(1,5,7)]
Statistics<-c("N", "Mean", "S.D.", "Min", "Pct25","Pct75", "Max")
A<-apply(dss,2,function(x) c(length(x), mean(x), sd(x),
                           min(x), quantile(x,c(0.25, 0.75)),max(x)))
rownames(A)<-Statistics
A

##           carat     depth price
## N     5.394e+04 53940.000 53940
## Mean  7.979e-01    61.749  3933
## S.D.  4.740e-01     1.433  3989
## Min   2.000e-01    43.000   326
## Pct25 4.000e-01    61.000   950
## Pct75 1.040e+00    62.500  5324
## Max   5.010e+00    79.000 18823

What if we want to describe data by groups? tapply can return important statistics by groups. But it can do only one statistics at a time.

table(sleep$group)

## 
##  1  2 
## 10 10

tapply(sleep$extra, sleep$group, mean)

##    1    2 
## 0.75 2.33

tapply(sleep$extra, sleep$group, median)

##    1    2 
## 0.35 1.75

tapply(sleep$extra, sleep$group, sd)

##     1     2 
## 1.789 2.002

2.3 summarize

R basic can provide summary statistics by running summary on one variable. dplyr can also summarize a variable with more options and flexibilities.

summary(mtcars$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.4    15.4    19.2    20.1    22.8    33.9

library(dplyr)
summarize(mtcars, n(), mean(mpg), mean(hp), sd(mpg), 
          quantile(mpg, c(0.25)), quantile(mpg, c(0.50)), quantile(mpg, c(0.75)))

##   n() mean(mpg) mean(hp) sd(mpg) quantile(mpg, c(0.25))
## 1  32     20.09    146.7   6.027                  15.43
##   quantile(mpg, c(0.5)) quantile(mpg, c(0.75))
## 1                  19.2                   22.8

2.3.1 group_by

dplyr also allows us to summarize a variable by group_by.

mtcars %>% group_by(cyl)  %>% summarize(mean(mpg))

## # A tibble: 3 x 2
##     cyl `mean(mpg)`
##   <dbl>       <dbl>
## 1     4        26.7
## 2     6        19.7
## 3     8        15.1

2.4 Scatter Plot

It is necessary to check the scatter plot when we look at the relationship between two variables. They can be two continuous variables or discrete variables.

For example, we can use ggplot2 to show how the two variables correspond to each other. Moreover, the regression line and the standard error can be added to the graph.

ggplot(anscombe, aes(x=x1, y=y1)) +
  geom_point(size=3)  +
  geom_smooth(method='lm', se=T)

Standard error means the accuracy of the estimate. However, scatter plot cannot show the accuracy of the model. We can run OLS model as follows.

fit1 <- lm(data=anscombe, y1 ~ x1)
stargazer(fit1, type='html')


	Dependent variable:

	y1

x1	0.500^***
	(0.118)

Constant	3.000^**
	(1.120)


Observations	11
R²	0.667
Adjusted R²	0.629
Residual Std. Error	1.240 (df = 9)
F Statistic	18.000^*** (df = 1; 9)

Note:	p<0.1; p<0.05; p<0.01

If there are one more variable of interest, we can also include it in the scatter plot.

ggplot(Orange, aes(x=age, y=circumference, group=Tree)) +
   geom_point(aes(col=Tree), size=3) +
   geom_smooth(method="loess", se=F) +
   theme_bw()

## Marginal Effect

When we have more than one independent variables and there are interaction term between two variables, we need to estimate the effect of the interaction term. Yiqing Xu (UCSD) provides a R package, interflex, to visualize the linear and non-linear marginal effect of treatment.

For example, we assume that the effect of mother’s IQ on children’s scores is conditional on whether or not mothers hold high school degree. We can specify the model as follows.

\[ Y_{i}=\beta_{0}+\beta_{1}X_{i}+\beta_{2}D_{i}+\beta_{3}XD \]

library(foreign)
library(interflex)
kid <- read.dta("kidiq.dta")
inter.raw(Y="kid_score",X="mom_iq",D="mom_hs", Xlabel="Mom's IQ", Ylabel="Kid's Score", Dlabel="Mom's High School",data=kid)

## $graph

The first graph shows steeper slope than the second one, which implies that students whose mothers holding no high school degree on average have better scores.

We can estimate the model with OLS.

M1<-lm(kid_score ~ mom_iq+mom_hs+mom_iq:mom_hs, data=kid)
stargazer(M1, type='html')


	Dependent variable:

	kid_score

mom_iq	0.969^***
	(0.148)

mom_hs	51.300^***
	(15.300)

mom_iq:mom_hs	-0.484^***
	(0.162)

Constant	-11.500
	(13.800)


Observations	434
R²	0.230
Adjusted R²	0.225
Residual Std. Error	18.000 (df = 430)
F Statistic	42.800^*** (df = 3; 430)

Note:	p<0.1; p<0.05; p<0.01

Again, the coefficient of the interaction term shows the negative effect of mother’s high school degree on scores.

3 Logarithmic Transformations of Variables

Variable transformation can “straighten” the relationship between two variables. If the relationship is monotone and simple; always positive or negative, we can use OLS as the functional form. Benoit discussed the log transformatiion of both dependent and independent variables in OLS regression.

library(car)
ggplot(data=Leinhardt, aes(x=income, y=infant)) +
    geom_point() +
    geom_smooth(method="loess", col="black")

We can take logarithmetic of both two variables because most of observatiosn are located in the lower-left corner; both variables are skewed.

par(mfrow=c(1,2))
hist(Leinhardt$infant, 100, probability = T, col="gray90", main="infant", xlab='')
hist(Leinhardt$income, 100, probability = T, col="gray90", main="income", xlab='')

We can plot the logarithm of both variables. We can see they looks more than normal distribution.

par(mfrow=c(1,2))
Leinhardt$l.infant = log(Leinhardt$infant)
Leinhardt$l.income = log(Leinhardt$income)
hist(Leinhardt$l.infant, 100, probability = T, col="gray90", main="Log infant", xlab='')
hist(Leinhardt$l.income, 100, probability = T, col="gray90", main="Log income", xlab='')

The scatter plot of the two new variables is different from the previous one. We can estimate the log-log model:

\[ \textrm{log}Y_{i}=\beta_{0}+\textrm{log}X_{i}\beta_{1}+\epsilon_{i} \]

If we take exponential on both sides, and add 1 to \(X\), we will obtain the following result.

\[ \begin{eqnarray} Y &= &\text{exp}^{\beta_{0}+\textrm{log}(X+1)\beta_{1}} \\ & = & \text{exp}^{\beta_{0}}\text{exp}^{\beta_{1}\times (\text{log}(X)+1)} \end{eqnarray} \]

Because \(\text{log}(X)+1 = \text{log}(X)+\text{log}(e)=\text{log}(eX)\), and \(e\approx 2.72\), \(e=\sum_{n=0}^\infty(\frac{1}{n!})\), adding 1 to \(\text{log}X\) means X increases by \(100\times (2.72-1)= 172\%\).

A change in unlogged \(X\) or \(X\) increases by 172%, \(Y\) will increase by \(e^{\beta_{1}}\).

A \(1\%\) increase in \(X\) means multiplying \(Y\) by \(e^{\beta_{1}\times \text{log}(1.01)}\). In practice, it is about \(\beta_{1}\).

We take log of both independent and response variables.

ggplot(data=Leinhardt, aes(x=l.income, y=l.infant)) +
    geom_point() +
    geom_smooth(method="lm", col="black")

fit<-lm(data=Leinhardt, l.infant ~ l.income)
stargazer(fit, type='html')


	Dependent variable:

	l.infant

l.income	-0.512^***
	(0.051)

Constant	7.146^***
	(0.317)


Observations	101
R²	0.502
Adjusted R²	0.497
Residual Std. Error	0.687 (df = 99)
F Statistic	99.845^*** (df = 1; 99)

Note:	p<0.1; p<0.05; p<0.01

A \(1\%\) incrase in income will reduce infant-mortality rate per 1000 live births by about \(0.05\%\).

4 Logistic Regression

When \(Y\) is a dichotomous variable, which has either 0 or 1, we can consider to transform the binary response variable.

One of the reseasons that OLS is not appropriate for the dichotomous variable is the predicted values would be lower than zero or greater than 1.

library(ISLR)
Default2<-Default[, c(1,3)]
Default2$default<-as.numeric(Default2$default)-1
fit1<-lm(data=Default2, default ~ balance)
Default2$predict <- predict(fit1)
library(reshape2)
DT <-melt(Default2, id.var="balance", variable.name = "Y")
ggplot(DT, aes(x=balance, y=value, col=Y)) + 
  geom_point() +
  labs(y="Y", x="Balance") +
  scale_y_continuous(breaks=c(0,1)) +
  scale_colour_manual(values=c("navyblue", "firebrick"))

Logit model or logistic regression model uses the following function:

\[ \begin{eqnarray} \text{logit}(Y=1|X) & = & \text{log} \frac{P_{i}}{1-P_{i}}\\ & = & \text{log}\frac{p(X)}{1-p(X)}\\ &=&\beta_{0}+X\beta_{1} \end{eqnarray} \] Because \(\text{exp}(\text{log}(a))=a\), so \[ \frac{p(X)}{1-p(X)} = exp^{\eta} \] or, \[ p(X)=\frac{e^{\eta}}{1+e^{\eta}} \] where, \[ \eta \equiv \sum \beta_{k}X_{ik}\]

We can compare the different results of OLS and logistic regression models.

Logit <- glm(data=Default2, default ~ balance,  family=binomial(logit))

stargazer(fit1, Logit, type="html")


	Dependent variable:

	default
	OLS	logistic
	(1)	(2)

balance	0.0001^***	0.005^***
	(0.00000)	(0.0002)

Constant	-0.075^***	-10.700^***
	(0.003)	(0.361)


Observations	10,000	10,000
R²	0.123
Adjusted R²	0.122
Log Likelihood		-798.000
Akaike Inf. Crit.		1,600.000
Residual Std. Error	0.168 (df = 9998)
F Statistic	1,397.000^*** (df = 1; 9998)

Note:	p<0.1; p<0.05; p<0.01

We can plot the predicted probabilities based on the predicted values from the logistics regression model.

Default3 <- Default[, c(1,3)]
Default3$default <- as.numeric(Default3$default)-1
Default3$predict_logit<--10.7+0.005*Default3$balance
Default3$p<-(exp(Default3$predict_logit))/(1+exp(Default3$predict_logit))
Default3 <- Default3[, c(1, 2, 4)]
df <-melt(Default3, id.var="balance", variable.name = "response", value.name = "V")

g1 <-ggplot(df, aes(x=balance, y=V, col=response)) + 
  geom_point() +
  labs(y="Y", x="Balance") 

g1 +  scale_y_continuous(breaks=c(0,1)) +
  scale_colour_manual(values=c("navyblue", "firebrick"))

4.1 Odds Ratio

Odds can be expressed as \(\frac{p(x)}{1-p(x)}\).

Remember that

\[ \frac{p(X)}{1-p(X)} = exp^{\beta_{0}+X\beta_{1}} \] If we take log on both sides,
\[ \begin{eqnarray} \textrm{log}\frac{p(X)}{1-p(X)} & = & \textrm{log}(exp^{\beta_{0}+X\beta_{1}}) \\ = & \beta_{0}+X\beta_{1} \end{eqnarray} \]

Increasing \(X\) by one unit changes the log odds by \(\beta_{1}\). In this study, one-unit increase in balance is associated with an increase in the log odds of default by 0.0055.

5 Assignments

1. Please calculate the aveage temperature and Solar R by month in airquality. You may want to remove the NA in the variables by adding na.rm=T to the variable that has NA.
1. Please examine the effect of Expend on Grad.rate in College of ISLR package on a scatter plot and estimate it with R.
1. Please estimate the effect of Ag on Surv in the following dataset.

library(kableExtra)
DT <-data.frame(Surv=c(rep(1,11),rep(0,22)),
                Ag=c(rep(1,9), rep(0,2),rep(1,8),rep(0,14)))
DT %>% 
  kable('html') %>% 
  kable_styling()

Surv	Ag
1	1
1	1
1	1
1	1
1	1
1	1
1	1
1	1
1	1
1	0
1	0
0	1
0	1
0	1
0	1
0	1
0	1
0	1
0	1
0	0
0	0
0	0
0	0
0	0
0	0
0	0
0	0
0	0
0	0
0	0
0	0
0	0
0	0

Introducing R

Descriptive Analysis, Regression, Visualization (v1.0)

Chia-hung Tsai (Election Study Center, NCCU)

Fridday, July 13, 2018, 5:10-6:30 p.m.