Descriptive Statistics
We can decribe our data in terms of central tendency and dispersion.
Mean and Standard Deviation
Mean and standard deviation represent the central tendency and disperson of a random variable respectively..
We can use stargazer
to report the mean, standard deviation, minimum, and maximum values.
library(stargazer)
dss <- as.data.frame(diamonds)
dss <- dss[, c(1,5,7)]
stargazer(dss, type="html")
|
Statistic
|
N
|
Mean
|
St. Dev.
|
Min
|
Max
|
|
carat
|
53,940
|
0.798
|
0.474
|
0.200
|
5.010
|
depth
|
53,940
|
61.700
|
1.430
|
43.000
|
79.000
|
price
|
53,940
|
3,933.000
|
3,989.000
|
326
|
18,823
|
|
apply()
apply can return the values that we specify in our functions.
options(digits=4)
dss <- as.data.frame(diamonds)
dss <- dss[, c(1,5,7)]
Statistics<-c("N", "Mean", "S.D.", "Min", "Pct25","Pct75", "Max")
A<-apply(dss,2,function(x) c(length(x), mean(x), sd(x),
min(x), quantile(x,c(0.25, 0.75)),max(x)))
rownames(A)<-Statistics
A
## carat depth price
## N 5.394e+04 53940.000 53940
## Mean 7.979e-01 61.749 3933
## S.D. 4.740e-01 1.433 3989
## Min 2.000e-01 43.000 326
## Pct25 4.000e-01 61.000 950
## Pct75 1.040e+00 62.500 5324
## Max 5.010e+00 79.000 18823
What if we want to describe data by groups? tapply can return important statistics by groups. But it can do only one statistics at a time.
##
## 1 2
## 10 10
tapply(sleep$extra, sleep$group, mean)
## 1 2
## 0.75 2.33
tapply(sleep$extra, sleep$group, median)
## 1 2
## 0.35 1.75
tapply(sleep$extra, sleep$group, sd)
## 1 2
## 1.789 2.002
summarize
R
basic can provide summary statistics by running summary on one variable. dplyr
can also summarize a variable with more options and flexibilities.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.4 15.4 19.2 20.1 22.8 33.9
library(dplyr)
summarize(mtcars, n(), mean(mpg), mean(hp), sd(mpg),
quantile(mpg, c(0.25)), quantile(mpg, c(0.50)), quantile(mpg, c(0.75)))
## n() mean(mpg) mean(hp) sd(mpg) quantile(mpg, c(0.25))
## 1 32 20.09 146.7 6.027 15.43
## quantile(mpg, c(0.5)) quantile(mpg, c(0.75))
## 1 19.2 22.8
group_by
dplyr
also allows us to summarize a variable by group_by.
mtcars %>% group_by(cyl) %>% summarize(mean(mpg))
## # A tibble: 3 x 2
## cyl `mean(mpg)`
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
Scatter Plot
It is necessary to check the scatter plot when we look at the relationship between two variables. They can be two continuous variables or discrete variables.
For example, we can use ggplot2
to show how the two variables correspond to each other. Moreover, the regression line and the standard error can be added to the graph.
ggplot(anscombe, aes(x=x1, y=y1)) +
geom_point(size=3) +
geom_smooth(method='lm', se=T)
Standard error means the accuracy of the estimate. However, scatter plot cannot show the accuracy of the model. We can run OLS model as follows.
fit1 <- lm(data=anscombe, y1 ~ x1)
stargazer(fit1, type='html')
|
|
Dependent variable:
|
|
|
|
y1
|
|
x1
|
0.500***
|
|
(0.118)
|
|
|
Constant
|
3.000**
|
|
(1.120)
|
|
|
|
Observations
|
11
|
R2
|
0.667
|
Adjusted R2
|
0.629
|
Residual Std. Error
|
1.240 (df = 9)
|
F Statistic
|
18.000*** (df = 1; 9)
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
If there are one more variable of interest, we can also include it in the scatter plot.
ggplot(Orange, aes(x=age, y=circumference, group=Tree)) +
geom_point(aes(col=Tree), size=3) +
geom_smooth(method="loess", se=F) +
theme_bw()

## Marginal Effect
When we have more than one independent variables and there are interaction term between two variables, we need to estimate the effect of the interaction term. Yiqing Xu (UCSD) provides a R
package, interflex, to visualize the linear and non-linear marginal effect of treatment.
For example, we assume that the effect of mother’s IQ on children’s scores is conditional on whether or not mothers hold high school degree. We can specify the model as follows.
\[
Y_{i}=\beta_{0}+\beta_{1}X_{i}+\beta_{2}D_{i}+\beta_{3}XD
\]
library(foreign)
library(interflex)
kid <- read.dta("kidiq.dta")
inter.raw(Y="kid_score",X="mom_iq",D="mom_hs", Xlabel="Mom's IQ", Ylabel="Kid's Score", Dlabel="Mom's High School",data=kid)
## $graph
The first graph shows steeper slope than the second one, which implies that students whose mothers holding no high school degree on average have better scores.
We can estimate the model with OLS.
M1<-lm(kid_score ~ mom_iq+mom_hs+mom_iq:mom_hs, data=kid)
stargazer(M1, type='html')
|
|
Dependent variable:
|
|
|
|
kid_score
|
|
mom_iq
|
0.969***
|
|
(0.148)
|
|
|
mom_hs
|
51.300***
|
|
(15.300)
|
|
|
mom_iq:mom_hs
|
-0.484***
|
|
(0.162)
|
|
|
Constant
|
-11.500
|
|
(13.800)
|
|
|
|
Observations
|
434
|
R2
|
0.230
|
Adjusted R2
|
0.225
|
Residual Std. Error
|
18.000 (df = 430)
|
F Statistic
|
42.800*** (df = 3; 430)
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
Again, the coefficient of the interaction term shows the negative effect of mother’s high school degree on scores.
Logistic Regression
When \(Y\) is a dichotomous variable, which has either 0 or 1, we can consider to transform the binary response variable.
One of the reseasons that OLS is not appropriate for the dichotomous variable is the predicted values would be lower than zero or greater than 1.
library(ISLR)
Default2<-Default[, c(1,3)]
Default2$default<-as.numeric(Default2$default)-1
fit1<-lm(data=Default2, default ~ balance)
Default2$predict <- predict(fit1)
library(reshape2)
DT <-melt(Default2, id.var="balance", variable.name = "Y")
ggplot(DT, aes(x=balance, y=value, col=Y)) +
geom_point() +
labs(y="Y", x="Balance") +
scale_y_continuous(breaks=c(0,1)) +
scale_colour_manual(values=c("navyblue", "firebrick"))
Logit model or logistic regression model uses the following function:
\[
\begin{eqnarray}
\text{logit}(Y=1|X) & = & \text{log} \frac{P_{i}}{1-P_{i}}\\
& = & \text{log}\frac{p(X)}{1-p(X)}\\
&=&\beta_{0}+X\beta_{1}
\end{eqnarray}
\] Because
\(\text{exp}(\text{log}(a))=a\), so
\[
\frac{p(X)}{1-p(X)} = exp^{\eta}
\] or,
\[
p(X)=\frac{e^{\eta}}{1+e^{\eta}}
\] where,
\[ \eta \equiv \sum \beta_{k}X_{ik}\]
We can compare the different results of OLS and logistic regression models.
Logit <- glm(data=Default2, default ~ balance, family=binomial(logit))
stargazer(fit1, Logit, type="html")
|
|
Dependent variable:
|
|
|
|
default
|
|
OLS
|
logistic
|
|
(1)
|
(2)
|
|
balance
|
0.0001***
|
0.005***
|
|
(0.00000)
|
(0.0002)
|
|
|
|
Constant
|
-0.075***
|
-10.700***
|
|
(0.003)
|
(0.361)
|
|
|
|
|
Observations
|
10,000
|
10,000
|
R2
|
0.123
|
|
Adjusted R2
|
0.122
|
|
Log Likelihood
|
|
-798.000
|
Akaike Inf. Crit.
|
|
1,600.000
|
Residual Std. Error
|
0.168 (df = 9998)
|
|
F Statistic
|
1,397.000*** (df = 1; 9998)
|
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
We can plot the predicted probabilities based on the predicted values from the logistics regression model.
Default3 <- Default[, c(1,3)]
Default3$default <- as.numeric(Default3$default)-1
Default3$predict_logit<--10.7+0.005*Default3$balance
Default3$p<-(exp(Default3$predict_logit))/(1+exp(Default3$predict_logit))
Default3 <- Default3[, c(1, 2, 4)]
df <-melt(Default3, id.var="balance", variable.name = "response", value.name = "V")
g1 <-ggplot(df, aes(x=balance, y=V, col=response)) +
geom_point() +
labs(y="Y", x="Balance")
g1 + scale_y_continuous(breaks=c(0,1)) +
scale_colour_manual(values=c("navyblue", "firebrick"))

Odds Ratio
Odds can be expressed as \(\frac{p(x)}{1-p(x)}\).
Remember that
\[
\frac{p(X)}{1-p(X)} = exp^{\beta_{0}+X\beta_{1}}
\] If we take log on both sides,
\[
\begin{eqnarray}
\textrm{log}\frac{p(X)}{1-p(X)} & = & \textrm{log}(exp^{\beta_{0}+X\beta_{1}}) \\
= & \beta_{0}+X\beta_{1}
\end{eqnarray}
\]
Increasing \(X\) by one unit changes the log odds by \(\beta_{1}\). In this study, one-unit increase in balance is associated with an increase in the log odds of default by 0.0055.
Assignments
- Please calculate the aveage temperature and Solar R by month in airquality. You may want to remove the NA in the variables by adding na.rm=T to the variable that has NA.
- Please examine the effect of Expend on Grad.rate in College of ISLR package on a scatter plot and estimate it with
R
.
- Please estimate the effect of Ag on Surv in the following dataset.
library(kableExtra)
DT <-data.frame(Surv=c(rep(1,11),rep(0,22)),
Ag=c(rep(1,9), rep(0,2),rep(1,8),rep(0,14)))
DT %>%
kable('html') %>%
kable_styling()
Surv
|
Ag
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
0
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|