1 Goal

  • This course will depart from descriptive analysis and visualization. Moreover, it will show how to do ordinary linear regression and logistic regression. For example, we can visualize the relationship between two variables like this:
  • library(ggplot2)
    ggplot(mtcars, aes(x=wt, y=mpg)) +
        geom_point() +
        geom_smooth(method='lm', se=T)


    2 Descriptive Statistics

  • We can decribe our data in terms of central tendency and dispersion.
  • 2.1 Mean and Standard Deviation

  • Mean and standard deviation represent the central tendency and disperson of a random variable respectively..
  • We can use stargazer to report the mean, standard deviation, minimum, and maximum values.
  • library(stargazer)
    dss <- as.data.frame(diamonds)
    dss <- dss[, c(1,5,7)]
    stargazer(dss, type="html")
    Statistic N Mean St. Dev. Min Max
    carat 53,940 0.798 0.474 0.200 5.010
    depth 53,940 61.700 1.430 43.000 79.000
    price 53,940 3,933.000 3,989.000 326 18,823

    2.2 apply()

  • apply can return the values that we specify in our functions.
  • options(digits=4)
    dss <- as.data.frame(diamonds)
    dss <- dss[, c(1,5,7)]
    Statistics<-c("N", "Mean", "S.D.", "Min", "Pct25","Pct75", "Max")
    A<-apply(dss,2,function(x) c(length(x), mean(x), sd(x),
                               min(x), quantile(x,c(0.25, 0.75)),max(x)))
    rownames(A)<-Statistics
    A
    ##           carat     depth price
    ## N     5.394e+04 53940.000 53940
    ## Mean  7.979e-01    61.749  3933
    ## S.D.  4.740e-01     1.433  3989
    ## Min   2.000e-01    43.000   326
    ## Pct25 4.000e-01    61.000   950
    ## Pct75 1.040e+00    62.500  5324
    ## Max   5.010e+00    79.000 18823
  • What if we want to describe data by groups? tapply can return important statistics by groups. But it can do only one statistics at a time.
  • table(sleep$group)
    ## 
    ##  1  2 
    ## 10 10
    tapply(sleep$extra, sleep$group, mean)
    ##    1    2 
    ## 0.75 2.33
    tapply(sleep$extra, sleep$group, median)
    ##    1    2 
    ## 0.35 1.75
    tapply(sleep$extra, sleep$group, sd)
    ##     1     2 
    ## 1.789 2.002

    2.3 summarize

  • R basic can provide summary statistics by running summary on one variable. dplyr can also summarize a variable with more options and flexibilities.
  • summary(mtcars$mpg)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##    10.4    15.4    19.2    20.1    22.8    33.9
    library(dplyr)
    summarize(mtcars, n(), mean(mpg), mean(hp), sd(mpg), 
              quantile(mpg, c(0.25)), quantile(mpg, c(0.50)), quantile(mpg, c(0.75)))
    ##   n() mean(mpg) mean(hp) sd(mpg) quantile(mpg, c(0.25))
    ## 1  32     20.09    146.7   6.027                  15.43
    ##   quantile(mpg, c(0.5)) quantile(mpg, c(0.75))
    ## 1                  19.2                   22.8

    2.3.1 group_by

  • dplyr also allows us to summarize a variable by group_by.
  • mtcars %>% group_by(cyl)  %>% summarize(mean(mpg))
    ## # A tibble: 3 x 2
    ##     cyl `mean(mpg)`
    ##   <dbl>       <dbl>
    ## 1     4        26.7
    ## 2     6        19.7
    ## 3     8        15.1

    2.4 Scatter Plot

  • It is necessary to check the scatter plot when we look at the relationship between two variables. They can be two continuous variables or discrete variables.
  • For example, we can use ggplot2 to show how the two variables correspond to each other. Moreover, the regression line and the standard error can be added to the graph.
  • ggplot(anscombe, aes(x=x1, y=y1)) +
      geom_point(size=3)  +
      geom_smooth(method='lm', se=T)
  • Standard error means the accuracy of the estimate. However, scatter plot cannot show the accuracy of the model. We can run OLS model as follows.
  • fit1 <- lm(data=anscombe, y1 ~ x1)
    stargazer(fit1, type='html')
    Dependent variable:
    y1
    x1 0.500***
    (0.118)
    Constant 3.000**
    (1.120)
    Observations 11
    R2 0.667
    Adjusted R2 0.629
    Residual Std. Error 1.240 (df = 9)
    F Statistic 18.000*** (df = 1; 9)
    Note: p<0.1; p<0.05; p<0.01
  • If there are one more variable of interest, we can also include it in the scatter plot.
  • ggplot(Orange, aes(x=age, y=circumference, group=Tree)) +
       geom_point(aes(col=Tree), size=3) +
       geom_smooth(method="loess", se=F) +
       theme_bw()
    ## Marginal Effect
  • When we have more than one independent variables and there are interaction term between two variables, we need to estimate the effect of the interaction term. Yiqing Xu (UCSD) provides a R package, interflex, to visualize the linear and non-linear marginal effect of treatment.
  • For example, we assume that the effect of mother’s IQ on children’s scores is conditional on whether or not mothers hold high school degree. We can specify the model as follows.
  • \[ Y_{i}=\beta_{0}+\beta_{1}X_{i}+\beta_{2}D_{i}+\beta_{3}XD \]

    library(foreign)
    library(interflex)
    kid <- read.dta("kidiq.dta")
    inter.raw(Y="kid_score",X="mom_iq",D="mom_hs", Xlabel="Mom's IQ", Ylabel="Kid's Score", Dlabel="Mom's High School",data=kid)
    ## $graph
  • The first graph shows steeper slope than the second one, which implies that students whose mothers holding no high school degree on average have better scores.
  • We can estimate the model with OLS.
  • M1<-lm(kid_score ~ mom_iq+mom_hs+mom_iq:mom_hs, data=kid)
    stargazer(M1, type='html')
    Dependent variable:
    kid_score
    mom_iq 0.969***
    (0.148)
    mom_hs 51.300***
    (15.300)
    mom_iq:mom_hs -0.484***
    (0.162)
    Constant -11.500
    (13.800)
    Observations 434
    R2 0.230
    Adjusted R2 0.225
    Residual Std. Error 18.000 (df = 430)
    F Statistic 42.800*** (df = 3; 430)
    Note: p<0.1; p<0.05; p<0.01
  • Again, the coefficient of the interaction term shows the negative effect of mother’s high school degree on scores.

  • 3 Logarithmic Transformations of Variables

  • Variable transformation can “straighten” the relationship between two variables. If the relationship is monotone and simple; always positive or negative, we can use OLS as the functional form. Benoit discussed the log transformatiion of both dependent and independent variables in OLS regression.
  • library(car)
    ggplot(data=Leinhardt, aes(x=income, y=infant)) +
        geom_point() +
        geom_smooth(method="loess", col="black")

  • We can take logarithmetic of both two variables because most of observatiosn are located in the lower-left corner; both variables are skewed.
  • par(mfrow=c(1,2))
    hist(Leinhardt$infant, 100, probability = T, col="gray90", main="infant", xlab='')
    hist(Leinhardt$income, 100, probability = T, col="gray90", main="income", xlab='')
  • We can plot the logarithm of both variables. We can see they looks more than normal distribution.
  • par(mfrow=c(1,2))
    Leinhardt$l.infant = log(Leinhardt$infant)
    Leinhardt$l.income = log(Leinhardt$income)
    hist(Leinhardt$l.infant, 100, probability = T, col="gray90", main="Log infant", xlab='')
    hist(Leinhardt$l.income, 100, probability = T, col="gray90", main="Log income", xlab='')

  • The scatter plot of the two new variables is different from the previous one. We can estimate the log-log model:
  • \[ \textrm{log}Y_{i}=\beta_{0}+\textrm{log}X_{i}\beta_{1}+\epsilon_{i} \]
  • If we take exponential on both sides, and add 1 to \(X\), we will obtain the following result.
  • \[ \begin{eqnarray} Y &= &\text{exp}^{\beta_{0}+\textrm{log}(X+1)\beta_{1}} \\ & = & \text{exp}^{\beta_{0}}\text{exp}^{\beta_{1}\times (\text{log}(X)+1)} \end{eqnarray} \]
  • Because \(\text{log}(X)+1 = \text{log}(X)+\text{log}(e)=\text{log}(eX)\), and \(e\approx 2.72\), \(e=\sum_{n=0}^\infty(\frac{1}{n!})\), adding 1 to \(\text{log}X\) means X increases by \(100\times (2.72-1)= 172\%\).
  • A change in unlogged \(X\) or \(X\) increases by 172%, \(Y\) will increase by \(e^{\beta_{1}}\).
  • A \(1\%\) increase in \(X\) means multiplying \(Y\) by \(e^{\beta_{1}\times \text{log}(1.01)}\). In practice, it is about \(\beta_{1}\).
  • We take log of both independent and response variables.
  • ggplot(data=Leinhardt, aes(x=l.income, y=l.infant)) +
        geom_point() +
        geom_smooth(method="lm", col="black")

    fit<-lm(data=Leinhardt, l.infant ~ l.income)
    stargazer(fit, type='html')
    Dependent variable:
    l.infant
    l.income -0.512***
    (0.051)
    Constant 7.146***
    (0.317)
    Observations 101
    R2 0.502
    Adjusted R2 0.497
    Residual Std. Error 0.687 (df = 99)
    F Statistic 99.845*** (df = 1; 99)
    Note: p<0.1; p<0.05; p<0.01
  • A \(1\%\) incrase in income will reduce infant-mortality rate per 1000 live births by about \(0.05\%\).

  • 4 Logistic Regression

  • When \(Y\) is a dichotomous variable, which has either 0 or 1, we can consider to transform the binary response variable.
  • One of the reseasons that OLS is not appropriate for the dichotomous variable is the predicted values would be lower than zero or greater than 1.
  • library(ISLR)
    Default2<-Default[, c(1,3)]
    Default2$default<-as.numeric(Default2$default)-1
    fit1<-lm(data=Default2, default ~ balance)
    Default2$predict <- predict(fit1)
    library(reshape2)
    DT <-melt(Default2, id.var="balance", variable.name = "Y")
    ggplot(DT, aes(x=balance, y=value, col=Y)) + 
      geom_point() +
      labs(y="Y", x="Balance") +
      scale_y_continuous(breaks=c(0,1)) +
      scale_colour_manual(values=c("navyblue", "firebrick"))
  • Logit model or logistic regression model uses the following function:
  • \[ \begin{eqnarray} \text{logit}(Y=1|X) & = & \text{log} \frac{P_{i}}{1-P_{i}}\\ & = & \text{log}\frac{p(X)}{1-p(X)}\\ &=&\beta_{0}+X\beta_{1} \end{eqnarray} \] Because \(\text{exp}(\text{log}(a))=a\), so \[ \frac{p(X)}{1-p(X)} = exp^{\eta} \] or, \[ p(X)=\frac{e^{\eta}}{1+e^{\eta}} \] where, \[ \eta \equiv \sum \beta_{k}X_{ik}\]
  • We can compare the different results of OLS and logistic regression models.
  • Logit <- glm(data=Default2, default ~ balance,  family=binomial(logit))
    stargazer(fit1, Logit, type="html")
    Dependent variable:
    default
    OLS logistic
    (1) (2)
    balance 0.0001*** 0.005***
    (0.00000) (0.0002)
    Constant -0.075*** -10.700***
    (0.003) (0.361)
    Observations 10,000 10,000
    R2 0.123
    Adjusted R2 0.122
    Log Likelihood -798.000
    Akaike Inf. Crit. 1,600.000
    Residual Std. Error 0.168 (df = 9998)
    F Statistic 1,397.000*** (df = 1; 9998)
    Note: p<0.1; p<0.05; p<0.01
  • We can plot the predicted probabilities based on the predicted values from the logistics regression model.
  • Default3 <- Default[, c(1,3)]
    Default3$default <- as.numeric(Default3$default)-1
    Default3$predict_logit<--10.7+0.005*Default3$balance
    Default3$p<-(exp(Default3$predict_logit))/(1+exp(Default3$predict_logit))
    Default3 <- Default3[, c(1, 2, 4)]
    df <-melt(Default3, id.var="balance", variable.name = "response", value.name = "V")
    
    g1 <-ggplot(df, aes(x=balance, y=V, col=response)) + 
      geom_point() +
      labs(y="Y", x="Balance") 
    
    g1 +  scale_y_continuous(breaks=c(0,1)) +
      scale_colour_manual(values=c("navyblue", "firebrick"))

    4.1 Odds Ratio

  • Odds can be expressed as \(\frac{p(x)}{1-p(x)}\).
  • Remember that
  • \[ \frac{p(X)}{1-p(X)} = exp^{\beta_{0}+X\beta_{1}} \] If we take log on both sides,
    \[ \begin{eqnarray} \textrm{log}\frac{p(X)}{1-p(X)} & = & \textrm{log}(exp^{\beta_{0}+X\beta_{1}}) \\ = & \beta_{0}+X\beta_{1} \end{eqnarray} \]
  • Increasing \(X\) by one unit changes the log odds by \(\beta_{1}\). In this study, one-unit increase in balance is associated with an increase in the log odds of default by 0.0055.

  • 5 Assignments

      1. Please calculate the aveage temperature and Solar R by month in airquality. You may want to remove the NA in the variables by adding na.rm=T to the variable that has NA.
      1. Please examine the effect of Expend on Grad.rate in College of ISLR package on a scatter plot and estimate it with R.
      1. Please estimate the effect of Ag on Surv in the following dataset.
    library(kableExtra)
    DT <-data.frame(Surv=c(rep(1,11),rep(0,22)),
                    Ag=c(rep(1,9), rep(0,2),rep(1,8),rep(0,14)))
    DT %>% 
      kable('html') %>% 
      kable_styling()
    Surv Ag
    1 1
    1 1
    1 1
    1 1
    1 1
    1 1
    1 1
    1 1
    1 1
    1 0
    1 0
    0 1
    0 1
    0 1
    0 1
    0 1
    0 1
    0 1
    0 1
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0