Regression modelling



Linear model

In classic linear (least squares) regression the usual way of representing the data \(y_i\), for \(i=1, 2, \ldots, n\), as a function of the \(k\) covariates \(x_{i1}, \ldots, x_{ik}\) is: \[\begin{equation} y_i=\beta _0+\beta _1x_{i1}+\ldots+\beta _{k}x_{ik}+\epsilon _i \quad i=1, \ldots, n.\end{equation}\]

In matrix form,

\[\begin{equation} \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}, \end{equation}\]

where

\[ \boldsymbol{y}=\left(\begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array}\right), \boldsymbol{X}=\left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1k} \\ 1 & x_{21} & \ldots & x_{2k} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & \ldots & x_{nk} \end{array}\right), \boldsymbol{\beta}=\left(\begin{array}{c} \beta_0\\ \beta_1\\ \vdots\\ \beta_k \end{array}\right), \boldsymbol{\epsilon}=\left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n \end{array}\right) \]

and \(\beta_i\) are the unknown parameters that need to be estimated and \(\boldsymbol{\epsilon}\) is the random part of the model. One of the assumptions of classic regression is the independence of the errors with each other and with the covariates. In addition, the errors have zero mean and constant variance (homoscedasticity). Applying the zero mean assumption of the errors we have that, \[\begin{equation}\mathbb{E}(y_i|\boldsymbol{x}_i)=\beta _0+\beta _1x_{i1}+\ldots+\beta _{k}x_{ik}\end{equation}\]

where \(\boldsymbol{x}_i=(1, x_{i1},x_{i2}, \ldots, x_{ik})^\intercal\). Least squares regression describes the behaviour of the location of the conditional distribution using the mean of the distribution to represent its central tendency.

The residuals \(\hat{\epsilon_i}\) are defined as the differences between the observed and the estimated values. Minimising the sum of the squared residuals \[\begin{equation}\sum_{i=1}^{n} r(y_i-\boldsymbol{x}_i^\intercal\boldsymbol{\hat{\beta}})=\sum_{i=1}^{n}(y_i-\boldsymbol{x}_i^\intercal\boldsymbol{\hat{\beta}})^2\end{equation}\] where \(r(u)=u^2\) is the quadratic loss function, gives the least squares estimator \(\boldsymbol{\hat{\beta}}\) by \[\begin{equation}\boldsymbol{\hat{\beta}}=(\boldsymbol{X}^\intercal \boldsymbol{X})^{-1}\boldsymbol{X}^\intercal \boldsymbol{y}.\end{equation}\] The additional assumption that the errors \(\boldsymbol{\epsilon}\) follow a Gaussian distribution, \[\begin{equation}\boldsymbol{\epsilon}\sim N(\boldsymbol{0},\sigma^2 \boldsymbol{I}_n)\end{equation}\] where \(\boldsymbol{I}_n\) is the \(n\times n\) identity matrix, provides a framework for testing the significance of the coefficients found. Under this assumption the least-squares estimator is also the maximum-likelihood estimator. Taking expectations, with respect to \(\boldsymbol{\epsilon}\), and noting that a linear function of a normally distributed random variable is normally distributed itself we can rewrite the model as \[\begin{equation} \boldsymbol{y} \sim N(\boldsymbol{\mu},\sigma^2\boldsymbol{I}_n), \ \text{where} \ \boldsymbol{\mu}=\boldsymbol{X}\boldsymbol{\beta}.\label{newmodel} \end{equation}\]

Thus, we model the relationship between the mean of \(y_i\), for \(i=1, 2, \ldots, n\), and the covariates linearly.

Generalised linear models



Elements of generalised linear models

On the page Regression modelling we focused on data \(\boldsymbol{y}\) which, conditioned on the covariates, are normally distributed . The aforementioned procedure can be generalised to any distribution belonging to the exponential family. These models are known as generalised linear models (GLM) and consist of three elements:

  1. A probability distribution that belongs to the exponential family of distributions (random component).

  2. A linear predictor \(\eta_i\) (systematic component) such that \(\eta_i=\beta _0+\beta _1x_{i1}+\ldots+\beta _{k}x_{ik}=\boldsymbol{x}_i^\intercal\boldsymbol{\beta}.\)

  3. A link function \(g\) such that \(\mathbb{E}[Y_i]=\mu_i=g^{-1}(\eta_i).\)

Note: It is called the link function because it links the linear predictor \(\eta\) to the mean of the distribution \(\mu\).

Generalised linear models

A GLM can be used for data that are not normally distributed and for situations where the relationship between the mean of the response variable and the covariates is not linear. The GLM includes many important distributions such as the Gaussian, Poisson, Bernoulli, gamma and inverse Gaussian distributions. The link and mean functions of some common distributions can be seen in the table below.

Distribution Probability density (or mass) function Linear predictor Link function
Normal distribution \(f(y;\mu, \sigma^2)=\frac{1}{\sigma \sqrt{2\pi}}\exp{\left\{- \frac{(y-\mu)^2}{2\sigma^2} \right\}}\) \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\mu_i\) \(\mu_i=\boldsymbol{x}_i^\intercal\boldsymbol{\beta}\)
Exponential distribution \(f(y;\lambda)=\lambda \exp{\{-\lambda y\}}\) \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\mu_i^{-1}\) \(\mu_i=(\boldsymbol{x}_i^\intercal\boldsymbol{\beta})^{-1}\)
Gamma distribution \(f(y;a, b)=\frac{b^a}{\Gamma (a)} y^{a-1}\exp{\{-by \}}\) \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\mu_i^{-1}\) \(\mu_i=(\boldsymbol{x}_i^\intercal\boldsymbol{\beta})^{-1}\)
Poisson distribution \(P(Y=y;\mu)=\exp{\{-\mu\}}\frac{\mu^y}{y!}\) \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\log{\{\mu_i\}}\) \(\mu_i=\exp{\{\boldsymbol{x}_i^\intercal\boldsymbol{\beta}\}}\)
Inverse Gaussian distribution \(f(y;\lambda, \mu)=\sqrt{\frac{\lambda}{2\pi y^3}}\exp{\left\{- \frac{\lambda (y-\mu)^2}{2\mu^2 y} \right\}}\) \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\mu_i^{-2}\) \(\mu_i=(\boldsymbol{x}_i^\intercal\boldsymbol{\beta})^{-\frac{1}{2}}\)
Bernoulli distribution \(P(0)=1-\mu, P(1)= \mu\) \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}= \log{\left\{\frac{\mu_i}{1-\mu_i}\right\}}\) \(\mu_i=\frac{ \exp{\{ \boldsymbol{x}_i^\intercal \boldsymbol{\beta} \} } }{ 1+\exp{\{ \boldsymbol{x}_i^\intercal \boldsymbol{\beta} \}} }\)

Generalised linear models are really flexible due to the fact that:

  • there is no need to transform the data into having a Gaussian distribution since GLM models allow us to model non-Gaussian data,
  • the choice of the link function is separate from the choice of the distribution. Thus, we can choose different link functions for the same distribution, and
  • they incorporate a lot of the common distributions for modelling the response variable.

Like all other models, a GLM has limitations. This is due to the fact that:

  • the models assume that the values of the response variable are independent, and
  • that the systematic component is a linear function of the separate covariates (i.e. the predictor cannot be non-linear).

Further reading



The pages Regression modelling and Generalised linear models provided a short summary of regression modelling and generalised linear models. Unfortunately, you will have to delve deeper into GLM’s by yourselves before you come to class. Fortunately though, you have a plethora of different options.

Reference

  1. You probably know that you can find almost anything online. Some things to look out for are:
    • course notes on GLM’s (including your notes from last year’s course).
    • videos on GLM’s.
    • posts explaining GLM’s (even better if they include R code)
    • Brian Caffo has some great videos on his YouTube page. You can have a look at the Linear Models V2 playlist.

Median house prices in Boston



Location of data set

We will first focus on the Boston data set which is located in the MASS library. You should install the package with install.packages("MASS"), load it with library(MASS), and finally load the dataset with data(Boston). The dataset records information regarding \(506\) neighbourhoods around Boston in \(1978\) and the median house prices of each neighbourhood, amongst other variables. A selection of variables, for the first \(10\) neighbourhoods, from the Boston data set can be seen in the table below. The median house prices are shown in thousands of dollars.

Note: The original names of these variables are medv, crim, lstat, rm, rad, and chas.

median.house.value crime.rate perc.low.socio.status aver.number.rooms index.access.radial.highways river.bounds
24.0 0.00632 4.98 6.575 1 0
21.6 0.02731 9.14 6.421 2 0
34.7 0.02729 4.03 7.185 2 0
33.4 0.03237 2.94 6.998 3 0
36.2 0.06905 5.33 7.147 3 0
28.7 0.02985 5.21 6.430 3 0
22.9 0.08829 12.43 6.012 5 0
27.1 0.14455 19.15 6.172 5 0
16.5 0.21124 29.93 5.631 5 0
18.9 0.17004 17.10 6.004 5 0

Exploratory data analysis

We can find out more information about the structure of the data set by using the command str(Boston), and produce result summaries of each variable, using the command summary(Boston).

str(Boston)
'data.frame':   506 obs. of  6 variables:
 $ median.house.value          : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 $ crime.rate                  : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ perc.low.socio.status       : num  4.98 9.14 4.03 2.94 5.33 ...
 $ aver.number.rooms           : num  6.58 6.42 7.18 7 7.15 ...
 $ index.access.radial.highways: int  1 2 2 3 3 3 5 5 5 5 ...
 $ river.bounds                : int  0 0 0 0 0 0 0 0 0 0 ...
summary(Boston)
 median.house.value   crime.rate       perc.low.socio.status aver.number.rooms
 Min.   : 5.00      Min.   : 0.00632   Min.   : 1.73         Min.   :3.561    
 1st Qu.:17.02      1st Qu.: 0.08204   1st Qu.: 6.95         1st Qu.:5.886    
 Median :21.20      Median : 0.25651   Median :11.36         Median :6.208    
 Mean   :22.53      Mean   : 3.61352   Mean   :12.65         Mean   :6.285    
 3rd Qu.:25.00      3rd Qu.: 3.67708   3rd Qu.:16.95         3rd Qu.:6.623    
 Max.   :50.00      Max.   :88.97620   Max.   :37.97         Max.   :8.780    
 index.access.radial.highways  river.bounds    
 Min.   : 1.000               Min.   :0.00000  
 1st Qu.: 4.000               1st Qu.:0.00000  
 Median : 5.000               Median :0.00000  
 Mean   : 9.549               Mean   :0.06917  
 3rd Qu.:24.000               3rd Qu.:0.00000  
 Max.   :24.000               Max.   :1.00000  

We can also use the as.tbl function, located in the dplyr package.

Note: This is a good time to pause the video, check the code below, and think about the benefits of using this function.

Boston <- as.tbl(Boston) 
Boston
# A tibble: 506 x 6
   median.house.value crime.rate perc.low.socio.s… aver.number.roo… index.access.radi… river.bounds
 *              <dbl>      <dbl>             <dbl>            <dbl>              <int>        <int>
 1               24      0.00632              4.98             6.58                  1            0
 2               21.6    0.0273               9.14             6.42                  2            0
 3               34.7    0.0273               4.03             7.18                  2            0
 4               33.4    0.0324               2.94             7.00                  3            0
 5               36.2    0.0690               5.33             7.15                  3            0
 6               28.7    0.0298               5.21             6.43                  3            0
 7               22.9    0.0883              12.4              6.01                  5            0
 8               27.1    0.145               19.2              6.17                  5            0
 9               16.5    0.211               29.9              5.63                  5            0
10               18.9    0.170               17.1              6.00                  5            0
# ... with 496 more rows

Basic plots

Let’s start with some basic plots
plot(Boston) # A bit ugly, but gives us some info

ggplot(melt(Boston), aes(variable, value)) + geom_boxplot() # Alternative to boxplot(Boston)

ggplot(Boston)+geom_histogram(aes(x=median.house.value),binwidth =1) #try it out without the binwidth argument

ggplot(Boston)+geom_histogram(aes(x=median.house.value,fill=factor(index.access.radial.highways)),binwidth =1) # A bit messy

ggplot(Boston,aes(x=median.house.value,y=crime.rate))+geom_point()+facet_wrap(~index.access.radial.highways) #Less messy

#ggplot(Boston,aes(x=median.house.value,y=crime.rate))+geom_point()+facet_grid(index.access.radial.highways~river.bounds) #Feel free to try this 

Interactive graph (pairwise comparison)

Interactive graph (median house value vs percent. of low socioec. status)

1st group project



Fiction

  1. Imagine you just started working as a statistician in the US government and you are asked by your boss, let’s call him Gary, to make sense out of the Boston data set. His actual quote was “Can you come up with a good model regarding the per capita crime rate for each neighbourhood in Boston? The President’s best idea is to build a wall around the neighbourhoods with the highest crime rates.”
    • Let’s forget about the President’s comment for the time being. How would you start working on this project?
    • Would you have any data-related questions for Gary? (By the way, Gary is not a statistician. He wouldn’t have hired you if he was.)
    • Let’s go back to the President’s comment for a second, and make it even more outrageous by imagining there are walls around each neighbourhood. What does this imply about the data? What does it imply about our modelling approaches so far?
    • If you had the chance to collect the data yourself, what additional information would you try to estimate for each neighbourhood in order for the regression model you use later on to be more realistic?

There is a boston data set within the spdep package. In this one, the main difference is the addition of the coordinates of each neighbourhood.

Non-fiction

  1. Do you notice anything strange about the output of the following command tail(table(Boston$medv))? Any possible explanations of why this is happening?

  46 46.7 48.3 48.5 48.8   50 
   1    1    1    1    1   16 
  1. I believe that the following model (glm(medv~tax+rm,data=Boston)) makes no sense at all. If you had to use the same covariates and dependent variable in the model, what can you do to make it more realistic/sensible?

    Hint: Think about how you would interpret the coefficients in the following model.


Call:
glm(formula = medv ~ tax + rm, data = Boston)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-16.495   -3.123   -0.548    2.384   42.057  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -21.233093   2.834371  -7.491 3.09e-13 ***
tax          -0.015837   0.001686  -9.391  < 2e-16 ***
rm            7.992681   0.404534  19.758  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 37.31825)

    Null deviance: 42716  on 505  degrees of freedom
Residual deviance: 18771  on 503  degrees of freedom
AIC: 3272.4

Number of Fisher Scoring iterations: 2
  1. Can you find variables in the Boston dataset that you feel are qualified to be used as dependent variables for a Poisson regression model? What about a logistic regression model?
  1. Can you think of any data sets where using a GLM might not be appropriate?

Graphic novels (not included in the project but you can try them at your own time)

  1. Plotly is a package in R that allows for interactive graphs. Try to create an interactive boxplot using variables from the Boston data set (or any other data set).

  2. The Plotly package can be used alongside the ggplot2 package to create interactive, online ggplot2 graphs. For more information on how this is done click here. Have a look at the ggpairs and ggplotly functions and try and create an interactive graph for the Boston data set using the aforementioned functions. If you want to take it one step further, you can choose a data set which contains a factor variable (e.g. race) and use the argument mapping=aes(color=race) inside the ggpairs function.

Number of publications by PhD students



Location of data set

We will now have a look on the Articles data set which is located in the Rchoice library. The dataset records information regarding the number of publications of \(915\) PhD biochemistry students during the \(1950\)’s and \(1960\)’s. The first \(7\) observations can be seen in the table below. The Article data set is already sorted with respect to the number of articles published, hence the \(0\)’s at the first column.

# A tibble: 915 x 6
     art   fem   mar  kid5   phd  ment
 * <int> <int> <int> <int> <dbl> <int>
 1     0     0     1     0  2.52     7
 2     0     1     0     0  2.05     6
 3     0     1     0     0  3.75     6
 4     0     0     1     1  1.18     3
 5     0     1     0     0  3.75    26
 6     0     1     1     2  3.59     2
 7     0     1     0     0  3.19     3
 8     0     0     1     2  2.96     4
 9     0     0     0     0  4.62     6
10     0     1     1     0  1.25     0
# ... with 905 more rows

Note: The original names of these variables are art, fem, mar, kid5, phd, and ment. Two variables (i.e. fem, mar) have been renamed and transformed to factors.

articles gender marital.status kids phd.prestige mentor
0 male married 0 2.52 7
0 female single 0 2.05 6
0 female single 0 3.75 6
0 male married 1 1.18 3
0 female single 0 3.75 26
0 female married 2 3.59 2
0 female single 0 3.19 3

Exploratory data analysis

We can find out more information about the structure of the data set by using the command str(Articles), and produce result summaries of each variable, using the command summary(Articles).

str(Articles)
Classes 'tbl_df', 'tbl' and 'data.frame':   915 obs. of  6 variables:
 $ articles      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ gender        : Factor w/ 2 levels "male","female": 1 2 2 1 2 2 2 1 1 2 ...
 $ marital.status: Factor w/ 2 levels "single","married": 2 1 1 2 1 2 1 2 1 2 ...
 $ kids          : int  0 0 0 1 0 2 0 2 0 0 ...
 $ phd.prestige  : num  2.52 2.05 3.75 1.18 3.75 ...
 $ mentor        : int  7 6 6 3 26 2 3 4 6 0 ...
 - attr(*, "datalabel")= chr "Academic Biochemists / S Long"
 - attr(*, "time.stamp")= chr "30 Jan 2001 10:49"
 - attr(*, "formats")= chr  "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
 - attr(*, "types")= int  98 98 98 98 102 98
 - attr(*, "val.labels")= chr  "" "sexlbl" "marlbl" "" ...
 - attr(*, "var.labels")= chr  "Articles in last 3 yrs of PhD" "Gender: 1=female 0=male" "Married: 1=yes 0=no" "Number of children < 6" ...
 - attr(*, "expansion.fields")=List of 1
  ..$ : chr  "_dta" " 8 data, cleaned 11/20/00, revise 11/20/00, revised 1/15/01, 1/30/01.\"'!=\"\"" "d 1/15/01, 1/30/01."
 - attr(*, "version")= int 6
 - attr(*, "label.table")=List of 2
  ..$ marlbl: Named int  0 1
  .. ..- attr(*, "names")= chr  "Single" "Married"
  ..$ sexlbl: Named int  0 1
  .. ..- attr(*, "names")= chr  "Men" "Women"
summary(Articles)
    articles         gender    marital.status      kids         phd.prestige       mentor      
 Min.   : 0.000   male  :494   single :309    Min.   :0.0000   Min.   :0.755   Min.   : 0.000  
 1st Qu.: 0.000   female:421   married:606    1st Qu.:0.0000   1st Qu.:2.260   1st Qu.: 3.000  
 Median : 1.000                               Median :0.0000   Median :3.150   Median : 6.000  
 Mean   : 1.693                               Mean   :0.4951   Mean   :3.103   Mean   : 8.767  
 3rd Qu.: 2.000                               3rd Qu.:1.0000   3rd Qu.:3.920   3rd Qu.:12.000  
 Max.   :19.000                               Max.   :3.0000   Max.   :4.620   Max.   :77.000  

The mean and the variance of the number of articles is 1.69 and 3.71 respectively.

We could also be interested in reporting summary statistics by gender and marital status. In this case, we can use the function describeBy in the psych package.

library(psych)
describeBy(Articles,list(Articles$gender,Articles$marital.status))

 Descriptive statistics by group 
: male
: single
                vars   n  mean   sd median trimmed  mad  min   max range  skew kurtosis   se
articles           1 113  1.95 2.01   1.00    1.62 1.48 0.00  7.00   7.0  1.13     0.45 0.19
gender*            2 113  1.00 0.00   1.00    1.00 0.00 1.00  1.00   0.0   NaN      NaN 0.00
marital.status*    3 113  1.00 0.00   1.00    1.00 0.00 1.00  1.00   0.0   NaN      NaN 0.00
kids               4 113  0.00 0.00   0.00    0.00 0.00 0.00  0.00   0.0   NaN      NaN 0.00
phd.prestige       5 113  3.50 0.89   3.59    3.55 1.04 1.52  4.62   3.1 -0.45    -1.13 0.08
mentor             6 113 10.21 9.81   8.00    8.57 5.93 0.00 55.00  55.0  2.03     5.08 0.92
--------------------------------------------------------------------------- 
: female
: single
                vars   n mean   sd median trimmed  mad  min   max range  skew kurtosis   se
articles           1 196 1.39 1.51   1.00    1.14 1.48 0.00  7.00  7.00  1.31     1.46 0.11
gender*            2 196 2.00 0.00   2.00    2.00 0.00 2.00  2.00  0.00   NaN      NaN 0.00
marital.status*    3 196 1.00 0.00   1.00    1.00 0.00 1.00  1.00  0.00   NaN      NaN 0.00
kids               4 196 0.00 0.00   0.00    0.00 0.00 0.00  0.00  0.00   NaN      NaN 0.00
phd.prestige       5 196 3.06 0.97   3.19    3.08 1.08 0.75  4.62  3.86 -0.18    -0.85 0.07
mentor             6 196 7.86 7.56   6.00    6.66 5.93 0.00 39.00 39.00  1.73     3.46 0.54
--------------------------------------------------------------------------- 
: male
: married
                vars   n mean    sd median trimmed  mad  min   max range skew kurtosis   se
articles           1 381 1.86  2.23   1.00    1.50 1.48 0.00 19.00  19.0 2.99    14.94 0.11
gender*            2 381 1.00  0.00   1.00    1.00 0.00 1.00  1.00   0.0  NaN      NaN 0.00
marital.status*    3 381 2.00  0.00   2.00    2.00 0.00 2.00  2.00   0.0  NaN      NaN 0.00
kids               4 381 0.91  0.86   1.00    0.84 1.48 0.00  3.00   3.0 0.53    -0.65 0.04
phd.prestige       5 381 3.02  1.00   2.96    3.02 1.28 0.92  4.62   3.7 0.07    -1.25 0.05
mentor             6 381 9.33 10.75   6.00    7.34 5.93 0.00 77.00  77.0 2.49     8.43 0.55
--------------------------------------------------------------------------- 
: female
: married
                vars   n mean   sd median trimmed  mad  min   max range  skew kurtosis   se
articles           1 225 1.54 1.59   1.00    1.32 1.48 0.00 10.00  10.0  1.40     3.08 0.11
gender*            2 225 2.00 0.00   2.00    2.00 0.00 2.00  2.00   0.0   NaN      NaN 0.00
marital.status*    3 225 2.00 0.00   2.00    2.00 0.00 2.00  2.00   0.0   NaN      NaN 0.00
kids               4 225 0.47 0.69   0.00    0.33 0.00 0.00  3.00   3.0  1.23     0.44 0.05
phd.prestige       5 225 3.09 0.97   3.21    3.12 0.98 0.92  4.62   3.7 -0.21    -0.88 0.06
mentor             6 225 7.88 8.37   5.00    6.47 4.45 0.00 49.00  49.0  2.26     6.77 0.56

Basic plots

We can first just plot the frequency of the number of articles published by PhD students.

sjp.frq(Articles$articles,title="Number of articles")

We can also plot the contingency table of the number of articles vs the gender of the PhD student and the number of articles vs the number of kids.

Regression analysis

The most common model for count data is the Poisson regression model. So we can start with that model and build up later on (or stay with it if it’s an appropriate model). We can run the model using the glm function with the argument family=poisson.

model.pois <- glm(articles~gender+marital.status+kids+phd.prestige+mentor, family=poisson, data=Articles)
summary(model.pois)

Call:
glm(formula = articles ~ gender + marital.status + kids + phd.prestige + 
    mentor, family = poisson, data = Articles)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.5672  -1.5398  -0.3660   0.5722   5.4467  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            0.304617   0.102981   2.958   0.0031 ** 
genderfemale          -0.224594   0.054613  -4.112 3.92e-05 ***
marital.statusmarried  0.155243   0.061374   2.529   0.0114 *  
kids                  -0.184883   0.040127  -4.607 4.08e-06 ***
phd.prestige           0.012823   0.026397   0.486   0.6271    
mentor                 0.025543   0.002006  12.733  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1817.4  on 914  degrees of freedom
Residual deviance: 1634.4  on 909  degrees of freedom
AIC: 3314.1

Number of Fisher Scoring iterations: 5

We can also check the goodness-of-fit of the model with a chi-square test based on the residual deviance and degrees of freedom, where the null hypothesis is that the Poisson model is correctly specified.

pchisq(model.pois$deviance, df=model.pois$df.residual, lower.tail=FALSE) # Strong evidence that our model fits badly.
[1] 4.525613e-44

We can now fit a negative binomial regression model. This model unlike the Poisson model, can handle overdispersion (i.e. the variance is greater than the mean) due to the fact that the probability mass function of a negative binomial distribution is \[P(Y=y|\mu, k)=\frac{\Gamma (\frac{1}{k} +y)}{\Gamma (\frac{1}{k})y!}\left(\frac{k\mu}{1+k\mu}\right)^{y}\left(\frac{1}{1+k\mu}\right)^{\frac{1}{k}}\] with mean \(\mathbb{E}[Y]=\mu\) and variance \(\mathbb{V}[Y]=\mu+k\mu^2\).

The first parameter of this formulation is the mean of the distribution whereas the second is referred to as the dispersion parameter. Large values of \(k\) are a sign of overdispersion, while when \(k\to 0\) the variance of the distribution is equal to the mean and we have the Poisson model as a special case.

Note: The dispersion parameter \(\theta\) in R is equivalent to \(\frac{1}{k}\) on the previous notation. The majority of other statistical languages (e.g. SAS, Stata, Matlab) use the previous notation with \(\mu\) and \(k\).


Call:
glm.nb(formula = articles ~ gender + marital.status + kids + 
    phd.prestige + mentor, data = Articles, init.theta = 2.264387695, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1678  -1.3617  -0.2806   0.4476   3.4524  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            0.256144   0.137348   1.865 0.062191 .  
genderfemale          -0.216418   0.072636  -2.979 0.002887 ** 
marital.statusmarried  0.150489   0.082097   1.833 0.066791 .  
kids                  -0.176415   0.052813  -3.340 0.000837 ***
phd.prestige           0.015271   0.035873   0.426 0.670326    
mentor                 0.029082   0.003214   9.048  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.2644) family taken to be 1)

    Null deviance: 1109.0  on 914  degrees of freedom
Residual deviance: 1004.3  on 909  degrees of freedom
AIC: 3135.9

Number of Fisher Scoring iterations: 1


              Theta:  2.264 
          Std. Err.:  0.271 

 2 x log-likelihood:  -3121.917 

We should again check the goodness-of-fit of this model with a chi-square test, where the null hypothesis is that the negative binomial model is correctly specified.

pchisq(model.negb$deviance, df=model.negb$df.residual, lower.tail=FALSE) # Evidence that we can still do better.
[1] 0.01475755
We can use a likelihood ratio test to compare the two models and test if the negative binomial model is more appropriate.
diff_ll <- 2*(logLik(model.negb)-logLik(model.pois))#log likelihood difference = likelihood ratio
pchisq(diff_ll,df=1,lower.tail=FALSE)#Strong evidence that we should reject the Poisson model in favour of the negative binomial model
'log Lik.' 4.391728e-41 (df=7)

In both models the regression coefficients \(\beta_i\) represent the expected change in the log of the mean of the response variable; for a one unit increase in the covariate \(x_i\) while all other covariates remain the same. In other words, if \(\beta_i\) is positive, increasing \(x_i\) by one unit is associated with an increase of \(\exp\{\beta_i\}\) in the mean of the response; while for a negative \(\beta_i\), increasing \(x_i\) by one unit is associated with a decrease of \(1-\exp\{\beta_i\}\) in the mean of the response.

We can plot the exponent of the regression coefficients using the function sjp.glm from the sjPlot library.


Finally, we can plot the predicted values for the average number of papers published, with regards to the number of papers the mentor has published and the number of kids a PhD student has. This can be done using the argument type="pred" in the sjp.glm function.

2nd group project



Fiction

  1. Let’s assume I am a biochemist in the \(1960\)’s and I just discovered this data set. Let’s also assume that I am interested in hiring a PhD student to work together in a project for the next \(3\) years. Finally (I promise this is the last assumption), you are a friend of mine and way better than me in statistics.
    • Would you have any questions for me before you start any statistical analysis?
    • If you had the chance to collect the data yourself, what additional information would you try to include in order for the regression model to be more realistic?
    • Would you try any different types of models besides the Poisson and negative binomial ones?
    • What would you advise me to look for in a PhD student? (you can take into account any of the graphs from the previous tab and any additional information from the regression models you used)
    Note: The last assumption was referring to the fact that we are friends, not that you are better than me in statistics.

Non-Fiction

  1. Modify the Articles dataset so when you type str(Articles) you get the results seen in the previous tab.

  2. I previously used the function describeBy to get summary statistics by group. Now it’s your turn to use the function… tapply and find the mean and the variance of the number of articles by gender.

Note: I am currently in a good mood so you can have a look here first

Graphic novels (not included in the project but you can try them at your own time)

  1. sjPlot is a package in R which contains plotting and table-output functions for data visualisation. Your job is simple; replicate all graphs from the previous tab using functions from the aforementioned package.

Median house prices in Boston



Location of data set

library(MASS)
data(Boston)
Boston <- Boston[,c("medv","crim","lstat","rm","rad","chas")]
colnames(Boston) <- c("median.house.value","crime.rate","perc.low.socio.status","aver.number.rooms","index.access.radial.highways","river.bounds")

Exploratory data analysis

library(knitr)
kable(Boston[c(1:10),])
median.house.value crime.rate perc.low.socio.status aver.number.rooms index.access.radial.highways river.bounds
24.0 0.00632 4.98 6.575 1 0
21.6 0.02731 9.14 6.421 2 0
34.7 0.02729 4.03 7.185 2 0
33.4 0.03237 2.94 6.998 3 0
36.2 0.06905 5.33 7.147 3 0
28.7 0.02985 5.21 6.430 3 0
22.9 0.08829 12.43 6.012 5 0
27.1 0.14455 19.15 6.172 5 0
16.5 0.21124 29.93 5.631 5 0
18.9 0.17004 17.10 6.004 5 0
str(Boston)
'data.frame':   506 obs. of  6 variables:
 $ median.house.value          : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 $ crime.rate                  : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ perc.low.socio.status       : num  4.98 9.14 4.03 2.94 5.33 ...
 $ aver.number.rooms           : num  6.58 6.42 7.18 7 7.15 ...
 $ index.access.radial.highways: int  1 2 2 3 3 3 5 5 5 5 ...
 $ river.bounds                : int  0 0 0 0 0 0 0 0 0 0 ...
summary(Boston)
 median.house.value   crime.rate       perc.low.socio.status aver.number.rooms
 Min.   : 5.00      Min.   : 0.00632   Min.   : 1.73         Min.   :3.561    
 1st Qu.:17.02      1st Qu.: 0.08204   1st Qu.: 6.95         1st Qu.:5.886    
 Median :21.20      Median : 0.25651   Median :11.36         Median :6.208    
 Mean   :22.53      Mean   : 3.61352   Mean   :12.65         Mean   :6.285    
 3rd Qu.:25.00      3rd Qu.: 3.67708   3rd Qu.:16.95         3rd Qu.:6.623    
 Max.   :50.00      Max.   :88.97620   Max.   :37.97         Max.   :8.780    
 index.access.radial.highways  river.bounds    
 Min.   : 1.000               Min.   :0.00000  
 1st Qu.: 4.000               1st Qu.:0.00000  
 Median : 5.000               Median :0.00000  
 Mean   : 9.549               Mean   :0.06917  
 3rd Qu.:24.000               3rd Qu.:0.00000  
 Max.   :24.000               Max.   :1.00000  
library(dplyr)
Boston <- as.tbl(Boston)
Boston
# A tibble: 506 x 6
   median.house.value crime.rate perc.low.socio.s… aver.number.roo… index.access.radi… river.bounds
 *              <dbl>      <dbl>             <dbl>            <dbl>              <int>        <int>
 1               24      0.00632              4.98             6.58                  1            0
 2               21.6    0.0273               9.14             6.42                  2            0
 3               34.7    0.0273               4.03             7.18                  2            0
 4               33.4    0.0324               2.94             7.00                  3            0
 5               36.2    0.0690               5.33             7.15                  3            0
 6               28.7    0.0298               5.21             6.43                  3            0
 7               22.9    0.0883              12.4              6.01                  5            0
 8               27.1    0.145               19.2              6.17                  5            0
 9               16.5    0.211               29.9              5.63                  5            0
10               18.9    0.170               17.1              6.00                  5            0
# ... with 496 more rows

Basic plots

plot(Boston) # A bit ugly, but gives us some info

library(ggplot2)
ggplot(melt(Boston), aes(variable, value)) + geom_boxplot() # Alternative to boxplot(Boston)

ggplot(Boston)+geom_histogram(aes(x=median.house.value),binwidth =1) #try it out without the binwidth argument

ggplot(Boston)+geom_histogram(aes(x=median.house.value,fill=factor(index.access.radial.highways)),binwidth =1) # A bit messy

ggplot(Boston,aes(x=median.house.value,y=crime.rate))+geom_point()+facet_wrap(~index.access.radial.highways) #Less messy

#ggplot(Boston,aes(x=median.house.value,y=crime.rate))+geom_point()+facet_grid(index.access.radial.highways~river.bounds) #Feel free to try this

Interactive graph (pairwise comparison)

library(GGally)
library(plotly)
#Function to return points and geom_smooth
my_fn <- function(data, mapping, method="loess", ...){
      ppp <- ggplot(data = data, mapping = mapping) +
      geom_point() +
      geom_smooth(method=method, ...)
      ppp
    }
#Remove last 2 columns from Boston data set
Boston <- Boston[,-c(5,6)]
p <- ggpairs(Boston,lower=list(continuous=wrap(my_fn, method="loess")),columnLabels = c("Median house value","Crime rate","Percentage of low socioec. status", "Average number of rooms"))
pp <- ggplotly(p )
#ggplotly has an issue with axis labels. Sometimes they dissapear.
x1 <- list(title="Median house value")
y1 <- x1
x2 <- list(title="Crime rate")
y2 <- x2
x3 <- list(title="Perc. of low socioec. status")
y3 <- x3
x4 <- list(title="Aver. number of rooms")
y4 <- x4
m = list(  l = 50,  r = 50,  b = 100,  t = 70)
pp %>% layout(width=1200,height=1000,margin=m,xaxis=x1, yaxis=y4,xaxis2=x2,yaxis2=y3,xaxis3=x3, yaxis3=y2,xaxis4=x4,yaxis4=y1)

Interactive graph (median house value vs percent. of low socioec. status)

x <- list(title="Median house value")
y <- list(title="Perc. of low socioec. status")


plot_ly(Boston,y=Boston$perc.low.socio.status,x=Boston$median.house.value,mode="markers",size=Boston$crime.rate,color=Boston$aver.number.rooms)%>% layout(width=1200,height=1000,xaxis = x, yaxis = y)

Number of publications by PhD students



Location of data set

library(knitr) #for the kable function
library(Rchoice) #for the data set
library(plyr) #for the revalue function
data("Articles")
Articles <- as.tbl(Articles)
Articles$fem <- factor(Articles$fem)
Articles$fem <-revalue(Articles$fem, c("1"="female", "0"="male"))
Articles$mar <- factor(Articles$mar)
Articles$mar <-revalue(Articles$mar, c("1"="married", "0"="single"))
colnames(Articles)[] <- c("articles","gender","marital.status","kids","phd.prestige","mentor")

kable(Articles[c(1:7),])
articles gender marital.status kids phd.prestige mentor
0 male married 0 2.52 7
0 female single 0 2.05 6
0 female single 0 3.75 6
0 male married 1 1.18 3
0 female single 0 3.75 26
0 female married 2 3.59 2
0 female single 0 3.19 3

Basic plots

library(sjPlot)
library(sjmisc)
sjp.frq(Articles$articles,title="Number of articles")

sjp.xtab(Articles$articles,Articles$gender,geom.colors=c("blue","pink"),show.values = FALSE,title = "Number of articles by gender",show.total = FALSE)

sjp.xtab(Articles$articles,Articles$kids,geom.colors=c("red","azure4","green","purple"),show.values = FALSE,title = "Number of articles by number of kids",show.n = FALSE,show.total = FALSE)

Regression analysis

sjp.glm(model.pois,digits=3,axis.title = "Exponent of regression coefficients",sort.est=TRUE,title = "Poisson model")

sjp.glm(model.negb,digits=3,axis.title = "Exponent of regression coefficients",sort.est = TRUE,title="Negative binomial model")


sjp.glm(model.pois,type="pred",vars=c("mentor","kids"),facet.grid = FALSE,geom.colors=c("red","azure4","green","purple"),title = "Poisson model",axis.lim =c(0,15))

sjp.glm(model.negb,type="pred",vars=c("mentor","kids"),geom.colors=c("red","azure4","green","purple"),facet.grid = FALSE,title = "Negative binomial model",axis.lim =c(0,15))