In matrix form,
\[\begin{equation} \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}, \end{equation}\]where
\[ \boldsymbol{y}=\left(\begin{array}{c} y_1\\ y_2\\ \vdots\\ y_n \end{array}\right), \boldsymbol{X}=\left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1k} \\ 1 & x_{21} & \ldots & x_{2k} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & \ldots & x_{nk} \end{array}\right), \boldsymbol{\beta}=\left(\begin{array}{c} \beta_0\\ \beta_1\\ \vdots\\ \beta_k \end{array}\right), \boldsymbol{\epsilon}=\left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_n \end{array}\right) \]
and \(\beta_i\) are the unknown parameters that need to be estimated and \(\boldsymbol{\epsilon}\) is the random part of the model. One of the assumptions of classic regression is the independence of the errors with each other and with the covariates. In addition, the errors have zero mean and constant variance (homoscedasticity). Applying the zero mean assumption of the errors we have that, \[\begin{equation}\mathbb{E}(y_i|\boldsymbol{x}_i)=\beta _0+\beta _1x_{i1}+\ldots+\beta _{k}x_{ik}\end{equation}\]where \(\boldsymbol{x}_i=(1, x_{i1},x_{i2}, \ldots, x_{ik})^\intercal\). Least squares regression describes the behaviour of the location of the conditional distribution using the mean of the distribution to represent its central tendency.
The residuals \(\hat{\epsilon_i}\) are defined as the differences between the observed and the estimated values. Minimising the sum of the squared residuals \[\begin{equation}\sum_{i=1}^{n} r(y_i-\boldsymbol{x}_i^\intercal\boldsymbol{\hat{\beta}})=\sum_{i=1}^{n}(y_i-\boldsymbol{x}_i^\intercal\boldsymbol{\hat{\beta}})^2\end{equation}\] where \(r(u)=u^2\) is the quadratic loss function, gives the least squares estimator \(\boldsymbol{\hat{\beta}}\) by \[\begin{equation}\boldsymbol{\hat{\beta}}=(\boldsymbol{X}^\intercal \boldsymbol{X})^{-1}\boldsymbol{X}^\intercal \boldsymbol{y}.\end{equation}\] The additional assumption that the errors \(\boldsymbol{\epsilon}\) follow a Gaussian distribution, \[\begin{equation}\boldsymbol{\epsilon}\sim N(\boldsymbol{0},\sigma^2 \boldsymbol{I}_n)\end{equation}\] where \(\boldsymbol{I}_n\) is the \(n\times n\) identity matrix, provides a framework for testing the significance of the coefficients found. Under this assumption the least-squares estimator is also the maximum-likelihood estimator. Taking expectations, with respect to \(\boldsymbol{\epsilon}\), and noting that a linear function of a normally distributed random variable is normally distributed itself we can rewrite the model as \[\begin{equation} \boldsymbol{y} \sim N(\boldsymbol{\mu},\sigma^2\boldsymbol{I}_n), \ \text{where} \ \boldsymbol{\mu}=\boldsymbol{X}\boldsymbol{\beta}.\label{newmodel} \end{equation}\]Thus, we model the relationship between the mean of \(y_i\), for \(i=1, 2, \ldots, n\), and the covariates linearly.
On the page Regression modelling we focused on data \(\boldsymbol{y}\) which, conditioned on the covariates, are normally distributed . The aforementioned procedure can be generalised to any distribution belonging to the exponential family. These models are known as generalised linear models (GLM) and consist of three elements:
A probability distribution that belongs to the exponential family of distributions (random component).
A linear predictor \(\eta_i\) (systematic component) such that \(\eta_i=\beta _0+\beta _1x_{i1}+\ldots+\beta _{k}x_{ik}=\boldsymbol{x}_i^\intercal\boldsymbol{\beta}.\)
A link function \(g\) such that \(\mathbb{E}[Y_i]=\mu_i=g^{-1}(\eta_i).\)
Note: It is called the link function because it links the linear predictor \(\eta\) to the mean of the distribution \(\mu\).
A GLM can be used for data that are not normally distributed and for situations where the relationship between the mean of the response variable and the covariates is not linear. The GLM includes many important distributions such as the Gaussian, Poisson, Bernoulli, gamma and inverse Gaussian distributions. The link and mean functions of some common distributions can be seen in the table below.
| Distribution | Probability density (or mass) function | Linear predictor | Link function |
|---|---|---|---|
| Normal distribution | \(f(y;\mu, \sigma^2)=\frac{1}{\sigma \sqrt{2\pi}}\exp{\left\{- \frac{(y-\mu)^2}{2\sigma^2} \right\}}\) | \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\mu_i\) | \(\mu_i=\boldsymbol{x}_i^\intercal\boldsymbol{\beta}\) |
| Exponential distribution | \(f(y;\lambda)=\lambda \exp{\{-\lambda y\}}\) | \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\mu_i^{-1}\) | \(\mu_i=(\boldsymbol{x}_i^\intercal\boldsymbol{\beta})^{-1}\) |
| Gamma distribution | \(f(y;a, b)=\frac{b^a}{\Gamma (a)} y^{a-1}\exp{\{-by \}}\) | \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\mu_i^{-1}\) | \(\mu_i=(\boldsymbol{x}_i^\intercal\boldsymbol{\beta})^{-1}\) |
| Poisson distribution | \(P(Y=y;\mu)=\exp{\{-\mu\}}\frac{\mu^y}{y!}\) | \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\log{\{\mu_i\}}\) | \(\mu_i=\exp{\{\boldsymbol{x}_i^\intercal\boldsymbol{\beta}\}}\) |
| Inverse Gaussian distribution | \(f(y;\lambda, \mu)=\sqrt{\frac{\lambda}{2\pi y^3}}\exp{\left\{- \frac{\lambda (y-\mu)^2}{2\mu^2 y} \right\}}\) | \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}=\mu_i^{-2}\) | \(\mu_i=(\boldsymbol{x}_i^\intercal\boldsymbol{\beta})^{-\frac{1}{2}}\) |
| Bernoulli distribution | \(P(0)=1-\mu, P(1)= \mu\) | \(\boldsymbol{x}_i^\intercal\boldsymbol{\beta}= \log{\left\{\frac{\mu_i}{1-\mu_i}\right\}}\) | \(\mu_i=\frac{ \exp{\{ \boldsymbol{x}_i^\intercal \boldsymbol{\beta} \} } }{ 1+\exp{\{ \boldsymbol{x}_i^\intercal \boldsymbol{\beta} \}} }\) |
Generalised linear models are really flexible due to the fact that:
Like all other models, a GLM has limitations. This is due to the fact that:
The pages Regression modelling and Generalised linear models provided a short summary of regression modelling and generalised linear models. Unfortunately, you will have to delve deeper into GLM’s by yourselves before you come to class. Fortunately though, you have a plethora of different options.
We will first focus on the Boston data set which is located in the MASS library. You should install the package with install.packages("MASS"), load it with library(MASS), and finally load the dataset with data(Boston). The dataset records information regarding \(506\) neighbourhoods around Boston in \(1978\) and the median house prices of each neighbourhood, amongst other variables. A selection of variables, for the first \(10\) neighbourhoods, from the Boston data set can be seen in the table below. The median house prices are shown in thousands of dollars.
Note: The original names of these variables are medv, crim, lstat, rm, rad, and chas.
| median.house.value | crime.rate | perc.low.socio.status | aver.number.rooms | index.access.radial.highways | river.bounds |
|---|---|---|---|---|---|
| 24.0 | 0.00632 | 4.98 | 6.575 | 1 | 0 |
| 21.6 | 0.02731 | 9.14 | 6.421 | 2 | 0 |
| 34.7 | 0.02729 | 4.03 | 7.185 | 2 | 0 |
| 33.4 | 0.03237 | 2.94 | 6.998 | 3 | 0 |
| 36.2 | 0.06905 | 5.33 | 7.147 | 3 | 0 |
| 28.7 | 0.02985 | 5.21 | 6.430 | 3 | 0 |
| 22.9 | 0.08829 | 12.43 | 6.012 | 5 | 0 |
| 27.1 | 0.14455 | 19.15 | 6.172 | 5 | 0 |
| 16.5 | 0.21124 | 29.93 | 5.631 | 5 | 0 |
| 18.9 | 0.17004 | 17.10 | 6.004 | 5 | 0 |
We can find out more information about the structure of the data set by using the command str(Boston), and produce result summaries of each variable, using the command summary(Boston).
str(Boston)'data.frame': 506 obs. of 6 variables:
$ median.house.value : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
$ crime.rate : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ perc.low.socio.status : num 4.98 9.14 4.03 2.94 5.33 ...
$ aver.number.rooms : num 6.58 6.42 7.18 7 7.15 ...
$ index.access.radial.highways: int 1 2 2 3 3 3 5 5 5 5 ...
$ river.bounds : int 0 0 0 0 0 0 0 0 0 0 ...
summary(Boston) median.house.value crime.rate perc.low.socio.status aver.number.rooms
Min. : 5.00 Min. : 0.00632 Min. : 1.73 Min. :3.561
1st Qu.:17.02 1st Qu.: 0.08204 1st Qu.: 6.95 1st Qu.:5.886
Median :21.20 Median : 0.25651 Median :11.36 Median :6.208
Mean :22.53 Mean : 3.61352 Mean :12.65 Mean :6.285
3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.:16.95 3rd Qu.:6.623
Max. :50.00 Max. :88.97620 Max. :37.97 Max. :8.780
index.access.radial.highways river.bounds
Min. : 1.000 Min. :0.00000
1st Qu.: 4.000 1st Qu.:0.00000
Median : 5.000 Median :0.00000
Mean : 9.549 Mean :0.06917
3rd Qu.:24.000 3rd Qu.:0.00000
Max. :24.000 Max. :1.00000
We can also use the as.tbl function, located in the dplyr package.
Note: This is a good time to pause the video, check the code below, and think about the benefits of using this function.
Boston <- as.tbl(Boston)
Boston# A tibble: 506 x 6
median.house.value crime.rate perc.low.socio.s… aver.number.roo… index.access.radi… river.bounds
* <dbl> <dbl> <dbl> <dbl> <int> <int>
1 24 0.00632 4.98 6.58 1 0
2 21.6 0.0273 9.14 6.42 2 0
3 34.7 0.0273 4.03 7.18 2 0
4 33.4 0.0324 2.94 7.00 3 0
5 36.2 0.0690 5.33 7.15 3 0
6 28.7 0.0298 5.21 6.43 3 0
7 22.9 0.0883 12.4 6.01 5 0
8 27.1 0.145 19.2 6.17 5 0
9 16.5 0.211 29.9 5.63 5 0
10 18.9 0.170 17.1 6.00 5 0
# ... with 496 more rows
plot(Boston) # A bit ugly, but gives us some infoggplot(melt(Boston), aes(variable, value)) + geom_boxplot() # Alternative to boxplot(Boston)ggplot(Boston)+geom_histogram(aes(x=median.house.value),binwidth =1) #try it out without the binwidth argumentggplot(Boston)+geom_histogram(aes(x=median.house.value,fill=factor(index.access.radial.highways)),binwidth =1) # A bit messyggplot(Boston,aes(x=median.house.value,y=crime.rate))+geom_point()+facet_wrap(~index.access.radial.highways) #Less messy#ggplot(Boston,aes(x=median.house.value,y=crime.rate))+geom_point()+facet_grid(index.access.radial.highways~river.bounds) #Feel free to try this Boston data set. His actual quote was “Can you come up with a good model regarding the per capita crime rate for each neighbourhood in Boston? The President’s best idea is to build a wall around the neighbourhoods with the highest crime rates.”
There is a boston data set within the spdep package. In this one, the main difference is the addition of the coordinates of each neighbourhood.
tail(table(Boston$medv))? Any possible explanations of why this is happening?
46 46.7 48.3 48.5 48.8 50
1 1 1 1 1 16
I believe that the following model (glm(medv~tax+rm,data=Boston)) makes no sense at all. If you had to use the same covariates and dependent variable in the model, what can you do to make it more realistic/sensible?
Hint: Think about how you would interpret the coefficients in the following model.
Call:
glm(formula = medv ~ tax + rm, data = Boston)
Deviance Residuals:
Min 1Q Median 3Q Max
-16.495 -3.123 -0.548 2.384 42.057
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.233093 2.834371 -7.491 3.09e-13 ***
tax -0.015837 0.001686 -9.391 < 2e-16 ***
rm 7.992681 0.404534 19.758 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 37.31825)
Null deviance: 42716 on 505 degrees of freedom
Residual deviance: 18771 on 503 degrees of freedom
AIC: 3272.4
Number of Fisher Scoring iterations: 2
Boston dataset that you feel are qualified to be used as dependent variables for a Poisson regression model? What about a logistic regression model?Plotly is a package in R that allows for interactive graphs. Try to create an interactive boxplot using variables from the Boston data set (or any other data set).
The Plotly package can be used alongside the ggplot2 package to create interactive, online ggplot2 graphs. For more information on how this is done click here. Have a look at the ggpairs and ggplotly functions and try and create an interactive graph for the Boston data set using the aforementioned functions. If you want to take it one step further, you can choose a data set which contains a factor variable (e.g. race) and use the argument mapping=aes(color=race) inside the ggpairs function.
We will now have a look on the Articles data set which is located in the Rchoice library. The dataset records information regarding the number of publications of \(915\) PhD biochemistry students during the \(1950\)’s and \(1960\)’s. The first \(7\) observations can be seen in the table below. The Article data set is already sorted with respect to the number of articles published, hence the \(0\)’s at the first column.
# A tibble: 915 x 6
art fem mar kid5 phd ment
* <int> <int> <int> <int> <dbl> <int>
1 0 0 1 0 2.52 7
2 0 1 0 0 2.05 6
3 0 1 0 0 3.75 6
4 0 0 1 1 1.18 3
5 0 1 0 0 3.75 26
6 0 1 1 2 3.59 2
7 0 1 0 0 3.19 3
8 0 0 1 2 2.96 4
9 0 0 0 0 4.62 6
10 0 1 1 0 1.25 0
# ... with 905 more rows
Note: The original names of these variables are art, fem, mar, kid5, phd, and ment. Two variables (i.e. fem, mar) have been renamed and transformed to factors.
| articles | gender | marital.status | kids | phd.prestige | mentor |
|---|---|---|---|---|---|
| 0 | male | married | 0 | 2.52 | 7 |
| 0 | female | single | 0 | 2.05 | 6 |
| 0 | female | single | 0 | 3.75 | 6 |
| 0 | male | married | 1 | 1.18 | 3 |
| 0 | female | single | 0 | 3.75 | 26 |
| 0 | female | married | 2 | 3.59 | 2 |
| 0 | female | single | 0 | 3.19 | 3 |
We can find out more information about the structure of the data set by using the command str(Articles), and produce result summaries of each variable, using the command summary(Articles).
str(Articles)Classes 'tbl_df', 'tbl' and 'data.frame': 915 obs. of 6 variables:
$ articles : int 0 0 0 0 0 0 0 0 0 0 ...
$ gender : Factor w/ 2 levels "male","female": 1 2 2 1 2 2 2 1 1 2 ...
$ marital.status: Factor w/ 2 levels "single","married": 2 1 1 2 1 2 1 2 1 2 ...
$ kids : int 0 0 0 1 0 2 0 2 0 0 ...
$ phd.prestige : num 2.52 2.05 3.75 1.18 3.75 ...
$ mentor : int 7 6 6 3 26 2 3 4 6 0 ...
- attr(*, "datalabel")= chr "Academic Biochemists / S Long"
- attr(*, "time.stamp")= chr "30 Jan 2001 10:49"
- attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int 98 98 98 98 102 98
- attr(*, "val.labels")= chr "" "sexlbl" "marlbl" "" ...
- attr(*, "var.labels")= chr "Articles in last 3 yrs of PhD" "Gender: 1=female 0=male" "Married: 1=yes 0=no" "Number of children < 6" ...
- attr(*, "expansion.fields")=List of 1
..$ : chr "_dta" " 8 data, cleaned 11/20/00, revise 11/20/00, revised 1/15/01, 1/30/01.\"'!=\"\"" "d 1/15/01, 1/30/01."
- attr(*, "version")= int 6
- attr(*, "label.table")=List of 2
..$ marlbl: Named int 0 1
.. ..- attr(*, "names")= chr "Single" "Married"
..$ sexlbl: Named int 0 1
.. ..- attr(*, "names")= chr "Men" "Women"
summary(Articles) articles gender marital.status kids phd.prestige mentor
Min. : 0.000 male :494 single :309 Min. :0.0000 Min. :0.755 Min. : 0.000
1st Qu.: 0.000 female:421 married:606 1st Qu.:0.0000 1st Qu.:2.260 1st Qu.: 3.000
Median : 1.000 Median :0.0000 Median :3.150 Median : 6.000
Mean : 1.693 Mean :0.4951 Mean :3.103 Mean : 8.767
3rd Qu.: 2.000 3rd Qu.:1.0000 3rd Qu.:3.920 3rd Qu.:12.000
Max. :19.000 Max. :3.0000 Max. :4.620 Max. :77.000
The mean and the variance of the number of articles is 1.69 and 3.71 respectively.
We could also be interested in reporting summary statistics by gender and marital status. In this case, we can use the function describeBy in the psych package.
library(psych)
describeBy(Articles,list(Articles$gender,Articles$marital.status))
Descriptive statistics by group
: male
: single
vars n mean sd median trimmed mad min max range skew kurtosis se
articles 1 113 1.95 2.01 1.00 1.62 1.48 0.00 7.00 7.0 1.13 0.45 0.19
gender* 2 113 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.0 NaN NaN 0.00
marital.status* 3 113 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.0 NaN NaN 0.00
kids 4 113 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 NaN NaN 0.00
phd.prestige 5 113 3.50 0.89 3.59 3.55 1.04 1.52 4.62 3.1 -0.45 -1.13 0.08
mentor 6 113 10.21 9.81 8.00 8.57 5.93 0.00 55.00 55.0 2.03 5.08 0.92
---------------------------------------------------------------------------
: female
: single
vars n mean sd median trimmed mad min max range skew kurtosis se
articles 1 196 1.39 1.51 1.00 1.14 1.48 0.00 7.00 7.00 1.31 1.46 0.11
gender* 2 196 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0.00 NaN NaN 0.00
marital.status* 3 196 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN NaN 0.00
kids 4 196 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NaN NaN 0.00
phd.prestige 5 196 3.06 0.97 3.19 3.08 1.08 0.75 4.62 3.86 -0.18 -0.85 0.07
mentor 6 196 7.86 7.56 6.00 6.66 5.93 0.00 39.00 39.00 1.73 3.46 0.54
---------------------------------------------------------------------------
: male
: married
vars n mean sd median trimmed mad min max range skew kurtosis se
articles 1 381 1.86 2.23 1.00 1.50 1.48 0.00 19.00 19.0 2.99 14.94 0.11
gender* 2 381 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.0 NaN NaN 0.00
marital.status* 3 381 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0.0 NaN NaN 0.00
kids 4 381 0.91 0.86 1.00 0.84 1.48 0.00 3.00 3.0 0.53 -0.65 0.04
phd.prestige 5 381 3.02 1.00 2.96 3.02 1.28 0.92 4.62 3.7 0.07 -1.25 0.05
mentor 6 381 9.33 10.75 6.00 7.34 5.93 0.00 77.00 77.0 2.49 8.43 0.55
---------------------------------------------------------------------------
: female
: married
vars n mean sd median trimmed mad min max range skew kurtosis se
articles 1 225 1.54 1.59 1.00 1.32 1.48 0.00 10.00 10.0 1.40 3.08 0.11
gender* 2 225 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0.0 NaN NaN 0.00
marital.status* 3 225 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0.0 NaN NaN 0.00
kids 4 225 0.47 0.69 0.00 0.33 0.00 0.00 3.00 3.0 1.23 0.44 0.05
phd.prestige 5 225 3.09 0.97 3.21 3.12 0.98 0.92 4.62 3.7 -0.21 -0.88 0.06
mentor 6 225 7.88 8.37 5.00 6.47 4.45 0.00 49.00 49.0 2.26 6.77 0.56
We can first just plot the frequency of the number of articles published by PhD students.
sjp.frq(Articles$articles,title="Number of articles")We can also plot the contingency table of the number of articles vs the gender of the PhD student and the number of articles vs the number of kids.
The most common model for count data is the Poisson regression model. So we can start with that model and build up later on (or stay with it if it’s an appropriate model). We can run the model using the glm function with the argument family=poisson.
model.pois <- glm(articles~gender+marital.status+kids+phd.prestige+mentor, family=poisson, data=Articles)
summary(model.pois)
Call:
glm(formula = articles ~ gender + marital.status + kids + phd.prestige +
mentor, family = poisson, data = Articles)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5672 -1.5398 -0.3660 0.5722 5.4467
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.304617 0.102981 2.958 0.0031 **
genderfemale -0.224594 0.054613 -4.112 3.92e-05 ***
marital.statusmarried 0.155243 0.061374 2.529 0.0114 *
kids -0.184883 0.040127 -4.607 4.08e-06 ***
phd.prestige 0.012823 0.026397 0.486 0.6271
mentor 0.025543 0.002006 12.733 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1817.4 on 914 degrees of freedom
Residual deviance: 1634.4 on 909 degrees of freedom
AIC: 3314.1
Number of Fisher Scoring iterations: 5
We can also check the goodness-of-fit of the model with a chi-square test based on the residual deviance and degrees of freedom, where the null hypothesis is that the Poisson model is correctly specified.
pchisq(model.pois$deviance, df=model.pois$df.residual, lower.tail=FALSE) # Strong evidence that our model fits badly.[1] 4.525613e-44
We can now fit a negative binomial regression model. This model unlike the Poisson model, can handle overdispersion (i.e. the variance is greater than the mean) due to the fact that the probability mass function of a negative binomial distribution is \[P(Y=y|\mu, k)=\frac{\Gamma (\frac{1}{k} +y)}{\Gamma (\frac{1}{k})y!}\left(\frac{k\mu}{1+k\mu}\right)^{y}\left(\frac{1}{1+k\mu}\right)^{\frac{1}{k}}\] with mean \(\mathbb{E}[Y]=\mu\) and variance \(\mathbb{V}[Y]=\mu+k\mu^2\).
The first parameter of this formulation is the mean of the distribution whereas the second is referred to as the dispersion parameter. Large values of \(k\) are a sign of overdispersion, while when \(k\to 0\) the variance of the distribution is equal to the mean and we have the Poisson model as a special case.
Note: The dispersion parameter \(\theta\) in R is equivalent to \(\frac{1}{k}\) on the previous notation. The majority of other statistical languages (e.g. SAS, Stata, Matlab) use the previous notation with \(\mu\) and \(k\).
Call:
glm.nb(formula = articles ~ gender + marital.status + kids +
phd.prestige + mentor, data = Articles, init.theta = 2.264387695,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1678 -1.3617 -0.2806 0.4476 3.4524
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.256144 0.137348 1.865 0.062191 .
genderfemale -0.216418 0.072636 -2.979 0.002887 **
marital.statusmarried 0.150489 0.082097 1.833 0.066791 .
kids -0.176415 0.052813 -3.340 0.000837 ***
phd.prestige 0.015271 0.035873 0.426 0.670326
mentor 0.029082 0.003214 9.048 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(2.2644) family taken to be 1)
Null deviance: 1109.0 on 914 degrees of freedom
Residual deviance: 1004.3 on 909 degrees of freedom
AIC: 3135.9
Number of Fisher Scoring iterations: 1
Theta: 2.264
Std. Err.: 0.271
2 x log-likelihood: -3121.917
We should again check the goodness-of-fit of this model with a chi-square test, where the null hypothesis is that the negative binomial model is correctly specified.
pchisq(model.negb$deviance, df=model.negb$df.residual, lower.tail=FALSE) # Evidence that we can still do better.[1] 0.01475755
We can use a likelihood ratio test to compare the two models and test if the negative binomial model is more appropriate.
diff_ll <- 2*(logLik(model.negb)-logLik(model.pois))#log likelihood difference = likelihood ratio
pchisq(diff_ll,df=1,lower.tail=FALSE)#Strong evidence that we should reject the Poisson model in favour of the negative binomial model'log Lik.' 4.391728e-41 (df=7)
In both models the regression coefficients \(\beta_i\) represent the expected change in the log of the mean of the response variable; for a one unit increase in the covariate \(x_i\) while all other covariates remain the same. In other words, if \(\beta_i\) is positive, increasing \(x_i\) by one unit is associated with an increase of \(\exp\{\beta_i\}\) in the mean of the response; while for a negative \(\beta_i\), increasing \(x_i\) by one unit is associated with a decrease of \(1-\exp\{\beta_i\}\) in the mean of the response.
We can plot the exponent of the regression coefficients using the function sjp.glm from the sjPlot library.
Finally, we can plot the predicted values for the average number of papers published, with regards to the number of papers the mentor has published and the number of kids a PhD student has. This can be done using the argument type="pred" in the sjp.glm function.
Modify the Articles dataset so when you type str(Articles) you get the results seen in the previous tab.
I previously used the function describeBy to get summary statistics by group. Now it’s your turn to use the function… tapply and find the mean and the variance of the number of articles by gender.
Note: I am currently in a good mood so you can have a look here first
library(MASS)
data(Boston)
Boston <- Boston[,c("medv","crim","lstat","rm","rad","chas")]
colnames(Boston) <- c("median.house.value","crime.rate","perc.low.socio.status","aver.number.rooms","index.access.radial.highways","river.bounds")library(knitr)
kable(Boston[c(1:10),])| median.house.value | crime.rate | perc.low.socio.status | aver.number.rooms | index.access.radial.highways | river.bounds |
|---|---|---|---|---|---|
| 24.0 | 0.00632 | 4.98 | 6.575 | 1 | 0 |
| 21.6 | 0.02731 | 9.14 | 6.421 | 2 | 0 |
| 34.7 | 0.02729 | 4.03 | 7.185 | 2 | 0 |
| 33.4 | 0.03237 | 2.94 | 6.998 | 3 | 0 |
| 36.2 | 0.06905 | 5.33 | 7.147 | 3 | 0 |
| 28.7 | 0.02985 | 5.21 | 6.430 | 3 | 0 |
| 22.9 | 0.08829 | 12.43 | 6.012 | 5 | 0 |
| 27.1 | 0.14455 | 19.15 | 6.172 | 5 | 0 |
| 16.5 | 0.21124 | 29.93 | 5.631 | 5 | 0 |
| 18.9 | 0.17004 | 17.10 | 6.004 | 5 | 0 |
str(Boston)'data.frame': 506 obs. of 6 variables:
$ median.house.value : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
$ crime.rate : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ perc.low.socio.status : num 4.98 9.14 4.03 2.94 5.33 ...
$ aver.number.rooms : num 6.58 6.42 7.18 7 7.15 ...
$ index.access.radial.highways: int 1 2 2 3 3 3 5 5 5 5 ...
$ river.bounds : int 0 0 0 0 0 0 0 0 0 0 ...
summary(Boston) median.house.value crime.rate perc.low.socio.status aver.number.rooms
Min. : 5.00 Min. : 0.00632 Min. : 1.73 Min. :3.561
1st Qu.:17.02 1st Qu.: 0.08204 1st Qu.: 6.95 1st Qu.:5.886
Median :21.20 Median : 0.25651 Median :11.36 Median :6.208
Mean :22.53 Mean : 3.61352 Mean :12.65 Mean :6.285
3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.:16.95 3rd Qu.:6.623
Max. :50.00 Max. :88.97620 Max. :37.97 Max. :8.780
index.access.radial.highways river.bounds
Min. : 1.000 Min. :0.00000
1st Qu.: 4.000 1st Qu.:0.00000
Median : 5.000 Median :0.00000
Mean : 9.549 Mean :0.06917
3rd Qu.:24.000 3rd Qu.:0.00000
Max. :24.000 Max. :1.00000
library(dplyr)
Boston <- as.tbl(Boston)
Boston# A tibble: 506 x 6
median.house.value crime.rate perc.low.socio.s… aver.number.roo… index.access.radi… river.bounds
* <dbl> <dbl> <dbl> <dbl> <int> <int>
1 24 0.00632 4.98 6.58 1 0
2 21.6 0.0273 9.14 6.42 2 0
3 34.7 0.0273 4.03 7.18 2 0
4 33.4 0.0324 2.94 7.00 3 0
5 36.2 0.0690 5.33 7.15 3 0
6 28.7 0.0298 5.21 6.43 3 0
7 22.9 0.0883 12.4 6.01 5 0
8 27.1 0.145 19.2 6.17 5 0
9 16.5 0.211 29.9 5.63 5 0
10 18.9 0.170 17.1 6.00 5 0
# ... with 496 more rows
plot(Boston) # A bit ugly, but gives us some infolibrary(ggplot2)
ggplot(melt(Boston), aes(variable, value)) + geom_boxplot() # Alternative to boxplot(Boston)ggplot(Boston)+geom_histogram(aes(x=median.house.value),binwidth =1) #try it out without the binwidth argumentggplot(Boston)+geom_histogram(aes(x=median.house.value,fill=factor(index.access.radial.highways)),binwidth =1) # A bit messyggplot(Boston,aes(x=median.house.value,y=crime.rate))+geom_point()+facet_wrap(~index.access.radial.highways) #Less messy#ggplot(Boston,aes(x=median.house.value,y=crime.rate))+geom_point()+facet_grid(index.access.radial.highways~river.bounds) #Feel free to try thislibrary(GGally)
library(plotly)
#Function to return points and geom_smooth
my_fn <- function(data, mapping, method="loess", ...){
ppp <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=method, ...)
ppp
}
#Remove last 2 columns from Boston data set
Boston <- Boston[,-c(5,6)]
p <- ggpairs(Boston,lower=list(continuous=wrap(my_fn, method="loess")),columnLabels = c("Median house value","Crime rate","Percentage of low socioec. status", "Average number of rooms"))
pp <- ggplotly(p )
#ggplotly has an issue with axis labels. Sometimes they dissapear.
x1 <- list(title="Median house value")
y1 <- x1
x2 <- list(title="Crime rate")
y2 <- x2
x3 <- list(title="Perc. of low socioec. status")
y3 <- x3
x4 <- list(title="Aver. number of rooms")
y4 <- x4
m = list( l = 50, r = 50, b = 100, t = 70)
pp %>% layout(width=1200,height=1000,margin=m,xaxis=x1, yaxis=y4,xaxis2=x2,yaxis2=y3,xaxis3=x3, yaxis3=y2,xaxis4=x4,yaxis4=y1)x <- list(title="Median house value")
y <- list(title="Perc. of low socioec. status")
plot_ly(Boston,y=Boston$perc.low.socio.status,x=Boston$median.house.value,mode="markers",size=Boston$crime.rate,color=Boston$aver.number.rooms)%>% layout(width=1200,height=1000,xaxis = x, yaxis = y)library(knitr) #for the kable function
library(Rchoice) #for the data set
library(plyr) #for the revalue function
data("Articles")
Articles <- as.tbl(Articles)
Articles$fem <- factor(Articles$fem)
Articles$fem <-revalue(Articles$fem, c("1"="female", "0"="male"))
Articles$mar <- factor(Articles$mar)
Articles$mar <-revalue(Articles$mar, c("1"="married", "0"="single"))
colnames(Articles)[] <- c("articles","gender","marital.status","kids","phd.prestige","mentor")
kable(Articles[c(1:7),])| articles | gender | marital.status | kids | phd.prestige | mentor |
|---|---|---|---|---|---|
| 0 | male | married | 0 | 2.52 | 7 |
| 0 | female | single | 0 | 2.05 | 6 |
| 0 | female | single | 0 | 3.75 | 6 |
| 0 | male | married | 1 | 1.18 | 3 |
| 0 | female | single | 0 | 3.75 | 26 |
| 0 | female | married | 2 | 3.59 | 2 |
| 0 | female | single | 0 | 3.19 | 3 |
library(sjPlot)
library(sjmisc)
sjp.frq(Articles$articles,title="Number of articles")sjp.xtab(Articles$articles,Articles$gender,geom.colors=c("blue","pink"),show.values = FALSE,title = "Number of articles by gender",show.total = FALSE)sjp.xtab(Articles$articles,Articles$kids,geom.colors=c("red","azure4","green","purple"),show.values = FALSE,title = "Number of articles by number of kids",show.n = FALSE,show.total = FALSE)sjp.glm(model.pois,digits=3,axis.title = "Exponent of regression coefficients",sort.est=TRUE,title = "Poisson model")sjp.glm(model.negb,digits=3,axis.title = "Exponent of regression coefficients",sort.est = TRUE,title="Negative binomial model")sjp.glm(model.pois,type="pred",vars=c("mentor","kids"),facet.grid = FALSE,geom.colors=c("red","azure4","green","purple"),title = "Poisson model",axis.lim =c(0,15))sjp.glm(model.negb,type="pred",vars=c("mentor","kids"),geom.colors=c("red","azure4","green","purple"),facet.grid = FALSE,title = "Negative binomial model",axis.lim =c(0,15))