Lino AA Notarantonio (lino@tec.mx)
14/10/2019
Windows: right-click this tutorial online
Mac OSX: right-click this tutorial online
RStudio is an IDE (Integrated Development Environment), that is, is a front to interact more conveniently with the program R. To work properly, R must be installed in your system.
Right-click on this tutorial.
The standard install of R(Studio) can also handle the most common commercial formats such as SPSS, Stata, SAS out of the box.
Importing Minitab files can also be done using the foreign library and using the read.mtp() function.
It is also possible to use a URL, if the dataset comes from a website.
Data come typically in the form of a collection of columns (for example, from a spreadsheet), each of which represents a variable.
A variable in a dataset can be classified either as categorical or numerical.
Among categorical (ordinal) variables, we find dummy (or binary) variables.
Statistical data analysis comprises
among others.
When importing a dataset, the preview window (top-left part of RStudio window) will also show the type of each variable (integer, double precision, etc.).
We can detect format problems of the variables in a dataset at this stage.
To display all datasets included in the standard distribution of R, input
data()
The dataset mtcars includes fuel consumption and other 10 aspects of car design and performance for 32 cars (1973-74 models).
We use the names() function to display the variables in the dataset:
names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
The explanation of each variable can be found in the Help section (type the following command in the Console).
help(mtcars)
To extract the variable mpg from this dataset, and call it \( x \), we use the “$” symbol as follows
x <- mtcars$mpg
The default assignment operator is “<-” (less than sign followed by the hyphen symbol).
The symbol “<-” can also be typed as “alt -” in the console.
It is also possible to use the “=” sign, but the symbol “=” has lower priority than “<-”.
The main statistics can be obtained by using the summary() function:
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
The function summary() can also be applied to a dataset
mtcars.select <- subset(mtcars, select = c(1:3,6))
summary(mtcars.select)
mpg cyl disp wt
Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581
Median :19.20 Median :6.000 Median :196.3 Median :3.325
Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610
Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424
The variance and the standard deviation must be computed separately:
var(x)
[1] 36.3241
sd(x)
[1] 6.026948
We can use vectorization to compute the standard deviation of the full dataset:
sapply(mtcars, sd)
mpg cyl disp hp drat wt
6.0269481 1.7859216 123.9386938 68.5628685 0.5346787 0.9784574
qsec vs am gear carb
1.7869432 0.5040161 0.4989909 0.7378041 1.6152000
xtabs(~ cyl + carb, data = mtcars)
carb
cyl 1 2 3 4 6 8
4 5 6 0 0 0 0
6 2 0 0 4 1 0
8 0 4 3 6 0 1
The following code displays the relative frequencies of the variable \( x \):
x.freq <- table(x)
x.relfreq <- x.freq/length(x)
Comments to the code
The function table() allows us to generate a frequency table of each observation.
The function length() computes the number of observations of the variable \( x \) (length of a vector).
Vectorization, then, permits us to generate the vector, x.relfreq, of relative frequencies of the variable \( x \).
# x: mpg
x.relfreq
x
10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4
0.06250 0.03125 0.03125 0.03125 0.03125 0.06250 0.03125 0.03125 0.03125
17.3 17.8 18.1 18.7 19.2 19.7 21 21.4 21.5
0.03125 0.03125 0.03125 0.03125 0.06250 0.03125 0.06250 0.06250 0.03125
22.8 24.4 26 27.3 30.4 32.4 33.9
0.06250 0.03125 0.03125 0.03125 0.06250 0.03125 0.03125
By default, histograms are determined by showing the frequency of the observations:
hist(x)
The number of bins can be specified with breaks = n, where \( n \) is an integer:
par(mfrow = c(1,2))
hist(x)
hist(x, breaks = 36)
par(mfrow = c(1,1))
The number of bins to use, by default, is given the Sturges's formula, even though it is possible to use Scott's or Freedman-Diaconis's formula as well.
hist(x, breaks = "Scott")
hist(x, breaks = "FD")
Changing the number of breaks may make the histogram look very different, so caution should be taken when interpreting the shape of a histogram with non-default settings.
You can find more here (Hyndman, 1995)
To display the the relative frequencies, we pass the optional parameter probability = TRUE:
hist(x,probability = TRUE)
Boxplots are another tool to visualize the descriptive robust statistics:
boxplot(x)
The “box” in a boxplot consists of the Interquartile Range (IQR), that is, the lower edge corresponds to the first quartile, \( Q_1 \), while the upper edge corresponds to the third quartile, \( Q_3 \).
The thick line is the median and the “whiskers” corresponds to the \[ Q_3 + 1.5\cdot IQR\qquad \text{upper whisker} \] and \[ Q_1-1.5 IQR\cdot \qquad \text{lower whisker} \]
A “notch” can be added to a boxplot; it can be interpreted as a confidence interval of the median, according to the formula \[ M \pm 1.57 \times IQR/\sqrt{n} \] (Chambers et al, 1983)
If notches do not overlap, it is a strong evidence that a similar difference could be observed in other sets of data collected under similar circumstances.
In the following slide we plot the boxplot of the mileage, as a function of the number of cylinders.
boxplot(mtcars$mpg ~ mtcars$cyl, notch=T)
Values which are either greater than the upper whisker or lower than the lower whisker can be considered outliers and are graphically represented by small circles.
Sometimes, it may be convenient to consider what may be called extreme outliers, which are observed values which are greater than \( Q_3 + 3\cdot IQR \) or lower than \( Q_1 - 3\cdot IQR \).
Extreme outliers may be the result of mistyping values observed in the field, or some other sort of errors.
In other situations, outliers can be defined as values lying two, or perhaps three, standard deviations from the mean.
We prove the hypothesis that the variable mpg (dataset mtcars) is greater than 18 (miles):
We test \[ H_0:\ \mu = 18 \] against the alternative \[ H_1:\ \mu > 18 \] at \( \alpha = .05 \). In the next slide we show the result.
x <- mtcars$mpg
t.test(x,mu=18, alternative = "greater")
One Sample t-test
data: x
t = 1.9622, df = 31, p-value = 0.02938
alternative hypothesis: true mean is greater than 18
95 percent confidence interval:
18.28418 Inf
sample estimates:
mean of x
20.09062
We reject the null hypothesis, so we conclude that the mean \( \mu > 18 \).
Having rejected the null hypothesis in the previous slide, we now want to determine the power of the test, under the alternative hypothesis \[ H_a: \mu = 20 \] (\( \alpha=.05 \))
power.t.test(length(x), delta = 2, sd =sd(x), sig.level = .05, type = "one.sample", alternative = "one.sided")
One-sample t test power calculation
n = 32
delta = 2
sd = 6.026948
sig.level = 0.05
power = 0.5757901
alternative = one.sided
Hence, we reject the null hypothesis when it is false with a probability of .576.
The ethical aspect of a power analysis cannot be overlooked (Hunt 2015, p. 8, last-but-one paragraph).
We simulate two samples, of equal size \( N=100 \), drawn from a standard normal distribution and a uniform distribution between \( 0 \) and \( 1 \), respectively:
set.seed(123)
y1 <- rnorm(100)
y2 <- runif(100)
We test
\[ H_0:\ \mu_1 - \mu_2 = 0 \] against the alternative \[ H_1:\ \mu_1 - \mu_2 \neq 0 \] (\( \alpha = .05 \))
t.test(y1, y2, mu = 0)
Welch Two Sample t-test
data: y1 and y2
t = -4.1252, df = 119.37, p-value = 6.883e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5855503 -0.2057413
sample estimates:
mean of x mean of y
0.09040591 0.48605169
Default: unequal variances.
To test the difference of two means with equal variances:
t.test(y1,y2, mu=0, var.equal = T)
To check whether \( \sigma_1^2 = \sigma_2^2 \):
var.test(y1,y2, ratio = 1)
F test to compare two variances
data: y1 and y2
F = 9.6181, num df = 99, denom df = 99, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
6.47148 14.29479
sample estimates:
ratio of variances
9.618132
power.t.test(length(y1), delta = .5, sd =sqrt(var(y1)+var(y2)), sig.level = .05, type = "two.sample", alternative = "two.sided")
Two-sample t test power calculation
n = 100
delta = 0.5
sd = 0.9590956
sig.level = 0.05
power = 0.9562207
alternative = two.sided
NOTE: n is number in *each* group
Hence, we reject the null hypothesis when it is false with a probability equal to \( .9562207 \).
power.t.test(delta = 1, sd = 2, power = .99, sig.level = .05)
Two-sample t test power calculation
n = 147.9478
delta = 1
sd = 2
sig.level = 0.05
power = 0.99
alternative = two.sided
NOTE: n is number in *each* group
We need more than 148 observations to achieve a power of \( .99 \) of a two-tailed test, with a significance level of \( \alpha=.05 \) and a true difference in means equal to 1.
power.t.test(delta = 1, sd = 2, power = .99, sig.level = .05, alternative = "one.sided")
Two-sample t test power calculation
n = 126.8465
delta = 1
sd = 2
sig.level = 0.05
power = 0.99
alternative = one.sided
NOTE: n is number in *each* group
We need more than 127 observations to achieve a power of a (one-tailed) test of \( .99 \), with a significance level of \( \alpha=.05 \) and a true difference in means equal to 1.
Principal component analysis is a non-parametric tool designed to study high dimensional data; specifcally, to extract the variables with the highest signal-to-noise ratio (Shlens, 2005).
It is based on the following assumptions.
Linearity: the relation between the variables is linear
The interesting variables have the largest variance
Mean and variance are sufficient statistics; if the data present skewness, kurtosis, then PCA is inadequate.
The principal components are orthogonal
PCA is based on the Singular Value Decomposition of a certain rectangular matrix (inputs and outputs; cf., Shlens (2005), Section VI).
Performing PCA is strightforward.
Organize the observations in a dataframe
Center the variables (columns of the dataframe).
In some cases, it may be necessary to scale the variables (divide each centered variables by its standard deviation — their \( z \)-scores)
Apply the necesasary linear algebra, or use a statistical (or numerical) package
Consider the iris dataset, in the standard distribution of R.
head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
[5] "Species"
We will use the first four numerical variables in the dataset to determine how many components are necessary to describe the qualitative variable Species.
Compute the variance of the numerical variables
iris.var <- subset(iris, select = c(1:4))
names(iris.var)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
apply(iris.var, 2, var)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.6856935 0.1899794 3.1162779 0.5810063
Probably scaling the variables is a good idea.
Decomposition into components:
iris.pca <- prcomp(iris.var, center = T, scale = T)
iris.pca
Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
\[ PC_1 = .5210659 Sepal.Length - .2693474 Sepal.Width \] \[ \qquad \quad + .5804131 Petal.Length + .5648565 Petal.Width \]
Remarks
The other components are written in a similar way.
Notice that there is sign ambiguity in the coefficients, as the coefficients are related to eigenvalues of a certain matrix.
The magnitude of the coefficients is equal to 1.
Summary of the decomposition
summary(iris.pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
The first two components explain most of the varianbility of the data.
Scree plot
plot(iris.pca, type = "l")
Conclusion
Only two, of the four components, capture the most signal-to-noise ratio
The composition of first two components is
iris.pca$rotation[,1:2]
PC1 PC2
Sepal.Length 0.5210659 -0.37741762
Sepal.Width -0.2693474 -0.92329566
Petal.Length 0.5804131 -0.02449161
Petal.Width 0.5648565 -0.06694199
Conclusion
library(devtools)
install_github("vqv/ggbiplot")
Conclusion
library(ggbiplot)
iris.species <- iris{,5}
g <- ggbiplot(iris.pca, obs.scale = 1, var.scale = 1,
groups = iris.species, ellipse = TRUE,
circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
print(g)
Conclusion
The KNN algorithm groups data depending on certain characteristics (numerical, or categorical, variables).
It works on the assumption that data points of similar classes are closer to each other.
The closeness is (a function of) a distance, possibly the euclidean distance in the data space.
The KNN algorithm needs a training set.
To avoid ties, \( k \) is odd.
KNN algorithm is an example of supervised learning.
The variables are normalized \[ normalized.x = \frac{x - \min(x)}{\max(x) - \min(x)}. \] before splitting the data into training and testing sets.
The algorithm works best with numerical variables.
To normalize the variables we can use the following function
normal.fct <- function(x){
(x - min(x))/(max(x) - min(x))
}
We implement the algorithm on the iris dataset.
Normalize the variables
normal.iris <- as.data.frame(apply(iris.var, 2, normal.fct))
summary(normal.iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
Training set
set.seed(123)
train.row <- sample(1:nrow(iris), 0.75 * nrow(iris))
train.iris <- normal.iris[train.row,]
Testing set
test.iris <- normal.iris[-train.row,]
Classification variable in training set
target.class.iris <- iris[train.row,5]
Classification variable in testing set
test.class.iris <- iris[-train.row,5]
library(class)
iris.knn <- knn(train.iris, test.iris, cl = target.class.iris, k = 5)
tab <- table(iris.knn, test.class.iris)
tab
test.class.iris
iris.knn setosa versicolor virginica
setosa 12 0 0
versicolor 0 16 1
virginica 0 1 8
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x))))}
accuracy(tab)
[1] 0.9473684
The accuracy is about 95%.
It can be improved upon by
We want to estimate the model that explains the consumption of gasoline \[ mpg = \beta_0 + \beta_1 wt + \beta_2 cyl + u \] where \( mpg \) is the consumption of gasoline (miles/gallon); \( wt \), weight of the automobile (1000 lbs) and \( cyl \) the number of cylinders.
m1 <- lm(mpg ~ wt + cyl, data = mtcars)
summary(m1)
Call:
lm(formula = mpg ~ wt + cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.2893 -1.5512 -0.4684 1.5743 6.1004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
wt -3.1910 0.7569 -4.216 0.000222 ***
cyl -1.5078 0.4147 -3.636 0.001064 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
m1.summary <- summary(m1)
m1.summary$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.686261 1.7149840 23.140893 3.043182e-20
wt -3.190972 0.7569065 -4.215808 2.220200e-04
cyl -1.507795 0.4146883 -3.635972 1.064282e-03
m1.summary$coefficients[,1]
(Intercept) wt cyl
39.686261 -3.190972 -1.507795
m1.summary$coefficients[2,]
Estimate Std. Error t value Pr(>|t|)
-3.19097214 0.75690649 -4.21580760 0.00022202
m1.summary$r.squared
[1] 0.8302274
m1.summary$adj.r.squared
[1] 0.8185189
ifelse(pf(m1.summary$fstatistic[1], m1.summary$fstatistic[2],
m1.summary$fstatistic[3], lower.tail = F) < .05,
"The model is significant", "The model is not significant")
value
"The model is significant"
Estimate the mileage of a car with \( wt=2000 \) and \( cyl=6 \).
sum(coef(m1)*c(1,2,6))
[1] 24.25755
Compute the 99% confidence interval for the value estimated in the previous slide.
predict(m1, newdata = data.frame(wt=2,cyl=6),
interval = "confidence", level = .99)
fit lwr upr
1 24.25755 21.57263 26.94246
Compute the 99% prediction interval for the value estimated two slides above.
predict(m1, newdata = data.frame(wt=2,cyl=6),
interval = "prediction", level = .99)
fit lwr upr
1 24.25755 16.68829 31.8268
To determine the, say, 95% confidence intervals of the estimated slopes we use the function confint()
confint(m1, level = .99)
0.5 % 99.5 %
(Intercept) 34.959104 44.413419
wt -5.277299 -1.104646
cyl -2.650836 -0.364754
To generate a table with the estimated slopes and related 98% confidence intervals
cbind("Est slopes" = coef(m1), confint(m1, level = .98))
Est slopes 1 % 99 %
(Intercept) 39.686261 35.463934 43.9085887
wt -3.190972 -5.054492 -1.3274522
cyl -1.507795 -2.528766 -0.4868235
wt.mean = mtcars$wt-mean(mtcars$wt)
cyl.mean = mtcars$cyl-mean(mtcars$cyl)
m2 <- lm(mpg ~ wt.mean + cyl.mean , data = mtcars)
summary(m2)
Call:
lm(formula = mpg ~ wt.mean + cyl.mean, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.2893 -1.5512 -0.4684 1.5743 6.1004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.0906 0.4539 44.264 < 2e-16 ***
wt.mean -3.1910 0.7569 -4.216 0.000222 ***
cyl.mean -1.5078 0.4147 -3.636 0.001064 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
\[ log(mpg) = \beta_0 + \beta_1 \log(wt) + \beta_2 \log(cyl) + u. \]
m.log <- lm(log(mpg) ~ log(wt) + log(cyl), data = mtcars)
summary(m.log)
Call:
lm(formula = log(mpg) ~ log(wt) + log(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.20640 -0.07434 -0.02087 0.10690 0.23144
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.2229 0.1292 32.672 < 2e-16 ***
log(wt) -0.5627 0.1119 -5.031 2.33e-05 ***
log(cyl) -0.3566 0.1149 -3.103 0.00424 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1176 on 29 degrees of freedom
Multiple R-squared: 0.8541, Adjusted R-squared: 0.844
F-statistic: 84.88 on 2 and 29 DF, p-value: 7.568e-13
\[ mpg = \beta_0 + \beta_1 wt + \beta_2 wt^2 + \beta_3 cyl + \beta_4 cyl^2 + u \]
m3 <- lm(mpg ~ wt + I(wt^2) + cyl + I(cyl^2), data = mtcars)
summary(m3)
Call:
lm(formula = mpg ~ wt + I(wt^2) + cyl + I(cyl^2), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.9949 -1.5565 -0.6359 1.8183 5.4899
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.1094 8.8249 5.791 3.67e-06 ***
wt -8.8677 2.9264 -3.030 0.00533 **
I(wt^2) 0.7561 0.3789 1.996 0.05616 .
cyl -2.5445 3.4215 -0.744 0.46348
I(cyl^2) 0.1143 0.2773 0.412 0.68355
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.431 on 27 degrees of freedom
Multiple R-squared: 0.8583, Adjusted R-squared: 0.8373
F-statistic: 40.9 on 4 and 27 DF, p-value: 4.388e-11
We can perform the diagnostics
plot(m1, 1)
It suggests that some nonlinearities may be present in the model.
library(lmtest)
resettest(m1, type = "fitted") # default
RESET test
data: m1
RESET = 3.6267, df1 = 2, df2 = 27, p-value = 0.04026
QQ-plot
plot(m1,2)
shapiro.test(m1$residuals)
Shapiro-Wilk normality test
data: m1$residuals
W = 0.93745, p-value = 0.06341
Just about normal. The plot shows also possible troublesome observations.
plot(m1,3)
bptest(m1)
studentized Breusch-Pagan test
data: m1
BP = 3.2548, df = 2, p-value = 0.1964
The model is homoscedastic.
plot(m1,4)
dwtest(m1)
Durbin-Watson test
data: m1
DW = 1.6711, p-value = 0.1335
alternative hypothesis: true autocorrelation is greater than 0
No autocorrelation present.
The logit model is based on the logistic equation \[ y = \frac{1}{1+\exp[-(\beta_0 + \beta_1 x)]} \]
Downloada the data from
mydata <-read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
head(mydata)
admit gre gpa rank
1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
5 0 520 2.93 4
6 1 760 3.00 2
The dataset consists of four variables:
The variables \( gre \), \( gpa \) will be considered continuous; \( admit \) is binary and \( rank \) is a categorical (or ordinal) variable.
mydata$rank <- factor(mydata$rank)
logit.model <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(logit.model)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
rank2 -0.675443 0.316490 -2.134 0.032829 *
rank3 -1.340204 0.345306 -3.881 0.000104 ***
rank4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
The logistic regression coefficients give the change in the log-odds of the dependent variable when the control (independent) variable changes by one unit.
For example, having graduated from low-ranked university, instead of a top university, decreases the log-odds of admission to a post-graduate program by \( 1.551464 \).
The estimated logit fit is compared to a model with only the intercept (null model). \[ H_0: \ \text{The null model is a better fit}\ \ \ \ \] \[ H_1: \ \text{The null model is not a better fit} \] The test statistics has \( \chi^2_k \) distribution, with \( k= \) number of variables in the model.
with(logit.model, pchisq(null.deviance - deviance,
df.null - df.residual, lower.tail = FALSE))
[1] 7.578194e-08
We reject \( H_0 \) and concludes that logit.model is a better fit than the null model.
Only the estimated coefficients
exp(coef(logit.model))
(Intercept) gre gpa rank2 rank3 rank4
0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375
Together with the 98\% confidence interval (for illustration purposes)
exp(cbind(OR = coef(logit.model), confint(logit.model, level = .98)))
OR 1 % 99 %
(Intercept) 0.0185001 0.001218675 0.2494547
gre 1.0022670 0.999742603 1.0048609
gpa 2.2345448 1.041853560 4.9042225
rank2 0.5089310 0.241856450 1.0607886
rank3 0.2617923 0.115436498 0.5791447
rank4 0.2119375 0.076797072 0.5444075
Predict the probability of being admitted to a post-graduate program, when \( gpa \) and \( gre \) are equal to their mean values.
mydata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
mydata1$rankProb <- predict(logit.model, newdata = mydata1, type = "response")
mydata1
gre gpa rank rankProb
1 587.7 3.3899 1 0.5166016
2 587.7 3.3899 2 0.3522846
3 587.7 3.3899 3 0.2186120
4 587.7 3.3899 4 0.1846684
Chambers, John M., William S. Cleveland, Beat Kleiner, and Paul A. Tukey (1983) “Comparing Data Distributions.” In Graphical Methods for Data Analysis, 62. Belmont, California: Wadsworth International Group
Hunt, A (2015) “A Researcher’s Guide to Power Analysis”, https://research.usu.edu//irb/wp-content/uploads/sites/12/2015/08/A_Researchers_Guide_to_Power_Analysis_USU.pdf Retrieved on October 13, 2019
Hyndman, R.J. (1995) “The problem with Sturges’ rule for constructing histograms”, https://robjhyndman.com/papers/sturges.pdf Retrieved on October 13, 2019
Schlens, J. (2005) A Tutorial on Principal Component Analysis. Copy retrieved [04-09-2008] from: http://www.cs.cmu.edu/~elaw/papers/pca.pdf (2005)