Introduction to R(Studio)

Lino AA Notarantonio (lino@tec.mx)
14/10/2019

Table of Contents

  1. Install R and RStudio (Before the workshop)
  2. Import data (Before the workshop)
  3. Extract variables from a dataset
  4. Descriptive statistics
  5. Statistical inference
  6. PCA Algorithm
  7. KNN Algorithm
  8. Linear regression
  9. Logit regression
  10. References

Install R and RStudio

Windows: right-click this tutorial online

Mac OSX: right-click this tutorial online

RStudio is an IDE (Integrated Development Environment), that is, is a front to interact more conveniently with the program R. To work properly, R must be installed in your system.

Import data

Right-click on this tutorial.

The standard install of R(Studio) can also handle the most common commercial formats such as SPSS, Stata, SAS out of the box.

Importing Minitab files can also be done using the foreign library and using the read.mtp() function.

It is also possible to use a URL, if the dataset comes from a website.

Data

Data come typically in the form of a collection of columns (for example, from a spreadsheet), each of which represents a variable.

A variable in a dataset can be classified either as categorical or numerical.

Among categorical (ordinal) variables, we find dummy (or binary) variables.

Statistical Data Analysis

Statistical data analysis comprises

  • Descriptive statistics
  • Exploratory Data Analysis (EDA)
  • Statistical inference
  • Supervised and unsupervised learning
  • Regression analysis
  • Logit regressions

among others.

Handling Data

When importing a dataset, the preview window (top-left part of RStudio window) will also show the type of each variable (integer, double precision, etc.).

We can detect format problems of the variables in a dataset at this stage.

To display all datasets included in the standard distribution of R, input

data()

Extract Variables from a Dataset

The dataset mtcars includes fuel consumption and other 10 aspects of car design and performance for 32 cars (1973-74 models).

We use the names() function to display the variables in the dataset:

names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

Extract Variables from a Dataset

The explanation of each variable can be found in the Help section (type the following command in the Console).

help(mtcars)

To extract the variable mpg from this dataset, and call it \( x \), we use the “$” symbol as follows

x <- mtcars$mpg

The default assignment operator is “<-” (less than sign followed by the hyphen symbol).

The symbol “<-” can also be typed as “alt -” in the console.

It is also possible to use the “=” sign, but the symbol “=” has lower priority than “<-”.

Descriptive Statistics

The main statistics can be obtained by using the summary() function:

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 

Descriptive Statistics

The function summary() can also be applied to a dataset

mtcars.select <- subset(mtcars, select = c(1:3,6))
summary(mtcars.select)
      mpg             cyl             disp             wt       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   :1.513  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.:2.581  
 Median :19.20   Median :6.000   Median :196.3   Median :3.325  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :3.217  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:3.610  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :5.424  

Descriptive Statistics

The variance and the standard deviation must be computed separately:

var(x)
[1] 36.3241
sd(x)
[1] 6.026948

Descriptive Statistics

We can use vectorization to compute the standard deviation of the full dataset:

sapply(mtcars, sd)
        mpg         cyl        disp          hp        drat          wt 
  6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
       qsec          vs          am        gear        carb 
  1.7869432   0.5040161   0.4989909   0.7378041   1.6152000 

Contingency Table

xtabs(~ cyl + carb, data = mtcars)
   carb
cyl 1 2 3 4 6 8
  4 5 6 0 0 0 0
  6 2 0 0 4 1 0
  8 0 4 3 6 0 1

Relative Fequencies

The following code displays the relative frequencies of the variable \( x \):

x.freq <- table(x)
x.relfreq <- x.freq/length(x)

Comments to the code

The function table() allows us to generate a frequency table of each observation.

The function length() computes the number of observations of the variable \( x \) (length of a vector).

Vectorization, then, permits us to generate the vector, x.relfreq, of relative frequencies of the variable \( x \).

Relative Frequencies

# x: mpg
x.relfreq
x
   10.4    13.3    14.3    14.7      15    15.2    15.5    15.8    16.4 
0.06250 0.03125 0.03125 0.03125 0.03125 0.06250 0.03125 0.03125 0.03125 
   17.3    17.8    18.1    18.7    19.2    19.7      21    21.4    21.5 
0.03125 0.03125 0.03125 0.03125 0.06250 0.03125 0.06250 0.06250 0.03125 
   22.8    24.4      26    27.3    30.4    32.4    33.9 
0.06250 0.03125 0.03125 0.03125 0.06250 0.03125 0.03125 

Histograms

By default, histograms are determined by showing the frequency of the observations:

hist(x)

plot of chunk unnamed-chunk-13

Histograms

The number of bins can be specified with breaks = n, where \( n \) is an integer:

par(mfrow = c(1,2))
hist(x)
hist(x, breaks = 36)
par(mfrow = c(1,1))

Histograms

plot of chunk unnamed-chunk-15

Histograms

The number of bins to use, by default, is given the Sturges's formula, even though it is possible to use Scott's or Freedman-Diaconis's formula as well.

hist(x, breaks = "Scott")
hist(x, breaks = "FD")

Changing the number of breaks may make the histogram look very different, so caution should be taken when interpreting the shape of a histogram with non-default settings.

You can find more here (Hyndman, 1995)

Histograms

To display the the relative frequencies, we pass the optional parameter probability = TRUE:

hist(x,probability = TRUE)

plot of chunk unnamed-chunk-17

Boxplots

Boxplots are another tool to visualize the descriptive robust statistics:

boxplot(x)

plot of chunk unnamed-chunk-18

Boxplots

The “box” in a boxplot consists of the Interquartile Range (IQR), that is, the lower edge corresponds to the first quartile, \( Q_1 \), while the upper edge corresponds to the third quartile, \( Q_3 \).

The thick line is the median and the “whiskers” corresponds to the \[ Q_3 + 1.5\cdot IQR\qquad \text{upper whisker} \] and \[ Q_1-1.5 IQR\cdot \qquad \text{lower whisker} \]

Boxplots: Notch

A “notch” can be added to a boxplot; it can be interpreted as a confidence interval of the median, according to the formula \[ M \pm 1.57 \times IQR/\sqrt{n} \] (Chambers et al, 1983)

If notches do not overlap, it is a strong evidence that a similar difference could be observed in other sets of data collected under similar circumstances.

In the following slide we plot the boxplot of the mileage, as a function of the number of cylinders.

Boxplots: Notch

boxplot(mtcars$mpg ~ mtcars$cyl, notch=T)

plot of chunk unnamed-chunk-19

Outliers

Values which are either greater than the upper whisker or lower than the lower whisker can be considered outliers and are graphically represented by small circles.

Sometimes, it may be convenient to consider what may be called extreme outliers, which are observed values which are greater than \( Q_3 + 3\cdot IQR \) or lower than \( Q_1 - 3\cdot IQR \).

Extreme outliers may be the result of mistyping values observed in the field, or some other sort of errors.

In other situations, outliers can be defined as values lying two, or perhaps three, standard deviations from the mean.

Statistical Inference

We prove the hypothesis that the variable mpg (dataset mtcars) is greater than 18 (miles):

We test \[ H_0:\ \mu = 18 \] against the alternative \[ H_1:\ \mu > 18 \] at \( \alpha = .05 \). In the next slide we show the result.

Statistical Inference

x <- mtcars$mpg
t.test(x,mu=18, alternative = "greater")

    One Sample t-test

data:  x
t = 1.9622, df = 31, p-value = 0.02938
alternative hypothesis: true mean is greater than 18
95 percent confidence interval:
 18.28418      Inf
sample estimates:
mean of x 
 20.09062 

We reject the null hypothesis, so we conclude that the mean \( \mu > 18 \).

Power of the Test

Having rejected the null hypothesis in the previous slide, we now want to determine the power of the test, under the alternative hypothesis \[ H_a: \mu = 20 \] (\( \alpha=.05 \))

Power of the Test

power.t.test(length(x), delta = 2, sd =sd(x), sig.level = .05, type = "one.sample", alternative = "one.sided")

     One-sample t test power calculation 

              n = 32
          delta = 2
             sd = 6.026948
      sig.level = 0.05
          power = 0.5757901
    alternative = one.sided

Hence, we reject the null hypothesis when it is false with a probability of .576.

The ethical aspect of a power analysis cannot be overlooked (Hunt 2015, p. 8, last-but-one paragraph).

Statistical Inference (Two Populations)

We simulate two samples, of equal size \( N=100 \), drawn from a standard normal distribution and a uniform distribution between \( 0 \) and \( 1 \), respectively:

set.seed(123)
y1 <- rnorm(100)
y2 <- runif(100)

Statistical Inference (Two Populations)

We test

\[ H_0:\ \mu_1 - \mu_2 = 0 \] against the alternative \[ H_1:\ \mu_1 - \mu_2 \neq 0 \] (\( \alpha = .05 \))

Statistical Inference (Two Populations)

t.test(y1, y2, mu = 0)

    Welch Two Sample t-test

data:  y1 and y2
t = -4.1252, df = 119.37, p-value = 6.883e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5855503 -0.2057413
sample estimates:
 mean of x  mean of y 
0.09040591 0.48605169 

Default: unequal variances.

Statistical Inference (Two Populations)

To test the difference of two means with equal variances:

t.test(y1,y2, mu=0, var.equal = T)

Statistical Inference (Two Populations)

To check whether \( \sigma_1^2 = \sigma_2^2 \):

var.test(y1,y2, ratio = 1)

    F test to compare two variances

data:  y1 and y2
F = 9.6181, num df = 99, denom df = 99, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
  6.47148 14.29479
sample estimates:
ratio of variances 
          9.618132 

Power of the Test

power.t.test(length(y1), delta = .5, sd =sqrt(var(y1)+var(y2)), sig.level = .05, type = "two.sample", alternative = "two.sided")

     Two-sample t test power calculation 

              n = 100
          delta = 0.5
             sd = 0.9590956
      sig.level = 0.05
          power = 0.9562207
    alternative = two.sided

NOTE: n is number in *each* group

Hence, we reject the null hypothesis when it is false with a probability equal to \( .9562207 \).

Power of the Test: Application to experimental design

power.t.test(delta = 1, sd = 2, power = .99, sig.level = .05)

     Two-sample t test power calculation 

              n = 147.9478
          delta = 1
             sd = 2
      sig.level = 0.05
          power = 0.99
    alternative = two.sided

NOTE: n is number in *each* group

We need more than 148 observations to achieve a power of \( .99 \) of a two-tailed test, with a significance level of \( \alpha=.05 \) and a true difference in means equal to 1.

Power of the Test: Application to experimental design

power.t.test(delta = 1, sd = 2, power = .99, sig.level = .05, alternative = "one.sided")

     Two-sample t test power calculation 

              n = 126.8465
          delta = 1
             sd = 2
      sig.level = 0.05
          power = 0.99
    alternative = one.sided

NOTE: n is number in *each* group

We need more than 127 observations to achieve a power of a (one-tailed) test of \( .99 \), with a significance level of \( \alpha=.05 \) and a true difference in means equal to 1.

Principal Component Analysis (PCA)

Principal component analysis is a non-parametric tool designed to study high dimensional data; specifcally, to extract the variables with the highest signal-to-noise ratio (Shlens, 2005).

It is based on the following assumptions.

  1. Linearity: the relation between the variables is linear

  2. The interesting variables have the largest variance

  3. Mean and variance are sufficient statistics; if the data present skewness, kurtosis, then PCA is inadequate.

  4. The principal components are orthogonal

PCA is based on the Singular Value Decomposition of a certain rectangular matrix (inputs and outputs; cf., Shlens (2005), Section VI).

Principal Component Analysis (PCA)

Performing PCA is strightforward.

  1. Organize the observations in a dataframe

  2. Center the variables (columns of the dataframe).

  3. In some cases, it may be necessary to scale the variables (divide each centered variables by its standard deviation — their \( z \)-scores)

  4. Apply the necesasary linear algebra, or use a statistical (or numerical) package

Applying PCA

Consider the iris dataset, in the standard distribution of R.

head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"     

We will use the first four numerical variables in the dataset to determine how many components are necessary to describe the qualitative variable Species.

Applying PCA

Compute the variance of the numerical variables

iris.var <- subset(iris, select = c(1:4))
names(iris.var)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
apply(iris.var, 2, var)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.6856935    0.1899794    3.1162779    0.5810063 

Probably scaling the variables is a good idea.

Applying PCA

Decomposition into components:

iris.pca <- prcomp(iris.var, center = T, scale = T)
iris.pca
Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

\[ PC_1 = .5210659 Sepal.Length - .2693474 Sepal.Width \] \[ \qquad \quad + .5804131 Petal.Length + .5648565 Petal.Width \]

Applying PCA

Remarks

  • The other components are written in a similar way.

  • Notice that there is sign ambiguity in the coefficients, as the coefficients are related to eigenvalues of a certain matrix.

  • The magnitude of the coefficients is equal to 1.

Applying PCA

Summary of the decomposition

summary(iris.pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

The first two components explain most of the varianbility of the data.

Applying PCA

Scree plot

plot(iris.pca, type = "l")

plot of chunk unnamed-chunk-34

Applying PCA

Conclusion

  • Only two, of the four components, capture the most signal-to-noise ratio

  • The composition of first two components is

iris.pca$rotation[,1:2]
                    PC1         PC2
Sepal.Length  0.5210659 -0.37741762
Sepal.Width  -0.2693474 -0.92329566
Petal.Length  0.5804131 -0.02449161
Petal.Width   0.5648565 -0.06694199

Applying PCA

Conclusion

library(devtools)
install_github("vqv/ggbiplot")

Applying PCA

Conclusion

library(ggbiplot)
iris.species <- iris{,5}
g <- ggbiplot(iris.pca, obs.scale = 1, var.scale = 1, 
              groups = iris.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)

Applying PCA

Conclusion

plot of chunk unnamed-chunk-38

KNN (K-nearest Neighbors) Algorithm

  • The KNN algorithm groups data depending on certain characteristics (numerical, or categorical, variables).

  • It works on the assumption that data points of similar classes are closer to each other.

  • The closeness is (a function of) a distance, possibly the euclidean distance in the data space.

  • The KNN algorithm needs a training set.

    • On a given observation, the algorithm assigns a value of probability equal to 1/k to its closest data points and 0, otherwise.
    • Then, we count the number of each class out of those k points. The given point is assigned the class whose count is greater.
  • To avoid ties, \( k \) is odd.

KNN algorithm is an example of supervised learning.

KNN (K-nearest Neighbors) Algorithm: Implementation

  1. The variables are normalized \[ normalized.x = \frac{x - \min(x)}{\max(x) - \min(x)}. \] before splitting the data into training and testing sets.

  2. The algorithm works best with numerical variables.

To normalize the variables we can use the following function

normal.fct <- function(x){
  (x - min(x))/(max(x) - min(x))
}

We implement the algorithm on the iris dataset.

KNN (K-nearest Neighbors) Algorithm: Implementation

Normalize the variables

normal.iris <- as.data.frame(apply(iris.var, 2, normal.fct))
summary(normal.iris)
  Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
 Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
 Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
 3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  

KNN (K-nearest Neighbors) Algorithm: Implementation

Training set

set.seed(123)
train.row <- sample(1:nrow(iris), 0.75 * nrow(iris))
train.iris <- normal.iris[train.row,]

Testing set

test.iris <- normal.iris[-train.row,]

Classification variable in training set

target.class.iris <- iris[train.row,5]

Classification variable in testing set

test.class.iris <- iris[-train.row,5]

KNN (K-nearest Neighbors) Algorithm: Implementation

library(class)
iris.knn <- knn(train.iris, test.iris, cl = target.class.iris, k = 5)
tab <- table(iris.knn, test.class.iris)
tab
            test.class.iris
iris.knn     setosa versicolor virginica
  setosa         12          0         0
  versicolor      0         16         1
  virginica       0          1         8

KNN (K-nearest Neighbors) Algorithm: Conclusion

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x))))}
accuracy(tab)
[1] 0.9473684
  • The accuracy is about 95%.

  • It can be improved upon by

    • either increasing the training set
    • or increasing the value of \( k \) (or both).

Linear regression

We want to estimate the model that explains the consumption of gasoline \[ mpg = \beta_0 + \beta_1 wt + \beta_2 cyl + u \] where \( mpg \) is the consumption of gasoline (miles/gallon); \( wt \), weight of the automobile (1000 lbs) and \( cyl \) the number of cylinders.

Linear regression: Summary of the result

m1 <- lm(mpg ~ wt + cyl, data = mtcars)
summary(m1)

Linear regression: Summary of the result


Call:
lm(formula = mpg ~ wt + cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2893 -1.5512 -0.4684  1.5743  6.1004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
wt           -3.1910     0.7569  -4.216 0.000222 ***
cyl          -1.5078     0.4147  -3.636 0.001064 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,    Adjusted R-squared:  0.8185 
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

Linear regression: Break down of the result

m1.summary <- summary(m1)
m1.summary$coefficients
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 39.686261  1.7149840 23.140893 3.043182e-20
wt          -3.190972  0.7569065 -4.215808 2.220200e-04
cyl         -1.507795  0.4146883 -3.635972 1.064282e-03
m1.summary$coefficients[,1]
(Intercept)          wt         cyl 
  39.686261   -3.190972   -1.507795 

Linear regression: Break down of the result

m1.summary$coefficients[2,]
   Estimate  Std. Error     t value    Pr(>|t|) 
-3.19097214  0.75690649 -4.21580760  0.00022202 
m1.summary$r.squared
[1] 0.8302274
m1.summary$adj.r.squared
[1] 0.8185189

Linear regression: Break down of the result

ifelse(pf(m1.summary$fstatistic[1], m1.summary$fstatistic[2], 
          m1.summary$fstatistic[3], lower.tail = F) < .05, 
       "The model is significant", "The model is not significant")
                     value 
"The model is significant" 

Linear regression: Estimation

Estimate the mileage of a car with \( wt=2000 \) and \( cyl=6 \).

sum(coef(m1)*c(1,2,6))
[1] 24.25755

Linear regression: Prediction

Compute the 99% confidence interval for the value estimated in the previous slide.

predict(m1, newdata = data.frame(wt=2,cyl=6),
        interval = "confidence", level = .99)
       fit      lwr      upr
1 24.25755 21.57263 26.94246

Linear regression: Prediction

Compute the 99% prediction interval for the value estimated two slides above.

predict(m1, newdata = data.frame(wt=2,cyl=6),
        interval = "prediction", level = .99)
       fit      lwr     upr
1 24.25755 16.68829 31.8268

Linear regression: Confidence intervals

To determine the, say, 95% confidence intervals of the estimated slopes we use the function confint()

confint(m1, level = .99)
                0.5 %    99.5 %
(Intercept) 34.959104 44.413419
wt          -5.277299 -1.104646
cyl         -2.650836 -0.364754

Linear regression: Confidence intervals

To generate a table with the estimated slopes and related 98% confidence intervals

cbind("Est slopes" = coef(m1), confint(m1, level = .98))
            Est slopes       1 %       99 %
(Intercept)  39.686261 35.463934 43.9085887
wt           -3.190972 -5.054492 -1.3274522
cyl          -1.507795 -2.528766 -0.4868235

Linear regression: Centered controls

wt.mean = mtcars$wt-mean(mtcars$wt)
cyl.mean = mtcars$cyl-mean(mtcars$cyl)

Linear regression: Centered controls

m2 <- lm(mpg ~ wt.mean + cyl.mean , data = mtcars)
summary(m2)

Call:
lm(formula = mpg ~ wt.mean + cyl.mean, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2893 -1.5512 -0.4684  1.5743  6.1004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20.0906     0.4539  44.264  < 2e-16 ***
wt.mean      -3.1910     0.7569  -4.216 0.000222 ***
cyl.mean     -1.5078     0.4147  -3.636 0.001064 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,    Adjusted R-squared:  0.8185 
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

Linear regression: Constant elasticity models

\[ log(mpg) = \beta_0 + \beta_1 \log(wt) + \beta_2 \log(cyl) + u. \]

m.log <- lm(log(mpg) ~ log(wt) + log(cyl), data = mtcars)
summary(m.log)

Linear regression: Constant elasticity models


Call:
lm(formula = log(mpg) ~ log(wt) + log(cyl), data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.20640 -0.07434 -0.02087  0.10690  0.23144 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.2229     0.1292  32.672  < 2e-16 ***
log(wt)      -0.5627     0.1119  -5.031 2.33e-05 ***
log(cyl)     -0.3566     0.1149  -3.103  0.00424 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1176 on 29 degrees of freedom
Multiple R-squared:  0.8541,    Adjusted R-squared:  0.844 
F-statistic: 84.88 on 2 and 29 DF,  p-value: 7.568e-13

Linear regression: Quadratic models

\[ mpg = \beta_0 + \beta_1 wt + \beta_2 wt^2 + \beta_3 cyl + \beta_4 cyl^2 + u \]

m3 <- lm(mpg ~ wt + I(wt^2) + cyl + I(cyl^2), data = mtcars)
summary(m3)

Linear regression: Quadratic models


Call:
lm(formula = mpg ~ wt + I(wt^2) + cyl + I(cyl^2), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9949 -1.5565 -0.6359  1.8183  5.4899 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  51.1094     8.8249   5.791 3.67e-06 ***
wt           -8.8677     2.9264  -3.030  0.00533 ** 
I(wt^2)       0.7561     0.3789   1.996  0.05616 .  
cyl          -2.5445     3.4215  -0.744  0.46348    
I(cyl^2)      0.1143     0.2773   0.412  0.68355    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.431 on 27 degrees of freedom
Multiple R-squared:  0.8583,    Adjusted R-squared:  0.8373 
F-statistic:  40.9 on 4 and 27 DF,  p-value: 4.388e-11

Linear regression: Diagnostics

  1. Model adequacy: Reset Test
  2. Heteroskedasticity
  3. Robust estimations
  4. Normality of errors
  5. Correlation of errors

We can perform the diagnostics

  • by plotting the residuals
  • using the library lmtest

Diagnostics: Nonlinearities

plot(m1, 1)

plot of chunk unnamed-chunk-65

It suggests that some nonlinearities may be present in the model.

Diagnostics: Nonlinearities

library(lmtest)
resettest(m1, type = "fitted") # default

    RESET test

data:  m1
RESET = 3.6267, df1 = 2, df2 = 27, p-value = 0.04026

Diagnostics: Normality of errors

QQ-plot

plot(m1,2)

plot of chunk unnamed-chunk-67

Diagnostics: Normality of errors

shapiro.test(m1$residuals)

    Shapiro-Wilk normality test

data:  m1$residuals
W = 0.93745, p-value = 0.06341

Just about normal. The plot shows also possible troublesome observations.

Diagnostics: Homoscedasticity

plot(m1,3)

plot of chunk unnamed-chunk-69

Diagnostics: Homoscedasticity

bptest(m1)

    studentized Breusch-Pagan test

data:  m1
BP = 3.2548, df = 2, p-value = 0.1964

The model is homoscedastic.

Diagnostics: Influential observations

plot(m1,4)

plot of chunk unnamed-chunk-71

Diagnostics: Autocorrelation of errors

dwtest(m1)

    Durbin-Watson test

data:  m1
DW = 1.6711, p-value = 0.1335
alternative hypothesis: true autocorrelation is greater than 0

No autocorrelation present.

Nonlinear regression: Logit model

The logit model is based on the logistic equation \[ y = \frac{1}{1+\exp[-(\beta_0 + \beta_1 x)]} \]

  1. Logistic models are estimated by maximizing the log-likelihood function.
  2. The estimated coefficients are log-odds of the response variable.

Logit model: Estimation

Downloada the data from

mydata <-read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
head(mydata)
  admit gre  gpa rank
1     0 380 3.61    3
2     1 660 3.67    3
3     1 800 4.00    1
4     1 640 3.19    4
5     0 520 2.93    4
6     1 760 3.00    2

Logit model: Estimation

The dataset consists of four variables:

  • admit: equals to 1, if admitted a post-graduate program; zero, if not;
  • gre: GRE score;
  • gpa: GPA score;
  • rank: ranking from 1 to 4; \( rank=1 \) (top university); \( rank=2 \), prestigious university; \( rank=3 \), medium-ranked university and \( rank=4 \), low-ranked university:

The variables \( gre \), \( gpa \) will be considered continuous; \( admit \) is binary and \( rank \) is a categorical (or ordinal) variable.

Logit model: Estimation

mydata$rank <- factor(mydata$rank)
logit.model <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(logit.model)

Logit model: Estimation


Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

Logit model: Estimation

The logistic regression coefficients give the change in the log-odds of the dependent variable when the control (independent) variable changes by one unit.

For example, having graduated from low-ranked university, instead of a top university, decreases the log-odds of admission to a post-graduate program by \( 1.551464 \).

Logit model: Model fit

The estimated logit fit is compared to a model with only the intercept (null model). \[ H_0: \ \text{The null model is a better fit}\ \ \ \ \] \[ H_1: \ \text{The null model is not a better fit} \] The test statistics has \( \chi^2_k \) distribution, with \( k= \) number of variables in the model.

Logit model: Model fit

with(logit.model, pchisq(null.deviance - deviance, 
                         df.null - df.residual, lower.tail = FALSE))
[1] 7.578194e-08

We reject \( H_0 \) and concludes that logit.model is a better fit than the null model.

Logit model: Odds ratios

Only the estimated coefficients

exp(coef(logit.model))
(Intercept)         gre         gpa       rank2       rank3       rank4 
  0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375 

Logit model: Odds ratios (OR)

Together with the 98\% confidence interval (for illustration purposes)

exp(cbind(OR = coef(logit.model), confint(logit.model, level = .98)))
                   OR         1 %      99 %
(Intercept) 0.0185001 0.001218675 0.2494547
gre         1.0022670 0.999742603 1.0048609
gpa         2.2345448 1.041853560 4.9042225
rank2       0.5089310 0.241856450 1.0607886
rank3       0.2617923 0.115436498 0.5791447
rank4       0.2119375 0.076797072 0.5444075

Logit model: Probability prediction

Predict the probability of being admitted to a post-graduate program, when \( gpa \) and \( gre \) are equal to their mean values.

mydata1 <- with(mydata, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
mydata1$rankProb <- predict(logit.model, newdata = mydata1, type = "response")
mydata1
    gre    gpa rank  rankProb
1 587.7 3.3899    1 0.5166016
2 587.7 3.3899    2 0.3522846
3 587.7 3.3899    3 0.2186120
4 587.7 3.3899    4 0.1846684

References

Chambers, John M., William S. Cleveland, Beat Kleiner, and Paul A. Tukey (1983) “Comparing Data Distributions.” In Graphical Methods for Data Analysis, 62. Belmont, California: Wadsworth International Group

Hunt, A (2015) “A Researcher’s Guide to Power Analysis”, https://research.usu.edu//irb/wp-content/uploads/sites/12/2015/08/A_Researchers_Guide_to_Power_Analysis_USU.pdf Retrieved on October 13, 2019

Hyndman, R.J. (1995) “The problem with Sturges’ rule for constructing histograms”, https://robjhyndman.com/papers/sturges.pdf Retrieved on October 13, 2019

Schlens, J. (2005) A Tutorial on Principal Component Analysis. Copy retrieved [04-09-2008] from: http://www.cs.cmu.edu/~elaw/papers/pca.pdf (2005)