Extended Example: Regression Analysis of Infections

The content of this webpage belongs to Lecture 2, activity 3.

First, download the txt file from Canvas and import it to this code.

infections <- read.table("PopInf.txt",header=FALSE)

Does our dataset have columns headers?

#View(infections)

We just noticed that our dataset do have headers. Compared to the example using the Examsquiz.txt, the PopInf data file has headers this time.

infections <- read.table("PopInf.txt",header=TRUE)
#View(infections)

Now, we stated that there is a header column in the data frame which is exactly what we want.

For the next step, we want to return the see if the class of our file is actually data frame.

class(infections)
[1] "data.frame"

This is the case! Further, we inspect the data frame infections using the head() function. head() takes the considered data frame (mandatory) as well as an optional number as inputs. If the number is omitted, head(df) returns the first 6 rows of the data frame together with the header.

head(infections)

If we add the additional number argument, the corresponding number of first rows is being returned. For example, let`s set the parameter to 10. The first 10 rows will be returned.

head(infections,10)

Let us consider a linear regresion model using the infection data. Define a variable lm1 as a linear regression model explaning how the population may or may not affect the number of infections.

The first argument in the lm() function is always the dependent variable, i.e. the variable that the model wants to predict. It is followed by a tilde sign (~) and the independent variables that are assumed to predict the dependent variable. For a simple linear regression model (SLR), the number of independent variables is 1.

In this case, we investigate if the dependent variable infections can be predicted using data of the parameter pop:

lm1<-lm(infections$infections~infections$pop)
lm1

Call:
lm(formula = infections$infections ~ infections$pop)

Coefficients:
   (Intercept)  infections$pop  
     6.275e+02       3.601e-03  

Hence, the corresponding linear model approach looks as follows: infections = 6.275e+02 + 3.601e-03 * pop

To get a deeper analysis, let us run a chunck of code returning a summary of the model.

summary(lm1)

Call:
lm(formula = infections$infections ~ infections$pop)

Residuals:
    Min      1Q  Median      3Q     Max 
-1242.9  -635.5  -537.3  -367.5  6085.0 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    6.275e+02  3.126e+02   2.007 0.054826 .  
infections$pop 3.601e-03  9.668e-04   3.725 0.000912 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1528 on 27 degrees of freedom
Multiple R-squared:  0.3394,    Adjusted R-squared:  0.315 
F-statistic: 13.87 on 1 and 27 DF,  p-value: 0.0009122

Important is the p-value of pop in the last column of the coefficients table. Assuming the null-hypothesis that there is no correlation between infection and pop, the p-value returns the probability that the coefficient of pop is 0, i.e. the correlation of infections and pop is 0. Since the p-value in this case (0.000912) is less than the significance level 0.05, the null hypothesis can be rejected.

Next, run a function returning the different attributes available for our linear regression model.

attributes(lm1)
$names
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        

$class
[1] "lm"

Again, we print the values of the coefficients. This time by calling lm1$coefficients:

lm1$coefficients
   (Intercept) infections$pop 
  6.275345e+02   3.601114e-03 

The Intercept coefficient predicts the number of infections when the population amount is 0. In this case, that does not make sense, since in a there can’t be a positive number of infections in a population with 0 people. In a case-by-case analysis, it has to be analyzed individually if a setting with all independent variables being set to 0 makes sense.

The coefficient for pop indicates that for each population increase by 1 unit, the number of infections gets increased by 3.601114e-03.

Above, we just used the R language to analyze a simple linear regression problem(SLR). Let us know apply our basic knowledge of R to a multiple linear regression (MLR) scenario.

Let us define the function moreinfections by using the dataset available in Canvas, infections1.

moreinfections <- read.table("infections1.txt",header=TRUE)

Let us inspect the dataframe moreinfections.Let us inspect the dataframe moreinfections.

head(moreinfections)

Define a new variable lm2 that stores the linear regression model describing the number of infections as a function of ufo and pop and display the summary.

lm2<-lm(moreinfections$infections ~ moreinfections$ufo2010 + moreinfections$pop)
summary(lm2)

Call:
lm(formula = moreinfections$infections ~ moreinfections$ufo2010 + 
    moreinfections$pop)

Residuals:
    Min      1Q  Median      3Q     Max 
-1210.1  -595.9  -510.3  -192.3  6100.0 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)  
(Intercept)            6.187e+02  3.129e+02   1.977   0.0587 .
moreinfections$ufo2010 2.235e+01  2.267e+01   0.986   0.3332  
moreinfections$pop     9.281e-04  2.878e-03   0.322   0.7497  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1528 on 26 degrees of freedom
Multiple R-squared:  0.3632,    Adjusted R-squared:  0.3143 
F-statistic: 7.416 on 2 and 26 DF,  p-value: 0.002829

Again, investigating the p-values for the coefficients, we see that neither for ufo2010 (p-value 0.3332), nor for pop (p-value 0.7497) it can be implied that the coefficients are not 0! Hence, we cannot say with a significant certainty that there is a correlation between infections and both parameters. We fail to reject the null hypothesis that the coefficients for both parameters are 0.

LS0tCnRpdGxlOiAiSW4tY2xhc3MgYWN0aXZpdHkgMyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKKioqRXh0ZW5kZWQgRXhhbXBsZTogUmVncmVzc2lvbiBBbmFseXNpcyBvZiBJbmZlY3Rpb25zKioqIAoKVGhlIGNvbnRlbnQgb2YgdGhpcyB3ZWJwYWdlIGJlbG9uZ3MgdG8gTGVjdHVyZSAyLCBhY3Rpdml0eSAzLgoKCkZpcnN0LCBkb3dubG9hZCB0aGUgdHh0IGZpbGUgZnJvbSBDYW52YXMgYW5kIGltcG9ydCBpdCB0byB0aGlzIGNvZGUuCmBgYHtyfQppbmZlY3Rpb25zIDwtIHJlYWQudGFibGUoIlBvcEluZi50eHQiLGhlYWRlcj1GQUxTRSkKYGBgCgpEb2VzIG91ciBkYXRhc2V0IGhhdmUgY29sdW1ucyBoZWFkZXJzPwoKYGBge3J9CiNWaWV3KGluZmVjdGlvbnMpCmBgYAoKV2UganVzdCBub3RpY2VkIHRoYXQgb3VyIGRhdGFzZXQgZG8gaGF2ZSBoZWFkZXJzLiBDb21wYXJlZCB0byB0aGUgZXhhbXBsZSB1c2luZyB0aGUgRXhhbXNxdWl6LnR4dCwgdGhlIFBvcEluZiBkYXRhIGZpbGUgaGFzIGhlYWRlcnMgdGhpcyB0aW1lLgpgYGB7cn0KaW5mZWN0aW9ucyA8LSByZWFkLnRhYmxlKCJQb3BJbmYudHh0IixoZWFkZXI9VFJVRSkKI1ZpZXcoaW5mZWN0aW9ucykKYGBgCk5vdywgd2Ugc3RhdGVkIHRoYXQgdGhlcmUgaXMgYSBoZWFkZXIgY29sdW1uIGluIHRoZSBkYXRhIGZyYW1lIHdoaWNoIGlzIGV4YWN0bHkgd2hhdCB3ZSB3YW50LgoKRm9yIHRoZSBuZXh0IHN0ZXAsIHdlIHdhbnQgdG8gcmV0dXJuIHRoZSBzZWUgaWYgdGhlIGNsYXNzIG9mIG91ciBmaWxlIGlzIGFjdHVhbGx5IGRhdGEgZnJhbWUuCmBgYHtyfQpjbGFzcyhpbmZlY3Rpb25zKQpgYGAKVGhpcyBpcyB0aGUgY2FzZSEgRnVydGhlciwgd2UgaW5zcGVjdCB0aGUgZGF0YSBmcmFtZSBpbmZlY3Rpb25zIHVzaW5nIHRoZSBoZWFkKCkgZnVuY3Rpb24uIGhlYWQoKSB0YWtlcyB0aGUgY29uc2lkZXJlZCBkYXRhIGZyYW1lIChtYW5kYXRvcnkpIGFzIHdlbGwgYXMgYW4gb3B0aW9uYWwgbnVtYmVyIGFzIGlucHV0cy4gSWYgdGhlIG51bWJlciBpcyBvbWl0dGVkLCBoZWFkKGRmKSByZXR1cm5zIHRoZSBmaXJzdCA2IHJvd3Mgb2YgdGhlIGRhdGEgZnJhbWUgdG9nZXRoZXIgd2l0aCB0aGUgaGVhZGVyLgpgYGB7cn0KaGVhZChpbmZlY3Rpb25zKQpgYGAKSWYgd2UgYWRkIHRoZSBhZGRpdGlvbmFsIG51bWJlciBhcmd1bWVudCwgdGhlIGNvcnJlc3BvbmRpbmcgbnVtYmVyIG9mIGZpcnN0IHJvd3MgaXMgYmVpbmcgcmV0dXJuZWQuCkZvciBleGFtcGxlLCBsZXRgcyBzZXQgdGhlIHBhcmFtZXRlciB0byAxMC4gVGhlIGZpcnN0IDEwIHJvd3Mgd2lsbCBiZSByZXR1cm5lZC4KYGBge3J9CmhlYWQoaW5mZWN0aW9ucywxMCkKYGBgCgpMZXQgdXMgY29uc2lkZXIgYSBsaW5lYXIgcmVncmVzaW9uIG1vZGVsIHVzaW5nIHRoZSBpbmZlY3Rpb24gZGF0YS4gRGVmaW5lIGEgdmFyaWFibGUgbG0xIGFzIGEgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWwgZXhwbGFuaW5nIGhvdyB0aGUgcG9wdWxhdGlvbiBtYXkgb3IgbWF5IG5vdCBhZmZlY3QgdGhlIG51bWJlciBvZiBpbmZlY3Rpb25zLiAKClRoZSBmaXJzdCBhcmd1bWVudCBpbiB0aGUgbG0oKSBmdW5jdGlvbiBpcyBhbHdheXMgdGhlIGRlcGVuZGVudCB2YXJpYWJsZSwgaS5lLiB0aGUgdmFyaWFibGUgdGhhdCB0aGUgbW9kZWwgd2FudHMgdG8gcHJlZGljdC4gSXQgaXMgZm9sbG93ZWQgYnkgYSB0aWxkZSBzaWduICh+KSBhbmQgdGhlIGluZGVwZW5kZW50IHZhcmlhYmxlcyB0aGF0IGFyZSBhc3N1bWVkIHRvIHByZWRpY3QgdGhlIGRlcGVuZGVudCB2YXJpYWJsZS4gRm9yIGEgc2ltcGxlIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsIChTTFIpLCB0aGUgbnVtYmVyIG9mIGluZGVwZW5kZW50IHZhcmlhYmxlcyBpcyAxLiAKCkluIHRoaXMgY2FzZSwgd2UgaW52ZXN0aWdhdGUgaWYgdGhlIGRlcGVuZGVudCB2YXJpYWJsZSBpbmZlY3Rpb25zIGNhbiBiZSBwcmVkaWN0ZWQgdXNpbmcgZGF0YSBvZiB0aGUgcGFyYW1ldGVyIHBvcDoKYGBge3J9CmxtMTwtbG0oaW5mZWN0aW9ucyRpbmZlY3Rpb25zfmluZmVjdGlvbnMkcG9wKQpsbTEKYGBgCkhlbmNlLCB0aGUgY29ycmVzcG9uZGluZyBsaW5lYXIgbW9kZWwgYXBwcm9hY2ggbG9va3MgYXMgZm9sbG93czogCmluZmVjdGlvbnMgPSA2LjI3NWUrMDIgKyAzLjYwMWUtMDMgKiBwb3AKCgpUbyBnZXQgYSBkZWVwZXIgYW5hbHlzaXMsIGxldCB1cyBydW4gYSBjaHVuY2sgb2YgY29kZSByZXR1cm5pbmcgYSBzdW1tYXJ5IG9mIHRoZSBtb2RlbC4KYGBge3J9CnN1bW1hcnkobG0xKQpgYGAKSW1wb3J0YW50IGlzIHRoZSBwLXZhbHVlIG9mIHBvcCBpbiB0aGUgbGFzdCBjb2x1bW4gb2YgdGhlIGNvZWZmaWNpZW50cyB0YWJsZS4gQXNzdW1pbmcgdGhlIG51bGwtaHlwb3RoZXNpcyB0aGF0IHRoZXJlIGlzIG5vIGNvcnJlbGF0aW9uIGJldHdlZW4gaW5mZWN0aW9uIGFuZCBwb3AsIHRoZSBwLXZhbHVlIHJldHVybnMgdGhlIHByb2JhYmlsaXR5IHRoYXQgdGhlIGNvZWZmaWNpZW50IG9mIHBvcCBpcyAwLCBpLmUuIHRoZSBjb3JyZWxhdGlvbiBvZiBpbmZlY3Rpb25zIGFuZCBwb3AgaXMgMC4gU2luY2UgdGhlIHAtdmFsdWUgaW4gdGhpcyBjYXNlICgwLjAwMDkxMikgaXMgbGVzcyB0aGFuIHRoZSBzaWduaWZpY2FuY2UgbGV2ZWwgMC4wNSwgdGhlIG51bGwgaHlwb3RoZXNpcyBjYW4gYmUgcmVqZWN0ZWQuCgpOZXh0LCBydW4gYSBmdW5jdGlvbiByZXR1cm5pbmcgdGhlIGRpZmZlcmVudCBhdHRyaWJ1dGVzIGF2YWlsYWJsZSBmb3Igb3VyIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsLgpgYGB7cn0KYXR0cmlidXRlcyhsbTEpCmBgYAoKQWdhaW4sIHdlIHByaW50IHRoZSB2YWx1ZXMgb2YgdGhlIGNvZWZmaWNpZW50cy4gVGhpcyB0aW1lIGJ5IGNhbGxpbmcgbG0xJGNvZWZmaWNpZW50czoKYGBge3J9CmxtMSRjb2VmZmljaWVudHMKYGBgClRoZSBJbnRlcmNlcHQgY29lZmZpY2llbnQgcHJlZGljdHMgdGhlIG51bWJlciBvZiBpbmZlY3Rpb25zIHdoZW4gdGhlIHBvcHVsYXRpb24gYW1vdW50IGlzIDAuIEluIHRoaXMgY2FzZSwgdGhhdCBkb2VzIG5vdCBtYWtlIHNlbnNlLCBzaW5jZSBpbiBhIHRoZXJlIGNhbid0IGJlIGEgcG9zaXRpdmUgbnVtYmVyIG9mIGluZmVjdGlvbnMgaW4gYSBwb3B1bGF0aW9uIHdpdGggMCBwZW9wbGUuIEluIGEgY2FzZS1ieS1jYXNlIGFuYWx5c2lzLCBpdCBoYXMgdG8gYmUgYW5hbHl6ZWQgaW5kaXZpZHVhbGx5IGlmIGEgc2V0dGluZyB3aXRoIGFsbCBpbmRlcGVuZGVudCB2YXJpYWJsZXMgYmVpbmcgc2V0IHRvIDAgbWFrZXMgc2Vuc2UuIAoKVGhlIGNvZWZmaWNpZW50IGZvciBwb3AgaW5kaWNhdGVzIHRoYXQgZm9yIGVhY2ggcG9wdWxhdGlvbiBpbmNyZWFzZSBieSAxIHVuaXQsIHRoZSBudW1iZXIgb2YgaW5mZWN0aW9ucyBnZXRzIGluY3JlYXNlZCBieSAzLjYwMTExNGUtMDMuIAoKCgpBYm92ZSwgd2UganVzdCB1c2VkIHRoZSBSIGxhbmd1YWdlIHRvIGFuYWx5emUgYSBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24gcHJvYmxlbShTTFIpLiBMZXQgdXMga25vdyBhcHBseSBvdXIgYmFzaWMga25vd2xlZGdlIG9mIFIgdG8gYSBtdWx0aXBsZSBsaW5lYXIgcmVncmVzc2lvbiAoTUxSKSBzY2VuYXJpby4KCkxldCB1cyBkZWZpbmUgdGhlIGZ1bmN0aW9uIG1vcmVpbmZlY3Rpb25zIGJ5IHVzaW5nIHRoZSBkYXRhc2V0IGF2YWlsYWJsZSBpbiBDYW52YXMsIGluZmVjdGlvbnMxLgpgYGB7cn0KbW9yZWluZmVjdGlvbnMgPC0gcmVhZC50YWJsZSgiaW5mZWN0aW9uczEudHh0IixoZWFkZXI9VFJVRSkKYGBgCgpMZXQgdXMgaW5zcGVjdCB0aGUgZGF0YWZyYW1lIG1vcmVpbmZlY3Rpb25zLkxldCB1cyBpbnNwZWN0IHRoZSBkYXRhZnJhbWUgbW9yZWluZmVjdGlvbnMuCmBgYHtyfQpoZWFkKG1vcmVpbmZlY3Rpb25zKQpgYGAKCkRlZmluZSBhIG5ldyB2YXJpYWJsZSBsbTIgdGhhdCBzdG9yZXMgdGhlIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsIGRlc2NyaWJpbmcgdGhlIG51bWJlciBvZiBpbmZlY3Rpb25zIGFzIGEgZnVuY3Rpb24gb2YgdWZvIGFuZCBwb3AgYW5kIGRpc3BsYXkgdGhlIHN1bW1hcnkuIApgYGB7cn0KbG0yPC1sbShtb3JlaW5mZWN0aW9ucyRpbmZlY3Rpb25zIH4gbW9yZWluZmVjdGlvbnMkdWZvMjAxMCArIG1vcmVpbmZlY3Rpb25zJHBvcCkKc3VtbWFyeShsbTIpCmBgYApBZ2FpbiwgaW52ZXN0aWdhdGluZyB0aGUgcC12YWx1ZXMgZm9yIHRoZSBjb2VmZmljaWVudHMsIHdlIHNlZSB0aGF0IG5laXRoZXIgZm9yIHVmbzIwMTAgKHAtdmFsdWUgMC4zMzMyKSwgbm9yIGZvciBwb3AgKHAtdmFsdWUgMC43NDk3KSBpdCBjYW4gYmUgaW1wbGllZCB0aGF0IHRoZSBjb2VmZmljaWVudHMgYXJlIG5vdCAwISBIZW5jZSwgd2UgY2Fubm90IHNheSB3aXRoIGEgc2lnbmlmaWNhbnQgY2VydGFpbnR5IHRoYXQgdGhlcmUgaXMgYSBjb3JyZWxhdGlvbiBiZXR3ZWVuIGluZmVjdGlvbnMgYW5kIGJvdGggcGFyYW1ldGVycy4gV2UgZmFpbCB0byByZWplY3QgdGhlIG51bGwgaHlwb3RoZXNpcyB0aGF0IHRoZSBjb2VmZmljaWVudHMgZm9yIGJvdGggcGFyYW1ldGVycyBhcmUgMC4g