Extended Example: Regression Analysis of Infections
Let us create a dataframe by using one of the files available in Canvas, PopInf.
getwd()
[1] "/cloud/project"
Let us create a dataframe by using one of the files available in Canvas, PopInf. I created a dataframe named infections using a file on canvas.
infections <- read.table("PopInf.txt",header=FALSE)
Does our dataset have columns headers? Why? Yes, it has column headers labeled v1 and v2.
View(infections)
We just noticed that our dataset do have headers.
infections <- read.table("PopInf.txt",header=TRUE)
View(infections)
What is the class of the dataframe infections? The class of the dataframe infections is data.frame
class(infections)
[1] "data.frame"
Inspect the dataframe infections using the head function. Describe the syntax and the output. The syntax and the output is the value of the first six rows of the dataframe.
head(infections)
Define a variable lm as a linear regression model explaning how the population may or may not affect the number of infections. Describe the syntax and the output. I created lm1 which is the linear regression model using as preddicting variable population and as responce variable infections. The output is the coefficient of population and infections.
lm1<-lm(infections$infections~infections$pop)
lm1
Call:
lm(formula = infections$infections ~ infections$pop)
Coefficients:
(Intercept) infections$pop
6.275e+02 3.601e-03
Let us run a chunck of code returning a summary of the model.
summary(lm1)
Call:
lm(formula = infections$infections ~ infections$pop)
Residuals:
Min 1Q Median 3Q Max
-1242.9 -635.5 -537.3 -367.5 6085.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.275e+02 3.126e+02 2.007 0.054826 .
infections$pop 3.601e-03 9.668e-04 3.725 0.000912 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1528 on 27 degrees of freedom
Multiple R-squared: 0.3394, Adjusted R-squared: 0.315
F-statistic: 13.87 on 1 and 27 DF, p-value: 0.0009122
What was the null hypothesis? Did we reject or fail to reject the null? The null hypothesis is you can reject the null due to the p-value is smaller.
Run a function returning the different attributes available for our linear regression model.
attributes(lm1)
$names
[1] "coefficients" "residuals" "effects"
[4] "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels"
[10] "call" "terms" "model"
$class
[1] "lm"
What are the values of the coefficients? Please interpret the coefficient for the population and the meaning for our model(problem). The value of the coefficient is 3.601114e-03. The meaning of this coefficient is that for every additional thousand people will be 3.601114e-03 more infections.
lm1$coefficients
(Intercept) infections$pop
6.275345e+02 3.601114e-03
Above, we just used the R language to analyze a simple linear regression problem(SLR). Let us know apply our basic knowledge of R to a multiple linear regression (MLR) scenario.
Let us define the function moreinfections by using the dataset available in Canvas,infections1.
moreinfections <- read.table("infections1.txt",header=TRUE)
Let us inspect the dataframe moreinfections. I used the head function output is the value of the first six rows of the dataframe.
head(moreinfections)
Define a new variable that stores the linear regression model describing the number of infections as a function of ufo and pop. Display the summary. Interpret the output. The null hypothesis is that the p value is greater than 0.05 so we can not reject the nullbfor both variables ufo and pop.
lm2<-lm(moreinfections$infections~moreinfections$ufo2010+moreinfections$pop)
summary(lm2)
Call:
lm(formula = moreinfections$infections ~ moreinfections$ufo2010 +
moreinfections$pop)
Residuals:
Min 1Q Median 3Q Max
-1210.1 -595.9 -510.3 -192.3 6100.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.187e+02 3.129e+02 1.977 0.0587 .
moreinfections$ufo2010 2.235e+01 2.267e+01 0.986 0.3332
moreinfections$pop 9.281e-04 2.878e-03 0.322 0.7497
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1528 on 26 degrees of freedom
Multiple R-squared: 0.3632, Adjusted R-squared: 0.3143
F-statistic: 7.416 on 2 and 26 DF, p-value: 0.002829