Extended example Regression Analysis of Infections
Let us create a dataframe by using one of the files available in Canvas, PopInf. Create data frame named infections
infections1 <- read.table("C:/Users/r0955087/Desktop/documents2/PopInf.txt",header=FALSE)
Does our data set have columns headers? Why?
Yes, because our data set already have the name of the predictor
#View(infections1)
Since our data frame already has name of the variables (titled for the columns) it is necessary to put header=true.
If not the variable name appears as normal rows
infections <- read.table("C:/Users/r0955087/Desktop/documents2/PopInf.txt",header=TRUE)
Shows our data set
#View(infections)
What is the class of the dataframe infections?
Shows us the class of infection (in this case it is a data frame)
class(infections)
## [1] "data.frame"
Inspect the dataframe infections using the head function. Describe the syntax and the output.
Head() function plot first 7 rows of the data frame infections. The output is the 7 first rows values of predictors population and reponse variable infections.
head(infections)
Define a variable lml as a linear regression model explaining how the population may or may not affect the number of infections. Describe the syntax and the output.
It has been used lm() in order to fit linear model regression with predictor variable “population” and response “infections”. The output show us the coefficients of the relation between population and infections
lml <- lm(infections$infections ~ infections$pop)
lml
##
## Call:
## lm(formula = infections$infections ~ infections$pop)
##
## Coefficients:
## (Intercept) infections$pop
## 6.275e+02 3.601e-03
Let us run a chunk of code returning a summary of the model.
summary(lml)
##
## Call:
## lm(formula = infections$infections ~ infections$pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1242.9 -635.5 -537.3 -367.5 6085.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.275e+02 3.126e+02 2.007 0.054826 .
## infections$pop 3.601e-03 9.668e-04 3.725 0.000912 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1528 on 27 degrees of freedom
## Multiple R-squared: 0.3394, Adjusted R-squared: 0.315
## F-statistic: 13.87 on 1 and 27 DF, p-value: 0.0009122
What was the null hypothesis? Did we reject or fail to reject the null? Run a function returning the different attributes available for our linear regression model.
The null hypothesis is that does not exist linear relation between population and infections. Since p-value is lower than 0.05 there exist enough evidence to reject null hypothesis and affirm that there exist a linear relationship between population and infections.
attributes(lml)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
What are the values of the coefficients? Please interpret the coefficient for the population and the meaning for our model(problem).
Coefficient is showing us that exist a positive linear correlation and its value means that for every additional thousand people we are going to get 3.601114e-03 more infections.
lml$coef
## (Intercept) infections$pop
## 6.275345e+02 3.601114e-03
Above, we just used the R language to analyze a simple linear regression problem(SLR). Let us know apply our basic knowledge of R to a multiple linear regression (MLR) scenario.
Let us define the function moreinfections by using the dataset available in Canvas,infections1.
moreinfections <- read.table("C:/Users/r0955087/Desktop/documents2/infections1.txt",header=TRUE)
Let us inspect the dataframe moreinfections.
Head() function plot first 7 rows of the data frame moreinfections. The output is the 7 first rows values of predictors population and ufo2010 and response variable infections.
head(moreinfections)
Define a new variable that stores the linear regression model describing the number of infections as a function of ufo(unidentified flying object) and pop(population). Display the summary. Interpret the output.
Function lm is used again but in this case to fit linear model regression with predictor variable “population” and ufo2010 and response variable “infections”
The coefficients value for ufo2010 means that exists a positive linear correlation between it and infections variable. Its value means that for every additional ufo(undefined fly object) it is going to be 2.235e+01 more infections.So for more ufo observed will be more infections what is very interesting, I would not expect it.
The coefficients value for pop means that exists a positive linear correlation between it and infections variable. Its value means that for every additional thousand people it is going to be 9.28e-04 more infections.
lm2 <- lm(moreinfections$infections ~ moreinfections$ufo2010 + moreinfections$pop)
lm2
##
## Call:
## lm(formula = moreinfections$infections ~ moreinfections$ufo2010 +
## moreinfections$pop)
##
## Coefficients:
## (Intercept) moreinfections$ufo2010 moreinfections$pop
## 6.187e+02 2.235e+01 9.281e-04
Since the p-value for both predictor variables is bigger than 0.05 the null hypothesis can not be rejected. That means that does nor exist enough evidences to affirm that exist any linear relationship between predictor variables and response variables.That means that explanations I have made for the coefficient values are not true.
summary(lm2)
##
## Call:
## lm(formula = moreinfections$infections ~ moreinfections$ufo2010 +
## moreinfections$pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1210.1 -595.9 -510.3 -192.3 6100.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.187e+02 3.129e+02 1.977 0.0587 .
## moreinfections$ufo2010 2.235e+01 2.267e+01 0.986 0.3332
## moreinfections$pop 9.281e-04 2.878e-03 0.322 0.7497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1528 on 26 degrees of freedom
## Multiple R-squared: 0.3632, Adjusted R-squared: 0.3143
## F-statistic: 7.416 on 2 and 26 DF, p-value: 0.002829