Extended Example: regression Analysis of Infections Let us create a dataframe by using one of the files available in Canvas, PopInf.
infections<- read.table("Popinf.txt",header=FALSE)
#E- read.table, is used to read the datafile in this case (pop..)
Does our dataset have columns headers? Why?
View(infections)
#E- YES, the dataset has columns titles ("infections", "pop"), displaying a populations infection rate
#E- 'view' displays the whole dataset that you are importing
We just noticed that our dataset do have headers
infections<- read.table("Popinf.txt",header=TRUE)
#E- need to put TRUE for header, because the data frame does have headers.
#view(infections)
What is the class of the dataframe infections?
class(infections)
## [1] "data.frame"
?class
## starting httpd help server ... done
#E- returne th vales of the class attribute of an R object.
Inspect the dataframe infections using the head function. Describe the syntax and the output.
head(infections)
#E- head()- shows the first 7 rows, including headers of a datset.
Define a variable lm as a linear regression model explaning how the population may or may not affect the number of infections. Describe the syntax and the output.
#E - What does LM do? A. it is used to fit linear models to data frames.
#E - syntax- what is the code being using. -
#E - Ouput- Analysis of the results.
#E- Syntax- lm()- will show you the correlation between population and infections.
#E- #Output- by every additional 1000 people (population) 3.6 infections will occur; the dataset is significant for large sets not much for smaller ones.
lm1 <- lm(infections$infections ~ infections$pop)
lm1
##
## Call:
## lm(formula = infections$infections ~ infections$pop)
##
## Coefficients:
## (Intercept) infections$pop
## 6.275e+02 3.601e-03
Let us run a chunck of code returning a summary of the model. What was the null hypothesis? did we reject or fail to reject the null?
summary(lm1)
##
## Call:
## lm(formula = infections$infections ~ infections$pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1242.9 -635.5 -537.3 -367.5 6085.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.275e+02 3.126e+02 2.007 0.054826 .
## infections$pop 3.601e-03 9.668e-04 3.725 0.000912 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1528 on 27 degrees of freedom
## Multiple R-squared: 0.3394, Adjusted R-squared: 0.315
## F-statistic: 13.87 on 1 and 27 DF, p-value: 0.0009122
#E- Null Hypothesis: population do not affect infections
#Type 1- rejecting the NULL, when the Null is actually true
#E- We assume that population, based on the p- value, affects Infection.
Run a function returning the different attributes available for our linear regression model.
attributes(lm1)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
What are the values of the coefficients? Please interpret the coefficient for the population and the meaning for our model(problem).
lm1$coefficients
## (Intercept) infections$pop
## 6.275345e+02 3.601114e-03
#E- For 1000 additional people on avg in a region there will be (on avg) 3.6 more affection reports.
#E- coeff$int- 627.5345
#E- coeff$pop- .003601114
Above, we just used the R language to analyze a simple linear regression problem(SLR). Let us now apply our basic knowledge of R to a multiple linear regression (MLR) scenario.
Let us define the function moreinfections by using the dataset available in Canvas,infections1
moreinfections<- read.table("infections1.txt",header=TRUE)
#E- new dataset using the 'infection1' file
Let us inspect the dataframe moreinfections.
head(moreinfections)
Define a new variable that stores the linear regression model describing the number of infections as a function of ufo and pop. Display the summary. Interpret the output.
#E- What are UFO and Population?; they are predictors that help determine infections.
lm2 <-lm(moreinfections$infections ~moreinfections$ufo2010 + moreinfections$pop)
lm2
##
## Call:
## lm(formula = moreinfections$infections ~ moreinfections$ufo2010 +
## moreinfections$pop)
##
## Coefficients:
## (Intercept) moreinfections$ufo2010 moreinfections$pop
## 6.187e+02 2.235e+01 9.281e-04
summary(lm2)
##
## Call:
## lm(formula = moreinfections$infections ~ moreinfections$ufo2010 +
## moreinfections$pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1210.1 -595.9 -510.3 -192.3 6100.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.187e+02 3.129e+02 1.977 0.0587 .
## moreinfections$ufo2010 2.235e+01 2.267e+01 0.986 0.3332
## moreinfections$pop 9.281e-04 2.878e-03 0.322 0.7497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1528 on 26 degrees of freedom
## Multiple R-squared: 0.3632, Adjusted R-squared: 0.3143
## F-statistic: 7.416 on 2 and 26 DF, p-value: 0.002829
#E- UFO & POP are not statisticaly significant, p-value- 0.5, so they do not affect infections
#E Not enough information to reject the NULL completely, but for now it is based on current dataset size.