Extended Example: regression Analysis of Infections Let us create a dataframe by using one of the files available in Canvas, PopInf.

infections<- read.table("Popinf.txt",header=FALSE)
#E- read.table, is used to read the datafile in this case (pop..)

Does our dataset have columns headers? Why?

View(infections)
#E- YES, the dataset has columns titles ("infections", "pop"), displaying a populations infection rate 
#E- 'view' displays the whole dataset that you are importing

We just noticed that our dataset do have headers

infections<- read.table("Popinf.txt",header=TRUE)
#E- need to put TRUE for header, because the data frame does have headers.
#view(infections)

What is the class of the dataframe infections?

class(infections)
## [1] "data.frame"
?class
## starting httpd help server ... done
#E- returne th vales of the class attribute of an R object.

Inspect the dataframe infections using the head function. Describe the syntax and the output.

head(infections)
#E- head()- shows the first 7 rows, including headers of a datset.

Define a variable lm as a linear regression model explaning how the population may or may not affect the number of infections. Describe the syntax and the output.

#E - What does LM do? A. it is used to fit linear models to data frames. 
#E - syntax- what is the code being using. - 
#E - Ouput- Analysis of the results. 
#E- Syntax- lm()- will show you the correlation between population and infections.  
#E- #Output- by every additional 1000 people (population) 3.6 infections will occur; the dataset is significant for large sets not much for smaller ones.
lm1 <- lm(infections$infections ~ infections$pop)
lm1
## 
## Call:
## lm(formula = infections$infections ~ infections$pop)
## 
## Coefficients:
##    (Intercept)  infections$pop  
##      6.275e+02       3.601e-03

Let us run a chunck of code returning a summary of the model. What was the null hypothesis? did we reject or fail to reject the null?

summary(lm1)
## 
## Call:
## lm(formula = infections$infections ~ infections$pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1242.9  -635.5  -537.3  -367.5  6085.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.275e+02  3.126e+02   2.007 0.054826 .  
## infections$pop 3.601e-03  9.668e-04   3.725 0.000912 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1528 on 27 degrees of freedom
## Multiple R-squared:  0.3394, Adjusted R-squared:  0.315 
## F-statistic: 13.87 on 1 and 27 DF,  p-value: 0.0009122
#E- Null Hypothesis: population do not affect infections 
#Type 1- rejecting the NULL, when the Null is actually true
#E- We assume that population, based on the p- value, affects Infection. 

Run a function returning the different attributes available for our linear regression model.

attributes(lm1)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"

What are the values of the coefficients? Please interpret the coefficient for the population and the meaning for our model(problem).

lm1$coefficients
##    (Intercept) infections$pop 
##   6.275345e+02   3.601114e-03
#E- For 1000 additional people on avg in a region there will be (on avg) 3.6 more affection reports. 
#E- coeff$int- 627.5345
#E- coeff$pop- .003601114

Above, we just used the R language to analyze a simple linear regression problem(SLR). Let us now apply our basic knowledge of R to a multiple linear regression (MLR) scenario.

Let us define the function moreinfections by using the dataset available in Canvas,infections1

moreinfections<- read.table("infections1.txt",header=TRUE)
#E- new dataset using the 'infection1' file

Let us inspect the dataframe moreinfections.

head(moreinfections)

Define a new variable that stores the linear regression model describing the number of infections as a function of ufo and pop. Display the summary. Interpret the output.

#E- What are UFO and Population?; they are predictors that help determine infections.
lm2 <-lm(moreinfections$infections ~moreinfections$ufo2010 + moreinfections$pop)
lm2
## 
## Call:
## lm(formula = moreinfections$infections ~ moreinfections$ufo2010 + 
##     moreinfections$pop)
## 
## Coefficients:
##            (Intercept)  moreinfections$ufo2010      moreinfections$pop  
##              6.187e+02               2.235e+01               9.281e-04
summary(lm2)
## 
## Call:
## lm(formula = moreinfections$infections ~ moreinfections$ufo2010 + 
##     moreinfections$pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1210.1  -595.9  -510.3  -192.3  6100.0 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            6.187e+02  3.129e+02   1.977   0.0587 .
## moreinfections$ufo2010 2.235e+01  2.267e+01   0.986   0.3332  
## moreinfections$pop     9.281e-04  2.878e-03   0.322   0.7497  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1528 on 26 degrees of freedom
## Multiple R-squared:  0.3632, Adjusted R-squared:  0.3143 
## F-statistic: 7.416 on 2 and 26 DF,  p-value: 0.002829
#E- UFO & POP are not statisticaly significant, p-value- 0.5, so they do not affect infections 
#E Not enough information to reject the NULL completely, but for now it is based on current dataset size.