Regression Analysis

Intro

My R guide will be focused on simple linear regression. There will be two quantitative variables, a predictor and a response. The data set I will be working with today is called “terrorism”. I will be using the number of wounded victims from terrorism to predict the number of victims killed from terrorism. Therefore, my predictor variable is the number of wounded victims, nwound, and my response variable is the number of people killed(including terrorists), nkill.

First, I will use the library Ecdat and retrieve my data “terrorism”. Next, I did ?terrorism to learn more about my data set and then I attached the data for later use. After attaching the data, I will review the beginning portion by using the head code to get a feel for the data set I will be working with. Finally, I will check the dimensions and look at the variable names, this is how I determined my nwound predictor and my nkill response.

Later, I will look at the data in a scatterplot to see if I can determine an association. Then I will confirm the association, if there is one, by checking the correlation. After that, I will fit a regression line to the data. This will be followed by a histogram, qq-plot, and residuals. Next, I will use my data set in prediction intervals and confidence intervals where I will actually predict values based on my predictor variable. Lastly, I will touch on R^2 and the f-test for this data.

library(Ecdat)

## Warning: package 'Ecdat' was built under R version 3.4.3

## Loading required package: Ecfun

## Warning: package 'Ecfun' was built under R version 3.4.3

## 
## Attaching package: 'Ecfun'

## The following object is masked from 'package:base':
## 
##     sign

## 
## Attaching package: 'Ecdat'

## The following object is masked from 'package:datasets':
## 
##     Orange

data("terrorism")
?terrorism

## starting httpd help server ...

##  done

attach(terrorism)
head(terrorism)

##   year methodology method incidents incidents.us suicide suicide.us nkill
## 1 1970        PGIS      p       651          468       0          0   171
## 2 1971        PGIS      p       470          247       0          0   173
## 3 1972        PGIS      p       494           64       0          0   566
## 4 1973        PGIS      p       473           58       0          0   370
## 5 1974        PGIS      p       580           94       0          0   542
## 6 1975        PGIS      p       740          149       0          0   617
##   nkill.us nwound nwound.us pNA.nkill pNA.nkill.us pNA.nwound
## 1       28    192       132  0.064516     0.379416   0.079877
## 2       15     82        48  0.136170     0.682979   0.308511
## 3       12    222        23  0.093117     0.880567   0.668016
## 4       73    495        87  0.103594     0.911205   0.484144
## 5       17    754        63  0.100000     0.841379   0.394828
## 6       22    617        18  0.206757     0.905405   0.459459
##   pNA.nwound.us worldPopulation USpopulation worldDeathRate USdeathRate
## 1      0.380952         3682488     209485.8         11.966         9.5
## 2      0.682979         3757735     211357.9         11.731         9.3
## 3      0.886640         3833595     213219.5         11.635         9.4
## 4      0.911205         3909722     215092.9         11.392         9.3
## 5      0.856897         3985734     217001.9         11.285         9.1
## 6      0.914865         4061399     218963.6         11.141         8.8
##   worldDeaths USdeaths   kill.pmp kill.pmp.us        pkill     pkill.us
## 1    44064648  1921031 0.04643600  0.13366061 3.880662e-06 1.457551e-05
## 2    44081985  1927542 0.04603838  0.07096966 3.924506e-06 7.781932e-06
## 3    44603877  1963944 0.14764210  0.05628003 1.268948e-05 6.110154e-06
## 4    44539554  1973003 0.09463588  0.33938824 8.307223e-06 3.699944e-05
## 5    44979006  1934388 0.13598500  0.07834034 1.205007e-05 8.788309e-06
## 6    45248049  1892879 0.15191809  0.10047334 1.363595e-05 1.162251e-05

dim(terrorism)

## [1] 46 25

names(terrorism)

##  [1] "year"            "methodology"     "method"         
##  [4] "incidents"       "incidents.us"    "suicide"        
##  [7] "suicide.us"      "nkill"           "nkill.us"       
## [10] "nwound"          "nwound.us"       "pNA.nkill"      
## [13] "pNA.nkill.us"    "pNA.nwound"      "pNA.nwound.us"  
## [16] "worldPopulation" "USpopulation"    "worldDeathRate" 
## [19] "USdeathRate"     "worldDeaths"     "USdeaths"       
## [22] "kill.pmp"        "kill.pmp.us"     "pkill"          
## [25] "pkill.us"

Scatterplot data

First, to begin my regression analysis I plotted my explanatory variable against my response variable. In this case I plotted how many were wounded from terrorism against how many were killed on a scatterplot to see if there was any correlation using the plot function. Also, for in case anyone else looks at this graph I labeled the axes will common names instead of the variable names, ylab for the y axis and xlab for the x axis.

Explanatory={nwound}
Response={nkill}
plot(Explanatory, Response, ylab = "Killed",
     xlab = "Wounded", 
     main = "Terrorism Deaths per Wounded")

As you can see they have a positive association and the data follows the trend fairly closely, so it probably has a linear relationship. We should be able to predict how many were killed from terrorism with a fair amount of accuracy.

Correlation

Next, I checked using R code to see exactly what the correlation was between the number of wounded people and the number of killed people from terrorism. TO do this you use the correlation code, cor(prdictor, response).

cor(nwound, nkill)

## [1] 0.9058301

.9058 is a strong correlation, therefore we can most likely use how many are wounded to predict how many were killed and we can see it is positive so as the amount of people wounded increase so does the amount killed.

Regression Line

Now, I want to find the line that best describes the linear association we saw above. To do this I first had to use the simple linear regression equation. \[\hat{nkill}= \hat{\beta_0}+\hat{\beta_1} nwound\] I found the line that best describes the linear association. We use the above mymod equation to find which intercept and slope we will need for the line of best fit. We then plug these numbers into the abline equation, abline(intercept, slope, col=“color of line you want”), that we put after the plot to add in the regression line.

mymod = lm(nkill ~ nwound)
mymod

## 
## Call:
## lm(formula = nkill ~ nwound)
## 
## Coefficients:
## (Intercept)       nwound  
##    524.3156       0.7395

plot(nwound, nkill, ylab = "Killed",
     xlab = "Wounded", 
     main = "Terrorism deaths per Wounded")
abline(524.3156,.7395, col="blue")

As you can see we get an intercept of 524.3156 and a slope of .7395.

Predicted values

Using the regression equation, I can predict the number of deaths from terrorism by plugging in a value for nwound, in this case I will use 40.

524.3156+(.7395*40)

## [1] 553.8956

In this example we predict that with 40 wounded from terrorism we can expect approximately 554 people killed.

Residuals

Histogram

Next I checked to see if the residuals (the difference between the observed values and the predicted values) are normally distributed by looking at a histogram of them.

myresids = mymod$residuals
hist(myresids)

As you can see there is some skewness but overall the histogram looks like it will follow a normal distribution.

qq-plot

Now, I will use a qq-plot to test that my assumption from my histogram is correct. A qq-plot is a plot of the quantiles of the nwound data against the quantiles of the nkill data. We use the qq-plot to check that both distributions actually are normally distributed, which they are as they form a close line.

qqnorm(myresids)
qqline(myresids)

Homoscedasticity

Now I will check that all the error terms will be the same across the data. To do this I will plot my residuals against my nwound data (plot(residuals~predictor)). I added in a couple lines to make it easier to see. The two red lines contain the majority of the data, the black line is set at zero to show the spread of residuals, and the blue line shows the trend below the line.

plot(mymod$residuals ~ nwound)
abline(5000,0, col="red")
abline(0,0)
abline(-5000,0, col="red")
abline(1000,-1/3, col="blue")

There is slight evidence of heteroscedastic as the residuals below the zero line are trending down (see blue line). If the residuals had an even spread from the line at zero I would have been certain of homoscedastic. That being said, it does not seem to be showing heteroscedastic as not all the points are following this trend. Therefore, the nkill data’s variability is mostly equal across the nwound data.

Prediction and Confidence intervals

To make a prediction/ confidence interval we must first create a new data frame indicating the value our predictor should be initialized to. Then we use our newdata and our mymod we already created to produce a prediction interval with interval=“Predict”. Then we can do the same thing for a confidence interval and change interval to equal “confidence”.

Prediction Interval

newdata = data.frame(nwound=40)
(predy = predict(mymod, newdata, interval="predict") )

##        fit       lwr     upr
## 1 553.8972 -6840.545 7948.34

This first one, the 95% prediction interval returns the lower bound of -6840.545 and the upper bound of 7948.34. The lower bound does not seem logical as we cannot have a negative amount of deaths. In this prediction interval we are 95% confident that a year that has 40 wounded people from terrorism will have between -6840.54 and 7948.34 people killed from terrorism.

Confidence Interval

(confy = predict(mymod, newdata, interval="confidence") )

##        fit       lwr      upr
## 1 553.8972 -929.9087 2037.703

The second one, the 95% confidence interval tells as that we are 95% confident that the mean number of deaths, when the number of wounded from terrorism is 40, is between -929.91 and 2037.70. Again, the negative lower bound does not make logical since here because there cannot be negative deaths.

Interval Widths

Now I determined the width of my confidence and prediction intervals by using the below equations.

confy %*% c(0, -1, 1)  #confidence interval width

##       [,1]
## 1 2967.612

predy %*% c(0, -1, 1)  #prediction interval width

##       [,1]
## 1 14788.88

The prediction interval should always be wider as it accounts for the uncertainty of knowing the value of the population mean as well as data scatter. Which as you can see above the prediction interval much wider at 14788.88, whereas the confidence interval is only 2967.612.

Centered

Here I checked to make sure that my confidence and prediction intervals are centered at the same location as they should both be centered on the mean.

confy[1] == predy[1]

## [1] TRUE

Summary

Using the summary function in R gives us a lot of useful information.

summary(mymod)

## 
## Call:
## lm(formula = nkill ~ nwound)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6136.9 -2351.6  -387.1  1874.2 12712.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 524.31559  737.69512   0.711    0.481    
## nwound        0.73954    0.05214  14.183   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3594 on 44 degrees of freedom
## Multiple R-squared:  0.8205, Adjusted R-squared:  0.8164 
## F-statistic: 201.2 on 1 and 44 DF,  p-value: < 2.2e-16

As you can see under Estimate is our intercept as well as our slope from earlier. This is followed by the standared error and t value. We can also use the summary function to find R^2, MSE, and our t-test and f-test pvalues.

R^2

R^2 is the explained variation divided by the total variation. This is important because the closer R^2 is to 1 the more total variation that is explained by the model and that means the model is more accurate in predicting nkill. Therefore, my R^2 value of .8205 is a fairly good predictor when it comes to using the amount of people wounded in terrorism to predict the amount of people killed.

MSE

Our mean square error can be easily found next to residual standard error. MSE is the point estimate of the variance of the error term populations. In this case MSE is equal to 3594. This is quite large as it is measuring the distance between the estimator and what is being estimated.

T-test

T-test can also be done with the summary function. The test stat of this particular data is 14.183 found under t value across from nwound. This test stat has a p-value of less than 2e-16 so this means that we can reject the null hypothesis that our nwound has no relationship (beta1=0) and say that our nwound does have a relationship with nkill(beta1 > 0 or beta1 < 0).

F-test

On the bottom of this summary you can find your f-test. This f-test has a very small p-value of less than 2.2e-16 which also means that we can reject the null hypothesis that our nwound has no relationship (beta1=0) and say that our nwound does have a relationship with nkill(beta1 > 0 or beta1 < 0), just like with the t-test.

Chapter 3 R Guide

Kaitlyn Massman

February 28, 2018