My R guide will be focused on simple linear regression. There will be two quantitative variables, a predictor and a response. The data set I will be working with today is called “terrorism”. I will be using the number of wounded victims from terrorism to predict the number of victims killed from terrorism. Therefore, my predictor variable is the number of wounded victims, nwound, and my response variable is the number of people killed(including terrorists), nkill.
First, I will use the library Ecdat and retrieve my data “terrorism”. Next, I did ?terrorism to learn more about my data set and then I attached the data for later use. After attaching the data, I will review the beginning portion by using the head code to get a feel for the data set I will be working with. Finally, I will check the dimensions and look at the variable names, this is how I determined my nwound predictor and my nkill response.
Later, I will look at the data in a scatterplot to see if I can determine an association. Then I will confirm the association, if there is one, by checking the correlation. After that, I will fit a regression line to the data. This will be followed by a histogram, qq-plot, and residuals. Next, I will use my data set in prediction intervals and confidence intervals where I will actually predict values based on my predictor variable. Lastly, I will touch on R^2 and the f-test for this data.
library(Ecdat)
## Warning: package 'Ecdat' was built under R version 3.4.3
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.4.3
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
data("terrorism")
?terrorism
## starting httpd help server ...
## done
attach(terrorism)
head(terrorism)
## year methodology method incidents incidents.us suicide suicide.us nkill
## 1 1970 PGIS p 651 468 0 0 171
## 2 1971 PGIS p 470 247 0 0 173
## 3 1972 PGIS p 494 64 0 0 566
## 4 1973 PGIS p 473 58 0 0 370
## 5 1974 PGIS p 580 94 0 0 542
## 6 1975 PGIS p 740 149 0 0 617
## nkill.us nwound nwound.us pNA.nkill pNA.nkill.us pNA.nwound
## 1 28 192 132 0.064516 0.379416 0.079877
## 2 15 82 48 0.136170 0.682979 0.308511
## 3 12 222 23 0.093117 0.880567 0.668016
## 4 73 495 87 0.103594 0.911205 0.484144
## 5 17 754 63 0.100000 0.841379 0.394828
## 6 22 617 18 0.206757 0.905405 0.459459
## pNA.nwound.us worldPopulation USpopulation worldDeathRate USdeathRate
## 1 0.380952 3682488 209485.8 11.966 9.5
## 2 0.682979 3757735 211357.9 11.731 9.3
## 3 0.886640 3833595 213219.5 11.635 9.4
## 4 0.911205 3909722 215092.9 11.392 9.3
## 5 0.856897 3985734 217001.9 11.285 9.1
## 6 0.914865 4061399 218963.6 11.141 8.8
## worldDeaths USdeaths kill.pmp kill.pmp.us pkill pkill.us
## 1 44064648 1921031 0.04643600 0.13366061 3.880662e-06 1.457551e-05
## 2 44081985 1927542 0.04603838 0.07096966 3.924506e-06 7.781932e-06
## 3 44603877 1963944 0.14764210 0.05628003 1.268948e-05 6.110154e-06
## 4 44539554 1973003 0.09463588 0.33938824 8.307223e-06 3.699944e-05
## 5 44979006 1934388 0.13598500 0.07834034 1.205007e-05 8.788309e-06
## 6 45248049 1892879 0.15191809 0.10047334 1.363595e-05 1.162251e-05
dim(terrorism)
## [1] 46 25
names(terrorism)
## [1] "year" "methodology" "method"
## [4] "incidents" "incidents.us" "suicide"
## [7] "suicide.us" "nkill" "nkill.us"
## [10] "nwound" "nwound.us" "pNA.nkill"
## [13] "pNA.nkill.us" "pNA.nwound" "pNA.nwound.us"
## [16] "worldPopulation" "USpopulation" "worldDeathRate"
## [19] "USdeathRate" "worldDeaths" "USdeaths"
## [22] "kill.pmp" "kill.pmp.us" "pkill"
## [25] "pkill.us"
First, to begin my regression analysis I plotted my explanatory variable against my response variable. In this case I plotted how many were wounded from terrorism against how many were killed on a scatterplot to see if there was any correlation using the plot function. Also, for in case anyone else looks at this graph I labeled the axes will common names instead of the variable names, ylab for the y axis and xlab for the x axis.
Explanatory={nwound}
Response={nkill}
plot(Explanatory, Response, ylab = "Killed",
xlab = "Wounded",
main = "Terrorism Deaths per Wounded")
As you can see they have a positive association and the data follows the trend fairly closely, so it probably has a linear relationship. We should be able to predict how many were killed from terrorism with a fair amount of accuracy.
Next, I checked using R code to see exactly what the correlation was between the number of wounded people and the number of killed people from terrorism. TO do this you use the correlation code, cor(prdictor, response).
cor(nwound, nkill)
## [1] 0.9058301
.9058 is a strong correlation, therefore we can most likely use how many are wounded to predict how many were killed and we can see it is positive so as the amount of people wounded increase so does the amount killed.
Now, I want to find the line that best describes the linear association we saw above. To do this I first had to use the simple linear regression equation. \[\hat{nkill}= \hat{\beta_0}+\hat{\beta_1} nwound\] I found the line that best describes the linear association. We use the above mymod equation to find which intercept and slope we will need for the line of best fit. We then plug these numbers into the abline equation, abline(intercept, slope, col=“color of line you want”), that we put after the plot to add in the regression line.
mymod = lm(nkill ~ nwound)
mymod
##
## Call:
## lm(formula = nkill ~ nwound)
##
## Coefficients:
## (Intercept) nwound
## 524.3156 0.7395
plot(nwound, nkill, ylab = "Killed",
xlab = "Wounded",
main = "Terrorism deaths per Wounded")
abline(524.3156,.7395, col="blue")
As you can see we get an intercept of 524.3156 and a slope of .7395.
Using the regression equation, I can predict the number of deaths from terrorism by plugging in a value for nwound, in this case I will use 40.
524.3156+(.7395*40)
## [1] 553.8956
In this example we predict that with 40 wounded from terrorism we can expect approximately 554 people killed.
Next I checked to see if the residuals (the difference between the observed values and the predicted values) are normally distributed by looking at a histogram of them.
myresids = mymod$residuals
hist(myresids)
As you can see there is some skewness but overall the histogram looks like it will follow a normal distribution.
Now, I will use a qq-plot to test that my assumption from my histogram is correct. A qq-plot is a plot of the quantiles of the nwound data against the quantiles of the nkill data. We use the qq-plot to check that both distributions actually are normally distributed, which they are as they form a close line.
qqnorm(myresids)
qqline(myresids)
Now I will check that all the error terms will be the same across the data. To do this I will plot my residuals against my nwound data (plot(residuals~predictor)). I added in a couple lines to make it easier to see. The two red lines contain the majority of the data, the black line is set at zero to show the spread of residuals, and the blue line shows the trend below the line.
plot(mymod$residuals ~ nwound)
abline(5000,0, col="red")
abline(0,0)
abline(-5000,0, col="red")
abline(1000,-1/3, col="blue")
There is slight evidence of heteroscedastic as the residuals below the zero line are trending down (see blue line). If the residuals had an even spread from the line at zero I would have been certain of homoscedastic. That being said, it does not seem to be showing heteroscedastic as not all the points are following this trend. Therefore, the nkill data’s variability is mostly equal across the nwound data.
To make a prediction/ confidence interval we must first create a new data frame indicating the value our predictor should be initialized to. Then we use our newdata and our mymod we already created to produce a prediction interval with interval=“Predict”. Then we can do the same thing for a confidence interval and change interval to equal “confidence”.
newdata = data.frame(nwound=40)
(predy = predict(mymod, newdata, interval="predict") )
## fit lwr upr
## 1 553.8972 -6840.545 7948.34
This first one, the 95% prediction interval returns the lower bound of -6840.545 and the upper bound of 7948.34. The lower bound does not seem logical as we cannot have a negative amount of deaths. In this prediction interval we are 95% confident that a year that has 40 wounded people from terrorism will have between -6840.54 and 7948.34 people killed from terrorism.
(confy = predict(mymod, newdata, interval="confidence") )
## fit lwr upr
## 1 553.8972 -929.9087 2037.703
The second one, the 95% confidence interval tells as that we are 95% confident that the mean number of deaths, when the number of wounded from terrorism is 40, is between -929.91 and 2037.70. Again, the negative lower bound does not make logical since here because there cannot be negative deaths.
Now I determined the width of my confidence and prediction intervals by using the below equations.
confy %*% c(0, -1, 1) #confidence interval width
## [,1]
## 1 2967.612
predy %*% c(0, -1, 1) #prediction interval width
## [,1]
## 1 14788.88
The prediction interval should always be wider as it accounts for the uncertainty of knowing the value of the population mean as well as data scatter. Which as you can see above the prediction interval much wider at 14788.88, whereas the confidence interval is only 2967.612.
Here I checked to make sure that my confidence and prediction intervals are centered at the same location as they should both be centered on the mean.
confy[1] == predy[1]
## [1] TRUE
Using the summary function in R gives us a lot of useful information.
summary(mymod)
##
## Call:
## lm(formula = nkill ~ nwound)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6136.9 -2351.6 -387.1 1874.2 12712.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 524.31559 737.69512 0.711 0.481
## nwound 0.73954 0.05214 14.183 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3594 on 44 degrees of freedom
## Multiple R-squared: 0.8205, Adjusted R-squared: 0.8164
## F-statistic: 201.2 on 1 and 44 DF, p-value: < 2.2e-16
As you can see under Estimate is our intercept as well as our slope from earlier. This is followed by the standared error and t value. We can also use the summary function to find R^2, MSE, and our t-test and f-test pvalues.
R^2 is the explained variation divided by the total variation. This is important because the closer R^2 is to 1 the more total variation that is explained by the model and that means the model is more accurate in predicting nkill. Therefore, my R^2 value of .8205 is a fairly good predictor when it comes to using the amount of people wounded in terrorism to predict the amount of people killed.
Our mean square error can be easily found next to residual standard error. MSE is the point estimate of the variance of the error term populations. In this case MSE is equal to 3594. This is quite large as it is measuring the distance between the estimator and what is being estimated.
T-test can also be done with the summary function. The test stat of this particular data is 14.183 found under t value across from nwound. This test stat has a p-value of less than 2e-16 so this means that we can reject the null hypothesis that our nwound has no relationship (beta1=0) and say that our nwound does have a relationship with nkill(beta1 > 0 or beta1 < 0).
On the bottom of this summary you can find your f-test. This f-test has a very small p-value of less than 2.2e-16 which also means that we can reject the null hypothesis that our nwound has no relationship (beta1=0) and say that our nwound does have a relationship with nkill(beta1 > 0 or beta1 < 0), just like with the t-test.