Man-portable air-defense systems (MANPADS or MPADS) are shoulder-launched surface-to-air missiles (SAMs). They are typically guided weapons and are a threat to low-flying aircraft, especially helicopters. MANPADS were developed in the 1940s to provide military ground forces with protection from enemy aircraft. They have received a great deal of attention, partly because armed groups have used them against commercial airliners. These missiles, affordable and widely available through a variety of sources, have been used successfully over the past three decades both in military conflicts. See https://en.wikipedia.org/wiki/Man-portable_air-defense_system.
Here is some interesting data about MPADS. Let’s explore the data.
library(FactoMineR)
library(car)
library(ggplot2)
pzrk <- read.csv("~/my-data/my-data/pzrk/pzrk.txt")
str(pzrk)
## 'data.frame': 13 obs. of 13 variables:
## $ Year : int 1967 1970 1974 1981 1983 2004 1968 1981 1983 1989 ...
## $ Country : Factor w/ 5 levels "China","France",..: 3 3 3 3 3 3 5 5 5 5 ...
## $ Name : Factor w/ 13 levels "Blowpipe","FN6",..: 9 10 5 6 7 8 4 11 12 13 ...
## $ Len.m : num 1.42 1.44 1.47 1.67 1.67 ...
## $ Diam.mm : int 72 72 72 72 72 72 70 70 70 70 ...
## $ Mass.mis.kg : num 9.15 9.8 10.3 10.8 10.6 11.7 8.3 10.1 10.1 10.1 ...
## $ Mass.net.kg : num 14.5 15 16 17 17 19 12.7 15.7 15.7 15.7 ...
## $ Yield.kg : num 1.15 1.15 1.17 1.3 1.3 2.5 1.06 3 3 3 ...
## $ Range.m : int 3600 4200 4100 5200 5200 6000 3400 4000 4800 4800 ...
## $ Height.m : int 1500 2300 3000 2500 3500 3500 2500 3500 3800 3800 ...
## $ Speed.m.s : int 430 430 400 600 600 600 566 865 730 730 ...
## $ Speed.target.m.s: int 220 260 310 360 360 400 225 NA NA NA ...
## $ Probability : Factor w/ 10 levels "0.25","0.33",..: 1 1 2 4 5 7 8 9 9 10 ...
cor(pzrk[,-c(1:3,8,12:13)])
## Len.m Diam.mm Mass.mis.kg Mass.net.kg Range.m
## Len.m 1.0000000 0.5812430 0.7786677 0.68501021 0.8865754
## Diam.mm 0.5812430 1.0000000 0.9281907 0.97349190 0.4101732
## Mass.mis.kg 0.7786677 0.9281907 1.0000000 0.98405977 0.6734219
## Mass.net.kg 0.6850102 0.9734919 0.9840598 1.00000000 0.5548379
## Range.m 0.8865754 0.4101732 0.6734219 0.55483792 1.0000000
## Height.m 0.4116695 -0.1579241 0.1417319 0.02759101 0.5762850
## Speed.m.s 0.4191271 0.3977185 0.5244555 0.52757076 0.3328453
## Height.m Speed.m.s
## Len.m 0.41166948 0.4191271
## Diam.mm -0.15792406 0.3977185
## Mass.mis.kg 0.14173192 0.5244555
## Mass.net.kg 0.02759101 0.5275708
## Range.m 0.57628499 0.3328453
## Height.m 1.00000000 0.4572448
## Speed.m.s 0.45724480 1.0000000
attach(pzrk)
scatterplotMatrix(~Len.m+Diam.mm+Mass.mis.kg+Mass.net.kg+Range.m+Height.m+Speed.m.s, data=pzrk,smoother=F)
These diagrams show that all variables are correlated.
Our goal is to find out how effective range \(Range.m\) can be evaluated by other parameters of MPADS.
One of the most challenging aspects of multivariate data is the sheer complexity of the information. P principal components analysis (PCA) is a data reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components.
pzrk2<-pzrk[-c(1:2,8,12:13)]
pc3<-PCA(pzrk2[,-1])
summary(pc3)
##
## Call:
## PCA(X = pzrk2[, -1])
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 4.390 1.548 0.782 0.190 0.076 0.014
## % of var. 62.716 22.113 11.169 2.711 1.090 0.195
## Cumulative % of var. 62.716 84.828 95.997 98.708 99.798 99.993
## Dim.7
## Variance 0.001
## % of var. 0.007
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 2.926 | -2.151 8.109 0.541 | -1.807 16.227 0.381 |
## 2 | 1.894 | -1.465 3.760 0.598 | -0.834 3.456 0.194 |
## 3 | 1.747 | -1.145 2.296 0.430 | -0.282 0.394 0.026 |
## 4 | 1.377 | 0.386 0.261 0.079 | 0.079 0.031 0.003 |
## 5 | 1.449 | 0.551 0.531 0.144 | 1.082 5.814 0.558 |
## 6 | 2.043 | 1.221 2.613 0.357 | 1.168 6.777 0.327 |
## 7 | 3.005 | -2.773 13.470 0.851 | -0.623 1.932 0.043 |
## 8 | 2.105 | -0.219 0.084 0.011 | 1.047 5.450 0.247 |
## 9 | 1.607 | -0.071 0.009 0.002 | 1.488 10.996 0.857 |
## 10 | 1.607 | -0.071 0.009 0.002 | 1.488 10.996 0.857 |
## Dim.3 ctr cos2
## 1 -0.643 4.062 0.048 |
## 2 -0.854 7.172 0.203 |
## 3 -0.901 7.979 0.266 |
## 4 -0.880 7.616 0.408 |
## 5 -0.693 4.732 0.229 |
## 6 -1.012 10.081 0.245 |
## 7 0.806 6.395 0.072 |
## 8 1.713 28.880 0.662 |
## 9 0.595 3.485 0.137 |
## 10 0.595 3.485 0.137 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## Len.m | 0.878 17.540 0.770 | 0.226 3.297 0.051 | -0.303 11.710
## Diam.mm | 0.835 15.878 0.697 | -0.530 18.138 0.281 | 0.038 0.180
## Mass.mis.kg | 0.967 21.304 0.935 | -0.224 3.232 0.050 | -0.013 0.021
## Mass.net.kg | 0.927 19.582 0.860 | -0.352 8.022 0.124 | 0.073 0.674
## Range.m | 0.792 14.302 0.628 | 0.426 11.712 0.181 | -0.389 19.355
## Height.m | 0.334 2.547 0.112 | 0.896 51.846 0.803 | 0.115 1.690
## Speed.m.s | 0.623 8.848 0.388 | 0.241 3.754 0.058 | 0.720 66.371
## cos2
## Len.m 0.092 |
## Diam.mm 0.001 |
## Mass.mis.kg 0.000 |
## Mass.net.kg 0.005 |
## Range.m 0.151 |
## Height.m 0.013 |
## Speed.m.s 0.519 |
We see two main components with \(Range.m\) - effective range of MPAD in meters and \(Len.m\) - length of MPAD in meters.
See: https://en.wikipedia.org/wiki/Linear_regression.
qplot(Len.m, Range.m, data = pzrk) + geom_smooth(method = lm) + geom_point(aes(colour=Name))
fit.rl<-lm(Range.m~Len.m,data=pzrk)
summary(fit.rl)
##
## Call:
## lm(formula = Range.m ~ Len.m, data = pzrk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -587.7 -246.1 -112.5 212.3 856.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2757.0 1163.4 -2.370 0.0372 *
## Len.m 4832.1 760.2 6.357 5.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 425.3 on 11 degrees of freedom
## Multiple R-squared: 0.786, Adjusted R-squared: 0.7666
## F-statistic: 40.41 on 1 and 11 DF, p-value: 5.396e-05
confint(fit.rl)
## 2.5 % 97.5 %
## (Intercept) -5317.602 -196.4091
## Len.m 3158.940 6505.1911
The F-statistic and p-value indicate very significant result for our linear model proving its adequateness.
See: http://www.statmethods.net/stats/rdiagnostics.html
par(mfrow=c(1,3))
shapiro.test(fit.rl$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit.rl$residuals
## W = 0.95716, p-value = 0.7094
hist(fit.rl$residuals)
qqPlot(fit.rl,main="Q-Q Plot")
## Robert I. Kabacoff R code
residplot <- function(fit, nbreaks=10) {
z <- rstudent(fit)
hist(z, breaks=nbreaks, freq=FALSE,
xlab="Studentized Residual",
main="Distribution of Errors")
rug(jitter(z), col="brown")
curve(dnorm(x, mean=mean(z), sd=sd(z)),
add=TRUE, col="blue", lwd=2)
lines(density(z)$x, density(z)$y,
col="red", lwd=2, lty=2)
legend("topright",
legend = c( "Normal Curve", "Kernel Density Curve"),
lty=1:2, col=c("blue","red"), cex=.7)
}
residplot(fit.rl)
Very optimistic result for normality of errors test.
durbinWatsonTest(fit.rl)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.09484196 1.918983 0.914
## Alternative hypothesis: rho != 0
The nonsignificant p-value (p=0.88) suggests a lack of autocorrelation, and conversely an independence of errors. Very nice result for our model.
crPlots(fit.rl)
The component plus residual plots confirm that we have met the linearity assumption. The form of the linear model seems to be appropriate for this dataset.
ncvTest(fit.rl)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.07677836 Df = 1 p = 0.7817117
spreadLevelPlot(fit.rl)
##
## Suggested power transformation: 1.324579
The score test is nonsignificant (p = 0.78), suggesting that you’ve met the constant variance assumption. The points form a random horizontal band around a horizontal line of best fit.
library(gvlma)
gvmodel <- gvlma(fit.rl)
summary(gvmodel)
##
## Call:
## lm(formula = Range.m ~ Len.m, data = pzrk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -587.7 -246.1 -112.5 212.3 856.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2757.0 1163.4 -2.370 0.0372 *
## Len.m 4832.1 760.2 6.357 5.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 425.3 on 11 degrees of freedom
## Multiple R-squared: 0.786, Adjusted R-squared: 0.7666
## F-statistic: 40.41 on 1 and 11 DF, p-value: 5.396e-05
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = fit.rl)
##
## Value p-value Decision
## Global Stat 0.779724 0.9411 Assumptions acceptable.
## Skewness 0.679624 0.4097 Assumptions acceptable.
## Kurtosis 0.047260 0.8279 Assumptions acceptable.
## Link Function 0.044552 0.8328 Assumptions acceptable.
## Heteroscedasticity 0.008289 0.9275 Assumptions acceptable.
par(mfrow=c(2,3))
plot(fit.rl,which = 1:6,labels.id = Name)
You can see from the printout (the Global Stat line) that the data meet all the statistical assumptions that go with the OLS regression model (p = 0.9411). Great! But some outliers are present according to Cook distance test: these observations aren’t predicted well by the model. Such things happen.