PCA and LM for MPADS

Case background
Web pages and books used
The problem
Research question
Principal components analysis
Linear regression model
Regression diagnostics
Conclusion

Case background

Man-portable air-defense systems (MANPADS or MPADS) are shoulder-launched surface-to-air missiles (SAMs). They are typically guided weapons and are a threat to low-flying aircraft, especially helicopters. MANPADS were developed in the 1940s to provide military ground forces with protection from enemy aircraft. They have received a great deal of attention, partly because armed groups have used them against commercial airliners. These missiles, affordable and widely available through a variety of sources, have been used successfully over the past three decades both in military conflicts. See https://en.wikipedia.org/wiki/Man-portable_air-defense_system.

Web pages and books used

The problem

Here is some interesting data about MPADS. Let’s explore the data.

library(FactoMineR)
library(car)
library(ggplot2)
pzrk <- read.csv("~/my-data/my-data/pzrk/pzrk.txt")
str(pzrk)

## 'data.frame':    13 obs. of  13 variables:
##  $ Year            : int  1967 1970 1974 1981 1983 2004 1968 1981 1983 1989 ...
##  $ Country         : Factor w/ 5 levels "China","France",..: 3 3 3 3 3 3 5 5 5 5 ...
##  $ Name            : Factor w/ 13 levels "Blowpipe","FN6",..: 9 10 5 6 7 8 4 11 12 13 ...
##  $ Len.m           : num  1.42 1.44 1.47 1.67 1.67 ...
##  $ Diam.mm         : int  72 72 72 72 72 72 70 70 70 70 ...
##  $ Mass.mis.kg     : num  9.15 9.8 10.3 10.8 10.6 11.7 8.3 10.1 10.1 10.1 ...
##  $ Mass.net.kg     : num  14.5 15 16 17 17 19 12.7 15.7 15.7 15.7 ...
##  $ Yield.kg        : num  1.15 1.15 1.17 1.3 1.3 2.5 1.06 3 3 3 ...
##  $ Range.m         : int  3600 4200 4100 5200 5200 6000 3400 4000 4800 4800 ...
##  $ Height.m        : int  1500 2300 3000 2500 3500 3500 2500 3500 3800 3800 ...
##  $ Speed.m.s       : int  430 430 400 600 600 600 566 865 730 730 ...
##  $ Speed.target.m.s: int  220 260 310 360 360 400 225 NA NA NA ...
##  $ Probability     : Factor w/ 10 levels "0.25","0.33",..: 1 1 2 4 5 7 8 9 9 10 ...

cor(pzrk[,-c(1:3,8,12:13)])

##                 Len.m    Diam.mm Mass.mis.kg Mass.net.kg   Range.m
## Len.m       1.0000000  0.5812430   0.7786677  0.68501021 0.8865754
## Diam.mm     0.5812430  1.0000000   0.9281907  0.97349190 0.4101732
## Mass.mis.kg 0.7786677  0.9281907   1.0000000  0.98405977 0.6734219
## Mass.net.kg 0.6850102  0.9734919   0.9840598  1.00000000 0.5548379
## Range.m     0.8865754  0.4101732   0.6734219  0.55483792 1.0000000
## Height.m    0.4116695 -0.1579241   0.1417319  0.02759101 0.5762850
## Speed.m.s   0.4191271  0.3977185   0.5244555  0.52757076 0.3328453
##                Height.m Speed.m.s
## Len.m        0.41166948 0.4191271
## Diam.mm     -0.15792406 0.3977185
## Mass.mis.kg  0.14173192 0.5244555
## Mass.net.kg  0.02759101 0.5275708
## Range.m      0.57628499 0.3328453
## Height.m     1.00000000 0.4572448
## Speed.m.s    0.45724480 1.0000000

attach(pzrk)
scatterplotMatrix(~Len.m+Diam.mm+Mass.mis.kg+Mass.net.kg+Range.m+Height.m+Speed.m.s, data=pzrk,smoother=F)

These diagrams show that all variables are correlated.

Research question

Our goal is to find out how effective range \(Range.m\) can be evaluated by other parameters of MPADS.

Principal components analysis

One of the most challenging aspects of multivariate data is the sheer complexity of the information. P principal components analysis (PCA) is a data reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components.

pzrk2<-pzrk[-c(1:2,8,12:13)]

pc3<-PCA(pzrk2[,-1])

summary(pc3)

## 
## Call:
## PCA(X = pzrk2[, -1]) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               4.390   1.548   0.782   0.190   0.076   0.014
## % of var.             62.716  22.113  11.169   2.711   1.090   0.195
## Cumulative % of var.  62.716  84.828  95.997  98.708  99.798  99.993
##                        Dim.7
## Variance               0.001
## % of var.              0.007
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                 Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1           |  2.926 | -2.151  8.109  0.541 | -1.807 16.227  0.381 |
## 2           |  1.894 | -1.465  3.760  0.598 | -0.834  3.456  0.194 |
## 3           |  1.747 | -1.145  2.296  0.430 | -0.282  0.394  0.026 |
## 4           |  1.377 |  0.386  0.261  0.079 |  0.079  0.031  0.003 |
## 5           |  1.449 |  0.551  0.531  0.144 |  1.082  5.814  0.558 |
## 6           |  2.043 |  1.221  2.613  0.357 |  1.168  6.777  0.327 |
## 7           |  3.005 | -2.773 13.470  0.851 | -0.623  1.932  0.043 |
## 8           |  2.105 | -0.219  0.084  0.011 |  1.047  5.450  0.247 |
## 9           |  1.607 | -0.071  0.009  0.002 |  1.488 10.996  0.857 |
## 10          |  1.607 | -0.071  0.009  0.002 |  1.488 10.996  0.857 |
##              Dim.3    ctr   cos2  
## 1           -0.643  4.062  0.048 |
## 2           -0.854  7.172  0.203 |
## 3           -0.901  7.979  0.266 |
## 4           -0.880  7.616  0.408 |
## 5           -0.693  4.732  0.229 |
## 6           -1.012 10.081  0.245 |
## 7            0.806  6.395  0.072 |
## 8            1.713 28.880  0.662 |
## 9            0.595  3.485  0.137 |
## 10           0.595  3.485  0.137 |
## 
## Variables
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Len.m       |  0.878 17.540  0.770 |  0.226  3.297  0.051 | -0.303 11.710
## Diam.mm     |  0.835 15.878  0.697 | -0.530 18.138  0.281 |  0.038  0.180
## Mass.mis.kg |  0.967 21.304  0.935 | -0.224  3.232  0.050 | -0.013  0.021
## Mass.net.kg |  0.927 19.582  0.860 | -0.352  8.022  0.124 |  0.073  0.674
## Range.m     |  0.792 14.302  0.628 |  0.426 11.712  0.181 | -0.389 19.355
## Height.m    |  0.334  2.547  0.112 |  0.896 51.846  0.803 |  0.115  1.690
## Speed.m.s   |  0.623  8.848  0.388 |  0.241  3.754  0.058 |  0.720 66.371
##               cos2  
## Len.m        0.092 |
## Diam.mm      0.001 |
## Mass.mis.kg  0.000 |
## Mass.net.kg  0.005 |
## Range.m      0.151 |
## Height.m     0.013 |
## Speed.m.s    0.519 |

We see two main components with \(Range.m\) - effective range of MPAD in meters and \(Len.m\) - length of MPAD in meters.

Linear regression model

See: https://en.wikipedia.org/wiki/Linear_regression.

qplot(Len.m, Range.m, data = pzrk) + geom_smooth(method = lm) + geom_point(aes(colour=Name))

fit.rl<-lm(Range.m~Len.m,data=pzrk)
summary(fit.rl)

## 
## Call:
## lm(formula = Range.m ~ Len.m, data = pzrk)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -587.7 -246.1 -112.5  212.3  856.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2757.0     1163.4  -2.370   0.0372 *  
## Len.m         4832.1      760.2   6.357  5.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 425.3 on 11 degrees of freedom
## Multiple R-squared:  0.786,  Adjusted R-squared:  0.7666 
## F-statistic: 40.41 on 1 and 11 DF,  p-value: 5.396e-05

confint(fit.rl)

##                 2.5 %    97.5 %
## (Intercept) -5317.602 -196.4091
## Len.m        3158.940 6505.1911

The F-statistic and p-value indicate very significant result for our linear model proving its adequateness.

Regression diagnostics

See: http://www.statmethods.net/stats/rdiagnostics.html

Normality of errors

par(mfrow=c(1,3))
shapiro.test(fit.rl$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit.rl$residuals
## W = 0.95716, p-value = 0.7094

hist(fit.rl$residuals)
qqPlot(fit.rl,main="Q-Q Plot")

## Robert I. Kabacoff R code
residplot <- function(fit, nbreaks=10) {
z <- rstudent(fit)
hist(z, breaks=nbreaks, freq=FALSE,
xlab="Studentized Residual",
main="Distribution of Errors")
rug(jitter(z), col="brown")
curve(dnorm(x, mean=mean(z), sd=sd(z)),
add=TRUE, col="blue", lwd=2)
lines(density(z)$x, density(z)$y,
col="red", lwd=2, lty=2)
legend("topright",
legend = c( "Normal Curve", "Kernel Density Curve"),
lty=1:2, col=c("blue","red"), cex=.7)
}

residplot(fit.rl)

Very optimistic result for normality of errors test.

Independence of Errors

durbinWatsonTest(fit.rl)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.09484196      1.918983   0.914
##  Alternative hypothesis: rho != 0

The nonsignificant p-value (p=0.88) suggests a lack of autocorrelation, and conversely an independence of errors. Very nice result for our model.

Linearity

crPlots(fit.rl)

The component plus residual plots confirm that we have met the linearity assumption. The form of the linear model seems to be appropriate for this dataset.

Homoscedascticiy

ncvTest(fit.rl)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.07677836    Df = 1     p = 0.7817117

spreadLevelPlot(fit.rl)

## 
## Suggested power transformation:  1.324579

The score test is nonsignificant (p = 0.78), suggesting that you’ve met the constant variance assumption. The points form a random horizontal band around a horizontal line of best fit.

Global validation of linear model assumption

library(gvlma)
gvmodel <- gvlma(fit.rl)
summary(gvmodel)

## 
## Call:
## lm(formula = Range.m ~ Len.m, data = pzrk)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -587.7 -246.1 -112.5  212.3  856.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2757.0     1163.4  -2.370   0.0372 *  
## Len.m         4832.1      760.2   6.357  5.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 425.3 on 11 degrees of freedom
## Multiple R-squared:  0.786,  Adjusted R-squared:  0.7666 
## F-statistic: 40.41 on 1 and 11 DF,  p-value: 5.396e-05
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit.rl) 
## 
##                       Value p-value                Decision
## Global Stat        0.779724  0.9411 Assumptions acceptable.
## Skewness           0.679624  0.4097 Assumptions acceptable.
## Kurtosis           0.047260  0.8279 Assumptions acceptable.
## Link Function      0.044552  0.8328 Assumptions acceptable.
## Heteroscedasticity 0.008289  0.9275 Assumptions acceptable.

par(mfrow=c(2,3))
plot(fit.rl,which = 1:6,labels.id = Name)

You can see from the printout (the Global Stat line) that the data meet all the statistical assumptions that go with the OLS regression model (p = 0.9411). Great! But some outliers are present according to Cook distance test: these observations aren’t predicted well by the model. Such things happen.

Conclusion

Our linear model proved to be adequate and robust according to the results of tests.
Effective Range (\(Range.m\)) and Length (\(Len.m\)) of MPAD can be used for real case estimates like this: \[Range.m = -2757.0 + 4832.1*Len.m\]
The multiple R-squared (0.786) indicates that our model accounts for 79% percent of the variance in \(Range.m\) variable.
Many thanks to Robert I. Kabacoff for his book R in action. Data analysis and graphics with R