Directions

Analyze the Nasa Rocket data and the Teacher Training in the regression review data folder as outlined in the description document.

NASA ROCKET DATA

NASARocketData - different types of nozzles and propellent mixtures on rocket thrust efficiency. Generate a predictive model and the impact of each predictive factor. Determine the best nozzle/propellent combination to achieve optimal thrust efficiency.

library(readxl)
## Warning: package 'readxl' was built under R version 3.5.3
NASARocketData <- read_excel("C:/USC Marshall/3.academics/Harrisburg/Qtr 3/ANLY 510 Analytics II/Labs/Lab 5/NASARocketData.xlsx")
View(NASARocketData)

We run a multiple regression with nozzle size and propellent mix as the predictors and efficiency of thrust as the dependent. We want to build a model to find the optimal combination of nozzle and mix. Let’s start with scatter plots to see how our predictors relate to our dependent, and also see if they are related to each-other which could indicate multicollinearity is going to be an issue:

Scatter-plot

library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
scatterplot(NASARocketData$NozzleSize, NASARocketData$Efficiency)

scatterplot(NASARocketData$PropellentMix, NASARocketData$Efficiency)

scatterplot(NASARocketData$NozzleSize, NASARocketData$PropellentMix)

No super obvious outliers and we see a weak negative relationship with nozzle size and efficiency, a strong positive relationship between formulation and efficiency, and it also appears our thrust and formula mixtures are going to be highly correlated. Lets see how we are with skew:

library(moments)
## Warning: package 'moments' was built under R version 3.5.2
agostino.test(NASARocketData$NozzleSize)
## 
##  D'Agostino skewness test
## 
## data:  NASARocketData$NozzleSize
## skew = 0.013663, z = 0.025830, p-value = 0.9794
## alternative hypothesis: data have a skewness
agostino.test(NASARocketData$PropellentMix)
## 
##  D'Agostino skewness test
## 
## data:  NASARocketData$PropellentMix
## skew = -0.077928, z = -0.147270, p-value = 0.8829
## alternative hypothesis: data have a skewness
agostino.test(NASARocketData$Efficiency)
## 
##  D'Agostino skewness test
## 
## data:  NASARocketData$Efficiency
## skew = 0.016073, z = 0.030385, p-value = 0.9758
## alternative hypothesis: data have a skewness

No clear issues with skew. Lets scale our predictors (this does the mean centering but will need to keep this in mind for interpretation) and run our model:

NASARocketData$nsm <- as.numeric(scale(NASARocketData$NozzleSize))
NASARocketData$pmm <- as.numeric(scale(NASARocketData$PropellentMix))
NASAmodel <- lm(Efficiency ~ nsm*pmm, data = NASARocketData)
summary(NASAmodel)
## 
## Call:
## lm(formula = Efficiency ~ nsm * pmm, data = NASARocketData)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.185193 -0.050863  0.005704  0.046910  0.185922 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 95.57948    0.04418 2163.629  < 2e-16 ***
## nsm         -1.91201    0.05094  -37.535 2.79e-10 ***
## pmm          2.36589    0.05104   46.357 5.18e-11 ***
## nsm:pmm     -0.16679    0.04356   -3.829  0.00502 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.114 on 8 degrees of freedom
## Multiple R-squared:  0.9964, Adjusted R-squared:  0.9951 
## F-statistic: 738.3 on 3 and 8 DF,  p-value: 4.122e-10
vif(NASAmodel)
##      nsm      pmm  nsm:pmm 
## 2.194956 2.203260 1.007813
leveragePlots(NASAmodel)

cooks.distance(NASAmodel)
##            1            2            3            4            5 
## 0.0229252993 0.3871770950 0.0270217653 0.0001608687 0.0386278273 
##            6            7            8            9           10 
## 0.0468191688 0.0010212015 0.0057817238 0.1808685698 0.3256096994 
##           11           12 
## 0.2285134733 0.0930341746
outlierTest(NASAmodel)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferonni p
## 10 2.402681            0.04728      0.56735
shapiro.test(NASAmodel$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  NASAmodel$residuals
## W = 0.98836, p-value = 0.9993

No real outliers, no high leverage observations, no major multicollinearity, and residuals are very normal according to shapiro. From the model we see that each SD increase from the mean in nozzle size leads to a 1.91 decrease in efficiency, each SD increase in propellent mix leads to a 2.37 increase in efficiency, and their interaction indicates that the effect of nozzle size is decreased further when the propellent mix increases as indicated in the following figure which shows the effect of nozzle size (x axis) on the propellent mix coefficient (y axis):

library(interplot)
## Warning: package 'interplot' was built under R version 3.5.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
## Loading required package: abind
## Warning: package 'abind' was built under R version 3.5.2
## Loading required package: arm
## Warning: package 'arm' was built under R version 3.5.3
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
## Warning: package 'lme4' was built under R version 3.5.3
## 
## arm (Version 1.10-1, built: 2018-4-12)
## Working directory is C:/USC Marshall/3.academics/Harrisburg/Qtr 3/ANLY 510 Analytics II/Labs/Lab 5
## 
## Attaching package: 'arm'
## The following object is masked from 'package:car':
## 
##     logit
interplot(m = NASAmodel, var1 = "nsm", var2  = "pmm", esize = 1.5)

Ok so we have a model, now we need to use it to find the optimal mix of nozzle size and propellent mix. What we need to do is create a new data frame that has all possible mixtures of nozzle size and propellent to pick from. We see we go from about -1.3 to 1.3 for nozzle size in .01 steps and -1.6 to 1.6 in .01 steps for propellent mix.

newdata <- expand.grid(nsm = seq(-1.3, 1.3, .01), pmm = seq(-1.6, 1.6, .01))
newdata$predicted <- predict(NASAmodel, newdata = newdata)

We see the best combinations have nozzle size at its smallest and propellent mix at its highest. Let’s convert that back to real units to see what this means and write up our summary:

mean(NASARocketData$NozzleSize)-1.3*sd(NASARocketData$NozzleSize)
## [1] 3.78706
mean(NASARocketData$PropellentMix)+1.6*sd(NASARocketData$PropellentMix)
## [1] 2.028404

We generated a predictive model regressing thrust efficiency on nozzle size, propellent mixture, and their interaction (predictors scaled). We find that larger nozzle sizes were predictive of decreases in thrust efficiency, ?? = -1.91, t(8) = -37.54, p < .001. A greater propellent mixture was predictive of increased thrust efficiency, ?? = 2.37, t(8) = 46.36, p < .001. Their interaction was also significant indicating that the negative effect of nozzle size is further increased when the propellent mix is high, ?? = -.17, t(8) = -3.83, p = .04. To find the optimal combination of nozzle size and propellent mix we constructed a new data set with nozzle size going from ~ 3.79 to 5.99 in .009 steps and propellent mix going from 1.82 to 2.03 in .0006 steps and used the regression model to predict efficiency. We find the greatest efficiency is achieved when nozzle size is minimized (3.79) and propellent mix is maximized (2.03).

TEACHER TRAINING DATA

TeacherTrainingData - Ability scores and training methods and resulting teacher performance. Determine the training method that performs best while controlling for ability.

library(readxl)
TeacherTrainingData <- read_excel("C:/USC Marshall/3.academics/Harrisburg/Qtr 3/ANLY 510 Analytics II/Labs/Lab 5/TeacherTrainingData.xlsx")
View(TeacherTrainingData)

Scatter-plot

scatterplot(TeacherTrainingData$Ability,TeacherTrainingData$LearningScore)

# Correlation Test

cor.test(TeacherTrainingData$Ability,TeacherTrainingData$LearningScore)
## 
##  Pearson's product-moment correlation
## 
## data:  TeacherTrainingData$Ability and TeacherTrainingData$LearningScore
## t = 1.7579, df = 66, p-value = 0.0834
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02836654  0.42831646
## sample estimates:
##       cor 
## 0.2114889

There seems to be a moderate correlation amongst ability and learning score

Now, let’s do a Regression

summary(lm(LearningScore~Ability*Treatment,data = TeacherTrainingData))
## 
## Call:
## lm(formula = LearningScore ~ Ability * Treatment, data = TeacherTrainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.0246  -7.4198   0.0964   7.3837  22.1946 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        16.0895     7.6817   2.095   0.0402 *
## Ability             0.7585     0.3106   2.442   0.0174 *
## Treatment          25.8308    11.0997   2.327   0.0231 *
## Ability:Treatment  -0.7786     0.4403  -1.768   0.0818 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 64 degrees of freedom
## Multiple R-squared:  0.1633, Adjusted R-squared:  0.1241 
## F-statistic: 4.165 on 3 and 64 DF,  p-value: 0.009307

Now we can do Agostino test

agostino.test(TeacherTrainingData$LearningScore)
## 
##  D'Agostino skewness test
## 
## data:  TeacherTrainingData$LearningScore
## skew = -0.029632, z = -0.108660, p-value = 0.9135
## alternative hypothesis: data have a skewness

and anscombe test

anscombe.test(TeacherTrainingData$LearningScore)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  TeacherTrainingData$LearningScore
## kurt = 2.1215, z = -2.2181, p-value = 0.02655
## alternative hypothesis: kurtosis is not equal to 3

The data doesn’t have kurtosis

Checking for normality

plot(density(TeacherTrainingData$LearningScore))

anscombe.test(log(TeacherTrainingData$LearningScore))
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  log(TeacherTrainingData$LearningScore)
## kurt = 3.05170, z = 0.46915, p-value = 0.639
## alternative hypothesis: kurtosis is not equal to 3
agostino.test(log(TeacherTrainingData$LearningScore+8))
## 
##  D'Agostino skewness test
## 
## data:  log(TeacherTrainingData$LearningScore + 8)
## skew = -0.55321, z = -1.92870, p-value = 0.05377
## alternative hypothesis: data have a skewness
anscombe.test(log(TeacherTrainingData$LearningScore+8))
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  log(TeacherTrainingData$LearningScore + 8)
## kurt = 2.65570, z = -0.37135, p-value = 0.7104
## alternative hypothesis: kurtosis is not equal to 3
TeacherTrainingData$ls2<-log(TeacherTrainingData$LearningScore+8)
bartlett.test(TeacherTrainingData$ls2,TeacherTrainingData$Treatment)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  TeacherTrainingData$ls2 and TeacherTrainingData$Treatment
## Bartlett's K-squared = 0.066648, df = 1, p-value = 0.7963
anscombe.test(log(TeacherTrainingData$ls2))
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  log(TeacherTrainingData$ls2)
## kurt = 3.00060, z = 0.37534, p-value = 0.7074
## alternative hypothesis: kurtosis is not equal to 3
summary(lm(ls2~Ability*Treatment,data = TeacherTrainingData))
## 
## Call:
## lm(formula = ls2 ~ Ability * Treatment, data = TeacherTrainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80077 -0.16590  0.03439  0.19423  0.49159 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.269446   0.182666  17.898   <2e-16 ***
## Ability            0.018293   0.007386   2.477   0.0159 *  
## Treatment          0.618849   0.263946   2.345   0.0222 *  
## Ability:Treatment -0.019271   0.010470  -1.841   0.0703 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2733 on 64 degrees of freedom
## Multiple R-squared:  0.154,  Adjusted R-squared:  0.1144 
## F-statistic: 3.884 on 3 and 64 DF,  p-value: 0.01293
table(TeacherTrainingData$Treatment)
## 
##  0  1 
## 37 31
model<-aov(ls2~Ability+Treatment,data = TeacherTrainingData)
Anova(model,Type=2)
## Anova Table (Type II tests)
## 
## Response: ls2
##           Sum Sq Df F value  Pr(>F)  
## Ability   0.2064  1  2.6648 0.10743  
## Treatment 0.3718  1  4.8009 0.03203 *
## Residuals 5.0340 65                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2<-aov(ls2~Treatment,data=TeacherTrainingData)
summary(model2)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Treatment    1  0.411  0.4110   5.177 0.0261 *
## Residuals   66  5.240  0.0794                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model, model2)
## Analysis of Variance Table
## 
## Model 1: ls2 ~ Ability + Treatment
## Model 2: ls2 ~ Treatment
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     65 5.0340                           
## 2     66 5.2403 -1  -0.20638 2.6648 0.1074
qqnorm(model2$residuals)
qqline(model2$residuals)

shapiro.test(model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.96154, p-value = 0.03441
shapiro.test(model2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model2$residuals
## W = 0.96551, p-value = 0.05671

Summary: Analysis was conducted to find correlation between ability scores and training methods and resulting teacher performance.The correlation (r = 0.21, t(66) = 1.7579) between ability and learning score was found to be statistically insginificant as p-value > 0.05. We find no outliers, no high leverage observations, not much multicollinearity. Also, the residuals are normal according to shapiro test. From the model equation we can clearly say that each SD increase from mean in ability leads to 0.76 increase in learning scores. Each SD increase in treatment leads to a 25.83 increase in learning scores. The interaction between ability and treatment indicates that the effect of ability decreases when the treatment increases which suggests that training method 1 performs best while controlling for ability.