Analyze and summarize the Strength and Distance Data and Sugar Chewy Data sets in the regression review section folder.
Predictor variable: Weight Lifted Dependent Variable: Distance thrown
library(readr)
## Warning: package 'readr' was built under R version 3.5.3
StrengthandDistanceData <- read_csv("C:/USC Marshall/3.academics/Harrisburg/Qtr 3/ANLY 510 Analytics II/Labs/Lab 4/StrengthandDistanceData.csv")
## Parsed with column specification:
## cols(
## weightlifted = col_double(),
## distancethrown = col_double()
## )
View(StrengthandDistanceData)
cor.test(StrengthandDistanceData$weightlifted,StrengthandDistanceData$distancethrown,method = "pearson", conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: StrengthandDistanceData$weightlifted and StrengthandDistanceData$distancethrown
## t = 10.117, df = 26, p-value = 1.663e-10
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.7265302 0.9604491
## sample estimates:
## cor
## 0.8929919
library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
scatterplot(StrengthandDistanceData$weightlifted,StrengthandDistanceData$distancethrown)
Scatter plot shows the linearly increasing relationship between ‘weight lifted’ and ‘distance thrown’.
Summary: Results show a significant positive relationship (r = 0.89, t(26) = 1.66, p < 0.001) between Weight lifted and Distance thrown.
Assumption 1 - holds true as our predictor variable - weight lifted, is continuous. Assumption 2 - There should variance in predictor variable i.e. weight lifted and dependent variable i.e. distance thrown
plot(density(StrengthandDistanceData$weightlifted))
plot(density(StrengthandDistanceData$distancethrown))
Assumption 4 - Predictors should be uncorrelated with external variables holds true since there are no external variables involved in this scenario Assumption 5 - Heteroscedasticity: equal residual variances along regression line could be an issue Assumption 6 - Not applicable Assumption 8 - Scatterplot above validates the linear relationship between predictor and dependent var.
SDD<-lm(distancethrown~weightlifted,StrengthandDistanceData)
summary(SDD)
##
## Call:
## lm(formula = distancethrown ~ weightlifted, data = StrengthandDistanceData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2475 -1.1798 0.3635 0.9516 2.3010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.959629 0.958835 6.215 1.42e-06 ***
## weightlifted 0.098344 0.009721 10.117 1.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.265 on 26 degrees of freedom
## Multiple R-squared: 0.7974, Adjusted R-squared: 0.7896
## F-statistic: 102.4 on 1 and 26 DF, p-value: 1.663e-10
Prediction model: Distance thrown = 5.96 + 0.098*weightlifted
Next, we’ll verify the assumption: Equal Variances in predictor and dependent variable
par(mfrow=c(2,2))
plot(SDD)
ncvTest(SDD)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.822118, Df = 1, p = 0.17706
There is not a significant violation.
Next, check the normality of residuals as follow:
qqnorm(SDD$residuals)
qqline(SDD$residuals)
Above plot shows that this assumption holds true i.e. residuals are quite normal.
We will also confirm the effect using Shapiro test as follows:
shapiro.test(SDD$residuals)
##
## Shapiro-Wilk normality test
##
## data: SDD$residuals
## W = 0.95813, p-value = 0.3146
W close to 1 and effect is insignificant. Therefore, residuals are normal.
Summary: We ran an analysis to find the correlation between predictor variable (weight lifted) and dependent variable (weight thrown). Results show a significant positive relationship (r = 0.89, t(26) = 1.66, p < 0.001) between Weight lifted and Distance thrown. We then went ahead with performing a Regression analysis. All the assumptions were validated to be true. F(1, 26) = 102.4, p < .001. Prediction model was generated as follows: Distance thrown = 5.96 + 0.098*weightlifted Regression model shows: High F statistic signifying that weighlifted has high impact on distance thrown. Weight lifted shows that for small increase in weigh lifted there is 0.0983 percent increase in distance thrown. Model explains 79% variance in distance thrown.
library(readxl)
## Warning: package 'readxl' was built under R version 3.5.3
SugarChewyData <- read_excel("C:/USC Marshall/3.academics/Harrisburg/Qtr 3/ANLY 510 Analytics II/Labs/Lab 4/SugarChewyData.xlsx")
View(SugarChewyData)
cor.test(SugarChewyData$sugar, SugarChewyData$chewiness, method = "pearson", conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: SugarChewyData$sugar and SugarChewyData$chewiness
## t = -6.6025, df = 88, p-value = 2.951e-09
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## -0.7315074 -0.3624002
## sample estimates:
## cor
## -0.5755643
scatterplot(SugarChewyData$sugar, SugarChewyData$chewiness)
Summary: We found a moderate negative correlation between sugar and its chewiness. r = -0.5 p < 0.001 t(88) = -6.60
Assumption 1 - holds true as our predictor variable - weight lifted, is continuous. Assumption 2 - There should be variance in predictor variable i.e. sugar and dependent variable i.e. chewiness
plot(density(SugarChewyData$sugar))
plot(density(SugarChewyData$chewiness))
Assumption 4 - Predictors should be uncorrelated with external variables holds true since there are no external variables involved in this scenario Assumption 5 - Heteroscedasticity: we will check this later in the analysis Assumption 6 - Not applicable Assumption 8 - Scatterplot above validates the linear relationship between predictor and dependent var. There are some outliers and potentially the leverage points.
Linearmod <- lm(chewiness~sugar, SugarChewyData)
summary(Linearmod)
##
## Call:
## lm(formula = chewiness ~ sugar, data = SugarChewyData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4557 -0.5604 0.1045 0.5249 1.9559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.662878 0.756610 10.128 < 2e-16 ***
## sugar -0.022797 0.003453 -6.603 2.95e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9178 on 88 degrees of freedom
## Multiple R-squared: 0.3313, Adjusted R-squared: 0.3237
## F-statistic: 43.59 on 1 and 88 DF, p-value: 2.951e-09
Model Prediction equation: Chewiness = 7.66 - 0.022*Sugar
For unit decrease in sugar there is 0.022 percent decrease in Chewiness. Model explains 33% variance in chewiness.
Next, we’ll verify the assumption: Equal Variances in predictor and dependent variable
Check Equal Variances:
par(mfrow=c(2,2))
plot(Linearmod)
Regression line is not quite flat in Scale-Location Plot. Also, residuals don’t look like they are normally distrubted. They are forming a horizontal pattern.
Therefore, this looks like a violation of assumption - i.e. variances are not equal.
ncvTest(Linearmod)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5.403637, Df = 1, p = 0.020095
Ncv Test confirms the significant violation of equal variance assumption.
Validating the Normality of Residuals:
qqnorm(Linearmod$residuals)
qqline(Linearmod$residuals)
QQ plot above shows that’s residuals scattered around the line and therefore they are normal.
We can validate this using Shapiro test -
shapiro.test(Linearmod$residuals)
##
## Shapiro-Wilk normality test
##
## data: Linearmod$residuals
## W = 0.98668, p-value = 0.4935
W is close to 1 and P > 0.05. There is not a significant effect.
Summary: Analysis was conducted to find correlation between Sugar content and its impact on Chewiness of different fruits. Moderate negative correlation (r = -0.5, t(88) = -6.60, p < 0.001) was found between sugar and chewiness. Further Regression Analysis was conducted and following prediction model was generated Chewiness = 7.66 - 0.022*Sugar F(1, 88) = 43.59, p < .001. For unit decrease in sugar there is 0.022 percent decrease in Chewiness. And the model explains 33% variance in chewiness.