Exercise 1
Can we predict participants’ money-saving motivation, based on their annul income?
library(readxl)
## Warning: package 'readxl' was built under R version 3.6.3
setwd("E:/mikhilesh/HU Sem VI ANLY 510 and 506/ANLY 510 Kao Principals and Applications/Lecture and other materials")
data <- read_xlsx("lecture 6 simplerelationships.xlsx")
names(data)
## [1] "Income" "PI" "MS" "Age"
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 99 obs. of 4 variables:
## $ Income: num 81119 100291 74035 19289 103307 ...
## $ PI : num 0 0 -0.01 0 -0.01 0.01 -0.02 -0.02 -0.01 0.01 ...
## $ MS : num 8 9 8 10 3 5 9 8 10 7 ...
## $ Age : num 34.1 35 33.5 31 35 ...
summary(data)
## Income PI MS Age
## Min. : 12470 Min. :-0.030000 Min. : 0.000 Min. :30.62
## 1st Qu.: 50551 1st Qu.: 0.000000 1st Qu.: 3.000 1st Qu.:32.63
## Median : 95875 Median : 0.010000 Median : 5.000 Median :34.99
## Mean : 91077 Mean : 0.001616 Mean : 5.212 Mean :34.59
## 3rd Qu.:124876 3rd Qu.: 0.010000 3rd Qu.: 8.000 3rd Qu.:36.38
## Max. :162697 Max. : 0.020000 Max. :10.000 Max. :38.30
Check Assumptions
#Type of variable - continuous variable, so we are good to continue
plot(density(data$Income))
plot(density(data$PI))
plot(density(data$MS))
#Results/graph shows its not a straight line meaning non zero variable
#multicollinearity - #We have only one predictor here, so no need to worry about multicollinearity. Predictors are not correlated with extraneous variable.
library(car)
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
#scatterplot - The variance around the regression line should be equal across values of x.
scatterplot(data$Income, data$PI)
scatterplot(data$Income, data$MS)
scatterplot(data$Income, data$Age)
#Independence of Error - we assume sample is random and no repeated measures
library(moments)
qqnorm(data$PI)
qqnorm(data$Income)
After meeting all the assumptions, we can now move on to testing our models for regression
model <- lm(PI ~ Income, data = data) # lm{DV ~ IV, data}
summary(model)
##
## Call:
## lm(formula = PI ~ Income, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.026076 -0.004835 0.004298 0.006859 0.012624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.378e-03 2.570e-03 -3.260 0.00154 **
## Income 1.097e-07 2.560e-08 4.286 4.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01075 on 97 degrees of freedom
## Multiple R-squared: 0.1592, Adjusted R-squared: 0.1505
## F-statistic: 18.37 on 1 and 97 DF, p-value: 4.296e-05
#F-statistic: 18.37 on 1 and 97 DF, p-value: 4.296e-05 - overall significant model
#Next we want to make sure our predictors (parameters - intercept) are significant too
# R2 - how much variance can be explained
# F (1, 97) = 18.37, p < 0.001, R2 = 0.16
Based on the result, now we are able to conduct the regression model. The expected regression model is: Y = b0 + b1X We can replace the values of intercept (b0) and slope (b1) with our own values. Our model is: Y = (-0.0083) + (.0000001097 )X
A simple regression model was conducted to predict participants’ money-saving motivation, based on their annul income. All the regression assumptions were met, and no further adjustment made. A significant regression equation was found (F (1, 97) = 18.37, p < .001), with an R2 of .16. Both the intercept (p = .002) and predictor (p < .001) were statistically significant. The result suggested that, income predicts and shows that for each dollar increase in income there is a 0.0000001097 percent increase in savings.
Exercise 2
Can we use either height or weight to predict heart rate?
H <- c(175, 170, 180, 178, 168, 181, 190, 185, 177, 162)
W <- c(60, 70, 75, 80, 69, 78, 82, 84, 72, 53)
HR <- c(60, 70, 75, 73, 71, 73, 76, 80, 68, 64)
data5lec <- data.frame(H, W, HR)
cor(data5lec) #check correlation because we can only do a regression when DV and IV are significantly related
## H W HR
## H 1.0000000 0.8441373 0.6702788
## W 0.8441373 1.0000000 0.8953390
## HR 0.6702788 0.8953390 1.0000000
# We have some strong correlation between H and HR, H and W
#To confirm the correlation is significant, we look at R2 by Create Data matrix
datamatrix <- as.matrix(data5lec[, c("H", "W", "HR")])
#rcorr(datamatrix)
cor(data5lec)^2
## H W HR
## H 1.0000000 0.7125678 0.4492736
## W 0.7125678 1.0000000 0.8016320
## HR 0.4492736 0.8016320 1.0000000
cor(data5lec)^2*100
## H W HR
## H 100.00000 71.25678 44.92736
## W 71.25678 100.00000 80.16320
## HR 44.92736 80.16320 100.00000
# We have confirmed some strong correlation between H and HR, H and W
We know variables are continuous and randomly selected
library(car)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.2
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.6.2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
plot(density(data5lec$H))
scatterplot(data5lec$H, data5lec$HR)
scatterplot(data5lec$W, data5lec$HR)
qqnorm(data5lec$H)
qqnorm(data5lec$W)
qqnorm(data5lec$HR)
#Let's first use Height to predict HeartRate
model1 <- lm(HR ~ H, data = data5lec)
summary(model1)
##
## Call:
## lm(formula = HR ~ H, data = data5lec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2395 -1.0500 0.6372 2.3222 5.0071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.9452 32.8921 -0.394 0.7042
## H 0.4753 0.1861 2.555 0.0339 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.62 on 8 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.3804
## F-statistic: 6.526 on 1 and 8 DF, p-value: 0.03393
#F-statistic: 6.526 on 1 and 8 DF, p-value: 0.03393 - model is significant and Height (slope) is significant but the intercept is not significant, so CAN NOT be used as a good regression model
#Let's first use Weight to predict HeartRate
model <- lm(HR ~ W, data = data5lec)
summary(model)
##
## Call:
## lm(formula = HR ~ W, data = data5lec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4266 -1.8478 0.0226 2.3587 3.3143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.36134 6.85189 4.723 0.001496 **
## W 0.53442 0.09399 5.686 0.000462 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.773 on 8 degrees of freedom
## Multiple R-squared: 0.8016, Adjusted R-squared: 0.7768
## F-statistic: 32.33 on 1 and 8 DF, p-value: 0.0004619
#F-statistic: 32.33 on 1 and 8 DF, p-value: 0.0004619 - overall significant model. Weight (slope) is significant but the intercept is also significant, so CAN be used as a good regression model
# Our model would look like: HR Y = (32.36134) + (0.53442)X weight
#Finally we have to check if we have normally distributed residuals
#Shapiro Normality Test - model 1 and model 2 residuals
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.94971, p-value = 0.665
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.88265, p-value = 0.1399
# Both p-values > o - so fail to reject the null - we don't violate the assumptions