Simple OR Linear Regression

Exercise 1

Can we predict participants’ money-saving motivation, based on their annul income?

library(readxl)
## Warning: package 'readxl' was built under R version 3.6.3
setwd("E:/mikhilesh/HU Sem VI ANLY 510 and 506/ANLY 510 Kao Principals and Applications/Lecture and other materials")
data <- read_xlsx("lecture 6 simplerelationships.xlsx")
names(data)
## [1] "Income" "PI"     "MS"     "Age"
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    99 obs. of  4 variables:
##  $ Income: num  81119 100291 74035 19289 103307 ...
##  $ PI    : num  0 0 -0.01 0 -0.01 0.01 -0.02 -0.02 -0.01 0.01 ...
##  $ MS    : num  8 9 8 10 3 5 9 8 10 7 ...
##  $ Age   : num  34.1 35 33.5 31 35 ...
summary(data)
##      Income             PI                  MS              Age       
##  Min.   : 12470   Min.   :-0.030000   Min.   : 0.000   Min.   :30.62  
##  1st Qu.: 50551   1st Qu.: 0.000000   1st Qu.: 3.000   1st Qu.:32.63  
##  Median : 95875   Median : 0.010000   Median : 5.000   Median :34.99  
##  Mean   : 91077   Mean   : 0.001616   Mean   : 5.212   Mean   :34.59  
##  3rd Qu.:124876   3rd Qu.: 0.010000   3rd Qu.: 8.000   3rd Qu.:36.38  
##  Max.   :162697   Max.   : 0.020000   Max.   :10.000   Max.   :38.30

Check Assumptions

#Type of variable - continuous variable, so we are good to continue

plot(density(data$Income))

plot(density(data$PI))

plot(density(data$MS))

#Results/graph shows its not a straight line meaning non zero variable

#multicollinearity - #We have only one predictor here, so no need to worry about multicollinearity. Predictors are not correlated with extraneous variable. 
library(car)
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
#scatterplot - The variance around the regression line should be equal across values of x.
scatterplot(data$Income, data$PI) 

scatterplot(data$Income, data$MS)

scatterplot(data$Income, data$Age)

#Independence of Error - we assume sample is random and no repeated measures
library(moments)
qqnorm(data$PI)

qqnorm(data$Income)

After meeting all the assumptions, we can now move on to testing our models for regression

model <- lm(PI ~ Income, data = data) # lm{DV ~ IV, data}
summary(model)
## 
## Call:
## lm(formula = PI ~ Income, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.026076 -0.004835  0.004298  0.006859  0.012624 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.378e-03  2.570e-03  -3.260  0.00154 ** 
## Income       1.097e-07  2.560e-08   4.286  4.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01075 on 97 degrees of freedom
## Multiple R-squared:  0.1592, Adjusted R-squared:  0.1505 
## F-statistic: 18.37 on 1 and 97 DF,  p-value: 4.296e-05
#F-statistic: 18.37 on 1 and 97 DF,  p-value: 4.296e-05 - overall significant model
#Next we want to make sure our predictors (parameters - intercept) are significant too
# R2 - how much variance can be explained
# F (1, 97) = 18.37, p < 0.001, R2 = 0.16

Based on the result, now we are able to conduct the regression model. The expected regression model is: Y = b0 + b1X We can replace the values of intercept (b0) and slope (b1) with our own values. Our model is: Y = (-0.0083) + (.0000001097 )X

Summary Write Up

A simple regression model was conducted to predict participants’ money-saving motivation, based on their annul income. All the regression assumptions were met, and no further adjustment made. A significant regression equation was found (F (1, 97) = 18.37, p < .001), with an R2 of .16. Both the intercept (p = .002) and predictor (p < .001) were statistically significant. The result suggested that, income predicts and shows that for each dollar increase in income there is a 0.0000001097 percent increase in savings.

Exercise 2

Can we use either height or weight to predict heart rate?

H <- c(175, 170, 180, 178, 168, 181, 190, 185, 177, 162)
W <- c(60, 70, 75, 80, 69, 78, 82, 84, 72, 53)
HR <- c(60, 70, 75, 73, 71, 73, 76, 80, 68, 64)

data5lec <- data.frame(H, W, HR)
cor(data5lec) #check correlation because we can only do a regression when DV and IV are significantly related
##            H         W        HR
## H  1.0000000 0.8441373 0.6702788
## W  0.8441373 1.0000000 0.8953390
## HR 0.6702788 0.8953390 1.0000000
# We have some strong correlation between H and HR, H and W

#To confirm the correlation is significant, we look at R2 by Create Data matrix
datamatrix <- as.matrix(data5lec[, c("H", "W", "HR")])
#rcorr(datamatrix) 
cor(data5lec)^2
##            H         W        HR
## H  1.0000000 0.7125678 0.4492736
## W  0.7125678 1.0000000 0.8016320
## HR 0.4492736 0.8016320 1.0000000
cor(data5lec)^2*100
##            H         W        HR
## H  100.00000  71.25678  44.92736
## W   71.25678 100.00000  80.16320
## HR  44.92736  80.16320 100.00000
# We have confirmed some strong correlation between H and HR, H and W

We know variables are continuous and randomly selected

library(car)
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.2
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.6.2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
plot(density(data5lec$H))

scatterplot(data5lec$H, data5lec$HR)

scatterplot(data5lec$W, data5lec$HR)

qqnorm(data5lec$H)

qqnorm(data5lec$W)

qqnorm(data5lec$HR)

#Let's first use Height to predict HeartRate
model1 <- lm(HR ~ H, data = data5lec)
summary(model1)
## 
## Call:
## lm(formula = HR ~ H, data = data5lec)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2395  -1.0500   0.6372   2.3222   5.0071 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -12.9452    32.8921  -0.394   0.7042  
## H             0.4753     0.1861   2.555   0.0339 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.62 on 8 degrees of freedom
## Multiple R-squared:  0.4493, Adjusted R-squared:  0.3804 
## F-statistic: 6.526 on 1 and 8 DF,  p-value: 0.03393
#F-statistic: 6.526 on 1 and 8 DF,  p-value: 0.03393 - model is significant and Height (slope) is significant but the intercept is not significant, so CAN NOT be used as a good regression model

#Let's first use Weight to predict HeartRate
model <- lm(HR ~ W, data = data5lec)
summary(model)
## 
## Call:
## lm(formula = HR ~ W, data = data5lec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4266 -1.8478  0.0226  2.3587  3.3143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.36134    6.85189   4.723 0.001496 ** 
## W            0.53442    0.09399   5.686 0.000462 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.773 on 8 degrees of freedom
## Multiple R-squared:  0.8016, Adjusted R-squared:  0.7768 
## F-statistic: 32.33 on 1 and 8 DF,  p-value: 0.0004619
#F-statistic: 32.33 on 1 and 8 DF,  p-value: 0.0004619 - overall significant model. Weight (slope) is significant but the intercept is also significant, so CAN be used as a good regression model

#   Our model would look like: HR Y = (32.36134) + (0.53442)X weight
#Finally we have to check if we have normally distributed residuals

#Shapiro Normality Test - model 1 and model 2 residuals
shapiro.test(model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.94971, p-value = 0.665
shapiro.test(model1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.88265, p-value = 0.1399
# Both p-values > o - so fail to reject the null - we don't violate the assumptions