Simple Linear Regression Assignment

1 - Calories_consumed-> predict weight gained using calories consumed

Reding the data from data file and saving into a variable

Calories_consumed <- read.csv("C:/Users/Pawan Srivastav/Desktop/Data Science/Data Sets/Data Sets/Simple Linear Regression/calories_consumed.csv") 

Getting Summary of Import Data

summary(Calories_consumed)
##  Weight.gained..grams. Calories.Consumed
##  Min.   :  62.0        Min.   :1400     
##  1st Qu.: 114.5        1st Qu.:1728     
##  Median : 200.0        Median :2250     
##  Mean   : 357.7        Mean   :2341     
##  3rd Qu.: 537.5        3rd Qu.:2775     
##  Max.   :1100.0        Max.   :3900
# Variance and Standard deviation of Calories.Consumed column
var(Calories_consumed$Calories.Consumed)
## [1] 565668.7
sd(Calories_consumed$Calories.Consumed)
## [1] 752.1095
# Variance and Standard deviation of Weight.gained..grams. column
var(Calories_consumed$Weight.gained..grams.)
## [1] 111350.7
sd(Calories_consumed$Weight.gained..grams.)
## [1] 333.6925

Creating Linear Model for weight gain

WeightGainModel <- lm(Weight.gained..grams. ~ Calories.Consumed, data = Calories_consumed)
summary(WeightGainModel)
## 
## Call:
## lm(formula = Weight.gained..grams. ~ Calories.Consumed, data = Calories_consumed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -158.67 -107.56   36.70   81.68  165.53 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -625.75236  100.82293  -6.206 4.54e-05 ***
## Calories.Consumed    0.42016    0.04115  10.211 2.86e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111.6 on 12 degrees of freedom
## Multiple R-squared:  0.8968, Adjusted R-squared:  0.8882 
## F-statistic: 104.3 on 1 and 12 DF,  p-value: 2.856e-07
plot(Calories_consumed)

Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square value is 0.8968. That’s mean this model will predict the output 89.68% time correct

2 - Delivery_time -> Predict delivery time using sorting time

Reding the data from data file and saving into a variable

delivery_time <- read.csv("C:/Users/Pawan Srivastav/Desktop/Data Science/Data Sets/Data Sets/Simple Linear Regression/delivery_time.csv") 
summary(delivery_time)
##  Delivery.Time    Sorting.Time  
##  Min.   : 8.00   Min.   : 2.00  
##  1st Qu.:13.50   1st Qu.: 4.00  
##  Median :17.83   Median : 6.00  
##  Mean   :16.79   Mean   : 6.19  
##  3rd Qu.:19.75   3rd Qu.: 8.00  
##  Max.   :29.00   Max.   :10.00
# Variance and Standard deviation of Delivery.Time column
var(delivery_time$Delivery.Time)
## [1] 25.75462
sd(delivery_time$Delivery.Time)
## [1] 5.074901
# Variance and Standard deviation of Sorting.Time column
var(delivery_time$Sorting.Time)
## [1] 6.461905
sd(delivery_time$Sorting.Time)
## [1] 2.542028

Creating Linear Model for delivery time

deliverTimeModel <- lm(Delivery.Time ~ Sorting.Time, data = delivery_time)
summary(deliverTimeModel)
## 
## Call:
## lm(formula = Delivery.Time ~ Sorting.Time, data = delivery_time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1729 -2.0298 -0.0298  0.8741  6.6722 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.5827     1.7217   3.823  0.00115 ** 
## Sorting.Time   1.6490     0.2582   6.387 3.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.935 on 19 degrees of freedom
## Multiple R-squared:  0.6823, Adjusted R-squared:  0.6655 
## F-statistic:  40.8 on 1 and 19 DF,  p-value: 3.983e-06
plot(deliverTimeModel)

Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square value is 0.6823. That’s mean this model will predict the output 68.23% time correct

For Increasing R squared value

Using mvinfluence in Linear Model to find the point which are creating problems

library(mvinfluence)
## Loading required package: car
## Loading required package: carData
## Loading required package: heplots
influenceIndexPlot(deliverTimeModel)

deliverTimeModel <- lm(Delivery.Time ~ Sorting.Time, data = delivery_time[c(-5,-9,-21),])
summary(deliverTimeModel)
## 
## Call:
## lm(formula = Delivery.Time ~ Sorting.Time, data = delivery_time[c(-5, 
##     -9, -21), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3407 -1.5027  0.2275  0.9328  3.6815 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.0240     1.1751   5.126 0.000102 ***
## Sorting.Time   1.6741     0.1872   8.941 1.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.839 on 16 degrees of freedom
## Multiple R-squared:  0.8332, Adjusted R-squared:  0.8228 
## F-statistic: 79.94 on 1 and 16 DF,  p-value: 1.273e-07
plot(deliverTimeModel)

After removing 3 points Multiple R-Square value is increased to 0.8332. That’s mean this model will predict the output 83.32% time correct

3 - Emp_data -> Build a prediction model for Churn_out_rate

Reding the data from data file and saving into a variable

Emp_data <- read.csv("C:/Users/Pawan Srivastav/Desktop/Data Science/Data Sets/Data Sets/Simple Linear Regression/emp_data.csv") 

Getting Summary of Import Data

summary(Emp_data)
##   Salary_hike   Churn_out_rate 
##  Min.   :1580   Min.   :60.00  
##  1st Qu.:1618   1st Qu.:65.75  
##  Median :1675   Median :71.00  
##  Mean   :1689   Mean   :72.90  
##  3rd Qu.:1724   3rd Qu.:78.75  
##  Max.   :1870   Max.   :92.00
# Variance and Standard deviation of Salary_hike column
var(Emp_data$Salary_hike)
## [1] 8481.822
sd(Emp_data$Salary_hike)
## [1] 92.09681
# Variance and Standard deviation of Churn_out_rate column
var(Emp_data$Churn_out_rate)
## [1] 105.2111
sd(Emp_data$Churn_out_rate)
## [1] 10.25725

Creating Linear Model for Churn_out_rate

Churn_out_rate_Model <- lm(Churn_out_rate ~ Salary_hike, data = Emp_data)
summary(Churn_out_rate_Model)
## 
## Call:
## lm(formula = Churn_out_rate ~ Salary_hike, data = Emp_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.804 -3.059 -1.819  2.430  8.072 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 244.36491   27.35194   8.934 1.96e-05 ***
## Salary_hike  -0.10154    0.01618  -6.277 0.000239 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.469 on 8 degrees of freedom
## Multiple R-squared:  0.8312, Adjusted R-squared:  0.8101 
## F-statistic:  39.4 on 1 and 8 DF,  p-value: 0.0002386
plot(Churn_out_rate_Model)

Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square value is 0.8312 That’s mean this model will predict the output 83.12% time correct

4 - Salary_hike -> Build a prediction model for Salary_hike

Reding the data from data file and saving into a variable

Salary_hike <- read.csv("C:/Users/Pawan Srivastav/Desktop/Data Science/Data Sets/Data Sets/Simple Linear Regression/Salary_Data.csv") 

Getting Summary of Import Data

summary(Salary_hike)
##  YearsExperience      Salary      
##  Min.   : 1.100   Min.   : 37731  
##  1st Qu.: 3.200   1st Qu.: 56721  
##  Median : 4.700   Median : 65237  
##  Mean   : 5.313   Mean   : 76003  
##  3rd Qu.: 7.700   3rd Qu.:100545  
##  Max.   :10.500   Max.   :122391
# Variance and Standard deviation of Salary_hike column
var(Salary_hike$YearsExperience)
## [1] 8.053609
sd(Salary_hike$YearsExperience)
## [1] 2.837888
# Variance and Standard deviation of Churn_out_rate column
var(Salary_hike$Salary)
## [1] 751550960
sd(Salary_hike$Salary)
## [1] 27414.43

Creating Linear Model for Salary_hike

Salary_hike_Model <- lm(Salary ~ YearsExperience, data = Salary_hike)
summary(Salary_hike_Model)
## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = Salary_hike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7958.0 -4088.5  -459.9  3372.6 11448.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      25792.2     2273.1   11.35 5.51e-12 ***
## YearsExperience   9450.0      378.8   24.95  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared:  0.957,  Adjusted R-squared:  0.9554 
## F-statistic: 622.5 on 1 and 28 DF,  p-value: < 2.2e-16
plot(Salary_hike_Model)

Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square value is 0.957 That’s mean this model will predict the output 95.7% time correct