PERFORMING LINEAR AND MULTIPLE REGRESSION ON AUTO+MPG DATASET

Aim/Objective

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

About the dataset

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute “mpg”, 8 of the original instances were removed because they had unknown values for the “mpg” attribute. The original dataset is available in the file “auto-mpg.data-original”.

Attribute Information:

mpg: continuous
cylinders: multi-valued discrete
displacement: continuous
horsepower: continuous
weight: continuous
acceleration: continuous
model year: multi-valued discrete
origin: multi-valued discrete
car name: string (unique for each instance)

Using Linear Regression

#Setting the working directory
setwd("C:/Users/kmoha/Downloads")

#Importing dataset
data = read.csv("auto-mpg.csv")

#View command is used displaying data in separate window
View(data) 

#Removing unnecessary features or attributes which are not useful for the prediction
data = data[1:8]

#confirming the data type of the dataset 
class(data)

## [1] "data.frame"

#displaying dimension of data
dim(data)

## [1] 398   8

#loading dplyr library for the usage of glimpse command in order to get organised view about the structure of the dataset
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

glimpse(data)

## Rows: 398
## Columns: 8
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <chr> "130", "165", "150", "150", "140", "198", "220", "215", "…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ model_year   <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …

#Understanding the data with the help of summary
summary(data)

##       mpg          cylinders      displacement    horsepower       
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Length:398        
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   Class :character  
##  Median :23.00   Median :4.000   Median :148.5   Mode  :character  
##  Mean   :23.51   Mean   :5.455   Mean   :193.4                     
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0                     
##  Max.   :46.60   Max.   :8.000   Max.   :455.0                     
##      weight      acceleration     model_year        origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573  
##  3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000

Data Visualization before cleaning the dataset

For data distributions, you may require more information than central tendency values (median, mean, mode). To analyze data variability, you need to know how dispersed the data are.

#Visualizing initial data

# ggplot2 is a system for declarative creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
library(ggplot2)

#Box plot
#a Box plot is a graph that illustrates the distribution of values in data. Box plots are commonly used to show the distribution of data in a standard way by presenting five summary values. The list below summarizes the minimum, Q1 (First Quartile), median, Q3 (Third Quartile), and maximum values.  Summarizing these values can provide us with information about our outliers and their values.  

ggplot(data, aes(x=factor(cylinders), y=mpg)) + geom_boxplot(aes(fill = cylinders))+labs(title = "20MID0012")

#Histogram
#A histogram is an approximate representation of the distribution of numerical data. In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data. Histograms roughly give us an idea about the probability distribution of a given variable by depicting the frequencies of observations occurring in certain ranges of values.

ggplot(data,aes(x=mpg)) + geom_histogram(bins=30) +facet_wrap(~cylinders)+labs(title = "20MID0012")

#Scatterplot using ggplot2
#A scatterplot displays the values of two variables along two axes. It shows the relationship between them, eventually revealing a correlation.Here the relationship between weight and acceleration of several cars is shown.

ggplot(data, aes(x=weight, y=acceleration, col=mpg))+geom_point()+labs(title = "20MID0012")

#Heat Map
#A heatmap depicts the relationship between two attributes of a dataframe as a color-coded tile or bin. A heatmap produces a grid with multiple attributes of the dataframe, representing the relationship between the two attributes taken at a time.

ggplot(data, aes(mpg, cylinders)) + geom_bin2d(bins = 30)+scale_fill_distiller(palette = "Spectral",direction=1)+labs(title = "20MID0012")

#Bar plot
#A Bar Graph (or a Bar Chart) is a graphical display of data using bars of different heights.They are good if you to want to visualize the data of different categories that are being compared with each other.


ggplot(data, aes(x = cylinders,fill = "model_year")) +geom_bar()+facet_wrap(~model_year)+labs(title = "20MID0012")

#Line plot
#A line chart or line graph displays the evolution of one or several numeric variables. Data points are usually connected by straight line segments.

ggplot(data=data, aes(x=cylinders, y=model_year, group=1)) + geom_line(color="blue")+geom_point()+labs(title = "20MID0012")

Cleaning the data

#Converting all int and chr attributes which are required into numeric
data$horsepower = as.numeric(data$horsepower)

## Warning: NAs introduced by coercion

data$cylinders = as.numeric(data$cylinders)
data$weight = as.numeric(data$weight)
data$model_year = as.numeric(data$model_year)

#Replacing missing values with mean
summary(data$horsepower)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    46.0    75.0    93.5   104.5   126.0   230.0       6

data$horsepower[is.na(data$horsepower)] = mean(data$horsepower,na.rm = T)
data$horsepower

##   [1] 130.0000 165.0000 150.0000 150.0000 140.0000 198.0000 220.0000 215.0000
##   [9] 225.0000 190.0000 170.0000 160.0000 150.0000 225.0000  95.0000  95.0000
##  [17]  97.0000  85.0000  88.0000  46.0000  87.0000  90.0000  95.0000 113.0000
##  [25]  90.0000 215.0000 200.0000 210.0000 193.0000  88.0000  90.0000  95.0000
##  [33] 104.4694 100.0000 105.0000 100.0000  88.0000 100.0000 165.0000 175.0000
##  [41] 153.0000 150.0000 180.0000 170.0000 175.0000 110.0000  72.0000 100.0000
##  [49]  88.0000  86.0000  90.0000  70.0000  76.0000  65.0000  69.0000  60.0000
##  [57]  70.0000  95.0000  80.0000  54.0000  90.0000  86.0000 165.0000 175.0000
##  [65] 150.0000 153.0000 150.0000 208.0000 155.0000 160.0000 190.0000  97.0000
##  [73] 150.0000 130.0000 140.0000 150.0000 112.0000  76.0000  87.0000  69.0000
##  [81]  86.0000  92.0000  97.0000  80.0000  88.0000 175.0000 150.0000 145.0000
##  [89] 137.0000 150.0000 198.0000 150.0000 158.0000 150.0000 215.0000 225.0000
##  [97] 175.0000 105.0000 100.0000 100.0000  88.0000  95.0000  46.0000 150.0000
## [105] 167.0000 170.0000 180.0000 100.0000  88.0000  72.0000  94.0000  90.0000
## [113]  85.0000 107.0000  90.0000 145.0000 230.0000  49.0000  75.0000  91.0000
## [121] 112.0000 150.0000 110.0000 122.0000 180.0000  95.0000 104.4694 100.0000
## [129] 100.0000  67.0000  80.0000  65.0000  75.0000 100.0000 110.0000 105.0000
## [137] 140.0000 150.0000 150.0000 140.0000 150.0000  83.0000  67.0000  78.0000
## [145]  52.0000  61.0000  75.0000  75.0000  75.0000  97.0000  93.0000  67.0000
## [153]  95.0000 105.0000  72.0000  72.0000 170.0000 145.0000 150.0000 148.0000
## [161] 110.0000 105.0000 110.0000  95.0000 110.0000 110.0000 129.0000  75.0000
## [169]  83.0000 100.0000  78.0000  96.0000  71.0000  97.0000  97.0000  70.0000
## [177]  90.0000  95.0000  88.0000  98.0000 115.0000  53.0000  86.0000  81.0000
## [185]  92.0000  79.0000  83.0000 140.0000 150.0000 120.0000 152.0000 100.0000
## [193] 105.0000  81.0000  90.0000  52.0000  60.0000  70.0000  53.0000 100.0000
## [201]  78.0000 110.0000  95.0000  71.0000  70.0000  75.0000  72.0000 102.0000
## [209] 150.0000  88.0000 108.0000 120.0000 180.0000 145.0000 130.0000 150.0000
## [217]  68.0000  80.0000  58.0000  96.0000  70.0000 145.0000 110.0000 145.0000
## [225] 130.0000 110.0000 105.0000 100.0000  98.0000 180.0000 170.0000 190.0000
## [233] 149.0000  78.0000  88.0000  75.0000  89.0000  63.0000  83.0000  67.0000
## [241]  78.0000  97.0000 110.0000 110.0000  48.0000  66.0000  52.0000  70.0000
## [249]  60.0000 110.0000 140.0000 139.0000 105.0000  95.0000  85.0000  88.0000
## [257] 100.0000  90.0000 105.0000  85.0000 110.0000 120.0000 145.0000 165.0000
## [265] 139.0000 140.0000  68.0000  95.0000  97.0000  75.0000  95.0000 105.0000
## [273]  85.0000  97.0000 103.0000 125.0000 115.0000 133.0000  71.0000  68.0000
## [281] 115.0000  85.0000  88.0000  90.0000 110.0000 130.0000 129.0000 138.0000
## [289] 135.0000 155.0000 142.0000 125.0000 150.0000  71.0000  65.0000  80.0000
## [297]  80.0000  77.0000 125.0000  71.0000  90.0000  70.0000  70.0000  65.0000
## [305]  69.0000  90.0000 115.0000 115.0000  90.0000  76.0000  60.0000  70.0000
## [313]  65.0000  90.0000  88.0000  90.0000  90.0000  78.0000  90.0000  75.0000
## [321]  92.0000  75.0000  65.0000 105.0000  65.0000  48.0000  48.0000  67.0000
## [329]  67.0000  67.0000 104.4694  67.0000  62.0000 132.0000 100.0000  88.0000
## [337] 104.4694  72.0000  84.0000  84.0000  92.0000 110.0000  84.0000  58.0000
## [345]  64.0000  60.0000  67.0000  65.0000  62.0000  68.0000  63.0000  65.0000
## [353]  65.0000  74.0000 104.4694  75.0000  75.0000 100.0000  74.0000  80.0000
## [361]  76.0000 116.0000 120.0000 110.0000 105.0000  88.0000  85.0000  88.0000
## [369]  88.0000  88.0000  85.0000  84.0000  90.0000  92.0000 104.4694  74.0000
## [377]  68.0000  68.0000  63.0000  70.0000  88.0000  75.0000  70.0000  67.0000
## [385]  67.0000  67.0000 110.0000  85.0000  92.0000 112.0000  96.0000  84.0000
## [393]  90.0000  86.0000  52.0000  84.0000  79.0000  82.0000

Splitting the data with help of caTools library

#Splitting the data into training and testing with the help of caTools library
library(caTools)

# set seed (value) where value specifies the initial value of the random number seed. Which helps the select the random rows of data for train and test set
set.seed(18) 

#spliting the data as training and testing in ratio of 70:30
split= sample(398,279) 

#Train set
# storing the training set values
training_set = data[split,] 
# Checking the dimension of training set whether data is properly distributed or not
dim(training_set)

## [1] 279   8

# displaying training set
View(training_set) 

#Test set
# storing the testing set values
test_set = data[-split,]
# Checking the dimension of testing set whether data is properly distributed or not
dim(test_set)

## [1] 119   8

# displaying test set
View(test_set)

Fitting a linear regression model

#Fitting a linear model
regressor = lm(formula=mpg~weight,data = training_set)

# summary of the created model
summary(regressor)

## 
## Call:
## lm(formula = mpg ~ weight, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6283 -2.7341 -0.3075  2.0045 14.9658 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.5682157  0.8980358   51.86   <2e-16 ***
## weight      -0.0077662  0.0002889  -26.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.21 on 277 degrees of freedom
## Multiple R-squared:  0.7229, Adjusted R-squared:  0.7219 
## F-statistic: 722.7 on 1 and 277 DF,  p-value: < 2.2e-16

# line chart of regressor
plot(regressor)

Prediction

#Prediction
predi = predict(object = regressor, newdata=test_set)
# displaying prediction
predi

##         3         5        14        16        19        20        25        31 
## 19.883634 19.782673 22.601795 24.566638 30.026260 32.317282 26.003381 28.985592 
##        34        35        44        45        46        53        59        64 
## 26.112107 19.860335  9.709942  6.650069 23.564801 30.531061 30.057325 12.513532 
##        65        66        72        80        86        89        90        96 
## 14.455076 14.501673 28.473025 29.568056 14.726892 15.177331 17.235367  8.117876 
##        98       101       105       108       111       112       114       115 
## 22.329979 23.106597  8.467354 24.908350 28.092482 30.072857 27.370228 28.977826 
##       116       120       121       122       126       128       131       134 
## 14.866684 26.515948 24.294822 20.170982 22.477536 24.038538 27.533317 17.204303 
##       147       149       150       163       165       168       171       178 
## 30.065091 29.125383 27.238203 17.600378 22.966806 29.707847 26.438286 25.646136 
##       186       188       190       192       200       201       206       208 
## 29.055488 13.833782 15.798625 21.460167 18.213906 18.811901 29.832106 22.104760 
##       210       211       213       215       217       218       232       233 
## 21.172819 23.813319 12.552363 16.513113 30.686385 29.832106 12.979503 12.901841 
##       238       241       243       248       250       254       255       264 
## 30.639788 29.560289 26.376157 30.492231 20.435032 22.065929 23.541503 19.813738 
##       265       267       270       272       276       284       285       290 
## 21.677620 29.832106 29.249642 25.250061 22.182422 21.211650 20.473863 12.707687 
##       293       295       296       298       299       300       303       305 
## 15.969481 31.230017 31.695988 19.153613 16.280128 21.794113 29.870936 30.026260 
##       307       308       309       312       313       314       317       319 
## 26.414988 25.599539 26.717869 30.103922 30.888305 25.770395 20.310773 25.514111 
##       323       331       333       339       343       355       357       358 
## 30.181583 32.317282 32.239620 27.230436 28.045885 28.550686 28.317701 26.259664 
##       359       361       362       363       367       370       375       383 
## 26.104341 22.027098 24.046304 23.813319 19.658414 27.968223 22.997870 29.133150 
##       384       385       388       390       394       396       398 
## 31.307679 31.307679 23.153194 24.551106 24.900583 28.744841 25.444216

#line chart of predition of actual vs prediction
plot(test_set$mpg,predi)

Peformance Evaluation

#Calculating RSME 
val=sqrt(sum(predi-test_set$mpg)^2)/length(test_set$mpg)
val

## [1] 0.05083303

#Adjusted R square value 
R2=summary(regressor)$r.squared
R2

## [1] 0.7229297

Using Multiple Regression

#Setting the working directory
setwd("C:/Users/kmoha/Downloads")

#Importing dataset
data = read.csv("auto-mpg.csv")

#View command is used displaying data in separate window
View(data) 

#Removing unnecessary features or attributes which are not useful for the prediction
data = data[1:8]

#confirming the data type of the dataset 
class(data)

## [1] "data.frame"

#displaying dimension of data
dim(data)

## [1] 398   8

#loading dplyr library for the usage of glimpse command in order to get organised view about the structure of the dataset
library(dplyr)
glimpse(data)

## Rows: 398
## Columns: 8
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <chr> "130", "165", "150", "150", "140", "198", "220", "215", "…
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ model_year   <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …

#Understanding the data with the help of summary
summary(data)

##       mpg          cylinders      displacement    horsepower       
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Length:398        
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   Class :character  
##  Median :23.00   Median :4.000   Median :148.5   Mode  :character  
##  Mean   :23.51   Mean   :5.455   Mean   :193.4                     
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0                     
##  Max.   :46.60   Max.   :8.000   Max.   :455.0                     
##      weight      acceleration     model_year        origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573  
##  3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000

Cleaning the data

#Converting all int and chr attributes which are required into numeric
data$horsepower = as.numeric(data$horsepower)

## Warning: NAs introduced by coercion

data$cylinders = as.numeric(data$cylinders)
data$weight = as.numeric(data$weight)
data$model_year = as.numeric(data$model_year)

#Replacing missing values with mean
summary(data$horsepower)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    46.0    75.0    93.5   104.5   126.0   230.0       6

data$horsepower[is.na(data$horsepower)] = mean(data$horsepower,na.rm = T)
data$horsepower

##   [1] 130.0000 165.0000 150.0000 150.0000 140.0000 198.0000 220.0000 215.0000
##   [9] 225.0000 190.0000 170.0000 160.0000 150.0000 225.0000  95.0000  95.0000
##  [17]  97.0000  85.0000  88.0000  46.0000  87.0000  90.0000  95.0000 113.0000
##  [25]  90.0000 215.0000 200.0000 210.0000 193.0000  88.0000  90.0000  95.0000
##  [33] 104.4694 100.0000 105.0000 100.0000  88.0000 100.0000 165.0000 175.0000
##  [41] 153.0000 150.0000 180.0000 170.0000 175.0000 110.0000  72.0000 100.0000
##  [49]  88.0000  86.0000  90.0000  70.0000  76.0000  65.0000  69.0000  60.0000
##  [57]  70.0000  95.0000  80.0000  54.0000  90.0000  86.0000 165.0000 175.0000
##  [65] 150.0000 153.0000 150.0000 208.0000 155.0000 160.0000 190.0000  97.0000
##  [73] 150.0000 130.0000 140.0000 150.0000 112.0000  76.0000  87.0000  69.0000
##  [81]  86.0000  92.0000  97.0000  80.0000  88.0000 175.0000 150.0000 145.0000
##  [89] 137.0000 150.0000 198.0000 150.0000 158.0000 150.0000 215.0000 225.0000
##  [97] 175.0000 105.0000 100.0000 100.0000  88.0000  95.0000  46.0000 150.0000
## [105] 167.0000 170.0000 180.0000 100.0000  88.0000  72.0000  94.0000  90.0000
## [113]  85.0000 107.0000  90.0000 145.0000 230.0000  49.0000  75.0000  91.0000
## [121] 112.0000 150.0000 110.0000 122.0000 180.0000  95.0000 104.4694 100.0000
## [129] 100.0000  67.0000  80.0000  65.0000  75.0000 100.0000 110.0000 105.0000
## [137] 140.0000 150.0000 150.0000 140.0000 150.0000  83.0000  67.0000  78.0000
## [145]  52.0000  61.0000  75.0000  75.0000  75.0000  97.0000  93.0000  67.0000
## [153]  95.0000 105.0000  72.0000  72.0000 170.0000 145.0000 150.0000 148.0000
## [161] 110.0000 105.0000 110.0000  95.0000 110.0000 110.0000 129.0000  75.0000
## [169]  83.0000 100.0000  78.0000  96.0000  71.0000  97.0000  97.0000  70.0000
## [177]  90.0000  95.0000  88.0000  98.0000 115.0000  53.0000  86.0000  81.0000
## [185]  92.0000  79.0000  83.0000 140.0000 150.0000 120.0000 152.0000 100.0000
## [193] 105.0000  81.0000  90.0000  52.0000  60.0000  70.0000  53.0000 100.0000
## [201]  78.0000 110.0000  95.0000  71.0000  70.0000  75.0000  72.0000 102.0000
## [209] 150.0000  88.0000 108.0000 120.0000 180.0000 145.0000 130.0000 150.0000
## [217]  68.0000  80.0000  58.0000  96.0000  70.0000 145.0000 110.0000 145.0000
## [225] 130.0000 110.0000 105.0000 100.0000  98.0000 180.0000 170.0000 190.0000
## [233] 149.0000  78.0000  88.0000  75.0000  89.0000  63.0000  83.0000  67.0000
## [241]  78.0000  97.0000 110.0000 110.0000  48.0000  66.0000  52.0000  70.0000
## [249]  60.0000 110.0000 140.0000 139.0000 105.0000  95.0000  85.0000  88.0000
## [257] 100.0000  90.0000 105.0000  85.0000 110.0000 120.0000 145.0000 165.0000
## [265] 139.0000 140.0000  68.0000  95.0000  97.0000  75.0000  95.0000 105.0000
## [273]  85.0000  97.0000 103.0000 125.0000 115.0000 133.0000  71.0000  68.0000
## [281] 115.0000  85.0000  88.0000  90.0000 110.0000 130.0000 129.0000 138.0000
## [289] 135.0000 155.0000 142.0000 125.0000 150.0000  71.0000  65.0000  80.0000
## [297]  80.0000  77.0000 125.0000  71.0000  90.0000  70.0000  70.0000  65.0000
## [305]  69.0000  90.0000 115.0000 115.0000  90.0000  76.0000  60.0000  70.0000
## [313]  65.0000  90.0000  88.0000  90.0000  90.0000  78.0000  90.0000  75.0000
## [321]  92.0000  75.0000  65.0000 105.0000  65.0000  48.0000  48.0000  67.0000
## [329]  67.0000  67.0000 104.4694  67.0000  62.0000 132.0000 100.0000  88.0000
## [337] 104.4694  72.0000  84.0000  84.0000  92.0000 110.0000  84.0000  58.0000
## [345]  64.0000  60.0000  67.0000  65.0000  62.0000  68.0000  63.0000  65.0000
## [353]  65.0000  74.0000 104.4694  75.0000  75.0000 100.0000  74.0000  80.0000
## [361]  76.0000 116.0000 120.0000 110.0000 105.0000  88.0000  85.0000  88.0000
## [369]  88.0000  88.0000  85.0000  84.0000  90.0000  92.0000 104.4694  74.0000
## [377]  68.0000  68.0000  63.0000  70.0000  88.0000  75.0000  70.0000  67.0000
## [385]  67.0000  67.0000 110.0000  85.0000  92.0000 112.0000  96.0000  84.0000
## [393]  90.0000  86.0000  52.0000  84.0000  79.0000  82.0000

Splitting the data with help of caTools library

#Splitting the data into training and testing with the help of caTools library
library(caTools)

# set seed (value) where value specifies the initial value of the random number seed. Which helps the select the random rows of data for train and test set
set.seed(18) 

#spliting the data as training and testing in ratio of 70:30
split= sample(398,279) 

#Train set
# storing the training set values
training_set = data[split,] 
# Checking the dimension of training set whether data is properly distributed or not
dim(training_set)

## [1] 279   8

# displaying training set
View(training_set) 

#Test set
# storing the testing set values
test_set = data[-split,]
# Checking the dimension of testing set whether data is properly distributed or not
dim(test_set)

## [1] 119   8

# displaying test set
View(test_set)

Fitting a linear regression model

#Fitting a linear model
regressor = lm(formula=mpg~.,data = training_set)

# summary of the created model
summary(regressor)

## 
## Call:
## lm(formula = mpg ~ ., data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.681 -1.956 -0.065  1.655 11.805 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.874e+01  5.518e+00  -3.396 0.000787 ***
## cylinders    -3.153e-01  3.797e-01  -0.830 0.407110    
## displacement  1.694e-02  8.894e-03   1.905 0.057887 .  
## horsepower   -2.416e-03  1.560e-02  -0.155 0.877023    
## weight       -6.866e-03  7.937e-04  -8.651 4.60e-16 ***
## acceleration  1.466e-01  1.119e-01   1.309 0.191544    
## model_year    7.447e-01  5.989e-02  12.435  < 2e-16 ***
## origin        1.584e+00  3.295e-01   4.807 2.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.246 on 271 degrees of freedom
## Multiple R-squared:  0.8388, Adjusted R-squared:  0.8346 
## F-statistic: 201.4 on 7 and 271 DF,  p-value: < 2.2e-16

# line chart of regressor
plot(regressor)

Prediction

#Prediction
predi = predict(object = regressor, newdata=test_set)
# displaying prediction
predi

##         3         5        14        16        19        20        25        31 
## 15.497748 15.088326 19.893973 19.028239 25.813119 27.235526 20.254242 23.339972 
##        34        35        44        45        46        53        59        64 
## 21.336835 16.045208  8.734933  6.017533 19.574247 25.296721 24.556306 11.946319 
##        65        66        72        80        86        89        90        96 
## 12.554063 13.073746 25.618897 25.855416 13.947515 13.844284 15.610367  9.469071 
##        98       101       105       108       111       112       114       115 
## 19.864710 21.015898  9.206310 22.055137 26.802562 27.794993 22.763924 25.695059 
##       116       120       121       122       126       128       131       134 
## 14.143575 23.567222 21.891153 17.985988 20.306680 22.177400 24.155961 16.586532 
##       147       149       150       163       165       168       171       178 
## 25.571260 26.691556 26.768188 18.085925 21.786925 29.506488 24.535408 24.441492 
##       186       188       190       192       200       201       206       208 
## 26.762947 14.714333 16.614782 21.180732 18.647707 20.136697 30.419702 22.394968 
##       210       211       213       215       217       218       232       233 
## 22.344105 25.255641 14.115156 17.349663 32.261348 27.987083 16.075020 15.612435 
##       238       241       243       248       250       254       255       264 
## 28.844458 28.995810 26.319355 32.624028 21.716724 23.204714 24.181722 20.866042 
##       265       267       270       272       276       284       285       290 
## 22.826542 28.789734 28.083310 25.661059 23.518260 23.748300 22.694616 16.957426 
##       293       295       296       298       299       300       303       305 
## 19.744283 33.551769 30.845612 23.307768 20.554791 25.949419 29.448045 30.905283 
##       307       308       309       312       313       314       317       319 
## 26.277622 25.791156 27.142093 30.368121 34.170252 27.532781 23.651240 30.039588 
##       323       331       333       339       343       355       357       358 
## 33.765259 33.869310 33.677873 29.194584 29.505177 31.318140 33.049379 31.062638 
##       359       361       362       363       367       370       375       383 
## 31.518017 26.307928 28.944203 28.531749 23.523501 30.529423 27.122259 34.541803 
##       384       385       388       390       394       396       398 
## 35.905174 36.007766 28.043460 28.368893 27.944635 30.677345 28.636131

#line chart of predition of actual vs prediction
plot(test_set$mpg,predi)

Peformance Evaluation

#Calculating RSME 
val=sqrt(sum(predi-test_set$mpg)^2)/length(test_set$mpg)
val

## [1] 0.1164861

#Adjusted R square value 
R2=summary(regressor)$r.squared
R2

## [1] 0.8387803

Comparative Statements of Algorithm 1 and Algorithm 2

When compared to Linear Regression in Algorithm-1 Adjusted r square of Multiple Regression in Algorithm-2 satisfies minimum requirement i.e >=0.75 but RSME is less in case of Algorithm-1, since RSME is minimal in both the cases we can consider Multiple Regression as a good technique for this data.

Result:

Finally Multiple Regression is chosen for this data in order to get efficient predictions.

PERFORMING LINEAR AND MULTIPLE REGRESSION ON AUTO+MPG DATASET

MohanDatta 20MID0012

2022-11-11

Aim/Objective

About the dataset

Attribute Information:

Using Linear Regression

Data Visualization before cleaning the dataset

Cleaning the data

Splitting the data with help of caTools library

Fitting a linear regression model

Prediction

Peformance Evaluation

Using Multiple Regression

Cleaning the data

Splitting the data with help of caTools library

Fitting a linear regression model

Prediction

Peformance Evaluation

Comparative Statements of Algorithm 1 and Algorithm 2

Result: