The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute “mpg”, 8 of the original instances were removed because they had unknown values for the “mpg” attribute. The original dataset is available in the file “auto-mpg.data-original”.
#Setting the working directory
setwd("C:/Users/kmoha/Downloads")
#Importing dataset
data = read.csv("auto-mpg.csv")
#View command is used displaying data in separate window
View(data)
#Removing unnecessary features or attributes which are not useful for the prediction
data = data[1:8]
#confirming the data type of the dataset
class(data)
## [1] "data.frame"
#displaying dimension of data
dim(data)
## [1] 398 8
#loading dplyr library for the usage of glimpse command in order to get organised view about the structure of the dataset
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(data)
## Rows: 398
## Columns: 8
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <chr> "130", "165", "150", "150", "140", "198", "220", "215", "…
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ model_year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
#Understanding the data with the help of summary
summary(data)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Length:398
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2 Class :character
## Median :23.00 Median :4.000 Median :148.5 Mode :character
## Mean :23.51 Mean :5.455 Mean :193.4
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
## Max. :46.60 Max. :8.000 Max. :455.0
## weight acceleration model_year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.57 Mean :76.01 Mean :1.573
## 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
For data distributions, you may require more information than central tendency values (median, mean, mode). To analyze data variability, you need to know how dispersed the data are.
#Visualizing initial data
# ggplot2 is a system for declarative creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
library(ggplot2)
#Box plot
#a Box plot is a graph that illustrates the distribution of values in data. Box plots are commonly used to show the distribution of data in a standard way by presenting five summary values. The list below summarizes the minimum, Q1 (First Quartile), median, Q3 (Third Quartile), and maximum values. Summarizing these values can provide us with information about our outliers and their values.
ggplot(data, aes(x=factor(cylinders), y=mpg)) + geom_boxplot(aes(fill = cylinders))+labs(title = "20MID0012")
#Histogram
#A histogram is an approximate representation of the distribution of numerical data. In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data. Histograms roughly give us an idea about the probability distribution of a given variable by depicting the frequencies of observations occurring in certain ranges of values.
ggplot(data,aes(x=mpg)) + geom_histogram(bins=30) +facet_wrap(~cylinders)+labs(title = "20MID0012")
#Scatterplot using ggplot2
#A scatterplot displays the values of two variables along two axes. It shows the relationship between them, eventually revealing a correlation.Here the relationship between weight and acceleration of several cars is shown.
ggplot(data, aes(x=weight, y=acceleration, col=mpg))+geom_point()+labs(title = "20MID0012")
#Heat Map
#A heatmap depicts the relationship between two attributes of a dataframe as a color-coded tile or bin. A heatmap produces a grid with multiple attributes of the dataframe, representing the relationship between the two attributes taken at a time.
ggplot(data, aes(mpg, cylinders)) + geom_bin2d(bins = 30)+scale_fill_distiller(palette = "Spectral",direction=1)+labs(title = "20MID0012")
#Bar plot
#A Bar Graph (or a Bar Chart) is a graphical display of data using bars of different heights.They are good if you to want to visualize the data of different categories that are being compared with each other.
ggplot(data, aes(x = cylinders,fill = "model_year")) +geom_bar()+facet_wrap(~model_year)+labs(title = "20MID0012")
#Line plot
#A line chart or line graph displays the evolution of one or several numeric variables. Data points are usually connected by straight line segments.
ggplot(data=data, aes(x=cylinders, y=model_year, group=1)) + geom_line(color="blue")+geom_point()+labs(title = "20MID0012")
#Converting all int and chr attributes which are required into numeric
data$horsepower = as.numeric(data$horsepower)
## Warning: NAs introduced by coercion
data$cylinders = as.numeric(data$cylinders)
data$weight = as.numeric(data$weight)
data$model_year = as.numeric(data$model_year)
#Replacing missing values with mean
summary(data$horsepower)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 46.0 75.0 93.5 104.5 126.0 230.0 6
data$horsepower[is.na(data$horsepower)] = mean(data$horsepower,na.rm = T)
data$horsepower
## [1] 130.0000 165.0000 150.0000 150.0000 140.0000 198.0000 220.0000 215.0000
## [9] 225.0000 190.0000 170.0000 160.0000 150.0000 225.0000 95.0000 95.0000
## [17] 97.0000 85.0000 88.0000 46.0000 87.0000 90.0000 95.0000 113.0000
## [25] 90.0000 215.0000 200.0000 210.0000 193.0000 88.0000 90.0000 95.0000
## [33] 104.4694 100.0000 105.0000 100.0000 88.0000 100.0000 165.0000 175.0000
## [41] 153.0000 150.0000 180.0000 170.0000 175.0000 110.0000 72.0000 100.0000
## [49] 88.0000 86.0000 90.0000 70.0000 76.0000 65.0000 69.0000 60.0000
## [57] 70.0000 95.0000 80.0000 54.0000 90.0000 86.0000 165.0000 175.0000
## [65] 150.0000 153.0000 150.0000 208.0000 155.0000 160.0000 190.0000 97.0000
## [73] 150.0000 130.0000 140.0000 150.0000 112.0000 76.0000 87.0000 69.0000
## [81] 86.0000 92.0000 97.0000 80.0000 88.0000 175.0000 150.0000 145.0000
## [89] 137.0000 150.0000 198.0000 150.0000 158.0000 150.0000 215.0000 225.0000
## [97] 175.0000 105.0000 100.0000 100.0000 88.0000 95.0000 46.0000 150.0000
## [105] 167.0000 170.0000 180.0000 100.0000 88.0000 72.0000 94.0000 90.0000
## [113] 85.0000 107.0000 90.0000 145.0000 230.0000 49.0000 75.0000 91.0000
## [121] 112.0000 150.0000 110.0000 122.0000 180.0000 95.0000 104.4694 100.0000
## [129] 100.0000 67.0000 80.0000 65.0000 75.0000 100.0000 110.0000 105.0000
## [137] 140.0000 150.0000 150.0000 140.0000 150.0000 83.0000 67.0000 78.0000
## [145] 52.0000 61.0000 75.0000 75.0000 75.0000 97.0000 93.0000 67.0000
## [153] 95.0000 105.0000 72.0000 72.0000 170.0000 145.0000 150.0000 148.0000
## [161] 110.0000 105.0000 110.0000 95.0000 110.0000 110.0000 129.0000 75.0000
## [169] 83.0000 100.0000 78.0000 96.0000 71.0000 97.0000 97.0000 70.0000
## [177] 90.0000 95.0000 88.0000 98.0000 115.0000 53.0000 86.0000 81.0000
## [185] 92.0000 79.0000 83.0000 140.0000 150.0000 120.0000 152.0000 100.0000
## [193] 105.0000 81.0000 90.0000 52.0000 60.0000 70.0000 53.0000 100.0000
## [201] 78.0000 110.0000 95.0000 71.0000 70.0000 75.0000 72.0000 102.0000
## [209] 150.0000 88.0000 108.0000 120.0000 180.0000 145.0000 130.0000 150.0000
## [217] 68.0000 80.0000 58.0000 96.0000 70.0000 145.0000 110.0000 145.0000
## [225] 130.0000 110.0000 105.0000 100.0000 98.0000 180.0000 170.0000 190.0000
## [233] 149.0000 78.0000 88.0000 75.0000 89.0000 63.0000 83.0000 67.0000
## [241] 78.0000 97.0000 110.0000 110.0000 48.0000 66.0000 52.0000 70.0000
## [249] 60.0000 110.0000 140.0000 139.0000 105.0000 95.0000 85.0000 88.0000
## [257] 100.0000 90.0000 105.0000 85.0000 110.0000 120.0000 145.0000 165.0000
## [265] 139.0000 140.0000 68.0000 95.0000 97.0000 75.0000 95.0000 105.0000
## [273] 85.0000 97.0000 103.0000 125.0000 115.0000 133.0000 71.0000 68.0000
## [281] 115.0000 85.0000 88.0000 90.0000 110.0000 130.0000 129.0000 138.0000
## [289] 135.0000 155.0000 142.0000 125.0000 150.0000 71.0000 65.0000 80.0000
## [297] 80.0000 77.0000 125.0000 71.0000 90.0000 70.0000 70.0000 65.0000
## [305] 69.0000 90.0000 115.0000 115.0000 90.0000 76.0000 60.0000 70.0000
## [313] 65.0000 90.0000 88.0000 90.0000 90.0000 78.0000 90.0000 75.0000
## [321] 92.0000 75.0000 65.0000 105.0000 65.0000 48.0000 48.0000 67.0000
## [329] 67.0000 67.0000 104.4694 67.0000 62.0000 132.0000 100.0000 88.0000
## [337] 104.4694 72.0000 84.0000 84.0000 92.0000 110.0000 84.0000 58.0000
## [345] 64.0000 60.0000 67.0000 65.0000 62.0000 68.0000 63.0000 65.0000
## [353] 65.0000 74.0000 104.4694 75.0000 75.0000 100.0000 74.0000 80.0000
## [361] 76.0000 116.0000 120.0000 110.0000 105.0000 88.0000 85.0000 88.0000
## [369] 88.0000 88.0000 85.0000 84.0000 90.0000 92.0000 104.4694 74.0000
## [377] 68.0000 68.0000 63.0000 70.0000 88.0000 75.0000 70.0000 67.0000
## [385] 67.0000 67.0000 110.0000 85.0000 92.0000 112.0000 96.0000 84.0000
## [393] 90.0000 86.0000 52.0000 84.0000 79.0000 82.0000
#Splitting the data into training and testing with the help of caTools library
library(caTools)
# set seed (value) where value specifies the initial value of the random number seed. Which helps the select the random rows of data for train and test set
set.seed(18)
#spliting the data as training and testing in ratio of 70:30
split= sample(398,279)
#Train set
# storing the training set values
training_set = data[split,]
# Checking the dimension of training set whether data is properly distributed or not
dim(training_set)
## [1] 279 8
# displaying training set
View(training_set)
#Test set
# storing the testing set values
test_set = data[-split,]
# Checking the dimension of testing set whether data is properly distributed or not
dim(test_set)
## [1] 119 8
# displaying test set
View(test_set)
#Fitting a linear model
regressor = lm(formula=mpg~weight,data = training_set)
# summary of the created model
summary(regressor)
##
## Call:
## lm(formula = mpg ~ weight, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6283 -2.7341 -0.3075 2.0045 14.9658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.5682157 0.8980358 51.86 <2e-16 ***
## weight -0.0077662 0.0002889 -26.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.21 on 277 degrees of freedom
## Multiple R-squared: 0.7229, Adjusted R-squared: 0.7219
## F-statistic: 722.7 on 1 and 277 DF, p-value: < 2.2e-16
# line chart of regressor
plot(regressor)
#Prediction
predi = predict(object = regressor, newdata=test_set)
# displaying prediction
predi
## 3 5 14 16 19 20 25 31
## 19.883634 19.782673 22.601795 24.566638 30.026260 32.317282 26.003381 28.985592
## 34 35 44 45 46 53 59 64
## 26.112107 19.860335 9.709942 6.650069 23.564801 30.531061 30.057325 12.513532
## 65 66 72 80 86 89 90 96
## 14.455076 14.501673 28.473025 29.568056 14.726892 15.177331 17.235367 8.117876
## 98 101 105 108 111 112 114 115
## 22.329979 23.106597 8.467354 24.908350 28.092482 30.072857 27.370228 28.977826
## 116 120 121 122 126 128 131 134
## 14.866684 26.515948 24.294822 20.170982 22.477536 24.038538 27.533317 17.204303
## 147 149 150 163 165 168 171 178
## 30.065091 29.125383 27.238203 17.600378 22.966806 29.707847 26.438286 25.646136
## 186 188 190 192 200 201 206 208
## 29.055488 13.833782 15.798625 21.460167 18.213906 18.811901 29.832106 22.104760
## 210 211 213 215 217 218 232 233
## 21.172819 23.813319 12.552363 16.513113 30.686385 29.832106 12.979503 12.901841
## 238 241 243 248 250 254 255 264
## 30.639788 29.560289 26.376157 30.492231 20.435032 22.065929 23.541503 19.813738
## 265 267 270 272 276 284 285 290
## 21.677620 29.832106 29.249642 25.250061 22.182422 21.211650 20.473863 12.707687
## 293 295 296 298 299 300 303 305
## 15.969481 31.230017 31.695988 19.153613 16.280128 21.794113 29.870936 30.026260
## 307 308 309 312 313 314 317 319
## 26.414988 25.599539 26.717869 30.103922 30.888305 25.770395 20.310773 25.514111
## 323 331 333 339 343 355 357 358
## 30.181583 32.317282 32.239620 27.230436 28.045885 28.550686 28.317701 26.259664
## 359 361 362 363 367 370 375 383
## 26.104341 22.027098 24.046304 23.813319 19.658414 27.968223 22.997870 29.133150
## 384 385 388 390 394 396 398
## 31.307679 31.307679 23.153194 24.551106 24.900583 28.744841 25.444216
#line chart of predition of actual vs prediction
plot(test_set$mpg,predi)
#Calculating RSME
val=sqrt(sum(predi-test_set$mpg)^2)/length(test_set$mpg)
val
## [1] 0.05083303
#Adjusted R square value
R2=summary(regressor)$r.squared
R2
## [1] 0.7229297
.
#Setting the working directory
setwd("C:/Users/kmoha/Downloads")
#Importing dataset
data = read.csv("auto-mpg.csv")
#View command is used displaying data in separate window
View(data)
#Removing unnecessary features or attributes which are not useful for the prediction
data = data[1:8]
#confirming the data type of the dataset
class(data)
## [1] "data.frame"
#displaying dimension of data
dim(data)
## [1] 398 8
#loading dplyr library for the usage of glimpse command in order to get organised view about the structure of the dataset
library(dplyr)
glimpse(data)
## Rows: 398
## Columns: 8
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <chr> "130", "165", "150", "150", "140", "198", "220", "215", "…
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ model_year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
## $ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
#Understanding the data with the help of summary
summary(data)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Length:398
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2 Class :character
## Median :23.00 Median :4.000 Median :148.5 Mode :character
## Mean :23.51 Mean :5.455 Mean :193.4
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
## Max. :46.60 Max. :8.000 Max. :455.0
## weight acceleration model_year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.57 Mean :76.01 Mean :1.573
## 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
#Converting all int and chr attributes which are required into numeric
data$horsepower = as.numeric(data$horsepower)
## Warning: NAs introduced by coercion
data$cylinders = as.numeric(data$cylinders)
data$weight = as.numeric(data$weight)
data$model_year = as.numeric(data$model_year)
#Replacing missing values with mean
summary(data$horsepower)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 46.0 75.0 93.5 104.5 126.0 230.0 6
data$horsepower[is.na(data$horsepower)] = mean(data$horsepower,na.rm = T)
data$horsepower
## [1] 130.0000 165.0000 150.0000 150.0000 140.0000 198.0000 220.0000 215.0000
## [9] 225.0000 190.0000 170.0000 160.0000 150.0000 225.0000 95.0000 95.0000
## [17] 97.0000 85.0000 88.0000 46.0000 87.0000 90.0000 95.0000 113.0000
## [25] 90.0000 215.0000 200.0000 210.0000 193.0000 88.0000 90.0000 95.0000
## [33] 104.4694 100.0000 105.0000 100.0000 88.0000 100.0000 165.0000 175.0000
## [41] 153.0000 150.0000 180.0000 170.0000 175.0000 110.0000 72.0000 100.0000
## [49] 88.0000 86.0000 90.0000 70.0000 76.0000 65.0000 69.0000 60.0000
## [57] 70.0000 95.0000 80.0000 54.0000 90.0000 86.0000 165.0000 175.0000
## [65] 150.0000 153.0000 150.0000 208.0000 155.0000 160.0000 190.0000 97.0000
## [73] 150.0000 130.0000 140.0000 150.0000 112.0000 76.0000 87.0000 69.0000
## [81] 86.0000 92.0000 97.0000 80.0000 88.0000 175.0000 150.0000 145.0000
## [89] 137.0000 150.0000 198.0000 150.0000 158.0000 150.0000 215.0000 225.0000
## [97] 175.0000 105.0000 100.0000 100.0000 88.0000 95.0000 46.0000 150.0000
## [105] 167.0000 170.0000 180.0000 100.0000 88.0000 72.0000 94.0000 90.0000
## [113] 85.0000 107.0000 90.0000 145.0000 230.0000 49.0000 75.0000 91.0000
## [121] 112.0000 150.0000 110.0000 122.0000 180.0000 95.0000 104.4694 100.0000
## [129] 100.0000 67.0000 80.0000 65.0000 75.0000 100.0000 110.0000 105.0000
## [137] 140.0000 150.0000 150.0000 140.0000 150.0000 83.0000 67.0000 78.0000
## [145] 52.0000 61.0000 75.0000 75.0000 75.0000 97.0000 93.0000 67.0000
## [153] 95.0000 105.0000 72.0000 72.0000 170.0000 145.0000 150.0000 148.0000
## [161] 110.0000 105.0000 110.0000 95.0000 110.0000 110.0000 129.0000 75.0000
## [169] 83.0000 100.0000 78.0000 96.0000 71.0000 97.0000 97.0000 70.0000
## [177] 90.0000 95.0000 88.0000 98.0000 115.0000 53.0000 86.0000 81.0000
## [185] 92.0000 79.0000 83.0000 140.0000 150.0000 120.0000 152.0000 100.0000
## [193] 105.0000 81.0000 90.0000 52.0000 60.0000 70.0000 53.0000 100.0000
## [201] 78.0000 110.0000 95.0000 71.0000 70.0000 75.0000 72.0000 102.0000
## [209] 150.0000 88.0000 108.0000 120.0000 180.0000 145.0000 130.0000 150.0000
## [217] 68.0000 80.0000 58.0000 96.0000 70.0000 145.0000 110.0000 145.0000
## [225] 130.0000 110.0000 105.0000 100.0000 98.0000 180.0000 170.0000 190.0000
## [233] 149.0000 78.0000 88.0000 75.0000 89.0000 63.0000 83.0000 67.0000
## [241] 78.0000 97.0000 110.0000 110.0000 48.0000 66.0000 52.0000 70.0000
## [249] 60.0000 110.0000 140.0000 139.0000 105.0000 95.0000 85.0000 88.0000
## [257] 100.0000 90.0000 105.0000 85.0000 110.0000 120.0000 145.0000 165.0000
## [265] 139.0000 140.0000 68.0000 95.0000 97.0000 75.0000 95.0000 105.0000
## [273] 85.0000 97.0000 103.0000 125.0000 115.0000 133.0000 71.0000 68.0000
## [281] 115.0000 85.0000 88.0000 90.0000 110.0000 130.0000 129.0000 138.0000
## [289] 135.0000 155.0000 142.0000 125.0000 150.0000 71.0000 65.0000 80.0000
## [297] 80.0000 77.0000 125.0000 71.0000 90.0000 70.0000 70.0000 65.0000
## [305] 69.0000 90.0000 115.0000 115.0000 90.0000 76.0000 60.0000 70.0000
## [313] 65.0000 90.0000 88.0000 90.0000 90.0000 78.0000 90.0000 75.0000
## [321] 92.0000 75.0000 65.0000 105.0000 65.0000 48.0000 48.0000 67.0000
## [329] 67.0000 67.0000 104.4694 67.0000 62.0000 132.0000 100.0000 88.0000
## [337] 104.4694 72.0000 84.0000 84.0000 92.0000 110.0000 84.0000 58.0000
## [345] 64.0000 60.0000 67.0000 65.0000 62.0000 68.0000 63.0000 65.0000
## [353] 65.0000 74.0000 104.4694 75.0000 75.0000 100.0000 74.0000 80.0000
## [361] 76.0000 116.0000 120.0000 110.0000 105.0000 88.0000 85.0000 88.0000
## [369] 88.0000 88.0000 85.0000 84.0000 90.0000 92.0000 104.4694 74.0000
## [377] 68.0000 68.0000 63.0000 70.0000 88.0000 75.0000 70.0000 67.0000
## [385] 67.0000 67.0000 110.0000 85.0000 92.0000 112.0000 96.0000 84.0000
## [393] 90.0000 86.0000 52.0000 84.0000 79.0000 82.0000
#Splitting the data into training and testing with the help of caTools library
library(caTools)
# set seed (value) where value specifies the initial value of the random number seed. Which helps the select the random rows of data for train and test set
set.seed(18)
#spliting the data as training and testing in ratio of 70:30
split= sample(398,279)
#Train set
# storing the training set values
training_set = data[split,]
# Checking the dimension of training set whether data is properly distributed or not
dim(training_set)
## [1] 279 8
# displaying training set
View(training_set)
#Test set
# storing the testing set values
test_set = data[-split,]
# Checking the dimension of testing set whether data is properly distributed or not
dim(test_set)
## [1] 119 8
# displaying test set
View(test_set)
#Fitting a linear model
regressor = lm(formula=mpg~.,data = training_set)
# summary of the created model
summary(regressor)
##
## Call:
## lm(formula = mpg ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.681 -1.956 -0.065 1.655 11.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874e+01 5.518e+00 -3.396 0.000787 ***
## cylinders -3.153e-01 3.797e-01 -0.830 0.407110
## displacement 1.694e-02 8.894e-03 1.905 0.057887 .
## horsepower -2.416e-03 1.560e-02 -0.155 0.877023
## weight -6.866e-03 7.937e-04 -8.651 4.60e-16 ***
## acceleration 1.466e-01 1.119e-01 1.309 0.191544
## model_year 7.447e-01 5.989e-02 12.435 < 2e-16 ***
## origin 1.584e+00 3.295e-01 4.807 2.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.246 on 271 degrees of freedom
## Multiple R-squared: 0.8388, Adjusted R-squared: 0.8346
## F-statistic: 201.4 on 7 and 271 DF, p-value: < 2.2e-16
# line chart of regressor
plot(regressor)
#Prediction
predi = predict(object = regressor, newdata=test_set)
# displaying prediction
predi
## 3 5 14 16 19 20 25 31
## 15.497748 15.088326 19.893973 19.028239 25.813119 27.235526 20.254242 23.339972
## 34 35 44 45 46 53 59 64
## 21.336835 16.045208 8.734933 6.017533 19.574247 25.296721 24.556306 11.946319
## 65 66 72 80 86 89 90 96
## 12.554063 13.073746 25.618897 25.855416 13.947515 13.844284 15.610367 9.469071
## 98 101 105 108 111 112 114 115
## 19.864710 21.015898 9.206310 22.055137 26.802562 27.794993 22.763924 25.695059
## 116 120 121 122 126 128 131 134
## 14.143575 23.567222 21.891153 17.985988 20.306680 22.177400 24.155961 16.586532
## 147 149 150 163 165 168 171 178
## 25.571260 26.691556 26.768188 18.085925 21.786925 29.506488 24.535408 24.441492
## 186 188 190 192 200 201 206 208
## 26.762947 14.714333 16.614782 21.180732 18.647707 20.136697 30.419702 22.394968
## 210 211 213 215 217 218 232 233
## 22.344105 25.255641 14.115156 17.349663 32.261348 27.987083 16.075020 15.612435
## 238 241 243 248 250 254 255 264
## 28.844458 28.995810 26.319355 32.624028 21.716724 23.204714 24.181722 20.866042
## 265 267 270 272 276 284 285 290
## 22.826542 28.789734 28.083310 25.661059 23.518260 23.748300 22.694616 16.957426
## 293 295 296 298 299 300 303 305
## 19.744283 33.551769 30.845612 23.307768 20.554791 25.949419 29.448045 30.905283
## 307 308 309 312 313 314 317 319
## 26.277622 25.791156 27.142093 30.368121 34.170252 27.532781 23.651240 30.039588
## 323 331 333 339 343 355 357 358
## 33.765259 33.869310 33.677873 29.194584 29.505177 31.318140 33.049379 31.062638
## 359 361 362 363 367 370 375 383
## 31.518017 26.307928 28.944203 28.531749 23.523501 30.529423 27.122259 34.541803
## 384 385 388 390 394 396 398
## 35.905174 36.007766 28.043460 28.368893 27.944635 30.677345 28.636131
#line chart of predition of actual vs prediction
plot(test_set$mpg,predi)
#Calculating RSME
val=sqrt(sum(predi-test_set$mpg)^2)/length(test_set$mpg)
val
## [1] 0.1164861
#Adjusted R square value
R2=summary(regressor)$r.squared
R2
## [1] 0.8387803
When compared to Linear Regression in Algorithm-1 Adjusted r square of Multiple Regression in Algorithm-2 satisfies minimum requirement i.e >=0.75 but RSME is less in case of Algorithm-1, since RSME is minimal in both the cases we can consider Multiple Regression as a good technique for this data.
Finally Multiple Regression is chosen for this data in order to get efficient predictions.