Data Preparation

# load data
cars <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", header = FALSE)
colnames(cars) <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","model year","origin","car name")

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Answer:

This Project is aimed at determining cars millage based on other attribures . The central question that will be answered by this project is : Can we predict cars mileage based on other attributes?

Cases

In this project the cases are observation of a cars information.

nrow(cars)
## [1] 398

There were 397 cases.

Data collection

The data was collected from the UCI machine learning repository website. Data set name: Auto MPG Data Set

Type of study

This is a observational study.

Data Source

Data link https://archive.ics.uci.edu/ml/datasets/Auto+MPG

Dependent Variable

Mile per gallon (mpg) is the dependent variable. It’s continuous.

Independent Variable

The independent variables are the following ones

colnames(cars)[2:8]
## [1] "cylinders"    "displacement" "horsepower"   "weight"      
## [5] "acceleration" "model year"   "origin"

Relevant summary statistics

First and last few rows of the dat

head(cars)
##   mpg cylinders displacement horsepower weight acceleration model year
## 1  18         8          307      130.0   3504         12.0         70
## 2  15         8          350      165.0   3693         11.5         70
## 3  18         8          318      150.0   3436         11.0         70
## 4  16         8          304      150.0   3433         12.0         70
## 5  17         8          302      140.0   3449         10.5         70
## 6  15         8          429      198.0   4341         10.0         70
##   origin                  car name
## 1      1 chevrolet chevelle malibu
## 2      1         buick skylark 320
## 3      1        plymouth satellite
## 4      1             amc rebel sst
## 5      1               ford torino
## 6      1          ford galaxie 500
tail(cars)
##     mpg cylinders displacement horsepower weight acceleration model year
## 393  27         4          151      90.00   2950         17.3         82
## 394  27         4          140      86.00   2790         15.6         82
## 395  44         4           97      52.00   2130         24.6         82
## 396  32         4          135      84.00   2295         11.6         82
## 397  28         4          120      79.00   2625         18.6         82
## 398  31         4          119      82.00   2720         19.4         82
##     origin         car name
## 393      1 chevrolet camaro
## 394      1  ford mustang gl
## 395      2        vw pickup
## 396      1    dodge rampage
## 397      1      ford ranger
## 398      1       chevy s-10

Structure of dataset

str(cars)
## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100.0","102.0",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
summary(cars)
##       mpg          cylinders      displacement     horsepower 
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   150.0  : 22  
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   90.00  : 20  
##  Median :23.00   Median :4.000   Median :148.5   88.00  : 19  
##  Mean   :23.51   Mean   :5.455   Mean   :193.4   110.0  : 18  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100.0  : 17  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   75.00  : 14  
##                                                  (Other):288  
##      weight      acceleration     model year        origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573  
##  3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##            car name  
##  ford pinto    :  6  
##  amc matador   :  5  
##  ford maverick :  5  
##  toyota corolla:  5  
##  amc gremlin   :  4  
##  amc hornet    :  4  
##  (Other)       :369
boxplot(cars$mpg~cars$cylinders,main = "Milage by Cylinders")

boxplot(cars$mpg~cars$origin,main = "Milage by origin")

boxplot(cars$mpg~cars$cylinders, main = "Milage by cylinders")

plot(cars$mpg,cars$weight,main = "Milage by weight")

hist(cars$mpg,main="Distribution of Milage")

cor(cars[,c(1,2,3,5,6,7)])
##                     mpg  cylinders displacement     weight acceleration
## mpg           1.0000000 -0.7753963   -0.8042028 -0.8317409    0.4202889
## cylinders    -0.7753963  1.0000000    0.9507214  0.8960168   -0.5054195
## displacement -0.8042028  0.9507214    1.0000000  0.9328241   -0.5436841
## weight       -0.8317409  0.8960168    0.9328241  1.0000000   -0.4174573
## acceleration  0.4202889 -0.5054195   -0.5436841 -0.4174573    1.0000000
## model year    0.5792671 -0.3487458   -0.3701642 -0.3065643    0.2881370
##              model year
## mpg           0.5792671
## cylinders    -0.3487458
## displacement -0.3701642
## weight       -0.3065643
## acceleration  0.2881370
## model year    1.0000000

From the summary statistics we can see that there are no missing values in the data. mpg response variable is normally distributed . From correlation we can see that mpg is strongly correlated with cylinders,displacement,weight,acceleration and model year are useful in predicting mpg.