# load data
cars <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", header = FALSE)
colnames(cars) <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","model year","origin","car name")
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Answer:
This Project is aimed at determining cars millage based on other attribures . The central question that will be answered by this project is : Can we predict cars mileage based on other attributes?
In this project the cases are observation of a cars information.
nrow(cars)
## [1] 398
There were 397 cases.
The data was collected from the UCI machine learning repository website. Data set name: Auto MPG Data Set
This is a observational study.
Mile per gallon (mpg) is the dependent variable. It’s continuous.
The independent variables are the following ones
colnames(cars)[2:8]
## [1] "cylinders" "displacement" "horsepower" "weight"
## [5] "acceleration" "model year" "origin"
First and last few rows of the dat
head(cars)
## mpg cylinders displacement horsepower weight acceleration model year
## 1 18 8 307 130.0 3504 12.0 70
## 2 15 8 350 165.0 3693 11.5 70
## 3 18 8 318 150.0 3436 11.0 70
## 4 16 8 304 150.0 3433 12.0 70
## 5 17 8 302 140.0 3449 10.5 70
## 6 15 8 429 198.0 4341 10.0 70
## origin car name
## 1 1 chevrolet chevelle malibu
## 2 1 buick skylark 320
## 3 1 plymouth satellite
## 4 1 amc rebel sst
## 5 1 ford torino
## 6 1 ford galaxie 500
tail(cars)
## mpg cylinders displacement horsepower weight acceleration model year
## 393 27 4 151 90.00 2950 17.3 82
## 394 27 4 140 86.00 2790 15.6 82
## 395 44 4 97 52.00 2130 24.6 82
## 396 32 4 135 84.00 2295 11.6 82
## 397 28 4 120 79.00 2625 18.6 82
## 398 31 4 119 82.00 2720 19.4 82
## origin car name
## 393 1 chevrolet camaro
## 394 1 ford mustang gl
## 395 2 vw pickup
## 396 1 dodge rampage
## 397 1 ford ranger
## 398 1 chevy s-10
Structure of dataset
str(cars)
## 'data.frame': 398 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : Factor w/ 94 levels "?","100.0","102.0",..: 17 35 29 29 24 42 47 46 48 40 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ car name : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
summary(cars)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 150.0 : 22
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2 90.00 : 20
## Median :23.00 Median :4.000 Median :148.5 88.00 : 19
## Mean :23.51 Mean :5.455 Mean :193.4 110.0 : 18
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 100.0 : 17
## Max. :46.60 Max. :8.000 Max. :455.0 75.00 : 14
## (Other):288
## weight acceleration model year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.57 Mean :76.01 Mean :1.573
## 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## car name
## ford pinto : 6
## amc matador : 5
## ford maverick : 5
## toyota corolla: 5
## amc gremlin : 4
## amc hornet : 4
## (Other) :369
boxplot(cars$mpg~cars$cylinders,main = "Milage by Cylinders")
boxplot(cars$mpg~cars$origin,main = "Milage by origin")
boxplot(cars$mpg~cars$cylinders, main = "Milage by cylinders")
plot(cars$mpg,cars$weight,main = "Milage by weight")
hist(cars$mpg,main="Distribution of Milage")
cor(cars[,c(1,2,3,5,6,7)])
## mpg cylinders displacement weight acceleration
## mpg 1.0000000 -0.7753963 -0.8042028 -0.8317409 0.4202889
## cylinders -0.7753963 1.0000000 0.9507214 0.8960168 -0.5054195
## displacement -0.8042028 0.9507214 1.0000000 0.9328241 -0.5436841
## weight -0.8317409 0.8960168 0.9328241 1.0000000 -0.4174573
## acceleration 0.4202889 -0.5054195 -0.5436841 -0.4174573 1.0000000
## model year 0.5792671 -0.3487458 -0.3701642 -0.3065643 0.2881370
## model year
## mpg 0.5792671
## cylinders -0.3487458
## displacement -0.3701642
## weight -0.3065643
## acceleration 0.2881370
## model year 1.0000000
From the summary statistics we can see that there are no missing values in the data. mpg response variable is normally distributed . From correlation we can see that mpg is strongly correlated with cylinders,displacement,weight,acceleration and model year are useful in predicting mpg.