Data Preperation Step:
The data we will be using comes from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/auto+mpg
The raw data can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
auto <- read.table(url("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"), header =FALSE)
head(auto)
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 18 8 307 130.0 3504 12.0 70 1 chevrolet chevelle malibu
## 2 15 8 350 165.0 3693 11.5 70 1 buick skylark 320
## 3 18 8 318 150.0 3436 11.0 70 1 plymouth satellite
## 4 16 8 304 150.0 3433 12.0 70 1 amc rebel sst
## 5 17 8 302 140.0 3449 10.5 70 1 ford torino
## 6 15 8 429 198.0 4341 10.0 70 1 ford galaxie 500
The data has no headers. We will need to do some manipulation in order to assign headers. Using the provided documentation, we can rename all the variables with the correct name. There are 9 variables, hence 9 columns we should rename.
names(auto) <- c("mpg", "cylinders",
"displacement",
" horsepower",
"weight",
"acceleration",
"model year",
"origin",
"car name")
Lets see if they changed correctly
names(auto)
## [1] "mpg" "cylinders" "displacement" " horsepower"
## [5] "weight" "acceleration" "model year" "origin"
## [9] "car name"
The next step is to convert this into a proper data frame that we can use in our downstream analysis
auto.df<-data.frame(auto)
head(auto.df)
## mpg cylinders displacement X.horsepower weight acceleration model.year
## 1 18 8 307 130.0 3504 12.0 70
## 2 15 8 350 165.0 3693 11.5 70
## 3 18 8 318 150.0 3436 11.0 70
## 4 16 8 304 150.0 3433 12.0 70
## 5 17 8 302 140.0 3449 10.5 70
## 6 15 8 429 198.0 4341 10.0 70
## origin car.name
## 1 1 chevrolet chevelle malibu
## 2 1 buick skylark 320
## 3 1 plymouth satellite
## 4 1 amc rebel sst
## 5 1 ford torino
## 6 1 ford galaxie 500
names(auto.df)
## [1] "mpg" "cylinders" "displacement" "X.horsepower"
## [5] "weight" "acceleration" "model.year" "origin"
## [9] "car.name"
We should check our data frame for missing values
colSums(is.na(auto.df)|auto.df == '')
## mpg cylinders displacement X.horsepower weight
## 0 0 0 0 0
## acceleration model.year origin car.name
## 0 0 0 0
No missing data or NA’s!
Research Question: Are we able to predict the miles per gallon for a car based on attributes such as cylinders, horsepower, weight…etc.
Cases: Each case represents the specific attributes belonging to a type of veichle. There are 398 observations in our dataset with 9 variables.
Data Collection: The data was collected from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993) I do not see any indication that this data is the same as mtcars as there is no mention of motor trend (the source of mtcars).
Type of Study: This type of data is observational.
Data Source: description: https://archive.ics.uci.edu/ml/datasets/auto+mpg raw data: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
Response variable: The response variable is mpg which is a continous value.
Explanatory: cylinders-discrete displacement-continous horsepower-continous weight-continous acceleration-continous model.year-discrete origin-discrete car.name-categorical
Relevant summary statistics:
General Summary
summary(auto.df)
## mpg cylinders displacement X.horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 150.0 : 22
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2 90.00 : 20
## Median :23.00 Median :4.000 Median :148.5 88.00 : 19
## Mean :23.51 Mean :5.455 Mean :193.4 110.0 : 18
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 100.0 : 17
## Max. :46.60 Max. :8.000 Max. :455.0 75.00 : 14
## (Other):288
## weight acceleration model.year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.57 Mean :76.01 Mean :1.573
## 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## car.name
## ford pinto : 6
## amc matador : 5
## ford maverick : 5
## toyota corolla: 5
## amc gremlin : 4
## amc hornet : 4
## (Other) :369
Lets examine how many cars belong to the same year
table(auto.df$model.year)
##
## 70 71 72 73 74 75 76 77 78 79 80 81 82
## 29 28 28 40 27 30 34 28 36 29 29 29 31
Lets look at the distribution of our response variable
x <- auto.df$mpg
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon",
main="Histogram with Normal Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
There is a right skew in our data. Lets see if we can dig in deeper with visualization
library("ggpubr")
## Warning: package 'ggpubr' was built under R version 3.4.4
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.4
## Loading required package: magrittr
ggdensity(auto.df$mpg,
main = "Density plot of MPG Distribution",
xlab = "MPG")
Lets see the correlation with our response and a normal distribution
ggqqplot(auto.df$mpg)
Most points seem to fall within the main line with the exception of some outliers.