DATA 606 Project Proposal

Data Preperation Step:

The data we will be using comes from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/auto+mpg

The raw data can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

auto <- read.table(url("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"), header =FALSE)

head(auto)

##   V1 V2  V3    V4   V5   V6 V7 V8                        V9
## 1 18  8 307 130.0 3504 12.0 70  1 chevrolet chevelle malibu
## 2 15  8 350 165.0 3693 11.5 70  1         buick skylark 320
## 3 18  8 318 150.0 3436 11.0 70  1        plymouth satellite
## 4 16  8 304 150.0 3433 12.0 70  1             amc rebel sst
## 5 17  8 302 140.0 3449 10.5 70  1               ford torino
## 6 15  8 429 198.0 4341 10.0 70  1          ford galaxie 500

The data has no headers. We will need to do some manipulation in order to assign headers. Using the provided documentation, we can rename all the variables with the correct name. There are 9 variables, hence 9 columns we should rename.

names(auto) <- c("mpg", "cylinders", 
                 "displacement", 
                 " horsepower", 
                 "weight", 
                 "acceleration", 
                 "model year",
                 "origin", 
                 "car name")

Lets see if they changed correctly

names(auto)

## [1] "mpg"          "cylinders"    "displacement" " horsepower" 
## [5] "weight"       "acceleration" "model year"   "origin"      
## [9] "car name"

The next step is to convert this into a proper data frame that we can use in our downstream analysis

auto.df<-data.frame(auto)
head(auto.df)

##   mpg cylinders displacement X.horsepower weight acceleration model.year
## 1  18         8          307        130.0   3504         12.0         70
## 2  15         8          350        165.0   3693         11.5         70
## 3  18         8          318        150.0   3436         11.0         70
## 4  16         8          304        150.0   3433         12.0         70
## 5  17         8          302        140.0   3449         10.5         70
## 6  15         8          429        198.0   4341         10.0         70
##   origin                  car.name
## 1      1 chevrolet chevelle malibu
## 2      1         buick skylark 320
## 3      1        plymouth satellite
## 4      1             amc rebel sst
## 5      1               ford torino
## 6      1          ford galaxie 500

names(auto.df)

## [1] "mpg"          "cylinders"    "displacement" "X.horsepower"
## [5] "weight"       "acceleration" "model.year"   "origin"      
## [9] "car.name"

We should check our data frame for missing values

colSums(is.na(auto.df)|auto.df == '')

##          mpg    cylinders displacement X.horsepower       weight 
##            0            0            0            0            0 
## acceleration   model.year       origin     car.name 
##            0            0            0            0

No missing data or NA’s!

Research Question: Are we able to predict the miles per gallon for a car based on attributes such as cylinders, horsepower, weight…etc.

Cases: Each case represents the specific attributes belonging to a type of veichle. There are 398 observations in our dataset with 9 variables.

Data Collection: The data was collected from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993) I do not see any indication that this data is the same as mtcars as there is no mention of motor trend (the source of mtcars).

Type of Study: This type of data is observational.

Data Source: description: https://archive.ics.uci.edu/ml/datasets/auto+mpg raw data: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data

Response variable: The response variable is mpg which is a continous value.

Explanatory: cylinders-discrete displacement-continous horsepower-continous weight-continous acceleration-continous model.year-discrete origin-discrete car.name-categorical

Relevant summary statistics:

General Summary

summary(auto.df)

##       mpg          cylinders      displacement    X.horsepower
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   150.0  : 22  
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   90.00  : 20  
##  Median :23.00   Median :4.000   Median :148.5   88.00  : 19  
##  Mean   :23.51   Mean   :5.455   Mean   :193.4   110.0  : 18  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100.0  : 17  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   75.00  : 14  
##                                                  (Other):288  
##      weight      acceleration     model.year        origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573  
##  3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##            car.name  
##  ford pinto    :  6  
##  amc matador   :  5  
##  ford maverick :  5  
##  toyota corolla:  5  
##  amc gremlin   :  4  
##  amc hornet    :  4  
##  (Other)       :369

Lets examine how many cars belong to the same year

table(auto.df$model.year)

## 
## 70 71 72 73 74 75 76 77 78 79 80 81 82 
## 29 28 28 40 27 30 34 28 36 29 29 29 31

Lets look at the distribution of our response variable

x <- auto.df$mpg 
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", 
    main="Histogram with Normal Curve") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2)

There is a right skew in our data. Lets see if we can dig in deeper with visualization

library("ggpubr")

## Warning: package 'ggpubr' was built under R version 3.4.4

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.4.4

## Loading required package: magrittr

ggdensity(auto.df$mpg, 
          main = "Density plot of MPG Distribution",
          xlab = "MPG")

Lets see the correlation with our response and a normal distribution

ggqqplot(auto.df$mpg)

Most points seem to fall within the main line with the exception of some outliers.

DATA 606 Project Proposal

Vinicio Haro

3/28/2018