INFO659A1

Lara Lechtenberg

1. Business problem & objectives:

Business problem: It’s the fuel crisis of 1974, and car sales have dropped tremendously. A federal stimulus package for the auto industry will target fuel effecient cars with 30+mpg combined, offsetting their cost with a grant for eligible US customers.

Objectives: A major nationwide consumer advocacy group wants to create a report for American auto-buyers, describing the fuel-efficient vehicles that would qualify for the stimulus and the auto features that correlate with fuel efficiency.

2. Understand Data:

Concept of learning:
We want our analytic model to determine which features are correlated with increased fuel efficiency in the current options for autos in the US.

Data Attributes

  1. mpg: Miles/(US) gallon (numeric)

  2. cyl: Number of cylinders (numeric)

  3. wt: Weight (1000 lbs) (numeric)

  4. am: Transmission (0 = automatic, 1 = manual) (logical)

The data attributes that would be useful for the concept of learning would be the mpg (dependent variable) and the number of cylinders, horsepower, weight, transmission type, and displacement.

Data Instances
This dataset includes 32 instances of 11 variables, as shown by the Global Environment in R.
Frequency table of Employee Exposures

3. Data in Action: Data preparation, visualization and exploration

3A: Data preparation and handling:
Insert an R statement to load the data into a variable with read.table or read.csv:

data(“mtcars”)

Double check whether data has been properly loaded into the variable by showing the first 5 rows of the data with head() function:
head(mtcars, 5)

3.B. Data distribution and anomalies
For each of the numeric variables (at least three) relevant to the concept, produce its histogram (distribution) using the hist() function.

Frequency table of mpg:
hist(mtcars$mpg, breaks=35, xlab=“Miles per Gallon (mpg)”, main=“Miles per Gallon”)
MPG

The range does make sense- there are couple gas guzzlers at the low end of 10mpg through three that are 30+ mpg.

Frequency table of cylinders:
hist(mtcars$cyl, breaks=8, xlab=“Cylinders”, main=“Number of Cylinders”)

Cylinders

Cylinders

The range of cylinders from 4-8, namely 4, 6, and 8 makes sense as these are the number of cylinders most commonly found in cars.

Frequency table of weight:
hist(mtcars$wt, breaks=10, xlab=“Weight x 1000lbs”, main=“Weight”)

Weight

Weight

The range in weights displayed here makes sense- from a low of ~1,500 lbs through ~5,500 lbs

3.C. Data distribution with log transformation
MPG appears to be the most skewed distribution.
hist(log10(mtcars$mpg), breaks=5, xlab=“mpg”, main=“log MPG”)
log MPG
By using the log-transformation on the MPG distribution, the distribution of MPG does look more normal and less skewed now.

3.D. Examining multiple variables and regression
* Predictor variable (x): wt
* Variable to be predicted (y): mpg
+ Scatter plot:
plot(mtcars\(wt, mtcars\)mpg, xlab=“Weight x1000lbs”, ylab=“MPG”)
Scatter Plot

  • Linear regression:
    myline <- lm(mtcars\(mpg ~ mtcars\)wt)

  • Add regression to scatter plot:
    **points(mtcars\(wt, myline\)coefficients[1] + myline$coefficients[2] *
  • mtcars$wt, type=“l”, col=“red”)**
    Scatter Plot with Regression Line

Yes, this linear regression does appear to capture the relationship between the two variables. As weight increases to the right of the x-axis, fuel efficiency decreases and falls towards the bottom of the y-axis. There certainly appears to be a strong negative correlation between vehicle weight and mpg.

4. Discussion, understanding, and planning

I think the model would need a larger data set to truly learn enough from the data to be able to say which characteristics are most strongly associated with fuel efficient cars. 32 examples of cars is not enough, but a data sample of every car manufactured for sale in the US would help. We would want to use not just weight data, but also number of cylinders, displacement, horsepower, and other variables to see what is correlated with an increase in fuel efficiency. Though a small data set, the data did seem clean and the ranges of data for the variables used did make sense. This exploratory analysis does make it clear that at least weight and mpg have a strong association, and other variables should be added to the analysis for additional exploration.