Dataset Exploration

The mpg dataset provides fuel economy data from 1999 and 2008 for 38 popular models of cars.

cat('number of rows:',nrow(mpg), #234
   '\nnumber of columns:',ncol(mpg)) #11
## number of rows: 234 
## number of columns: 11
glimpse(mpg) #shows rows, columns, and snummary of all columns
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
summary(mpg) #column names
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

Description of columns

Here is a description of the dataset from RPubs

Column Name Variable Type Variable Description
manufacturer categorical nominal Content Cell
model categorical nominal Content Cell
displ numeric continuous engine displacement in liters
year categorical ordinal year of manufacturing
cyl categorical ordinal number of cylinders
trans categorical nominal type of transmission
drv categorical nominal drive type
cty numeric continuous city mileage
hwy numeric continuous highway mileage
fl categorical nominal fuel type
class categorical nominal vehicle class

Question 1 (relationship between cylinders and highway fuel efficiency)

Investigate the relationship between the number of cylinders (cyl) and highway fuel efficiency. Look at the variables, and decide which type of plot (scatterplot, line plot, boxplot, or bar chart) best summarizes their relationship. Comment on that relationship. HINT: you may need to use the as.factor(cyl) syntax in the graph (as we did for year above).

Answer

A boxplot would be the best graph to display the relationship between the number of cylinders(cyl) and highway fuel efficiency(hwy). The reason for this is because cyl is a categorical nominal (or can also be seen as numerical discrete) x-variable while hwy is a continuous y-variable.

ggplot(data=mpg)+
  geom_boxplot(mapping=aes(x=as.factor(cyl),y=hwy))+coord_flip()

From looking at the boxplot you can observe that cars with less cylinders have higher fuel efficiency since the median for the boxplot with 4 and 5 cylinders is approximately 28 miles per gallon. This result is way larger than the median milage shown for cars with 6 and 8 cylinders which is approximately 23 and 17 respectively.

I determined that cyl is a categorical nominal (or numerical discrete variable depending on how you look at it) because the number of cylinders can only be a whole number and cannot include decimals. Also, when I entered the command “mpg$cyl” in R the number of cylinders shows these four repeated values only: 4,5,6,8. The number of cylinders is only limited to four groups based on the results of displaying all the values. For this reason, I have said that cyl is a categorical/discrete variable.

I determined that hwy is a numerical continuous variable because highway miles (per gallon) can be presented as a decimal with an infinite value. I even typed the command “mpg$hwy” in R to see if my initial observations were correct and I saw that the values for highway fuel efficiency were not repeated like the values shown for the number of cylinders. There were many different values shown for hwy. These observations led me to say that hwy is a continuous variable

Additional Work
mpg$cyl #Displayed all values of cyl and hwy to understand data better
##   [1] 4 4 4 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 6 6 6
##  [38] 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
##  [75] 8 8 8 6 6 6 6 8 8 6 6 8 8 8 8 8 6 6 6 6 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 6 6 8 8 4 4 4 4 6 6 6
## [149] 6 6 6 6 6 8 6 6 6 6 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 8 4 4 4 4 6 6
## [186] 6 4 4 4 4 6 6 6 4 4 4 4 4 8 8 4 4 4 6 6 6 6 4 4 4 4 6 4 4 4 4 4 5 5 6 6 4
## [223] 4 4 4 5 5 4 4 4 4 6 6 6
mpg$hwy
##   [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17 26 23
##  [26] 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21 23 23 19 18
##  [51] 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16 17 15 17
##  [76] 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25 26 24 21 22 23 22 20 33
## [101] 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28 26 29 28 27 24 24 24 22 19 20
## [126] 17 12 19 18 14 15 18 18 15 17 16 18 17 19 19 17 29 27 31 32 27 26 26 25 25
## [151] 17 17 20 18 26 26 27 28 25 25 24 27 25 26 23 26 26 26 26 25 27 25 27 20 20
## [176] 19 17 20 17 29 27 31 31 26 26 28 27 29 31 31 26 26 27 30 33 35 37 35 15 18
## [201] 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44 41 29 26
## [226] 28 29 29 29 28 29 26 26 26

Question 2 (coord_fixed() and geom_abline())

In section 3.9.1 of the textbook, solve problem #4 (on the relationship between city and highway fuel efficiency). What substantive conclusions can you draw about the relationship between these variables?

What does the scatter plot below tell you about the relationship between city and highway mpg?

The plot tells us that the two variables have a positive linear correlation which means that as city mpg increases highway mpg increases as well. The reason why highway mpg is greater than city mpg is due to the fact that there are less stops in a highway whereas in a city a car’s fuel is used on frequent stops in the traffic light or rush hours.

Why is coord_fixed() important?

coord_fixed() is important since it defines the aspect ratio of the line graph shown in problem #4. Adjusting a graphs aspect ratio helps with data visualization. The aspect ratio ensures that the x and y axis have a consistent ratio in a graph, no matter the size of the output window. In the example problem, the aspect ratio helped compress the data points and place the line at a specific angle so observations and conclusions can easily be seen and made.

What does geom_abline() do?

geom_abline() creates a line which has a slope of 1. It is different from the regression line which is used to show whether two variables are correlated. Since geom_abline() has a slope of 1 it represents a scenario where city mpg and highway mpg are the same. This is because if city mph is 1 then the highway mpg will be 1 as well. Both values would be the same.

The regression line lies consistently above the equality line, showing that highway mpg is always greater than city mpg. On average, cars get about 5–10 mpg more on highways than in cities. This makes sense, as cars generally achieve better fuel efficiency at steady highway speeds compared to stop-and-go city driving.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed() +
  geom_smooth()   # I added geom_smooth() to plot the regression line and see the correlation between the two variables
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 3 (car displacement and highway mpg)

Look at how the type of drivetrain influences fuel economy (drv). For a given engine size (displ), in general, do four-wheel drive, front wheel drive, or rear wheel drive engines have the highest fuel economy?

Answer

Front wheel drives have the highest fuel economy since the scatterplot in figure 5 shows that majority of the front wheel drivetrains are in the upper left-hand corner (green points in graph). Also, the front wheel drivetrains have a small engine size which reveals that cars with a small engine size have greater fuel efficiency.

There are some four-wheel drivetrains (red points in graph) that do mix with the green points, but majority of the red points are in the lower right-hand corner of the graph. This means that there are some four-wheel drivetrains that have a high fuel economy, but majority of the four-wheel drivetrains do not have a high fuel economy.

ggplot(data=mpg, aes(x=displ,y=hwy,color=drv))+
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'