The mpg dataset provides fuel economy data from 1999 and
2008 for 38 popular models of cars.
cat('number of rows:',nrow(mpg), #234
'\nnumber of columns:',ncol(mpg)) #11
## number of rows: 234
## number of columns: 11
glimpse(mpg) #shows rows, columns, and snummary of all columns
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
summary(mpg) #column names
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
Here is a description of the dataset from RPubs
| Column Name | Variable Type | Variable Description |
|---|---|---|
| manufacturer | categorical nominal | Content Cell |
| model | categorical nominal | Content Cell |
| displ | numeric continuous | engine displacement in liters |
| year | categorical ordinal | year of manufacturing |
| cyl | categorical ordinal | number of cylinders |
| trans | categorical nominal | type of transmission |
| drv | categorical nominal | drive type |
| cty | numeric continuous | city mileage |
| hwy | numeric continuous | highway mileage |
| fl | categorical nominal | fuel type |
| class | categorical nominal | vehicle class |
Investigate the relationship between the number of cylinders
(cyl) and highway fuel efficiency. Look at the variables,
and decide which type of plot (scatterplot, line plot, boxplot, or bar
chart) best summarizes their relationship. Comment on that relationship.
HINT: you may need to use the as.factor(cyl) syntax in the
graph (as we did for year above).
A boxplot would be the best graph to display the relationship between the number of cylinders(cyl) and highway fuel efficiency(hwy). The reason for this is because cyl is a categorical nominal (or can also be seen as numerical discrete) x-variable while hwy is a continuous y-variable.
ggplot(data=mpg)+
geom_boxplot(mapping=aes(x=as.factor(cyl),y=hwy))+coord_flip()
From looking at the boxplot you can observe that cars with less cylinders have higher fuel efficiency since the median for the boxplot with 4 and 5 cylinders is approximately 28 miles per gallon. This result is way larger than the median milage shown for cars with 6 and 8 cylinders which is approximately 23 and 17 respectively.
I determined that cyl is a categorical nominal (or numerical discrete variable depending on how you look at it) because the number of cylinders can only be a whole number and cannot include decimals. Also, when I entered the command “mpg$cyl” in R the number of cylinders shows these four repeated values only: 4,5,6,8. The number of cylinders is only limited to four groups based on the results of displaying all the values. For this reason, I have said that cyl is a categorical/discrete variable.
I determined that hwy is a numerical continuous variable because highway miles (per gallon) can be presented as a decimal with an infinite value. I even typed the command “mpg$hwy” in R to see if my initial observations were correct and I saw that the values for highway fuel efficiency were not repeated like the values shown for the number of cylinders. There were many different values shown for hwy. These observations led me to say that hwy is a continuous variable
Additional Workmpg$cyl #Displayed all values of cyl and hwy to understand data better
## [1] 4 4 4 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 6 6 6
## [38] 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [75] 8 8 8 6 6 6 6 8 8 6 6 8 8 8 8 8 6 6 6 6 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 6 6 8 8 4 4 4 4 6 6 6
## [149] 6 6 6 6 6 8 6 6 6 6 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 8 4 4 4 4 6 6
## [186] 6 4 4 4 4 6 6 6 4 4 4 4 4 8 8 4 4 4 6 6 6 6 4 4 4 4 6 4 4 4 4 4 5 5 6 6 4
## [223] 4 4 4 5 5 4 4 4 4 6 6 6
mpg$hwy
## [1] 29 29 31 30 26 26 27 26 25 28 27 25 25 25 25 24 25 23 20 15 20 17 17 26 23
## [26] 26 25 24 19 14 15 17 27 30 26 29 26 24 24 22 22 24 24 17 22 21 23 23 19 18
## [51] 17 17 19 19 12 17 15 17 17 12 17 16 18 15 16 12 17 17 16 12 15 16 17 15 17
## [76] 17 18 17 19 17 19 19 17 17 17 16 16 17 15 17 26 25 26 24 21 22 23 22 20 33
## [101] 32 32 29 32 34 36 36 29 26 27 30 31 26 26 28 26 29 28 27 24 24 24 22 19 20
## [126] 17 12 19 18 14 15 18 18 15 17 16 18 17 19 19 17 29 27 31 32 27 26 26 25 25
## [151] 17 17 20 18 26 26 27 28 25 25 24 27 25 26 23 26 26 26 26 25 27 25 27 20 20
## [176] 19 17 20 17 29 27 31 31 26 26 28 27 29 31 31 26 26 27 30 33 35 37 35 15 18
## [201] 20 20 22 17 19 18 20 29 26 29 29 24 44 29 26 29 29 29 29 23 24 44 41 29 26
## [226] 28 29 29 29 28 29 26 26 26
coord_fixed() and
geom_abline())In section 3.9.1 of the textbook, solve problem #4 (on the relationship between city and highway fuel efficiency). What substantive conclusions can you draw about the relationship between these variables?
What does the scatter plot below tell you about the relationship between city and highway mpg?The plot tells us that the two variables have a positive linear correlation which means that as city mpg increases highway mpg increases as well. The reason why highway mpg is greater than city mpg is due to the fact that there are less stops in a highway whereas in a city a car’s fuel is used on frequent stops in the traffic light or rush hours.
Why iscoord_fixed() important?
coord_fixed() is important since it defines the aspect
ratio of the line graph shown in problem #4. Adjusting a graphs aspect
ratio helps with data visualization. The aspect ratio ensures that the x
and y axis have a consistent ratio in a graph, no matter the size of the
output window. In the example problem, the aspect ratio helped compress
the data points and place the line at a specific angle so observations
and conclusions can easily be seen and made.
geom_abline() do?
geom_abline() creates a line which has a slope of 1. It
is different from the regression line which is used to show whether two
variables are correlated. Since geom_abline() has a slope of 1 it
represents a scenario where city mpg and highway mpg are the same. This
is because if city mph is 1 then the highway mpg will be 1 as well. Both
values would be the same.
The regression line lies consistently above the equality line, showing that highway mpg is always greater than city mpg. On average, cars get about 5–10 mpg more on highways than in cities. This makes sense, as cars generally achieve better fuel efficiency at steady highway speeds compared to stop-and-go city driving.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed() +
geom_smooth() # I added geom_smooth() to plot the regression line and see the correlation between the two variables
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Look at how the type of drivetrain influences fuel economy
(drv). For a given engine size (displ), in
general, do four-wheel drive, front wheel drive, or rear wheel drive
engines have the highest fuel economy?
Front wheel drives have the highest fuel economy since the scatterplot in figure 5 shows that majority of the front wheel drivetrains are in the upper left-hand corner (green points in graph). Also, the front wheel drivetrains have a small engine size which reveals that cars with a small engine size have greater fuel efficiency.
There are some four-wheel drivetrains (red points in graph) that do mix with the green points, but majority of the red points are in the lower right-hand corner of the graph. This means that there are some four-wheel drivetrains that have a high fuel economy, but majority of the four-wheel drivetrains do not have a high fuel economy.
ggplot(data=mpg, aes(x=displ,y=hwy,color=drv))+
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'