2.2.1 exercise
1.List five functions that you could use to get more information about the mpg dataset.
Five functions that you could use to get more information about the mpg dataset are Head, Tail, View, structure,summary
2.How can you find out what other datasets are included with ggplot2?
To find out what datasets are included with ggplot2 you would type data() and then run that line of code.
3.Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel). How could you convert cty and hwy into the European standard of l/100km?
This is the code you would use: mpg\(cty_l100km = 235.2 / mpg\)cty mpg\(hwy_l100km = 235.2 / mpg\)hwy
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
4.Which manufacturer has the most models in this dataset? Which model has the most variations? Does your answer change if you remove the redundant specification of drive train (e.g. “pathfinder 4wd”, “a4 quattro”) from the model name?
library(dplyr)
mpg %>%
count(model, sort= TRUE)
## # A tibble: 38 × 2
## model n
## <chr> <int>
## 1 caravan 2wd 11
## 2 ram 1500 pickup 4wd 10
## 3 civic 9
## 4 dakota pickup 4wd 9
## 5 jetta 9
## 6 mustang 9
## 7 a4 quattro 8
## 8 grand cherokee 4wd 8
## 9 impreza awd 8
## 10 a4 7
## # ℹ 28 more rows
library(stringr)
mpg %>%
mutate(model_clean = str_remove_all(model, "4wd| quattro| 2wd"))%>%
count(model_clean, sort = TRUE)
## # A tibble: 37 × 2
## model_clean n
## <chr> <int>
## 1 "a4" 15
## 2 "caravan" 11
## 3 "ram 1500 pickup " 10
## 4 "civic" 9
## 5 "dakota pickup " 9
## 6 "jetta" 9
## 7 "mustang" 9
## 8 "grand cherokee " 8
## 9 "impreza awd" 8
## 10 "camry" 7
## # ℹ 27 more rows
Using the library (dpylr function) and then this code mpg %>% count(model, sort= TRUE). The Dodge manufacturer has the most models in this dataset. The caravan 2wd model has the most variation at 11. Using the library (stringr) and the code mpg %>% mutate(model_clean = str_remove_all(model, “4wd| quattro| 2wd”))%>% count(model_clean, sort = TRUE) I could observe the change by using this code. It shows me that the anwser does change and that a4 has the most variation by model.
2.3.1 excercise 1.How would you describe the relationship between cty and hwy? Do you have any concerns about drawing conclusions from that plot?
The relationship bewtween cty and hwy shows a strong and linear realtionship. The realtionship shows the smaller the engine, the higher the highway miles per gallon. The concern I would have about drawing conclusions is that the graoh is measuring similar things and it can get confusing.
ggplot(mpg,aes(displ, hwy))+geom_point()
ggplot(mpg, aes(model, manufacturer))+ geom_point()
ggplot(mpg, aes(cty, hwy)) + geom_point()
ggplot(diamonds, aes(carat, price)) + geom_point()
ggplot(economics, aes(date, unemploy)) + geom_line()
ggplot(mpg, aes(cty)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
The x-axis is illegible. It is all crammed together. No, this data is not useful becuase you cannot read it. Both the y-axis and the x-axis are categorical. You could modify the data and make one of the categories numerical.
3.Describe the data, aesthetic mappings and layers used for each of the following plots. You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.
1.ggplot(mpg, aes(cty, hwy)) + geom_point()= mpg dataset aesthetic mapping= x = cty and y= hwy layer= scatterplot
I would predict that the graoh would show a strong linear realationship. The city mpg will have a higher highway mpg.
2.ggplot(diamonds, aes(carat, price)) + geom_point()= diamonds aesthetic mapping= x= carat and y= price layer= scatterplot
I would predict that the graoh would show a strong positive realationship. When the carat size increases so will the price.
3.ggplot(economics, aes(date, unemploy)) + geom_line()= economics aesthetic mapping= x= date y= unemploy layer= line graph
I predict the see a time series plot of the unempoyment on the y axis.
4.ggplot(mpg, aes(cty)) + geom_histogram()= mpg aesthetic mapping= x= cty layer= line graph
I predict to see a distribution of the city mpg. The cars will be clustered closely together.
ggplot(mpg, aes(displ,hwy,colour = displ))+geom_point()
ggplot(mpg, aes(displ, hwy, colour = class))+ geom_point()
ggplot(mpg, aes(displ, hwy, shape = drv))+ geom_point()
ggplot(mpg, aes(drv, hwy))+ geom_boxplot()
2.4.1 excercise 1.Experiment with the colour, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?
When you map the colour, shape, and aesthetics with continous values, you get a continous colorful gradient graph. When youuse categorial values, each class gets its own color.When you use more than one aesthetic in a plot each drive gets its own shape/symbol.
2.What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why?
I kept getting an error when I tried to map a continous varible to shape.Apparently, a continous variable cannot be mapped with shape.This is becuase shape is a discrete aesthetic. There is only a certain ampunt of distinct point shapes available.
3.How is drive train related to fuel economy? How is drive train related to engine size and class?
It shows on the boxplot that the front-wheel drive has the highest fuel economy because they tend have larger engines adn they are heavier. The Drive train is related to fuel economy becuase it does affect the weight and complexity of the car. The four wheel drive decreases front wheel drive and read wheel drive.
2.6.1 excercise
There is no questions so this is just practicing the code
ggplot(mpg, aes(displ, hwy))+ geom_point()+ geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(displ, hwy))+ geom_point()+ geom_smooth(span = 0.2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(displ, hwy))+ geom_point()+ geom_smooth(span = 1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(displ, hwy))+ geom_point()+ geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
2.6.6 excercise
ggplot(mpg, aes(cty, hwy))+ geom_point()
ggplot(mpg, aes(cty, hwy))+ geom_jitter()
1.What’s the problem with the plot created by ggplot(mpg, aes(cty, hwy)) + geom_point()? Which of the geoms described above is most effective at remedying the problem? The problem with this is that the cty and hwy are both integer variables and this means they usually share similar values. On the graph I coded the points were hard to read because it had outliers throwing the data set off. geom_jitter is the most effective at remedying the problem.
2.One challenge with ggplot(mpg, aes(class, hwy)) + geom_boxplot() is that the ordering of class is alphabetical, which is not terribly useful. How could you change the factor levels to be more informative?Rather than reordering the factor by hand, you can do it automatically based on the data: ggplot(mpg, aes(reorder(class, hwy), hwy)) + geom_boxplot(). What does reorder() do? Read the documentation.
ggplot(mpg, aes(class,hwy))+ geom_boxplot()
ggplot(mpg, aes(reorder(class,hwy), hwy))+ geom_boxplot()
The reorder function takes the x factor and reorders it based on the y levels. It makes the plot more informative. It makes it easier to see the relation between model and hwy mpg.
3.Explore the distribution of the carat variable in the diamonds dataset. What binwidth reveals the most interesting patterns?
In this dataset the 0.01 bindwith is the best representation bcause it shows the clear spike of the carat.
ggplot(diamonds, aes(carat))+ geom_freqpoly(binwidth = 0.5)
ggplot(diamonds, aes(carat))+ geom_freqpoly(binwidth = 0.01)
4.Explore the distribution of the price variable in the diamonds data. How does the distribution vary by cut?
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
ggplot(diamonds, aes(reorder(cut,price), price)) + geom_boxplot()
When looking at the boxplot, it seems that the fair cut diamonds have the highest median prices. The ieal cut diamonds surprisingly do not have the highest prices.The price is driven by the carat size.
5.You now know (at least) three ways to compare the distributions of subgroups: geom_violin(), geom_freqpoly() and the colour aesthetic, or geom_histogram() and faceting. What are the strengths and weaknesses of each approach? What other approaches could you try?
Jitterplots show the data set with every point, but they wokr will small data sets. However, the boxplots summarize the distribution with only five numbers. The violin plots give the full distribution shape, but they do rely on a density calculation estimate. Faceting makes it hard to make comparisons, but you can see the distributions of each group.
6.Read the documentation for geom_bar(). What does the weight aesthetic do?
The weight aesthetic controls how much each observation contributes to the height of the bar.
7.Using the techniques already discussed in this chapter, come up with three ways to visualise a 2d categorical distribution. Try them out by visualising the distribution of model and manufacturer, trans and class, and cyl and trans.
ggplot(mpg, aes(manufacturer, fill = model))+ geom_bar()
trans and class
ggplot(mpg, aes(class, trans)) + geom_count()
cyl and trans
ggplot(mpg, aes(cyl, fill = trans)) + geom_bar(position = "dodge")
3.1.1 exercise 1. What geoms would you use to draw each of the following named plots?
Scatterplot geom_point() Line chart geom_line() Histogram geom_histogram() *Bar chart geom_bar(stat=“identify”) Pie chart geom_bar(stat = “identify”)coord_polar(theta = “y”)
2.What’s the difference between geom_path() and geom_polygon()? What’s the difference between geom_path() and geom_line()?
geom_path() dos not close the shape automatically and it connects the observations in the order that they appear in the data set. geom() polygon does connect the points in the order and it closes the shapes automatically.
geom_path() connects the points in the order they appear in the dataset, but it ignores the x variables order. The geom_line() cnnects points that are sorted by x value.
3.What low-level geoms are used to draw geom_smooth()? What about geom_boxplot() and geom_violin()?
The geom_line() and geom_ribbon draws geom_smooth. In geom_boxplot the low levels geoms would be geom_react(), geom_segment() and geom_point(). geom_violin() low levels would be geom_polygon() and geom_boxplot().
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
mpg$cty_1100km = 235.2 / mpg$cty
mpg$hwy_1100km = 235.2 / mpg$hwy
library(dplyr)
mpg %>%
count(model, sort = TRUE)
## # A tibble: 38 × 2
## model n
## <chr> <int>
## 1 caravan 2wd 11
## 2 ram 1500 pickup 4wd 10
## 3 civic 9
## 4 dakota pickup 4wd 9
## 5 jetta 9
## 6 mustang 9
## 7 a4 quattro 8
## 8 grand cherokee 4wd 8
## 9 impreza awd 8
## 10 a4 7
## # ℹ 28 more rows
mpg %>%
mutate(model_clean = str_remove_all(model, "4wd| quattro| 2wd")) %>%
count(model_clean, sort = TRUE)
## # A tibble: 37 × 2
## model_clean n
## <chr> <int>
## 1 "a4" 15
## 2 "caravan" 11
## 3 "ram 1500 pickup " 10
## 4 "civic" 9
## 5 "dakota pickup " 9
## 6 "jetta" 9
## 7 "mustang" 9
## 8 "grand cherokee " 8
## 9 "impreza awd" 8
## 10 "camry" 7
## # ℹ 27 more rows