#loadng libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Cars <- read.csv("C:/Users/echan/Downloads/Cars.csv")
Ensure factors are defined correctly by running the following code to define the correct labels.
Cars$Sports <- Cars$Sports %>% factor(levels=c(0,1),
labels=c('No','Yes'), ordered=TRUE)
Cars$Sport_utility <- Cars$Sport_utility %>% factor(levels=c(0,1),
labels=c('No','Yes'), ordered=TRUE)
Cars$Wagon <- Cars$Wagon %>% factor(levels=c(0,1),
labels=c('No','Yes'), ordered=TRUE)
Cars$Minivan <- Cars$Minivan %>% factor(levels=c(0,1),
labels=c('No','Yes'), ordered=TRUE)
Cars$Pickup <- Cars$Pickup %>% factor(levels=c(0,1),
labels=c('No','Yes'), ordered=TRUE)
Cars$All_wheel_drive <- Cars$All_wheel_drive %>% factor(levels=c(0,1),
labels=c('No','Yes'), ordered=TRUE)
Cars$Rear_wheel_drive <- Cars$Rear_wheel_drive %>% factor(levels=c(0,1),
labels=c('No','Yes'), ordered=TRUE)
Cars$Cylinders <- Cars$Cylinders %>% as.factor()
The qplot() function in ggplot2 is a quick method to develop basic data visualisations. Many of the aesthetics, scales and other plot features are set by default, meaning you can get under way sooner, but without the fine-tuned control of using a layered approach. Let's jump right in and produce a bar chart showing the counts of cars by number of cylinders. (Note: -1 = rotary engine)
The first part of code was used to 'map' each variable to be represented in the plot to a particular plot element. For example x = "cylinders" places the Cylinders factor along the x-axis. The Cylinders variable is then mapped to the Cars data object. Next, we select the bar geom using geom = "bar" . This defaults to counting the number of observations in each category of the cylinder variable and then presenting the frequency as a bar. The height of the bar is the aesthetic and will be mapped to the y-axis.
qplot(x = Cylinders,data = Cars, geom = "bar")
Now let's try a box plot comparing kilowatts (a measure of a car's power) across cylinders.
qplot(x = Cylinders, y = Kilowatts, data = Cars,
geom = "boxplot")
This time we mapped Kilowatts directly to the y-axis and by changing the geom to a boxplot using geom = "boxplot" . We can also add layers at this point, so let's overlay the mean as a red dot. We can select the colour of the points to help contrast the means from the medians, and select the 'point' geom to visualise the mean.
qplot(x = Cylinders,y = Kilowatts, data = Cars,
geom = "boxplot") +
stat_summary(fun.y=mean, colour="red", geom="point")
## Warning: `fun.y` is deprecated. Use `fun` instead.
Now let's try a scatter plot showing the relationship between a car's weight and its estimated city fuel economy (km/L).
qplot(x = Weight,y = Economy_city, data = Cars,
geom = "point")
## Warning: Removed 16 rows containing missing values (geom_point).
R reports that we have 16 missing values, which is handy to note. Also note how we carefully mapped each variable to the x and y axis. We changed the geom type to 'point' for a scatter plot by using geom = "point".
Now let's try something a little trickier. The relationship between weight and economy looks curvilinear. Often we can apply log transformations to variables to straighten the relationship. Linear relationships are always easier to model and understand. We can use the log = "xy" option to apply a logarithmic scale to the plot
qplot(x = Weight,y = Economy_city, data = Cars,
geom = "point", log = "xy")
## Warning: Removed 16 rows containing missing values (geom_point).
The relationship looks more linear, but the axes are poorly scaled. In fact, log-transformed scales are prone to misinterpretations. Instead, we can transform the variables directly
qplot(x = log(Weight),y = log(Economy_city), data = Cars,
geom = "point")
## Warning: Removed 16 rows containing missing values (geom_point).
We can also map other aesthetics to reflect additional variables in the plot. Suppose we want to compare the relationship between city and highway fuel economy for cars with different numbers of cylinders.
qplot(x = log(Weight),y = log(Economy_city), data = Cars,
geom = "point",colour = Cylinders)
## Warning: Removed 16 rows containing missing values (geom_point).
We could also use shapes by setting shape = Cylinders, however, the colour aesthetic works better than shapes given the number of categories.
qplot(x = log(Weight),y = log(Economy_city), data = Cars,
geom = "point",shape = Cylinders)
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 8. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 18 rows containing missing values (geom_point).
We can add a linear regression trend line
qplot(x = log(Weight),y = log(Economy_city), data = Cars,
geom = "point") +
stat_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
Making a non-parametric smoother
qplot(x = log(Weight),y = log(Economy_city), data = Cars,geom = "point") +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).
Suppose we want to compare the relationship between a car's power (measured using kilowatts) and its retail price. However, we want to look at this relationship based on the number of cylinders in a car. We will filter the data to only consider four-, six- and eight-cylinder cars to ensure we have sufficient data
Cars_filter <- Cars %>% filter(Cylinders %in% c("4","6","8"))
qplot(x = Kilowatts,y = Retail_price, data = Cars_filter,
geom = "point",colour = Cylinders) +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The scatter plot is a little crowded as the relationships overlap. We can use the facet function to subset the data and display the relationship as small multiples.
qplot(x = Kilowatts,y = Retail_price, data = Cars_filter,
geom = "point", facets = Cylinders ~.) +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Faceting can come in very handy when exploring multidimensional datasets. Note how we specified the facets using a formula facets = Cylinders~.:
Cylinders~. - the left-hand side of the tilde symbol refers to rows and the right-hand side to columns. . signifies that no facets will appear on that dimension. In other words, split Cylinders across rows. If we wanted to move to columns, we would use .~Cylinders or just ~Cylinders. We could also just specify this as a vector using facets = "Cylinders" , but we then lose the ability to choose rows or columns.
Create a bar chart based on the dimensions of the cyliders.
qplot(x = Cylinders, data = Cars, geom = "bar")
Create a histogram using qplot asking for the weight dimension.
qplot(x = Weight, data = Cars, geom = "histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
The default bin number is 30. Change the bin numbers of the histogram to 40.
qplot(x = Weight, data = Cars, geom = "histogram", bins = 40)
## Warning: Removed 2 rows containing non-finite values (stat_bin).