For this weekly challenge I used the built-in mtcars dataset. A short overview to describe the dataset’s structure and basic statistics in the dataset is shown here.
dim(mt) # show dimension of mtcars
## [1] 32 11
names(mt) # show names from variables
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
head(mt, 10) # show first 10 rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
str(mt) # show roles (structure) of variables
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mt) # general overview
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
For the visualisation of the dateset mtcars I used a scatter and a box plot:
In the scatter plot I want to visualize the correlation between the weight and the usage of fuel (fuel efficiency in MPG) of the cars.
# Scatter plot: weight vs. miles per gallon (MPG) (fuel efficiency)
ggplot(mt, aes(x = wt, y = mpg)) + # define data and axes
geom_point(color = "black", size = 2) + # plot values with specific color and size
geom_smooth(method = "lm", se = TRUE, color = "red") + # linear regression: linear model with confidence interval with specific color
labs(
title = "Correlation between car weight and fuel efficiency", # define title
x = "Weight [1000 lbs]", # define x axis
y = "Miles per gallon [MPG]" # define y axis
) +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.4), # define font, style and position of title
axis.title = element_text(size = 14), # define axes font size
axis.line = element_line(linewidth = 0.8), # define axis thickness
panel.background = element_blank(), # remove grid and grey background
) +
scale_x_continuous(limits = c(0, 6), expand = c(0, 0)) + # define x axis range, starting at 0
scale_y_continuous(limits = c(0, 40), expand = c(0, 0)) # define y range, starting at 0
## `geom_smooth()` using formula = 'y ~ x'
Figure 1. Correlation between weight (in 1000 lbs) and
the usage miles per gallon (MPG) of cars. A linear regression was
performed.
In the box plot I want to visualize the correlation between the transmission type (0 = automatic cars, 1 = manual cars) and the usage of fuel (fuel efficiency in MPG) of the cars.
# Box plot: transmission vs. miles per gallon (MPG) (fuel efficiency)
ggplot(mt, aes(
x = factor(am, labels = c("Automatic", "Manual")), # define values for x axis, define 0 = automatic cars, 1 = manual cars
y = mpg, # define values y axis
fill = factor(am))) + # use colors in plot
geom_boxplot() + # plot values
geom_jitter(width = 0.35, size = 1.5, color = "black") + # define positioning, size and color of values
labs(
title = "Correlation between transmission type and fuel efficiency", # define title
x = "Transmission type", # define x axis
y = "Miles per gallon [MPG]" # define y axis
) +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 1.0), # define size, style and position of title
axis.title = element_text(size = 14), # define axes font size
axis.text.x = element_text(size = 12, color=("black")), # increase font size of labels, change color
axis.line = element_line(linewidth = 0.8), # define axes thickness
panel.background = element_blank(), # remove grey background
legend.position = "none" # remove legend
) +
scale_y_continuous(limits = c(0, 40), expand = c(0, 0)) + # define y axis range, starting at 0
scale_fill_manual(values = c("lightblue", "orange")) # define color of box plot
Figure 2. Correlation between transmission type (manual
and automatic) and the usage miles per gallon (MPG) of cars.
Regression: The fuel efficiency tend to decrease as the weight of the car increases. Lighter cars can drive further on the same amount of fuel compared to heavier cars.
Box plot: Manual cars tend to have a higher fuel efficiency than automatic cars. Manual cars can drive further on the same amount of fuel compared to automatic cars.
With the use of AI (ChatGPT) I generated a first idea of the code for the linear regression and the box plot. With using the code in R and the following knitting, I developed ideas for possible changes and then modified the code. I changed e.g. the color, font or size of objects. Further, I used AI to add specific items (e.g. color of box plots). In general, AI helped me a lot with coding, but I did the last modifications. I could not identify mistakes made by AI. All suggestions were helpful. The code was clear and because of my explicit question: “show me the easiest and cleanest way to do it” straightforward without any fancy coding. But, I often changed the title or axes titles to a more scientific context.
Prompt 1:
I want to create a scatter plot with linear regression in R with ggplot2. I will use the data set mtcars (defined as mt). I want to use the weight against the miles per gallon. Show me the basic code, which is well structured and straight forward, without fancy coding. Explain the prompts with # … .
#Response (excerpt):
#Load ggplot2
library(ggplot2)
#Use mtcars data set and assign it to mt
mt <- mtcars
#Create scatter plot with regression line
ggplot(mt, aes(x = wt, y = mpg)) +
geom_point(color = "blue") + # Draw scatter points
geom_smooth(method = "lm", # Add linear regression line
color = "red", # Line color
se = FALSE) + # Do not show confidence interval
labs(title = "MPG vs Car Weight", # Plot title
x = "Weight (1000 lbs)", # X-axis label
y = "Miles per Gallon") # Y-axis label
Adjustment:
I then changed some promts (e.g. load ggplot unnecessary because I already loaded it in the beginning) and modified the code (with confidence interval, other color, point size) and specified the axes titles and the plot title.
Prompt 2:
I want to modify the theme. Show me the promts for: adjusting the plot title in size and position; adjust the size of axes titles; increase the linewidth of the x and y axes; and remove the grey background. Use the easiest way to do
# Theme adjustments
theme(
plot.title = element_text(size = 18, hjust = 0.5), # Increase title size & center it
axis.title = element_text(size = 14), # Increase axes title sizes
axis.line = element_line(linewidth = 1), # Thicken x and y axis lines
panel.background = element_blank() # Remove grey background
)
Adjustment:
Modification of explanation and change of color of labels. I then modified the prompts that I like the appearance of the plot (I added face = “bold”)
Prompt 3:
I want that the x axis is starting at 0 to 6 and y axis from 0 to 40 and that both axes are located at 0 and not at -0.5.
# Set axis limits and remove padding
scale_x_continuous(limits = c(0, 6), expand = c(0, 0)) + # X axis from 0 to 6
scale_y_continuous(limits = c(0, 40), expand = c(0, 0)) + # Y axis from 0 to 40
Adjustment:
I just changed the explanation.
Prompt 4:
I want to do the same but with a box plot and the correlation between automatic and manual cars with the MPG. Show me the simpliest code for this.
#Box plot with clear labels
ggplot(mt, aes(x = factor(am, labels = c("Automatic", "Manual")),
y = mpg)) +
geom_boxplot() +
labs(title = "MPG by Transmission Type",
x = "Transmission",
y = "Miles per Gallon") +
theme_minimal()
Adjustment:
Clearer explanation and adjustment of axes titles.
Prompt 5:
I want to modify the theme: define title; size of axes font, add labels to x axis; remove legend
#Theme adjustments
theme(
plot.title = element_text(size = 18, hjust = 0.5), # Increase title size & center it
axis.title = element_text(size = 14), # Increase axes title sizes
axis.text.x = element_text(size = 12), # Increase label size
)
Adjustment:
I changed the explanations. I then added some modifications from the previous plot (remove grey background, increase thickness of axes, define y axis and range).
Prompt 6:
I want to use colors in the box plot. What is the easiest way to do it?
#include in aes
fill = factor(am))) + # map fill to transmission
scale_fill_manual(values = c("blue", "orange")) + # 0 = blue, 1 = orange
Adjustment:
Change of blue to lightblue.
Prompt 7:
I want to plot the actual values. Easiest way?
geom_boxplot(alpha = 0.6) + # Box plot with semi-transparent fill
geom_jitter(width = 0.15, size = 2, color = "black") + # Add points with slight horizontal jitter
Adjustment:
I removed the alpha and changed the value for the width.
sessionInfo()
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [9] ggplot2_4.0.0 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.3
## [5] lattice_0.22-5 hms_1.1.3 digest_0.6.35 magrittr_2.0.3
## [9] evaluate_0.23 grid_4.3.3 timechange_0.3.0 RColorBrewer_1.1-3
## [13] fastmap_1.1.1 jsonlite_1.8.8 Matrix_1.6-5 mgcv_1.9-1
## [17] fansi_1.0.6 scales_1.4.0 jquerylib_0.1.4 cli_3.6.2
## [21] rlang_1.1.3 splines_4.3.3 withr_3.0.0 cachem_1.0.8
## [25] yaml_2.3.8 tools_4.3.3 tzdb_0.4.0 vctrs_0.6.5
## [29] R6_2.5.1 lifecycle_1.0.4 pkgconfig_2.0.3 pillar_1.9.0
## [33] bslib_0.7.0 gtable_0.3.6 glue_1.7.0 xfun_0.43
## [37] tidyselect_1.2.1 highr_0.10 rstudioapi_0.16.0 knitr_1.46
## [41] farver_2.1.1 htmltools_0.5.8.1 nlme_3.1-164 rmarkdown_2.26
## [45] labeling_0.4.3 compiler_4.3.3 S7_0.2.0