1. Dataset overview

For this weekly challenge I used the built-in mtcars dataset. A short overview to describe the dataset’s structure and basic statistics in the dataset is shown here.

dim(mt) # show dimension of mtcars
## [1] 32 11
names(mt) # show names from variables
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
head(mt, 10) # show first 10 rows
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
str(mt) # show roles (structure) of variables
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mt) # general overview
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

2. Two visualizations

For the visualisation of the dateset mtcars I used a scatter and a box plot:

2.1 Scatter plot

In the scatter plot I want to visualize the correlation between the weight and the usage of fuel (fuel efficiency in MPG) of the cars.

# Scatter plot: weight vs. miles per gallon (MPG) (fuel efficiency)

ggplot(mt, aes(x = wt, y = mpg)) + # define data and axes 
  geom_point(color = "black", size = 2) + # plot values with specific color and size
  geom_smooth(method = "lm", se = TRUE, color = "red") + # linear  regression: linear model with confidence interval with specific color
  
  labs(
    title = "Correlation between car weight and fuel efficiency", # define title 
    x = "Weight [1000 lbs]", # define x axis
    y = "Miles per gallon [MPG]" # define y axis
  ) + 
  
   theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.4),  # define font, style and position of title
    axis.title = element_text(size = 14), # define axes font size
    axis.line = element_line(linewidth = 0.8), # define axis thickness
    panel.background = element_blank(), # remove grid and grey background 
  ) +
  
  scale_x_continuous(limits = c(0, 6), expand = c(0, 0)) +  # define x axis range, starting at 0
  scale_y_continuous(limits = c(0, 40), expand = c(0, 0))   # define y range, starting at 0 
## `geom_smooth()` using formula = 'y ~ x'

Figure 1. Correlation between weight (in 1000 lbs) and the usage miles per gallon (MPG) of cars. A linear regression was performed.

2.2 Box plot

In the box plot I want to visualize the correlation between the transmission type (0 = automatic cars, 1 = manual cars) and the usage of fuel (fuel efficiency in MPG) of the cars.

# Box plot: transmission vs. miles per gallon (MPG) (fuel efficiency)

ggplot(mt, aes(
  x = factor(am, labels = c("Automatic", "Manual")), # define values for x axis, define 0 = automatic cars, 1 = manual cars
  y = mpg, # define values y axis
  fill = factor(am))) + # use colors in plot
  
geom_boxplot() + # plot values 
  geom_jitter(width = 0.35, size = 1.5, color = "black")  + # define positioning, size and color of values
  
  labs(
    title = "Correlation between transmission type and fuel efficiency", # define title
    x = "Transmission type", # define x axis
    y = "Miles per gallon [MPG]" # define y axis
  ) + 
  
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 1.0),  # define size, style and position of title
    axis.title = element_text(size = 14), # define axes font size
    axis.text.x = element_text(size = 12, color=("black")), # increase font size of labels, change color  
    axis.line = element_line(linewidth = 0.8), # define axes thickness
    panel.background = element_blank(), # remove grey background 
    legend.position = "none"  # remove legend 
  ) +
  
    scale_y_continuous(limits = c(0, 40), expand = c(0, 0)) + # define y axis range, starting at 0 
    scale_fill_manual(values = c("lightblue", "orange")) # define color of box plot

Figure 2. Correlation between transmission type (manual and automatic) and the usage miles per gallon (MPG) of cars.

3. First hypothesis

Regression: The fuel efficiency tend to decrease as the weight of the car increases. Lighter cars can drive further on the same amount of fuel compared to heavier cars.

Box plot: Manual cars tend to have a higher fuel efficiency than automatic cars. Manual cars can drive further on the same amount of fuel compared to automatic cars.

4. Reflection on AI assistance

With the use of AI (ChatGPT) I generated a first idea of the code for the linear regression and the box plot. With using the code in R and the following knitting, I developed ideas for possible changes and then modified the code. I changed e.g. the color, font or size of objects. Further, I used AI to add specific items (e.g. color of box plots). In general, AI helped me a lot with coding, but I did the last modifications. I could not identify mistakes made by AI. All suggestions were helpful. The code was clear and because of my explicit question: “show me the easiest and cleanest way to do it” straightforward without any fancy coding. But, I often changed the title or axes titles to a more scientific context.

Prompt 1:

I want to create a scatter plot with linear regression in R with ggplot2. I will use the data set mtcars (defined as mt). I want to use the weight against the miles per gallon. Show me the basic code, which is well structured and straight forward, without fancy coding. Explain the prompts with # … .

#Response (excerpt):

#Load ggplot2
library(ggplot2)

#Use mtcars data set and assign it to mt
mt <- mtcars

#Create scatter plot with regression line
ggplot(mt, aes(x = wt, y = mpg)) +
  
geom_point(color = "blue") +        # Draw scatter points
  
geom_smooth(method = "lm",          # Add linear regression line
            color = "red",          # Line color
            se = FALSE) +           # Do not show confidence interval
  
  labs(title = "MPG vs Car Weight",   # Plot title
  x = "Weight (1000 lbs)",       # X-axis label
  y = "Miles per Gallon")        # Y-axis label

Adjustment:

I then changed some promts (e.g. load ggplot unnecessary because I already loaded it in the beginning) and modified the code (with confidence interval, other color, point size) and specified the axes titles and the plot title.

Prompt 2:

I want to modify the theme. Show me the promts for: adjusting the plot title in size and position; adjust the size of axes titles; increase the linewidth of the x and y axes; and remove the grey background. Use the easiest way to do

# Theme adjustments
  theme(
   plot.title = element_text(size = 18, hjust = 0.5),  # Increase title size & center it
   axis.title = element_text(size = 14), # Increase axes title sizes
   axis.line = element_line(linewidth = 1),            # Thicken x and y axis lines
   panel.background = element_blank()                  # Remove grey background
   )

Adjustment:

Modification of explanation and change of color of labels. I then modified the prompts that I like the appearance of the plot (I added face = “bold”)

Prompt 3:

I want that the x axis is starting at 0 to 6 and y axis from 0 to 40 and that both axes are located at 0 and not at -0.5.

# Set axis limits and remove padding
 scale_x_continuous(limits = c(0, 6), expand = c(0, 0)) +   # X axis from 0 to 6
 scale_y_continuous(limits = c(0, 40), expand = c(0, 0)) +   # Y axis from 0 to 40
  

Adjustment:

I just changed the explanation.

Prompt 4:

I want to do the same but with a box plot and the correlation between automatic and manual cars with the MPG. Show me the simpliest code for this.


#Box plot with clear labels
ggplot(mt, aes(x = factor(am, labels = c("Automatic", "Manual")),
               y = mpg)) +
  geom_boxplot() +
  labs(title = "MPG by Transmission Type",
       x = "Transmission",
       y = "Miles per Gallon") +
  theme_minimal()
  

Adjustment:

Clearer explanation and adjustment of axes titles.

Prompt 5:

I want to modify the theme: define title; size of axes font, add labels to x axis; remove legend

#Theme adjustments
theme(
  plot.title = element_text(size = 18, hjust = 0.5),  # Increase title size & center it
  axis.title = element_text(size = 14), # Increase axes title sizes
  axis.text.x = element_text(size = 12), # Increase label size
  )

Adjustment:

I changed the explanations. I then added some modifications from the previous plot (remove grey background, increase thickness of axes, define y axis and range).

Prompt 6:

I want to use colors in the box plot. What is the easiest way to do it?

#include in aes
fill = factor(am))) +     # map fill to transmission

scale_fill_manual(values = c("blue", "orange")) +   # 0 = blue, 1 = orange
  

Adjustment:

Change of blue to lightblue.

Prompt 7:

I want to plot the actual values. Easiest way?

geom_boxplot(alpha = 0.6) +                  # Box plot with semi-transparent fill
geom_jitter(width = 0.15, size = 2, color = "black") +  # Add points with slight horizontal jitter
  

Adjustment:

I removed the alpha and changed the value for the width.

5. Session info

sessionInfo()
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
##  [9] ggplot2_4.0.0   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9         utf8_1.2.4         generics_0.1.3     stringi_1.8.3     
##  [5] lattice_0.22-5     hms_1.1.3          digest_0.6.35      magrittr_2.0.3    
##  [9] evaluate_0.23      grid_4.3.3         timechange_0.3.0   RColorBrewer_1.1-3
## [13] fastmap_1.1.1      jsonlite_1.8.8     Matrix_1.6-5       mgcv_1.9-1        
## [17] fansi_1.0.6        scales_1.4.0       jquerylib_0.1.4    cli_3.6.2         
## [21] rlang_1.1.3        splines_4.3.3      withr_3.0.0        cachem_1.0.8      
## [25] yaml_2.3.8         tools_4.3.3        tzdb_0.4.0         vctrs_0.6.5       
## [29] R6_2.5.1           lifecycle_1.0.4    pkgconfig_2.0.3    pillar_1.9.0      
## [33] bslib_0.7.0        gtable_0.3.6       glue_1.7.0         xfun_0.43         
## [37] tidyselect_1.2.1   highr_0.10         rstudioapi_0.16.0  knitr_1.46        
## [41] farver_2.1.1       htmltools_0.5.8.1  nlme_3.1-164       rmarkdown_2.26    
## [45] labeling_0.4.3     compiler_4.3.3     S7_0.2.0