Digitizing a scatterplot

This document describes digitizing a scatterplot from a found PNG image, by pointing and clicking. This method is only feasible when the sample size is not too high, and there aren’t too many overlapping points on the scatterplot which might make it difficult to actually see some of the points.

Load the png package for the readPNG function which reads an image from a PNG file/content into a raster array.

library(png)

Load the image. This is a random image that came up as a result of googling ``scatterplot“, chosen for having a low number of data points, and axes. Note that the image is saved as a PNG file and cropped at the axes. Original image can be found at https://www.mathsisfun.com/data/scatter-xy-plots.html. The data are ice cream sales (in $) and temperature (in $).

ice_cream_temp <- readPNG("scatter-ice-cream.png")
x_lab <- "Temperature (C)"
y_lab <- "Ice cream sales ($)"

Set up the plotting window, with axes lining up with the axes of the plot we want to digitize, x = c(10, 26) and y = c(0, 700). Then, use the rasterImage function to draw the raster image in the plotting window. It might be useful to add gridlines to make sure they match up with the gridlines on the raster image.

par(mar = c(4, 4, 0, 0), las = 1)
plot(x = c(10, 26), y = c(0, 700), type = "n", 
     xlab = x_lab, ylab = y_lab, axes = FALSE)
axis(1, seq(10, 26, 2))
axis(2, seq(0, 700, 100))
rasterImage(ice_cream_temp, 10, 0, 26, 700)
abline(v = seq(10, 26, 2), lty = 2, col  = "gray")
abline(h = seq(100, 700, 100), lty = 2, col  = "gray")

The following function is interactive (hence commented out here). Run it, and then click on the points in the image. Try to be as precise as possible.

#pts <- locator(n = 12)

We can check compare these to the data given at https://www.mathsisfun.com/data/scatter-xy-plots.html.

round(pts$x, 1)

##  [1] 11.9 14.2 15.2 16.4 17.2 18.1 18.5 19.4 22.1 22.6 23.4 25.1

round(pts$y)

##  [1] 184 210 332 324 409 422 406 415 522 445 544 615

Close enough for science :)

Chances are you’re doing this because you don’t have the raw data. Then, you might compare the scatterplots visually,

par(mar = c(4, 4, 0, 0), las = 1)
plot(x = pts$x, y = pts$y, 
     pch = 23, cex = 1.5, col = "black", bg = "yellow",
     xlab = x_lab, ylab = y_lab, axes = FALSE,
     xlim = c(10, 26), ylim = c(100, 700))
axis(1, seq(10, 26, 2))
axis(2, seq(0, 700, 100))
abline(v = seq(10, 26, 2), lty = 2, col  = "gray")
abline(h = seq(100, 700, 100), lty = 2, col  = "gray")

or compare the results of a similar analysis to the one presented in the publication. Unfortunately the model presented at the data source is based on approximate data values, so let’s first see what they would get if they had publised a more accurate model.

To do this let’s first bring in the data from the website:

ice_cream_temp_data <- read.csv("ice_crem_temp.csv")
ice_cream_temp_data$sales

##  [1] 215 325 185 332 406 522 412 614 544 421 445 408

ice_cream_temp_data$temp

##  [1] 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 17.2

then run the regression:

their_model <- lm(sales ~ temp, data = ice_cream_temp_data)
their_model$coefficients

## (Intercept)        temp 
##  -159.47415    30.08786

and finally compare this to the result from the digitized data:

my_model <- lm(y ~ x, data = pts)
my_model$coefficients

## (Intercept)           x 
##  -165.74851    30.40259

Pretty close…

Digitizing a scatterplot

Mine Cetinkaya-Rundel

July 28, 2015