The dataset I am exploring is the English language keystroke pairs in
the Microsoft
Research Spelling-Correction Data collected through MTurk. Each
keystroke pair consists of a user’s spelling error and their correction.
There are 44,104 keystroke pairs in the dataset, so I decided to look at
smaller categories within the data to make it more manageable.
library(tidyverse)
errors <- read.delim("C:/Users/delee/OneDrive/Desktop/all_keystroke_pairs.txt", quote = "")
View(errors)
Error Frequency by Number
First, I filtered the data for errors that were corrected to number
terms from one to ten. Notably, this discounts keystroke pairs in which
both the error and the correction are misspelled number terms. Many
items in this dataset are pairs in which the typer still did not spell a
word correctly in their second attempt (who among us), but we don’t know
what they were trying to type in these cases.
I had to do some finagling to order the data correctly. Even when I
managed to get them in numerical order in the dataset by introducing a
column of the corresponding digits, they appeared in alphabetical order
in the visualization. I went through a lot of grief trying to correct
this, and eventually Leyla helped me out by recommending a line of code
involving factors and levels.
number_errors_vector <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")
number_errors <- errors %>%
filter(correction %in% c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")) %>%
mutate(digit = case_when(
endsWith(correction, "one") ~ "01",
endsWith(correction, "two") ~ "02",
endsWith(correction, "three") ~ "03",
endsWith(correction, "four") ~ "04",
endsWith(correction, "five") ~ "05",
endsWith(correction, "six") ~ "06",
endsWith(correction, "seven") ~ "07",
endsWith(correction, "eight") ~ "08",
endsWith(correction, "nine") ~ "09",
endsWith(correction, "ten") ~ "10"))
number_errors$digit <- as.numeric(as.character(number_errors$digit))
number_errors <- number_errors[order(number_errors$digit), ]
number_errors$correction <- factor(number_errors$correction,levels = unique(number_errors$correction),ordered = T)
View(number_errors)
ggplot(number_errors) +
geom_bar(aes(x = correction, fill = correction)) +
labs(title = "Error Frequency by Number",
x = "Numbers",
y = "Frequency") +
theme(legend.position = "none", panel.background = element_rect(fill = "pink"))

It looks like “two” is the most-misspelled number term by a long
shot, with over 150 misspellings in the dataset. There is also a major
drop off in misspellings of numbers after three, which I can only assume
can be attributed to people using digits in favor of spelled-out number
terms for higher numbers. I think it makes more sense that there would
be an overall lower frequency for larger number terms as opposed to a
higher accuracy rate in typing them.
Error Frequency by Color
Then I turned to color terms. I opted to stick to the widely- (though
I’m sure not universally-) agreed-upon eleven basic color terms in
English, partially because many non-basic color terms in English derive
from nouns with the same spelling, meaning it would be hard to tell
whether or not the typer was using the word as a color term. The only
basic color term where that issue arises is orange, but I decided I
could live with that.
Here, I added two columns to the dataset: one with digits so I could
get the data in the desired order, and one with shortened keys for each
color name so that they could all fit comfortably and legibly as labels
on the bar graph.
color_errors_vector <- c("pink", "red", "orange", "yellow", "green", "blue", "purple", "brown", "black", "grey", "gray", "white")
color_errors <- errors %>%
filter(correction %in% c("pink", "red", "orange", "yellow", "green", "blue", "purple", "brown", "black", "grey", "gray", "white")) %>%
mutate(color_lab = case_when(
correction == "pink" ~ "Pi",
correction == "red" ~ "R",
correction == "orange" ~ "O",
correction == "yellow" ~ "Y",
correction == "green" ~ "Green",
correction == "blue" ~ "Blue",
correction == "purple" ~ "Pu",
correction == "brown" ~ "Br",
correction == "black" ~ "Bla",
correction == "grey" | correction == "gray" ~ "Grey",
correction == "white" ~ "W")) %>%
mutate(color_id = case_when(
correction == "pink" ~ "01",
correction == "red" ~ "02",
correction == "orange" ~ "03",
correction == "yellow" ~ "04",
correction == "green" ~ "05",
correction == "blue" ~ "06",
correction == "purple" ~ "07",
correction == "brown" ~ "08",
correction == "black" ~ "09",
correction == "grey" | correction == "gray" ~ "10",
correction == "white" ~ "11"))
color_errors$color_id <- as.numeric(as.character(color_errors$color_id))
color_errors <- color_errors[order(color_errors$color_id), ]
color_errors$color_lab <- factor(color_errors$color_lab,levels = unique(color_errors$color_lab),ordered = T)
View(color_errors)
ggplot(color_errors) +
geom_bar(aes(x = color_lab, fill = correction)) +
labs(title = "Error Frequency by Color",
x = "Colors",
y = "Frequency") +
theme(legend.position = "none", panel.background = element_rect(fill = "#1D2257")) +
scale_fill_manual("legend", values = c("pink" = "#FB7189", "red" = "#F03621", "orange" = "#F68f1d", "yellow" = "#FBD236", "green" = "#539232", "blue" = "#3D8FC6", "purple" = "#772B7F", "brown" = "#7D451E", "black" = "black", "white" = "white", "gray" = "gray", "grey" = "grey"))

Here we can see that the most-misspelled basic color term is “white”,
followed closely by “black”–a duo that is often considered by artists,
scientists, and other pedants not to be colors at all. Interesting! I
would guess that in this case as well, it’s an issue of frequency.
“Black” and “white” can be used imprecisely to describe things that are
dark and light in a way that’s less strict than how the word “purple” is
mostly just used to describe things that are purple. There are other
applications of these words, too, beyond their color meanings, like as
race descriptors.
“Grey” (and “gray”–I’ve combined the two into one column) scores the
least misspellings. I wonder if the dual spellings make us think extra
carefully when we type the word. Two of the “errors” corrected to “gray”
were simply the with-an-E variant–it’s just as likely that the typers
couldn’t make up their minds between the two spellings as the
possibility that they typed the first one by accident when aiming for
the other, which I consider to be the prototypical typo narrative.
Discussion
I think there are plenty of other worthwhile inquiries to be made
about this data. I spoke on Monday with Kaung Zan, Michael, and Eliana
about various ways of measuring the “distance” between the error and the
correction, like the number of letters that are different between the
two, or the distance between two keys on a keyboard in the case of
letter substitution errors.
I know there are also different factors that drive us to make
different kinds of errors (which I’ve been noticing much more while
working on this project)–typing fast can cause us to miss a key or
switch two letters around, mentally narrating as we type can cause us to
substitute phonologically similar letters, and thinking about something
else or hearing someone speak while we type can cause us to output
something totally wrong. Trying to quantify these motivations would take
a lot of guesswork, but it’s possible to measure the ways they manifest,
at least in the case of more frequent and straightforward errors
(e.g. elision, substitution, swapping).
Another metric to measure here could be the proportion of errors that
were corrected to real, correctly-spelled words. However, there would be
similar complications to the above in drawing conclusions about this
set, and we would not be able to say for sure that they qualify as
“true” corrections (as opposed to further errors). For example, since
the data stops only after the correction immediately following the
error, who’s to say what the final corrected term was? What if the typer
intended to misspell the word? What if the immediate correction is a
real, correctly-spelled word, but still the wrong one? What is a “real
word”? What about terms in the data that include digits or punctuation?
It could still be an interesting inquiry, and my hunch from scrolling
through the data is that correctly-spelled corrections would prevail
over incorrectly-spelled corrections.
People type differently from one another for a host of reasons
including motor skills, what kind of keyboard they use, and how they
learned to type. These differences may influence the kinds of errors we
produce when we type. It was interesting looking at the full set of
typing errors and seeing the wide range of ways we make typos, which we
often quickly delete and move on from without a second thought.
I would be remiss if I didn’t mention that I think Amazon’s
Mechanical Turk (the source of this dataset) and other similar digital
micro-job platforms are highly exploitative operations.
Thnak yuou fr readig y ntebook :)
(^all real errors from the dataset)
Data: (2024, July 15). Microsoft Research Spelling-Correction
Data. Microsoft Download Center. https://www.microsoft.com/en-us/download/details.aspx?id=52418
---
title: "Spelling Error Frequency by Word Category"
output:
  html_notebook:
    code_folding: hide
---

The dataset I am exploring is the English language keystroke pairs in the [Microsoft Research Spelling-Correction Data collected through MTurk](https://www.microsoft.com/en-us/download/details.aspx?id=52418). Each keystroke pair consists of a user's spelling error and their correction. There are 44,104 keystroke pairs in the dataset, so I decided to look at smaller categories within the data to make it more manageable.
```{r}
library(tidyverse)
errors <- read.delim("C:/Users/delee/OneDrive/Desktop/all_keystroke_pairs.txt", quote = "")
View(errors)
```
## Error Frequency by Number

First, I filtered the data for errors that were corrected to number terms from one to ten.  Notably, this discounts keystroke pairs in which both the error and the correction are misspelled number terms.  Many items in this dataset are pairs in which the typer still did not spell a word correctly in their second attempt (who among us), but we don't know what they were trying to type in these cases.

I had to do some finagling to order the data correctly.  Even when I managed to get them in numerical order in the dataset by introducing a column of the corresponding digits, they appeared in alphabetical order in the visualization.  I went through a lot of grief trying to correct this, and eventually Leyla helped me out by recommending a line of code involving factors and levels.
```{r}
number_errors_vector <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")
number_errors <- errors %>%
  filter(correction %in% c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")) %>%
  mutate(digit = case_when(
    endsWith(correction, "one") ~ "01",
    endsWith(correction, "two") ~ "02",
    endsWith(correction, "three") ~ "03",
    endsWith(correction, "four") ~ "04",
    endsWith(correction, "five") ~ "05",
    endsWith(correction, "six") ~ "06",
    endsWith(correction, "seven") ~ "07",
    endsWith(correction, "eight") ~ "08", 
    endsWith(correction, "nine") ~ "09",
    endsWith(correction, "ten") ~ "10"))
number_errors$digit <- as.numeric(as.character(number_errors$digit))
number_errors <- number_errors[order(number_errors$digit), ]
number_errors$correction <- factor(number_errors$correction,levels = unique(number_errors$correction),ordered = T)
View(number_errors)
ggplot(number_errors) +
  geom_bar(aes(x = correction, fill = correction)) +
  labs(title = "Error Frequency by Number",
       x = "Numbers",
       y = "Frequency") +
  theme(legend.position = "none", panel.background = element_rect(fill = "pink"))
```

It looks like "two" is the most-misspelled number term by a long shot, with over 150 misspellings in the dataset.  There is also a major drop off in misspellings of numbers after three, which I can only assume can be attributed to people using digits in favor of spelled-out number terms for higher numbers.  I think it makes more sense that there would be an overall lower frequency for larger number terms as opposed to a higher accuracy rate in typing them.

## Error Frequency by Color

Then I turned to color terms.  I opted to stick to the widely- (though I'm sure not universally-) agreed-upon eleven basic color terms in English, partially because many non-basic color terms in English derive from nouns with the same spelling, meaning it would be hard to tell whether or not the typer was using the word as a color term.  The only basic color term where that issue arises is orange, but I decided I could live with that.

Here, I added two columns to the dataset: one with digits so I could get the data in the desired order, and one with shortened keys for each color name so that they could all fit comfortably and legibly as labels on the bar graph.
```{r}
color_errors_vector <- c("pink", "red", "orange", "yellow", "green", "blue", "purple", "brown", "black", "grey", "gray", "white")
color_errors <- errors %>%
  filter(correction %in% c("pink", "red", "orange", "yellow", "green", "blue", "purple", "brown", "black", "grey", "gray", "white")) %>%
  mutate(color_lab = case_when(
    correction == "pink" ~ "Pi",
    correction == "red" ~ "R",
    correction == "orange" ~ "O",
    correction == "yellow" ~ "Y",
    correction == "green" ~ "Green",
    correction == "blue" ~ "Blue",
    correction == "purple" ~ "Pu",
    correction == "brown" ~ "Br", 
    correction == "black" ~ "Bla",
    correction == "grey" | correction == "gray" ~ "Grey",
    correction == "white" ~ "W")) %>%
  mutate(color_id = case_when(
    correction == "pink" ~ "01",
    correction == "red" ~ "02",
    correction == "orange" ~ "03",
    correction == "yellow" ~ "04",
    correction == "green" ~ "05",
    correction == "blue" ~ "06",
    correction == "purple" ~ "07",
    correction == "brown" ~ "08", 
    correction == "black" ~ "09",
    correction == "grey" | correction == "gray" ~ "10",
    correction == "white" ~ "11"))
color_errors$color_id <- as.numeric(as.character(color_errors$color_id))
color_errors <- color_errors[order(color_errors$color_id), ]
color_errors$color_lab <- factor(color_errors$color_lab,levels = unique(color_errors$color_lab),ordered = T)
View(color_errors)
ggplot(color_errors) +
  geom_bar(aes(x = color_lab, fill = correction)) +
  labs(title = "Error Frequency by Color",
       x = "Colors",
       y = "Frequency") +
  theme(legend.position = "none", panel.background = element_rect(fill = "#1D2257")) +
  scale_fill_manual("legend", values = c("pink" = "#FB7189", "red" = "#F03621", "orange" = "#F68f1d", "yellow" = "#FBD236", "green" = "#539232", "blue" = "#3D8FC6", "purple" = "#772B7F", "brown" = "#7D451E", "black" = "black", "white" = "white", "gray" = "gray", "grey" = "grey"))
```

Here we can see that the most-misspelled basic color term is "white", followed closely by "black"--a duo that is often considered by artists, scientists, and other pedants not to be colors at all.  Interesting!  I would guess that in this case as well, it's an issue of frequency.  "Black" and "white" can be used imprecisely to describe things that are dark and light in a way that's less strict than how the word "purple" is mostly just used to describe things that are purple.  There are other applications of these words, too, beyond their color meanings, like as race descriptors.

"Grey" (and "gray"--I've combined the two into one column) scores the least misspellings.  I wonder if the dual spellings make us think extra carefully when we type the word.  Two of the "errors" corrected to "gray" were simply the with-an-E variant--it's just as likely that the typers couldn't make up their minds between the two spellings as the possibility that they typed the first one by accident when aiming for the other, which I consider to be the prototypical typo narrative. 

## Discussion

I think there are plenty of other worthwhile inquiries to be made about this data.  I spoke on Monday with Kaung Zan, Michael, and Eliana about various ways of measuring the "distance" between the error and the correction, like the number of letters that are different between the two, or the distance between two keys on a keyboard in the case of letter substitution errors.

I know there are also different factors that drive us to make different kinds of errors (which I've been noticing much more while working on this project)--typing fast can cause us to miss a key or switch two letters around, mentally narrating as we type can cause us to substitute phonologically similar letters, and thinking about something else or hearing someone speak while we type can cause us to output something totally wrong.  Trying to quantify these motivations would take a lot of guesswork, but it's possible to measure the ways they manifest, at least in the case of more frequent and straightforward errors (e.g. elision, substitution, swapping).

Another metric to measure here could be the proportion of errors that were corrected to real, correctly-spelled words.  However, there would be similar complications to the above in drawing conclusions about this set, and we would not be able to say for sure that they qualify as "true" corrections (as opposed to further errors).  For example, since the data stops only after the correction immediately following the error, who's to say what the final corrected term was?  What if the typer intended to misspell the word?  What if the immediate correction is a real, correctly-spelled word, but still the wrong one?  What is a "real word"?  What about terms in the data that include digits or punctuation?  It could still be an interesting inquiry, and my hunch from scrolling through the data is that correctly-spelled corrections would prevail over incorrectly-spelled corrections.

People type differently from one another for a host of reasons including motor skills, what kind of keyboard they use, and how they learned to type.  These differences may influence the kinds of errors we produce when we type.  It was interesting looking at the full set of typing errors and seeing the wide range of ways we make typos, which we often quickly delete and move on from without a second thought.

I would be remiss if I didn't mention that I think Amazon's Mechanical Turk (the source of this dataset) and other similar digital micro-job platforms are highly exploitative operations.

Thnak yuou fr readig y ntebook :)

(^all real errors from the dataset)

Data:
(2024, July 15). *Microsoft Research Spelling-Correction Data*. Microsoft Download Center. [https://www.microsoft.com/en-us/download/details.aspx?id=52418](https://www.microsoft.com/en-us/download/details.aspx?id=52418)