The “Oxford Comma” is the subject of endless debate among grammarians. It’s the comma that precedes the final conjunction in a listing of 3 or more items. For example, the sentence
“For breakfast I like to eat cereal, toast, and juice”
contains the Oxford Comma, while
“I am planning to invite Tim, Dick and Harry”
does not.
In cases like the above, there is really no difference in interpretation, but there are notable examples where the lack of an Oxford Comma can produce humorous results, for example:
Some additional humorous examples include:
On a more serious note, ambiguities in contracts because of the presence or absence of the Oxford Comma have resulted in costly legal disputes, such as
https://www.nytimes.com/2018/02/09/us/oxford-comma-maine.html
# Load libraries
library(tidyr)
library(dplyr)
library(kableExtra)
library(ggplot2)
library(psych)
library(forcats) ## for releveling of factors# load data from fivethirtyeight
commadata <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/comma-survey/comma-survey.csv",na="",stringsAsFactors = T)
summary(commadata)## RespondentID
## Min. :3.288e+09
## 1st Qu.:3.289e+09
## Median :3.290e+09
## Mean :3.290e+09
## 3rd Qu.:3.291e+09
## Max. :3.293e+09
##
## In.your.opinion..which.sentence.is.more.gramatically.correct.
## It's important for a person to be honest, kind and loyal. :488
## It's important for a person to be honest, kind, and loyal.:641
##
##
##
##
##
## Prior.to.reading.about.it.above..had.you.heard.of.the.serial..or.Oxford..comma.
## No :444
## Yes :655
## NA's: 30
##
##
##
##
## How.much..if.at.all..do.you.care.about.the.use..or.lack.thereof..of.the.serial..or.Oxford..comma.in.grammar.
## A lot :291
## Not at all:126
## Not much :268
## Some :414
## NA's : 30
##
##
## How.would.you.write.the.following.sentence.
## Some experts say it's important to drink milk, but the data are inconclusive.:228
## Some experts say it's important to drink milk, but the data is inconclusive. :865
## NA's : 36
##
##
##
##
## When.faced.with.using.the.word..data...have.you.ever.spent.time.considering.if.the.word.was.a.singular.or.plural.noun.
## No :547
## Yes :544
## NA's: 38
##
##
##
##
## How.much..if.at.all..do.you.care.about.the.debate.over.the.use.of.the.word..data..as.a.singluar.or.plural.noun.
## A lot :133
## Not at all:203
## Not much :403
## Some :352
## NA's : 38
##
##
## In.your.opinion..how.important.or.unimportant.is.proper.use.of.grammar.
## Neither important nor unimportant (neutral): 26
## Somewhat important :333
## Somewhat unimportant : 7
## Very important :688
## Very unimportant : 5
## NA's : 70
##
## Gender Age Household.Income
## Female:548 > 60 :272 $0 - $24,999 :121
## Male :489 18-29:221 $100,000 - $149,999:164
## NA's : 92 30-44:254 $150,000+ :103
## 45-60:290 $25,000 - $49,999 :158
## NA's : 92 $50,000 - $99,999 :290
## NA's :293
##
## Education Location..Census.Region.
## Bachelor degree :344 Pacific :180
## Graduate degree :276 East North Central:170
## High school degree :100 South Atlantic :164
## Less than high school degree : 11 Middle Atlantic :140
## Some college or Associate degree:295 West South Central: 88
## NA's :103 (Other) :285
## NA's :102
initial_comma_headers <- names(commadata)
print(initial_comma_headers)## [1] "RespondentID"
## [2] "In.your.opinion..which.sentence.is.more.gramatically.correct."
## [3] "Prior.to.reading.about.it.above..had.you.heard.of.the.serial..or.Oxford..comma."
## [4] "How.much..if.at.all..do.you.care.about.the.use..or.lack.thereof..of.the.serial..or.Oxford..comma.in.grammar."
## [5] "How.would.you.write.the.following.sentence."
## [6] "When.faced.with.using.the.word..data...have.you.ever.spent.time.considering.if.the.word.was.a.singular.or.plural.noun."
## [7] "How.much..if.at.all..do.you.care.about.the.debate.over.the.use.of.the.word..data..as.a.singluar.or.plural.noun."
## [8] "In.your.opinion..how.important.or.unimportant.is.proper.use.of.grammar."
## [9] "Gender"
## [10] "Age"
## [11] "Household.Income"
## [12] "Education"
## [13] "Location..Census.Region."
###### Clean up the names
newnames <- c("RespondentID","USES_Oxford","HEARD_Oxford","CARE_Oxford","DATA_Sentence","DATA_Plural","DATA_Care","Grammar_Important","Gender","Age","Income","Education","Location")
names(commadata) <- newnames
print("Newnames:")## [1] "Newnames:"
print(t(t(names(commadata))))## [,1]
## [1,] "RespondentID"
## [2,] "USES_Oxford"
## [3,] "HEARD_Oxford"
## [4,] "CARE_Oxford"
## [5,] "DATA_Sentence"
## [6,] "DATA_Plural"
## [7,] "DATA_Care"
## [8,] "Grammar_Important"
## [9,] "Gender"
## [10,] "Age"
## [11,] "Income"
## [12,] "Education"
## [13,] "Location"
RespondentID should not impact the results – it is just an identifier, so drop it# dimension before dropping
print(dim(commadata))## [1] 1129 13
# drop the column
commadata$RespondentID <- NULL
# dimension after dropping
print(dim(commadata))## [1] 1129 12
True or False## [2] The question posed to participants was whether they preferred a sentence using the oxford comma (n=641) or not (n=488),
t(t(table(commadata$USES_Oxford)))##
## [,1]
## It's important for a person to be honest, kind and loyal. 488
## It's important for a person to be honest, kind, and loyal. 641
# It's important for a person to be honest, kind and loyal. 488
# It's important for a person to be honest, kind, and loyal. 641
#### Replace the response with False / True:
levels(commadata$USES_Oxford) <- c(F,T)
print("Does the respondent prefer to use the Oxford Comma?")## [1] "Does the respondent prefer to use the Oxford Comma?"
t(t(summary(commadata$USES_Oxford)))## [,1]
## FALSE 488
## TRUE 641
#FALSE TRUE
# 488 641# First ten observations:
t(t(head(commadata$CARE_Oxford,10)))## [,1]
## [1,] Some
## [2,] Not much
## [3,] Some
## [4,] Some
## [5,] Not much
## [6,] A lot
## [7,] A lot
## [8,] A lot
## [9,] A lot
## [10,] Not at all
## Levels: A lot Not at all Not much Some
# Summary table (sequence is not ordinal):
t(t(table(commadata$CARE_Oxford,useNA = "always")))##
## [,1]
## A lot 291
## Not at all 126
## Not much 268
## Some 414
## <NA> 30
# What are the original levels on the CARE_Oxford variable?
t(t(levels(commadata$CARE_Oxford)))## [,1]
## [1,] "A lot"
## [2,] "Not at all"
## [3,] "Not much"
## [4,] "Some"
### use fct_relevel from library `forcats` to sort the CARE_Oxford levels ordinally
commadata$CARE_Oxford = fct_relevel(commadata$CARE_Oxford, levels(commadata$CARE_Oxford)[c(2,3,4,1)])
t(t(levels(commadata$CARE_Oxford)))## [,1]
## [1,] "Not at all"
## [2,] "Not much"
## [3,] "Some"
## [4,] "A lot"
t(t(summary(commadata$CARE_Oxford))) # resequenced to reflect ordering from "Not at all" to "A lot"## [,1]
## Not at all 126
## Not much 268
## Some 414
## A lot 291
## NA's 30
### Make sure the results are still same
print("CARE_oxford levels:")## [1] "CARE_oxford levels:"
t(t(table(commadata$CARE_Oxford,useNA = "always")))##
## [,1]
## Not at all 126
## Not much 268
## Some 414
## A lot 291
## <NA> 30
t(t(head(commadata$CARE_Oxford,10)))## [,1]
## [1,] Some
## [2,] Not much
## [3,] Some
## [4,] Some
## [5,] Not much
## [6,] A lot
## [7,] A lot
## [8,] A lot
## [9,] A lot
## [10,] Not at all
## Levels: Not at all Not much Some A lot
### [5] Another question asked users whether they think the word "data" should be considered singular or plural.
t(t(table(commadata$DATA_Sentence,useNA = "always")))##
## [,1]
## Some experts say it's important to drink milk, but the data are inconclusive. 228
## Some experts say it's important to drink milk, but the data is inconclusive. 865
## <NA> 36
# Some experts say it's important to drink milk, but the data are inconclusive. 228
# Some experts say it's important to drink milk, but the data is inconclusive. 865
### Replace the above sentences with the word "PLURAL" or "SINGULAR" to reflect user preference
levels(commadata$DATA_Sentence) <- c("PLURAL","SINGULAR")
t(t(summary(commadata$DATA_Sentence)))## [,1]
## PLURAL 228
## SINGULAR 865
## NA's 36
# PLURAL 228
# SINGULAR 865
# NA's 36# First ten observations:
t(t(head(commadata$DATA_Care,10)))## [,1]
## [1,] Not much
## [2,] Not much
## [3,] Not at all
## [4,] Some
## [5,] Not much
## [6,] Some
## [7,] Some
## [8,] A lot
## [9,] Not much
## [10,] Some
## Levels: A lot Not at all Not much Some
# Summary table (sequence is not ordinal):
t(t(table(commadata$DATA_Care,useNA = "always")))##
## [,1]
## A lot 133
## Not at all 203
## Not much 403
## Some 352
## <NA> 38
# What are the original levels on the DATA_Care variable?
t(t(levels(commadata$DATA_Care)))## [,1]
## [1,] "A lot"
## [2,] "Not at all"
## [3,] "Not much"
## [4,] "Some"
### use fct_relevel from library `forcats` to sort the DATA_Care levels ordinally
commadata$DATA_Care = fct_relevel(commadata$DATA_Care, levels(commadata$DATA_Care)[c(2,3,4,1)])
t(t(levels(commadata$DATA_Care)))## [,1]
## [1,] "Not at all"
## [2,] "Not much"
## [3,] "Some"
## [4,] "A lot"
t(t(summary(commadata$DATA_Care))) # resequenced to reflect ordering from "Not at all" to "A lot"## [,1]
## Not at all 203
## Not much 403
## Some 352
## A lot 133
## NA's 38
### Make sure the results are still same
print("DATA_Care levels:")## [1] "DATA_Care levels:"
t(t(table(commadata$DATA_Care,useNA = "always")))##
## [,1]
## Not at all 203
## Not much 403
## Some 352
## A lot 133
## <NA> 38
t(t(head(commadata$DATA_Care,10)))## [,1]
## [1,] Not much
## [2,] Not much
## [3,] Not at all
## [4,] Some
## [5,] Not much
## [6,] Some
## [7,] Some
## [8,] A lot
## [9,] Not much
## [10,] Some
## Levels: Not at all Not much Some A lot
# Incoming table of Grammar_Important responses:
t(t(table(commadata$Grammar_Important,useNA = "always")))##
## [,1]
## Neither important nor unimportant (neutral) 26
## Somewhat important 333
## Somewhat unimportant 7
## Very important 688
## Very unimportant 5
## <NA> 70
# First 10 responses:
t(t(head(commadata$Grammar_Important,10)))## [,1]
## [1,] Somewhat important
## [2,] Somewhat unimportant
## [3,] Very important
## [4,] Somewhat important
## [5,] <NA>
## [6,] Very important
## [7,] Very important
## [8,] Very important
## [9,] Very important
## [10,] Very important
## 5 Levels: Neither important nor unimportant (neutral) ...
# What are the original levels on the Grammar_Important variable?
t(t(levels(commadata$Grammar_Important)))## [,1]
## [1,] "Neither important nor unimportant (neutral)"
## [2,] "Somewhat important"
## [3,] "Somewhat unimportant"
## [4,] "Very important"
## [5,] "Very unimportant"
### use fct_relevel from library `forcats` to sort the Grammar_Important levels ordinally
commadata$Grammar_Important = fct_relevel(commadata$Grammar_Important, levels(commadata$Grammar_Important)[c(5,3,1,2,4)])
# Resequenced levels:
t(t(levels(commadata$Grammar_Important)))## [,1]
## [1,] "Very unimportant"
## [2,] "Somewhat unimportant"
## [3,] "Neither important nor unimportant (neutral)"
## [4,] "Somewhat important"
## [5,] "Very important"
# Summary table:
t(t(summary(commadata$Grammar_Important))) # resequenced to reflect ordering from "very unimportant" to "very important"## [,1]
## Very unimportant 5
## Somewhat unimportant 7
## Neither important nor unimportant (neutral) 26
## Somewhat important 333
## Very important 688
## NA's 70
### Make sure the results are still same
# Grammar_Important levels:
t(t(table(commadata$Grammar_Important,useNA = "always")))##
## [,1]
## Very unimportant 5
## Somewhat unimportant 7
## Neither important nor unimportant (neutral) 26
## Somewhat important 333
## Very important 688
## <NA> 70
# First ten entries:
t(t(head(commadata$Grammar_Important,10)))## [,1]
## [1,] Somewhat important
## [2,] Somewhat unimportant
## [3,] Very important
## [4,] Somewhat important
## [5,] <NA>
## [6,] Very important
## [7,] Very important
## [8,] Very important
## [9,] Very important
## [10,] Very important
## 5 Levels: Very unimportant ... Very important
Age variable - resequence the levels, and create a numeric equivalent, based upon the midpoints# First ten entries from the Age variable:
t(t(head(commadata$Age,10)))## [,1]
## [1,] 30-44
## [2,] 30-44
## [3,] 30-44
## [4,] 18-29
## [5,] <NA>
## [6,] 18-29
## [7,] 18-29
## [8,] 18-29
## [9,] 30-44
## [10,] 30-44
## Levels: > 60 18-29 30-44 45-60
# Save the Age bands
commadata$AgeBands <- commadata$Age
# What are the original levels on the Age variable?
t(t(levels(commadata$Age)))## [,1]
## [1,] "> 60"
## [2,] "18-29"
## [3,] "30-44"
## [4,] "45-60"
### use fct_relevel from library `forcats` to sort the age levels ordinally
commadata$AgeBands = fct_relevel(commadata$AgeBands, levels(commadata$Age)[c(2,3,4,1)])
# What are the resequenced levels on the Age variable?
t(t(levels(commadata$AgeBands)))## [,1]
## [1,] "18-29"
## [2,] "30-44"
## [3,] "45-60"
## [4,] "> 60"
### Make sure the results are still same
# Age bands:
t(t(table(commadata$AgeBands,useNA = "always")))##
## [,1]
## 18-29 221
## 30-44 254
## 45-60 290
## > 60 272
## <NA> 92
# First ten entries:
t(t(head(commadata$AgeBands,10)))## [,1]
## [1,] 30-44
## [2,] 30-44
## [3,] 30-44
## [4,] 18-29
## [5,] <NA>
## [6,] 18-29
## [7,] 18-29
## [8,] 18-29
## [9,] 30-44
## [10,] 30-44
## Levels: 18-29 30-44 45-60 > 60
#### Let's replace the above Age ranges with their midpoints, so we can treat Age as a numeric variable rather than as categorical (for "> 60" , we'll arbitrarily use 65 for the right-censored data)
commadata$AgeNumeric <- commadata$AgeBands
levels(commadata$AgeNumeric) <- c(23.5,37,52.5,65) ## 18-29 ; 30-44 ; 45-60 ; Over 60
print("Age table (B):")## [1] "Age table (B):"
t(t(table(commadata$AgeNumeric,useNA = "always")))##
## [,1]
## 23.5 221
## 37 254
## 52.5 290
## 65 272
## <NA> 92
str(commadata$AgeNumeric)## Factor w/ 4 levels "23.5","37","52.5",..: 2 2 2 1 NA 1 1 1 2 2 ...
# First ten entries:
t(t(head(commadata$AgeNumeric,10)))## [,1]
## [1,] 37
## [2,] 37
## [3,] 37
## [4,] 23.5
## [5,] <NA>
## [6,] 23.5
## [7,] 23.5
## [8,] 23.5
## [9,] 37
## [10,] 37
## Levels: 23.5 37 52.5 65
# Replace the above string values with their numeric equivalents
commadata$AgeNumeric <- as.numeric(levels(commadata$AgeNumeric))[commadata$AgeNumeric]
print("Age table (C):")## [1] "Age table (C):"
t(t(table(commadata$AgeNumeric,useNA = "always")))##
## [,1]
## 23.5 221
## 37 254
## 52.5 290
## 65 272
## <NA> 92
str(commadata$AgeNumeric)## num [1:1129] 37 37 37 23.5 NA 23.5 23.5 23.5 37 37 ...
# First ten entries:
t(t(head(commadata$AgeNumeric,10)))## [,1]
## [1,] 37.0
## [2,] 37.0
## [3,] 37.0
## [4,] 23.5
## [5,] NA
## [6,] 23.5
## [7,] 23.5
## [8,] 23.5
## [9,] 37.0
## [10,] 37.0
Income variable - resequence the levels, and create a numeric equivalent, based upon the midpointst(t(head(commadata$Income,10)))## [,1]
## [1,] $50,000 - $99,999
## [2,] $50,000 - $99,999
## [3,] <NA>
## [4,] <NA>
## [5,] <NA>
## [6,] $25,000 - $49,999
## [7,] $0 - $24,999
## [8,] $25,000 - $49,999
## [9,] $50,000 - $99,999
## [10,] $150,000+
## 5 Levels: $0 - $24,999 $100,000 - $149,999 ... $50,000 - $99,999
# Save the income bands (for later)
commadata$IncomeBands <- commadata$Income
# What are the original levels on the income variable?
t(t(levels(commadata$Income)))## [,1]
## [1,] "$0 - $24,999"
## [2,] "$100,000 - $149,999"
## [3,] "$150,000+"
## [4,] "$25,000 - $49,999"
## [5,] "$50,000 - $99,999"
### use fct_relevel from library forcats to sort the income levels ordinally
commadata$IncomeBands = fct_relevel(commadata$IncomeBands, levels(commadata$IncomeBands)[c(1,4,5,2,3)])
t(t(levels(commadata$IncomeBands)))## [,1]
## [1,] "$0 - $24,999"
## [2,] "$25,000 - $49,999"
## [3,] "$50,000 - $99,999"
## [4,] "$100,000 - $149,999"
## [5,] "$150,000+"
### Make sure the results are still same
print("Income bands:")## [1] "Income bands:"
t(t(table(commadata$IncomeBands,useNA = "always")))##
## [,1]
## $0 - $24,999 121
## $25,000 - $49,999 158
## $50,000 - $99,999 290
## $100,000 - $149,999 164
## $150,000+ 103
## <NA> 293
t(t(head(commadata$IncomeBands,10)))## [,1]
## [1,] $50,000 - $99,999
## [2,] $50,000 - $99,999
## [3,] <NA>
## [4,] <NA>
## [5,] <NA>
## [6,] $25,000 - $49,999
## [7,] $0 - $24,999
## [8,] $25,000 - $49,999
## [9,] $50,000 - $99,999
## [10,] $150,000+
## 5 Levels: $0 - $24,999 $25,000 - $49,999 ... $150,000+
#### Let's replace the above income ranges with their midpoints, so we can treat income as a numeric variable rather than as categorical (for $150,000+ , we'll arbitrarily use $160,000 for the right-censored data)
commadata$IncomeNumeric <- commadata$IncomeBands
levels(commadata$IncomeNumeric) <- c(12500,37500,75000,125000,160000)
print("Income table (B):")## [1] "Income table (B):"
t(t(table(commadata$IncomeNumeric,useNA = "always")))##
## [,1]
## 12500 121
## 37500 158
## 75000 290
## 125000 164
## 160000 103
## <NA> 293
str(commadata$IncomeNumeric)## Factor w/ 5 levels "12500","37500",..: 3 3 NA NA NA 2 1 2 3 5 ...
t(t(head(commadata$IncomeNumeric,10)))## [,1]
## [1,] 75000
## [2,] 75000
## [3,] <NA>
## [4,] <NA>
## [5,] <NA>
## [6,] 37500
## [7,] 12500
## [8,] 37500
## [9,] 75000
## [10,] 160000
## Levels: 12500 37500 75000 125000 160000
# Replace the values with their numeric equivalents
commadata$IncomeNumeric <- as.numeric(levels(commadata$IncomeNumeric))[commadata$IncomeNumeric]
print("Income table (C):")## [1] "Income table (C):"
t(t(table(commadata$IncomeNumeric,useNA = "always")))##
## [,1]
## 12500 121
## 37500 158
## 75000 290
## 125000 164
## 160000 103
## <NA> 293
str(commadata$IncomeNumeric)## num [1:1129] 75000 75000 NA NA NA 37500 12500 37500 75000 160000 ...
t(t(head(commadata$IncomeNumeric,10)))## [,1]
## [1,] 75000
## [2,] 75000
## [3,] NA
## [4,] NA
## [5,] NA
## [6,] 37500
## [7,] 12500
## [8,] 37500
## [9,] 75000
## [10,] 160000
print("Incoming table of Education responses:")## [1] "Incoming table of Education responses:"
t(t(table(commadata$Education,useNA = "always")))##
## [,1]
## Bachelor degree 344
## Graduate degree 276
## High school degree 100
## Less than high school degree 11
## Some college or Associate degree 295
## <NA> 103
print("First 10 responses:")## [1] "First 10 responses:"
t(t(head(commadata$Education,10)))## [,1]
## [1,] Bachelor degree
## [2,] Graduate degree
## [3,] <NA>
## [4,] Less than high school degree
## [5,] <NA>
## [6,] Some college or Associate degree
## [7,] Some college or Associate degree
## [8,] Some college or Associate degree
## [9,] Graduate degree
## [10,] Bachelor degree
## 5 Levels: Bachelor degree Graduate degree ... Some college or Associate degree
# What are the original levels on the Education variable?
t(t(levels(commadata$Education)))## [,1]
## [1,] "Bachelor degree"
## [2,] "Graduate degree"
## [3,] "High school degree"
## [4,] "Less than high school degree"
## [5,] "Some college or Associate degree"
#[1,] "Bachelor degree"
#[2,] "Graduate degree"
#[3,] "High school degree"
#[4,] "Less than high school degree"
#[5,] "Some college or Associate degree"
### use fct_relevel from library forcats to sort the Education levels ordinally
commadata$Education = fct_relevel(commadata$Education, levels(commadata$Education)[c(4,3,5,1,2)])
t(t(levels(commadata$Education)))## [,1]
## [1,] "Less than high school degree"
## [2,] "High school degree"
## [3,] "Some college or Associate degree"
## [4,] "Bachelor degree"
## [5,] "Graduate degree"
t(t(summary(commadata$Education))) # resequenced to reflect ordering from "Less than high school degree" to "Graduate degree"## [,1]
## Less than high school degree 11
## High school degree 100
## Some college or Associate degree 295
## Bachelor degree 344
## Graduate degree 276
## NA's 103
### Make sure the results are still same
print("Education levels:")## [1] "Education levels:"
t(t(table(commadata$Education,useNA = "always")))##
## [,1]
## Less than high school degree 11
## High school degree 100
## Some college or Associate degree 295
## Bachelor degree 344
## Graduate degree 276
## <NA> 103
t(t(head(commadata$Education,10)))## [,1]
## [1,] Bachelor degree
## [2,] Graduate degree
## [3,] <NA>
## [4,] Less than high school degree
## [5,] <NA>
## [6,] Some college or Associate degree
## [7,] Some college or Associate degree
## [8,] Some college or Associate degree
## [9,] Graduate degree
## [10,] Bachelor degree
## 5 Levels: Less than high school degree ... Graduate degree
print("Incoming table of Location responses:")## [1] "Incoming table of Location responses:"
t(t(table(commadata$Location,useNA = "always")))##
## [,1]
## East North Central 170
## East South Central 43
## Middle Atlantic 140
## Mountain 87
## New England 73
## Pacific 180
## South Atlantic 164
## West North Central 82
## West South Central 88
## <NA> 102
print("First 10 responses:")## [1] "First 10 responses:"
t(t(head(commadata$Location,10)))## [,1]
## [1,] South Atlantic
## [2,] Mountain
## [3,] East North Central
## [4,] Middle Atlantic
## [5,] <NA>
## [6,] New England
## [7,] Pacific
## [8,] East North Central
## [9,] Mountain
## [10,] Pacific
## 9 Levels: East North Central East South Central ... West South Central
# What are the original levels on the Location variable?
t(t(levels(commadata$Location)))## [,1]
## [1,] "East North Central"
## [2,] "East South Central"
## [3,] "Middle Atlantic"
## [4,] "Mountain"
## [5,] "New England"
## [6,] "Pacific"
## [7,] "South Atlantic"
## [8,] "West North Central"
## [9,] "West South Central"
# [1,] "East North Central"
# [2,] "East South Central"
# [3,] "Middle Atlantic"
# [4,] "Mountain"
# [5,] "New England"
# [6,] "Pacific"
# [7,] "South Atlantic"
# [8,] "West North Central"
# [9,] "West South Central"
### use fct_relevel from library forcats to sort the Location levels so they reflect geography from East to West
commadata$Location = fct_relevel(commadata$Location, levels(commadata$Location)[c(5,3,7,1,2,8,9,4,6)])
t(t(levels(commadata$Location)))## [,1]
## [1,] "New England"
## [2,] "Middle Atlantic"
## [3,] "South Atlantic"
## [4,] "East North Central"
## [5,] "East South Central"
## [6,] "West North Central"
## [7,] "West South Central"
## [8,] "Mountain"
## [9,] "Pacific"
t(t(summary(commadata$Location))) # resequenced to reflect ordering from east cost to west coast## [,1]
## New England 73
## Middle Atlantic 140
## South Atlantic 164
## East North Central 170
## East South Central 43
## West North Central 82
## West South Central 88
## Mountain 87
## Pacific 180
## NA's 102
### Make sure the results are still same
print("Location levels:")## [1] "Location levels:"
t(t(table(commadata$Location,useNA = "always")))##
## [,1]
## New England 73
## Middle Atlantic 140
## South Atlantic 164
## East North Central 170
## East South Central 43
## West North Central 82
## West South Central 88
## Mountain 87
## Pacific 180
## <NA> 102
t(t(head(commadata$Location,10)))## [,1]
## [1,] South Atlantic
## [2,] Mountain
## [3,] East North Central
## [4,] Middle Atlantic
## [5,] <NA>
## [6,] New England
## [7,] Pacific
## [8,] East North Central
## [9,] Mountain
## [10,] Pacific
## 9 Levels: New England Middle Atlantic ... Pacific
commadata$DATA_Sentence <- NULL
commadata$DATA_Plural <- NULL
commadata$DATA_Care <- NULL
# Dimension after dropping the above variables
dim(commadata)## [1] 1129 13
summary(commadata)## USES_Oxford HEARD_Oxford CARE_Oxford
## FALSE:488 No :444 Not at all:126
## TRUE :641 Yes :655 Not much :268
## NA's: 30 Some :414
## A lot :291
## NA's : 30
##
##
## Grammar_Important Gender
## Very unimportant : 5 Female:548
## Somewhat unimportant : 7 Male :489
## Neither important nor unimportant (neutral): 26 NA's : 92
## Somewhat important :333
## Very important :688
## NA's : 70
##
## Age Income
## > 60 :272 $0 - $24,999 :121
## 18-29:221 $100,000 - $149,999:164
## 30-44:254 $150,000+ :103
## 45-60:290 $25,000 - $49,999 :158
## NA's : 92 $50,000 - $99,999 :290
## NA's :293
##
## Education Location
## Less than high school degree : 11 Pacific :180
## High school degree :100 East North Central:170
## Some college or Associate degree:295 South Atlantic :164
## Bachelor degree :344 Middle Atlantic :140
## Graduate degree :276 West South Central: 88
## NA's :103 (Other) :285
## NA's :102
## AgeBands AgeNumeric IncomeBands IncomeNumeric
## 18-29:221 Min. :23.5 $0 - $24,999 :121 Min. : 12500
## 30-44:254 1st Qu.:37.0 $25,000 - $49,999 :158 1st Qu.: 37500
## 45-60:290 Median :52.5 $50,000 - $99,999 :290 Median : 75000
## > 60 :272 Mean :45.8 $100,000 - $149,999:164 Mean : 79148
## NA's : 92 3rd Qu.:65.0 $150,000+ :103 3rd Qu.:125000
## Max. :65.0 NA's :293 Max. :160000
## NA's :92 NA's :293
# Size of dataframe BEFORE dropping rows containing NAs:
dim(commadata)## [1] 1129 13
## drop rows which contain any NA values
comma2 <- commadata[complete.cases(commadata),]
summary(comma2)## USES_Oxford HEARD_Oxford CARE_Oxford
## FALSE:358 No :329 Not at all: 86
## TRUE :470 Yes:499 Not much :208
## Some :311
## A lot :223
##
##
##
## Grammar_Important Gender
## Very unimportant : 2 Female:437
## Somewhat unimportant : 5 Male :391
## Neither important nor unimportant (neutral): 21
## Somewhat important :267
## Very important :533
##
##
## Age Income
## > 60 :201 $0 - $24,999 :120
## 18-29:174 $100,000 - $149,999:163
## 30-44:205 $150,000+ :102
## 45-60:248 $25,000 - $49,999 :157
## $50,000 - $99,999 :286
##
##
## Education Location
## Less than high school degree : 8 Pacific :152
## High school degree : 82 South Atlantic :132
## Some college or Associate degree:228 East North Central:131
## Bachelor degree :281 Middle Atlantic :108
## Graduate degree :229 West South Central: 73
## West North Central: 68
## (Other) :164
## AgeBands AgeNumeric IncomeBands IncomeNumeric
## 18-29:174 Min. :23.5 $0 - $24,999 :120 Min. : 12500
## 30-44:205 1st Qu.:37.0 $25,000 - $49,999 :157 1st Qu.: 37500
## 45-60:248 Median :52.5 $50,000 - $99,999 :286 Median : 75000
## > 60 :201 Mean :45.6 $100,000 - $149,999:163 Mean : 79146
## 3rd Qu.:52.5 $150,000+ :102 3rd Qu.:125000
## Max. :65.0 Max. :160000
##
# Size of dataframe "comma2" AFTER dropping rows containing NAs:
dim(comma2)## [1] 828 13
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Is there any association between gender, age, income, educational attainment, and geographic region among individuals who choose to use the Oxford Comma? Which variables provide the strongest support for predicting whether an individual will or will not prefer to use the Oxford Comma?
What are the cases, and how many are there?
The dataset contains 1129 cases, each of which represents a response to an online poll conducted in June 2014, where participants were asked various questions, including:
Additionally, participants were asked questions regarding their gender, age, income, educational attainment, and geographic region.
Describe the method of data collection.
Data was collected using an online poll on the SurveyMonkey.com platform. The survey was open for three days (June 3-5, 2014). There were 1129 participants, however not everyone answered all of the questions. A total of 825 respondents answered all questions. Excluding the questions pertaining to whether “Data” is singular or plural, 828 respondents answered all remaining questions.
What type of study is this (observational/experiment)?
This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
In June of 2014, FiveThirtyEight.com ran an online poll using “surveymonkey.com” asking Americans whether they preferred the serial comma (also known as the Oxford Comma.)
Additional questions were posed regarding the respondents’ educational level, income level, age, and what part of the country each was from.
Additional grammatical questions which were part of the same poll concerned usage of the word “data”: respondents were asked whether they considered “data” to be singular or plural.
Information on the study is here: https://fivethirtyeight.com/features/elitist-superfluous-or-popular-we-polled-americans-on-the-oxford-comma/
The data can be sourced from here: https://github.com/fivethirtyeight/data/tree/master/comma-survey
What is the response variable? Is it quantitative or qualitative?
The response variable is USES_Oxford - it is a True/False variable which indicates whether a subject prefers the use of the Oxford Comma, or not.
You should have two independent variables, one quantitative and one qualitative.
I am going to implement stepwise logistic regression on all of the variables to determine which are most indicative as to whether an individual prefers the use of the Oxford Comma. (I will omit the variables pertaining to whether “Data” is singular or plural, as those represent a side question.)
Most significantly, I’ll explore whether there is any association favoring the Oxford Comma based upon participant Age, Income, Gender, Education, and geographic Location .
The variables Age and Income are quantitative, but not continuous (as the results are given in bands rather than exact values.)
Gender and Location are qualitative, non-ordinal variables.
Education is a qualitative variable, but it is ordinal as the various levels of education can be ranked.
My conjecture would be that users of the Oxford Comma would tend to be younger, higher in income, and higher in education that non-users.
I do not have any prior expectation that Gender would impact the results.
It would be interesting to see whether Location has any impact on the results, as regional dialects can impact usage of English in various parts of the country.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
t(t(summary(comma2$USES_Oxford)))## [,1]
## FALSE 358
## TRUE 470
barplot(table(comma2$USES_Oxford),col=c("red","green"),
xlab = "Do you prefer to use the Oxford Comma?", ylab = "Frequency",
main = "Oxford Comma usage among respondents to FiveThirtyEight.com poll")t(t(summary(comma2$HEARD_Oxford)))## [,1]
## No 329
## Yes 499
barplot(table(comma2$HEARD_Oxford),col=c("red","green"),
xlab = "Have you previously heard about the Oxford Comma?", ylab = "Frequency",
main = "Oxford Comma awareness among respondents to FiveThirtyEight.com poll")t(t(summary(comma2$CARE_Oxford))) # already resequenced above to reflect ordering from "A lot" to "Not at all"## [,1]
## Not at all 86
## Not much 208
## Some 311
## A lot 223
### Barplot of degree to which respondent cares about the Oxford Comma
barplot(table(comma2$CARE_Oxford),col=rainbow(4),
xlab = "How much do you care?", ylab = "Frequency",
main = "How much do respondents care about the 'Oxford Comma' ?")# Have you previously heard about the Oxford Comma vs. how much do you care about it?
table(comma2$CARE_Oxford,comma2$HEARD_Oxford)##
## No Yes
## Not at all 60 26
## Not much 98 110
## Some 127 184
## A lot 44 179
mosaicplot(comma2$HEARD_Oxford ~ comma2$CARE_Oxford,col=rainbow(4),las=1,
xlab="Have you (previously) heard of the Oxford Comma?",
ylab="Do you care?",
main="'Have you heard' vs. 'Do you care?'")Respondents who have previously heard of the Oxford Comma are more likely to care about it, while those who have not previously heard about it are less likely.
t(t(summary(comma2$Grammar_Important))) # already resequenced above to reflect ordering## [,1]
## Very unimportant 2
## Somewhat unimportant 5
## Neither important nor unimportant (neutral) 21
## Somewhat important 267
## Very important 533
### Barplot of importance of correct grammar
barplot_names = levels(comma2$Grammar_Important)
barplot_names[3]="Neutral" # otherwise this entry is too long
barplot(table(comma2$Grammar_Important),col=rainbow(5),names.arg = barplot_names,
xlab = "Opinion", ylab = "Frequency",
main = "Do you believe that correct grammar is important?")t(t(summary(comma2$Gender))) ## [,1]
## Female 437
## Male 391
# Number of each gender using Oxford Comma
table(comma2$USES_Oxford, comma2$Gender)##
## Female Male
## FALSE 187 171
## TRUE 250 220
# Percentage within each row using the Oxford Comma
table(comma2$USES_Oxford, comma2$Gender)/rowSums(table(comma2$USES_Oxford, comma2$Gender))##
## Female Male
## FALSE 0.5223464 0.4776536
## TRUE 0.5319149 0.4680851
# Percentage within each column using the Oxford Comma
t(t(table(comma2$USES_Oxford, comma2$Gender))/colSums(table(comma2$USES_Oxford, comma2$Gender)))##
## Female Male
## FALSE 0.4279176 0.4373402
## TRUE 0.5720824 0.5626598
mosaicplot(comma2$Gender~ comma2$USES_Oxford, col=c("Red","Green"),
xlab="Gender", ylab="Uses Oxford Comma?",
main="Use of Oxford Comma by Gender")The percentage of each gender using the Oxford Comma is about the same.
t(t(table(comma2$AgeBands,useNA = "ifany"))) # already resequenced to reflect ordering of age bands##
## [,1]
## 18-29 174
## 30-44 205
## 45-60 248
## > 60 201
### Barplot of age bands
barplot(table(comma2$AgeBands),col=rainbow(4),space = 1,
xlab = "Age Bands", ylab = "Frequency",
main = "Age range of respondents to 'Oxford Comma' survey")t(t(table(comma2$IncomeBands,useNA = "ifany"))) # resequenced to reflect ordering of income bands##
## [,1]
## $0 - $24,999 120
## $25,000 - $49,999 157
## $50,000 - $99,999 286
## $100,000 - $149,999 163
## $150,000+ 102
### Barplot of income bands
barplot(table(comma2$IncomeBands),col=rainbow(5),space = 1,
xlab = "Income Ranges", ylab = "Frequency",
main = "Income range of respondents to 'Oxford Comma' survey")The above results appear to resemble a normal distribution.
t(t(summary(comma2$Education))) # already resequenced to reflect ordering of educational attainment## [,1]
## Less than high school degree 8
## High school degree 82
## Some college or Associate degree 228
## Bachelor degree 281
## Graduate degree 229
### Barplot of educational attainment
barplot(table(comma2$Education),col=rainbow(5),space = 1,
xlab = "Educational Attainment", ylab = "Frequency",
main = "Educational Attainment of respondents to 'Oxford Comma' survey")The data is left-skewed, with the median respondent having a bachelor’s degree.
# Where is the respondent located?
t(t(summary(comma2$Location))) # already resequenced to reflect geography (east to west)## [,1]
## New England 59
## Middle Atlantic 108
## South Atlantic 132
## East North Central 131
## East South Central 38
## West North Central 68
## West South Central 73
## Mountain 67
## Pacific 152
EastCoast = summary(comma2$Location)[c(1,2,3)]
MiddleUSA = summary(comma2$Location)[c(4,5,6,7)]
WesternUSA = summary(comma2$Location)[c(8,9)]The respondents are geographically well-distributed across the USA, with 299 from the East Coast, 310 from the Central portion of the country, and 219 from the West.
# Table of Oxford Comma Usage by region
table(comma2$Location, comma2$USES_Oxford)##
## FALSE TRUE
## New England 26 33
## Middle Atlantic 62 46
## South Atlantic 47 85
## East North Central 44 87
## East South Central 15 23
## West North Central 31 37
## West South Central 30 43
## Mountain 27 40
## Pacific 76 76
# Mosaic Plot
mosaicplot(data=comma2, USES_Oxford ~ Location, col=rainbow(9), las=1, main = "Oxford Comma Usage by region")