suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
tvhours
. Is the mean a good summary?summary(gss_cat[["tvhours"]])
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 1.000 2.000 2.981 4.000 24.000 10146
gss_cat %>%
filter(!is.na(tvhours)) %>%
ggplot(aes(x = tvhours)) +
geom_histogram(binwidth = 1)
Whether the mean is the best summary depends on what you are using it for :-), i.e. your objective. But probably the median would be what most people prefer. And the hours of TV doesn’t look that surprising to me.
gss_cat
identify whether the order of the levels is arbitrary or principled.keep(gss_cat, is.factor) %>% names()
[1] "marital" "race" "rincome" "partyid" "relig" "denom"
There are six categorical variables: marital
, race
, rincome
, partyid
, relig
, and denom
.
The ordering of marital is “somewhat principled”. There is some sort of logic in that the levels are grouped “never married”, married at some point (separated, divorced, widowed), and “married”; though it would seem that “Never Married”, “Divorced”, “Widowed”, “Separated”, “Married” might be more natural. I find that the question of ordering can be determined by the level of aggregation in a categorical variable, and there can be more “partially ordered” factors than one would expect.
levels(gss_cat[["marital"]])
[1] "No answer" "Never married" "Separated" "Divorced" "Widowed"
[6] "Married"
gss_cat %>%
ggplot(aes(x = marital)) +
geom_bar()
The ordering of race is principled in that the categories are ordered by count of observations in the data.
levels(gss_cat$race)
[1] "Other" "Black" "White" "Not applicable"
gss_cat %>%
ggplot(aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
The levels of rincome
are ordered in decreasing order of the income; however the placement of “No answer”, “Don’t know”, and “Refused” before, and “Not applicable” after the income levels is arbitrary. It would be better to place all the missing income level categories either before or after all the known values.
levels(gss_cat$rincome)
[1] "No answer" "Don't know" "Refused" "$25000 or more" "$20000 - 24999"
[6] "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" "$7000 to 7999" "$6000 to 6999"
[11] "$5000 to 5999" "$4000 to 4999" "$3000 to 3999" "$1000 to 2999" "Lt $1000"
[16] "Not applicable"
The levels of relig
is arbitrary: there is no natural ordering, and they don’t appear to be ordered by stats within the dataset.
levels(gss_cat$relig)
[1] "No answer" "Don't know" "Inter-nondenominational"
[4] "Native american" "Christian" "Orthodox-christian"
[7] "Moslem/islam" "Other eastern" "Hinduism"
[10] "Buddhism" "Other" "None"
[13] "Jewish" "Catholic" "Protestant"
[16] "Not applicable"
gss_cat %>%
ggplot(aes(relig)) +
geom_bar() +
coord_flip()
The same goes for denom
.
levels(gss_cat$denom)
[1] "No answer" "Don't know" "No denomination" "Other"
[5] "Episcopal" "Presbyterian-dk wh" "Presbyterian, merged" "Other presbyterian"
[9] "United pres ch in us" "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
[13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod" "Luth ch in america"
[17] "Am lutheran" "Methodist-dk which" "Other methodist" "United methodist"
[21] "Afr meth ep zion" "Afr meth episcopal" "Baptist-dk which" "Other baptists"
[25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am" "Am bapt ch in usa"
[29] "Am baptist asso" "Not applicable"
Ignoring “No answer”, “Don’t know”, and “Other party”, the levels of partyid
are ordered from “Strong Republican”" to “Strong Democrat”.
levels(gss_cat$partyid)
[1] "No answer" "Don't know" "Other party" "Strong republican"
[5] "Not str republican" "Ind,near rep" "Independent" "Ind,near dem"
[9] "Not str democrat" "Strong democrat"
Because that gives the level “Not applicable” an integer value of 1.