gss_cat <- read.csv("/Users/eunseokim/Desktop/gss_cat.csv", stringsAsFactors = TRUE)
Explanation
Converting character strings to factors is beneficial because many
variables are categorical in this dataset, such as marital,
race, and income.
table(gss_cat$marital)
##
## Divorced Married Never married No answer Separated
## 3383 10117 5416 17 743
## Widowed
## 1807
table(gss_cat$race)
##
## Black Other White
## 3129 1959 16395
missing_counts <- sapply(gss_cat, function(x) sum(is.na(x)))
print(missing_counts)
## X year marital age race rincome partyid relig denom tvhours
## 0 0 0 76 0 0 0 0 0 10146
names(which(missing_counts > 0))
## [1] "age" "tvhours"
gss_cat$tvhours <- ifelse(is.na(gss_cat$tvhours),
mean(gss_cat$tvhours, na.rm = TRUE),
gss_cat$tvhours)
Explanation
Replacing missing values with the mean is logical for
tvhours, as it’s a numeric variable representing
hours.
5. Discussion about replacing missing values for all variables
For numeric variables like tvhours, replacing missing
values with the mean is feasible.
For categorical variables like marital and
race, replacing with the mode might be more
appropriate.