DSCI-D590 Applied Data Science: Homework 1

Step 1: Data

Data Origin:

Dataset: Adult aka Census Income
Release Date: 4/30/1996
Authors: Barry Becker and Ronny Kohavi
Data Location: UCI Data Repository
URL: https://archive.ics.uci.edu/dataset/2/adult
Summary: Census data on the demographics and incomes of adults in the U.S. based on the 1994 U.S. Census Database.

Import Data

colHeaders <- c("age",
              "workclass",
              "fnlwgt",
              "education",
              "education_num",
              "marital_status",
              "occupation",
              "relationship",
              "race",
              "sex",
              "capital_gain",
              "capital_loss",
              "hours_per_week",
              "native_country",
              "salary")

colClass <- c("numeric", #age
           "factor", #workclass
           "numeric", #fnlwg
           "factor", #education
           "numeric", #education-num
           "factor", #marital-status
           "factor", #occupation
           "factor", #relationship
           "factor", #race
           "factor", #sex
           "numeric", #capital-gain
           "numeric", #capital-loss
           "numeric", #hours-per-week
           "factor", #native-country
           "factor" # Salary (> or <= 50k)
)

# All missing values in the dataset are denoted with " ?" in the original data. These will be replaced with NA. 
data <- read.table("data/adult/adult.data", sep=",", col.names = colHeaders, na.strings=c(" ?"), colClasses = colClass)

Display First 5 Rows

head(data, 5)

##   age         workclass fnlwgt  education education_num      marital_status
## 1  39         State-gov  77516  Bachelors            13       Never-married
## 2  50  Self-emp-not-inc  83311  Bachelors            13  Married-civ-spouse
## 3  38           Private 215646    HS-grad             9            Divorced
## 4  53           Private 234721       11th             7  Married-civ-spouse
## 5  28           Private 338409  Bachelors            13  Married-civ-spouse
##           occupation   relationship   race     sex capital_gain capital_loss
## 1       Adm-clerical  Not-in-family  White    Male         2174            0
## 2    Exec-managerial        Husband  White    Male            0            0
## 3  Handlers-cleaners  Not-in-family  White    Male            0            0
## 4  Handlers-cleaners        Husband  Black    Male            0            0
## 5     Prof-specialty           Wife  Black  Female            0            0
##   hours_per_week native_country salary
## 1             40  United-States  <=50K
## 2             13  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40           Cuba  <=50K

Step 2: Data Quality

Data Size: 32561 rows x 15 columns

dim(data)

## [1] 32561    15

Data Types

# Check data types
str(data)

## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : num  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : Factor w/ 8 levels " Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
##  $ fnlwgt        : num  77516 83311 215646 234721 338409 ...
##  $ education     : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num : num  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation    : Factor w/ 14 levels " Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
##  $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex           : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain  : num  2174 0 0 0 0 ...
##  $ capital_loss  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week: num  40 13 40 40 40 40 16 45 50 40 ...
##  $ native_country: Factor w/ 41 levels " Cambodia"," Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ salary        : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...

Overall Descriptive Values

for (col in names(data)){
  cat("\nSummary of", col, ":\n")
  print(summary(data[[col]]))
}

## 
## Summary of age :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   37.00   38.58   48.00   90.00 
## 
## Summary of workclass :
##       Federal-gov         Local-gov      Never-worked           Private 
##               960              2093                 7             22696 
##      Self-emp-inc  Self-emp-not-inc         State-gov       Without-pay 
##              1116              2541              1298                14 
##              NA's 
##              1836 
## 
## Summary of fnlwgt :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12285  117827  178356  189778  237051 1484705 
## 
## Summary of education :
##          10th          11th          12th       1st-4th       5th-6th 
##           933          1175           433           168           333 
##       7th-8th           9th    Assoc-acdm     Assoc-voc     Bachelors 
##           646           514          1067          1382          5355 
##     Doctorate       HS-grad       Masters     Preschool   Prof-school 
##           413         10501          1723            51           576 
##  Some-college 
##          7291 
## 
## Summary of education_num :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   10.00   10.08   12.00   16.00 
## 
## Summary of marital_status :
##               Divorced      Married-AF-spouse     Married-civ-spouse 
##                   4443                     23                  14976 
##  Married-spouse-absent          Never-married              Separated 
##                    418                  10683                   1025 
##                Widowed 
##                    993 
## 
## Summary of occupation :
##       Adm-clerical       Armed-Forces       Craft-repair    Exec-managerial 
##               3770                  9               4099               4066 
##    Farming-fishing  Handlers-cleaners  Machine-op-inspct      Other-service 
##                994               1370               2002               3295 
##    Priv-house-serv     Prof-specialty    Protective-serv              Sales 
##                149               4140                649               3650 
##       Tech-support   Transport-moving               NA's 
##                928               1597               1843 
## 
## Summary of relationship :
##         Husband   Not-in-family  Other-relative       Own-child       Unmarried 
##           13193            8305             981            5068            3446 
##            Wife 
##            1568 
## 
## Summary of race :
##  Amer-Indian-Eskimo  Asian-Pac-Islander               Black               Other 
##                 311                1039                3124                 271 
##               White 
##               27816 
## 
## Summary of sex :
##  Female    Male 
##   10771   21790 
## 
## Summary of capital_gain :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1078       0   99999 
## 
## Summary of capital_loss :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0    87.3     0.0  4356.0 
## 
## Summary of hours_per_week :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   40.00   40.00   40.44   45.00   99.00 
## 
## Summary of native_country :
##                    Cambodia                      Canada 
##                          19                         121 
##                       China                    Columbia 
##                          75                          59 
##                        Cuba          Dominican-Republic 
##                          95                          70 
##                     Ecuador                 El-Salvador 
##                          28                         106 
##                     England                      France 
##                          90                          29 
##                     Germany                      Greece 
##                         137                          29 
##                   Guatemala                       Haiti 
##                          64                          44 
##          Holand-Netherlands                    Honduras 
##                           1                          13 
##                        Hong                     Hungary 
##                          20                          13 
##                       India                        Iran 
##                         100                          43 
##                     Ireland                       Italy 
##                          24                          73 
##                     Jamaica                       Japan 
##                          81                          62 
##                        Laos                      Mexico 
##                          18                         643 
##                   Nicaragua  Outlying-US(Guam-USVI-etc) 
##                          34                          14 
##                        Peru                 Philippines 
##                          31                         198 
##                      Poland                    Portugal 
##                          60                          37 
##                 Puerto-Rico                    Scotland 
##                         114                          12 
##                       South                      Taiwan 
##                          80                          51 
##                    Thailand             Trinadad&Tobago 
##                          18                          19 
##               United-States                     Vietnam 
##                       29170                          67 
##                  Yugoslavia                        NA's 
##                          16                         583 
## 
## Summary of salary :
##  <=50K   >50K 
##  24720   7841

Assess Missing Data

The missing values in the original dataset were denoted with ” ?” (a space + ?). These were converted to NA values in the initial data import.

Quantitative Assessment of Missing Values

Count Total Number of NAs = 4262

sum(is.na(data)==TRUE)

## [1] 4262

Number of rows that have complete data and number of rows that do not (e.g., contain NA values).

table(complete.cases(data))

## 
## FALSE  TRUE 
##  2399 30162

Percentage of complete rows and % of incomplete rows (e.g., contain NA values)

prop.table(table(complete.cases(data))) * 100

## 
##    FALSE     TRUE 
##  7.36771 92.63229

Table of the number of missing values in a case:

miss_case_table(data)

## # A tibble: 4 × 3
##   n_miss_in_case n_cases pct_cases
##            <int>   <int>     <dbl>
## 1              0   30162   92.6   
## 2              1     563    1.73  
## 3              2    1809    5.56  
## 4              3      27    0.0829

Counts and percentages of missing values in each variable:

miss_var_summary(data)

## # A tibble: 15 × 3
##    variable       n_miss pct_miss
##    <chr>           <int>    <num>
##  1 occupation       1843     5.66
##  2 workclass        1836     5.64
##  3 native_country    583     1.79
##  4 age                 0     0   
##  5 fnlwgt              0     0   
##  6 education           0     0   
##  7 education_num       0     0   
##  8 marital_status      0     0   
##  9 relationship        0     0   
## 10 race                0     0   
## 11 sex                 0     0   
## 12 capital_gain        0     0   
## 13 capital_loss        0     0   
## 14 hours_per_week      0     0   
## 15 salary              0     0

Visual Assessment of Missing Data

Visualization of the whole dataset’s classes and missing data:

vis_dat(data)

More in-depth visualization of missing data:

vis_miss(data)

gg_miss_var(data) + labs(y="Count of Missing Values")

Summary of Missing Data

Out of the 32,651 total cases, there are 30,162 complete cases, about 93% of the data, and 2399 incomplete cases, around 7% of the data. Overall, there are a total of 4262 missing values in the data set, meaning many of the cases with missing information are missing more than one variable. As seen in the table created from the miss_case_table() function, most of the incomplete cases are missing 2 variables. Of the 15 variables, three account for all of the missing values in the data: native_country, workclass, and occupation. Given that workclass and occupation both account for the majority of missing values, these are the two variables most likely to be missing in the incomplete cases.

Step 3: Data Visualization

One important thing to take into account before tackling Data Visualization is to ensure that the classes of each variable are correct. In the initial data importation, if the variables were left as is, the variables would all imported as “integer” or “character”. By converting necessary groups to a factor class, we are able to work with some variables as categorical variables in the visualization step.

Salary: The Dependent Variable

The purpose of the Adult data set is to predict whether a person’s income will exceed $50K/year making salary is the dependent variable. The barplot below shows the count of those in the data set who make less than or equal to $50K and those that make more. As we can see, the number of those who make less than or equal to $50K far exceeds the number making more than $50K.

p <- ggplot(data, aes(x = salary)) +
  geom_bar(fill="salmon") +
  ggtitle("Barplot of Salary") +
  xlab("Salary")

p

Age

A boxplot of the age data gives us a quick summary about the distribution of ages in the data set. The youngest person in the data set is just under 20 years old, while the oldest is over 80. The median age is 37. Half of the people in the data set, represented by the interquartile range, are between 28 and 48 years old. While there are no outliers representing young people (according to metadata, the data extracted from the census excluded those under the age of 16), there are several outliers who are over 80 years old.

summary(data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   37.00   38.58   48.00   90.00

boxplot(data$age, col="salmon",
        main = "Boxplot of Age")

The histogram below gives us more detail on the distribution of ages in the dataset. Here, we have grouped age by salary to give us even more information. We can surmise several things from the plot:

As seen in the barplot above showing the tabulation of salary, those making <=50K outweights those making >50K.
The distribution is positively skewed to the right. The majority of people are less than 50 years old.
There are a good amount of outliers in the data who are older than 75.

p <- ggplot(data) +
  geom_histogram(aes(x=age, group =salary, fill=salary), 
                 binwidth=1, color="black") +    #(Grouped, stacked and percent stacked barplot in ggplot2, 2018) 
  ggtitle("Distribution of Age, Grouped by Salary") +
  xlab("Age") 
p

Looking at a side-by-side boxplot of Age grouped by salary, we can see that those who make over $50K a year tended to be older than those who did not. The median age for those making over $50K was older than the mean age of 39, while the median age for those making less than or equal to $50K, was younger. Both groups had outliers older than the Social Security Retirement Age.

boxplot(age ~ salary, data=data, col="salmon",
        main = "Boxplot of Age by Salary",
        xlab = "Salary",
        ylab = "Age")
abline(h=mean(data$age), col="blue",lwd=3)
text(x=2.33, y = 32, 'Mean Age (38.5)',col="blue")
abline(h=67, col="blue", lwd=3)
text(x=2.08, y = 62, 'Social Security Retirement Age (67)',col="blue")

Final Weight

The Final Weight estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights.

The 3 sets of control are:

A single cell estimate of the population 16+ for each state.
Controls for Hispanic Origin by age and sex.
Controls by Race, age and sex.

Below are a side-by-side boxplot and a histogram showing the distribution of Final Weight. There are several extreme outliers with very high final weight values. The distributions of final weight between those that make less than or equal to 50K and those that make more are comparable to each other, though outliers for those making less than or equal to $50K are more extreme than those making over.

boxplot(fnlwgt ~ salary, data=data, col="salmon",
        main = "Boxplot of Final Weight by Salary",
        xlab = "Salary",
        ylab = "Final Weight")

p <- ggplot(data) +
  geom_histogram(aes(x=fnlwgt, group =salary, fill=salary), 
                 bins=5000, ) +    
  ggtitle("Distribution of Final Weight, Grouped by Salary") +
  xlab("Final Weight") 
p

Numeric Education Level

Numeric education level runs from 1 to 16, and gives a numeric value to how much education a respondent in the data set has received. In the U.S., grades 9-12 represent high school while 13-16 represent college and post-graduate studies.

The boxplot below shows a fairly symmetrical distribution of Numeric education level but there are very low outliers skewing the data a bit; this is due to the respondents who reported education of preschool and elementary school levels. The median education level is around the 10th grade and 50% of the dataset reported education levels between 9 and 12, most received at least some high school level education.

summary(data$education)

##          10th          11th          12th       1st-4th       5th-6th 
##           933          1175           433           168           333 
##       7th-8th           9th    Assoc-acdm     Assoc-voc     Bachelors 
##           646           514          1067          1382          5355 
##     Doctorate       HS-grad       Masters     Preschool   Prof-school 
##           413         10501          1723            51           576 
##  Some-college 
##          7291

boxplot(data$education_num , data=data, col="salmon",
        main = "Boxplot of Numeric Education Level",
        ylab = "Numeric Education Level")

The histogram below shows a more detailed distribution of the Numeric Education Levels in the data set. As shown by the turquoise portion of the bars, almost all of those that make $50K or over have at least some high school education, if not some amount of further education. The blue and green dashed lines show the mean and the median of the numeric educational levels respectively. They are extremely close together, indicating a fairly symmetrical distribution, though (as in the boxplot above) we can see that there are a number of very low outliers.

p <- ggplot(data) +
  geom_histogram(aes(x=education_num, group =salary, fill=salary), 
                 binwidth=1, color="black") +    
  ggtitle("Distribution of Numeric Education Level, Grouped by Salary") +
  xlab("Numeric Education Level") + 
  geom_vline(aes(xintercept=mean(education_num)), 
               color="blue", 
               linewidth=1,
               linetype="dashed") +
    geom_vline(aes(xintercept=median(education_num)), 
               color="green", 
               linewidth=1,
               linetype="dashed") +
  annotate("text", x=12.8, y = 10400,        #(How to add label to geom_vline in ggplot2, 2022)
           label = "Mean Education Level", 
           color="blue") +
    annotate("text", x=7, y = 10800,        #(How to add label to geom_vline in ggplot2, 2022)
           label = "Median Education Level", 
           color="green") 

p

The side-by-side boxplots below show that in general, those that make more than $50K a year have higher educational levels than those who make less than or equal to $50K. The blue line shows that the mean numeric education level is around 10. The middle 50% (represented by the IQR) of those that make less than or equal to $50K lies below this mean while those making over $50K lies above. Also, while those who make less than or equal to $50K a year have outliers both low and high outliers, those that make over $50K only have outliers below the lower whiskers.

boxplot(education_num ~ salary, data=data, col="salmon",
        main = "Boxplot of Numeric Education Level by Salary",
        xlab = "Salary",
        ylab = "Numeric Education Level")
abline(h=mean(data$education_num), 
       col="blue",
       lwd=3,
       lty=2
       )
text(x=2.2, y = 9, 'Mean Education Level (10)',col="blue")

Education Level

Education Level translates the numeric education levels seen above to categorical data. In the barplot shown below, we can see that a good majority of the respondents in the data set have at least a high school diploma. Very few people reported a doctorate or prof-school education but these are the only two education levels where the majority of people reported earning over $50K a year. A small portion of respondents reported Preschool, 1st-4th, and 5th-6th grade level education, accounting for the low outliers we saw in the boxplots above. These groups mostly reported making less than or equal to $50K, though a very small number of those with a 5th-6th grade level education made more.

p <- ggplot(data, 
       aes(x = fct_infreq(education), fill=salary)) +   #(Reordering geom_bar and geom_col by Count or Value, 2022)
  geom_bar(position="stack") +
  ggtitle("Barplot of Education, Grouped by Salary") +
  xlab("Education Level") +
  theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))    
p

Marital Status

Marital Status was another categorical variable in the data set. The following 3 bar plots represent the marital statuses for:

The entire dataset, grouped by salary
Those that made less than or equal to $50K only
Those that made more than $50K only

In the plot representing the whole dataset, most respondents reported being married, and this segment was split fairly evenly between those who made over $50K a year and those who did not. Those who were never married or divorced were the next two most common segments made up primarily by those who made less than or equal to $50K.

The top segment for those who made up less than or equal to $50K was “Never-Married”, closely followed by “Married” while the clear top category for those who made over $50K a year was “Married”.

ggplot(data, 
     aes(x = fct_infreq(marital_status), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Marital Status, Grouped by Salary") +
xlab("Marital Status") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

data %>% 
  filter(salary == " <=50K") %>%
  ggplot(aes(x=fct_infreq(marital_status))) +
  geom_bar(fill="salmon") +
  xlab("Marital Status") +
  ylim(0,15000)+
  ggtitle("Barplot of Marital Status for those making <=50K") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

data %>% 
  filter(salary != " <=50K") %>%
  ggplot(aes(x=fct_infreq(marital_status))) +
  geom_bar(fill="turquoise") +
  xlab("Marital Status") +
  ylim(0,15000)+
  ggtitle("Barplot of Marital Status for those making >50K") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

Occupation

Occupation was a categorical variable that, along with the workclass variable, had the most number of missing values in the data set. The following 3 bar plots represent the occupations for:

The entire dataset, grouped by salary
Those that made less than or equal to $50K only
Those that made more than $50K only

The top 3 occupations for the whole data set were:

Prof-specialty
Craft-repair
Exec-managerial

The top 3 for those who made less than or equal to $50K a year:

Adm-clerical
Craft-repair
Other-service

The top 3 for those who made more than $50K a year:

Exec-managerial
Prof-specialty
Sales

ggplot(data, 
   aes(x = fct_infreq(occupation), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Occupations, Grouped by Salary") +
xlab("Occupation") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

data %>% 
  filter(salary == " <=50K") %>%
  ggplot(aes(x=fct_infreq(occupation))) +
  geom_bar(fill="salmon") +
  xlab("Occupation") +
  ylim(0,4000)+
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

data %>% 
  filter(salary != " <=50K") %>%
  ggplot(aes(x=fct_infreq(occupation))) +
  geom_bar(fill="turquoise") +
  xlab("Occupation") +
  ylim(0,4000)+
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

Relationships

The Relationship categorical variable is tied in with the marital status variable to a certain extent. The marital status variable plots showed us that marriage was the most common relationship type in the data set. Here, those who reported being husbands were the most typical response. This is probably due to the fact there were many more men in the data set than women, and that in the U.S. in 1994, men were the primary breadwinners of families. What is interesting is that those who reported being wives were the only group where over half of the respondant reported making over $50K a year.

p <- ggplot(data, 
       aes(x = fct_infreq(relationship), fill=salary, )) +
  geom_bar(position="stack") +
  ggtitle("Barplot of Relationships, Grouped by Salary") +
  xlab("Relationship") +
  theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))    

p

Race

As we can see in the Barplot of Race, Grouped by Salary, the vast majority of respondents in the data set reported being White, followed by Black, Asian-Pac-Islander, Amer-Indian-Eskimo, and lastly Other. In the Percent Stacked Bar Plot of Race, Grouped by Salary, we can see that racial groups White and Asian-Pac-Islander were most likely to earn over $50K a year, with about 25% of each group doing so. Out of those who reported being Black, Amer-Indian-Eskimo, or Other, about 12-13% reported making over $50K a year.

ggplot(data, 
     aes(x = fct_infreq(race), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Race, Grouped by Salary") +
xlab("Race") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

#Percent Stacked Barplot (Grouped, stacked and percent stacked barplot in ggplot2, 2021)
ggplot(data, aes(fill=salary, x=fct_infreq(race))) + 
  geom_bar(position="fill") +
  ggtitle("Percent Stacked Bar Plot of Race, Grouped by Salary") +
  xlab("Race") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

Sex

Another key demographic is the gender of those in the data set. As we can see in the bar plot below, there are about twice as many male respondents in the data set as women. There are a higher percentage of males that make over $50K than females.

ggplot(data, 
     aes(x = sex, fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Sex, Grouped by Salary") +
xlab("Sex")

When separated by sex, we can see in the side-by-side box plots below that female and male respondents in the data set share similar distributions of age, though men tended to be just a bit older than females. The median age for both groups seems to be in the late 30s with comparable minimum and maximum ages. There are older outliers in both groups as well.

boxplot(age ~ sex, data=data, col="coral1",
        main = "Boxplot of Age by Sex",
        xlab = "Sex",
        ylab = "Age")

Capital Gains

It is difficult to evaluate the distribution of Capital Gains because the high number of Capital Gains equaling zero heavily skews the model.

ggplot(data) +
geom_histogram(aes(x=capital_gain, group =salary, fill=salary), bins=50, color="black") +    
ggtitle("Distribution of Capital Gain, Grouped by Salary") +
xlab("Capital Gain")

The histogram below shows the Capital Gain excluding those with a Capital Gain of 0, which does not have any impact on income. The histogram shows that people with more capital gains, tended to make more than $50K annually. The histogram was still skewed to the right with high outliers with capital gains of 100,000. Clearly, having capital gain to report had positive impacts on income.

data %>% 
  filter(capital_gain != 0) %>%
  ggplot(aes(x=capital_gain, group=salary, fill=salary)) +
  geom_histogram(bins=50, color="black") +
  xlab("Capital Gain")

Capital Losses

It is difficult to evaluate the distribution of Capital Losses because the high number of Capital Losses equaling zero heavily skews the model to the right. To deal with this, we will examine the data excluding the zeros (which have no impact on income).

ggplot(data) +
geom_histogram(aes(x=capital_loss, group =salary, fill=salary), 
               bins=50, color="black") +    
ggtitle("Distribution of Capital Loss, Grouped by Salary") +
xlab("Capital Loss")

The histogram below shows the Capital Losses excluding those with a Loss of 0. The 0s were heavily skewing the the distribution to the right and made no impact on income. The histogram shows a fairly symmetrical distribution with both salary groups reporting capital losses. There is a sizeable spike of those reporting losses at around $2000, especially those who made more than $50K. This could be worth investigating further.

data %>% 
  filter(capital_loss != 0) %>%
  ggplot(aes(x=capital_loss, group=salary, fill=salary)) +
  geom_histogram(bins=50, color="black") +
  xlab("Capital Loss")

Hours per Week

The hours per week variable shows how many hours a week people reported working. The distribution is symmetrical and there is a huge spike at around 40 hours a week, which makes sense as this is the standard work week for many Americans. Those who made over $50K in the year generally worked at least 40 hours a week, if not more. This makes sense as we would assume that the more one works, the more they would make.

ggplot(data) +
geom_histogram(aes(x=hours_per_week, group =salary, fill=salary), 
               bins = 30, 
               color="black") +    
ggtitle("Distribution of Hours Worked per Week, Grouped by Salary") +
xlab("Hours Worked Per Week")

In the side-by-side scatter plot comparing age and hours worked per week for the two salary groups, we can see that younger people were more likely to make less than or equal to $50K a year than those who were older. Across the board, most people worked around the standard 40 hour work week. Additionally, those 80 and older were not only less likely to work, they were less likely to work more than the 40 hour standard.

## Lattice
library(lattice)
xyplot(hours_per_week ~ age | salary, data = data)

Native Country

As shown in the bar plot below, the primary respondants in the data set reported the United States as their native country, followed by Mexico as a distant second.

ggplot(data, 
     aes(x = fct_infreq(native_country), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Native Country, Grouped by Salary") +
xlab("Country") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

Citations

Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.
R Core Team (2018). Grouped, stacked and percent stacked barplot in ggplot2. R Graph Gallery. Retrieved September 17, 2024 from https://r-graph-gallery.com/48-grouped-barplot-with-ggplot2.html#
Bobbitt, Z. (2021). How to Rotate Axis Labels in ggplot2 (With Examples). Statology. Retrieved September 17, 2024 from https://www.statology.org/rotate-axis-labels-ggplot2/
Kuo, A. (2022). Reordering geom_bar and geom_col by Count or Value. Albert’s Blog. Retrieved September 18, 2024 from https://blog.albertkuo.me/post/2022-01-04-reordering-geom-col-and-geom-bar-by-count-or-value/
Bobbitt, Z. (2022). How to Add Label to geom_vline in ggplot2. Statology. Retrieved September 19, 2024 from https://www.statology.org/geom_vline-label/
Geeks for Geeks (2021). Grouped, stacked and percent stacked barplot in ggplot2. Geeks for Geeks. Retrieved September 20, 20224 from https://www.geeksforgeeks.org/grouped-stacked-and-percent-stacked-barplot-in-ggplot2/

Homework1_LeonaChase

Leona Chase

2024-09-11