Outcomes

Created by Nathan Garrett, updated 1/29/24

Outcomes:

For help, see:

Guided Task 0: Setup markdown options

This block should run, but have no output or code

For plots, we usually want to show the results, but not our code.

We sometimes will want to show the code and the output.

# Show a table of over50k values
table(t$over50k)
## 
## <=50K  >50K 
## 24720  7841

Guided Task 1: Data cleanup

Go back and make a cleaner version of our dataset.

Now, print the number of rows, the number over 50k, and the % of the rows earning over 50k.

Show your code and the result.

# Create a new tibble t1. Hint: summarise
t1 <- t %>% 
  summarise(n = n(),
            count_over_50k = sum(over50k01),
            mean_over_50k = mean(over50k01))

# Show the tibble
print(t1)
## # A tibble: 1 × 3
##       n count_over_50k mean_over_50k
##   <int>          <dbl>         <dbl>
## 1 32561           7841         0.241

Guided Task 2: Correlations

Use the cor function to show the correlation between different values.

Create a new tibble t2 with over50k01, age, and education. Then, use cor(tibble_name) to find the correlation between those values.

##                      age education_num over50k01
## age           1.00000000    0.03652719 0.2340371
## education_num 0.03652719    1.00000000 0.3351540
## over50k01     0.23403710    0.33515395 1.0000000

Graph

Create a graph showing the variable with the best correlation from the prior task. Pick a useful graphic. Show the graphic, but not the code.

Guided Task 3: Tables

Use the table function to show the % of people making over 50k by text field(s).

Start with table(text_column, over50k) to find the relationship between those values. Then, convert into proportions by wrapping the result of table with prop.table. Show the output only.

##         
##          <=50K  >50K
##   Female  9592  1179
##   Male   15128  6662
##         
##               <=50K       >50K
##   Female 0.29458555 0.03620896
##   Male   0.46460490 0.20460060

Guided Task 4: Convert text into 0/1 variables

We often want to group variables, or turn text values into numbers for easier analysis.

Create a new tibble called t_numbers. Add:

Then, print a correlation test with those two new columns and is_over50k. Are any of these better than your other numeric columns?

##             is_male  is_white over50k01
## is_male   1.0000000 0.1034862 0.2159802
## is_white  0.1034862 1.0000000 0.0852245
## over50k01 0.2159802 0.0852245 1.0000000

On your own!

Your goal is to find or create a variable with the highest correlation to over50k01. Try different text values, and then turn them into 0/1 fields. Print out the correlations, as well as a visualization of each of the variables.

Start by viewing a vis of each key variable. Then, pull out variables as needed.

## Warning: Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
##   # Was:
##   data %>% select(is.numeric)
## 
##   # Now:
##   data %>% select(where(is.numeric))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
##                        age education_num hours_per_week over50k01 is_married
## age             1.00000000    0.03652719     0.06875571 0.2340371 0.31567870
## education_num   0.03652719    1.00000000     0.14812273 0.3351540 0.08607804
## hours_per_week  0.06875571    0.14812273     1.00000000 0.2296891 0.21281652
## over50k01       0.23403710    0.33515395     0.22968907 1.0000000 0.44469616
## is_married      0.31567870    0.08607804     0.21281652 0.4446962 1.00000000
## is_exec_or_prof 0.11721028    0.47448031     0.15222364 0.3062068 0.11294069
## is_highed       0.08615323    0.78951108     0.13457910 0.3156778 0.09984679
##                 is_exec_or_prof  is_highed
## age                   0.1172103 0.08615323
## education_num         0.4744803 0.78951108
## hours_per_week        0.1522236 0.13457910
## over50k01             0.3062068 0.31567783
## is_married            0.1129407 0.09984679
## is_exec_or_prof       1.0000000 0.48338742
## is_highed             0.4833874 1.00000000
## Warning: Removed 16 rows containing missing values (`geom_bar()`).

Visualize the highest