Outcomes

Created by Connor Lewis, updated 1/29/24

Outcomes:

For help, see:

Guided Task 0: Setup markdown options

This block should run, but have no output or code

For plots, we usually want to show the results, but not our code.

We sometimes will want to show the code and the output.

# Show a table of over50k values
table(t$over50k)
## 
## <=50K  >50K 
## 24720  7841

Guided Task 1: Data cleanup

Go back and make a cleaner version of our dataset.

Now, print the number of rows, the number over 50k, and the % of the rows earning over 50k.

Show your code and the result.

# Create a new tibble t1. Hint: summarise
t1 <- t %>% summarize(n = n(), number_over_50 = sum(over50k01), percent_over = mean(over50k01))
# Show the tibble
t1

Guided Task 2: Correlations

Use the cor function to show the correlation between different values.

Create a new tibble t2 with over50k01, age, and education. Then, use cor(tibble_name) to find the correlation between those values.

# Create a smaller tibble containing only numeric columns.
t2 <- t %>% select(over50k01, age, education_num, hours_per_week)
# Use the function cor to print the correlation between those variables
cor(t2)
##                over50k01        age education_num hours_per_week
## over50k01      1.0000000 0.23403710    0.33515395     0.22968907
## age            0.2340371 1.00000000    0.03652719     0.06875571
## education_num  0.3351540 0.03652719    1.00000000     0.14812273
## hours_per_week 0.2296891 0.06875571    0.14812273     1.00000000
# to the terminal.

Graph

Create a graph showing the variable with the best correlation from the prior task. Pick a useful graphic. Show the graphic, but not the code.

Guided Task 3: Tables

Use the table function to show the % of people making over 50k by text field(s).

Start with table(text_column, over50k) to find the relationship between those values. Then, convert into proportions by wrapping the result of table with prop.table. Show the output only.

##                    
##                        0    1
##   ?                 1652  191
##   Adm-clerical      3263  507
##   Armed-Forces         8    1
##   Craft-repair      3170  929
##   Exec-managerial   2098 1968
##   Farming-fishing    879  115
##   Handlers-cleaners 1284   86
##   Machine-op-inspct 1752  250
##   Other-service     3158  137
##   Priv-house-serv    148    1
##   Prof-specialty    2281 1859
##   Protective-serv    438  211
##   Sales             2667  983
##   Tech-support       645  283
##   Transport-moving  1277  320
##                    
##                                0            1
##   ?                 5.073554e-02 5.865913e-03
##   Adm-clerical      1.002119e-01 1.557077e-02
##   Armed-Forces      2.456927e-04 3.071159e-05
##   Craft-repair      9.735573e-02 2.853106e-02
##   Exec-managerial   6.443291e-02 6.044040e-02
##   Farming-fishing   2.699549e-02 3.531833e-03
##   Handlers-cleaners 3.943368e-02 2.641197e-03
##   Machine-op-inspct 5.380670e-02 7.677897e-03
##   Other-service     9.698719e-02 4.207487e-03
##   Priv-house-serv   4.545315e-03 3.071159e-05
##   Prof-specialty    7.005313e-02 5.709284e-02
##   Protective-serv   1.345168e-02 6.480145e-03
##   Sales             8.190780e-02 3.018949e-02
##   Tech-support      1.980897e-02 8.691379e-03
##   Transport-moving  3.921870e-02 9.827708e-03

Guided Task 4: Convert text into 0/1 variables

We often want to group variables, or turn text values into numbers for easier analysis.

Create a new tibble called t_numbers. Add:

Then, print a correlation test with those two new columns and is_over50k. Are any of these better than your other numeric columns?

On your own!

Your goal is to find or create a variable with the highest correlation to over50k01. Try different text values, and then turn them into 0/1 fields. Print out the correlations, as well as a visualization of each of the variables.

Start by viewing a vis of each key variable. Then, pull out variables as needed.