Outcomes

Created by Nathan Garrett, updated 1/29/24

Outcomes:

Create RMarkdown documents
Understand syntax
- Titles with hashes (1-3)
- Links with parens
- Code rendering / output options
- Code rendering defaults
Publish your files online
R
- cor
- table and prop.table

For help, see:

Guided Task 0: Setup markdown options

This block should run, but have no output or code

Add a new column called over50k01 that is numeric (1 or 0).
Use rename to change names from hyphens to underscores (ie. from education-num to education_num)
Remove the fields capital_loss, education, fnlwgt, workclass, and capital_gain.

For plots, we usually want to show the results, but not our code.

We sometimes will want to show the code and the output.

# Show a table of over50k values
table(t$over50k)

## 
## <=50K  >50K 
## 24720  7841

Guided Task 1: Data cleanup

Go back and make a cleaner version of our dataset.

Add a new column called over50k01 that is numeric (1 or 0).
Use rename to change names from hyphens to underscores (ie. from education-num to education_num)
Remove the fields capital_loss, education, fnlwgt, workclass, and capital_gain.

Now, print the number of rows, the number over 50k, and the % of the rows earning over 50k.

Show your code and the result.

# Create a new tibble t1. Hint: summarise
t1 <- t %>% 
  summarise(n = n(),
            number_over_50k = sum(over50k01),
            percent_over_50k = mean(over50k01))

# Show the tibble
print(t1)

## # A tibble: 1 × 3
##       n number_over_50k percent_over_50k
##   <int>           <dbl>            <dbl>
## 1 32561            7841            0.241

Guided Task 2: Correlations

Use the cor function to show the correlation between different values.

Create a new tibble t2 with over50k01, age, and education. Then, use cor(tibble_name) to find the correlation between those values.

# Create a smaller tibble containing only numeric columns.
t2 <- t %>% 
  select(age, education_num, hours_per_week, over50k01)

# Use the function cor to print the correlation between those variables
# to the terminal.
cor(t2)

##                       age education_num hours_per_week over50k01
## age            1.00000000    0.03652719     0.06875571 0.2340371
## education_num  0.03652719    1.00000000     0.14812273 0.3351540
## hours_per_week 0.06875571    0.14812273     1.00000000 0.2296891
## over50k01      0.23403710    0.33515395     0.22968907 1.0000000

Graph

Create a graph showing the variable with the best correlation from the prior task. Pick a useful graphic. Show the graphic, but not the code.

Guided Task 3: Tables

Use the table function to show the % of people making over 50k by text field(s).

Start with table(text_column, over50k) to find the relationship between those values. Then, convert into proportions by wrapping the result of table with prop.table. Show the output only.

##     
##      <=50K >50K
##   1     51    0
##   2    162    6
##   3    317   16
##   4    606   40
##   5    487   27
##   6    871   62
##   7   1115   60
##   8    400   33
##   9   8826 1675
##   10  5904 1387
##   11  1021  361
##   12   802  265
##   13  3134 2221
##   14   764  959
##   15   153  423
##   16   107  306

##     
##             <=50K         >50K
##   1  0.0015662910 0.0000000000
##   2  0.0049752772 0.0001842695
##   3  0.0097355732 0.0004913854
##   4  0.0186112220 0.0012284635
##   5  0.0149565431 0.0008292129
##   6  0.0267497927 0.0019041184
##   7  0.0342434200 0.0018426952
##   8  0.0122846350 0.0010134824
##   9  0.2710604711 0.0514419090
##   10 0.1813212125 0.0425969718
##   11 0.0313565308 0.0110868831
##   12 0.0246306932 0.0081385707
##   13 0.0962501152 0.0682104358
##   14 0.0234636528 0.0294524124
##   15 0.0046988729 0.0129910015
##   16 0.0032861399 0.0093977458

Guided Task 4: Convert text into 0/1 variables

We often want to group variables, or turn text values into numbers for easier analysis.

Create a new tibble called t_numbers. Add:

is_male: 0/1 column
is_white: 0/1 column

Then, print a correlation test with those two new columns and is_over50k. Are any of these better than your other numeric columns?

On your own!

Your goal is to find or create a variable with the highest correlation to over50k01. Try different text values, and then turn them into 0/1 fields. Print out the correlations, as well as a visualization of each of the variables.

Start by viewing a vis of each key variable. Then, pull out variables as needed.

US Census