Created by Nathan Garrett, updated 1/29/24
Outcomes:
For help, see:
This block should run, but have no output or code
For plots, we usually want to show the results, but not our code.
We sometimes will want to show the code and the output.
# Show a table of over50k values
table(t$over50k)
##
## <=50K >50K
## 24720 7841
Go back and make a cleaner version of our dataset.
rename to change names from hyphens to underscores
(ie. from education-num to education_num)Now, print the number of rows, the number over 50k, and the % of the rows earning over 50k.
Show your code and the result.
# Create a new tibble t1. Hint: summarise
t1 <- t %>%
summarise(n = n(),
number_over_50k=sum(over50k01),
percent_over_50k = mean(over50k01))
# Show the tibble
Use the cor function to show the correlation between different values.
Create a new tibble t2 with over50k01, age, and education. Then, use
cor(tibble_name) to find the correlation between those
values.
t2 <- t %>%
select(age, education_num, hours_per_week, over50k01)
# Create a smaller tibble containing only numeric columns.
# Use the function cor to print the correlation between those variables
# to the terminal.
Create a graph showing the variable with the best correlation from the prior task. Pick a useful graphic. Show the graphic, but not the code.
Use the table function to show the % of people making over 50k by text field(s).
Start with table(text_column, over50k) to find the
relationship between those values. Then, convert into proportions by
wrapping the result of table with prop.table.
Show the output only.
##
## <=50K >50K
## 1 51 0
## 2 162 6
## 3 317 16
## 4 606 40
## 5 487 27
## 6 871 62
## 7 1115 60
## 8 400 33
## 9 8826 1675
## 10 5904 1387
## 11 1021 361
## 12 802 265
## 13 3134 2221
## 14 764 959
## 15 153 423
## 16 107 306
##
## <=50K >50K
## 1 0.0015662910 0.0000000000
## 2 0.0049752772 0.0001842695
## 3 0.0097355732 0.0004913854
## 4 0.0186112220 0.0012284635
## 5 0.0149565431 0.0008292129
## 6 0.0267497927 0.0019041184
## 7 0.0342434200 0.0018426952
## 8 0.0122846350 0.0010134824
## 9 0.2710604711 0.0514419090
## 10 0.1813212125 0.0425969718
## 11 0.0313565308 0.0110868831
## 12 0.0246306932 0.0081385707
## 13 0.0962501152 0.0682104358
## 14 0.0234636528 0.0294524124
## 15 0.0046988729 0.0129910015
## 16 0.0032861399 0.0093977458
We often want to group variables, or turn text values into numbers for easier analysis.
Create a new tibble called t_numbers. Add:
is_male: 0/1 columnis_white: 0/1 columnThen, print a correlation test with those two new columns and is_over50k. Are any of these better than your other numeric columns?
Your goal is to find or create a variable with the highest correlation to over50k01. Try different text values, and then turn them into 0/1 fields. Print out the correlations, as well as a visualization of each of the variables.
Start by viewing a vis of each key variable. Then, pull out variables as needed.