Created by Nathan Garrett, updated 1/29/24
Outcomes:
For help, see:
This block should run, but have no output or code
rename to change names from hyphens to underscores
(ie. from education-num to education_num)For plots, we usually want to show the results, but not our code.
We sometimes will want to show the code and the output.
# Show a table of over50k values
table(t$over50k)
##
## <=50K >50K
## 24720 7841
Go back and make a cleaner version of our dataset.
rename to change names from hyphens to underscores
(ie. from education-num to education_num)Now, print the number of rows, the number over 50k, and the % of the rows earning over 50k.
Show your code and the result.
# Create a new tibble t1. Hint: summarise
t1 <- t %>%
summarise(n = n(),
number_over_50k = sum(over50k01),
percent_over_50k = mean(over50k01))
# Show the tibble
print(t1)
## # A tibble: 1 × 3
## n number_over_50k percent_over_50k
## <int> <dbl> <dbl>
## 1 32561 7841 0.241
Use the cor function to show the correlation between different values.
Create a new tibble t2 with over50k01, age, and education. Then, use
cor(tibble_name) to find the correlation between those
values.
# Create a smaller tibble containing only numeric columns.
t2 <- t %>%
select(age, education_num, hours_per_week, over50k01)
# Use the function cor to print the correlation between those variables
# to the terminal.
cor(t2)
## age education_num hours_per_week over50k01
## age 1.00000000 0.03652719 0.06875571 0.2340371
## education_num 0.03652719 1.00000000 0.14812273 0.3351540
## hours_per_week 0.06875571 0.14812273 1.00000000 0.2296891
## over50k01 0.23403710 0.33515395 0.22968907 1.0000000
Create a graph showing the variable with the best correlation from the prior task. Pick a useful graphic. Show the graphic, but not the code.
Use the table function to show the % of people making over 50k by text field(s).
Start with table(text_column, over50k) to find the
relationship between those values. Then, convert into proportions by
wrapping the result of table with prop.table.
Show the output only.
##
## <=50K >50K
## 1 51 0
## 2 162 6
## 3 317 16
## 4 606 40
## 5 487 27
## 6 871 62
## 7 1115 60
## 8 400 33
## 9 8826 1675
## 10 5904 1387
## 11 1021 361
## 12 802 265
## 13 3134 2221
## 14 764 959
## 15 153 423
## 16 107 306
##
## <=50K >50K
## 1 0.0015662910 0.0000000000
## 2 0.0049752772 0.0001842695
## 3 0.0097355732 0.0004913854
## 4 0.0186112220 0.0012284635
## 5 0.0149565431 0.0008292129
## 6 0.0267497927 0.0019041184
## 7 0.0342434200 0.0018426952
## 8 0.0122846350 0.0010134824
## 9 0.2710604711 0.0514419090
## 10 0.1813212125 0.0425969718
## 11 0.0313565308 0.0110868831
## 12 0.0246306932 0.0081385707
## 13 0.0962501152 0.0682104358
## 14 0.0234636528 0.0294524124
## 15 0.0046988729 0.0129910015
## 16 0.0032861399 0.0093977458
We often want to group variables, or turn text values into numbers for easier analysis.
Create a new tibble called t_numbers. Add:
is_male: 0/1 columnis_white: 0/1 columnThen, print a correlation test with those two new columns and is_over50k. Are any of these better than your other numeric columns?
Your goal is to find or create a variable with the highest correlation to over50k01. Try different text values, and then turn them into 0/1 fields. Print out the correlations, as well as a visualization of each of the variables.
Start by viewing a vis of each key variable. Then, pull out variables as needed.