The Rise of Data Science (Ethics)

Why Care?

Right and Wrong

Data Science

Data Science Ethics Equilibrium

The FAT Flow Framework for Data Science Ethics

Step-by-Step Examples

The sample data includes information on The Current Population Survey (CPS) for the 2010 - 2022 year sample. These surveys gather information on education, labor force status, demographics, and other aspects of the U.S. population, which we will use as investigating social and economic trends in the U.S. This sample data has the following characteristics:

  • The total population of this sample is 210,451.

  • The sample data is composed entirely of individual person records from population censuses.

  • This is a weighted sample, meaning that each individual does not represent the same number of persons in the population.

  • The predominant race of the population sample is White, with 88% of the population being White and the remaining 12% being Black.

  • The population is approximately 50% percent Male and 50% Female.

Variables

Variable Explanation
year This variable is the census year and constitutes a unique identifier for each person record in the IPUMS
sex This variable is the sex of the person, either male or female
age This variable is the age of the person
race This variable is the race of the person, either white or black
education This variable is the educational attainment, as measured by the highest year of school or degree completed
hourwage This variable reports how much the respondent earned per hour in the current job per contemporary dollars, for those workers paid an hourly wage

First, as part of the data preparation process, we will begin by cleaning the data in order to remove irrelevant data from the raw dataset. Cleaning our data ahead of time allows for accurate, defensible data that generates a more reliable model and visualizations.

Data Cleaning

# Read the data
ddi <- read_ipums_ddi("cps_00004.xml")
df <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
zip_data <- read.csv("VA_Zip_Codes_-4237492270829032891.csv")
#str(data)

# Change sex, education, and race variables as factor variables
df = df %>% mutate(SEX.factor = as_factor(lbl_clean(SEX)))
df = df %>% mutate(EDUC.factor = as_factor(lbl_clean(EDUC)))
df = df %>% mutate(RACE.factor = as_factor(lbl_clean(RACE)))

# Remove the RACED AND EDUCD variables
df1 <- subset(df, select = -c(SEX,EDUC,RACE))

# Change column names
names(df1)[1] = "year"
names(df1)[2] = "age"
names(df1)[3] = "hourweight"
names(df1)[4] = "hourwage"
names(df1)[5] = "sex"
names(df1)[6] = "education"
names(df1)[7] = "race"

# Not needed for this analysis
df1$hourweight <- NULL
df1$year <- NULL

formatted_id <- sprintf("ID%04d", seq_len(nrow(df1)))
df1 <- cbind(id = formatted_id, df1)

# Remove NAs
df1 <- subset(df1, hourwage!=999.99)%>% droplevels()
head(df2, 10) %>%
  kable() %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  )
id Zip_Code sex race age education hourwage
ID0008 23221 Female White 30 Associate’s degree, occupational/vocational program 11.49
ID0009 23460 Male White 33 Associate’s degree, occupational/vocational program 14.50
ID0077 22153 Male White 60 High school diploma or equivalent 9.00
ID0080 22303 Female White 41 High school diploma or equivalent 7.50
ID0081 22181 Female White 17 Grade 11 7.25
ID0083 24015 Female White 31 Associate’s degree, academic program 10.42
ID0084 23832 Male White 31 High school diploma or equivalent 13.00
ID0088 23603 Male White 53 Grade 11 12.00
ID0089 23323 Male White 45 Some college but no degree 28.75
ID0090 22303 Female White 38 Professional school degree 13.00
#write.csv(df2,"C:\\Users\\luzya\\OneDrive\\Escritorio\\PhD program\\Fall 2024\\DS 8998 - Farhana\\df_ex1.csv", row.names = FALSE) 
  
#print ('CSV created Successfully :)')

Summary