September 27, 2024
Load the Current Population Survey CSV file:
CPS <- read.csv("CPSData.csv", stringsAsFactors = TRUE)
Summary of the dataset:
summary(CPS)
PeopleInHousehold Region State Age Married
Min. : 1.000 Midwest :26337 California :11362 Min. : 0.00 Divorced :11350
1st Qu.: 2.000 Northeast:21438 Texas : 7562 1st Qu.:19.00 Married :55839
Median : 3.000 South :48488 Florida : 5772 Median :39.00 Never Married:31723
Mean : 3.288 West :36436 New York : 5278 Mean :39.27 Separated : 2018
3rd Qu.: 4.000 Illinois : 3870 3rd Qu.:58.00 Widowed : 6616
Max. :15.000 Pennsylvania: 3646 Max. :85.00 NA's :25153
(Other) :95209
Sex Education Race Hispanic
Female:68051 High school :30885 American Indian : 2003 Min. :0.0000
Male :64648 Bachelor's degree :20283 Asian : 6963 1st Qu.:0.0000
Some college, no degree:18861 Black : 14564 Median :0.0000
No high school diploma :15878 Multiracial : 2915 Mean :0.1458
Associate degree : 9983 Pacific Islander: 760 3rd Qu.:0.0000
(Other) :11656 White :105494 Max. :1.0000
NA's :25153
Citizenship EmploymentStatus Industry
Citizen, Native :117628 Disabled : 6150 Educational and health services :15053
Citizen, Naturalized: 7138 Employed :62518 Trade : 8769
Non-Citizen : 7933 Not in Labor Force:15518 Professional and business services: 7749
Retired :19928 Manufacturing : 6523
Unemployed : 2989 Leisure and hospitality : 6301
NA's :25596 (Other) :21663
NA's :66641
Number of interviewees in the data set:
num_interviewees <- nrow(CPS)
cat("Number of interviewees:", num_interviewees, "\n")
Number of interviewees: 132699
Number of numeric or integer variables in this dataset:
# Get numeric and integer columns
numeric_columns <- sapply(CPS, is.numeric)
# Count the number of numeric or integer variables
num_numeric_vars <- sum(numeric_columns)
cat("Number of numeric or integer variables:", num_numeric_vars, "\n")
Number of numeric or integer variables: 3
Factor variable with largest number of unique values:
factor_columns <- sapply(CPS, is.factor)
# Get number of unique levels for each factor column
unique_values <- sapply(CPS[, factor_columns], function(x) length(unique(x)))
# Find the factor variable with the largest number of unique values
max_unique <- which.max(unique_values)
cat("Factor variable with the largest number of unique values:", names(unique_values)[max_unique], "\n")
Factor variable with the largest number of unique values: State
# Create a data frame to display factor variable names and their unique value counts
factor_unique_values <- data.frame(Variable = names(unique_values), Unique_Values = unique_values)
cat("Unique values for each factor variable:\n")
Unique values for each factor variable:
print(factor_unique_values)
Maximum, Median, and Mean values of the Age variable:
max_age <- max(CPS$Age, na.rm = TRUE)
median_age <- median(CPS$Age, na.rm = TRUE)
mean_age <- mean(CPS$Age, na.rm = TRUE)
cat("Maximum Age:", max_age, "\n")
Maximum Age: 85
cat("Median Age:", median_age, "\n")
Median Age: 39
cat("Mean Age:", mean_age, "\n")
Mean Age: 39.26807
Most common industry among interviewees:
industry_data <- na.omit(CPS$Industry)
industry_counts <- table(industry_data)
most_common_industry <- names(industry_counts)[which.max(industry_counts)]
most_common_count <- max(industry_counts)
cat("Most common industry:", most_common_industry, "\n")
Most common industry: Educational and health services
cat("Number of interviewees in this industry:", most_common_count, "\n")
Number of interviewees in this industry: 15053
Proportion of interviewees who are citizens of the US:
total_interviewees <- nrow(CPS)
us_citizens <- sum(CPS$Citizenship %in% c("Citizen, Native", "Citizen, Naturalized"), na.rm = TRUE)
proportion_us_citizens <- us_citizens / total_interviewees
cat("Total number of interviewees:", total_interviewees, "\n")
Total number of interviewees: 132699
cat("Number of US citizens:", us_citizens, "\n")
Number of US citizens: 124766
cat("Proportion of interviewees who are US citizens:", round(proportion_us_citizens, 4), "\n")
Proportion of interviewees who are US citizens: 0.9402
State with the fewest interviewees:
# Create a table of counts of interviewees by state
state_counts <- table(CPS$State)
# Sort the state counts in ascending order
sorted_state_counts <- sort(state_counts)
# Extract the state with the fewest interviewees and the corresponding count
state_with_fewest <- names(sorted_state_counts)[1]
count_with_fewest <- sorted_state_counts[1]
cat("State with the fewest interviewees:", state_with_fewest, "\n")
State with the fewest interviewees: Maine
cat("Number of interviewees in this state:", count_with_fewest, "\n")
Number of interviewees in this state: 1190
Races with at least 250 interviewees of Hispanic ethnicity:
hispanic_data <- CPS[CPS$Hispanic == "1", ]
race_counts <- table(hispanic_data$Race)
races_with_at_least_250 <- race_counts[race_counts >= 250]
cat("Races with at least 250 interviewees of Hispanic ethnicity:\n")
Races with at least 250 interviewees of Hispanic ethnicity:
print(races_with_at_least_250)
American Indian Black Multiracial White
375 659 439 17617
Histogram of the number of people in the interviewee’s household:
hist(CPS$PeopleInHousehold,
main = "Histogram of Household Size",
xlab = "Number of People in Household",
ylab = "Frequency",
col = "lightblue")
cat("The most popular household size is 1. As the household size drops, the frequency of occurrence drops.\n")
The most popular household size is 1. As the household size drops, the frequency of occurrence drops.
Boxplot of the age by marital status:
boxplot(Age ~ Married,
data = CPS,
main = "Boxplot of Age by Marital Status",
xlab = "Marital Status",
ylab = "Age",
col = "lightblue",
border = "black",
las = 2)
cat("The youngest group has never been married, and the eldest consists of widowers. It intuitively makes sense.\n")
The youngest group has never been married, and the eldest consists of widowers. It intuitively makes sense.
Boxplot of the age by employment status:
boxplot(Age ~ EmploymentStatus,
data = CPS,
main = "Boxplot of Age by Employment Status",
xlab = "Employment Status",
ylab = "Age",
col = "lightblue",
border = "black",
las = 2)
cat("The youngest group is labor force and unemployed. The eldest consists of retirees. It intuitively makes sense.\n")
The youngest group is labor force and unemployed. The eldest consists of retirees. It intuitively makes sense.
#Create a subset for interviewees from California
california_interviewees <- subset(CPS, State == "California")
education_employment_table <- table(california_interviewees$Education, california_interviewees$EmploymentStatus)
Probability that an individual is unemployed given their education level is Associate’s degree:
# Count the total number of interviewees with an Associate's degree
total_associates_degree <- sum(california_interviewees$Education == "Associate degree", na.rm = TRUE)
cat("Total interviewees with an Associate's degree:", total_associates_degree, "\n")
Total interviewees with an Associate's degree: 714
associate_unemployed <- education_employment_table["Associate degree", "Unemployed"] # Number of people with Associate's degree and unemployed
probability_associate_unemployed <- associate_unemployed / total_associates_degree
cat("Probability that someone with an Associate's degree is unemployed:", round(probability_associate_unemployed, 4), "\n")
Probability that someone with an Associate's degree is unemployed: 0.0322
Probability that an individual is unemployed given their education level is Master’s degree:
# Count the total number of interviewees with a Master's degree
total_masters_degree <- sum(california_interviewees$Education == "Master's degree", na.rm = TRUE)
cat("Total interviewees with a Master's degree:", total_masters_degree, "\n")
Total interviewees with a Master's degree: 716
masters_unemployed <- education_employment_table["Master's degree", "Unemployed"] # Number of people with Master's degree and unemployed
probability_masters_unemployed <- masters_unemployed / total_masters_degree
cat("Probability that someone with a Master's degree is unemployed:", round(probability_masters_unemployed, 4), "\n")
Probability that someone with a Master's degree is unemployed: 0.0196
Average age of California residents who work in the financial industry:
financial_industry_california <- subset(CPS, State == "California" & Industry == "Financial")
average_age_financial_industry <- mean(financial_industry_california$Age, na.rm = TRUE)
cat("Average age of California residents in the financial industry:", round(average_age_financial_industry, 2), "\n")
Average age of California residents in the financial industry: 45.91
Average age of California residents who work in the leisure and hospitality industry:
leisure_hospitality_california <- subset(CPS, State == "California" & Industry == "Leisure and hospitality")
average_age_leisure_hospitality <- mean(leisure_hospitality_california$Age, na.rm = TRUE)
cat("Average age of California residents in the leisure and hospitality industry:", round(average_age_leisure_hospitality, 2), "\n")
Average age of California residents in the leisure and hospitality industry: 35.53
Marital status and number of people in household:
# Calculate the mean number of people in household for each marital status
mean_people_in_household_by_marriage <- aggregate(PeopleInHousehold ~ Married, data = california_interviewees, FUN = mean, na.rm = TRUE)
# View the results
cat("Mean number of people in household by marital status:\n")
Mean number of people in household by marital status:
print(mean_people_in_household_by_marriage)
# Find the marital status with the largest and smallest mean value
largest_mean <- mean_people_in_household_by_marriage[which.max(mean_people_in_household_by_marriage$PeopleInHousehold), ]
smallest_mean <- mean_people_in_household_by_marriage[which.min(mean_people_in_household_by_marriage$PeopleInHousehold), ]
# Print the results
cat("Marital status with the largest mean household size:\n")
Marital status with the largest mean household size:
print(largest_mean)
cat("Marital status with the smallest mean household size:\n")
Marital status with the smallest mean household size:
print(smallest_mean)