HW06_Borunda

Setting Up the Data

# ImportTransPop data set
data <- read.csv("/Users/nicoleborunda/Downloads/ICPSR_37938 3/DS0005/TransPopDS0005.csv")

# What's the structure of the TransPop data set? Limit to first 10 variables.

str(data, list.len = 10)

## 'data.frame':    1436 obs. of  612 variables:
##  $ STUDYID                  : int  151768927 152357242 152444055 152525272 152894493 152925625 153003265 153036828 153162357 153318257 ...
##  $ WEIGHT_CISGENDER_TRANSPOP: num  0.02204 0.00849 0.01576 0.03566 0.0418 ...
##  $ WEIGHT_CISGENDER         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ WEIGHT_TRANSPOP          : num  0.986 0.38 0.705 1.595 1.87 ...
##  $ GMETHOD_TYPE             : chr  " " " " " " " " ...
##  $ SURVEYCOMPLETED          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GRESPONDENT_DATE         : chr  "26-APR-2016" "07-APR-2016" "01-MAY-2016" "20-APR-2016" ...
##  $ GCENREG                  : int  1 3 3 4 2 1 3 2 3 4 ...
##  $ RACE                     : int  6 6 6 6 8 8 6 8 2 6 ...
##  $ RACE_RECODE              : int  1 1 1 1 3 6 1 6 2 1 ...
##   [list output truncated]

One of the challenges of working with this data set is the shear number of variables. For this assignment, I want to compare Life Satisfaction scores for cisgender and transgender respondents. That means I need to locate the couple of variables out of the 612 that are relevant to this analysis and make sure I can use them the way I think I can. To start, I will confirm that the variables I think are in my data set are actually there.

column_names <- names(data)
# Confirm "LIFESAT" is among the column names
if ("LIFESAT" %in% column_names) {
  print("LIFESAT variable is in the data frame.")
} else {
  print("LIFESAT variable is not found in the data frame.")
}

## [1] "LIFESAT variable is in the data frame."

Next, I want to confirm that LIFESAT is a numeric variable so I am certain it can be used for my analysis.

# Confirm LIFESAT variable type 
variable_type <- class(data$LIFESAT)

# Print the data type of the "LIFESAT" variable
print(variable_type)

## [1] "numeric"

Next, I want to confirm that the SURVEYCOMPLETED variable is in the data set. This variable indicates whether the participant took the survey for a cisgender or transgender person. There were two transgender surveys at two different time points that are coded at 0 and 1. I want to combine those into one group. The cisgender data set is coded as 2.

column_names <- names(data)
# Confirm "SURVEYCOMPLETED" is among the column names
if ("SURVEYCOMPLETED" %in% column_names) {
  print("SURVEYCOMPLETED variable is in the data frame.")
} else {
  print("SURVEYCOMPLETED variable is not found in the data frame.")
}

## [1] "SURVEYCOMPLETED variable is in the data frame."

Next, I want to make sure that the values in the variable are as I expect: 0, 1, and 2. Note: ChatGPT helped me figure out the unique_values code.

# Check for unique values for SURVEYCOMPLETED
unique_values <- unique(data$SURVEYCOMPLETED)

# Print unique values
print(unique_values)

## [1] 0 1 2

Next I want to collapse the 0 and 1 coded surveys, which are both for transgender participants, into one group. This will allow me to compare transgender participants to cisgender participants. I will create a new variable called TRANSORCIS and use the ifelse() function to recategorize 0 and 1 as trans and 2 as cis. I tried to do this by keeping the dummy coding but I could not get it to work by recoding from 0, 1, and 2 to just 0 and 1. After an hour or so of failing to make it work, I just gave in and used the non-numeric categorizations, “trans” and “cis”.

# Create a new variable TRANSORCIS based on SURVEYCOMPLETED
data$TRANSORCIS <- ifelse(data$SURVEYCOMPLETED %in% c(0, 1), "trans", "cis")

# Check for unique values for SURVEYCOMPLETED
unique_values <- unique(data$TRANSORCIS)

# Print unique values
print(unique_values)

## [1] "trans" "cis"

Now that I have confirmed my variables are in my data frame and are formatted for the analysis I want to do, I can do the comparison and plot the data.

Analyze and Plot the Data

# Subset only the variables I want from my data frame
subset_data <- data[c("LIFESAT", "TRANSORCIS")]

# Create the box plot
boxplot(LIFESAT ~ TRANSORCIS, data = subset_data,
        xlab = "Group", ylab = "Life Satisfaction",
        main = "Comparison of Life Satisfaction\nfor Trans and Cis People")

I initially created a scatter plot as a reflex because that is what we have primarily been working on in class. I immediately realized that a scatter plot doesn’t make any sense to display these data. Rather, to compare two groups based on a single variable, I am much more interested in visualizing simple differences in how the scores are distributed. This box plot tells me that overall, trans participants have lower life satisfaction scores than cis participants. The median life satisfaction score for trans participants is about the same as the cis group’s lower quartile whereas the cis group’s median is about the same as the trans group’s upper quartile. The minimum and max are the same because life satisfaction is on a 7 point scale.

While this is not a deep analysis, it is enough information to conclude that I may be onto something here and I should keep digging to figure out what other variables may be contributing to the differences in life satisfaction among these two groups.

##Get A Bit More Complex (and fun!) I’m curious what variability I might find if I take into account age. The variable AGE is continuous. I’m going to first confirm that this variable is in my data frame, and then I’m going to create a scatter plot showing age by life satisfaction for trans and cis participants.

column_names <- names(data)
# Confirm "AGE" is a column name
if ("AGE" %in% column_names) {
  print("AGE variable is in the data frame.")
} else {
  print("AGE variable is not found in the data frame.")
}

## [1] "AGE variable is in the data frame."

#Plot Age and Life Satisfaction for Trans and Cis People
library(ggplot2)

# Create a scatter plot
ggplot(data, aes(x = AGE, y = LIFESAT, color = factor(TRANSORCIS))) +
  geom_point() +
  labs(x = "Age", y = "Life Satisfaction", color = "Trans or Cis") +
  scale_color_manual(values = c("lightblue", "pink")) +
  theme_minimal()

## Warning: Removed 61 rows containing missing values (`geom_point()`).

This scatter plot shows me three main things: 1) There are more cis participants in my data than trans participants, 2) the trans participants seem to skew younger and less satisfied with life, whereas 3) the cis participants seem to be the opposite.

I’m not satisfied enough to draw any conclusions yet. Given how uneven the representation is between cis and trans participants, I’m going to filter the data and pull a random sample from each group so that I can re-plot this with equal representation of trans and cis people. Downsampling has risks and benefits. Here I’m really just curious how the visualization will change when I balance the groups.

# Filter data for trans and cis participants
trans_data <- data[data$TRANSORCIS == "trans", ]
cis_data <- data[data$TRANSORCIS == "cis", ]

# Randomly sample cis data to match the number of trans observations
sampled_cis_data <- cis_data[sample(nrow(cis_data), nrow(trans_data)), ]

# Combine trans and sampled cis data frames
balanced_data <- rbind(trans_data, sampled_cis_data)

ggplot(balanced_data, aes(x = AGE, y = LIFESAT, color = TRANSORCIS)) +
  geom_point() +
  scale_color_manual(values = c("lightblue", "pink")) +
  labs(x = "Age", y = "Life Satisfaction", color = "Participant Type") +
  theme_bw()

## Warning: Removed 16 rows containing missing values (`geom_point()`).

Now that there is equal representation of trans and cis people, the patters remain. Cis participants are older and have higher life satisfaction and trans participants are younger with lower satisfaction. The question I’m left with is, how much does age account for life satisfaction?

## Run a linear regression to analyze the relationship between age and life satisfaction for cis and trans people
model <- lm(LIFESAT ~ AGE + TRANSORCIS, data = data)
summary(model)

## 
## Call:
## lm(formula = LIFESAT ~ AGE + TRANSORCIS, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3250 -1.0942  0.3317  1.2560  3.4341 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.963709   0.161874  24.486  < 2e-16 ***
## AGE              0.018907   0.002741   6.898 8.01e-12 ***
## TRANSORCIStrans -0.775971   0.116534  -6.659 3.98e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.566 on 1372 degrees of freedom
##   (61 observations deleted due to missingness)
## Multiple R-squared:  0.1019, Adjusted R-squared:  0.1006 
## F-statistic: 77.86 on 2 and 1372 DF,  p-value: < 2.2e-16

The model confirms that age in associated with life satisfaction; older people tend to have slightly higher LIFESAT scores. It also indicates that trans people, on average, tend of have lower LIFESAT scores (-.776) than cis people. Age and trans or cis status only account for 10.2% of the differences in life satisfaction though. This indicates I should be considering other variables to account for the differences in life satisfaction scores. The next time I pick up this data, I’ll begin by adding urbanicity into the model.

HW06_Borunda

2023-06-14

Setting Up the Data

Analyze and Plot the Data