STATS-HW-8.knit

Title: Homework 8

Author: Brandon Flores

Date: Nov. 3rd, 2021

When observing the age difference between transmen and transwomen it is with 95% confidence that transmen are about 5 to 16 years younger than transwomen. With this we can reject the null hypothesis. This is a statistically significant relationship (pvalue = 0.0002697) at the .001 level. 

Observing the difference between the outputs of excel and R, the outputs were very similar with a critical T of 3.734642 and a p-value of 0.000137. Although these two numbers slightly differed from RStudio (more so for the pvalue) the interpretation of them are all the same. 

This can also be said for the outputs of both programs concerning the confidence intervals both being able to be interpreted by being between about 5 to 16 with 95% confidence. 

The biggest difference between the two programs of outputs is that in R all of the outputs are negative (besides the pvalue & df) and in excel they are positive. When I tried to mimic the negative output of R in excel my pvalue became extremely different showing about a 0.999 level being the wrong output. 

Finally the difference can also be shown in the degrees of freedom with my df by hand being 138 compared to the model in R being at about 129 when using the FALSE function. 

Even with these known differences, when calculated properly the outputs of both programs are strikingly similar and the same in terms of interpretation. 

According to the data, cismen are older than ciswomen by about 1.94 years. When analyzing the t-test for these groups the relationship between them is statistically significant at the .001 level. But due to the extremely large sample size of the model (about 55,000 and a df at about 47,000) this statistical relationship is not practical in the real world. Due to the size of the sample within the model it makes the relationship between the two variables very likely to be statically related to one another.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(httr)
library(purrr)
library(stringr)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

library(ggplot2)
library(Rmisc)

## Loading required package: lattice

pulse39 <-read.csv("C:\\Users\\BTP\\Downloads\\pulse2021_puf_39.csv")

pulse39$subgroup <-paste(pulse39 $EGENID_BIRTH,
pulse39$GENID_DESCRIBE, sep = "")
pulse39 %>%
tabyl(subgroup)

##  subgroup     n     percent
##      1-99   417 0.007307584
##        11 22652 0.396957802
##        12    66 0.001156596
##        13    60 0.001051451
##        14   263 0.004608860
##      2-99   544 0.009533156
##        21    73 0.001279265
##        22 32522 0.569921492
##        23    79 0.001384410
##        24   388 0.006799383

pulse39$subgroupcat <-car::Recode(pulse39$ subgroup,
recodes=" '11' = 'cisman'; '22' = 'ciswoman'; '23' = 'transman'; '13' = 'transwoman'; else=NA",
as.factor=T)
pulse39 %>%
tabyl(subgroupcat)

##  subgroupcat     n     percent valid_percent
##       cisman 22652 0.396957802   0.409523982
##     ciswoman 32522 0.569921492   0.587963047
##     transman    79 0.001384410   0.001428236
##   transwoman    60 0.001051451   0.001084736
##         <NA>  1751 0.030684845            NA

pulse39 <- transform(pulse39, age=2021-TBIRTH_YEAR)

pulse39$trans <- car::Recode(pulse39$ subgroup,
recodes=" '23' = 'transman'; '13' = 'transwoman'; 
else=NA", as.factor=T)
pulse39 %>% 
tabyl(trans)

##       trans     n     percent valid_percent
##    transman    79 0.001384410     0.5683453
##  transwoman    60 0.001051451     0.4316547
##        <NA> 56925 0.997564139            NA

pulse39$cis <-car::Recode(pulse39$ subgroup,
recodes=" '11' = 'cisman'; '22' = 'ciswoman'; else=NA", as.factor=T)
pulse39 %>%
tabyl(cis)

##       cis     n    percent valid_percent
##    cisman 22652 0.39695780     0.4105557
##  ciswoman 32522 0.56992149     0.5894443
##      <NA>  1890 0.03312071            NA

pulse39 %>%
group_by(trans) %>%
summarise_at(vars(age), list(name = mean))

## # A tibble: 3 x 2
##   trans       name
##   <fct>      <dbl>
## 1 transman    33.6
## 2 transwoman  44.0
## 3 <NA>        53.9

pulse39 %>% 
group_by(trans) %>%
summarise_at(vars(age), list(name = sd))

## # A tibble: 3 x 2
##   trans       name
##   <fct>      <dbl>
## 1 transman    16.6
## 2 transwoman  16.0
## 3 <NA>        15.9

pulse39 %>% 
group_by(cis) %>%
summarise_at(vars(age), list(name = mean))

## # A tibble: 3 x 2
##   cis       name
##   <fct>    <dbl>
## 1 cisman    55.0
## 2 ciswoman  53.1
## 3 <NA>      52.9

pulse39 %>%
group_by(cis) %>%
summarise_at(vars(age), list(name = sd))

## # A tibble: 3 x 2
##   cis       name
##   <fct>    <dbl>
## 1 cisman    16.3
## 2 ciswoman  15.5
## 3 <NA>      18.2

t.test (age ~ trans, var.equals=FALSE, data = pulse39)

## 
##  Welch Two Sample t-test
## 
## data:  age by trans
## t = -3.7457, df = 129.47, p-value = 0.0002697
## alternative hypothesis: true difference in means between group transman and group transwoman is not equal to 0
## 95 percent confidence interval:
##  -15.971303  -4.930807
## sample estimates:
##   mean in group transman mean in group transwoman 
##                 33.58228                 44.03333

t.test (age ~ cis, var.equals=FALSE, data = pulse39)

## 
##  Welch Two Sample t-test
## 
## data:  age by cis
## t = 14.056, df = 47056, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group cisman and group ciswoman is not equal to 0
## 95 percent confidence interval:
##  1.672666 2.214754
## sample estimates:
##   mean in group cisman mean in group ciswoman 
##               55.04441               53.10070