Study experiments and Data analysis

L5: Descriptive analysis

Read dataset into R

income = read.csv("C:\\VN trips\\VN trip 2 (Sept 2022)\\Can Tho 2022\\Income and PhDs.csv", header = T)
head(income)

##   id TimeSincePhD NPubs Sex Citations Salary
## 1  1            3    18   1        50  51876
## 2  2            6     3   1        26  54511
## 3  3            3     2   1        50  53425
## 4  4            8    17   0        34  61683
## 5  5            9    11   1        41  52926
## 6  6            6     6   0        37  47034

Characteristics by sexes

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~ TimeSincePhD + NPubs + Citations + Salary | as.factor(Sex), data = income)

	0 (N=35)	1 (N=27)	Overall (N=62)
TimeSincePhD
Mean (SD)	7.57 (4.99)	5.78 (2.91)	6.79 (4.28)
Median [Min, Max]	6.00 [1.00, 21.0]	5.00 [2.00, 16.0]	6.00 [1.00, 21.0]
NPubs
Mean (SD)	20.1 (15.4)	15.7 (11.8)	18.2 (14.0)
Median [Min, Max]	17.0 [1.00, 69.0]	12.0 [2.00, 50.0]	13.0 [1.00, 69.0]
Citations
Mean (SD)	42.5 (18.9)	37.3 (14.5)	40.2 (17.2)
Median [Min, Max]	40.0 [1.00, 90.0]	34.0 [14.0, 83.0]	35.0 [1.00, 90.0]
Salary
Mean (SD)	56500 (10800)	50600 (11700)	53900 (11500)
Median [Min, Max]	55600 [37900, 83500]	52500 [6330, 74300]	53500 [6330, 83500]

library("tidyverse")

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

income_n = income %>%
  mutate(gender = case_when(Sex == 0 ~ "Male",
                            Sex == 1 ~ "Female"))

table1(~ TimeSincePhD + NPubs + Citations + Salary | gender, data = income_n)

	Female (N=27)	Male (N=35)	Overall (N=62)
TimeSincePhD
Mean (SD)	5.78 (2.91)	7.57 (4.99)	6.79 (4.28)
Median [Min, Max]	5.00 [2.00, 16.0]	6.00 [1.00, 21.0]	6.00 [1.00, 21.0]
NPubs
Mean (SD)	15.7 (11.8)	20.1 (15.4)	18.2 (14.0)
Median [Min, Max]	12.0 [2.00, 50.0]	17.0 [1.00, 69.0]	13.0 [1.00, 69.0]
Citations
Mean (SD)	37.3 (14.5)	42.5 (18.9)	40.2 (17.2)
Median [Min, Max]	34.0 [14.0, 83.0]	40.0 [1.00, 90.0]	35.0 [1.00, 90.0]
Salary
Mean (SD)	50600 (11700)	56500 (10800)	53900 (11500)
Median [Min, Max]	52500 [6330, 74300]	55600 [37900, 83500]	53500 [6330, 83500]

Check whether the differences in baseline characteristics statistically significant

library(compareGroups)
createTable(compareGroups(gender ~ TimeSincePhD + NPubs + Citations + Salary, data = income_n))

## 
## --------Summary descriptives table by 'gender'---------
## 
## __________________________________________________ 
##                 Female         Male      p.overall 
##                  N=27          N=35                
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## TimeSincePhD  5.78 (2.91)   7.57 (4.99)    0.082   
## NPubs         15.7 (11.8)   20.1 (15.4)    0.203   
## Citations     37.3 (14.5)   42.5 (18.9)    0.232   
## Salary       50613 (11666) 56510 (10777)   0.046   
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Overall graphical assessment of the variables

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(data = income_n, mapping = aes(color = gender))

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

vars = income_n[, c("TimeSincePhD", "NPubs", "Citations", "Salary", "gender")]
ggpairs(data = vars, mapping = aes(color = gender))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data analysis course UTS

Thach Tran

2022-10-22

Study experiments and Data analysis

L5: Descriptive analysis

Read dataset into R

Characteristics by sexes

Check whether the differences in baseline characteristics statistically significant

Overall graphical assessment of the variables