Basic Statistics

Load Libraries

# if you haven't used a given package before, you'll need to download it first
# after download is finished, insert a "#" before the install function so that the file will Knit later
# then run the library function calling that package

#install.packages("psych")
#install.packages("expss")

library(psych) # for the describe() command
library(expss) # for the cross_cases() command
## Loading required package: maditr
## 
## To aggregate data: take(mtcars, mean_mpg = mean(mpg), by = am)

Import & Examine Data

# Import the "fakedata_2025.csv" file

d2 <- read.csv("Data/projectdata.csv")

str(d2)
## 'data.frame':    697 obs. of  7 variables:
##  $ X          : int  520 2814 3146 3295 717 6056 4753 5365 2044 1965 ...
##  $ mhealth    : chr  "none or NA" "none or NA" "none or NA" "none or NA" ...
##  $ sleep_hours: chr  "2 5-6 hours" "3 7-8 hours" "2 5-6 hours" "4 8-10 hours" ...
##  $ big5_neu   : num  5.33 2.67 1 3.67 4.33 ...
##  $ big5_con   : num  3 4 6 4 3.33 ...
##  $ pswq       : num  2.71 1.43 1.86 1.79 2.36 ...
##  $ covid_pos  : int  0 0 0 0 0 0 0 0 0 0 ...
# Note: for the HW, you will import "projectdata.csv" that you created and exported in the Data Prep Lab

Univariate Plots: Histograms & Tables

Tables are used to visualize individual categorical variables. Histograms are used to visualize individual continuous variables.

# use tables to visualize categorical data (2 variables)
table(d2$mhealth)
## 
##              anxiety disorder                       bipolar 
##                            78                             3 
##                    depression              eating disorders 
##                            12                            20 
##                    none or NA obsessive compulsive disorder 
##                           539                            15 
##                         other                          ptsd 
##                            17                            13
table(d2$sleep_hours)
## 
##  1 < 5 hours  2 5-6 hours  3 7-8 hours 4 8-10 hours 5 > 10 hours 
##           56          183          234          190           34
# use histograms to visualize continuous data (4 variables)
hist(d2$big5_neu)

hist(d2$big5_con)

hist(d2$pswq)

hist(d2$covid_pos)

Univariate Normality for Continuous Variables

d2_mini = data.frame(d2$big5_neu, d2$mhealth, d2$sleep_hours, d2$covid_pos, d2$big5_con, d2$pswq)

describe(d2)
##              vars   n    mean      sd  median trimmed     mad min  max range
## X               1 697 5171.35 2597.37 5763.00 5317.06 3049.71  20 8858  8838
## mhealth*        2 697    4.60    1.43    5.00    4.84    0.00   1    8     7
## sleep_hours*    3 697    2.95    1.02    3.00    2.97    1.48   1    5     4
## big5_neu        4 697    4.63    1.45    5.00    4.73    1.48   1    7     6
## big5_con        5 697    4.51    1.14    4.33    4.52    0.99   1    7     6
## pswq            6 697    2.66    0.76    2.71    2.67    0.95   1    4     3
## covid_pos       7 697    2.48    3.61    0.00    1.80    0.00   0   15    15
##               skew kurtosis    se
## X            -0.41    -1.13 98.38
## mhealth*     -1.43     2.32  0.05
## sleep_hours* -0.07    -0.67  0.04
## big5_neu     -0.55    -0.38  0.06
## big5_con     -0.08    -0.22  0.04
## pswq         -0.15    -0.98  0.03
## covid_pos     1.30     0.60  0.14
## OPTION 1
# We analyzed the skew and kurtosis of our continuous variables and all were within the accepted range (-2/+2).

Write-up of Normality

We analyzed the skew and kurtosis of our continuous variables and all were within the accepted range (-2/+2).

Bivariate Plots

Crosstabs

Crosstabs are used to visualize combinations of two categorical variables.

cross_cases(d2, mhealth, sleep_hours)
 sleep_hours 
 1 < 5 hours   2 5-6 hours   3 7-8 hours   4 8-10 hours   5 > 10 hours 
 mhealth 
   anxiety disorder  10 26 25 12 5
   bipolar  2 1
   depression  1 4 4 2 1
   eating disorders  2 6 9 3
   none or NA  31 134 179 169 26
   obsessive compulsive disorder  2 5 6 2
   other  2 4 10 1
   ptsd  6 4 1 2
   #Total cases  56 183 234 190 34
## Some students may have issues with this function working. If this happens to you, please try these 2 options:
## Option 1: install the "maditr" package and then call in its library.
## Option 2: If Option 1 doesn't work, then you will use xtabs() instead. Fill in the code below and remove the "#" to run. Then hashtag out the cross_cases() line.

# xtabs(~ + , data=)

Scatterplots

Scatterplots are used to visualize combinations of two continuous variables.

plot(d2$big5_neu, d2$pswq, 
     main="Scatterplot of Neuroticism and Worry",
     xlab = "Neuroticism",
     ylab = "Worry")

plot(d2$big5_con, d2$pswq,
     main="Scatterplot of Conscientiousness and Worry",
     xlab = "Conscientiousness",
     ylab = "Worry")

Boxplots

Boxplots are used to visualize combinations of one categorical and one continuous variable.

# ORDER MATTERS HERE: 'continuous variable' ~ 'categorical variable' 

boxplot(data=d2, pswq~mhealth,
        main="Boxplot of Penn State Worry Questionnaire by Mental Health Disorders",
        xlab = "Mental Health Disorders",
        ylab = "Penn State Worry Questionnaire")

boxplot(data=d2, big5_neu~mhealth,        main="Boxplot of Neuroticism by Mental Health Disorders",
        xlab = "Mental Health Disorders",
        ylab = "Neuroticism")

That’s it!!