Background

Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:

Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.

Nominal are categories without any order, an example is types of flavors: vanilla, chocolate, orange. Ordinal is categories with order, but unequal spacing, an example is” strongly agree to strongly disagree. Interval is numbers with spacinf but not zero, an example is SAT scores. Ratio is values with equal spacing and zero, example is time in milliseconds.

For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
- Scores on a depression inventory (0-63)
- Response time in milliseconds
- Likert scale ratings of agreement (1-7)
- Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
- Age in years

Scores on a depression inventory: Interval. Response time in milliseconds: Ratio. Likert scale of agreement: Ordinal. Diagnostic categories: Nominal. Age in years: Ratio.

Introduction

Referring to Chapter 3 (Measurement Errors in Psychological Research):

Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.

Random error is unpredictable changes in measurement, example is a participant gets distracted during a memory experiment. Systematic error is a bias in measurement is a broken timer may underestimate reaction time in a memory test.

How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?

The measurement error affects the validity of a study of the relationship between stress and academic performance because your results will be messed up, therefore leading to it being inaccurate. To minimize these errors researchers could use reliable tools, increase sample size, and repeat experiments to validate results.

Analysis

Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:

Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s corPlot() function to create a correlation plot.
Interpret the resulting plot by addressing:
- Which variables appear to be strongly correlated?
- Are there any surprising relationships?
- How might these correlations inform further research in psychology?

1. Import Packages

2. Import Insurance Data

# Import Data
data <- read_csv(here('Data', 'Raw Data', 'insurance.csv'), show_col_types = FALSE)

3. Data Wrangling

data %>% 
  set_value_labels(
    sex = c(
      'Male' = 'male',
      'Female' = 'female'
    ),
    smoker = c(
      'Yes' = 'yes',
      'No' = 'no'
    ),
    region = c(
      'North East' = 'northeast',
      'North West' = 'northwest',
      'South East' = 'southeast',
      'South West' = 'southwest'
    )
  ) %>% 
  mutate(
    bmi = as.numeric(bmi) %>% round(., 2),
    charges = as.numeric(charges) %>% round(., 2)
  ) %>% 
  set_variable_labels(
    age = 'Patient Age',
    sex = 'Patient Gender',
    bmi = 'Body Mass Index',
    children = 'Number of Children',
    smoker = 'Smoking Status',
    region = 'Region',
    charges = 'Insurance charge'
  ) %>% 
  mutate_if(is.labelled, as_factor) %>%
  reactable::reactable()

4. MySQL

mysql <- dbConnect(
  drv = RMySQL::MySQL(),
  username = 'georgedibo',
  password = 'bossman',
  dbname = 'world',
  host = 'localhost',
  port = 3306
)

SELECT
  Name, 
  Region, 
  IndepYear,
  Population,
  LifeExpectancy,
  GovernmentForm
FROM country c
WHERE Continent = 'Africa'
        AND
      LifeExpectancy >= 28;

SELECT *
FROM country c;

world <- world %>% 
  clean_names()

world %>% 
  tabyl(continent) %>% 
  adorn_pct_formatting() %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(y=reorder(continent, n), x = n, fill = -n)) +
  geom_col() +
  geom_text(aes(label = percent), hjust = -0.1) +
  theme_minimal() +
  labs(
    x = NULL,
    y = NULL,
    fill = NULL,
    title = 'Africa has the highest number of counties'
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 18, face = 'bold'),
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 14, color = 'black'),
    legend.position = 'none',
    panel.grid = element_blank(),
  )

world %>% 
  ggboxplot(
    x = 'continent', 
    y = 'life_expectancy',
    color ='continent',
    add = 'median_q1q3',
    palette = 'npg',
    bxp.errorbar = TRUE,
  ) +
  theme(
    axis.text = element_text(size=14),
    legend.position = 'right',
    legend.text = element_text(size = 14)
  )

5. Inferential Stats

data %>% 
  ttest(age ~ sex, data = .)


Compare age across sex with levels female and male 
Grouping Variable:  sex
Response Variable:  age


------ Describe ------

age for sex female:  n.miss = 0,  n = 662,  mean = 39.503,  sd = 14.054
age for sex male:  n.miss = 0,  n = 676,  mean = 38.917,  sd = 14.050

Mean Difference of age:  0.586

Weighted Average Standard Deviation:   14.052 


------ Assumptions ------

Note: These hypothesis tests can perform poorly, and the 
      t-test is typically robust to violations of assumptions. 
      Use as heuristic guides instead of interpreting literally. 

Null hypothesis, for each group, is a normal distribution of age.
Group female: Sample mean assumed normal because n > 30, so no test needed.
Group male: Sample mean assumed normal because n > 30, so no test needed.

Null hypothesis is equal variances of age, homogeneous.
Variance Ratio test:  F = 197.521/197.406 = 1.001,  df = 661;675,  p-value = 0.994
Levene's test, Brown-Forsythe:  t = -0.030,  df = 1336,  p-value = 0.976


------ Infer ------

--- Assume equal population variances of age for each sex 

t-cutoff for 95% range of variation: tcut =  1.962 
Standard Error of Mean Difference: SE =  0.768 

Hypothesis Test of 0 Mean Diff:  t-value = 0.762,  df = 1336,  p-value = 0.446

Margin of Error for 95% Confidence Level:  1.507
95% Confidence Interval for Mean Difference:  -0.921 to 2.093


--- Do not assume equal population variances of age for each sex 

t-cutoff: tcut =  1.962 
Standard Error of Mean Difference: SE =  0.768 

Hypothesis Test of 0 Mean Diff:  t = 0.762,  df = 1335.398, p-value = 0.446

Margin of Error for 95% Confidence Level:  1.507
95% Confidence Interval for Mean Difference:  -0.921 to 2.093


------ Effect Size ------

--- Assume equal population variances of age for each sex 

Standardized Mean Difference of age, Cohen's d:  0.042


------ Practical Importance ------

Minimum Mean Difference of practical importance: mmd
Minimum Standardized Mean Difference of practical importance: msmd
Neither value specified, so no analysis


------ Graphics Smoothing Parameter ------

Density bandwidth for sex female: 4.275
Density bandwidth for sex male: 4.232

data %>% 
  ttest(charges ~ sex, data = .)


Compare charges across sex with levels male and female 
Grouping Variable:  sex
Response Variable:  charges


------ Describe ------

charges for sex male:  n.miss = 0,  n = 676,  mean = 13956.7512,  sd = 12971.0259
charges for sex female:  n.miss = 0,  n = 662,  mean = 12569.5788,  sd = 11128.7038

Mean Difference of charges:  1387.1723

Weighted Average Standard Deviation:   12094.6440 


------ Assumptions ------

Note: These hypothesis tests can perform poorly, and the 
      t-test is typically robust to violations of assumptions. 
      Use as heuristic guides instead of interpreting literally. 

Null hypothesis, for each group, is a normal distribution of charges.
Group male: Sample mean assumed normal because n > 30, so no test needed.
Group female: Sample mean assumed normal because n > 30, so no test needed.

Null hypothesis is equal variances of charges, homogeneous.
Variance Ratio test:  F = 168247513.2882/123848048.2885 = 1.3585,  df = 675;661,  p-value = 0.000
Levene's test, Brown-Forsythe:  t = 3.148,  df = 1336,  p-value = 0.002


------ Infer ------

--- Assume equal population variances of charges for each sex 

t-cutoff for 95% range of variation: tcut =  1.962 
Standard Error of Mean Difference: SE =  661.3309 

Hypothesis Test of 0 Mean Diff:  t-value = 2.098,  df = 1336,  p-value = 0.036

Margin of Error for 95% Confidence Level:  1297.3600
95% Confidence Interval for Mean Difference:  89.8123 to 2684.5324


--- Do not assume equal population variances of charges for each sex 

t-cutoff: tcut =  1.962 
Standard Error of Mean Difference: SE =  660.2791 

Hypothesis Test of 0 Mean Diff:  t = 2.101,  df = 1313.360, p-value = 0.036

Margin of Error for 95% Confidence Level:  1295.3170
95% Confidence Interval for Mean Difference:  91.8553 to 2682.4893


------ Effect Size ------

--- Assume equal population variances of charges for each sex 

Standardized Mean Difference of charges, Cohen's d:  0.1147


------ Practical Importance ------

Minimum Mean Difference of practical importance: mmd
Minimum Standardized Mean Difference of practical importance: msmd
Neither value specified, so no analysis


------ Graphics Smoothing Parameter ------

Density bandwidth for sex male: 1750.1981
Density bandwidth for sex female: 1207.9571

data %>% 
  shapiro_test(charges)

# A tibble: 1 × 3
  variable statistic        p
  <chr>        <dbl>    <dbl>
1 charges      0.815 1.15e-36

data %>% 
  mutate(
    log_charge = log(charges) + 1
  ) %>%
  shapiro_test(log_charge)

# A tibble: 1 × 3
  variable   statistic        p
  <chr>          <dbl>    <dbl>
1 log_charge     0.983 2.29e-11

data %>% 
  levene_test(charges ~ sex, data = .)

# A tibble: 1 × 4
    df1   df2 statistic       p
  <int> <int>     <dbl>   <dbl>
1     1  1336      9.91 0.00168

data %>% 
  mutate(
    log_charge = log(charges) + 1
  ) %>% 
  levene_test(log_charge ~ sex, data = .)

# A tibble: 1 × 4
    df1   df2 statistic        p
  <int> <int>     <dbl>    <dbl>
1     1  1336      15.0 0.000115

data %>% 
  wilcox.test(charges ~ sex, conf.int = TRUE, exact = TRUE, data = .)


    Wilcoxon rank sum test with continuity correction

data:  charges by sex
W = 221304, p-value = 0.7287
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -929.0673  566.9229
sample estimates:
difference in location 
             -114.8947

data %>% 
  group_by(sex) %>% 
  summarize(
    average = median(charges, na.rm = TRUE)
  )

# A tibble: 2 × 2
  sex    average
  <chr>    <dbl>
1 female   9413.
2 male     9370.

data %>% 
  wilcox_effsize(charges ~ sex, data = .)

# A tibble: 1 × 7
  .y.     group1 group2 effsize    n1    n2 magnitude
* <chr>   <chr>  <chr>    <dbl> <int> <int> <ord>    
1 charges female male   0.00949   662   676 small

data %>% 
  kruskal.test(charges ~ sex, data = .)


    Kruskal-Wallis rank sum test

data:  charges by sex
Kruskal-Wallis chi-squared = 0.1204, df = 1, p-value = 0.7286

data %>% 
  kruskal_effsize(charges ~ sex, data = .)

# A tibble: 1 × 5
  .y.         n   effsize method  magnitude
* <chr>   <int>     <dbl> <chr>   <ord>    
1 charges  1338 -0.000658 eta2[H] small

dunn.test::dunn.test(data$charges, data$region, method = 'holm')

  Kruskal-Wallis rank sum test

data: x and group
Kruskal-Wallis chi-squared = 4.7342, df = 3, p-value = 0.19

                           Comparison of x by group                            
                                    (Holm)                                     
Col Mean-|
Row Mean |   northeas   northwes   southeas
---------+---------------------------------
northwes |   1.524949
         |     0.3182
         |
southeas |   0.697727  -0.870427
         |     0.4853     0.5761
         |
southwes |   1.990731   0.466141   1.349579
         |     0.1395     0.3206     0.3543

alpha = 0.05
Reject Ho if p <= alpha/2

data %>% 
  gtsummary::tbl_cross(
    row = region,
    col = sex,
    percent = 'row'
  ) %>% 
  gtsummary::add_p()

	sex		Total	p-value¹
	female	male	Total	p-value¹
region				>0.9
northeast	161 (50%)	163 (50%)	324 (100%)
northwest	164 (50%)	161 (50%)	325 (100%)
southeast	175 (48%)	189 (52%)	364 (100%)
southwest	162 (50%)	163 (50%)	325 (100%)
Total	662 (49%)	676 (51%)	1,338 (100%)
¹ Pearson’s Chi-squared test

cramer_v(table(data$sex, data$region))

[1] 0.0180337

chisq.test(table(data$sex, data$region))


    Pearson's Chi-squared test

data:  table(data$sex, data$region)
X-squared = 0.43514, df = 3, p-value = 0.9329