# Import Data
data <- read_csv(here('Data', 'Raw Data', 'insurance.csv'), show_col_types = FALSE)Background
Read Chapter 2 (Types of Data Psychologists Collect) and answer the following:
- Describe the key differences between nominal, ordinal, interval, and ratio data. Provide one example of each from psychological research.
Nominal are categories without any order, an example is types of flavors: vanilla, chocolate, orange. Ordinal is categories with order, but unequal spacing, an example is” strongly agree to strongly disagree. Interval is numbers with spacinf but not zero, an example is SAT scores. Ratio is values with equal spacing and zero, example is time in milliseconds.
For each of the following variables, identify the appropriate level of measurement (nominal, ordinal, interval, or ratio) and explain your reasoning:
Scores on a depression inventory (0-63)
Response time in milliseconds
Likert scale ratings of agreement (1-7)
Diagnostic categories (e.g., ADHD, anxiety disorder, no diagnosis)
Age in years
Scores on a depression inventory: Interval. Response time in milliseconds: Ratio. Likert scale of agreement: Ordinal. Diagnostic categories: Nominal. Age in years: Ratio.
Introduction
Referring to Chapter 3 (Measurement Errors in Psychological Research):
- Explain the difference between random and systematic error, providing an example of each in the context of a memory experiment.
Random error is unpredictable changes in measurement, example is a participant gets distracted during a memory experiment. Systematic error is a bias in measurement is a broken timer may underestimate reaction time in a memory test.
- How might measurement error affect the validity of a study examining the relationship between stress and academic performance? What steps could researchers take to minimize these errors?
The measurement error affects the validity of a study of the relationship between stress and academic performance because your results will be messed up, therefore leading to it being inaccurate. To minimize these errors researchers could use reliable tools, increase sample size, and repeat experiments to validate results.
Analysis
Using the psych package, create a correlation plot for the simulated dataset created in Part 2. Include the following steps:
Select the numeric variables from the dataset (reaction_time, accuracy, anxiety_pre, anxiety_post, and anxiety_change if you created it).
Use the psych package’s
corPlot()function to create a correlation plot.Interpret the resulting plot by addressing:
Which variables appear to be strongly correlated?
Are there any surprising relationships?
How might these correlations inform further research in psychology?
1. Import Packages
2. Import Insurance Data
3. Data Wrangling
data %>%
set_value_labels(
sex = c(
'Male' = 'male',
'Female' = 'female'
),
smoker = c(
'Yes' = 'yes',
'No' = 'no'
),
region = c(
'North East' = 'northeast',
'North West' = 'northwest',
'South East' = 'southeast',
'South West' = 'southwest'
)
) %>%
mutate(
bmi = as.numeric(bmi) %>% round(., 2),
charges = as.numeric(charges) %>% round(., 2)
) %>%
set_variable_labels(
age = 'Patient Age',
sex = 'Patient Gender',
bmi = 'Body Mass Index',
children = 'Number of Children',
smoker = 'Smoking Status',
region = 'Region',
charges = 'Insurance charge'
) %>%
mutate_if(is.labelled, as_factor) %>%
reactable::reactable()4. MySQL
mysql <- dbConnect(
drv = RMySQL::MySQL(),
username = 'georgedibo',
password = 'bossman',
dbname = 'world',
host = 'localhost',
port = 3306
)SELECT
Name,
Region,
IndepYear,
Population,
LifeExpectancy,
GovernmentForm
FROM country c
WHERE Continent = 'Africa'
AND
LifeExpectancy >= 28;SELECT *
FROM country c;world <- world %>%
clean_names()world %>%
tabyl(continent) %>%
adorn_pct_formatting() %>%
arrange(desc(n)) %>%
ggplot(aes(y=reorder(continent, n), x = n, fill = -n)) +
geom_col() +
geom_text(aes(label = percent), hjust = -0.1) +
theme_minimal() +
labs(
x = NULL,
y = NULL,
fill = NULL,
title = 'Africa has the highest number of counties'
) +
theme(
plot.title = element_text(hjust = 0.5, size = 18, face = 'bold'),
axis.text.x = element_blank(),
axis.text.y = element_text(size = 14, color = 'black'),
legend.position = 'none',
panel.grid = element_blank(),
)world %>%
ggboxplot(
x = 'continent',
y = 'life_expectancy',
color ='continent',
add = 'median_q1q3',
palette = 'npg',
bxp.errorbar = TRUE,
) +
theme(
axis.text = element_text(size=14),
legend.position = 'right',
legend.text = element_text(size = 14)
)5. Inferential Stats
data %>%
ttest(age ~ sex, data = .)
Compare age across sex with levels female and male
Grouping Variable: sex
Response Variable: age
------ Describe ------
age for sex female: n.miss = 0, n = 662, mean = 39.503, sd = 14.054
age for sex male: n.miss = 0, n = 676, mean = 38.917, sd = 14.050
Mean Difference of age: 0.586
Weighted Average Standard Deviation: 14.052
------ Assumptions ------
Note: These hypothesis tests can perform poorly, and the
t-test is typically robust to violations of assumptions.
Use as heuristic guides instead of interpreting literally.
Null hypothesis, for each group, is a normal distribution of age.
Group female: Sample mean assumed normal because n > 30, so no test needed.
Group male: Sample mean assumed normal because n > 30, so no test needed.
Null hypothesis is equal variances of age, homogeneous.
Variance Ratio test: F = 197.521/197.406 = 1.001, df = 661;675, p-value = 0.994
Levene's test, Brown-Forsythe: t = -0.030, df = 1336, p-value = 0.976
------ Infer ------
--- Assume equal population variances of age for each sex
t-cutoff for 95% range of variation: tcut = 1.962
Standard Error of Mean Difference: SE = 0.768
Hypothesis Test of 0 Mean Diff: t-value = 0.762, df = 1336, p-value = 0.446
Margin of Error for 95% Confidence Level: 1.507
95% Confidence Interval for Mean Difference: -0.921 to 2.093
--- Do not assume equal population variances of age for each sex
t-cutoff: tcut = 1.962
Standard Error of Mean Difference: SE = 0.768
Hypothesis Test of 0 Mean Diff: t = 0.762, df = 1335.398, p-value = 0.446
Margin of Error for 95% Confidence Level: 1.507
95% Confidence Interval for Mean Difference: -0.921 to 2.093
------ Effect Size ------
--- Assume equal population variances of age for each sex
Standardized Mean Difference of age, Cohen's d: 0.042
------ Practical Importance ------
Minimum Mean Difference of practical importance: mmd
Minimum Standardized Mean Difference of practical importance: msmd
Neither value specified, so no analysis
------ Graphics Smoothing Parameter ------
Density bandwidth for sex female: 4.275
Density bandwidth for sex male: 4.232
data %>%
ttest(charges ~ sex, data = .)
Compare charges across sex with levels male and female
Grouping Variable: sex
Response Variable: charges
------ Describe ------
charges for sex male: n.miss = 0, n = 676, mean = 13956.7512, sd = 12971.0259
charges for sex female: n.miss = 0, n = 662, mean = 12569.5788, sd = 11128.7038
Mean Difference of charges: 1387.1723
Weighted Average Standard Deviation: 12094.6440
------ Assumptions ------
Note: These hypothesis tests can perform poorly, and the
t-test is typically robust to violations of assumptions.
Use as heuristic guides instead of interpreting literally.
Null hypothesis, for each group, is a normal distribution of charges.
Group male: Sample mean assumed normal because n > 30, so no test needed.
Group female: Sample mean assumed normal because n > 30, so no test needed.
Null hypothesis is equal variances of charges, homogeneous.
Variance Ratio test: F = 168247513.2882/123848048.2885 = 1.3585, df = 675;661, p-value = 0.000
Levene's test, Brown-Forsythe: t = 3.148, df = 1336, p-value = 0.002
------ Infer ------
--- Assume equal population variances of charges for each sex
t-cutoff for 95% range of variation: tcut = 1.962
Standard Error of Mean Difference: SE = 661.3309
Hypothesis Test of 0 Mean Diff: t-value = 2.098, df = 1336, p-value = 0.036
Margin of Error for 95% Confidence Level: 1297.3600
95% Confidence Interval for Mean Difference: 89.8123 to 2684.5324
--- Do not assume equal population variances of charges for each sex
t-cutoff: tcut = 1.962
Standard Error of Mean Difference: SE = 660.2791
Hypothesis Test of 0 Mean Diff: t = 2.101, df = 1313.360, p-value = 0.036
Margin of Error for 95% Confidence Level: 1295.3170
95% Confidence Interval for Mean Difference: 91.8553 to 2682.4893
------ Effect Size ------
--- Assume equal population variances of charges for each sex
Standardized Mean Difference of charges, Cohen's d: 0.1147
------ Practical Importance ------
Minimum Mean Difference of practical importance: mmd
Minimum Standardized Mean Difference of practical importance: msmd
Neither value specified, so no analysis
------ Graphics Smoothing Parameter ------
Density bandwidth for sex male: 1750.1981
Density bandwidth for sex female: 1207.9571
data %>%
shapiro_test(charges)# A tibble: 1 × 3
variable statistic p
<chr> <dbl> <dbl>
1 charges 0.815 1.15e-36
data %>%
mutate(
log_charge = log(charges) + 1
) %>%
shapiro_test(log_charge)# A tibble: 1 × 3
variable statistic p
<chr> <dbl> <dbl>
1 log_charge 0.983 2.29e-11
data %>%
levene_test(charges ~ sex, data = .)# A tibble: 1 × 4
df1 df2 statistic p
<int> <int> <dbl> <dbl>
1 1 1336 9.91 0.00168
data %>%
mutate(
log_charge = log(charges) + 1
) %>%
levene_test(log_charge ~ sex, data = .)# A tibble: 1 × 4
df1 df2 statistic p
<int> <int> <dbl> <dbl>
1 1 1336 15.0 0.000115
data %>%
wilcox.test(charges ~ sex, conf.int = TRUE, exact = TRUE, data = .)
Wilcoxon rank sum test with continuity correction
data: charges by sex
W = 221304, p-value = 0.7287
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-929.0673 566.9229
sample estimates:
difference in location
-114.8947
data %>%
group_by(sex) %>%
summarize(
average = median(charges, na.rm = TRUE)
)# A tibble: 2 × 2
sex average
<chr> <dbl>
1 female 9413.
2 male 9370.
data %>%
wilcox_effsize(charges ~ sex, data = .)# A tibble: 1 × 7
.y. group1 group2 effsize n1 n2 magnitude
* <chr> <chr> <chr> <dbl> <int> <int> <ord>
1 charges female male 0.00949 662 676 small
data %>%
kruskal.test(charges ~ sex, data = .)
Kruskal-Wallis rank sum test
data: charges by sex
Kruskal-Wallis chi-squared = 0.1204, df = 1, p-value = 0.7286
data %>%
kruskal_effsize(charges ~ sex, data = .)# A tibble: 1 × 5
.y. n effsize method magnitude
* <chr> <int> <dbl> <chr> <ord>
1 charges 1338 -0.000658 eta2[H] small
dunn.test::dunn.test(data$charges, data$region, method = 'holm') Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 4.7342, df = 3, p-value = 0.19
Comparison of x by group
(Holm)
Col Mean-|
Row Mean | northeas northwes southeas
---------+---------------------------------
northwes | 1.524949
| 0.3182
|
southeas | 0.697727 -0.870427
| 0.4853 0.5761
|
southwes | 1.990731 0.466141 1.349579
| 0.1395 0.3206 0.3543
alpha = 0.05
Reject Ho if p <= alpha/2
data %>%
gtsummary::tbl_cross(
row = region,
col = sex,
percent = 'row'
) %>%
gtsummary::add_p()
sex
|
Total | p-value1 | ||
|---|---|---|---|---|
| female | male | |||
| region | >0.9 | |||
| northeast | 161 (50%) | 163 (50%) | 324 (100%) | |
| northwest | 164 (50%) | 161 (50%) | 325 (100%) | |
| southeast | 175 (48%) | 189 (52%) | 364 (100%) | |
| southwest | 162 (50%) | 163 (50%) | 325 (100%) | |
| Total | 662 (49%) | 676 (51%) | 1,338 (100%) | |
| 1 Pearson’s Chi-squared test | ||||
cramer_v(table(data$sex, data$region))[1] 0.0180337
chisq.test(table(data$sex, data$region))
Pearson's Chi-squared test
data: table(data$sex, data$region)
X-squared = 0.43514, df = 3, p-value = 0.9329