This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(readr)
data <- read_csv("~/Desktop/Qian Thesis 13 features.csv")
## Rows: 1419 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): story
## dbl (13): avg_sentence_length, coherence, consistency, distinct_2, lm_ppl, t...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [1,419 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ story : chr [1:1419] "Pam had a jam in a pan. The jam was pink. Tim had a cat. The cat had a bat. Tim and Pam sat. They sat with Matt"| __truncated__ "Pat the cat sat on a mat. A bat flew and sat on Pat. Pat the cat did not like the bat. Pat swat at the bat. The"| __truncated__ "Pam has a pet cat. The cat's name is Pat. Pat is a fat cat. Pam and Pat play in the sun. Pat can nap on a mat. "| __truncated__ "Pam the cat has a hat. Pam's hat is tan and fits just right. She naps with her hat on a mat. Matt the rat is Pa"| __truncated__ ...
## $ avg_sentence_length : num [1:1419] 6.33 8.62 6.93 8.23 8.94 ...
## $ coherence : num [1:1419] 0.0909 0.625 0.1429 0.1538 0 ...
## $ consistency : num [1:1419] 0.371 0.417 0.371 0.421 0.216 ...
## $ distinct_2 : num [1:1419] 0.84 0.809 0.864 0.887 0.869 ...
## $ lm_ppl : num [1:1419] 25.7 25.1 20.8 38.7 28.5 ...
## $ text_novelty : num [1:1419] 0.0267 0.1176 0.1068 0.2453 0.2188 ...
## $ flesch : num [1:1419] 135 122 129 123 130 ...
## $ spache : num [1:1419] 1.53 1.91 1.72 1.92 2.05 ...
## $ toxicity : num [1:1419] 0.0561 0.0122 0.0272 0.0339 0.014 ...
## $ syntactic_complexity: num [1:1419] 2.05 2.44 2.25 2.04 3.43 ...
## $ self_bleu : num [1:1419] 0.041182 0.062214 0.000638 0.0005 0.000951 ...
## $ factual_hint : num [1:1419] 0.615 0.658 0.701 0.673 0.797 ...
## $ semantic_drift : num [1:1419] 0.66 0.554 0.68 0.622 0.733 ...
## - attr(*, "spec")=
## .. cols(
## .. story = col_character(),
## .. avg_sentence_length = col_double(),
## .. coherence = col_double(),
## .. consistency = col_double(),
## .. distinct_2 = col_double(),
## .. lm_ppl = col_double(),
## .. text_novelty = col_double(),
## .. flesch = col_double(),
## .. spache = col_double(),
## .. toxicity = col_double(),
## .. syntactic_complexity = col_double(),
## .. self_bleu = col_double(),
## .. factual_hint = col_double(),
## .. semantic_drift = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
scaled_data <- data %>%
select(-story) %>%
scale()
library(psych)
## Warning: package 'psych' was built under R version 4.3.3
fa.parallel(scaled_data, fa = "fa", n.iter = 100, show.legend = FALSE)
## Parallel analysis suggests that the number of factors = 5 and the number of components = NA
efa_result <- fa(scaled_data, nfactors = 3, rotate = "oblimin", fm = "pa")
## Loading required namespace: GPArotation
print(efa_result$loadings)
##
## Loadings:
## PA1 PA2 PA3
## avg_sentence_length 0.867 -0.103
## coherence -0.346 -0.120
## consistency -0.967
## distinct_2 0.536 0.438 0.148
## lm_ppl 0.193 0.447
## text_novelty 0.171 0.665
## flesch -0.740 -0.292
## spache 0.776 0.264
## toxicity -0.306 -0.163
## syntactic_complexity 0.802 -0.261
## self_bleu -0.218 -0.289 0.118
## factual_hint 0.927
## semantic_drift 0.887
##
## PA1 PA2 PA3
## SS loadings 3.020 3.060 0.929
## Proportion Var 0.232 0.235 0.071
## Cumulative Var 0.232 0.468 0.539
print(efa_result)
## Factor Analysis using method = pa
## Call: fa(r = scaled_data, nfactors = 3, rotate = "oblimin", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 h2 u2 com
## avg_sentence_length 0.87 -0.10 -0.06 0.72 0.277 1.0
## coherence 0.05 -0.35 -0.12 0.13 0.867 1.3
## consistency -0.02 -0.97 0.00 0.94 0.064 1.0
## distinct_2 0.54 0.44 0.15 0.56 0.436 2.1
## lm_ppl -0.08 0.19 0.45 0.22 0.784 1.4
## text_novelty 0.17 -0.08 0.66 0.57 0.428 1.2
## flesch -0.74 0.03 -0.29 0.81 0.186 1.3
## spache 0.78 -0.01 0.26 0.84 0.158 1.2
## toxicity -0.31 -0.16 0.09 0.10 0.896 1.7
## syntactic_complexity 0.80 0.02 -0.26 0.54 0.464 1.2
## self_bleu -0.22 -0.29 0.12 0.12 0.879 2.2
## factual_hint 0.01 0.93 -0.03 0.86 0.141 1.0
## semantic_drift -0.10 0.89 0.00 0.80 0.201 1.0
##
## PA1 PA2 PA3
## SS loadings 3.12 3.06 1.04
## Proportion Var 0.24 0.24 0.08
## Cumulative Var 0.24 0.48 0.56
## Proportion Explained 0.43 0.42 0.14
## Cumulative Proportion 0.43 0.86 1.00
##
## With factor correlations of
## PA1 PA2 PA3
## PA1 1.00 -0.01 0.42
## PA2 -0.01 1.00 0.01
## PA3 0.42 0.01 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 3 factors are sufficient.
##
## df null model = 78 with the objective function = 8.58 with Chi Square = 12115.08
## df of the model are 42 and the objective function was 1.14
##
## The root mean square of the residuals (RMSR) is 0.05
## The df corrected root mean square of the residuals is 0.06
##
## The harmonic n.obs is 1419 with the empirical chi square 483.65 with prob < 2e-76
## The total n.obs was 1419 with Likelihood Chi Square = 1608.08 with prob < 3.5e-310
##
## Tucker Lewis Index of factoring reliability = 0.758
## RMSEA index = 0.162 and the 90 % confidence intervals are 0.155 0.169
## BIC = 1303.26
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy
## PA1 PA2 PA3
## Correlation of (regression) scores with factors 0.96 0.98 0.84
## Multiple R square of scores with factors 0.92 0.97 0.70
## Minimum correlation of possible factor scores 0.84 0.94 0.40
cor.plot(efa_result$loadings, numbers=TRUE, main="Factor Loadings")
You can also embed plots, for example:
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.