In this project, I will try to analyze a raw dataset -
‘hnd1_aswords_stats’. The dataset contains hindi words and the details
such as its syllable length, lexical form, word length etc.
Let’s explore the data!
These are the header descriptions of the dataset:
“item”: Sentence number
“roi”: Region of interest (equivalent to
words in the sentence)
“syll_len”: Syllable length
“word_lex”:
Lexical form
“word_complex”: Word/graphemic complexity
“word_freq”: Token frequency
“type_freq”: Type frequency
“word_bifreq”: Token bigram frequency
“type_bifreq”: Type bigram frequency
“word_len”: Word length (based
on letter count)
“PB”: Phrase boundary
“IC”: Integration cost
(based on DLT)
“SC”: Storage cost (based on DLT)
Now I am going to do a few basic analysis.
First, I will subset this data frame to only include the following columns
“item”: Sentence number
“roi”: Region of interest (equivalent to
words in the sentence)
“syll_len”: Syllable length
“word_lex”:
Lexical form
“word_complex”: Word/graphemic complexity
“word_freq”: Token frequency
“word_bifreq”: Token bigram frequency
“word_len”: Word length (based on letter count)
dat <- read.table("C:\\Users\\RVM\\Downloads\\drive-download-20231119T124854Z-001\\hnd1_aswords_stats", header = TRUE)
dat_subset <- dat[, c("item", "roi", "syll_len", "word_lex", "word_complex", "word_freq", "word_bifreq", "word_len")]
Here is the subseted data:
summary(dat_subset)
## item roi syll_len word_lex
## Min. : 1.00 Min. : 1.00 Min. :1.000 Length:1273
## 1st Qu.:18.00 1st Qu.: 5.00 1st Qu.:1.000 Class :character
## Median :36.00 Median : 9.00 Median :2.000 Mode :character
## Mean :36.72 Mean :10.34 Mean :2.214
## 3rd Qu.:55.00 3rd Qu.:15.00 3rd Qu.:3.000
## Max. :74.00 Max. :30.00 Max. :7.000
##
## word_complex word_freq word_bifreq word_len
## Min. :0.0000 Min. : 1 Min. : 1.0 Min. : 1.000
## 1st Qu.:0.0000 1st Qu.: 45 1st Qu.: 1.0 1st Qu.: 2.000
## Median :0.5000 Median : 495 Median : 4.0 Median : 3.000
## Mean :0.5004 Mean : 3788 Mean : 266.8 Mean : 3.628
## 3rd Qu.:1.0000 3rd Qu.: 5500 3rd Qu.: 31.0 3rd Qu.: 5.000
## Max. :5.5000 Max. :19415 Max. :6561.0 Max. :14.000
## NA's :60 NA's :238
head(dat_subset)
## item roi syll_len word_lex word_complex word_freq word_bifreq word_len
## 1 1 1 2 इस 0.0 3071 NA 2
## 2 1 2 3 फिल्म 1.5 107 11 5
## 3 1 3 1 में 1.0 12692 11 3
## 4 1 4 3 उनके 0.5 565 26 4
## 5 1 5 2 हीरो 0.0 6 1 4
## 6 1 6 3 अजय 0.0 19 1 3
sl_more5 <- dat_subset[dat_subset$syll_len > 5, ]
print(sl_more5)
## item roi syll_len word_lex word_complex word_freq word_bifreq word_len
## 428 24 8 6 इंसपेक्टर 1.5 1 1 9
## 431 24 11 7 सेक्टर-14 1.0 NA NA 9
## 613 35 15 6 नोएडावासियों 1.5 1 1 12
## 716 41 12 6 मलयेशियन 1.5 NA NA 8
## 1139 67 4 7 मक्का-मदीना 0.5 NA NA 11
Here we have the whole row, now let us extract out only the words that have syllable length more than 5.
#The words with syllable length more than 5 are:
sl_more5_words <- dat_subset$word_lex[dat_subset$syll_len > 5]
print(sl_more5_words)
## [1] "इंसपेक्टर" "सेक्टर-14" "नोएडावासियों" "मलयेशियन" "मक्का-मदीना"
sl_more5_and_freq100 <- dat_subset[dat_subset$syll_len > 5 & dat_subset$word_freq > 100, ]
print(sl_more5_and_freq100)
## item roi syll_len word_lex word_complex word_freq word_bifreq word_len
## NA NA NA NA <NA> NA NA NA NA
## NA.1 NA NA NA <NA> NA NA NA NA
## NA.2 NA NA NA <NA> NA NA NA NA
Here is the whole row, next extracting out only the words.
sl_more5_and_freq100_words <- dat_subset$word_lex[dat_subset$syll_len > 5 & dat_subset$word_freq > 100]
print(sl_more5_and_freq100_words)
## [1] NA NA NA
The unique() function is commonly used for data cleaning, extracting unique identifiers or categorical values, and filtering out duplicates from datasets or lists.
unique_words <- unique(dat_subset$word_lex)
length(unique_words)
## [1] 519
In this dataset there are 519 unique words.
head(unique_words,15)
## [1] "इस" "फिल्म" "में" "उनके" "हीरो" "अजय" "देवगन"
## [8] "बने" "है" "।" "चूंकि" "अजनबी" "का" "प्रीमियर"
## [15] "यहां"
c0_bifreq20_words <- dat_subset$word_lex[dat_subset$word_complex == 0 & dat_subset$word_bifreq > 20 ]
print(c0_bifreq20_words)
## [1] NA "।" NA NA NA "काफी" "।" NA
## [9] NA "।" NA NA "और" "वह" "भी" "की"
## [17] NA "बदलाव" "अपनी" "।" NA "इस" "मनमोहन" "को"
## [25] "।" "बातचीत" "दौरान" "वह" "अपनी" "इस" NA "को"
## [33] "यह" "।" "बताया" "उनकी" NA "को" "।" NA
## [41] "को" "भी" "।" NA NA "समय" "साथ" NA
## [49] "।" NA "।" NA "ही" NA NA "और"
## [57] "बाद" "।" "इस" "।" "की" "की" "बात" "कही"
## [65] "।" "साथ" "की" "गई" "।" "भी" "।" "बाद"
## [73] "यह" "पर" "।" NA NA "भी" NA NA
## [81] "।" NA "बताया" "नाम" NA NA "नाम" "कहा"
## [89] "वह" "।" NA "और" "साथ" "।" "बताया" "बाद"
## [97] NA "।" NA "आज" "तक" "।" "कहा" "तो"
## [105] "वह" "।" NA "ही" "अलग" "।" NA "को"
## [113] "बताया" "।" NA "और" "।" NA "वाली" NA
## [121] NA NA "की" NA NA "।" NA "की"
## [129] "।" NA "।" "उजाला" "कहा" "वजह" "ही" "तक"
## [137] "।" NA "बाद" "बताया" "।" "रही" "थी" "यह"
## [145] "।" NA "की" "था" "।" NA NA "बताया"
## [153] NA "तौर" "पर" NA NA NA "।" NA
## [161] NA NA "एक" "।" NA "।" NA NA
## [169] "।" "एक" NA "था" "और" "था" "।" NA
## [177] NA NA "रात" "।" "रहा" NA NA NA
## [185] NA NA "तरह" "की" "।" NA "।" NA
## [193] NA NA "पर" NA "इस" "बार" "रहा" "।"
## [201] NA "ली" "।" "और" "इस" "बार" "।" NA
## [209] NA "।" "कहा" "।" "।" "कहा" "उनकी" "।"
## [217] NA NA "गया" "।" "इस" "बात" "को" "इस"
## [225] "तरह" "।" "।" NA NA "रही" "और" "हो"
## [233] "सकता" "।" NA NA NA NA "।" "कहना"
## [241] "।" "बार" "।" "।" "इस" "बात" "की" "रात"
## [249] "।" "अगर" "तो" "।" NA "बात" "का" "कहना"
## [257] "हम" "।" NA NA "तो" "काम" "।"
unique(c0_bifreq20_words)
## [1] NA "।" "काफी" "और" "वह" "भी" "की" "बदलाव"
## [9] "अपनी" "इस" "मनमोहन" "को" "बातचीत" "दौरान" "यह" "बताया"
## [17] "उनकी" "समय" "साथ" "ही" "बाद" "बात" "कही" "गई"
## [25] "पर" "नाम" "कहा" "आज" "तक" "तो" "अलग" "वाली"
## [33] "उजाला" "वजह" "रही" "थी" "था" "तौर" "एक" "रात"
## [41] "रहा" "तरह" "बार" "ली" "गया" "हो" "सकता" "कहना"
## [49] "अगर" "का" "हम" "काम"
We just need the words that are unique with word complexity equal to 0 and word bigram frequency of more than 20. This is giving me 263 elements, that too with repetition. I just need the unique words, so I’m extracting only the unique elements, without repetition. But still this is giving me the hindi word break ‘|’ which is not a word. Let me remove that as well.
c0_bifreq20_words <- gsub("\\।", "", c0_bifreq20_words) # Remove '।'
unique(c0_bifreq20_words)
## [1] NA "" "काफी" "और" "वह" "भी" "की" "बदलाव"
## [9] "अपनी" "इस" "मनमोहन" "को" "बातचीत" "दौरान" "यह" "बताया"
## [17] "उनकी" "समय" "साथ" "ही" "बाद" "बात" "कही" "गई"
## [25] "पर" "नाम" "कहा" "आज" "तक" "तो" "अलग" "वाली"
## [33] "उजाला" "वजह" "रही" "थी" "था" "तौर" "एक" "रात"
## [41] "रहा" "तरह" "बार" "ली" "गया" "हो" "सकता" "कहना"
## [49] "अगर" "का" "हम" "काम"
Here I removed ‘।’ but still there is a “” in its place, which I have to find a way to remove.
Correlation analysis (removing missing values)
subset_complete <- dat_subset[complete.cases(dat_subset$word_freq, dat_subset$syll_len), ]
cor(subset_complete$word_freq, subset_complete$syll_len)
## [1] -0.6188939
We know the correlation score ranges from -1 to 1, here the score -0.62(approx) indicates a moderate negative linear relationship between the variables word frequency and syllable length. That means,as the word frequency (word_freq) increases, there’s a tendency for the syllable length (syll_len) to decrease, and vice versa, with a moderate degree of correlation between them.
library(ggplot2)
Histogram
Frequency in y axis and syllable length in x axis, this histogram shows us the distribution of syllable length in our dataset, and it is skewed to the right (positively skewed),with its ‘tail’ on the right side of the distribution and it’s peak at 2.
Scatter plot :
A scatter plot is a useful visualization when
exploring the relationship between two continuous variables.
Here we can see that, words which have lesser word length, are used more frequently than words which have more. Words with legth more than 10 are hardly used more than once. And words with length less than 5 are used very frequently.
To check the trend between two variables, let’s use a Scatter plot with a trend line(or line of best fit). The relationship between variables can be described in many ways: positive, negative, strong or weak, linear or non-linear.
This plot shows that there is a relationship between the word bigram frequency(y) and Word frequency(x). As word frequency increases, the word bigram frequency also increases, suggesting a positive correlation.
Boxplot
This was my attempt to analyze a raw linguistic data containing
syll_length, word_lex, word_complex, word_freq, word_len etc. First I
subsetted the data to include only those columns that I want. Then on
this subsetted dataset, I tried to extract out the data that I need to
answer some questions.
After that, I tried to plot a few graphs- a
histogram for the distribution of syllable lengths, scatter plot for
understanding the relation between word frequency and word length, a
scatter plot with trend line for understanding the relation between word
frequency and word bigram frequency, and a boxplot to understand word
complexity across sentence items.
That was it! Thank you!