Analyzing Raw Linguistics Data

In this project, I will try to analyze a raw dataset - ‘hnd1_aswords_stats’. The dataset contains hindi words and the details such as its syllable length, lexical form, word length etc.

Let’s explore the data!

These are the header descriptions of the dataset:

“item”: Sentence number
“roi”: Region of interest (equivalent to words in the sentence)
“syll_len”: Syllable length
“word_lex”: Lexical form
“word_complex”: Word/graphemic complexity
“word_freq”: Token frequency
“type_freq”: Type frequency
“word_bifreq”: Token bigram frequency

“type_bifreq”: Type bigram frequency
“word_len”: Word length (based on letter count)
“PB”: Phrase boundary
“IC”: Integration cost (based on DLT)
“SC”: Storage cost (based on DLT)

Now I am going to do a few basic analysis.

Subsetting the data

First, I will subset this data frame to only include the following columns

dat <- read.table("C:\\Users\\RVM\\Downloads\\drive-download-20231119T124854Z-001\\hnd1_aswords_stats", header = TRUE)
dat_subset <- dat[, c("item", "roi", "syll_len", "word_lex", "word_complex", "word_freq", "word_bifreq", "word_len")]

Here is the subseted data:

summary(dat_subset)

##       item            roi           syll_len       word_lex        
##  Min.   : 1.00   Min.   : 1.00   Min.   :1.000   Length:1273       
##  1st Qu.:18.00   1st Qu.: 5.00   1st Qu.:1.000   Class :character  
##  Median :36.00   Median : 9.00   Median :2.000   Mode  :character  
##  Mean   :36.72   Mean   :10.34   Mean   :2.214                     
##  3rd Qu.:55.00   3rd Qu.:15.00   3rd Qu.:3.000                     
##  Max.   :74.00   Max.   :30.00   Max.   :7.000                     
##                                                                    
##   word_complex      word_freq      word_bifreq        word_len     
##  Min.   :0.0000   Min.   :    1   Min.   :   1.0   Min.   : 1.000  
##  1st Qu.:0.0000   1st Qu.:   45   1st Qu.:   1.0   1st Qu.: 2.000  
##  Median :0.5000   Median :  495   Median :   4.0   Median : 3.000  
##  Mean   :0.5004   Mean   : 3788   Mean   : 266.8   Mean   : 3.628  
##  3rd Qu.:1.0000   3rd Qu.: 5500   3rd Qu.:  31.0   3rd Qu.: 5.000  
##  Max.   :5.5000   Max.   :19415   Max.   :6561.0   Max.   :14.000  
##                   NA's   :60      NA's   :238

head(dat_subset)

##   item roi syll_len word_lex word_complex word_freq word_bifreq word_len
## 1    1   1        2       इस          0.0      3071          NA        2
## 2    1   2        3     फिल्म          1.5       107          11        5
## 3    1   3        1        में          1.0     12692          11        3
## 4    1   4        3      उनके          0.5       565          26        4
## 5    1   5        2     हीरो          0.0         6           1        4
## 6    1   6        3      अजय          0.0        19           1        3

With the reduced data set, Extract all the words that have syllable length more than 5

sl_more5 <- dat_subset[dat_subset$syll_len > 5, ]
print(sl_more5)

##      item roi syll_len    word_lex word_complex word_freq word_bifreq word_len
## 428    24   8        6      इंसपेक्टर          1.5         1           1        9
## 431    24  11        7     सेक्टर-14          1.0        NA          NA        9
## 613    35  15        6 नोएडावासियों          1.5         1           1       12
## 716    41  12        6     मलयेशियन          1.5        NA          NA        8
## 1139   67   4        7  मक्का-मदीना          0.5        NA          NA       11

Here we have the whole row, now let us extract out only the words that have syllable length more than 5.

#The words with syllable length more than 5 are:
sl_more5_words <- dat_subset$word_lex[dat_subset$syll_len > 5]
print(sl_more5_words)

## [1] "इंसपेक्टर"      "सेक्टर-14"     "नोएडावासियों" "मलयेशियन"     "मक्का-मदीना"

Extract all the words that have a syllable length of more than 5 and have a frequency of more than 100

sl_more5_and_freq100 <- dat_subset[dat_subset$syll_len > 5 & dat_subset$word_freq > 100, ]
print(sl_more5_and_freq100)

##      item roi syll_len word_lex word_complex word_freq word_bifreq word_len
## NA     NA  NA       NA     <NA>           NA        NA          NA       NA
## NA.1   NA  NA       NA     <NA>           NA        NA          NA       NA
## NA.2   NA  NA       NA     <NA>           NA        NA          NA       NA

Here is the whole row, next extracting out only the words.

sl_more5_and_freq100_words <- dat_subset$word_lex[dat_subset$syll_len > 5 & dat_subset$word_freq > 100]
print(sl_more5_and_freq100_words)

## [1] NA NA NA

How many unique words exist in the dataset?

The unique() function is commonly used for data cleaning, extracting unique identifiers or categorical values, and filtering out duplicates from datasets or lists.

unique_words <- unique(dat_subset$word_lex)
length(unique_words)

## [1] 519

In this dataset there are 519 unique words.

head(unique_words,15)

##  [1] "इस"      "फिल्म"    "में"       "उनके"     "हीरो"    "अजय"     "देवगन"   
##  [8] "बने"      "है"       "।"       "चूंकि"     "अजनबी"   "का"      "प्रीमियर"
## [15] "यहां"

Extract all words with word complexity equal to 0 and word bigram frequency of more than 20

c0_bifreq20_words <- dat_subset$word_lex[dat_subset$word_complex == 0 & dat_subset$word_bifreq > 20 ]
print(c0_bifreq20_words)

##   [1] NA       "।"      NA       NA       NA       "काफी"   "।"      NA      
##   [9] NA       "।"      NA       NA       "और"     "वह"     "भी"     "की"    
##  [17] NA       "बदलाव"  "अपनी"   "।"      NA       "इस"     "मनमोहन" "को"    
##  [25] "।"      "बातचीत" "दौरान"  "वह"     "अपनी"   "इस"     NA       "को"    
##  [33] "यह"     "।"      "बताया"  "उनकी"   NA       "को"     "।"      NA      
##  [41] "को"     "भी"     "।"      NA       NA       "समय"    "साथ"    NA      
##  [49] "।"      NA       "।"      NA       "ही"     NA       NA       "और"    
##  [57] "बाद"    "।"      "इस"     "।"      "की"     "की"     "बात"    "कही"   
##  [65] "।"      "साथ"    "की"     "गई"     "।"      "भी"     "।"      "बाद"   
##  [73] "यह"     "पर"     "।"      NA       NA       "भी"     NA       NA      
##  [81] "।"      NA       "बताया"  "नाम"    NA       NA       "नाम"    "कहा"   
##  [89] "वह"     "।"      NA       "और"     "साथ"    "।"      "बताया"  "बाद"   
##  [97] NA       "।"      NA       "आज"     "तक"     "।"      "कहा"    "तो"    
## [105] "वह"     "।"      NA       "ही"     "अलग"    "।"      NA       "को"    
## [113] "बताया"  "।"      NA       "और"     "।"      NA       "वाली"   NA      
## [121] NA       NA       "की"     NA       NA       "।"      NA       "की"    
## [129] "।"      NA       "।"      "उजाला"  "कहा"    "वजह"    "ही"     "तक"    
## [137] "।"      NA       "बाद"    "बताया"  "।"      "रही"    "थी"     "यह"    
## [145] "।"      NA       "की"     "था"     "।"      NA       NA       "बताया" 
## [153] NA       "तौर"    "पर"     NA       NA       NA       "।"      NA      
## [161] NA       NA       "एक"     "।"      NA       "।"      NA       NA      
## [169] "।"      "एक"     NA       "था"     "और"     "था"     "।"      NA      
## [177] NA       NA       "रात"    "।"      "रहा"    NA       NA       NA      
## [185] NA       NA       "तरह"    "की"     "।"      NA       "।"      NA      
## [193] NA       NA       "पर"     NA       "इस"     "बार"    "रहा"    "।"     
## [201] NA       "ली"     "।"      "और"     "इस"     "बार"    "।"      NA      
## [209] NA       "।"      "कहा"    "।"      "।"      "कहा"    "उनकी"   "।"     
## [217] NA       NA       "गया"    "।"      "इस"     "बात"    "को"     "इस"    
## [225] "तरह"    "।"      "।"      NA       NA       "रही"    "और"     "हो"    
## [233] "सकता"   "।"      NA       NA       NA       NA       "।"      "कहना"  
## [241] "।"      "बार"    "।"      "।"      "इस"     "बात"    "की"     "रात"   
## [249] "।"      "अगर"    "तो"     "।"      NA       "बात"    "का"     "कहना"  
## [257] "हम"     "।"      NA       NA       "तो"     "काम"    "।"

unique(c0_bifreq20_words)

##  [1] NA       "।"      "काफी"   "और"     "वह"     "भी"     "की"     "बदलाव" 
##  [9] "अपनी"   "इस"     "मनमोहन" "को"     "बातचीत" "दौरान"  "यह"     "बताया" 
## [17] "उनकी"   "समय"    "साथ"    "ही"     "बाद"    "बात"    "कही"    "गई"    
## [25] "पर"     "नाम"    "कहा"    "आज"     "तक"     "तो"     "अलग"    "वाली"  
## [33] "उजाला"  "वजह"    "रही"    "थी"     "था"     "तौर"    "एक"     "रात"   
## [41] "रहा"    "तरह"    "बार"    "ली"     "गया"    "हो"     "सकता"   "कहना"  
## [49] "अगर"    "का"     "हम"     "काम"

We just need the words that are unique with word complexity equal to 0 and word bigram frequency of more than 20. This is giving me 263 elements, that too with repetition. I just need the unique words, so I’m extracting only the unique elements, without repetition. But still this is giving me the hindi word break ‘|’ which is not a word. Let me remove that as well.

c0_bifreq20_words <- gsub("\\।", "", c0_bifreq20_words)  # Remove '।'
unique(c0_bifreq20_words)

##  [1] NA       ""       "काफी"   "और"     "वह"     "भी"     "की"     "बदलाव" 
##  [9] "अपनी"   "इस"     "मनमोहन" "को"     "बातचीत" "दौरान"  "यह"     "बताया" 
## [17] "उनकी"   "समय"    "साथ"    "ही"     "बाद"    "बात"    "कही"    "गई"    
## [25] "पर"     "नाम"    "कहा"    "आज"     "तक"     "तो"     "अलग"    "वाली"  
## [33] "उजाला"  "वजह"    "रही"    "थी"     "था"     "तौर"    "एक"     "रात"   
## [41] "रहा"    "तरह"    "बार"    "ली"     "गया"    "हो"     "सकता"   "कहना"  
## [49] "अगर"    "का"     "हम"     "काम"

Here I removed ‘।’ but still there is a “” in its place, which I have to find a way to remove.

Is there a correlation between word frequency and syllable length?

Correlation analysis (removing missing values)

subset_complete <- dat_subset[complete.cases(dat_subset$word_freq, dat_subset$syll_len), ]
cor(subset_complete$word_freq, subset_complete$syll_len)

## [1] -0.6188939

We know the correlation score ranges from -1 to 1, here the score -0.62(approx) indicates a moderate negative linear relationship between the variables word frequency and syllable length. That means,as the word frequency (word_freq) increases, there’s a tendency for the syllable length (syll_len) to decrease, and vice versa, with a moderate degree of correlation between them.

Let’s do some plotting!

library(ggplot2)

What is the distribution of the syllable lengths?

Histogram

Frequency in y axis and syllable length in x axis, this histogram shows us the distribution of syllable length in our dataset, and it is skewed to the right (positively skewed),with its ‘tail’ on the right side of the distribution and it’s peak at 2.

How do the word frequency and word length relate to each other?

Scatter plot :
A scatter plot is a useful visualization when exploring the relationship between two continuous variables.

Here we can see that, words which have lesser word length, are used more frequently than words which have more. Words with legth more than 10 are hardly used more than once. And words with length less than 5 are used very frequently.

Are there any trends between word frequency and word bigram frequency?

To check the trend between two variables, let’s use a Scatter plot with a trend line(or line of best fit). The relationship between variables can be described in many ways: positive, negative, strong or weak, linear or non-linear.

This plot shows that there is a relationship between the word bigram frequency(y) and Word frequency(x). As word frequency increases, the word bigram frequency also increases, suggesting a positive correlation.

How does word complexity vary across different sentence items?

Boxplot

This was my attempt to analyze a raw linguistic data containing syll_length, word_lex, word_complex, word_freq, word_len etc. First I subsetted the data to include only those columns that I want. Then on this subsetted dataset, I tried to extract out the data that I need to answer some questions.
After that, I tried to plot a few graphs- a histogram for the distribution of syllable lengths, scatter plot for understanding the relation between word frequency and word length, a scatter plot with trend line for understanding the relation between word frequency and word bigram frequency, and a boxplot to understand word complexity across sentence items.

That was it! Thank you!