The reliability of automated processing is greatly improved by knowing where the paragraphs end. However, PDF files do not contain end of paragraph markers. Most programs for converting PDF files to text files result in a stream of lines with hard encoded linebreaks.
This require manual adjustment to make the text more readable. Captioning at the paragraph or sentence level can facilitate understanding of a corresponding audio file. In addition, translation software work best when the translation context is set at the sentence or paragraphg level.
This report describes an attempt implement and use a fuzzy logic approach to determine the End of Paragraph (EOP), ie the last line of a paragraphs.
The folowing analysis was conducted on the content of Lecture 2.
dat = read.csv("/home/rbatz/GEN_TLResources.csv",header=TRUE)
dat[,1] = as.numeric(paste(dat$length))
hist(dat$length,breaks=50,col="blue", xlab="line length (chars)", main="Lengths of lines in the lecture PDF")
dtxt = read.csv("/home/rbatz/gen.csv")
library(stringr)
l = c()
for(word in dtxt[,1]) {
l = rbind(l , str_length(word))
}
hist(l,breaks=20,col="blue", main="Size of words used",xlab ="Word Length")
Word in this text varied between 1 and 17 characters in length in a parento distribution.
Line length | Freq | EOP | ? | . | ! | " | , | a-z | |
<80 80 - 84 85 - 89 90 - 94 95 - 99 100-104 105-109 >109 Total | 40 2 2 12 53 104 32 3 248 | 40 2 42 |
7 7 |
30 2 1 2 5 11 3 49 |
2 1 1 1 5 |
3 7 4 14 |
1 1 |
1 1 10 43 85 25 2 130 |
The Ruby program developed was tested to be accurate to 100% in extracting the English text and was able to determine the end of paragraph breaks at least 95% accuracy. Ruby is a free programming language and both it and the software can be used without cost. It will work on Windows, Linux and Max OsX.
gem install pdf_reader
ruby pdf2txtpar.rb > newversion.txt
in the command console/terminal.newversion.txt
)Please all inquiries and questions about this software to the author.
The statistic studies have revealed the following points.
\[LineLength < 80 \ \therefore\ EOP\] * It is accurate to assume that longer lines ending in characters that do not end sentences share the same paragraph as the next line.
\[LineLength > 80\ \cap\ Chr_{last} \not\in \{!?."'\}\ \therefore\ EOP\]
\[LineLength > 80\ \cap\ Chr_{last} \in \{!?."'\}\ \cap\ LineLength + NxtWordLength > 110 \implies EOP\] * Probabilty of error in identifying EOP decreases as the difference MaxLineLength and LineLength decrease.
\[P(EOP_{err}) = F(LineLength_{max}-LineLength)\] # Appendix
The following is the source code of the software developed. I have called it PDF2TXTpar.rb
# coding: utf-8
=begin
------------------------------------------------------
Program: PDF2TxTpar.rb
Author: Robert Batzinger
Distribution: Open source, MIT Licence
Version: 1.0, 12 Sept 2020
Documentation: RMarkdown notebook explains development, testing and use.
Copyright © 2020 Robert Batzinger
------------------------------------------------------
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the “Software”), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
------------------------------------------------------
=end
require 'pdf/reader'
# Check for end of paragraph
def is_EOP?(line)
if line.length < 80
return true
elsif line.length < 85 &&
line[-1] =~ /[\?\!\.\"]/
return true
else
return false
end
end
reader = PDF::Reader.new("GEN_TL_RESOURCES_01_06_2020.pdf")
paragraph = ""
paragraphs = []
reader.pages.each do |page|
lines = page.text.scan(/^.+/)
lines.each do |line|
paragraph += " #{line}"
if is_EOP?(line)
paragraphs << paragraph
paragraph = ""
end
end
end
if paragraph != ""
paragraphs << paragraph
end
# Double space between paragraphs
puts paragraphs.join("\n\n ")
The following text was obtained by cut and paste using Foxit PDF Reader create a text file in MS Word.
Lesson 1 ‑ GENESIS 1:1‑25 ‑ Lea Compton
Of all of God's creation, humanity is unique in its ability to express itself with language. But
sometimes conversational words fail to communicate the depth of what we're thinking and feeling.
We're emotive and our communication can move beyond facts to feelings, intentions, hopes and dreams.
Not only can we express these things for ourselves, but we can do so in a way that evokes these
emotions and feelings in others.
Many art forms evoke emotion, but words are special and when they are given structure, they become
art with meaning that transcends time and culture, connecting heart to heart and life to life, even
connecting humanity to the divine. If you don't believe me or if you don't get into poetry, just imagine
songs that were popular when you were coming of age. They were songs that spoke your deepest
emotions. There were songs that spoke your greatest aspirations and your most profound despair. This
is who we are as humans. This is who God made us to be and how He intended for us to communicate.
Consequently, we shouldn't be surprised that the opening words of the Bible, a book about God and His
steps to be in relationship with us, are artful language, structured differently than conversational speech.
We should not be surprised that the Holy Spirit inspired Moses to write God's introduction with rhythm
and repetition. The ancient poetry of Genesis 1 offers answers, but you might be surprised which
questions are answered.
In today's passage, we find answers about God. The Bible is about Him, after all. This passage clearly
teaches that God is the creator and sustainer of all things. God is the creator and sustainer of all things,
and we're going to explore these 25 verses in two divisions. Our first division is Genesis 1, verses 1
through 5 where we see God is Creation's Originator; and our second division is Genesis 1, verses 6
through 25, where we see God's as Creation's Sustainer.
Lesson 1 ‑ GENESIS 1:1‑25 ‑ Lea Compton
Of all of God's creation, humanity is unique in its ability to express itself with language. But
sometimes conversational words fail to communicate the depth of what we're thinking and feeling.
We're emotive and our communication can move beyond facts to feelings, intentions, hopes and dreams.
Not only can we express these things for ourselves, but we can do so in a way that evokes these
emotions and feelings in others.
Many art forms evoke emotion, but words are special and when they are given structure, they become
art with meaning that transcends time and culture, connecting heart to heart and life to life, even
connecting humanity to the divine. If you don't believe me or if you don't get into poetry, just imagine
songs that were popular when you were coming of age. They were songs that spoke your deepest
emotions. There were songs that spoke your greatest aspirations and your most profound despair. This
is who we are as humans. This is who God made us to be and how He intended for us to communicate.
Consequently, we shouldn't be surprised that the opening words of the Bible, a book about God and His
steps to be in relationship with us, are artful language, structured differently than conversational speech.
We should not be surprised that the Holy Spirit inspired Moses to write God's introduction with rhythm
and repetition. The ancient poetry of Genesis 1 offers answers, but you might be surprised which
questions are answered.
In today's passage, we find answers about God. The Bible is about Him, after all. This passage clearly
teaches that God is the creator and sustainer of all things. God is the creator and sustainer of all things,
and we're going to explore these 25 verses in two divisions. Our first division is Genesis 1, verses 1
through 5 where we see God is Creation's Originator; and our second division is Genesis 1, verses 6
through 25, where we see God's as Creation's Sustainer.
The following text was extracted using the software developed. The resulting text can be flowed and rendered in any width as proper paragraphs. Extra white space was included between paragraphs to enhance readability.
Lesson 1 ‑ GENESIS 1:1‑25 ‑ Lea Compton
Of all of God’s creation, humanity is unique in its ability to express itself with language. But sometimes conversational words fail to communicate the depth of what we’re thinking and feeling. We’re emotive and our communication can move beyond facts to feelings, intentions, hopes and dreams. Not only can we express these things for ourselves, but we can do so in a way that evokes these emotions and feelings in others.
Many art forms evoke emotion, but words are special and when they are given structure, they become art with meaning that transcends time and culture, connecting heart to heart and life to life, even connecting humanity to the divine. If you don’t believe me or if you don’t get into poetry, just imagine songs that were popular when you were coming of age. They were songs that spoke your deepest emotions. There were songs that spoke your greatest aspirations and your most profound despair. This is who we are as humans. This is who God made us to be and how He intended for us to communicate. Consequently, we shouldn’t be surprised that the opening words of the Bible, a book about God and His steps to be in relationship with us, are artful language, structured differently than conversational speech. We should not be surprised that the Holy Spirit inspired Moses to write God’s introduction with rhythm and repetition. The ancient poetry of Genesis 1 offers answers, but you might be surprised which questions are answered.
In today’s passage, we find answers about God. The Bible is about Him, after all. This passage clearly teaches that God is the creator and sustainer of all things. God is the creator and sustainer of all things, and we’re going to explore these 25 verses in two divisions. Our first division is Genesis 1, verses 1 through 5 where we see God is Creation’s Originator; and our second division is Genesis 1, verses 6 through 25, where we see God’s as Creation’s Sustainer.
Lesson 1 ‑ GENESIS 1:1‑25 ‑ Lea Compton
Of all of God’s creation, humanity is unique in its ability to express itself with language. But sometimes conversational words fail to communicate the depth of what we’re thinking and feeling. We’re emotive and our communication can move beyond facts to feelings, intentions, hopes and dreams. Not only can we express these things for ourselves, but we can do so in a way that evokes these emotions and feelings in others.
Many art forms evoke emotion, but words are special and when they are given structure, they become art with meaning that transcends time and culture, connecting heart to heart and life to life, even connecting humanity to the divine. If you don’t believe me or if you don’t get into poetry, just imagine songs that were popular when you were coming of age. They were songs that spoke your deepest emotions. There were songs that spoke your greatest aspirations and your most profound despair. This is who we are as humans. This is who God made us to be and how He intended for us to communicate. Consequently, we shouldn’t be surprised that the opening words of the Bible, a book about God and His steps to be in relationship with us, are artful language, structured differently than conversational speech. We should not be surprised that the Holy Spirit inspired Moses to write God’s introduction with rhythm and repetition. The ancient poetry of Genesis 1 offers answers, but you might be surprised which questions are answered.
In today’s passage, we find answers about God. The Bible is about Him, after all. This passage clearly teaches that God is the creator and sustainer of all things. God is the creator and sustainer of all things, and we’re going to explore these 25 verses in two divisions. Our first division is Genesis 1, verses 1 through 5 where we see God is Creation’s Originator; and our second division is Genesis 1, verses 6 through 25, where we see God’s as Creation’s Sustainer.