Brutally fast 4-gram phrase ripper
Hi to all English-language-explorers,
I am an amateurish C program-mess-er who is interested mainly in English language console utilities with one only goal in mind: to give statistical info about words/phrases/sentences usage.
First impression: a nice forum.
My wish is to share here my attempts/console-tools for English sidekick-ing.
Second impression: an unnecessary limitation: 5 posts to be able to share a link, grmbl!
Leprechaun_r13_7pluses_quadrupleton_r1_EXEs.zip
Leprechaun_r13_7pluses_quadrupleton_r1_AT_A_GLANCE .pdf
For example: Code:
D:\_KA45F~1\_4>dir
12/12/2010 01:37 PM 1,111,609,996 googlebooks-eng-us-all-4gram-20090715-0.csv
01/26/2011 06:46 PM 315 googlebooks-eng-us-all-4gram-20090715-0.csv.EXCERPT
01/26/2011 06:56 PM 362 Gulliver's-Travels.pdf.txt.EXCERPT
01/26/2011 06:47 PM 4,108 Leprechaun.LOG
01/26/2011 05:13 AM 514,048 Leprechaun_quadrupleton_Intel_IA-32_11.1.exe
01/26/2011 06:47 PM 53 test.lst
01/26/2011 06:47 PM 14 test.wrd
D:\_KA45F~1\_4>dir Gulliver*.excerpt/b>test2.lst
D:\_KA45F~1\_4>type "Gulliver's-Travels.pdf.txt.EXCERPT"
...
And so unmeasureable is the ambition of princes, that he
seemed to think of nothing less than reducing the whole
empire of Blefuscu into a province, and governing it, by
a viceroy; of destroying the Big-endian exiles, and compelling
that people to break the smaller end of their eggs,
by which he would remain the sole monarch of the whole
world.
...
D:\_KA45F~1\_4>Leprechaun_quadrupleton_Intel_IA-32_11.1.exe test2.lst test2.wrd
Leprechaun(Fast Greedy Word-Ripper), rev. 13_7pluses quadrupleton_r1, written by Svalqyatchx.
Leprechaun: 'Oh, well, didn't you hear? Bigger is good, but jumbo is dear.'
Kaze: Let's see what a 3-way hash + 6,602,752 Binary-Search-Trees can give us,
also the performance of a 3-way hash + 6,602,752 B-Trees of order 3.
Size of input file with files for Leprechauning: 36
Allocating memory 424MB ... OK
Size of Input TEXTual file: 362
|; Word count: 62 of them 41 distinct; Done: 64/64
Bytes per second performance: 362B/s
Words per second performance: 62W/s
Flushing unsorted words ...
Time for making unsorted wordlist: 1 second(s)
Deallocated memory in MB: 424
Allocated memory for words in MB: 1
Allocated memory for pointers-to-words in MB: 1
Sorting(with 'MultiKeyQuickSortX26Sort' by J. Bentley and R. Sedgewick) ...
Sort pass 26/26 ...
Flushing sorted words ...
Time for sorting unsorted wordlist: 1 second(s)
Leprechaun: Done.
D:\_KA45F~1\_4>type test2.wrd
and_compelling_that_people
and_so_unmeasureable_is
blefuscu_into_a_province
break_the_smaller_end
by_which_he_would
compelling_that_people_to
destroying_the_big_endian
empire_of_blefuscu_into
end_of_their_eggs
he_seemed_to_think
he_would_remain_the
is_the_ambition_of
less_than_reducing_the
monarch_of_the_whole
nothing_less_than_reducing
of_blefuscu_into_a
of_destroying_the_big
of_nothing_less_than
of_the_whole_world
people_to_break_the
reducing_the_whole_empire
remain_the_sole_monarch
seemed_to_think_of
smaller_end_of_their
so_unmeasureable_is_the
sole_monarch_of_the
than_reducing_the_whole
that_he_seemed_to
that_people_to_break
the_ambition_of_princes
the_big_endian_exiles
the_smaller_end_of
the_sole_monarch_of
the_whole_empire_of
think_of_nothing_less
to_break_the_smaller
to_think_of_nothing
unmeasureable_is_the_ambition
which_he_would_remain
whole_empire_of_blefuscu
would_remain_the_sole
D:\_KA45F~1\_4>
Enjoy!
Re: Brutally fast 4-gram phrase ripper
Here is "the final" release suitable and decent enough both for Linux & Windows users:
http://www.sanmayce.com/Downloads/Le..._ELFs_EXEs.zip
P.S.
The README.txt from the package above:
This is a short description of 'Leprechaun_[quadrupleton]_r13_7pluses_ELFs_EXEs' package (27 files; 9,882,301 bytes):
Code:
02/19/2011 04:32 AM 994,119 Leprechaun.png
02/19/2011 04:32 AM 181,165 Leprechaun.c
02/19/2011 04:32 AM 4,122 Leprechaun.LOG
02/19/2011 04:32 AM 27 Leprechaun.lst
02/19/2011 04:32 AM 2,615 Leprechaun.wrd
02/19/2011 04:32 AM 83 Leprechaun_COMPILE_Intel.bat
02/19/2011 04:32 AM 73 Leprechaun_COMPILE_Microsoft.bat
02/19/2011 04:32 AM 1,977,802 Leprechaun_Intel.cod
02/19/2011 04:32 AM 2,886,589 Leprechaun_Logo-diz.pdf
02/19/2011 04:32 AM 438,359 Leprechaun_Microsoft.cod
02/19/2011 04:32 AM 183,655 Leprechaun_quadrupleton.c
02/19/2011 04:32 AM 4,132 Leprechaun_quadrupleton.LOG
02/19/2011 04:32 AM 12,544 Leprechaun_quadrupleton.wrd
02/19/2011 04:32 AM 151,049 Leprechaun_quadrupleton_AT_A_GLANCE_cover.txt.pdf
02/19/2011 04:32 AM 109 Leprechaun_quadrupleton_COMPILE_Intel.bat
02/19/2011 04:32 AM 99 Leprechaun_quadrupleton_COMPILE_Microsoft.bat
02/19/2011 04:32 AM 644,409 Leprechaun_quadrupleton_r13_7pluses_generic_32bits.elf
02/19/2011 04:32 AM 514,048 Leprechaun_quadrupleton_r13_7pluses_Intel_IA-32_11.1.exe
02/19/2011 04:32 AM 96,256 Leprechaun_quadrupleton_r13_7pluses_Microsoft_32-bit_16.00.30319.01.exe
02/19/2011 04:32 AM 523,183 Leprechaun_r13_7pluses.pdf
02/19/2011 04:32 AM 642,613 Leprechaun_r13_7pluses_generic_32bits.elf
02/19/2011 04:32 AM 514,048 Leprechaun_r13_7pluses_Intel_IA-32_11.1.exe
02/19/2011 04:32 AM 95,232 Leprechaun_r13_7pluses_Microsoft_32-bit_16.00.30319.01.exe
02/19/2011 04:32 AM 117 Linux_Leprechaun_Complile_Line.script
02/19/2011 04:32 AM 143 Linux_Leprechaun_quadrupleton_Complile_Line.script
02/19/2011 04:32 AM 5,203 The_Little_Match_Girl.txt
02/19/2011 04:32 AM 10,507 _Caution_64bit_is-not-possible-yet_must_be_rewritten.txt
The package contains 32bit (console) executables compiled for Windows & Linux.
It is 100% free open-source copyleft software.
It creates English wordlists(1gram or 4gram) for a given filelist (each line is a filename).
Run one of the executables without any parameters to see how to use it.
Leprechaun_[quadrupleton]_r13_7pluses is powered by the fastest (on new architectures, like Core i3) so far string hash function: Jesteress.
Examples (in Linux prompt):
./Leprechaun_r13_7pluses_generic_32bits.elf Leprechaun.lst Leprechaun.wrd 7000
./Leprechaun_quadrupleton_r13_7pluses_generic_32bits .elf Leprechaun.lst Leprechaun_quadrupleton.wrd 6000
Open text files Leprechaun.LOG and Leprechaun_quadrupleton.LOG respectively for first and second line.
Pluses:
+ written in C;
+ extremely fast;
+ an useful etude for wordlisting.
Minuses:
- the developer being an amateur;
- dirty style, so dirty that 64bit compilation surely fails;
- greedy: low memory utilization.
Sanmayce
Enjoy!
Re: Brutally fast 4-gram phrase ripper
I see it works. I think you should make your program recognize the string n't as a word. They do it this way in the BYU corpora:
is n't
we 're
Now, your program gives:
isn 't
we 're
I'm not sure if I understand what the program actually does. What does
Quote:
At left side of the word - '[' means no left successor
At left side of the word - ']' means left successor exists
At right side of the word - ']' means no right successor
At right side of the word - '[' means right successor exists
mean? I don't see any ]'s or ['s anywhere... All I can find is the output file with the words/4-grams in it and the log file. Is there anything else to look at?
edit1: Oh, I forgot to add. I liked how the program said, "Can't open file. I've already explained," or something like this.
You could correct those brackets throughout your files. It's already difficult to read computer geek speech and the lack of spaces doesn't help. ;-)
Re: Brutally fast 4-gram phrase ripper
Quote:
Originally Posted by
Sanmayce
Second impression: an unnecessary limitation: 5 posts to be able to share a link, grmbl!
Well, you'd probably see things differently if you had to clear out dozens of spammers every day. ;-)
Re: Brutally fast 4-gram phrase ripper
Quote:
Originally Posted by
Sanmayce
Hi to all English-language-explorers,
I am an amateurish C program-mess-er who is interested mainly in English language console utilities with one only goal in mind: to give statistical info about words/phrases/sentences usage.
...
I have't played with your tool yet, so no comments - but thanks. Have you heard of SNOBOL - Wikipedia, the free encyclopedia ? (A bit of a dinosaur, but interesting...)
b
Re: Brutally fast 4-gram phrase ripper
Quote:
Originally Posted by
birdeen's call
Now, your program gives:
isn 't
we 're
Yes, it limits functionality but since I couldn't find a way to deal with apostrophes in all cases I decided to remove them altogether. The ripper's action is simple: it parses the incoming files (given via the filelist - the first filename in command line) and extracts all latin-letter words with lengths of 1 to 31 characters, the forming rule is simple too: a word is a string containing only alpha characters i.e. 'a' to 'z' or 'A' to 'Z'.
Quote:
Originally Posted by
birdeen's call
I don't see any ]'s or ['s anywhere... All I can find is the output file with the words/4-grams in it and the log file. Is there anything else to look at?
The square brackets stand for binary-search-tree leaf succession status, that is ']' means a child exists for left node and '[' similarly for right node. I draw the highest binary tree just for informative purposes. You don't need Leprechaun.log except for tracking the activity of the executable, you need only the second file from command line namely the extracted distinct words/grams.
Quote:
Originally Posted by
birdeen's call
You could correct those brackets throughout your files. It's already difficult to read computer geek speech and the lack of spaces doesn't help. ;-)
It was/is not intentional, I don't want to impose my buggy ways, at least I try my explanations to be useful at max.
Re: Brutally fast 4-gram phrase ripper
Quote:
Originally Posted by
BobK
Thank you for the link, I have not heard of it, but after reading the article I can say: it has nothing to do with real world tasks, nowayears the well-applied algorithms could smash all old approaches. There are languages like Python designed for such tasks, but I am kind of orthodoxal amateur my language is and will be C, I see myself migrating to native 64bit Linux and 64bit C code after a couple of years.
Regards
Re: Brutally fast 4-gram phrase ripper
Quote:
Originally Posted by
Sanmayce
Yes, it limits functionality but since I couldn't find a way to deal with apostrophes in all cases I decided to remove them altogether.
A very simple and bad (not tragic though) solution is to add a couple of lines that would change every
...n
't
to
...
n't
in the output file after it's ready, but it will work only in the 1-gram program for obvious reasons. It's still O(n), which won't change the general complexity.
(I'm not a programmer. Please forgive me if I'm talking nonsense.)
Re: Brutally fast 4-gram phrase ripper
Quote:
Originally Posted by
Sanmayce
Thank you for the link, I have not heard of it, but after reading the article I can say: it has nothing to do with real world tasks, nowayears the well-applied algorithms could smash all old approaches. There are languages like Python designed for such tasks, but I am kind of orthodoxal amateur my language is and will be C, I see myself migrating to native 64bit Linux and 64bit C code after a couple of years.
Regards
:up: ;-)
(Was 'nowayears' a joke? If so, it works. But 'nowadays' doesn't normally have a *365 analogue. 'Nowadays', like 'these days', means 'at/in this time/era....'. The 'day' in 'present-day', in the same way, doesn't mean '24 hours'.)
b
Re: Brutally fast 4-gram phrase ripper
Quote:
Originally Posted by
birdeen's call
A very simple and bad (not tragic though) solution is to add a couple of lines that would change every
...n
't
to
...
n't
in the output file after it's ready, but it will work only in the 1-gram program for obvious reasons.
Speaking of what a small utility must do (and what not to do), after many tries and errors I realized the need for well-defined actions, that is everything must not be half-done and simplicity along with speed must be among the highest priorities.
So in its simplicity I reckon that Leprechaun has not to be altered - it does what exactly is expected to do as first pass of parsing, from that point on another tool must take over. Here arises need not only of apostrophes, hyphens but additional alphabets ... and clarity vanishes. In other words the skeleton (the superfast hash reinforced by unrolled b-trees with simulated stack) is the thrilling thing (an important base/etude for further not developing but rather tuning and rewriting as 64bit code), the rest interests me not.
In case of not sensing what my obsession is, here is my diagnose: maniacal hi-speed text processing fondness. As I have said it elsewhere the speed is beauty.
One reason to abandon extracting words containing apostrophes was the existence of shortened forms like 'cause, 'twas, 'tis, 'tween, 'twere for 'because', 'it was', 'it is', 'between', 'it were' respectively.
Many years ago I wrote 2 very slow 16bit console utilities which might be useful to you, they work in duo and rip distinct (with hyphens and apostrophes) English words from a given file:
Example:
Code:
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1
02/22/2011 06:05 AM .
02/22/2011 06:05 AM ..
02/22/2011 03:37 AM 6 masakari.ss
02/22/2011 03:23 AM 2,659 RIP_EWRD.BAS
02/22/2011 03:23 AM 44,232 RIP_EWRD.EXE
09/19/1997 12:00 AM 55,972 SAKURA.EXE
10/23/2007 05:26 PM 146,248 SAKURA8.ZIP
02/12/2011 09:11 PM 1,385,282 TSZ.txt
6 File(s) 1,634,399 bytes
2 Dir(s) 2,650,181,632 bytes free
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>RIP_EWRD.EXE TSZ.txt
RIP_EWRD.EXE
NumberOfWords&: 90958
SAKURA.EXE, revision 008, written by Svalqyatchx 'Kaze'.
Revision note: Virtual_Memory_Simulated_Stack, if overflow_error then HALT.
Caution: Very(pivot is chosen from first 20 elements) slow version.
Searching for MASAKARI.SS and MASAKARI.SD ...
Creating MASAKARI.SWP ...
Allocated HDD memory: 320MB.
Room for 33,554432 elements; Maximum(2GB-1) for 214,748364 elements.
Input file: TSZ.UW
Output file: TSZ.SW
Making SAKURA.QSS(10bytes per element(6-entry,4-stack)) at HDD memory ...
Current sort options: CASE_UNSENSITIVE /START= 1 /LENGTH= 26
Sorting in two passes 90958 elements(longest 26), needed 889KB ...
Bubble-sorting possible pivots ...
Sorting pass#1(splitting) countdown(Right&), StackPtr: 000000001, 000000034 ...
Sorting pass#2 countdown(Quantity&-Right&), StackPtr: 000000000, 000000000 ...
Stack_Nested_Levels i.e. StackPtrMAX& / 2 = 21
TotalReadData# = 12,011397 bytes.
Looking in SAKURA.QSS and writing TSZ.SW ...
Time: 06:18:12 / 06:18:53 i.e. 40.69922seconds.
SAKURA: Done(performance: 288KB/s).
Creating TSZ.WRD ...
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1
02/22/2011 06:18 AM .
02/22/2011 06:18 AM ..
02/22/2011 03:37 AM 6 masakari.ss
02/22/2011 06:18 AM 335,544,320 MASAKARI.SWP
02/22/2011 03:23 AM 2,659 RIP_EWRD.BAS
02/22/2011 03:23 AM 44,232 RIP_EWRD.EXE
09/19/1997 12:00 AM 55,972 SAKURA.EXE
10/23/2007 05:26 PM 146,248 SAKURA8.ZIP
02/22/2011 06:18 AM 578,855 TSZ.SW
02/12/2011 09:11 PM 1,385,282 TSZ.txt
02/22/2011 06:18 AM 578,855 TSZ.UW
02/22/2011 06:18 AM 76,752 TSZ.WRD
10 File(s) 338,413,181 bytes
2 Dir(s) 2,313,396,224 bytes free
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>type TSZ.WRD
a
a-sniffing
a-weary
abandoned
abash
abashed
abet
abide
ability
ability-to-stand
...
all-too-gentle
all-too-great
all-too-human
all-too-patient
all-too-poor
all-too-similar
all-too-small
...
bid'th
...
can't
...
day's-work
day-journeys
...
doubt
doubt'th
...
e'er
each
...
i'm
i've
...
it's
...
look
look'st
looked
lookedst
looketh
looking
looking-back
looks
...
lov'th
lovable
love
love's
love-glances
...
mean
mean'th
meaneth
meaning
means
meant
...
naysayer
ne'er
ne'er-do-ills
ne'er-do-wells
near
nearer
...
o'er
o'erflowing
o'erhangeth
o'erhearst
o'erhung
o'erleap
o'ershadowed
o'erspan
o'erswelled
o'erthrowers
o'erthrowing
o'erthrown
...
they've
...
world's
world-blessing
world-loving
...
would
would'st
...
y-e-a
yawn
...
zarathustra
zarathustra's
zarathustra-kingdom
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>
The Dummy_DOS_ripper.zip 877KB is here.