Brutally fast 4-gram phrase ripper

Status
Not open for further replies.

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Hi to all English-language-explorers,

I am an amateurish C program-mess-er who is interested mainly in English language console utilities with one only goal in mind: to give statistical info about words/phrases/sentences usage.

First impression: a nice forum.
My wish is to share here my attempts/console-tools for English sidekick-ing.

Second impression: an unnecessary limitation: 5 posts to be able to share a link, grmbl!

Leprechaun_r13_7pluses_quadrupleton_r1_EXEs.zip

Leprechaun_r13_7pluses_quadrupleton_r1_AT_A_GLANCE.pdf

For example:
Code:
D:\_KA45F~1\_4>dir

12/12/2010  01:37 PM     1,111,609,996 googlebooks-eng-us-all-4gram-20090715-0.csv
01/26/2011  06:46 PM               315 googlebooks-eng-us-all-4gram-20090715-0.csv.EXCERPT
01/26/2011  06:56 PM               362 Gulliver's-Travels.pdf.txt.EXCERPT
01/26/2011  06:47 PM             4,108 Leprechaun.LOG
01/26/2011  05:13 AM           514,048 Leprechaun_quadrupleton_Intel_IA-32_11.1.exe
01/26/2011  06:47 PM                53 test.lst
01/26/2011  06:47 PM                14 test.wrd

D:\_KA45F~1\_4>dir Gulliver*.excerpt/b>test2.lst

D:\_KA45F~1\_4>type "Gulliver's-Travels.pdf.txt.EXCERPT"
...
And so unmeasureable is the ambition of princes, that he
seemed to think of nothing less than reducing the whole
empire of Blefuscu into a province, and governing it, by
a viceroy; of destroying the Big-endian exiles, and compelling
that people to break the smaller end of their eggs,
by which he would remain the sole monarch of the whole
world.
...
D:\_KA45F~1\_4>Leprechaun_quadrupleton_Intel_IA-32_11.1.exe test2.lst test2.wrd
Leprechaun(Fast Greedy Word-Ripper), rev. 13_7pluses quadrupleton_r1, written by Svalqyatchx.
Leprechaun: 'Oh, well, didn't you hear? Bigger is good, but jumbo is dear.'
Kaze: Let's see what a 3-way hash + 6,602,752 Binary-Search-Trees can give us,
      also the performance of a 3-way hash + 6,602,752 B-Trees of order 3.
Size of input file with files for Leprechauning: 36
Allocating memory 424MB ... OK
Size of Input TEXTual file: 362
|; Word count: 62 of them 41 distinct; Done: 64/64
Bytes per second performance: 362B/s
Words per second performance: 62W/s
Flushing unsorted words ...
Time for making unsorted wordlist: 1 second(s)
Deallocated memory in MB: 424
Allocated memory for words in MB: 1
Allocated memory for pointers-to-words in MB: 1
Sorting(with 'MultiKeyQuickSortX26Sort' by J. Bentley and R. Sedgewick) ...
Sort pass 26/26 ...
Flushing sorted words ...
Time for sorting unsorted wordlist: 1 second(s)
Leprechaun: Done.

D:\_KA45F~1\_4>type test2.wrd
and_compelling_that_people
and_so_unmeasureable_is
blefuscu_into_a_province
break_the_smaller_end
by_which_he_would
compelling_that_people_to
destroying_the_big_endian
empire_of_blefuscu_into
end_of_their_eggs
he_seemed_to_think
he_would_remain_the
is_the_ambition_of
less_than_reducing_the
monarch_of_the_whole
nothing_less_than_reducing
of_blefuscu_into_a
of_destroying_the_big
of_nothing_less_than
of_the_whole_world
people_to_break_the
reducing_the_whole_empire
remain_the_sole_monarch
seemed_to_think_of
smaller_end_of_their
so_unmeasureable_is_the
sole_monarch_of_the
than_reducing_the_whole
that_he_seemed_to
that_people_to_break
the_ambition_of_princes
the_big_endian_exiles
the_smaller_end_of
the_sole_monarch_of
the_whole_empire_of
think_of_nothing_less
to_break_the_smaller
to_think_of_nothing
unmeasureable_is_the_ambition
which_he_would_remain
whole_empire_of_blefuscu
would_remain_the_sole

D:\_KA45F~1\_4>
Enjoy!
 
Last edited:

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Here is "the final" release suitable and decent enough both for Linux & Windows users:

http://www.sanmayce.com/Downloads/Leprechaun_[quadrupleton]_r13_7pluses_ELFs_EXEs.zip

P.S.
The README.txt from the package above:
This is a short description of 'Leprechaun_[quadrupleton]_r13_7pluses_ELFs_EXEs' package (27 files; 9,882,301 bytes):

Code:
[FONT=Courier New]02/19/2011  04:32 AM           994,119 Leprechaun.png
02/19/2011  04:32 AM           181,165 Leprechaun.c
02/19/2011  04:32 AM             4,122 Leprechaun.LOG
02/19/2011  04:32 AM                27 Leprechaun.lst
02/19/2011  04:32 AM             2,615 Leprechaun.wrd
02/19/2011  04:32 AM                83 Leprechaun_COMPILE_Intel.bat
02/19/2011  04:32 AM                73 Leprechaun_COMPILE_Microsoft.bat
02/19/2011  04:32 AM         1,977,802 Leprechaun_Intel.cod
02/19/2011  04:32 AM         2,886,589 Leprechaun_Logo-diz.pdf
02/19/2011  04:32 AM           438,359 Leprechaun_Microsoft.cod
02/19/2011  04:32 AM           183,655 Leprechaun_quadrupleton.c
02/19/2011  04:32 AM             4,132 Leprechaun_quadrupleton.LOG
02/19/2011  04:32 AM            12,544 Leprechaun_quadrupleton.wrd
02/19/2011  04:32 AM           151,049 Leprechaun_quadrupleton_AT_A_GLANCE_cover.txt.pdf
02/19/2011  04:32 AM               109 Leprechaun_quadrupleton_COMPILE_Intel.bat
02/19/2011  04:32 AM                99 Leprechaun_quadrupleton_COMPILE_Microsoft.bat
02/19/2011  04:32 AM           644,409 Leprechaun_quadrupleton_r13_7pluses_generic_32bits.elf
02/19/2011  04:32 AM           514,048 Leprechaun_quadrupleton_r13_7pluses_Intel_IA-32_11.1.exe
02/19/2011  04:32 AM            96,256 Leprechaun_quadrupleton_r13_7pluses_Microsoft_32-bit_16.00.30319.01.exe
02/19/2011  04:32 AM           523,183 Leprechaun_r13_7pluses.pdf
02/19/2011  04:32 AM           642,613 Leprechaun_r13_7pluses_generic_32bits.elf
02/19/2011  04:32 AM           514,048 Leprechaun_r13_7pluses_Intel_IA-32_11.1.exe
02/19/2011  04:32 AM            95,232 Leprechaun_r13_7pluses_Microsoft_32-bit_16.00.30319.01.exe
02/19/2011  04:32 AM               117 Linux_Leprechaun_Complile_Line.script
02/19/2011  04:32 AM               143 Linux_Leprechaun_quadrupleton_Complile_Line.script
02/19/2011  04:32 AM             5,203 The_Little_Match_Girl.txt
02/19/2011  04:32 AM            10,507 _Caution_64bit_is-not-possible-yet_must_be_rewritten.txt[/FONT]
The package contains 32bit (console) executables compiled for Windows & Linux.
It is 100% free open-source copyleft software.
It creates English wordlists(1gram or 4gram) for a given filelist (each line is a filename).
Run one of the executables without any parameters to see how to use it.
Leprechaun_[quadrupleton]_r13_7pluses is powered by the fastest (on new architectures, like Core i3) so far string hash function: Jesteress.

Examples (in Linux prompt):
./Leprechaun_r13_7pluses_generic_32bits.elf Leprechaun.lst Leprechaun.wrd 7000
./Leprechaun_quadrupleton_r13_7pluses_generic_32bits.elf Leprechaun.lst Leprechaun_quadrupleton.wrd 6000

Open text files Leprechaun.LOG and Leprechaun_quadrupleton.LOG respectively for first and second line.

Pluses:
+ written in C;
+ extremely fast;
+ an useful etude for wordlisting.
Minuses:
- the developer being an amateur;
- dirty style, so dirty that 64bit compilation surely fails;
- greedy: low memory utilization.

Sanmayce
Enjoy!
 

birdeen's call

VIP Member
Joined
Jul 15, 2010
Member Type
Student or Learner
Native Language
Polish
Home Country
Poland
Current Location
Poland
I see it works. I think you should make your program recognize the string n't as a word. They do it this way in the BYU corpora:

is n't
we 're

Now, your program gives:

isn 't
we 're

I'm not sure if I understand what the program actually does. What does
At left side of the word - '[' means no left successor
At left side of the word - ']' means left successor exists
At right side of the word - ']' means no right successor
At right side of the word - '[' means right successor exists
mean? I don't see any ]'s or ['s anywhere... All I can find is the output file with the words/4-grams in it and the log file. Is there anything else to look at?

edit1: Oh, I forgot to add. I liked how the program said, "Can't open file. I've already explained," or something like this.

You could correct those brackets throughout your files. It's already difficult to read computer geek speech and the lack of spaces doesn't help. ;-)
 
Last edited:

Tdol

No Longer With Us (RIP)
Staff member
Joined
Nov 13, 2002
Native Language
British English
Home Country
UK
Current Location
Japan
Second impression: an unnecessary limitation: 5 posts to be able to share a link, grmbl!

Well, you'd probably see things differently if you had to clear out dozens of spammers every day. ;-)
 

BobK

Moderator
Staff member
Joined
Jul 29, 2006
Location
Spencers Wood, near Reading, UK
Member Type
Retired English Teacher
Native Language
English
Home Country
UK
Current Location
UK
Hi to all English-language-explorers,

I am an amateurish C program-mess-er who is interested mainly in English language console utilities with one only goal in mind: to give statistical info about words/phrases/sentences usage.
...

I have't played with your tool yet, so no comments - but thanks. Have you heard of SNOBOL - Wikipedia, the free encyclopedia ? (A bit of a dinosaur, but interesting...)

b
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Now, your program gives:

isn 't
we 're
Yes, it limits functionality but since I couldn't find a way to deal with apostrophes in all cases I decided to remove them altogether. The ripper's action is simple: it parses the incoming files (given via the filelist - the first filename in command line) and extracts all latin-letter words with lengths of 1 to 31 characters, the forming rule is simple too: a word is a string containing only alpha characters i.e. 'a' to 'z' or 'A' to 'Z'.

I don't see any ]'s or ['s anywhere... All I can find is the output file with the words/4-grams in it and the log file. Is there anything else to look at?
The square brackets stand for binary-search-tree leaf succession status, that is ']' means a child exists for left node and '[' similarly for right node. I draw the highest binary tree just for informative purposes. You don't need Leprechaun.log except for tracking the activity of the executable, you need only the second file from command line namely the extracted distinct words/grams.

You could correct those brackets throughout your files. It's already difficult to read computer geek speech and the lack of spaces doesn't help. ;-)
It was/is not intentional, I don't want to impose my buggy ways, at least I try my explanations to be useful at max.
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Have you heard of SNOBOL - Wikipedia, the free encyclopedia ? (A bit of a dinosaur, but interesting...)
Thank you for the link, I have not heard of it, but after reading the article I can say: it has nothing to do with real world tasks, nowayears the well-applied algorithms could smash all old approaches. There are languages like Python designed for such tasks, but I am kind of orthodoxal amateur my language is and will be C, I see myself migrating to native 64bit Linux and 64bit C code after a couple of years.

Regards
 

birdeen's call

VIP Member
Joined
Jul 15, 2010
Member Type
Student or Learner
Native Language
Polish
Home Country
Poland
Current Location
Poland
Yes, it limits functionality but since I couldn't find a way to deal with apostrophes in all cases I decided to remove them altogether.
A very simple and bad (not tragic though) solution is to add a couple of lines that would change every

...n
't


to

...
n't

in the output file after it's ready, but it will work only in the 1-gram program for obvious reasons. It's still O(n), which won't change the general complexity.

(I'm not a programmer. Please forgive me if I'm talking nonsense.)
 

BobK

Moderator
Staff member
Joined
Jul 29, 2006
Location
Spencers Wood, near Reading, UK
Member Type
Retired English Teacher
Native Language
English
Home Country
UK
Current Location
UK
Thank you for the link, I have not heard of it, but after reading the article I can say: it has nothing to do with real world tasks, nowayears the well-applied algorithms could smash all old approaches. There are languages like Python designed for such tasks, but I am kind of orthodoxal amateur my language is and will be C, I see myself migrating to native 64bit Linux and 64bit C code after a couple of years.

Regards

:up: ;-)

(Was 'nowayears' a joke? If so, it works. But 'nowadays' doesn't normally have a *365 analogue. 'Nowadays', like 'these days', means 'at/in this time/era....'. The 'day' in 'present-day', in the same way, doesn't mean '24 hours'.)

b
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
A very simple and bad (not tragic though) solution is to add a couple of lines that would change every

...n
't


to

...
n't

in the output file after it's ready, but it will work only in the 1-gram program for obvious reasons.

Speaking of what a small utility must do (and what not to do), after many tries and errors I realized the need for well-defined actions, that is everything must not be half-done and simplicity along with speed must be among the highest priorities.
So in its simplicity I reckon that Leprechaun has not to be altered - it does what exactly is expected to do as first pass of parsing, from that point on another tool must take over. Here arises need not only of apostrophes, hyphens but additional alphabets ... and clarity vanishes. In other words the skeleton (the superfast hash reinforced by unrolled b-trees with simulated stack) is the thrilling thing (an important base/etude for further not developing but rather tuning and rewriting as 64bit code), the rest interests me not.

In case of not sensing what my obsession is, here is my diagnose: maniacal hi-speed text processing fondness. As I have said it elsewhere the speed is beauty.
One reason to abandon extracting words containing apostrophes was the existence of shortened forms like 'cause, 'twas, 'tis, 'tween, 'twere for 'because', 'it was', 'it is', 'between', 'it were' respectively.
Many years ago I wrote 2 very slow 16bit console utilities which might be useful to you, they work in duo and rip distinct (with hyphens and apostrophes) English words from a given file:

Example:
Code:
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir
 Volume in drive D is H320_Vol5
 Volume Serial Number is 0CB3-C881

 Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1

02/22/2011  06:05 AM    <DIR>          .
02/22/2011  06:05 AM    <DIR>          ..
02/22/2011  03:37 AM                 6 masakari.ss
02/22/2011  03:23 AM             2,659 RIP_EWRD.BAS
02/22/2011  03:23 AM            44,232 RIP_EWRD.EXE
09/19/1997  12:00 AM            55,972 SAKURA.EXE
10/23/2007  05:26 PM           146,248 SAKURA8.ZIP
02/12/2011  09:11 PM         1,385,282 TSZ.txt
               6 File(s)      1,634,399 bytes
               2 Dir(s)   2,650,181,632 bytes free

D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>RIP_EWRD.EXE TSZ.txt
RIP_EWRD.EXE
NumberOfWords&: 90958
SAKURA.EXE, revision 008, written by Svalqyatchx 'Kaze'.
Revision note: Virtual_Memory_Simulated_Stack, if overflow_error then HALT.
Caution: Very(pivot is chosen from first 20 elements) slow version.
Searching for MASAKARI.SS and MASAKARI.SD ...
Creating MASAKARI.SWP ...
Allocated HDD memory: 320MB.
Room for 33,554432 elements; Maximum(2GB-1) for 214,748364 elements.
Input file: TSZ.UW
Output file: TSZ.SW
Making SAKURA.QSS(10bytes per element(6-entry,4-stack)) at HDD memory ...
Current sort options: CASE_UNSENSITIVE /START= 1 /LENGTH= 26
Sorting in two passes 90958 elements(longest 26), needed 889KB ...
Bubble-sorting possible pivots ...
Sorting pass#1(splitting) countdown(Right&), StackPtr: 000000001, 000000034 ...
Sorting pass#2 countdown(Quantity&-Right&), StackPtr: 000000000, 000000000 ...
Stack_Nested_Levels i.e. StackPtrMAX& / 2 = 21
TotalReadData# = 12,011397 bytes.
Looking in SAKURA.QSS and writing TSZ.SW ...
Time: 06:18:12 / 06:18:53 i.e. 40.69922seconds.
SAKURA: Done(performance: 288KB/s).
Creating TSZ.WRD ...

D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir
 Volume in drive D is H320_Vol5
 Volume Serial Number is 0CB3-C881

 Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1

02/22/2011  06:18 AM    <DIR>          .
02/22/2011  06:18 AM    <DIR>          ..
02/22/2011  03:37 AM                 6 masakari.ss
02/22/2011  06:18 AM       335,544,320 MASAKARI.SWP
02/22/2011  03:23 AM             2,659 RIP_EWRD.BAS
02/22/2011  03:23 AM            44,232 RIP_EWRD.EXE
09/19/1997  12:00 AM            55,972 SAKURA.EXE
10/23/2007  05:26 PM           146,248 SAKURA8.ZIP
02/22/2011  06:18 AM           578,855 TSZ.SW
02/12/2011  09:11 PM         1,385,282 TSZ.txt
02/22/2011  06:18 AM           578,855 TSZ.UW
02/22/2011  06:18 AM            76,752 TSZ.WRD
              10 File(s)    338,413,181 bytes
               2 Dir(s)   2,313,396,224 bytes free

D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>type TSZ.WRD
a
a-sniffing
a-weary
abandoned
abash
abashed
abet
abide
ability
ability-to-stand
...
all-too-gentle
all-too-great
all-too-human
all-too-patient
all-too-poor
all-too-similar
all-too-small
...
bid'th
...
can't
...
day's-work
day-journeys
...
doubt
doubt'th
...
e'er
each
...
i'm
i've
...
it's
...
look
look'st
looked
lookedst
looketh
looking
looking-back
looks
...
lov'th
lovable
love
love's
love-glances
...
mean
mean'th
meaneth
meaning
means
meant
...
naysayer
ne'er
ne'er-do-ills
ne'er-do-wells
near
nearer
...
o'er
o'erflowing
o'erhangeth
o'erhearst
o'erhung
o'erleap
o'ershadowed
o'erspan
o'erswelled
o'erthrowers
o'erthrowing
o'erthrown
...
they've
...
world's
world-blessing
world-loving
...
would
would'st
...
y-e-a
yawn
...
zarathustra
zarathustra's
zarathustra-kingdom

D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>
The Dummy_DOS_ripper.zip 877KB is here.
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Was 'nowayears' a joke?

He-he, it is all about playing with words as you have already noticed. The charming part of broken English is exactly the lack of any bias and why not disregard of RULES.

Thanks for pointing out
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Can you enlist the common vocabulary (1grams/words as well as 4grams/phrases) of Hercules Poirot's and Sherlock Holmes' stories?

The awfully slow revision of 'Compare_Two_Wordlists_r1+.exe' is now replaced by 'Overlapper-Blender_r1.exe', which is fast enough.
This allows (in next Graphein r.1++) large texts to be checked without those disgusting delays.
I have an (old even better) idea to make one variant of Leprechaun which to solve all-these-mini-pushups cardinally, for now Overlapper-Blender is not a bad utility.
Overlapper-Blender works with any type of strings, in particular 1grams and 4grams.

One important aspect of making cross-references is to explore the usage of most frequent phrases (4grams) throughout several corpora of such 4grams.
Here I will show how to generate a single text file (each line is a 4gram) containing common 4grams for given 4 (Agatha Christie collection, Sherlock Holmes collection, Sunnah and Hadith and Qur'an collection, four versions of The Holy Bible) corpora.
In short:
The log below contains the steps needed to create 'Agatha+Sherlock+Islam+Bible_Overlapped.txt' file.
Some stats:
By clashing 'Agatha Christie' corpus (2,615,513 4grams) into 'Sherlock Holmes' corpus (1,233,227 4grams) the outcome is: 102,201 overlapped/common 4grams.
By clashing 'Sunnah and Hadith and Qur'an' corpus (936,195 4grams) into 'The Holy Bible' corpus (795,822 4grams) the outcome is: 29,038 overlapped/common 4grams.
And finally the clash between 102,201 and 29,038 results in only 5,940 4grams.
All these 4-words-phrases constitute one important part of the common phraseology (in-here: four major sources).
An excerpt (with all to_be's) from this list of 5,940 4grams:
...
to_be_a_great
to_be_cut_off
to_be_deprived_of
to_be_done_by
to_be_done_with
to_be_found_in
to_be_free_from
to_be_given_to
to_be_his_wife
to_be_in_the
to_be_kept_in
to_be_left_behind
to_be_married_to
to_be_more_than
to_be_of_service
to_be_on_the
to_be_one_of
to_be_regarded_as
to_be_seen_by
to_be_seen_of
to_be_taken_from
to_be_taken_to
to_be_the_most
to_be_under_the
to_be_used_for
to_be_used_in
to_be_with_the
to_be_with_you
...


This exercise reveals how many (102,201) 4-words-phrases are common for Agatha Christie's style and Conan Doyle's style.
Also shows the recipe for creating your own High-Quality corpus of 1/4 grams.
That is how I have created my HQ wordlist of 1grams, as far as I remember from 13 Low-Quality and Unknown-Quality spell-checker's wordlists.
Soon I will create my own 4gram wordlist by clashing various not small corpora, the way of mixing them reminds me of Barry White's Put Me In Your Mix superhit.

Code:
[FONT=Courier New]D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir

03/01/2011  07:25 AM                30 1.txt
03/01/2011  07:25 AM                14 2.txt
03/01/2011  07:25 AM            41,768 Overlapper-Blender_r1.c
03/01/2011  07:25 AM            64,000 Overlapper-Blender_r1.exe
03/01/2011  07:25 AM        57,389,250 _Agatha Christie_Texts.txt
03/01/2011  07:25 AM        27,024,497 _Sherlock Holmes_Texts.txt
03/01/2011  07:25 AM        20,326,151 _Sunnah and Hadith and Qur'an.txt
03/01/2011  07:25 AM        17,183,313 _The_Holy_Bible_4-versions.txt

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type 1.txt
a_bad_day_of
a_bad_day_when

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type 2.txt
a_bad_day_of

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe 1.txt 2.txt
Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
Usage: Overlapper-Blender wordlistfile1 wordlistfile2
Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
Size of 1st input file: 30
Size of 2nd input file: 14
Allocating 512MB ...
Lines in 1st input file: 2
Lines in 2nd input file: 1
Allocated memory for pointers-to-words in MB: 1
Sorting 3 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Blended lines: 2
Overlapped lines: 1

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir

03/01/2011  07:25 AM                30 1.txt
03/01/2011  07:25 AM                14 2.txt
03/01/2011  07:26 AM                30 Blended.txt
03/01/2011  07:26 AM                14 Overlapped.txt
03/01/2011  07:25 AM            41,768 Overlapper-Blender_r1.c
03/01/2011  07:25 AM            64,000 Overlapper-Blender_r1.exe
03/01/2011  07:25 AM        57,389,250 _Agatha Christie_Texts.txt
03/01/2011  07:25 AM        27,024,497 _Sherlock Holmes_Texts.txt
03/01/2011  07:25 AM        20,326,151 _Sunnah and Hadith and Qur'an.txt
03/01/2011  07:25 AM        17,183,313 _The_Holy_Bible_4-versions.txt

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Blended.txt
a_bad_day_of
a_bad_day_when

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Overlapped.txt
a_bad_day_of

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "_Agatha Christie_Texts.txt" "_Sherlock Holmes_Texts.txt"
Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
Usage: Overlapper-Blender wordlistfile1 wordlistfile2
Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
Size of 1st input file: 57389250
Size of 2nd input file: 27024497
Allocating 512MB ...
Lines in 1st input file: 2615513
Lines in 2nd input file: 1233227
Allocated memory for pointers-to-words in MB: 15
Sorting 3848740 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Blended lines: 3746539
Overlapped lines: 102201

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Blended.txt Agatha+Sherlock_Blended.txt

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt Agatha+Sherlock_Overlapped.txt

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "_Sunnah and Hadith and Qur'an.txt" _The_Holy_Bible_4-versions.txt
Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
Usage: Overlapper-Blender wordlistfile1 wordlistfile2
Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
Size of 1st input file: 20326151
Size of 2nd input file: 17183313
Allocating 512MB ...
Lines in 1st input file: 936195
Lines in 2nd input file: 795822
Allocated memory for pointers-to-words in MB: 7
Sorting 1732017 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Blended lines: 1702979
Overlapped lines: 29038

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Blended.txt Islam+Bible_Blended.txt

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt Islam+Bible_Overlapped.txt

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir

03/01/2011  07:25 AM                30 1.txt
03/01/2011  07:25 AM                14 2.txt
03/01/2011  07:26 AM        82,470,333 Agatha+Sherlock_Blended.txt
03/01/2011  07:26 AM         1,943,414 Agatha+Sherlock_Overlapped.txt
03/01/2011  07:28 AM        36,965,004 Islam+Bible_Blended.txt
03/01/2011  07:28 AM           544,460 Islam+Bible_Overlapped.txt
03/01/2011  07:25 AM            41,768 Overlapper-Blender_r1.c
03/01/2011  07:25 AM            64,000 Overlapper-Blender_r1.exe
03/01/2011  07:25 AM        57,389,250 _Agatha Christie_Texts.txt
03/01/2011  07:25 AM        27,024,497 _Sherlock Holmes_Texts.txt
03/01/2011  07:25 AM        20,326,151 _Sunnah and Hadith and Qur'an.txt
03/01/2011  07:25 AM        17,183,313 _The_Holy_Bible_4-versions.txt

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "Agatha+Sherlock_Overlapped.txt" "Islam+Bible_Overlapped.txt"
Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
Usage: Overlapper-Blender wordlistfile1 wordlistfile2
Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
Size of 1st input file: 1943414
Size of 2nd input file: 544460
Allocating 512MB ...
Lines in 1st input file: 102201
Lines in 2nd input file: 29038
Allocated memory for pointers-to-words in MB: 1
Sorting 131239 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Blended lines: 125299
Overlapped lines: 5940

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Overlapped.txt|more
a_breath_of_the
a_change_in_the
^C
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt "Agatha+Sherlock+Islam+Bible_Overlapped.txt"

D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir "Agatha+Sherlock+Islam+Bible_Overlapped.txt"

03/01/2011  07:31 AM           104,584 Agatha+Sherlock+Islam+Bible_Overlapped.txt
[/FONT]
All (the whole example) files are available here: one 30MB zip archive.
Enjoy!
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Refinement continues...
Here comes Overlapper-Blender_r1+.
Overlapper-Blender revision 1 was with one shortcoming (not giving the unfamiliar words, now exterminated), so a few very useful features were added: 'Unfamiliar.txt' creation and more stats.

One short DIZ/description TXT file, here: 16.1KB.
One short DIZ/description PDF file, here: 74.2KB.
Overlapper-Blender_r1+.zip file (contains Windows console executable, C source, four 4gram wordlists), here: 30.3MB.

By adding this (I updated Dummy-Check-package to r.2) now I will show how to spot misspelled/new words in a bunch of incoming TXT files.
The wordlist in use contains 351,116 words.

Under quick-and-dummy spell-checking is the text 'The history of the Oxford English Dictionary', taken from OED on CD-ROM HTML help files:

Code:
[FONT=Courier New]D:\_KAZE_new-stuff\Dummy_Check_package_r2>dir/s
 Volume in drive D is H320_Vol5
 Volume Serial Number is 0CB3-C881

 Directory of D:\_KAZE_new-stuff\Dummy_Check_package_r2

03/03/2011  12:03 AM    <DIR>          .
03/03/2011  12:03 AM    <DIR>          ..
03/03/2011  12:03 AM               259 Dummy_Check.bat
03/03/2011  12:03 AM         4,024,155 english.dic_351116_wordlist
03/03/2011  12:03 AM            94,208 Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe
03/03/2011  12:03 AM            66,048 Overlapper-Blender_r1+.exe
03/02/2011  11:48 PM    <DIR>          TREE_of_TXT_files_to_be_processed
03/03/2011  12:03 AM            34,606 Yoshi_r6.exe
               5 File(s)      4,219,276 bytes

 Directory of D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed

03/02/2011  11:48 PM    <DIR>          .
03/02/2011  11:48 PM    <DIR>          ..
03/03/2011  12:03 AM            15,816 oed2_hist.txt
03/03/2011  12:03 AM            17,128 oed2_hist10.txt
03/03/2011  12:03 AM             8,475 oed2_hist11.txt
03/03/2011  12:03 AM            13,394 oed2_hist12.txt
03/03/2011  12:03 AM            11,942 oed2_hist13.txt
03/03/2011  12:03 AM            12,366 oed2_hist2.txt
03/03/2011  12:03 AM            11,197 oed2_hist3.txt
03/03/2011  12:03 AM             9,752 oed2_hist4.txt
03/03/2011  12:03 AM            12,589 oed2_hist5.txt
03/03/2011  12:03 AM            11,206 oed2_hist6.txt
03/03/2011  12:03 AM            15,374 oed2_hist7.txt
03/03/2011  12:03 AM            15,962 oed2_hist8.txt
03/03/2011  12:03 AM            12,009 oed2_hist9.txt
              13 File(s)        167,210 bytes

     Total Files Listed:
              18 File(s)      4,386,486 bytes
               5 Dir(s)   1,004,392,448 bytes free

D:\_KAZE_new-stuff\Dummy_Check_package_r2>type Dummy_Check.bat
cd TREE_of_TXT_files_to_be_processed
..\Yoshi_r6.exe -f -o..\Dummy_Check.lst *.txt
cd..
Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe Dummy_Check.lst Dummy_Check.lst.wrd 3000
Overlapper-Blender_r1+.exe Dummy_Check.lst.wrd english.dic_351116_wordlist

D:\_KAZE_new-stuff\Dummy_Check_package_r2>Dummy_Check.bat

D:\_KAZE_new-stuff\Dummy_Check_package_r2>cd TREE_of_TXT_files_to_be_processed

D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed>..\Yoshi_r6.exe -f -o..\Dummy_Check.lst *.txt
Yoshi(Filelist Creator), revision 06, written by Svalqyatchx,
in fact based on SWEEP.C from 'Open Watcom Project', thanks-thanks.

Note1: So far, it works for current directory only.
Note2: Default method is depth-first traversal;
       may use pipe 'Yoshi|sort' for breadth-first_like traversal results.
Note3: Make notice that '*.*'(extensionfull only) is not equal to '*'(all);
       one disadvantage is an inability to list only extensionless filenames.
Note4: Search is case-insensitive as-must.
Note5: This revision allows multiple '*', and meaning of masks is:
       '?' - any character AND NOT EMPTY(default, for OR EMPTY see option -e);
       '*' - any character(s) or empty.
Note6: What is a .LBL(LineByLine) file?
       it is a bunch of GRAMMATICAL lines not mere LF or CRLF lines;
       it contains not symbols under 32(except CR and LF) and above 127;
       it contains not space symbol sequences.
Usage:
      Yoshi [option(s)] [filename(s)]
      option(s):
         -v           i.e. verbose mode; output goes to console;
         -f           i.e. fullpath mode for output;
         -e           i.e. treat '?' as any character OR EMPTY;
         -t           i.e. touch all encountered files;
         -2           i.e. convert all encountered .TXT files to .LBL files;
         -o<filename> i.e. output goes to file(in append mode).
      filename(s):
         Wildcards '*' and wildcards '?' are allowed i.e. "str*.c??";
         default filename is '*'; DO NOT FORGET TO PUT
         filename(s) WITH WILDCARD(S) INTO QUOTE MARKS!
Examples:
      Yoshi -v -f -oCaterpillar_NON.lst "*.lbl" "*.txt" "*.htm" "*.html"
      Yoshi -f -oMyEbooks.txt "*wiley*essential*.pdf" "*russian*.*htm"

Yoshi: Total size of files: 00,000,000,167,210 bytes.
Yoshi: Total files: 000,000,000,013.
Yoshi: Total folders: 0,000,000,000.

D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed>cd..

D:\_KAZE_new-stuff\Dummy_Check_package_r2>Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe Dummy_Check.lst Dummy_Check.lst.wrd 3000
Leprechaun(Fast Greedy Word-Ripper), revision 13++++++, written by Svalqyatchx.
Leprechaun: 'Oh, well, didn't you hear? Bigger is good, but jumbo is dear.'
Kaze: Let's see what a 3-way hash + 6,602,752 Binary-Search-Trees can give us,
      also the performance of a 3-way hash + 6,602,752 B-Trees of order 3.
Size of input file with files for Leprechauning: 1550
Allocating memory 1170MB ... OK
Size of Input TEXTual file: 15,816
|; Word count: 2,572 of them 790 distinct; Done: 64/64
Size of Input TEXTual file: 17,128
/; Word count: 5,290 of them 1,359 distinct; Done: 64/64
Size of Input TEXTual file: 8,475
-; Word count: 6,618 of them 1,570 distinct; Done: 64/64
Size of Input TEXTual file: 13,394
\; Word count: 8,930 of them 2,014 distinct; Done: 64/64
Size of Input TEXTual file: 11,942
|; Word count: 11,035 of them 2,493 distinct; Done: 64/64
Size of Input TEXTual file: 12,366
/; Word count: 13,117 of them 2,714 distinct; Done: 64/64
Size of Input TEXTual file: 11,197
-; Word count: 14,968 of them 2,914 distinct; Done: 64/64
Size of Input TEXTual file: 9,752
\; Word count: 16,604 of them 3,078 distinct; Done: 64/64
Size of Input TEXTual file: 12,589
|; Word count: 18,726 of them 3,237 distinct; Done: 64/64
Size of Input TEXTual file: 11,206
/; Word count: 20,545 of them 3,388 distinct; Done: 64/64
Size of Input TEXTual file: 15,374
-; Word count: 22,972 of them 3,601 distinct; Done: 64/64
Size of Input TEXTual file: 15,962
\; Word count: 25,447 of them 3,815 distinct; Done: 64/64
Size of Input TEXTual file: 12,009
|; Word count: 27,328 of them 3,974 distinct; Done: 64/64
Bytes per second performance: 167,210B/s
Words per second performance: 27,328W/s
Flushing unsorted words ...
Time for making unsorted wordlist: 1 second(s)
Deallocated memory in MB: 1170
Allocated memory for words in MB: 1
Allocated memory for pointers-to-words in MB: 1
Sorting(with 'MultiKeyQuickSortX26Sort' by J. Bentley and R. Sedgewick) ...
Sort pass 26/26 ...
Flushing sorted words ...
Time for sorting unsorted wordlist: 1 second(s)
Leprechaun: Done.

D:\_KAZE_new-stuff\Dummy_Check_package_r2>Overlapper-Blender_r1+.exe Dummy_Check.lst.wrd english.dic_351116_wordlist
Overlapper-Blender r.1+, written by Kaze.
Size of 1st input file: 36609
Size of 2nd input file: 4024155
Allocating 1024MB ...
Lines in 1st input file: 3974
Lines in 2nd input file: 351116
Allocated memory for pointers-to-words in MB: 2
Allocated memory for pointers-to-words in MB: 1
Sorting 355090 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Dumping all-from-first-file except deduplicated duplicates into 'Unfamiliar.txt' ...
Blended lines, i.e. combined lines from both files: 351623
Overlapped lines, i.e. lines common for both files: 3467
Unfamiliar lines, i.e. lines from 1st file not encountered in 2nd file: 507

D:\_KAZE_new-stuff\Dummy_Check_package_r2>type Unfamiliar.txt
abrm
ada
addenbrooke
addlestone
...
wyllie
wyndham
yockney
yonge
yvonne
zorc

D:\_KAZE_new-stuff\Dummy_Check_package_r2>[/FONT]

Actually I could not find any mistakes (in those 507 words from 'Unfamiliar.txt'), but something more ominous than a typo: No (formal) RECOGNITION whatsoever of Samuel Johnson's contribution, caramba! If the OED staff is aware of this ... it is worse than UNGRATEFULNESS! One train full of contributors enlisted and no SEAT for my man, I have had different notion of the famous English politeness, as not being a superficial courtesy but plain gratefulness.
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
The 10000 chars limitation forced me to split my post in two:

Only some "Johnson" was mentioned as if some irrelevant meddler was bubbling something:

"The example of Johnson and Richardson had shown clearly that the citation of authority
for a word was one of the essentials for establishing its meaning and tracing its
history. It was therefore obvious that the first step towards the building up of a new
dictionary must be the assembling of such authority, in the form of quotations from
English writings throughout the various periods of the language.
Johnson and Richardson
had been selective in the material they assembled, and obviously some kind of selection
would be imposed by practical limits, however wide the actual range might be."
/An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/

"... The next stage is marked by
Johnson's systematic use of quotations to illustrate and justify the definitions, the
many omissions still existing in the vocabulary being partly filled by later
supplementary works on the same lines. When to all this was superadded the principle of
historical illustration, introduced by Richardson, it became inevitable that any
adequate dictionary of English must be one of the larger books of the world."
/An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/

"It is remarkable that Richardson's dictionary, perhaps through certain defects in his
method, did not at once attract the attention it deserved. From the appearance of the
first instalment in the Encyclopaedia Metropolitana in 1819 to the full acceptance of
the historical principle by the Philological Society almost forty years had passed, and
the separate publication of his dictionary in 1836-7 did not affect to any appreciable
extent the work of those lexicographers who followed in the wake of Johnson or Webster.
Even his wealth of quotations remained unutilized, although they formed a natural
storehouse for any who cared to search in it and bring forth 'treasures new and old' to
add to those already available in the works of Johnson and his successors."
/An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/

And what knocked me completely down was the lack of word 'SAMUEL' in OED entry-list, and to see the etymology of this name/word - forget (Heritage dictionary explains it, though, reaching its Semitic roots) about it!
And if the above-said is not a wake-up call for the OED staff...

Dummy_Check_package_r2.zip file, here: 1.1MB.

Enjoy!

Add-on:
Google didn't bother to supply stats about their CSV files, so here is some info about US English 4grams from 2009 July 15.
Here I want to give the exact number of pure (no year/pages/...) distinct 4grams derived from all 400 'googlebooks-eng-us-all-4gram-20090715' CSV files:

googlebooks-eng-us-all-4gram-20090715-graffith_A_distinct: 437,808,652 bytes, 17,981,107 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_B_distinct: 159,141,163 bytes, 6,571,872 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_C_distinct: 160,011,167 bytes, 6,212,540 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_D_distinct: 97,107,487 bytes, 3,856,617 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_E_distinct: 88,831,581 bytes, 3,424,994 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_F_distinct: 129,873,927 bytes, 5,282,784 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_G_distinct: 51,318,288 bytes, 2,116,401 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_H_distinct: 164,940,851 bytes, 6,760,278 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_I_distinct: 234,234,813 bytes, 9,449,270 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_J_distinct: 10,856,482 bytes, 444,251 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_K_distinct: 13,466,244 bytes, 569,361 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_L_distinct: 74,101,010 bytes, 3,123,807 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_M_distinct: 125,532,372 bytes, 5,180,952 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_N_distinct: 73,979,970 bytes, 3,075,105 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_O_distinct: 257,378,814 bytes, 10,718,140 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_P_distinct: 134,588,800 bytes, 5,222,828 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_Q_distinct: 6,573,966 bytes, 257,343 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_R_distinct: 90,619,671 bytes, 3,565,405 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_S_distinct: 219,649,789 bytes, 8,736,465 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_T_distinct: 638,879,823 bytes, 24,309,233 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_U_distinct: 39,351,963 bytes, 1,640,327 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_V_distinct: 23,544,104 bytes, 957,759 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_W_distinct: 236,365,992 bytes, 9,738,971 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_X_distinct: 157,465 bytes, 6,593 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_Y_distinct: 24,202,157 bytes, 1,000,248 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_Z_distinct: 463,569 bytes, 19,684 distinct lines

Total size/number of 4grams is: 3,233,748,341/140,222,335.


Because many 4grams (here) are meaningless and because of need-for-rich-collection it is obvious: only several times as many will do (a serious job).
To see what-converts-what you may read this log.
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Stomp stomp I have arrived...

Another brute-force approach was taken in order to make the awfully slow Graphein r.1 more bearable resulting in Graphein r.1++(now 403MB only (8:1 compression) thankfully to the most-advanced text compressor BSC(GRAFFITH_r2++_Graphein.exe) written by Ilya Grebnov).
Here instead of waiting 20 minutes (just for getting Found & Unfamiliar 4grams for a small incoming file) now latency is under 4 minutes and what is more important: for large incoming texts (like _Sherlock Holmes_Texts.quadrupleton.txt) the total time is sub-linear.
Needed space on HDD/SSD (or better yet on a flash card): 3.45GB, the batch file decompresses automatically the archives (.bsc files) when it is started for a fist time.
The examples below were executed on Toshiba laptop Intel Merom 2.16GHz CPU with Windows XP as OS.

Here again the tested text is: The_Little_Match_Girl.txt 5,203 bytes.
After quadrupletoning it the result is a text file with 580 4grams: The_Little_Match_Girl.quadrupleton.txt 12,544 bytes.

By starting a single batch file you can get 2 text files (211 seconds needed) with Found/Unfamiliar 4grams in/to 140,222,335 4grams from googlebooks-eng-us-all-4gram-20090715 corpus:
Code:
D:\Package 'Graphein' a 4-gram-Phrase-Checker, revision 1++>26Clash_Intel.BAT The_Little_Match_Girl.quadrupleton.txt
...
The_Little_Match_Girl.quadrupleton.txt_overlapped_all_distinct 7,590 bytes
The_Little_Match_Girl.quadrupleton.txt_unfamiliar_all_distinct 4,954 bytes
First file contains 370 4grams familiar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
a_box_of_them
a_boy_had_run
...
with_such_a_glow
you_will_vanish_like


Second file contains 210 4grams unfamiliar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
a_cradle_some_day
a_little_pathetic_figure
...
you_will_disappear_when
youngster_stretched_out_her


After quadrupletoning (the full collection of Sherlock Holmes stories) the result is a text file with 1,233,227 4grams: _Sherlock Holmes_Texts.quadrupleton.txt 27,024,497 bytes.

By starting a single batch file you can get 2 text files (318 seconds needed) with Found/Unfamiliar 4grams in/to 140,222,335 4grams from googlebooks-eng-us-all-4gram-20090715 corpus:
Code:
D:\Package 'Graphein' a 4-gram-Phrase-Checker, revision 1++>26Clash_Intel.BAT "_Sherlock Holmes_Texts.quadrupleton.txt"
...
_Sherlock Holmes_Texts.quadrupleton.txt_overlapped_all_distinct 12,532,297 bytes
_Sherlock Holmes_Texts.quadrupleton.txt_unfamiliar_all_distinct 14,492,200 bytes
First file contains 612,319 4grams familiar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
a_and_b_cleared
a_and_b_companies
...
zone_of_light_and
zoo_and_see_the


Second file contains 620,908 4grams unfamiliar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
a_alane_is_waur
a_appy_day_with
...
zuurfontein_by_as_many
zuurfontein_were_both_made


When you need (for instance) a post or e-mail (as well as whole e-books) to be checked (using richest so far 4gram corpus) for broken-four-words-phrases Graphein r.1++ is here, one 419MB ZIP file.
Also, the second (semi-auto) mode of operation is intact but faster, one screenshot here.
To shrink these 200+ seconds down to less than a second (not relying on CPU power and many GBs of available system RAM) a lot of bread I must eat...

In fact, I know exactly how to create the skeleton (not sacrificing speed a bit: with NO system RAM and CPU heavy loads) despite the 32bit coding limitations: creating a single 10x3++GB file: a dump/mirror of already inserted 140,000,000++ phrases into millions of b-trees. Spanning on/over such a huge pool is well-suited when flash memories (SSD, SDs) are used because of sub-fast memory latency/(seek time), roughly: 50 nanoseconds/microseconds/milliseconds respectively for system/flash/harddisk RAM. What is most needed with this greedy (10% only memory utilization) approach is low latency not hi-bandwidth.
Nevertheless I would appreciate any how-to-do-it hint.

Enjoy!
 
Status
Not open for further replies.
Top