He-he, it is all about playing with words as you have already noticed. The charming part of broken English is exactly the lack of any bias and why not disregard of RULES.
Thanks for pointing out
Printable View
Can you enlist the common vocabulary (1grams/words as well as 4grams/phrases) of Hercules Poirot's and Sherlock Holmes' stories?
The awfully slow revision of 'Compare_Two_Wordlists_r1+.exe' is now replaced by 'Overlapper-Blender_r1.exe', which is fast enough.
This allows (in next Graphein r.1++) large texts to be checked without those disgusting delays.
I have an (old even better) idea to make one variant of Leprechaun which to solve all-these-mini-pushups cardinally, for now Overlapper-Blender is not a bad utility.
Overlapper-Blender works with any type of strings, in particular 1grams and 4grams.
One important aspect of making cross-references is to explore the usage of most frequent phrases (4grams) throughout several corpora of such 4grams.
Here I will show how to generate a single text file (each line is a 4gram) containing common 4grams for given 4 (Agatha Christie collection, Sherlock Holmes collection, Sunnah and Hadith and Qur'an collection, four versions of The Holy Bible) corpora.
In short:
The log below contains the steps needed to create 'Agatha+Sherlock+Islam+Bible_Overlapped.txt' file.
Some stats:
By clashing 'Agatha Christie' corpus (2,615,513 4grams) into 'Sherlock Holmes' corpus (1,233,227 4grams) the outcome is: 102,201 overlapped/common 4grams.
By clashing 'Sunnah and Hadith and Qur'an' corpus (936,195 4grams) into 'The Holy Bible' corpus (795,822 4grams) the outcome is: 29,038 overlapped/common 4grams.
And finally the clash between 102,201 and 29,038 results in only 5,940 4grams.
All these 4-words-phrases constitute one important part of the common phraseology (in-here: four major sources).
An excerpt (with all to_be's) from this list of 5,940 4grams:
...
to_be_a_great
to_be_cut_off
to_be_deprived_of
to_be_done_by
to_be_done_with
to_be_found_in
to_be_free_from
to_be_given_to
to_be_his_wife
to_be_in_the
to_be_kept_in
to_be_left_behind
to_be_married_to
to_be_more_than
to_be_of_service
to_be_on_the
to_be_one_of
to_be_regarded_as
to_be_seen_by
to_be_seen_of
to_be_taken_from
to_be_taken_to
to_be_the_most
to_be_under_the
to_be_used_for
to_be_used_in
to_be_with_the
to_be_with_you
...
This exercise reveals how many (102,201) 4-words-phrases are common for Agatha Christie's style and Conan Doyle's style.
Also shows the recipe for creating your own High-Quality corpus of 1/4 grams.
That is how I have created my HQ wordlist of 1grams, as far as I remember from 13 Low-Quality and Unknown-Quality spell-checker's wordlists.
Soon I will create my own 4gram wordlist by clashing various not small corpora, the way of mixing them reminds me of Barry White's Put Me In Your Mix superhit.
All (the whole example) files are available here: one 30MB zip archive.Code:D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir
03/01/2011 07:25 AM 30 1.txt
03/01/2011 07:25 AM 14 2.txt
03/01/2011 07:25 AM 41,768 Overlapper-Blender_r1.c
03/01/2011 07:25 AM 64,000 Overlapper-Blender_r1.exe
03/01/2011 07:25 AM 57,389,250 _Agatha Christie_Texts.txt
03/01/2011 07:25 AM 27,024,497 _Sherlock Holmes_Texts.txt
03/01/2011 07:25 AM 20,326,151 _Sunnah and Hadith and Qur'an.txt
03/01/2011 07:25 AM 17,183,313 _The_Holy_Bible_4-versions.txt
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type 1.txt
a_bad_day_of
a_bad_day_when
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type 2.txt
a_bad_day_of
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe 1.txt 2.txt
Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
Usage: Overlapper-Blender wordlistfile1 wordlistfile2
Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
Size of 1st input file: 30
Size of 2nd input file: 14
Allocating 512MB ...
Lines in 1st input file: 2
Lines in 2nd input file: 1
Allocated memory for pointers-to-words in MB: 1
Sorting 3 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Blended lines: 2
Overlapped lines: 1
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir
03/01/2011 07:25 AM 30 1.txt
03/01/2011 07:25 AM 14 2.txt
03/01/2011 07:26 AM 30 Blended.txt
03/01/2011 07:26 AM 14 Overlapped.txt
03/01/2011 07:25 AM 41,768 Overlapper-Blender_r1.c
03/01/2011 07:25 AM 64,000 Overlapper-Blender_r1.exe
03/01/2011 07:25 AM 57,389,250 _Agatha Christie_Texts.txt
03/01/2011 07:25 AM 27,024,497 _Sherlock Holmes_Texts.txt
03/01/2011 07:25 AM 20,326,151 _Sunnah and Hadith and Qur'an.txt
03/01/2011 07:25 AM 17,183,313 _The_Holy_Bible_4-versions.txt
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Blended.txt
a_bad_day_of
a_bad_day_when
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Overlapped.txt
a_bad_day_of
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "_Agatha Christie_Texts.txt" "_Sherlock Holmes_Texts.txt"
Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
Usage: Overlapper-Blender wordlistfile1 wordlistfile2
Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
Size of 1st input file: 57389250
Size of 2nd input file: 27024497
Allocating 512MB ...
Lines in 1st input file: 2615513
Lines in 2nd input file: 1233227
Allocated memory for pointers-to-words in MB: 15
Sorting 3848740 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Blended lines: 3746539
Overlapped lines: 102201
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Blended.txt Agatha+Sherlock_Blended.txt
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt Agatha+Sherlock_Overlapped.txt
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "_Sunnah and Hadith and Qur'an.txt" _The_Holy_Bible_4-versions.txt
Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
Usage: Overlapper-Blender wordlistfile1 wordlistfile2
Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
Size of 1st input file: 20326151
Size of 2nd input file: 17183313
Allocating 512MB ...
Lines in 1st input file: 936195
Lines in 2nd input file: 795822
Allocated memory for pointers-to-words in MB: 7
Sorting 1732017 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Blended lines: 1702979
Overlapped lines: 29038
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Blended.txt Islam+Bible_Blended.txt
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt Islam+Bible_Overlapped.txt
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir
03/01/2011 07:25 AM 30 1.txt
03/01/2011 07:25 AM 14 2.txt
03/01/2011 07:26 AM 82,470,333 Agatha+Sherlock_Blended.txt
03/01/2011 07:26 AM 1,943,414 Agatha+Sherlock_Overlapped.txt
03/01/2011 07:28 AM 36,965,004 Islam+Bible_Blended.txt
03/01/2011 07:28 AM 544,460 Islam+Bible_Overlapped.txt
03/01/2011 07:25 AM 41,768 Overlapper-Blender_r1.c
03/01/2011 07:25 AM 64,000 Overlapper-Blender_r1.exe
03/01/2011 07:25 AM 57,389,250 _Agatha Christie_Texts.txt
03/01/2011 07:25 AM 27,024,497 _Sherlock Holmes_Texts.txt
03/01/2011 07:25 AM 20,326,151 _Sunnah and Hadith and Qur'an.txt
03/01/2011 07:25 AM 17,183,313 _The_Holy_Bible_4-versions.txt
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "Agatha+Sherlock_Overlapped.txt" "Islam+Bible_Overlapped.txt"
Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
Usage: Overlapper-Blender wordlistfile1 wordlistfile2
Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
Size of 1st input file: 1943414
Size of 2nd input file: 544460
Allocating 512MB ...
Lines in 1st input file: 102201
Lines in 2nd input file: 29038
Allocated memory for pointers-to-words in MB: 1
Sorting 131239 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Blended lines: 125299
Overlapped lines: 5940
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Overlapped.txt|more
a_breath_of_the
a_change_in_the
^C
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt "Agatha+Sherlock+Islam+Bible_Overlapped.txt"
D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir "Agatha+Sherlock+Islam+Bible_Overlapped.txt"
03/01/2011 07:31 AM 104,584 Agatha+Sherlock+Islam+Bible_Overlapped.txt
Enjoy!
Refinement continues...
Here comes Overlapper-Blender_r1+.
Overlapper-Blender revision 1 was with one shortcoming (not giving the unfamiliar words, now exterminated), so a few very useful features were added: 'Unfamiliar.txt' creation and more stats.
One short DIZ/description TXT file, here: 16.1KB.
One short DIZ/description PDF file, here: 74.2KB.
Overlapper-Blender_r1+.zip file (contains Windows console executable, C source, four 4gram wordlists), here: 30.3MB.
By adding this (I updated Dummy-Check-package to r.2) now I will show how to spot misspelled/new words in a bunch of incoming TXT files.
The wordlist in use contains 351,116 words.
Under quick-and-dummy spell-checking is the text 'The history of the Oxford English Dictionary', taken from OED on CD-ROM HTML help files:
Actually I could not find any mistakes (in those 507 words from 'Unfamiliar.txt'), but something more ominous than a typo: No (formal) RECOGNITION whatsoever of Samuel Johnson's contribution, caramba! If the OED staff is aware of this ... it is worse than UNGRATEFULNESS! One train full of contributors enlisted and no SEAT for my man, I have had different notion of the famous English politeness, as not being a superficial courtesy but plain gratefulness.Code:D:\_KAZE_new-stuff\Dummy_Check_package_r2>dir/s
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
Directory of D:\_KAZE_new-stuff\Dummy_Check_package_r2
03/03/2011 12:03 AM.
03/03/2011 12:03 AM..
03/03/2011 12:03 AM 259 Dummy_Check.bat
03/03/2011 12:03 AM 4,024,155 english.dic_351116_wordlist
03/03/2011 12:03 AM 94,208 Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe
03/03/2011 12:03 AM 66,048 Overlapper-Blender_r1+.exe
03/02/2011 11:48 PMTREE_of_TXT_files_to_be_processed
03/03/2011 12:03 AM 34,606 Yoshi_r6.exe
5 File(s) 4,219,276 bytes
Directory of D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed
03/02/2011 11:48 PM.
03/02/2011 11:48 PM..
03/03/2011 12:03 AM 15,816 oed2_hist.txt
03/03/2011 12:03 AM 17,128 oed2_hist10.txt
03/03/2011 12:03 AM 8,475 oed2_hist11.txt
03/03/2011 12:03 AM 13,394 oed2_hist12.txt
03/03/2011 12:03 AM 11,942 oed2_hist13.txt
03/03/2011 12:03 AM 12,366 oed2_hist2.txt
03/03/2011 12:03 AM 11,197 oed2_hist3.txt
03/03/2011 12:03 AM 9,752 oed2_hist4.txt
03/03/2011 12:03 AM 12,589 oed2_hist5.txt
03/03/2011 12:03 AM 11,206 oed2_hist6.txt
03/03/2011 12:03 AM 15,374 oed2_hist7.txt
03/03/2011 12:03 AM 15,962 oed2_hist8.txt
03/03/2011 12:03 AM 12,009 oed2_hist9.txt
13 File(s) 167,210 bytes
Total Files Listed:
18 File(s) 4,386,486 bytes
5 Dir(s) 1,004,392,448 bytes free
D:\_KAZE_new-stuff\Dummy_Check_package_r2>type Dummy_Check.bat
cd TREE_of_TXT_files_to_be_processed
..\Yoshi_r6.exe -f -o..\Dummy_Check.lst *.txt
cd..
Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe Dummy_Check.lst Dummy_Check.lst.wrd 3000
Overlapper-Blender_r1+.exe Dummy_Check.lst.wrd english.dic_351116_wordlist
D:\_KAZE_new-stuff\Dummy_Check_package_r2>Dummy_Check.bat
D:\_KAZE_new-stuff\Dummy_Check_package_r2>cd TREE_of_TXT_files_to_be_processed
D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed>..\Yoshi_r6.exe -f -o..\Dummy_Check.lst *.txt
Yoshi(Filelist Creator), revision 06, written by Svalqyatchx,
in fact based on SWEEP.C from 'Open Watcom Project', thanks-thanks.
Note1: So far, it works for current directory only.
Note2: Default method is depth-first traversal;
may use pipe 'Yoshi|sort' for breadth-first_like traversal results.
Note3: Make notice that '*.*'(extensionfull only) is not equal to '*'(all);
one disadvantage is an inability to list only extensionless filenames.
Note4: Search is case-insensitive as-must.
Note5: This revision allows multiple '*', and meaning of masks is:
'?' - any character AND NOT EMPTY(default, for OR EMPTY see option -e);
'*' - any character(s) or empty.
Note6: What is a .LBL(LineByLine) file?
it is a bunch of GRAMMATICAL lines not mere LF or CRLF lines;
it contains not symbols under 32(except CR and LF) and above 127;
it contains not space symbol sequences.
Usage:
Yoshi [option(s)] [filename(s)]
option(s):
-v i.e. verbose mode; output goes to console;
-f i.e. fullpath mode for output;
-e i.e. treat '?' as any character OR EMPTY;
-t i.e. touch all encountered files;
-2 i.e. convert all encountered .TXT files to .LBL files;
-oi.e. output goes to file(in append mode).
filename(s):
Wildcards '*' and wildcards '?' are allowed i.e. "str*.c??";
default filename is '*'; DO NOT FORGET TO PUT
filename(s) WITH WILDCARD(S) INTO QUOTE MARKS!
Examples:
Yoshi -v -f -oCaterpillar_NON.lst "*.lbl" "*.txt" "*.htm" "*.html"
Yoshi -f -oMyEbooks.txt "*wiley*essential*.pdf" "*russian*.*htm"
Yoshi: Total size of files: 00,000,000,167,210 bytes.
Yoshi: Total files: 000,000,000,013.
Yoshi: Total folders: 0,000,000,000.
D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed>cd..
D:\_KAZE_new-stuff\Dummy_Check_package_r2>Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe Dummy_Check.lst Dummy_Check.lst.wrd 3000
Leprechaun(Fast Greedy Word-Ripper), revision 13++++++, written by Svalqyatchx.
Leprechaun: 'Oh, well, didn't you hear? Bigger is good, but jumbo is dear.'
Kaze: Let's see what a 3-way hash + 6,602,752 Binary-Search-Trees can give us,
also the performance of a 3-way hash + 6,602,752 B-Trees of order 3.
Size of input file with files for Leprechauning: 1550
Allocating memory 1170MB ... OK
Size of Input TEXTual file: 15,816
|; Word count: 2,572 of them 790 distinct; Done: 64/64
Size of Input TEXTual file: 17,128
/; Word count: 5,290 of them 1,359 distinct; Done: 64/64
Size of Input TEXTual file: 8,475
-; Word count: 6,618 of them 1,570 distinct; Done: 64/64
Size of Input TEXTual file: 13,394
\; Word count: 8,930 of them 2,014 distinct; Done: 64/64
Size of Input TEXTual file: 11,942
|; Word count: 11,035 of them 2,493 distinct; Done: 64/64
Size of Input TEXTual file: 12,366
/; Word count: 13,117 of them 2,714 distinct; Done: 64/64
Size of Input TEXTual file: 11,197
-; Word count: 14,968 of them 2,914 distinct; Done: 64/64
Size of Input TEXTual file: 9,752
\; Word count: 16,604 of them 3,078 distinct; Done: 64/64
Size of Input TEXTual file: 12,589
|; Word count: 18,726 of them 3,237 distinct; Done: 64/64
Size of Input TEXTual file: 11,206
/; Word count: 20,545 of them 3,388 distinct; Done: 64/64
Size of Input TEXTual file: 15,374
-; Word count: 22,972 of them 3,601 distinct; Done: 64/64
Size of Input TEXTual file: 15,962
\; Word count: 25,447 of them 3,815 distinct; Done: 64/64
Size of Input TEXTual file: 12,009
|; Word count: 27,328 of them 3,974 distinct; Done: 64/64
Bytes per second performance: 167,210B/s
Words per second performance: 27,328W/s
Flushing unsorted words ...
Time for making unsorted wordlist: 1 second(s)
Deallocated memory in MB: 1170
Allocated memory for words in MB: 1
Allocated memory for pointers-to-words in MB: 1
Sorting(with 'MultiKeyQuickSortX26Sort' by J. Bentley and R. Sedgewick) ...
Sort pass 26/26 ...
Flushing sorted words ...
Time for sorting unsorted wordlist: 1 second(s)
Leprechaun: Done.
D:\_KAZE_new-stuff\Dummy_Check_package_r2>Overlapper-Blender_r1+.exe Dummy_Check.lst.wrd english.dic_351116_wordlist
Overlapper-Blender r.1+, written by Kaze.
Size of 1st input file: 36609
Size of 2nd input file: 4024155
Allocating 1024MB ...
Lines in 1st input file: 3974
Lines in 2nd input file: 351116
Allocated memory for pointers-to-words in MB: 2
Allocated memory for pointers-to-words in MB: 1
Sorting 355090 Pointers ...
Deduplicating duplicates and dumping all into 'Blended.txt' ...
Dumping deduplicated duplicates into 'Overlapped.txt' ...
Dumping all-from-first-file except deduplicated duplicates into 'Unfamiliar.txt' ...
Blended lines, i.e. combined lines from both files: 351623
Overlapped lines, i.e. lines common for both files: 3467
Unfamiliar lines, i.e. lines from 1st file not encountered in 2nd file: 507
D:\_KAZE_new-stuff\Dummy_Check_package_r2>type Unfamiliar.txt
abrm
ada
addenbrooke
addlestone
...
wyllie
wyndham
yockney
yonge
yvonne
zorc
D:\_KAZE_new-stuff\Dummy_Check_package_r2>
The 10000 chars limitation forced me to split my post in two:
Only some "Johnson" was mentioned as if some irrelevant meddler was bubbling something:
"The example of Johnson and Richardson had shown clearly that the citation of authority
for a word was one of the essentials for establishing its meaning and tracing its
history. It was therefore obvious that the first step towards the building up of a new
dictionary must be the assembling of such authority, in the form of quotations from
English writings throughout the various periods of the language. Johnson and Richardson
had been selective in the material they assembled, and obviously some kind of selection
would be imposed by practical limits, however wide the actual range might be."
/An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/
"... The next stage is marked by
Johnson's systematic use of quotations to illustrate and justify the definitions, the
many omissions still existing in the vocabulary being partly filled by later
supplementary works on the same lines. When to all this was superadded the principle of
historical illustration, introduced by Richardson, it became inevitable that any
adequate dictionary of English must be one of the larger books of the world."
/An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/
"It is remarkable that Richardson's dictionary, perhaps through certain defects in his
method, did not at once attract the attention it deserved. From the appearance of the
first instalment in the Encyclopaedia Metropolitana in 1819 to the full acceptance of
the historical principle by the Philological Society almost forty years had passed, and
the separate publication of his dictionary in 1836-7 did not affect to any appreciable
extent the work of those lexicographers who followed in the wake of Johnson or Webster.
Even his wealth of quotations remained unutilized, although they formed a natural
storehouse for any who cared to search in it and bring forth 'treasures new and old' to
add to those already available in the works of Johnson and his successors."
/An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/
And what knocked me completely down was the lack of word 'SAMUEL' in OED entry-list, and to see the etymology of this name/word - forget (Heritage dictionary explains it, though, reaching its Semitic roots) about it!
And if the above-said is not a wake-up call for the OED staff...
Dummy_Check_package_r2.zip file, here: 1.1MB.
Enjoy!
Add-on:
Google didn't bother to supply stats about their CSV files, so here is some info about US English 4grams from 2009 July 15.
Here I want to give the exact number of pure (no year/pages/...) distinct 4grams derived from all 400 'googlebooks-eng-us-all-4gram-20090715' CSV files:
googlebooks-eng-us-all-4gram-20090715-graffith_A_distinct: 437,808,652 bytes, 17,981,107 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_B_distinct: 159,141,163 bytes, 6,571,872 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_C_distinct: 160,011,167 bytes, 6,212,540 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_D_distinct: 97,107,487 bytes, 3,856,617 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_E_distinct: 88,831,581 bytes, 3,424,994 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_F_distinct: 129,873,927 bytes, 5,282,784 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_G_distinct: 51,318,288 bytes, 2,116,401 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_H_distinct: 164,940,851 bytes, 6,760,278 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_I_distinct: 234,234,813 bytes, 9,449,270 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_J_distinct: 10,856,482 bytes, 444,251 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_K_distinct: 13,466,244 bytes, 569,361 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_L_distinct: 74,101,010 bytes, 3,123,807 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_M_distinct: 125,532,372 bytes, 5,180,952 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_N_distinct: 73,979,970 bytes, 3,075,105 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_O_distinct: 257,378,814 bytes, 10,718,140 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_P_distinct: 134,588,800 bytes, 5,222,828 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_Q_distinct: 6,573,966 bytes, 257,343 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_R_distinct: 90,619,671 bytes, 3,565,405 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_S_distinct: 219,649,789 bytes, 8,736,465 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_T_distinct: 638,879,823 bytes, 24,309,233 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_U_distinct: 39,351,963 bytes, 1,640,327 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_V_distinct: 23,544,104 bytes, 957,759 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_W_distinct: 236,365,992 bytes, 9,738,971 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_X_distinct: 157,465 bytes, 6,593 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_Y_distinct: 24,202,157 bytes, 1,000,248 distinct lines
googlebooks-eng-us-all-4gram-20090715-graffith_Z_distinct: 463,569 bytes, 19,684 distinct lines
Total size/number of 4grams is: 3,233,748,341/140,222,335.
Because many 4grams (here) are meaningless and because of need-for-rich-collection it is obvious: only several times as many will do (a serious job).
To see what-converts-what you may read this log.
Stomp stomp I have arrived...
Another brute-force approach was taken in order to make the awfully slow Graphein r.1 more bearable resulting in Graphein r.1++(now 403MB only (8:1 compression) thankfully to the most-advanced text compressor BSC(GRAFFITH_r2++_Graphein.exe) written by Ilya Grebnov).
Here instead of waiting 20 minutes (just for getting Found & Unfamiliar 4grams for a small incoming file) now latency is under 4 minutes and what is more important: for large incoming texts (like _Sherlock Holmes_Texts.quadrupleton.txt) the total time is sub-linear.
Needed space on HDD/SSD (or better yet on a flash card): 3.45GB, the batch file decompresses automatically the archives (.bsc files) when it is started for a fist time.
The examples below were executed on Toshiba laptop Intel Merom 2.16GHz CPU with Windows XP as OS.
Here again the tested text is: The_Little_Match_Girl.txt 5,203 bytes.
After quadrupletoning it the result is a text file with 580 4grams: The_Little_Match_Girl.quadrupleton.txt 12,544 bytes.
By starting a single batch file you can get 2 text files (211 seconds needed) with Found/Unfamiliar 4grams in/to 140,222,335 4grams from googlebooks-eng-us-all-4gram-20090715 corpus:
First file contains 370 4grams familiar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:Code:D:\Package 'Graphein' a 4-gram-Phrase-Checker, revision 1++>26Clash_Intel.BAT The_Little_Match_Girl.quadrupleton.txt
...
The_Little_Match_Girl.quadrupleton.txt_overlapped_all_distinct 7,590 bytes
The_Little_Match_Girl.quadrupleton.txt_unfamiliar_all_distinct 4,954 bytes
a_box_of_them
a_boy_had_run
...
with_such_a_glow
you_will_vanish_like
Second file contains 210 4grams unfamiliar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
a_cradle_some_day
a_little_pathetic_figure
...
you_will_disappear_when
youngster_stretched_out_her
After quadrupletoning (the full collection of Sherlock Holmes stories) the result is a text file with 1,233,227 4grams: _Sherlock Holmes_Texts.quadrupleton.txt 27,024,497 bytes.
By starting a single batch file you can get 2 text files (318 seconds needed) with Found/Unfamiliar 4grams in/to 140,222,335 4grams from googlebooks-eng-us-all-4gram-20090715 corpus:
First file contains 612,319 4grams familiar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:Code:D:\Package 'Graphein' a 4-gram-Phrase-Checker, revision 1++>26Clash_Intel.BAT "_Sherlock Holmes_Texts.quadrupleton.txt"
...
_Sherlock Holmes_Texts.quadrupleton.txt_overlapped_all_distinct 12,532,297 bytes
_Sherlock Holmes_Texts.quadrupleton.txt_unfamiliar_all_distinct 14,492,200 bytes
a_and_b_cleared
a_and_b_companies
...
zone_of_light_and
zoo_and_see_the
Second file contains 620,908 4grams unfamiliar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
a_alane_is_waur
a_appy_day_with
...
zuurfontein_by_as_many
zuurfontein_were_both_made
When you need (for instance) a post or e-mail (as well as whole e-books) to be checked (using richest so far 4gram corpus) for broken-four-words-phrases Graphein r.1++ is here, one 419MB ZIP file.
Also, the second (semi-auto) mode of operation is intact but faster, one screenshot here.
To shrink these 200+ seconds down to less than a second (not relying on CPU power and many GBs of available system RAM) a lot of bread I must eat...
In fact, I know exactly how to create the skeleton (not sacrificing speed a bit: with NO system RAM and CPU heavy loads) despite the 32bit coding limitations: creating a single 10x3++GB file: a dump/mirror of already inserted 140,000,000++ phrases into millions of b-trees. Spanning on/over such a huge pool is well-suited when flash memories (SSD, SDs) are used because of sub-fast memory latency/(seek time), roughly: 50 nanoseconds/microseconds/milliseconds respectively for system/flash/harddisk RAM. What is most needed with this greedy (10% only memory utilization) approach is low latency not hi-bandwidth.
Nevertheless I would appreciate any how-to-do-it hint.
Enjoy!