Page 2 of 2 FirstFirst 1 2
Results 11 to 15 of 15
  1. #11
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Re: Brutally fast 4-gram phrase ripper

    Quote Originally Posted by BobK View Post
    Was 'nowayears' a joke?
    He-he, it is all about playing with words as you have already noticed. The charming part of broken English is exactly the lack of any bias and why not disregard of RULES.

    Thanks for pointing out

  2. #12
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Re: Brutally fast 4-gram phrase ripper

    Can you enlist the common vocabulary (1grams/words as well as 4grams/phrases) of Hercules Poirot's and Sherlock Holmes' stories?

    The awfully slow revision of 'Compare_Two_Wordlists_r1+.exe' is now replaced by 'Overlapper-Blender_r1.exe', which is fast enough.
    This allows (in next Graphein r.1++) large texts to be checked without those disgusting delays.
    I have an (old even better) idea to make one variant of Leprechaun which to solve all-these-mini-pushups cardinally, for now Overlapper-Blender is not a bad utility.
    Overlapper-Blender works with any type of strings, in particular 1grams and 4grams.

    One important aspect of making cross-references is to explore the usage of most frequent phrases (4grams) throughout several corpora of such 4grams.
    Here I will show how to generate a single text file (each line is a 4gram) containing common 4grams for given 4 (Agatha Christie collection, Sherlock Holmes collection, Sunnah and Hadith and Qur'an collection, four versions of The Holy Bible) corpora.
    In short:
    The log below contains the steps needed to create 'Agatha+Sherlock+Islam+Bible_Overlapped.txt' file.
    Some stats:
    By clashing 'Agatha Christie' corpus (2,615,513 4grams) into 'Sherlock Holmes' corpus (1,233,227 4grams) the outcome is: 102,201 overlapped/common 4grams.
    By clashing 'Sunnah and Hadith and Qur'an' corpus (936,195 4grams) into 'The Holy Bible' corpus (795,822 4grams) the outcome is: 29,038 overlapped/common 4grams.
    And finally the clash between 102,201 and 29,038 results in only 5,940 4grams.
    All these 4-words-phrases constitute one important part of the common phraseology (in-here: four major sources).
    An excerpt (with all to_be's) from this list of 5,940 4grams:
    ...
    to_be_a_great
    to_be_cut_off
    to_be_deprived_of
    to_be_done_by
    to_be_done_with
    to_be_found_in
    to_be_free_from
    to_be_given_to
    to_be_his_wife
    to_be_in_the
    to_be_kept_in
    to_be_left_behind
    to_be_married_to
    to_be_more_than
    to_be_of_service
    to_be_on_the
    to_be_one_of
    to_be_regarded_as
    to_be_seen_by
    to_be_seen_of
    to_be_taken_from
    to_be_taken_to
    to_be_the_most
    to_be_under_the
    to_be_used_for
    to_be_used_in
    to_be_with_the
    to_be_with_you
    ...


    This exercise reveals how many (102,201) 4-words-phrases are common for Agatha Christie's style and Conan Doyle's style.
    Also shows the recipe for creating your own High-Quality corpus of 1/4 grams.
    That is how I have created my HQ wordlist of 1grams, as far as I remember from 13 Low-Quality and Unknown-Quality spell-checker's wordlists.
    Soon I will create my own 4gram wordlist by clashing various not small corpora, the way of mixing them reminds me of Barry White's Put Me In Your Mix superhit.

    Code:
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir
    
    03/01/2011  07:25 AM                30 1.txt
    03/01/2011  07:25 AM                14 2.txt
    03/01/2011  07:25 AM            41,768 Overlapper-Blender_r1.c
    03/01/2011  07:25 AM            64,000 Overlapper-Blender_r1.exe
    03/01/2011  07:25 AM        57,389,250 _Agatha Christie_Texts.txt
    03/01/2011  07:25 AM        27,024,497 _Sherlock Holmes_Texts.txt
    03/01/2011  07:25 AM        20,326,151 _Sunnah and Hadith and Qur'an.txt
    03/01/2011  07:25 AM        17,183,313 _The_Holy_Bible_4-versions.txt
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type 1.txt
    a_bad_day_of
    a_bad_day_when
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type 2.txt
    a_bad_day_of
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe 1.txt 2.txt
    Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
    Usage: Overlapper-Blender wordlistfile1 wordlistfile2
    Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
    Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
    Size of 1st input file: 30
    Size of 2nd input file: 14
    Allocating 512MB ...
    Lines in 1st input file: 2
    Lines in 2nd input file: 1
    Allocated memory for pointers-to-words in MB: 1
    Sorting 3 Pointers ...
    Deduplicating duplicates and dumping all into 'Blended.txt' ...
    Dumping deduplicated duplicates into 'Overlapped.txt' ...
    Blended lines: 2
    Overlapped lines: 1
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir
    
    03/01/2011  07:25 AM                30 1.txt
    03/01/2011  07:25 AM                14 2.txt
    03/01/2011  07:26 AM                30 Blended.txt
    03/01/2011  07:26 AM                14 Overlapped.txt
    03/01/2011  07:25 AM            41,768 Overlapper-Blender_r1.c
    03/01/2011  07:25 AM            64,000 Overlapper-Blender_r1.exe
    03/01/2011  07:25 AM        57,389,250 _Agatha Christie_Texts.txt
    03/01/2011  07:25 AM        27,024,497 _Sherlock Holmes_Texts.txt
    03/01/2011  07:25 AM        20,326,151 _Sunnah and Hadith and Qur'an.txt
    03/01/2011  07:25 AM        17,183,313 _The_Holy_Bible_4-versions.txt
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Blended.txt
    a_bad_day_of
    a_bad_day_when
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Overlapped.txt
    a_bad_day_of
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "_Agatha Christie_Texts.txt" "_Sherlock Holmes_Texts.txt"
    Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
    Usage: Overlapper-Blender wordlistfile1 wordlistfile2
    Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
    Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
    Size of 1st input file: 57389250
    Size of 2nd input file: 27024497
    Allocating 512MB ...
    Lines in 1st input file: 2615513
    Lines in 2nd input file: 1233227
    Allocated memory for pointers-to-words in MB: 15
    Sorting 3848740 Pointers ...
    Deduplicating duplicates and dumping all into 'Blended.txt' ...
    Dumping deduplicated duplicates into 'Overlapped.txt' ...
    Blended lines: 3746539
    Overlapped lines: 102201
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Blended.txt Agatha+Sherlock_Blended.txt
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt Agatha+Sherlock_Overlapped.txt
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "_Sunnah and Hadith and Qur'an.txt" _The_Holy_Bible_4-versions.txt
    Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
    Usage: Overlapper-Blender wordlistfile1 wordlistfile2
    Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
    Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
    Size of 1st input file: 20326151
    Size of 2nd input file: 17183313
    Allocating 512MB ...
    Lines in 1st input file: 936195
    Lines in 2nd input file: 795822
    Allocated memory for pointers-to-words in MB: 7
    Sorting 1732017 Pointers ...
    Deduplicating duplicates and dumping all into 'Blended.txt' ...
    Dumping deduplicated duplicates into 'Overlapped.txt' ...
    Blended lines: 1702979
    Overlapped lines: 29038
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Blended.txt Islam+Bible_Blended.txt
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt Islam+Bible_Overlapped.txt
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir
    
    03/01/2011  07:25 AM                30 1.txt
    03/01/2011  07:25 AM                14 2.txt
    03/01/2011  07:26 AM        82,470,333 Agatha+Sherlock_Blended.txt
    03/01/2011  07:26 AM         1,943,414 Agatha+Sherlock_Overlapped.txt
    03/01/2011  07:28 AM        36,965,004 Islam+Bible_Blended.txt
    03/01/2011  07:28 AM           544,460 Islam+Bible_Overlapped.txt
    03/01/2011  07:25 AM            41,768 Overlapper-Blender_r1.c
    03/01/2011  07:25 AM            64,000 Overlapper-Blender_r1.exe
    03/01/2011  07:25 AM        57,389,250 _Agatha Christie_Texts.txt
    03/01/2011  07:25 AM        27,024,497 _Sherlock Holmes_Texts.txt
    03/01/2011  07:25 AM        20,326,151 _Sunnah and Hadith and Qur'an.txt
    03/01/2011  07:25 AM        17,183,313 _The_Holy_Bible_4-versions.txt
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>Overlapper-Blender_r1.exe "Agatha+Sherlock_Overlapped.txt" "Islam+Bible_Overlapped.txt"
    Overlapper-Blender r.1, mix of Compare_Two_Wordlists, revision 1+ and Building-Blocks_DUMPER rev.1, written by Kaze.
    Usage: Overlapper-Blender wordlistfile1 wordlistfile2
    Note1: wordlistfile1's lines encountered in wordlistfile2's lines go to 'Overlapped.txt' file.
    Note2: wordlistfile1's lines blended (no repetitions allowed) with wordlistfile2's lines go to 'Blended.txt' file.
    Size of 1st input file: 1943414
    Size of 2nd input file: 544460
    Allocating 512MB ...
    Lines in 1st input file: 102201
    Lines in 2nd input file: 29038
    Allocated memory for pointers-to-words in MB: 1
    Sorting 131239 Pointers ...
    Deduplicating duplicates and dumping all into 'Blended.txt' ...
    Dumping deduplicated duplicates into 'Overlapped.txt' ...
    Blended lines: 125299
    Overlapped lines: 5940
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>type Overlapped.txt|more
    a_breath_of_the
    a_change_in_the
    ^C
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>ren Overlapped.txt "Agatha+Sherlock+Islam+Bible_Overlapped.txt"
    
    D:\_KAZE_new-stuff_2011-Feb-26\Overlapper-Blender_r1>dir "Agatha+Sherlock+Islam+Bible_Overlapped.txt"
    
    03/01/2011  07:31 AM           104,584 Agatha+Sherlock+Islam+Bible_Overlapped.txt
    
    All (the whole example) files are available here: one 30MB zip archive.
    Enjoy!

  3. #13
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Re: Brutally fast 4-gram phrase ripper

    Refinement continues...
    Here comes Overlapper-Blender_r1+.
    Overlapper-Blender revision 1 was with one shortcoming (not giving the unfamiliar words, now exterminated), so a few very useful features were added: 'Unfamiliar.txt' creation and more stats.

    One short DIZ/description TXT file, here: 16.1KB.
    One short DIZ/description PDF file, here: 74.2KB.
    Overlapper-Blender_r1+.zip file (contains Windows console executable, C source, four 4gram wordlists), here: 30.3MB.

    By adding this (I updated Dummy-Check-package to r.2) now I will show how to spot misspelled/new words in a bunch of incoming TXT files.
    The wordlist in use contains 351,116 words.

    Under quick-and-dummy spell-checking is the text 'The history of the Oxford English Dictionary', taken from OED on CD-ROM HTML help files:

    Code:
    D:\_KAZE_new-stuff\Dummy_Check_package_r2>dir/s
     Volume in drive D is H320_Vol5
     Volume Serial Number is 0CB3-C881
    
     Directory of D:\_KAZE_new-stuff\Dummy_Check_package_r2
    
    03/03/2011  12:03 AM    <DIR>          .
    03/03/2011  12:03 AM    <DIR>          ..
    03/03/2011  12:03 AM               259 Dummy_Check.bat
    03/03/2011  12:03 AM         4,024,155 english.dic_351116_wordlist
    03/03/2011  12:03 AM            94,208 Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe
    03/03/2011  12:03 AM            66,048 Overlapper-Blender_r1+.exe
    03/02/2011  11:48 PM    <DIR>          TREE_of_TXT_files_to_be_processed
    03/03/2011  12:03 AM            34,606 Yoshi_r6.exe
                   5 File(s)      4,219,276 bytes
    
     Directory of D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed
    
    03/02/2011  11:48 PM    <DIR>          .
    03/02/2011  11:48 PM    <DIR>          ..
    03/03/2011  12:03 AM            15,816 oed2_hist.txt
    03/03/2011  12:03 AM            17,128 oed2_hist10.txt
    03/03/2011  12:03 AM             8,475 oed2_hist11.txt
    03/03/2011  12:03 AM            13,394 oed2_hist12.txt
    03/03/2011  12:03 AM            11,942 oed2_hist13.txt
    03/03/2011  12:03 AM            12,366 oed2_hist2.txt
    03/03/2011  12:03 AM            11,197 oed2_hist3.txt
    03/03/2011  12:03 AM             9,752 oed2_hist4.txt
    03/03/2011  12:03 AM            12,589 oed2_hist5.txt
    03/03/2011  12:03 AM            11,206 oed2_hist6.txt
    03/03/2011  12:03 AM            15,374 oed2_hist7.txt
    03/03/2011  12:03 AM            15,962 oed2_hist8.txt
    03/03/2011  12:03 AM            12,009 oed2_hist9.txt
                  13 File(s)        167,210 bytes
    
         Total Files Listed:
                  18 File(s)      4,386,486 bytes
                   5 Dir(s)   1,004,392,448 bytes free
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2>type Dummy_Check.bat
    cd TREE_of_TXT_files_to_be_processed
    ..\Yoshi_r6.exe -f -o..\Dummy_Check.lst *.txt
    cd..
    Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe Dummy_Check.lst Dummy_Check.lst.wrd 3000
    Overlapper-Blender_r1+.exe Dummy_Check.lst.wrd english.dic_351116_wordlist
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2>Dummy_Check.bat
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2>cd TREE_of_TXT_files_to_be_processed
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed>..\Yoshi_r6.exe -f -o..\Dummy_Check.lst *.txt
    Yoshi(Filelist Creator), revision 06, written by Svalqyatchx,
    in fact based on SWEEP.C from 'Open Watcom Project', thanks-thanks.
    
    Note1: So far, it works for current directory only.
    Note2: Default method is depth-first traversal;
           may use pipe 'Yoshi|sort' for breadth-first_like traversal results.
    Note3: Make notice that '*.*'(extensionfull only) is not equal to '*'(all);
           one disadvantage is an inability to list only extensionless filenames.
    Note4: Search is case-insensitive as-must.
    Note5: This revision allows multiple '*', and meaning of masks is:
           '?' - any character AND NOT EMPTY(default, for OR EMPTY see option -e);
           '*' - any character(s) or empty.
    Note6: What is a .LBL(LineByLine) file?
           it is a bunch of GRAMMATICAL lines not mere LF or CRLF lines;
           it contains not symbols under 32(except CR and LF) and above 127;
           it contains not space symbol sequences.
    Usage:
          Yoshi [option(s)] [filename(s)]
          option(s):
             -v           i.e. verbose mode; output goes to console;
             -f           i.e. fullpath mode for output;
             -e           i.e. treat '?' as any character OR EMPTY;
             -t           i.e. touch all encountered files;
             -2           i.e. convert all encountered .TXT files to .LBL files;
             -o<filename> i.e. output goes to file(in append mode).
          filename(s):
             Wildcards '*' and wildcards '?' are allowed i.e. "str*.c??";
             default filename is '*'; DO NOT FORGET TO PUT
             filename(s) WITH WILDCARD(S) INTO QUOTE MARKS!
    Examples:
          Yoshi -v -f -oCaterpillar_NON.lst "*.lbl" "*.txt" "*.htm" "*.html"
          Yoshi -f -oMyEbooks.txt "*wiley*essential*.pdf" "*russian*.*htm"
    
    Yoshi: Total size of files: 00,000,000,167,210 bytes.
    Yoshi: Total files: 000,000,000,013.
    Yoshi: Total folders: 0,000,000,000.
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2\TREE_of_TXT_files_to_be_processed>cd..
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2>Leprechaun_r13++++++_Microsoft_16.00.30319.01.exe Dummy_Check.lst Dummy_Check.lst.wrd 3000
    Leprechaun(Fast Greedy Word-Ripper), revision 13++++++, written by Svalqyatchx.
    Leprechaun: 'Oh, well, didn't you hear? Bigger is good, but jumbo is dear.'
    Kaze: Let's see what a 3-way hash + 6,602,752 Binary-Search-Trees can give us,
          also the performance of a 3-way hash + 6,602,752 B-Trees of order 3.
    Size of input file with files for Leprechauning: 1550
    Allocating memory 1170MB ... OK
    Size of Input TEXTual file: 15,816
    |; Word count: 2,572 of them 790 distinct; Done: 64/64
    Size of Input TEXTual file: 17,128
    /; Word count: 5,290 of them 1,359 distinct; Done: 64/64
    Size of Input TEXTual file: 8,475
    -; Word count: 6,618 of them 1,570 distinct; Done: 64/64
    Size of Input TEXTual file: 13,394
    \; Word count: 8,930 of them 2,014 distinct; Done: 64/64
    Size of Input TEXTual file: 11,942
    |; Word count: 11,035 of them 2,493 distinct; Done: 64/64
    Size of Input TEXTual file: 12,366
    /; Word count: 13,117 of them 2,714 distinct; Done: 64/64
    Size of Input TEXTual file: 11,197
    -; Word count: 14,968 of them 2,914 distinct; Done: 64/64
    Size of Input TEXTual file: 9,752
    \; Word count: 16,604 of them 3,078 distinct; Done: 64/64
    Size of Input TEXTual file: 12,589
    |; Word count: 18,726 of them 3,237 distinct; Done: 64/64
    Size of Input TEXTual file: 11,206
    /; Word count: 20,545 of them 3,388 distinct; Done: 64/64
    Size of Input TEXTual file: 15,374
    -; Word count: 22,972 of them 3,601 distinct; Done: 64/64
    Size of Input TEXTual file: 15,962
    \; Word count: 25,447 of them 3,815 distinct; Done: 64/64
    Size of Input TEXTual file: 12,009
    |; Word count: 27,328 of them 3,974 distinct; Done: 64/64
    Bytes per second performance: 167,210B/s
    Words per second performance: 27,328W/s
    Flushing unsorted words ...
    Time for making unsorted wordlist: 1 second(s)
    Deallocated memory in MB: 1170
    Allocated memory for words in MB: 1
    Allocated memory for pointers-to-words in MB: 1
    Sorting(with 'MultiKeyQuickSortX26Sort' by J. Bentley and R. Sedgewick) ...
    Sort pass 26/26 ...
    Flushing sorted words ...
    Time for sorting unsorted wordlist: 1 second(s)
    Leprechaun: Done.
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2>Overlapper-Blender_r1+.exe Dummy_Check.lst.wrd english.dic_351116_wordlist
    Overlapper-Blender r.1+, written by Kaze.
    Size of 1st input file: 36609
    Size of 2nd input file: 4024155
    Allocating 1024MB ...
    Lines in 1st input file: 3974
    Lines in 2nd input file: 351116
    Allocated memory for pointers-to-words in MB: 2
    Allocated memory for pointers-to-words in MB: 1
    Sorting 355090 Pointers ...
    Deduplicating duplicates and dumping all into 'Blended.txt' ...
    Dumping deduplicated duplicates into 'Overlapped.txt' ...
    Dumping all-from-first-file except deduplicated duplicates into 'Unfamiliar.txt' ...
    Blended lines, i.e. combined lines from both files: 351623
    Overlapped lines, i.e. lines common for both files: 3467
    Unfamiliar lines, i.e. lines from 1st file not encountered in 2nd file: 507
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2>type Unfamiliar.txt
    abrm
    ada
    addenbrooke
    addlestone
    ...
    wyllie
    wyndham
    yockney
    yonge
    yvonne
    zorc
    
    D:\_KAZE_new-stuff\Dummy_Check_package_r2>
    Actually I could not find any mistakes (in those 507 words from 'Unfamiliar.txt'), but something more ominous than a typo: No (formal) RECOGNITION whatsoever of Samuel Johnson's contribution, caramba! If the OED staff is aware of this ... it is worse than UNGRATEFULNESS! One train full of contributors enlisted and no SEAT for my man, I have had different notion of the famous English politeness, as not being a superficial courtesy but plain gratefulness.

  4. #14
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Re: Brutally fast 4-gram phrase ripper

    The 10000 chars limitation forced me to split my post in two:

    Only some "Johnson" was mentioned as if some irrelevant meddler was bubbling something:

    "The example of Johnson and Richardson had shown clearly that the citation of authority
    for a word was one of the essentials for establishing its meaning and tracing its
    history. It was therefore obvious that the first step towards the building up of a new
    dictionary must be the assembling of such authority, in the form of quotations from
    English writings throughout the various periods of the language.
    Johnson and Richardson
    had been selective in the material they assembled, and obviously some kind of selection
    would be imposed by practical limits, however wide the actual range might be."
    /An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/

    "... The next stage is marked by
    Johnson's systematic use of quotations to illustrate and justify the definitions, the
    many omissions still existing in the vocabulary being partly filled by later
    supplementary works on the same lines. When to all this was superadded the principle of
    historical illustration, introduced by Richardson, it became inevitable that any
    adequate dictionary of English must be one of the larger books of the world."
    /An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/

    "It is remarkable that Richardson's dictionary, perhaps through certain defects in his
    method, did not at once attract the attention it deserved. From the appearance of the
    first instalment in the Encyclopaedia Metropolitana in 1819 to the full acceptance of
    the historical principle by the Philological Society almost forty years had passed, and
    the separate publication of his dictionary in 1836-7 did not affect to any appreciable
    extent the work of those lexicographers who followed in the wake of Johnson or Webster.
    Even his wealth of quotations remained unutilized, although they formed a natural
    storehouse for any who cared to search in it and bring forth 'treasures new and old' to
    add to those already available in the works of Johnson and his successors."
    /An excerpt from 'The history of the Oxford English Dictionary' OED on CD-ROM/

    And what knocked me completely down was the lack of word 'SAMUEL' in OED entry-list, and to see the etymology of this name/word - forget (Heritage dictionary explains it, though, reaching its Semitic roots) about it!
    And if the above-said is not a wake-up call for the OED staff...

    Dummy_Check_package_r2.zip file, here: 1.1MB.

    Enjoy!

    Add-on:
    Google didn't bother to supply stats about their CSV files, so here is some info about US English 4grams from 2009 July 15.
    Here I want to give the exact number of pure (no year/pages/...) distinct 4grams derived from all 400 'googlebooks-eng-us-all-4gram-20090715' CSV files:

    googlebooks-eng-us-all-4gram-20090715-graffith_A_distinct: 437,808,652 bytes, 17,981,107 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_B_distinct: 159,141,163 bytes, 6,571,872 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_C_distinct: 160,011,167 bytes, 6,212,540 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_D_distinct: 97,107,487 bytes, 3,856,617 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_E_distinct: 88,831,581 bytes, 3,424,994 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_F_distinct: 129,873,927 bytes, 5,282,784 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_G_distinct: 51,318,288 bytes, 2,116,401 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_H_distinct: 164,940,851 bytes, 6,760,278 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_I_distinct: 234,234,813 bytes, 9,449,270 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_J_distinct: 10,856,482 bytes, 444,251 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_K_distinct: 13,466,244 bytes, 569,361 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_L_distinct: 74,101,010 bytes, 3,123,807 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_M_distinct: 125,532,372 bytes, 5,180,952 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_N_distinct: 73,979,970 bytes, 3,075,105 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_O_distinct: 257,378,814 bytes, 10,718,140 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_P_distinct: 134,588,800 bytes, 5,222,828 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_Q_distinct: 6,573,966 bytes, 257,343 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_R_distinct: 90,619,671 bytes, 3,565,405 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_S_distinct: 219,649,789 bytes, 8,736,465 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_T_distinct: 638,879,823 bytes, 24,309,233 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_U_distinct: 39,351,963 bytes, 1,640,327 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_V_distinct: 23,544,104 bytes, 957,759 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_W_distinct: 236,365,992 bytes, 9,738,971 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_X_distinct: 157,465 bytes, 6,593 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_Y_distinct: 24,202,157 bytes, 1,000,248 distinct lines
    googlebooks-eng-us-all-4gram-20090715-graffith_Z_distinct: 463,569 bytes, 19,684 distinct lines

    Total size/number of 4grams is: 3,233,748,341/140,222,335.


    Because many 4grams (here) are meaningless and because of need-for-rich-collection it is obvious: only several times as many will do (a serious job).
    To see what-converts-what you may read this log.

  5. #15
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Re: Brutally fast 4-gram phrase ripper

    Stomp stomp I have arrived...

    Another brute-force approach was taken in order to make the awfully slow Graphein r.1 more bearable resulting in Graphein r.1++(now 403MB only (8:1 compression) thankfully to the most-advanced text compressor BSC(GRAFFITH_r2++_Graphein.exe) written by Ilya Grebnov).
    Here instead of waiting 20 minutes (just for getting Found & Unfamiliar 4grams for a small incoming file) now latency is under 4 minutes and what is more important: for large incoming texts (like _Sherlock Holmes_Texts.quadrupleton.txt) the total time is sub-linear.
    Needed space on HDD/SSD (or better yet on a flash card): 3.45GB, the batch file decompresses automatically the archives (.bsc files) when it is started for a fist time.
    The examples below were executed on Toshiba laptop Intel Merom 2.16GHz CPU with Windows XP as OS.

    Here again the tested text is: The_Little_Match_Girl.txt 5,203 bytes.
    After quadrupletoning it the result is a text file with 580 4grams: The_Little_Match_Girl.quadrupleton.txt 12,544 bytes.

    By starting a single batch file you can get 2 text files (211 seconds needed) with Found/Unfamiliar 4grams in/to 140,222,335 4grams from googlebooks-eng-us-all-4gram-20090715 corpus:
    Code:
    D:\Package 'Graphein' a 4-gram-Phrase-Checker, revision 1++>26Clash_Intel.BAT The_Little_Match_Girl.quadrupleton.txt
    ...
    The_Little_Match_Girl.quadrupleton.txt_overlapped_all_distinct 7,590 bytes
    The_Little_Match_Girl.quadrupleton.txt_unfamiliar_all_distinct 4,954 bytes
    First file contains 370 4grams familiar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
    a_box_of_them
    a_boy_had_run
    ...
    with_such_a_glow
    you_will_vanish_like


    Second file contains 210 4grams unfamiliar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
    a_cradle_some_day
    a_little_pathetic_figure
    ...
    you_will_disappear_when
    youngster_stretched_out_her


    After quadrupletoning (the full collection of Sherlock Holmes stories) the result is a text file with 1,233,227 4grams: _Sherlock Holmes_Texts.quadrupleton.txt 27,024,497 bytes.

    By starting a single batch file you can get 2 text files (318 seconds needed) with Found/Unfamiliar 4grams in/to 140,222,335 4grams from googlebooks-eng-us-all-4gram-20090715 corpus:
    Code:
    D:\Package 'Graphein' a 4-gram-Phrase-Checker, revision 1++>26Clash_Intel.BAT "_Sherlock Holmes_Texts.quadrupleton.txt"
    ...
    _Sherlock Holmes_Texts.quadrupleton.txt_overlapped_all_distinct 12,532,297 bytes
    _Sherlock Holmes_Texts.quadrupleton.txt_unfamiliar_all_distinct 14,492,200 bytes
    First file contains 612,319 4grams familiar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
    a_and_b_cleared
    a_and_b_companies
    ...
    zone_of_light_and
    zoo_and_see_the


    Second file contains 620,908 4grams unfamiliar to googlebooks-eng-us-all-4gram-20090715 corpus, some of them:
    a_alane_is_waur
    a_appy_day_with
    ...
    zuurfontein_by_as_many
    zuurfontein_were_both_made


    When you need (for instance) a post or e-mail (as well as whole e-books) to be checked (using richest so far 4gram corpus) for broken-four-words-phrases Graphein r.1++ is here, one 419MB ZIP file.
    Also, the second (semi-auto) mode of operation is intact but faster, one screenshot here.
    To shrink these 200+ seconds down to less than a second (not relying on CPU power and many GBs of available system RAM) a lot of bread I must eat...

    In fact, I know exactly how to create the skeleton (not sacrificing speed a bit: with NO system RAM and CPU heavy loads) despite the 32bit coding limitations: creating a single 10x3++GB file: a dump/mirror of already inserted 140,000,000++ phrases into millions of b-trees. Spanning on/over such a huge pool is well-suited when flash memories (SSD, SDs) are used because of sub-fast memory latency/(seek time), roughly: 50 nanoseconds/microseconds/milliseconds respectively for system/flash/harddisk RAM. What is most needed with this greedy (10% only memory utilization) approach is low latency not hi-bandwidth.
    Nevertheless I would appreciate any how-to-do-it hint.

    Enjoy!

Page 2 of 2 FirstFirst 1 2

Similar Threads

  1. [Idiom] "brutally bloody" & "bloodily brutal"
    By outofdejavu in forum Ask a Teacher
    Replies: 3
    Last Post: 29-Nov-2009, 06:31
  2. [General] go it = to go fast, run fast, not to spare yourself.
    By vil in forum Ask a Teacher
    Replies: 3
    Last Post: 09-Jun-2009, 03:53
  3. Adverbial Phrase, Noun Phrase, Verb Phrase
    By novi_83 in forum Ask a Teacher
    Replies: 1
    Last Post: 29-Jun-2008, 19:46
  4. more fast?
    By Unregistered in forum Ask a Teacher
    Replies: 2
    Last Post: 24-Jun-2007, 22:50
  5. ripping/ripper
    By possopo in forum Ask a Teacher
    Replies: 1
    Last Post: 19-Jul-2006, 00:03

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •