Page 1 of 2 1 2 LastLast
Results 1 to 10 of 15
  1. #1
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Default Brutally fast 4-gram phrase ripper

    Hi to all English-language-explorers,

    I am an amateurish C program-mess-er who is interested mainly in English language console utilities with one only goal in mind: to give statistical info about words/phrases/sentences usage.

    First impression: a nice forum.
    My wish is to share here my attempts/console-tools for English sidekick-ing.

    Second impression: an unnecessary limitation: 5 posts to be able to share a link, grmbl!

    Leprechaun_r13_7pluses_quadrupleton_r1_EXEs.zip

    Leprechaun_r13_7pluses_quadrupleton_r1_AT_A_GLANCE .pdf

    For example:
    Code:
    D:\_KA45F~1\_4>dir
    
    12/12/2010  01:37 PM     1,111,609,996 googlebooks-eng-us-all-4gram-20090715-0.csv
    01/26/2011  06:46 PM               315 googlebooks-eng-us-all-4gram-20090715-0.csv.EXCERPT
    01/26/2011  06:56 PM               362 Gulliver's-Travels.pdf.txt.EXCERPT
    01/26/2011  06:47 PM             4,108 Leprechaun.LOG
    01/26/2011  05:13 AM           514,048 Leprechaun_quadrupleton_Intel_IA-32_11.1.exe
    01/26/2011  06:47 PM                53 test.lst
    01/26/2011  06:47 PM                14 test.wrd
    
    D:\_KA45F~1\_4>dir Gulliver*.excerpt/b>test2.lst
    
    D:\_KA45F~1\_4>type "Gulliver's-Travels.pdf.txt.EXCERPT"
    ...
    And so unmeasureable is the ambition of princes, that he
    seemed to think of nothing less than reducing the whole
    empire of Blefuscu into a province, and governing it, by
    a viceroy; of destroying the Big-endian exiles, and compelling
    that people to break the smaller end of their eggs,
    by which he would remain the sole monarch of the whole
    world.
    ...
    D:\_KA45F~1\_4>Leprechaun_quadrupleton_Intel_IA-32_11.1.exe test2.lst test2.wrd
    Leprechaun(Fast Greedy Word-Ripper), rev. 13_7pluses quadrupleton_r1, written by Svalqyatchx.
    Leprechaun: 'Oh, well, didn't you hear? Bigger is good, but jumbo is dear.'
    Kaze: Let's see what a 3-way hash + 6,602,752 Binary-Search-Trees can give us,
          also the performance of a 3-way hash + 6,602,752 B-Trees of order 3.
    Size of input file with files for Leprechauning: 36
    Allocating memory 424MB ... OK
    Size of Input TEXTual file: 362
    |; Word count: 62 of them 41 distinct; Done: 64/64
    Bytes per second performance: 362B/s
    Words per second performance: 62W/s
    Flushing unsorted words ...
    Time for making unsorted wordlist: 1 second(s)
    Deallocated memory in MB: 424
    Allocated memory for words in MB: 1
    Allocated memory for pointers-to-words in MB: 1
    Sorting(with 'MultiKeyQuickSortX26Sort' by J. Bentley and R. Sedgewick) ...
    Sort pass 26/26 ...
    Flushing sorted words ...
    Time for sorting unsorted wordlist: 1 second(s)
    Leprechaun: Done.
    
    D:\_KA45F~1\_4>type test2.wrd
    and_compelling_that_people
    and_so_unmeasureable_is
    blefuscu_into_a_province
    break_the_smaller_end
    by_which_he_would
    compelling_that_people_to
    destroying_the_big_endian
    empire_of_blefuscu_into
    end_of_their_eggs
    he_seemed_to_think
    he_would_remain_the
    is_the_ambition_of
    less_than_reducing_the
    monarch_of_the_whole
    nothing_less_than_reducing
    of_blefuscu_into_a
    of_destroying_the_big
    of_nothing_less_than
    of_the_whole_world
    people_to_break_the
    reducing_the_whole_empire
    remain_the_sole_monarch
    seemed_to_think_of
    smaller_end_of_their
    so_unmeasureable_is_the
    sole_monarch_of_the
    than_reducing_the_whole
    that_he_seemed_to
    that_people_to_break
    the_ambition_of_princes
    the_big_endian_exiles
    the_smaller_end_of
    the_sole_monarch_of
    the_whole_empire_of
    think_of_nothing_less
    to_break_the_smaller
    to_think_of_nothing
    unmeasureable_is_the_ambition
    which_he_would_remain
    whole_empire_of_blefuscu
    would_remain_the_sole
    
    D:\_KA45F~1\_4>
    Enjoy!
    Last edited by Sanmayce; 06-Feb-2011 at 16:01. Reason: URLs added

  2. #2
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    Here is "the final" release suitable and decent enough both for Linux & Windows users:

    http://www.sanmayce.com/Downloads/Le..._ELFs_EXEs.zip

    P.S.
    The README.txt from the package above:
    This is a short description of 'Leprechaun_[quadrupleton]_r13_7pluses_ELFs_EXEs' package (27 files; 9,882,301 bytes):

    Code:
    02/19/2011  04:32 AM           994,119 Leprechaun.png
    02/19/2011  04:32 AM           181,165 Leprechaun.c
    02/19/2011  04:32 AM             4,122 Leprechaun.LOG
    02/19/2011  04:32 AM                27 Leprechaun.lst
    02/19/2011  04:32 AM             2,615 Leprechaun.wrd
    02/19/2011  04:32 AM                83 Leprechaun_COMPILE_Intel.bat
    02/19/2011  04:32 AM                73 Leprechaun_COMPILE_Microsoft.bat
    02/19/2011  04:32 AM         1,977,802 Leprechaun_Intel.cod
    02/19/2011  04:32 AM         2,886,589 Leprechaun_Logo-diz.pdf
    02/19/2011  04:32 AM           438,359 Leprechaun_Microsoft.cod
    02/19/2011  04:32 AM           183,655 Leprechaun_quadrupleton.c
    02/19/2011  04:32 AM             4,132 Leprechaun_quadrupleton.LOG
    02/19/2011  04:32 AM            12,544 Leprechaun_quadrupleton.wrd
    02/19/2011  04:32 AM           151,049 Leprechaun_quadrupleton_AT_A_GLANCE_cover.txt.pdf
    02/19/2011  04:32 AM               109 Leprechaun_quadrupleton_COMPILE_Intel.bat
    02/19/2011  04:32 AM                99 Leprechaun_quadrupleton_COMPILE_Microsoft.bat
    02/19/2011  04:32 AM           644,409 Leprechaun_quadrupleton_r13_7pluses_generic_32bits.elf
    02/19/2011  04:32 AM           514,048 Leprechaun_quadrupleton_r13_7pluses_Intel_IA-32_11.1.exe
    02/19/2011  04:32 AM            96,256 Leprechaun_quadrupleton_r13_7pluses_Microsoft_32-bit_16.00.30319.01.exe
    02/19/2011  04:32 AM           523,183 Leprechaun_r13_7pluses.pdf
    02/19/2011  04:32 AM           642,613 Leprechaun_r13_7pluses_generic_32bits.elf
    02/19/2011  04:32 AM           514,048 Leprechaun_r13_7pluses_Intel_IA-32_11.1.exe
    02/19/2011  04:32 AM            95,232 Leprechaun_r13_7pluses_Microsoft_32-bit_16.00.30319.01.exe
    02/19/2011  04:32 AM               117 Linux_Leprechaun_Complile_Line.script
    02/19/2011  04:32 AM               143 Linux_Leprechaun_quadrupleton_Complile_Line.script
    02/19/2011  04:32 AM             5,203 The_Little_Match_Girl.txt
    02/19/2011  04:32 AM            10,507 _Caution_64bit_is-not-possible-yet_must_be_rewritten.txt
    The package contains 32bit (console) executables compiled for Windows & Linux.
    It is 100% free open-source copyleft software.
    It creates English wordlists(1gram or 4gram) for a given filelist (each line is a filename).
    Run one of the executables without any parameters to see how to use it.
    Leprechaun_[quadrupleton]_r13_7pluses is powered by the fastest (on new architectures, like Core i3) so far string hash function: Jesteress.

    Examples (in Linux prompt):
    ./Leprechaun_r13_7pluses_generic_32bits.elf Leprechaun.lst Leprechaun.wrd 7000
    ./Leprechaun_quadrupleton_r13_7pluses_generic_32bits .elf Leprechaun.lst Leprechaun_quadrupleton.wrd 6000

    Open text files Leprechaun.LOG and Leprechaun_quadrupleton.LOG respectively for first and second line.

    Pluses:
    + written in C;
    + extremely fast;
    + an useful etude for wordlisting.
    Minuses:
    - the developer being an amateur;
    - dirty style, so dirty that 64bit compilation surely fails;
    - greedy: low memory utilization.

    Sanmayce
    Enjoy!

  3. #3
    birdeen's call is offline VIP Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Polish
      • Home Country:
      • Poland
      • Current Location:
      • Poland
    Join Date
    Jul 2010
    Posts
    5,099
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    I see it works. I think you should make your program recognize the string n't as a word. They do it this way in the BYU corpora:

    is n't
    we 're

    Now, your program gives:

    isn 't
    we 're

    I'm not sure if I understand what the program actually does. What does
    At left side of the word - '[' means no left successor
    At left side of the word - ']' means left successor exists
    At right side of the word - ']' means no right successor
    At right side of the word - '[' means right successor exists
    mean? I don't see any ]'s or ['s anywhere... All I can find is the output file with the words/4-grams in it and the log file. Is there anything else to look at?

    edit1: Oh, I forgot to add. I liked how the program said, "Can't open file. I've already explained," or something like this.

    You could correct those brackets throughout your files. It's already difficult to read computer geek speech and the lack of spaces doesn't help.
    Last edited by birdeen's call; 20-Feb-2011 at 00:01.

  4. #4
    Tdol is online now Editor, UsingEnglish.com
    • Member Info
      • Member Type:
      • English Teacher
      • Native Language:
      • British English
      • Home Country:
      • UK
      • Current Location:
      • Philippines
    Join Date
    Nov 2002
    Posts
    43,129
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    Quote Originally Posted by Sanmayce View Post
    Second impression: an unnecessary limitation: 5 posts to be able to share a link, grmbl!
    Well, you'd probably see things differently if you had to clear out dozens of spammers every day.

  5. #5
    BobK's Avatar
    BobK is offline Harmless drudge
    • Member Info
      • Member Type:
      • English Teacher
      • Native Language:
      • English
      • Home Country:
      • UK
      • Current Location:
      • UK
    Join Date
    Jul 2006
    Posts
    15,590
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    Quote Originally Posted by Sanmayce View Post
    Hi to all English-language-explorers,

    I am an amateurish C program-mess-er who is interested mainly in English language console utilities with one only goal in mind: to give statistical info about words/phrases/sentences usage.
    ...
    I have't played with your tool yet, so no comments - but thanks. Have you heard of SNOBOL - Wikipedia, the free encyclopedia ? (A bit of a dinosaur, but interesting...)

    b

  6. #6
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    Quote Originally Posted by birdeen's call View Post
    Now, your program gives:

    isn 't
    we 're
    Yes, it limits functionality but since I couldn't find a way to deal with apostrophes in all cases I decided to remove them altogether. The ripper's action is simple: it parses the incoming files (given via the filelist - the first filename in command line) and extracts all latin-letter words with lengths of 1 to 31 characters, the forming rule is simple too: a word is a string containing only alpha characters i.e. 'a' to 'z' or 'A' to 'Z'.

    Quote Originally Posted by birdeen's call View Post
    I don't see any ]'s or ['s anywhere... All I can find is the output file with the words/4-grams in it and the log file. Is there anything else to look at?
    The square brackets stand for binary-search-tree leaf succession status, that is ']' means a child exists for left node and '[' similarly for right node. I draw the highest binary tree just for informative purposes. You don't need Leprechaun.log except for tracking the activity of the executable, you need only the second file from command line namely the extracted distinct words/grams.

    Quote Originally Posted by birdeen's call View Post
    You could correct those brackets throughout your files. It's already difficult to read computer geek speech and the lack of spaces doesn't help.
    It was/is not intentional, I don't want to impose my buggy ways, at least I try my explanations to be useful at max.

  7. #7
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    Quote Originally Posted by BobK View Post
    Have you heard of SNOBOL - Wikipedia, the free encyclopedia ? (A bit of a dinosaur, but interesting...)
    Thank you for the link, I have not heard of it, but after reading the article I can say: it has nothing to do with real world tasks, nowayears the well-applied algorithms could smash all old approaches. There are languages like Python designed for such tasks, but I am kind of orthodoxal amateur my language is and will be C, I see myself migrating to native 64bit Linux and 64bit C code after a couple of years.

    Regards

  8. #8
    birdeen's call is offline VIP Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Polish
      • Home Country:
      • Poland
      • Current Location:
      • Poland
    Join Date
    Jul 2010
    Posts
    5,099
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    Quote Originally Posted by Sanmayce View Post
    Yes, it limits functionality but since I couldn't find a way to deal with apostrophes in all cases I decided to remove them altogether.
    A very simple and bad (not tragic though) solution is to add a couple of lines that would change every

    ...n
    't


    to

    ...
    n't

    in the output file after it's ready, but it will work only in the 1-gram program for obvious reasons. It's still O(n), which won't change the general complexity.

    (I'm not a programmer. Please forgive me if I'm talking nonsense.)

  9. #9
    BobK's Avatar
    BobK is offline Harmless drudge
    • Member Info
      • Member Type:
      • English Teacher
      • Native Language:
      • English
      • Home Country:
      • UK
      • Current Location:
      • UK
    Join Date
    Jul 2006
    Posts
    15,590
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    Quote Originally Posted by Sanmayce View Post
    Thank you for the link, I have not heard of it, but after reading the article I can say: it has nothing to do with real world tasks, nowayears the well-applied algorithms could smash all old approaches. There are languages like Python designed for such tasks, but I am kind of orthodoxal amateur my language is and will be C, I see myself migrating to native 64bit Linux and 64bit C code after a couple of years.

    Regards


    (Was 'nowayears' a joke? If so, it works. But 'nowadays' doesn't normally have a *365 analogue. 'Nowadays', like 'these days', means 'at/in this time/era....'. The 'day' in 'present-day', in the same way, doesn't mean '24 hours'.)

    b

  10. #10
    Sanmayce is offline Junior Member
    • Member Info
      • Member Type:
      • Student or Learner
      • Native Language:
      • Bulgarian
      • Home Country:
      • Bulgaria
      • Current Location:
      • Bulgaria
    Join Date
    Jan 2011
    Posts
    36
    Post Thanks / Like

    Default Re: Brutally fast 4-gram phrase ripper

    Quote Originally Posted by birdeen's call View Post
    A very simple and bad (not tragic though) solution is to add a couple of lines that would change every

    ...n
    't


    to

    ...
    n't

    in the output file after it's ready, but it will work only in the 1-gram program for obvious reasons.
    Speaking of what a small utility must do (and what not to do), after many tries and errors I realized the need for well-defined actions, that is everything must not be half-done and simplicity along with speed must be among the highest priorities.
    So in its simplicity I reckon that Leprechaun has not to be altered - it does what exactly is expected to do as first pass of parsing, from that point on another tool must take over. Here arises need not only of apostrophes, hyphens but additional alphabets ... and clarity vanishes. In other words the skeleton (the superfast hash reinforced by unrolled b-trees with simulated stack) is the thrilling thing (an important base/etude for further not developing but rather tuning and rewriting as 64bit code), the rest interests me not.

    In case of not sensing what my obsession is, here is my diagnose: maniacal hi-speed text processing fondness. As I have said it elsewhere the speed is beauty.
    One reason to abandon extracting words containing apostrophes was the existence of shortened forms like 'cause, 'twas, 'tis, 'tween, 'twere for 'because', 'it was', 'it is', 'between', 'it were' respectively.
    Many years ago I wrote 2 very slow 16bit console utilities which might be useful to you, they work in duo and rip distinct (with hyphens and apostrophes) English words from a given file:

    Example:
    Code:
    D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir
     Volume in drive D is H320_Vol5
     Volume Serial Number is 0CB3-C881
    
     Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1
    
    02/22/2011  06:05 AM    <DIR>          .
    02/22/2011  06:05 AM    <DIR>          ..
    02/22/2011  03:37 AM                 6 masakari.ss
    02/22/2011  03:23 AM             2,659 RIP_EWRD.BAS
    02/22/2011  03:23 AM            44,232 RIP_EWRD.EXE
    09/19/1997  12:00 AM            55,972 SAKURA.EXE
    10/23/2007  05:26 PM           146,248 SAKURA8.ZIP
    02/12/2011  09:11 PM         1,385,282 TSZ.txt
                   6 File(s)      1,634,399 bytes
                   2 Dir(s)   2,650,181,632 bytes free
    
    D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>RIP_EWRD.EXE TSZ.txt
    RIP_EWRD.EXE
    NumberOfWords&: 90958
    SAKURA.EXE, revision 008, written by Svalqyatchx 'Kaze'.
    Revision note: Virtual_Memory_Simulated_Stack, if overflow_error then HALT.
    Caution: Very(pivot is chosen from first 20 elements) slow version.
    Searching for MASAKARI.SS and MASAKARI.SD ...
    Creating MASAKARI.SWP ...
    Allocated HDD memory: 320MB.
    Room for 33,554432 elements; Maximum(2GB-1) for 214,748364 elements.
    Input file: TSZ.UW
    Output file: TSZ.SW
    Making SAKURA.QSS(10bytes per element(6-entry,4-stack)) at HDD memory ...
    Current sort options: CASE_UNSENSITIVE /START= 1 /LENGTH= 26
    Sorting in two passes 90958 elements(longest 26), needed 889KB ...
    Bubble-sorting possible pivots ...
    Sorting pass#1(splitting) countdown(Right&), StackPtr: 000000001, 000000034 ...
    Sorting pass#2 countdown(Quantity&-Right&), StackPtr: 000000000, 000000000 ...
    Stack_Nested_Levels i.e. StackPtrMAX& / 2 = 21
    TotalReadData# = 12,011397 bytes.
    Looking in SAKURA.QSS and writing TSZ.SW ...
    Time: 06:18:12 / 06:18:53 i.e. 40.69922seconds.
    SAKURA: Done(performance: 288KB/s).
    Creating TSZ.WRD ...
    
    D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir
     Volume in drive D is H320_Vol5
     Volume Serial Number is 0CB3-C881
    
     Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1
    
    02/22/2011  06:18 AM    <DIR>          .
    02/22/2011  06:18 AM    <DIR>          ..
    02/22/2011  03:37 AM                 6 masakari.ss
    02/22/2011  06:18 AM       335,544,320 MASAKARI.SWP
    02/22/2011  03:23 AM             2,659 RIP_EWRD.BAS
    02/22/2011  03:23 AM            44,232 RIP_EWRD.EXE
    09/19/1997  12:00 AM            55,972 SAKURA.EXE
    10/23/2007  05:26 PM           146,248 SAKURA8.ZIP
    02/22/2011  06:18 AM           578,855 TSZ.SW
    02/12/2011  09:11 PM         1,385,282 TSZ.txt
    02/22/2011  06:18 AM           578,855 TSZ.UW
    02/22/2011  06:18 AM            76,752 TSZ.WRD
                  10 File(s)    338,413,181 bytes
                   2 Dir(s)   2,313,396,224 bytes free
    
    D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>type TSZ.WRD
    a
    a-sniffing
    a-weary
    abandoned
    abash
    abashed
    abet
    abide
    ability
    ability-to-stand
    ...
    all-too-gentle
    all-too-great
    all-too-human
    all-too-patient
    all-too-poor
    all-too-similar
    all-too-small
    ...
    bid'th
    ...
    can't
    ...
    day's-work
    day-journeys
    ...
    doubt
    doubt'th
    ...
    e'er
    each
    ...
    i'm
    i've
    ...
    it's
    ...
    look
    look'st
    looked
    lookedst
    looketh
    looking
    looking-back
    looks
    ...
    lov'th
    lovable
    love
    love's
    love-glances
    ...
    mean
    mean'th
    meaneth
    meaning
    means
    meant
    ...
    naysayer
    ne'er
    ne'er-do-ills
    ne'er-do-wells
    near
    nearer
    ...
    o'er
    o'erflowing
    o'erhangeth
    o'erhearst
    o'erhung
    o'erleap
    o'ershadowed
    o'erspan
    o'erswelled
    o'erthrowers
    o'erthrowing
    o'erthrown
    ...
    they've
    ...
    world's
    world-blessing
    world-loving
    ...
    would
    would'st
    ...
    y-e-a
    yawn
    ...
    zarathustra
    zarathustra's
    zarathustra-kingdom
    
    D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>
    The Dummy_DOS_ripper.zip 877KB is here.

Page 1 of 2 1 2 LastLast

Similar Threads

  1. [Idiom] "brutally bloody" & "bloodily brutal"
    By outofdejavu in forum Ask a Teacher
    Replies: 3
    Last Post: 29-Nov-2009, 05:31
  2. [General] go it = to go fast, run fast, not to spare yourself.
    By vil in forum Ask a Teacher
    Replies: 3
    Last Post: 09-Jun-2009, 02:53
  3. Adverbial Phrase, Noun Phrase, Verb Phrase
    By novi_83 in forum Ask a Teacher
    Replies: 1
    Last Post: 29-Jun-2008, 18:46
  4. more fast?
    By Unregistered in forum Ask a Teacher
    Replies: 2
    Last Post: 24-Jun-2007, 21:50
  5. ripping/ripper
    By possopo in forum Ask a Teacher
    Replies: 1
    Last Post: 18-Jul-2006, 23:03

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •