Hi to all English-language-explorers,
I am an amateurish C program-mess-er who is interested mainly in English language console utilities with one only goal in mind: to give statistical info about words/phrases/sentences usage.
First impression: a nice forum.
My wish is to share here my attempts/console-tools for English sidekick-ing.
Second impression: an unnecessary limitation: 5 posts to be able to share a link, grmbl!
Leprechaun_r13_7pluses_quadrupleton_r1_EXEs.zip
Leprechaun_r13_7pluses_quadrupleton_r1_AT_A_GLANCE .pdf
For example:Enjoy!Code:D:\_KA45F~1\_4>dir 12/12/2010 01:37 PM 1,111,609,996 googlebooks-eng-us-all-4gram-20090715-0.csv 01/26/2011 06:46 PM 315 googlebooks-eng-us-all-4gram-20090715-0.csv.EXCERPT 01/26/2011 06:56 PM 362 Gulliver's-Travels.pdf.txt.EXCERPT 01/26/2011 06:47 PM 4,108 Leprechaun.LOG 01/26/2011 05:13 AM 514,048 Leprechaun_quadrupleton_Intel_IA-32_11.1.exe 01/26/2011 06:47 PM 53 test.lst 01/26/2011 06:47 PM 14 test.wrd D:\_KA45F~1\_4>dir Gulliver*.excerpt/b>test2.lst D:\_KA45F~1\_4>type "Gulliver's-Travels.pdf.txt.EXCERPT" ... And so unmeasureable is the ambition of princes, that he seemed to think of nothing less than reducing the whole empire of Blefuscu into a province, and governing it, by a viceroy; of destroying the Big-endian exiles, and compelling that people to break the smaller end of their eggs, by which he would remain the sole monarch of the whole world. ... D:\_KA45F~1\_4>Leprechaun_quadrupleton_Intel_IA-32_11.1.exe test2.lst test2.wrd Leprechaun(Fast Greedy Word-Ripper), rev. 13_7pluses quadrupleton_r1, written by Svalqyatchx. Leprechaun: 'Oh, well, didn't you hear? Bigger is good, but jumbo is dear.' Kaze: Let's see what a 3-way hash + 6,602,752 Binary-Search-Trees can give us, also the performance of a 3-way hash + 6,602,752 B-Trees of order 3. Size of input file with files for Leprechauning: 36 Allocating memory 424MB ... OK Size of Input TEXTual file: 362 |; Word count: 62 of them 41 distinct; Done: 64/64 Bytes per second performance: 362B/s Words per second performance: 62W/s Flushing unsorted words ... Time for making unsorted wordlist: 1 second(s) Deallocated memory in MB: 424 Allocated memory for words in MB: 1 Allocated memory for pointers-to-words in MB: 1 Sorting(with 'MultiKeyQuickSortX26Sort' by J. Bentley and R. Sedgewick) ... Sort pass 26/26 ... Flushing sorted words ... Time for sorting unsorted wordlist: 1 second(s) Leprechaun: Done. D:\_KA45F~1\_4>type test2.wrd and_compelling_that_people and_so_unmeasureable_is blefuscu_into_a_province break_the_smaller_end by_which_he_would compelling_that_people_to destroying_the_big_endian empire_of_blefuscu_into end_of_their_eggs he_seemed_to_think he_would_remain_the is_the_ambition_of less_than_reducing_the monarch_of_the_whole nothing_less_than_reducing of_blefuscu_into_a of_destroying_the_big of_nothing_less_than of_the_whole_world people_to_break_the reducing_the_whole_empire remain_the_sole_monarch seemed_to_think_of smaller_end_of_their so_unmeasureable_is_the sole_monarch_of_the than_reducing_the_whole that_he_seemed_to that_people_to_break the_ambition_of_princes the_big_endian_exiles the_smaller_end_of the_sole_monarch_of the_whole_empire_of think_of_nothing_less to_break_the_smaller to_think_of_nothing unmeasureable_is_the_ambition which_he_would_remain whole_empire_of_blefuscu would_remain_the_sole D:\_KA45F~1\_4>
Last edited by Sanmayce; 06-Feb-2011 at 15:01. Reason: URLs added
Here is "the final" release suitable and decent enough both for Linux & Windows users:
http://www.sanmayce.com/Downloads/Le..._ELFs_EXEs.zip
P.S.
The README.txt from the package above:
This is a short description of 'Leprechaun_[quadrupleton]_r13_7pluses_ELFs_EXEs' package (27 files; 9,882,301 bytes):
The package contains 32bit (console) executables compiled for Windows & Linux.Code:02/19/2011 04:32 AM 994,119 Leprechaun.png 02/19/2011 04:32 AM 181,165 Leprechaun.c 02/19/2011 04:32 AM 4,122 Leprechaun.LOG 02/19/2011 04:32 AM 27 Leprechaun.lst 02/19/2011 04:32 AM 2,615 Leprechaun.wrd 02/19/2011 04:32 AM 83 Leprechaun_COMPILE_Intel.bat 02/19/2011 04:32 AM 73 Leprechaun_COMPILE_Microsoft.bat 02/19/2011 04:32 AM 1,977,802 Leprechaun_Intel.cod 02/19/2011 04:32 AM 2,886,589 Leprechaun_Logo-diz.pdf 02/19/2011 04:32 AM 438,359 Leprechaun_Microsoft.cod 02/19/2011 04:32 AM 183,655 Leprechaun_quadrupleton.c 02/19/2011 04:32 AM 4,132 Leprechaun_quadrupleton.LOG 02/19/2011 04:32 AM 12,544 Leprechaun_quadrupleton.wrd 02/19/2011 04:32 AM 151,049 Leprechaun_quadrupleton_AT_A_GLANCE_cover.txt.pdf 02/19/2011 04:32 AM 109 Leprechaun_quadrupleton_COMPILE_Intel.bat 02/19/2011 04:32 AM 99 Leprechaun_quadrupleton_COMPILE_Microsoft.bat 02/19/2011 04:32 AM 644,409 Leprechaun_quadrupleton_r13_7pluses_generic_32bits.elf 02/19/2011 04:32 AM 514,048 Leprechaun_quadrupleton_r13_7pluses_Intel_IA-32_11.1.exe 02/19/2011 04:32 AM 96,256 Leprechaun_quadrupleton_r13_7pluses_Microsoft_32-bit_16.00.30319.01.exe 02/19/2011 04:32 AM 523,183 Leprechaun_r13_7pluses.pdf 02/19/2011 04:32 AM 642,613 Leprechaun_r13_7pluses_generic_32bits.elf 02/19/2011 04:32 AM 514,048 Leprechaun_r13_7pluses_Intel_IA-32_11.1.exe 02/19/2011 04:32 AM 95,232 Leprechaun_r13_7pluses_Microsoft_32-bit_16.00.30319.01.exe 02/19/2011 04:32 AM 117 Linux_Leprechaun_Complile_Line.script 02/19/2011 04:32 AM 143 Linux_Leprechaun_quadrupleton_Complile_Line.script 02/19/2011 04:32 AM 5,203 The_Little_Match_Girl.txt 02/19/2011 04:32 AM 10,507 _Caution_64bit_is-not-possible-yet_must_be_rewritten.txt
It is 100% free open-source copyleft software.
It creates English wordlists(1gram or 4gram) for a given filelist (each line is a filename).
Run one of the executables without any parameters to see how to use it.
Leprechaun_[quadrupleton]_r13_7pluses is powered by the fastest (on new architectures, like Core i3) so far string hash function: Jesteress.
Examples (in Linux prompt):
./Leprechaun_r13_7pluses_generic_32bits.elf Leprechaun.lst Leprechaun.wrd 7000
./Leprechaun_quadrupleton_r13_7pluses_generic_32bits .elf Leprechaun.lst Leprechaun_quadrupleton.wrd 6000
Open text files Leprechaun.LOG and Leprechaun_quadrupleton.LOG respectively for first and second line.
Pluses:
+ written in C;
+ extremely fast;
+ an useful etude for wordlisting.
Minuses:
- the developer being an amateur;
- dirty style, so dirty that 64bit compilation surely fails;
- greedy: low memory utilization.
Sanmayce
Enjoy!
I see it works. I think you should make your program recognize the string n't as a word. They do it this way in the BYU corpora:
is n't
we 're
Now, your program gives:
isn 't
we 're
I'm not sure if I understand what the program actually does. What does
mean? I don't see any ]'s or ['s anywhere... All I can find is the output file with the words/4-grams in it and the log file. Is there anything else to look at?At left side of the word - '[' means no left successor
At left side of the word - ']' means left successor exists
At right side of the word - ']' means no right successor
At right side of the word - '[' means right successor exists
edit1: Oh, I forgot to add. I liked how the program said, "Can't open file. I've already explained," or something like this.
You could correct those brackets throughout your files. It's already difficult to read computer geek speech and the lack of spaces doesn't help.![]()
Last edited by birdeen's call; 19-Feb-2011 at 23:01.
I have't played with your tool yet, so no comments - but thanks. Have you heard of SNOBOL - Wikipedia, the free encyclopedia ? (A bit of a dinosaur, but interesting...)
b
Yes, it limits functionality but since I couldn't find a way to deal with apostrophes in all cases I decided to remove them altogether. The ripper's action is simple: it parses the incoming files (given via the filelist - the first filename in command line) and extracts all latin-letter words with lengths of 1 to 31 characters, the forming rule is simple too: a word is a string containing only alpha characters i.e. 'a' to 'z' or 'A' to 'Z'.
The square brackets stand for binary-search-tree leaf succession status, that is ']' means a child exists for left node and '[' similarly for right node. I draw the highest binary tree just for informative purposes. You don't need Leprechaun.log except for tracking the activity of the executable, you need only the second file from command line namely the extracted distinct words/grams.
It was/is not intentional, I don't want to impose my buggy ways, at least I try my explanations to be useful at max.
Thank you for the link, I have not heard of it, but after reading the article I can say: it has nothing to do with real world tasks, nowayears the well-applied algorithms could smash all old approaches. There are languages like Python designed for such tasks, but I am kind of orthodoxal amateur my language is and will be C, I see myself migrating to native 64bit Linux and 64bit C code after a couple of years.
Regards
A very simple and bad (not tragic though) solution is to add a couple of lines that would change every
...n
't
to
...
n't
in the output file after it's ready, but it will work only in the 1-gram program for obvious reasons. It's still O(n), which won't change the general complexity.
(I'm not a programmer. Please forgive me if I'm talking nonsense.)
Speaking of what a small utility must do (and what not to do), after many tries and errors I realized the need for well-defined actions, that is everything must not be half-done and simplicity along with speed must be among the highest priorities.
So in its simplicity I reckon that Leprechaun has not to be altered - it does what exactly is expected to do as first pass of parsing, from that point on another tool must take over. Here arises need not only of apostrophes, hyphens but additional alphabets ... and clarity vanishes. In other words the skeleton (the superfast hash reinforced by unrolled b-trees with simulated stack) is the thrilling thing (an important base/etude for further not developing but rather tuning and rewriting as 64bit code), the rest interests me not.
In case of not sensing what my obsession is, here is my diagnose: maniacal hi-speed text processing fondness. As I have said it elsewhere the speed is beauty.
One reason to abandon extracting words containing apostrophes was the existence of shortened forms like 'cause, 'twas, 'tis, 'tween, 'twere for 'because', 'it was', 'it is', 'between', 'it were' respectively.
Many years ago I wrote 2 very slow 16bit console utilities which might be useful to you, they work in duo and rip distinct (with hyphens and apostrophes) English words from a given file:
Example:
The Dummy_DOS_ripper.zip 877KB is here.Code:D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir Volume in drive D is H320_Vol5 Volume Serial Number is 0CB3-C881 Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1 02/22/2011 06:05 AM <DIR> . 02/22/2011 06:05 AM <DIR> .. 02/22/2011 03:37 AM 6 masakari.ss 02/22/2011 03:23 AM 2,659 RIP_EWRD.BAS 02/22/2011 03:23 AM 44,232 RIP_EWRD.EXE 09/19/1997 12:00 AM 55,972 SAKURA.EXE 10/23/2007 05:26 PM 146,248 SAKURA8.ZIP 02/12/2011 09:11 PM 1,385,282 TSZ.txt 6 File(s) 1,634,399 bytes 2 Dir(s) 2,650,181,632 bytes free D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>RIP_EWRD.EXE TSZ.txt RIP_EWRD.EXE NumberOfWords&: 90958 SAKURA.EXE, revision 008, written by Svalqyatchx 'Kaze'. Revision note: Virtual_Memory_Simulated_Stack, if overflow_error then HALT. Caution: Very(pivot is chosen from first 20 elements) slow version. Searching for MASAKARI.SS and MASAKARI.SD ... Creating MASAKARI.SWP ... Allocated HDD memory: 320MB. Room for 33,554432 elements; Maximum(2GB-1) for 214,748364 elements. Input file: TSZ.UW Output file: TSZ.SW Making SAKURA.QSS(10bytes per element(6-entry,4-stack)) at HDD memory ... Current sort options: CASE_UNSENSITIVE /START= 1 /LENGTH= 26 Sorting in two passes 90958 elements(longest 26), needed 889KB ... Bubble-sorting possible pivots ... Sorting pass#1(splitting) countdown(Right&), StackPtr: 000000001, 000000034 ... Sorting pass#2 countdown(Quantity&-Right&), StackPtr: 000000000, 000000000 ... Stack_Nested_Levels i.e. StackPtrMAX& / 2 = 21 TotalReadData# = 12,011397 bytes. Looking in SAKURA.QSS and writing TSZ.SW ... Time: 06:18:12 / 06:18:53 i.e. 40.69922seconds. SAKURA: Done(performance: 288KB/s). Creating TSZ.WRD ... D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir Volume in drive D is H320_Vol5 Volume Serial Number is 0CB3-C881 Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1 02/22/2011 06:18 AM <DIR> . 02/22/2011 06:18 AM <DIR> .. 02/22/2011 03:37 AM 6 masakari.ss 02/22/2011 06:18 AM 335,544,320 MASAKARI.SWP 02/22/2011 03:23 AM 2,659 RIP_EWRD.BAS 02/22/2011 03:23 AM 44,232 RIP_EWRD.EXE 09/19/1997 12:00 AM 55,972 SAKURA.EXE 10/23/2007 05:26 PM 146,248 SAKURA8.ZIP 02/22/2011 06:18 AM 578,855 TSZ.SW 02/12/2011 09:11 PM 1,385,282 TSZ.txt 02/22/2011 06:18 AM 578,855 TSZ.UW 02/22/2011 06:18 AM 76,752 TSZ.WRD 10 File(s) 338,413,181 bytes 2 Dir(s) 2,313,396,224 bytes free D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>type TSZ.WRD a a-sniffing a-weary abandoned abash abashed abet abide ability ability-to-stand ... all-too-gentle all-too-great all-too-human all-too-patient all-too-poor all-too-similar all-too-small ... bid'th ... can't ... day's-work day-journeys ... doubt doubt'th ... e'er each ... i'm i've ... it's ... look look'st looked lookedst looketh looking looking-back looks ... lov'th lovable love love's love-glances ... mean mean'th meaneth meaning means meant ... naysayer ne'er ne'er-do-ills ne'er-do-wells near nearer ... o'er o'erflowing o'erhangeth o'erhearst o'erhung o'erleap o'ershadowed o'erspan o'erswelled o'erthrowers o'erthrowing o'erthrown ... they've ... world's world-blessing world-loving ... would would'st ... y-e-a yawn ... zarathustra zarathustra's zarathustra-kingdom D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>