
Originally Posted by
birdeen's call
A very simple and bad (not tragic though) solution is to add a couple of lines that would change every
...n
't
to
...
n't
in the output file after it's ready, but it will work only in the 1-gram program for obvious reasons.
Speaking of what a small utility must do (and what not to do), after many tries and errors I realized the need for well-defined actions, that is everything must not be half-done and simplicity along with speed must be among the highest priorities.
So in its simplicity I reckon that Leprechaun has not to be altered - it does what exactly is expected to do as first pass of parsing, from that point on another tool must take over. Here arises need not only of apostrophes, hyphens but additional alphabets ... and clarity vanishes. In other words the skeleton (the superfast hash reinforced by unrolled b-trees with simulated stack) is the thrilling thing (an important base/etude for further not developing but rather tuning and rewriting as 64bit code), the rest interests me not.
In case of not sensing what my obsession is, here is my diagnose: maniacal hi-speed text processing fondness. As I have said it elsewhere the speed is beauty.
One reason to abandon extracting words containing apostrophes was the existence of shortened forms like 'cause, 'twas, 'tis, 'tween, 'twere for 'because', 'it was', 'it is', 'between', 'it were' respectively.
Many years ago I wrote 2 very slow 16bit console utilities which might be useful to you, they work in duo and rip distinct (with hyphens and apostrophes) English words from a given file:
Example:
Code:
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1
02/22/2011 06:05 AM <DIR> .
02/22/2011 06:05 AM <DIR> ..
02/22/2011 03:37 AM 6 masakari.ss
02/22/2011 03:23 AM 2,659 RIP_EWRD.BAS
02/22/2011 03:23 AM 44,232 RIP_EWRD.EXE
09/19/1997 12:00 AM 55,972 SAKURA.EXE
10/23/2007 05:26 PM 146,248 SAKURA8.ZIP
02/12/2011 09:11 PM 1,385,282 TSZ.txt
6 File(s) 1,634,399 bytes
2 Dir(s) 2,650,181,632 bytes free
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>RIP_EWRD.EXE TSZ.txt
RIP_EWRD.EXE
NumberOfWords&: 90958
SAKURA.EXE, revision 008, written by Svalqyatchx 'Kaze'.
Revision note: Virtual_Memory_Simulated_Stack, if overflow_error then HALT.
Caution: Very(pivot is chosen from first 20 elements) slow version.
Searching for MASAKARI.SS and MASAKARI.SD ...
Creating MASAKARI.SWP ...
Allocated HDD memory: 320MB.
Room for 33,554432 elements; Maximum(2GB-1) for 214,748364 elements.
Input file: TSZ.UW
Output file: TSZ.SW
Making SAKURA.QSS(10bytes per element(6-entry,4-stack)) at HDD memory ...
Current sort options: CASE_UNSENSITIVE /START= 1 /LENGTH= 26
Sorting in two passes 90958 elements(longest 26), needed 889KB ...
Bubble-sorting possible pivots ...
Sorting pass#1(splitting) countdown(Right&), StackPtr: 000000001, 000000034 ...
Sorting pass#2 countdown(Quantity&-Right&), StackPtr: 000000000, 000000000 ...
Stack_Nested_Levels i.e. StackPtrMAX& / 2 = 21
TotalReadData# = 12,011397 bytes.
Looking in SAKURA.QSS and writing TSZ.SW ...
Time: 06:18:12 / 06:18:53 i.e. 40.69922seconds.
SAKURA: Done(performance: 288KB/s).
Creating TSZ.WRD ...
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>dir
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
Directory of D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1
02/22/2011 06:18 AM <DIR> .
02/22/2011 06:18 AM <DIR> ..
02/22/2011 03:37 AM 6 masakari.ss
02/22/2011 06:18 AM 335,544,320 MASAKARI.SWP
02/22/2011 03:23 AM 2,659 RIP_EWRD.BAS
02/22/2011 03:23 AM 44,232 RIP_EWRD.EXE
09/19/1997 12:00 AM 55,972 SAKURA.EXE
10/23/2007 05:26 PM 146,248 SAKURA8.ZIP
02/22/2011 06:18 AM 578,855 TSZ.SW
02/12/2011 09:11 PM 1,385,282 TSZ.txt
02/22/2011 06:18 AM 578,855 TSZ.UW
02/22/2011 06:18 AM 76,752 TSZ.WRD
10 File(s) 338,413,181 bytes
2 Dir(s) 2,313,396,224 bytes free
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>type TSZ.WRD
a
a-sniffing
a-weary
abandoned
abash
abashed
abet
abide
ability
ability-to-stand
...
all-too-gentle
all-too-great
all-too-human
all-too-patient
all-too-poor
all-too-similar
all-too-small
...
bid'th
...
can't
...
day's-work
day-journeys
...
doubt
doubt'th
...
e'er
each
...
i'm
i've
...
it's
...
look
look'st
looked
lookedst
looketh
looking
looking-back
looks
...
lov'th
lovable
love
love's
love-glances
...
mean
mean'th
meaneth
meaning
means
meant
...
naysayer
ne'er
ne'er-do-ills
ne'er-do-wells
near
nearer
...
o'er
o'erflowing
o'erhangeth
o'erhearst
o'erhung
o'erleap
o'ershadowed
o'erspan
o'erswelled
o'erthrowers
o'erthrowing
o'erthrown
...
they've
...
world's
world-blessing
world-loving
...
would
would'st
...
y-e-a
yawn
...
zarathustra
zarathustra's
zarathustra-kingdom
D:\_KA45F~1\_KAZE_~1\QB_PART\DUMMY_~1>
The Dummy_DOS_ripper.zip 877KB is here.