I'm trying to understand how the cluster count works but can't. I don't see it in the code... Is the improved version much different from what you have alraedy posted?
Here are the consonant clusters in the Bible(KJV) with frequencies.
Printout from the improved ConClusters.py; computation time – 2 mins.
It reads from a text file and prints to screen or file.
-------------------------------------------------
List of Consonant Clusters
6-letter custers:
5-letter custers:
rstfr:34, ffspr:13, tchcr:7, ngstr:7, llspr:2,
4-letter custers:
ngth:409, nsgr:179, nstr:144, rstb:118, ghts:104, ghtn:72, lchr:70, rthr:63, nths:59, thst:57, ckcl:47, ghtw:42, ghth:41, phth:35, stfr:34, rthw:34, rstf:34, tchm:31, stpl:31, nksg:31, ghtl:28, thdr:26, lfth:23, rstl:20, rthq:19, ngpl:19, ngfl:19, pths:17, cksl:17, ngst:14, ngbl:14, fspr:13, ffsp:13, llst:11, ftsm:9, tchc:7, gstr:7, chcr:7, rthl:6, ldsm:6, rscr:5, rdsh:5, bscr:5, rstr:4, rspr:4, tstr:3, thph:3, tcht:3, rthd:3, ndbr:3, nchr:3, chsh:3, tchf:2, stch:2, shch:2, rnfl:2, rlds:2, rldl:2, rchs:2, ngtr:2, ndsh:2, lspr:2, llsp:2, llpl:2, ffsc:2, chth:2, wncl:1, wkth:1, thch:1, rchm:1, phph:1, nthl:1, ntbr:1, nscr:1, nggl:1, ndwr:1, ndst:1, nctl:1, lmsd:1, lkst:1, ckcr:1, chst:1,
3-letter custers:
ght:6034, rth:2605, ngs:2403, str:2398, thr:2236, nst:1955, ldr:1838, nts:1585, sts:1090, rst:1028, rds:931, nds:882, ndr:714, nsw:690, nth:666, ttl:645, dst:620, tch:498, ndm:469, ngr:464, rch:454, dgm:431, ngt:413, gth:413, ngd:400, ntr:371, rnt:368, mpl:343, mbl:331, ncl:309, rld:301, ths:294, nct:291, rsh:289, spr:279, cks:271, rks:256, phr:255, rts:247, scr:238, stl:217, ndl:206, nch:192, lls:183, ffl:181, sgr:180, nsg:179, mpt:178, ngl:172, ppr:166, rkn:165, ntl:160, ncr:160, ghb:159, rbs:143, chr:142, thf:132, rns:130, lds:130, shm:126, nks:125, stb:118, lch:118, lth:108, hts:104, pht:97, dth:96, mbs:95, ckl:95, ppl:92, nsl:92, mbr:91, ndn:89, fth:88, sch:86, wls:85, rpr:84, hth:78, sth:77, fts:76, cts:73, htn:72, nkl:69, stw:67, ddl:64, ckn:64, rdl:63, shb:62, ctr:60, bst:60, thw:59, sph:59, rpl:59, hst:58, wns:55, ntm:54, gns:54, chm:49, xth:48, ckc:48, kcl:47, sht:46, rtl:46, rms:45, mps:44, htw:42, rsc:41, nsc:41, nsp:38, rfl:37, spl:36, tpl:35, tfr:34, stf:34, lst:33, lms:33, rkm:32, stp:31, ksg:31, chs:31, thh:30, pth:30, ngk:30, thd:29, rlw:29, htl:28, shd:26, ngf:26, ndf:26, hdr:26, ghw:26, ghs:26, ghl:26, rps:25, mph:25, thl:24, pts:24, nsm:24, bbl:24, tst:23, ssh:23, shn:23, rtr:23, lft:23, wsh:22, rrh:22, ldl:22, ngp:21, ldn:20, thq:19, gpl:19, gfl:19, ckw:19, wbr:18, wdn:17, shl:17, lph:17, ksl:17, ffs:17, bsh:17, ptr:16, ngb:15, mst:15, lfs:15, xpr:14, thn:14, pph:14, nfl:14, lps:14, gst:14, gbl:14, tth:13, rph:13, fsp:13, rdn:12, nsf:12, ndw:12, ltl:12, hsh:12, ftl:12, shk:11, nkn:11, ngh:11, lts:11, lsh:11, ffr:11, ctl:11, cht:11, tsm:10, rsq:10, npr:10, msh:10, bsc:10, ssl:9, rwh:9, rmw:9, rls:9, rdm:9, mpr:9, ffn:9, ckb:9, chb:9, shr:8, shh:8, shc:8, phn:8, nsh:8, dsh:8, xch:7, nkf:7, hcr:7, ghn:7, ggs:7, chc:7, btl:7, wnw:6, skm:6, shp:6, rsp:6, rnm:6, rnf:6, phl:6, nfr:6, llb:6, lks:6, lgr:6, dsm:6, dch:6, chp:6, tsh:5, rtn:5, rbl:5, chl:5, cch:5, bts:5, xtr:4, thm:4, stm:4, ssw:4, sps:4, rml:4, phs:4, pbr:4, ksh:4, hph:4, chz:4, chf:4, zzl:3, wpr:3, thp:3, thc:3, rtf:3, rdr:3, psh:3, phk:3, nkh:3, ndb:3, nbl:3, llf:3, lfw:3, kth:3, hch:3, ghm:3, ftt:3, ffd:3, dbr:3, ckk:3, btf:3, xsc:2, xpl:2, xcl:2, wth:2, ttr:2, tfl:2, stn:2, stc:2, sks:2, scl:2, rsm:2, rpn:2, rnn:2, rmh:2, rkl:2, rct:2, pwr:2, ptl:2, phz:2, phm:2, ntn:2, nsk:2, npl:2, ngn:2, msp:2, mns:2, mnl:2, lsp:2, lpl:2, llp:2, ldh:2, gtr:2, ggl:2, fsc:2, ckt:2, ccl:2, zth:1, wnh:1, wnc:1, wkt:1, wdl:1, vsh:1, tsk:1, thj:1, thb:1, tbr:1, ssf:1, sms:1, shv:1, shg:1, rtg:1, rsw:1, rkf:1, rfr:1, rdh:1, rdb:1, rcl:1, ptn:1, psk:1, php:1, ntb:1, nsn:1, nkw:1, ngw:1, ngg:1, ndk:1, msd:1, ltp:1, ltn:1, llm:1, lfr:1, lfc:1, ldb:1, kst:1, kcr:1, ffh:1, dwr:1, dds:1, ctn:1, chn:1, btr:1,
2-letter custers:
th:155741, nd:64964, ll:26006, nt:23126, ng:20898, st:19747, sh:19449, ch:12629, wh:11902, ld:8722, ns:8119, gh:7941, rs:7668, rd:7649, ss:7098, rt:6810, ht:6189, pr:6003, tr:5641, br:5059, pl:5005, rn:4748, ts:4679, fr:4596, nc:4261, dr:3895, lt:3823, sp:3662, ls:3281, hr:3272, ds:3257, gr:3119, bl:3002, ff:2770, tt:2716, sr:2619, gs:2512, ft:2506, wn:2341, ph:2300, pt:2295, ck:2286, cr:1941, kn:1917, rv:1895, cl:1877, tw:1872, mm:1636, rk:1631, sw:1620, fl:1575, mb:1562, ms:1547, sc:1519, mp:1448, rr:1430, lv:1422, ct:1348, rc:1339, lf:1339, tl:1297, rm:1217, dg:1179, nn:1175, cc:1161, pp:1116, sl:1115, gl:984, rl:961, wr:949, nk:949, sm:790, ks:768, dw:750, gn:715, rg:681, lk:632, nh:614, ws:606, dl:566, nl:540, dn:538, bs:531, dm:529, tc:520, sk:483, sn:481, rp:472, ps:445, gm:440, gd:422, gt:417, wl:410, dd:394, bb:371, hs:370, tn:360, nf:351, rb:350, lm:347, rf:317, mn:293, xc:288, sd:245, lp:235, hb:235, zz:229, sg:229, nj:226, xt:213, hm:197, hn:195, kl:187, sb:178, rw:171, nv:157, dv:147, mc:146, mf:144, lc:142, hf:136, zr:135, bt:125, lw:124, tb:123, tf:122, nq:119, gg:109, bn:109, hl:108, dt:98, mr:89, wb:88, ln:87, tm:86, hw:85, xp:81, sf:79, nw:79, lg:72, df:72, pw:65, wf:63, hd:60, nr:58, bd:56, rh:52, zp:51, fs:51, bh:50, kc:48, nb:43, kk:41, bj:41, rj:40, km:39, lr:38, hh:38, xh:36, tp:36, rz:33, wd:32, gk:30, lb:29, kr:28, nm:27, ml:26, gf:26, sq:25, np:24, lh:24, hp:24, gp:22, bm:21, kw:20, hq:19, hc:18, gb:18, cq:18, hz:17, kb:16, rx:15, lz:14, hk:14, zm:12, kt:12, zh:11, mw:10, fn:10, tg:9, mh:9, db:9, cs:9, bz:9, zn:8, kf:8, wg:7, pm:7, pb:7, dj:7, zl:6, zb:6, wk:6, kh:6, gv:6, dh:6, dc:6, wm:5, bv:5, zk:4, zg:4, pn:4, pf:4, mz:4, wp:3, pk:3, pc:3, mt:3, mg:3, kv:3, fw:3, fd:3, dk:3, xs:2, xl:2, wt:2, nz:2, hg:2, zt:1, zj:1, ww:1, wc:1, vs:1, tk:1, qw:1, md:1, kd:1, hv:1, hj:1, gw:1, fh:1, fc:1,
>>>
Last edited by Raymott; 28-Oct-2010 at 12:47.
I'm trying to understand how the cluster count works but can't. I don't see it in the code... Is the improved version much different from what you have alraedy posted?
That's just the printout. I'll post the code in the next post below.
Yes, there are two main changes I made.
The first was to count the instances. The first time the program encounters a new cluster, it goes over the whole text and counts the instances, and saves a duple (cluster, count), instead of just the cluster.
The second change, which improved the efficiency to a realistic level was having the successive searches only look at the instances of the previous search: So, the search for 3-letter clusters doesn't start again looking at the whole text; it only searches for clusters with a root already occurring in the 2-letter search (which is rather obvious, but I didn't think of it straight away!)
If I stay interested in Text Processing coding, I might put a page up somewhere with some useful tools on it.
Hmm, to see the tabs, download the .txt file
It won't run without the tabs, and the '.txt' has to replaces by '.py'
################################################## ##############
# Searches a file for consonant clusters
# Prints clusters with frequency,
# Sorted by cluster length, then frequency, then alphabetical
# Raymott 28/10/2010
################################################## ##############
import operator #for sort
conList,clist,c1list,dlist,d1list = [],[],[],[],[]
elist,e1list,flist,f1list, glist, g1list = [],[],[],[],[],[] #start with empty lists of clusters
cluster = "" # empty cluster
letters = "bcdfghjklmnpqrstvwxz" #consonants
#Open file, read it to a list, close file
# Note: Change title in line below.
dictionary = open('hamlet.txt', 'r+') #File to be read
wordList = dictionary.read()
dictionary.close()
########Get 2-letter clusters
for first in letters:
for second in letters:
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in conList:
clist.append([cluster, counter])
c1list.append(cluster) #hold clusters 3-letter search
#sort the list by frequency, descending
clist.sort(key=operator.itemgetter(1))
clist.reverse()
#######Get 3-letter clusters
for first in letters:
for second in c1list: #search only with the 2-letter clusters
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in dlist:
dlist.append([cluster, counter])
d1list.append(cluster) #hold for 4-lett search
#sort the list by frequency, descending
dlist.sort(key=operator.itemgetter(1))
dlist.reverse()
########Get 4-letter clusters
for first in letters:
for second in d1list: #search only with the 3-letter clusters
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in elist:
elist.append([cluster, counter])
e1list.append(cluster)
#sort the list by frequency, descending
elist.sort(key=operator.itemgetter(1))
elist.reverse()
########Get 5-letter clusters
for first in letters:
for second in e1list: #search only with the 3-letter clusters
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in flist:
flist.append([cluster, counter])
f1list.append(cluster)
#sort the list by frequency, descending
flist.sort(key=operator.itemgetter(1))
flist.reverse()
########Get 6-letter clusters
for first in letters:
for second in f1list: #search only with the 3-letter clusters
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in glist:
glist.append([cluster, counter])
g1list.append(cluster)
#sort the list by frequency, descending
glist.sort(key=operator.itemgetter(1))
glist.reverse()
#Print the lists
print "List of Consonant Clusters\n"
print "6-letter custers:\n"
for item in glist:
print item[0] + ":" + str(item[1]) + ", ",
print '\n'
print "5-letter custers:\n"
for item in flist:
print item[0] + ":" + str(item[1]) + ", ",
print '\n'
print "4-letter custers:\n"
for item in elist:
print item[0] + ":" + str(item[1]) + ", ",
print '\n'
print "3-letter custers:\n"
for item in dlist:
print item[0] + ":" + str(item[1]) + ", ",
print '\n'
print "2-letter custers:\n"
for item in clist:
print item[0] + ":" + str(item[1]) + ", ",
#########
Hi Raymott, can you program with C# or Java or C++ ?
I only know Python is a server scripting language like PHP. It's so good if you know C# because I'm learning C# and needing someone to discuss with and help me sorting out some problems.
I don't understand why we have to classify consonants into clusters ? How does it help English learners ?
Please tell me how useful it is to understand it.
Thank you so much !
One doesn't have to classify consonant clusters. But the OP asked about it. It gave me an idea, and I coded it. It's interesting, but not important.
Well, it could help English learners to know about the apparent frequency of clusters; to know if being able to pronounce them is necessary. For example, we don't have a cluster 'kxsh', but 'str' is very common.
Thank you, I'll try Python after mastering almost of C# language. I think why don't you try learning C#, it's so great, powerful, if you are good at Python, you'll need only a month or less than to get at basic level of C# programming.
Programming is the main reason I have to learn English to understand English IT e-books, there are not many Vietnamese IT e-books for me.
Thank you for sharing!