[Idiom] consonant clusters

Status
Not open for further replies.

5jj

Moderator
Staff member
Joined
Oct 14, 2010
Member Type
English Teacher
Native Language
British English
Home Country
Czech Republic
Current Location
Czech Republic
I am not sure exactly what you mean by 'classifying' them.

Perhaps you could try grouping them into clusters of three consonants: /glImps/ and four: /glImpst/

Or by voiced-unvoiced-unvoiced: /glɑːnst/ and unvoiced-unvoiced-unvoiced: /nekst/

Or by how consonant clusters may be put together in English, as in: CVC (bed), CCVC (bled), CCVCC (blend), etc.
 

Raymott

VIP Member
Joined
Jun 29, 2008
Member Type
Academic
Native Language
English
Home Country
Australia
Current Location
Australia
Here's a little python program. It works, but python can also be easily read as an algorithm too. It finds as many consonant clusters as are in you wordlist, and prints them in alphabetical order.
If you want to classify them some other way (apart from alphabetically), you'd need to add a few lines.
If you want to classify consonant sounds, you need a different program with 'phl' represented as 'fl', for example, and a phonetic dictionary stripped of extraneous marks. (Maybe that will be my next project).

---------------------------------------------
# findClusters.py
# Raymott 24/10/2010
# Finds 2-letter consonant custers in a dictionary and sorts them. Extendable to 3, 4 +

clist = [] #start with empty list of clusters
cluster = "" # empty cluster
letters = "bcdfghjklmnpqrstvwxz" #consonants
dictionary = "a temporary dictionary - replace" #Replace this string with a very big word list
# Get 2-letter clusters
for first in letters:
oooofor second in letters:
oooooooocluster = first + second
ooooooooif cluster in dictionary:
ooooooooooooif cluster not in clist:
ooooooooooooooooclist.append(cluster)
clist.sort() # Can sort list in a few ways
print clist # prints [‘ct’, ‘mp’, ‘pl’] with above string as dictionary.
 
  • Like
Reactions: 5jj

5jj

Moderator
Staff member
Joined
Oct 14, 2010
Member Type
English Teacher
Native Language
British English
Home Country
Czech Republic
Current Location
Czech Republic
Thanks. I'll try to use that some time. Trouble is, I'm nearly computer illiterate, and if something can go wrong, it will.

By the way, is custers in the third line of your program a typo in the program which should therefore be retained, or a typo in this message, which should therefore be corrected. Or is it something else?
 

Raymott

VIP Member
Joined
Jun 29, 2008
Member Type
Academic
Native Language
English
Home Country
Australia
Current Location
Australia
Thanks. I'll try to use that some time. Trouble is, I'm nearly computer illiterate, and if something can go wrong, it will.

By the way, is custers in the third line of your program a typo in the program which should therefore be retained, or a typo in this message, which should therefore be corrected. Or is it something else?
Sorry, it's a typo, but it's in a 'comment', which the complier doesn't read. All comments (in python anything following # on a line) are stripped off the program before the compiler reads it; so it isn't important.
I've tried the program with 3, 4 and 5-letter clusters, but you need to go off and get coffee while it's working on the 5-letter ones.

I'm working on a version which, instead of ignoring duplicates, actually counts the instances. One could then classify (or list) them by frequency of occurrence.
 
  • Like
Reactions: 5jj

Tdol

No Longer With Us (RIP)
Staff member
Joined
Nov 13, 2002
Native Language
British English
Home Country
UK
Current Location
Japan
How do you do 3/4/5 letters? Clist + letters?
 

Raymott

VIP Member
Joined
Jun 29, 2008
Member Type
Academic
Native Language
English
Home Country
Australia
Current Location
Australia
How do you do 3/4/5 letters? Clist + letters?

The simplest way, in keeping with my poor level of programming, would be to add functions one after the other, and have them run consecutively, then join the lists.

-----------------------
#Get 2-letter clusters
for first in letters:
oooofor second in letters:
oooooooocluster = first + second
...
clist.sort()
...

#Get 3-letter clusters
dlist = []
for first in letters:
oooofor second in letters:
oooooooofor third in letters:
oooooooooooocluster = first + second + third
...
dlist.sort()
clist.append(dlist)

#Get 4-letter clusters
elist = []
...
clist.append(elist)
---------------

The more involved way would be to write one function that increments the number of consonants to look for. There would be a number of ways of doing this, none of which I've worked out yet.

for clusterSize = 2 .. 6:
...
---------------------------------------
#Third method - looks for bc, then bcd, then bcdf ... before looking for bd
for first in letters:
oooofor second in letters:
oooooooocluster = first + second
...
clist.sort()
... #Get 3-letter
oooooooofor cluster in letters:
oooooooooooofor third in letters:
oooooooooooooooocluster = cluster + third
ooooclist.append(dlist.sort())
...#Get 4-letter
:=

 

Tdol

No Longer With Us (RIP)
Staff member
Joined
Nov 13, 2002
Native Language
British English
Home Country
UK
Current Location
Japan
I know nothing about python, but if it searched for vowels and excluded rather than appending, might that not give it less to search through?
 

Raymott

VIP Member
Joined
Jun 29, 2008
Member Type
Academic
Native Language
English
Home Country
Australia
Current Location
Australia
I know nothing about python, but if it searched for vowels and excluded rather than appending, might that not give it less to search through?
But you're looking for consonant clusters.
If the text was "banana", and you excluded the vowels, you'd have 'bnn', but that's not a cluster in 'banana', which doesn't have a consonant cluster. You'd also have to exclude consonants that were surrounded by vowels.

I guess you could swap all vowels with a space on the first pass, then swap all single consonants with a space on the second, and you'd be left with consonant clusters.
Something like this:

Change vowels to spaces"
for x in "aeiou"
....for x in dictionary
........x = ' '
Then change isolated consonants to spaces
for x in "bcdfghjklmnpqrstvxwxz"
....for " x " in dictionary
........" x " = " "
Then gather up the clusters
for word in dictionary
....list.append(word.stripSpaces())
list.sort()

One possible drawback is the work necessary for swapping letters to spaces is greater than that needed for merely searching and passing over them. Of course, there are always other possible ways to do things. Last night I just dashed off the most obvious way I could think of, regardless of good Software Engineering principles
Yours is a good idea though. I'll try it sometime.

PS: You'd also have to strip off all punctuation marks, perhaps newline and tab marks, etc. Also, this method is destructive, so you'd need to do it on a copy of the dictionary! There's a lot to consider!
 
Last edited:

Raymott

VIP Member
Joined
Jun 29, 2008
Member Type
Academic
Native Language
English
Home Country
Australia
Current Location
Australia
Here are the consonant clusters in the Bible(KJV) with frequencies.
Printout from the improved ConClusters.py; computation time – 2 mins.
It reads from a text file and prints to screen or file.
-------------------------------------------------
List of Consonant Clusters

6-letter custers:

5-letter custers:

rstfr:34, ffspr:13, tchcr:7, ngstr:7, llspr:2,

4-letter custers:

ngth:409, nsgr:179, nstr:144, rstb:118, ghts:104, ghtn:72, lchr:70, rthr:63, nths:59, thst:57, ckcl:47, ghtw:42, ghth:41, phth:35, stfr:34, rthw:34, rstf:34, tchm:31, stpl:31, nksg:31, ghtl:28, thdr:26, lfth:23, rstl:20, rthq:19, ngpl:19, ngfl:19, pths:17, cksl:17, ngst:14, ngbl:14, fspr:13, ffsp:13, llst:11, ftsm:9, tchc:7, gstr:7, chcr:7, rthl:6, ldsm:6, rscr:5, rdsh:5, bscr:5, rstr:4, rspr:4, tstr:3, thph:3, tcht:3, rthd:3, ndbr:3, nchr:3, chsh:3, tchf:2, stch:2, shch:2, rnfl:2, rlds:2, rldl:2, rchs:2, ngtr:2, ndsh:2, lspr:2, llsp:2, llpl:2, ffsc:2, chth:2, wncl:1, wkth:1, thch:1, rchm:1, phph:1, nthl:1, ntbr:1, nscr:1, nggl:1, ndwr:1, ndst:1, nctl:1, lmsd:1, lkst:1, ckcr:1, chst:1,

3-letter custers:

ght:6034, rth:2605, ngs:2403, str:2398, thr:2236, nst:1955, ldr:1838, nts:1585, sts:1090, rst:1028, rds:931, nds:882, ndr:714, nsw:690, nth:666, ttl:645, dst:620, tch:498, ndm:469, ngr:464, rch:454, dgm:431, ngt:413, gth:413, ngd:400, ntr:371, rnt:368, mpl:343, mbl:331, ncl:309, rld:301, ths:294, nct:291, rsh:289, spr:279, cks:271, rks:256, phr:255, rts:247, scr:238, stl:217, ndl:206, nch:192, lls:183, ffl:181, sgr:180, nsg:179, mpt:178, ngl:172, ppr:166, rkn:165, ntl:160, ncr:160, ghb:159, rbs:143, chr:142, thf:132, rns:130, lds:130, shm:126, nks:125, stb:118, lch:118, lth:108, hts:104, pht:97, dth:96, mbs:95, ckl:95, ppl:92, nsl:92, mbr:91, ndn:89, fth:88, sch:86, wls:85, rpr:84, hth:78, sth:77, fts:76, cts:73, htn:72, nkl:69, stw:67, ddl:64, ckn:64, rdl:63, shb:62, ctr:60, bst:60, thw:59, sph:59, rpl:59, hst:58, wns:55, ntm:54, gns:54, chm:49, xth:48, ckc:48, kcl:47, sht:46, rtl:46, rms:45, mps:44, htw:42, rsc:41, nsc:41, nsp:38, rfl:37, spl:36, tpl:35, tfr:34, stf:34, lst:33, lms:33, rkm:32, stp:31, ksg:31, chs:31, thh:30, pth:30, ngk:30, thd:29, rlw:29, htl:28, shd:26, ngf:26, ndf:26, hdr:26, ghw:26, ghs:26, ghl:26, rps:25, mph:25, thl:24, pts:24, nsm:24, bbl:24, tst:23, ssh:23, shn:23, rtr:23, lft:23, wsh:22, rrh:22, ldl:22, ngp:21, ldn:20, thq:19, gpl:19, gfl:19, ckw:19, wbr:18, wdn:17, shl:17, lph:17, ksl:17, ffs:17, bsh:17, ptr:16, ngb:15, mst:15, lfs:15, xpr:14, thn:14, pph:14, nfl:14, lps:14, gst:14, gbl:14, tth:13, rph:13, fsp:13, rdn:12, nsf:12, ndw:12, ltl:12, hsh:12, ftl:12, shk:11, nkn:11, ngh:11, lts:11, lsh:11, ffr:11, ctl:11, cht:11, tsm:10, rsq:10, npr:10, msh:10, bsc:10, ssl:9, rwh:9, rmw:9, rls:9, rdm:9, mpr:9, ffn:9, ckb:9, chb:9, shr:8, shh:8, shc:8, phn:8, nsh:8, dsh:8, xch:7, nkf:7, hcr:7, ghn:7, ggs:7, chc:7, btl:7, wnw:6, skm:6, shp:6, rsp:6, rnm:6, rnf:6, phl:6, nfr:6, llb:6, lks:6, lgr:6, dsm:6, dch:6, chp:6, tsh:5, rtn:5, rbl:5, chl:5, cch:5, bts:5, xtr:4, thm:4, stm:4, ssw:4, sps:4, rml:4, phs:4, pbr:4, ksh:4, hph:4, chz:4, chf:4, zzl:3, wpr:3, thp:3, thc:3, rtf:3, rdr:3, psh:3, phk:3, nkh:3, ndb:3, nbl:3, llf:3, lfw:3, kth:3, hch:3, ghm:3, ftt:3, ffd:3, dbr:3, ckk:3, btf:3, xsc:2, xpl:2, xcl:2, wth:2, ttr:2, tfl:2, stn:2, stc:2, sks:2, scl:2, rsm:2, rpn:2, rnn:2, rmh:2, rkl:2, rct:2, pwr:2, ptl:2, phz:2, phm:2, ntn:2, nsk:2, npl:2, ngn:2, msp:2, mns:2, mnl:2, lsp:2, lpl:2, llp:2, ldh:2, gtr:2, ggl:2, fsc:2, ckt:2, ccl:2, zth:1, wnh:1, wnc:1, wkt:1, wdl:1, vsh:1, tsk:1, thj:1, thb:1, tbr:1, ssf:1, sms:1, shv:1, shg:1, rtg:1, rsw:1, rkf:1, rfr:1, rdh:1, rdb:1, rcl:1, ptn:1, psk:1, php:1, ntb:1, nsn:1, nkw:1, ngw:1, ngg:1, ndk:1, msd:1, ltp:1, ltn:1, llm:1, lfr:1, lfc:1, ldb:1, kst:1, kcr:1, ffh:1, dwr:1, dds:1, ctn:1, chn:1, btr:1,

2-letter custers:

th:155741, nd:64964, ll:26006, nt:23126, ng:20898, st:19747, sh:19449, ch:12629, wh:11902, ld:8722, ns:8119, gh:7941, rs:7668, rd:7649, ss:7098, rt:6810, ht:6189, pr:6003, tr:5641, br:5059, pl:5005, rn:4748, ts:4679, fr:4596, nc:4261, dr:3895, lt:3823, sp:3662, ls:3281, hr:3272, ds:3257, gr:3119, bl:3002, ff:2770, tt:2716, sr:2619, gs:2512, ft:2506, wn:2341, ph:2300, pt:2295, ck:2286, cr:1941, kn:1917, rv:1895, cl:1877, tw:1872, mm:1636, rk:1631, sw:1620, fl:1575, mb:1562, ms:1547, sc:1519, mp:1448, rr:1430, lv:1422, ct:1348, rc:1339, lf:1339, tl:1297, rm:1217, dg:1179, nn:1175, cc:1161, pp:1116, sl:1115, gl:984, rl:961, wr:949, nk:949, sm:790, ks:768, dw:750, gn:715, rg:681, lk:632, nh:614, ws:606, dl:566, nl:540, dn:538, bs:531, dm:529, tc:520, sk:483, sn:481, rp:472, ps:445, gm:440, gd:422, gt:417, wl:410, dd:394, bb:371, hs:370, tn:360, nf:351, rb:350, lm:347, rf:317, mn:293, xc:288, sd:245, lp:235, hb:235, zz:229, sg:229, nj:226, xt:213, hm:197, hn:195, kl:187, sb:178, rw:171, nv:157, dv:147, mc:146, mf:144, lc:142, hf:136, zr:135, bt:125, lw:124, tb:123, tf:122, nq:119, gg:109, bn:109, hl:108, dt:98, mr:89, wb:88, ln:87, tm:86, hw:85, xp:81, sf:79, nw:79, lg:72, df:72, pw:65, wf:63, hd:60, nr:58, bd:56, rh:52, zp:51, fs:51, bh:50, kc:48, nb:43, kk:41, bj:41, rj:40, km:39, lr:38, hh:38, xh:36, tp:36, rz:33, wd:32, gk:30, lb:29, kr:28, nm:27, ml:26, gf:26, sq:25, np:24, lh:24, hp:24, gp:22, bm:21, kw:20, hq:19, hc:18, gb:18, cq:18, hz:17, kb:16, rx:15, lz:14, hk:14, zm:12, kt:12, zh:11, mw:10, fn:10, tg:9, mh:9, db:9, cs:9, bz:9, zn:8, kf:8, wg:7, pm:7, pb:7, dj:7, zl:6, zb:6, wk:6, kh:6, gv:6, dh:6, dc:6, wm:5, bv:5, zk:4, zg:4, pn:4, pf:4, mz:4, wp:3, pk:3, pc:3, mt:3, mg:3, kv:3, fw:3, fd:3, dk:3, xs:2, xl:2, wt:2, nz:2, hg:2, zt:1, zj:1, ww:1, wc:1, vs:1, tk:1, qw:1, md:1, kd:1, hv:1, hj:1, gw:1, fh:1, fc:1,

>>>
 
Last edited:
  • Like
Reactions: 5jj

birdeen's call

VIP Member
Joined
Jul 15, 2010
Member Type
Student or Learner
Native Language
Polish
Home Country
Poland
Current Location
Poland
I'm trying to understand how the cluster count works but can't. I don't see it in the code... Is the improved version much different from what you have alraedy posted?
 

Raymott

VIP Member
Joined
Jun 29, 2008
Member Type
Academic
Native Language
English
Home Country
Australia
Current Location
Australia
I'm trying to understand how the cluster count works but can't. I don't see it in the code... Is the improved version much different from what you have alraedy posted?
That's just the printout. I'll post the code in the next post below.
Yes, there are two main changes I made.
The first was to count the instances. The first time the program encounters a new cluster, it goes over the whole text and counts the instances, and saves a duple (cluster, count), instead of just the cluster.
The second change, which improved the efficiency to a realistic level was having the successive searches only look at the instances of the previous search: So, the search for 3-letter clusters doesn't start again looking at the whole text; it only searches for clusters with a root already occurring in the 2-letter search (which is rather obvious, but I didn't think of it straight away!)

If I stay interested in Text Processing coding, I might put a page up somewhere with some useful tools on it.
 

Raymott

VIP Member
Joined
Jun 29, 2008
Member Type
Academic
Native Language
English
Home Country
Australia
Current Location
Australia
Hmm, to see the tabs, download the .txt file
It won't run without the tabs, and the '.txt' has to replaces by '.py'
################################################################
# Searches a file for consonant clusters
# Prints clusters with frequency,
# Sorted by cluster length, then frequency, then alphabetical
# Raymott 28/10/2010
################################################################
import operator #for sort
conList,clist,c1list,dlist,d1list = [],[],[],[],[]
elist,e1list,flist,f1list, glist, g1list = [],[],[],[],[],[] #start with empty lists of clusters
cluster = "" # empty cluster
letters = "bcdfghjklmnpqrstvwxz" #consonants
#Open file, read it to a list, close file
# Note: Change title in line below.
dictionary = open('hamlet.txt', 'r+') #File to be read
wordList = dictionary.read()
dictionary.close()
########Get 2-letter clusters
for first in letters:
for second in letters:
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in conList:
clist.append([cluster, counter])
c1list.append(cluster) #hold clusters 3-letter search

#sort the list by frequency, descending
clist.sort(key=operator.itemgetter(1))
clist.reverse()
#######Get 3-letter clusters
for first in letters:
for second in c1list: #search only with the 2-letter clusters
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in dlist:
dlist.append([cluster, counter])
d1list.append(cluster) #hold for 4-lett search
#sort the list by frequency, descending
dlist.sort(key=operator.itemgetter(1))
dlist.reverse()
########Get 4-letter clusters
for first in letters:
for second in d1list: #search only with the 3-letter clusters
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in elist:
elist.append([cluster, counter])
e1list.append(cluster)

#sort the list by frequency, descending
elist.sort(key=operator.itemgetter(1))
elist.reverse()
########Get 5-letter clusters
for first in letters:
for second in e1list: #search only with the 3-letter clusters
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in flist:
flist.append([cluster, counter])
f1list.append(cluster)
#sort the list by frequency, descending
flist.sort(key=operator.itemgetter(1))
flist.reverse()

########Get 6-letter clusters
for first in letters:
for second in f1list: #search only with the 3-letter clusters
cluster = first + second
if cluster in wordList:
counter = wordList.count(cluster)
if cluster not in glist:
glist.append([cluster, counter])
g1list.append(cluster)

#sort the list by frequency, descending
glist.sort(key=operator.itemgetter(1))
glist.reverse()
#Print the lists
print "List of Consonant Clusters\n"
print "6-letter custers:\n"
for item in glist:
print item[0] + ":" + str(item[1]) + ", ",
print '\n'
print "5-letter custers:\n"
for item in flist:
print item[0] + ":" + str(item[1]) + ", ",
print '\n'
print "4-letter custers:\n"
for item in elist:
print item[0] + ":" + str(item[1]) + ", ",
print '\n'
print "3-letter custers:\n"
for item in dlist:
print item[0] + ":" + str(item[1]) + ", ",
print '\n'
print "2-letter custers:\n"
for item in clist:
print item[0] + ":" + str(item[1]) + ", ",
#########
 

Attachments

  • clusterCount.txt
    3.7 KB · Views: 3

crazYgeeK

Member
Joined
Jun 9, 2010
Member Type
Student or Learner
Native Language
Vietnamese
Home Country
Vietnam
Current Location
Vietnam
Hi Raymott, can you program with C# or Java or C++ ?
I only know Python is a server scripting language like PHP. It's so good if you know C# because I'm learning C# and needing someone to discuss with and help me sorting out some problems.
I don't understand why we have to classify consonants into clusters ? How does it help English learners ?
Please tell me how useful it is to understand it.
Thank you so much !
 

Raymott

VIP Member
Joined
Jun 29, 2008
Member Type
Academic
Native Language
English
Home Country
Australia
Current Location
Australia
Hi Raymott, can you program with C# or Java or C++ ?
I only know Python as [?] a server scripting language like PHP.
Python is a fully featured language. Yes, it's often used as a scripting language. I learnt Java and C for a few semesters, but I don't work in the industry, so I don't need them. I looked at a lot of languages before deciding on one to learn properly, and I decided on Python. Java is excellent if it's important that your code doesn't make airplanes fall out of the sky - it's safe. But that safety comes at a huge cost in difficulty and pure tedium.
It's so good if you know C# because I'm learning C# and needing someone to discuss with and help me sorting out some problems.
I've never tried C#, sorry. Maybe you should try a C# forum? I'm sure they exist.
I don't understand why we have to classify consonants into clusters ? How does it help English learners ?
Please tell me how useful it is to understand it.
Thank you so much !
One doesn't have to classify consonant clusters. But the OP asked about it. It gave me an idea, and I coded it. It's interesting, but not important.
Well, it could help English learners to know about the apparent frequency of clusters; to know if being able to pronounce them is necessary. For example, we don't have a cluster 'kxsh', but 'str' is very common.
 

crazYgeeK

Member
Joined
Jun 9, 2010
Member Type
Student or Learner
Native Language
Vietnamese
Home Country
Vietnam
Current Location
Vietnam
Thank you, I'll try Python after mastering almost of C# language. I think why don't you try learning C#, it's so great, powerful, if you are good at Python, you'll need only a month or less than to get at basic level of C# programming.
Programming is the main reason I have to learn English to understand English IT e-books, there are not many Vietnamese IT e-books for me.
Thank you for sharing!
 
Status
Not open for further replies.
Top