Hi. I am researching formulaic language among English learners, and I'm using as baseline information data I get for mutual information (MI) and frequency scores when searching given collocations from the Corpus of Contemporary American English (COCA). I hoped someone out there could help me with a couple questions related to frequency and MI counts.
Question 1: Previous research has mostly examined 2-word collocations, not formulas with variable slots and variable sequencing. Therefore, they’re more likely to find higher MI, aren’t they? I’m finding that adjacent collocations have much higher MI than those with variability of form. Is this to be expected? If so, why?
Question 2: When allowing slots between two co-occurring words, some of the results retrieved might actually be different from the original from my samples. For example, students used “encourage [NP] to”, but when I search for encourage followed by to, allowing 4 spaces between (max), I might find in the concordances something like “encourage innovative approaches to”. How do I deal with these cases? I don’t want to count them in the total frequency, but I can’t go through manually checking each concordance line.
Thanks for any help!
Can't you clean the data? Using the basic text ->table/table -> text tools in a Word Processor would make some of that possible.
If you look through the posts of this guy, he has some command line tools that might help:
His homepage is here: http://www.sanmayce.com/
While I didn't agree with his conclusions about some of the searches he was doing, his searches were very good and discovered all sorts of things.