Hi. I am researching formulaic language among English learners, and I'm using as baseline information data I get for mutual information (MI) and frequency scores when searching given collocations from the Corpus of Contemporary American English (COCA). I hoped someone out there could help me with a couple questions related to frequency and MI counts.

Question 1: Previous research has mostly examined 2-word collocations, not formulas with variable slots and variable sequencing. Therefore, they’re more likely to find higher MI, aren’t they? I’m finding that adjacent collocations have much higher MI than those with variability of form. Is this to be expected? If so, why?

Question 2: When allowing slots between two co-occurring words, some of the results retrieved might actually be different from the original from my samples. For example, students used “encourage [NP] to”, but when I search for encourage followed by to, allowing 4 spaces between (max), I might find in the concordances something like “encourage innovative approaches to”. How do I deal with these cases? I don’t want to count them in the total frequency, but I can’t go through manually checking each concordance line.

Thanks for any help!