Stats for Researching Collocation

Status
Not open for further replies.

lvilleky

Member
Joined
Jul 30, 2011
Member Type
Academic
Native Language
English
Home Country
United States
Current Location
United States
Hi. I am researching formulaic language among English learners, and I'm using as baseline information data I get for mutual information (MI) and frequency scores when searching given collocations from the Corpus of Contemporary American English (COCA). I hoped someone out there could help me with a couple questions related to frequency and MI counts.

Question 1: Previous research has mostly examined 2-word collocations, not formulas with variable slots and variable sequencing. Therefore, they’re more likely to find higher MI, aren’t they? I’m finding that adjacent collocations have much higher MI than those with variability of form. Is this to be expected? If so, why?

Question 2: When allowing slots between two co-occurring words, some of the results retrieved might actually be different from the original from my samples. For example, students used “encourage [NP] to”, but when I search for encourage followed by to, allowing 4 spaces between (max), I might find in the concordances something like “encourage innovative approaches to”. How do I deal with these cases? I don’t want to count them in the total frequency, but I can’t go through manually checking each concordance line.

Thanks for any help!
 

Tdol

No Longer With Us (RIP)
Staff member
Joined
Nov 13, 2002
Native Language
British English
Home Country
UK
Current Location
Japan
Can't you clean the data? Using the basic text ->table/table -> text tools in a Word Processor would make some of that possible.

If you look through the posts of this guy, he has some command line tools that might help:
https://www.usingenglish.com/forum/members/358574.html

His homepage is here: http://www.sanmayce.com/

While I didn't agree with his conclusions about some of the searches he was doing, his searches were very good and discovered all sorts of things.
 
Status
Not open for further replies.
Top