|
#1
| |||
| |||
| I need to know the ratio of meaningfull words against non-meaningfull words in a sentence. I don't know if I explained that correctly, so here's an example: "hello. this is a test which you'll all find very interesting and will study for many hours when you get home." These are the "meaningfull" words in that sentence: hello, test, you'll, find, very, interesting, study, hours, you, home All the rest are "non-meaningfull" (ie: they have no impact in the sentance other than to structure it), thus giving a ratio of 10:21. Is there an average ratio like this that matches for all English documents (on average, of course)? If that doesn't exist, is there a maximum "keyword" average? For example, if I'm talking about my pen. and my desk: "This is my pen. I normally keep my pen on my desk at all times. I work on my desk, and use my pen for writting with". Here, the ratio of keywords (pen, desk) against non-keywords is 5:27. Is there a maximum ratio like this, where the keyword shall not be said more than X number of times in a sentence? Many thanks, and sorry for the unusual question! |
|
#2
| ||||
| ||||
| Quote:
http://www.usingenglish.com/members/text-statistics.php Register Now It does some of what you ask, and if you have any ideas for improvment for it I'd be happy to discuss options with you (I'm about to start modifying it anyway). Hope that helps. Last edited by Red5; 26-Nov-2004 at 14:07. |
|
#3
| |||
| |||
| Hi Thanks for that, but it's not exactly what I was looking for... I'm actually trying to make some artificial inteligence for a search engine, and it needs to be able to distinguish between keywords and trivail (non-meaningfull) words. I do that by counting the word frequency. The more times a word appears, the more likely it is to be trivial. However, if a word is appearing frequently because the artical heavily focuses on it, then obviously I don't want it to be considered at all trivial. My solution is: If I can find out the average ratio of non-meaningfull words against keywords, I'll be able to guess whether the world is non-meaningfull or very meaningfull. It'll also use loads of other tests at different levels, etc etc. Thanks Last edited by Red5; 26-Nov-2004 at 14:08. |
|
#4
| |||
| |||
| That's a brilliant idea, although I see a major problem in it. To be able to determine a ratio like the one you're describing you would have to first teach a computer program to analyze what a proper sentence should look like under ALL circumstances. Basically you'd have to teach a heuristic algorithm to go beyond itself and analyze all the things that cannot be measured. Namely intent, mood and tone. As we've all seen with so-called "grammar checks" and "translation software", technology has a loooooooooooooong way to go before this is a reality. -Nah- |
|
#5
| |||
| |||
| Hehe, yeah. That would be the ultimate goal, but I could never be bothered to do that :P That's only one "layer" of the anaysing... The other "layers" look for where the text is. For example, text which is in bold is considered to be important, but I need a way of making sure any trivial bold text (the, etc) doesn't also get considered as being important. Once finished, it'll probably try to learn from these layers: (+) - makes word more important (-) - makes word less important (+) <b>,<h>,<a>,<i>,<u>, etc (+) frequency ratio UP TO (perhaps...) 1:5. A lower ratio results in (-) (Few more, working on them) Also, I'd like it to learn from previous searches. *Most* people don't include words such as "the" in their search queries (but some do - I'd need a way of checking for that...), so words which had once been in a search term would increase the words importance. Loads of other ideas in the back of my head, but can't quite put my finger on them yet... It should be quite cool once finished though (I hope - otherwise I've wasted several weeks work!). Cheers |
|
#6
| |||
| |||
| It depends very much on length- lexical density falls as a text grows, so I think it would be hard to find an absolute ratio, but it might be possible to find a ratio for different lengths. |
|
#7
| ||||
| ||||
| Quote:
If you're talking about search engines and their algos, I would recommend you visit SearchEngineWatch.com and ask in their Search Technology & Relevancy forum. Their moderator, Orion, has an encyclopedic knowledge of all things algorithmic regarding search engines. There's one condition, that you come back and let us know how you get on! Kind regards, Red5 |
|
#8
| ||||
| ||||
| Quote:
Anyway, I wish you luck with your research. Edited to add: I'd be interested in starting a new forum area here specifically to do with analysing language. Would anyone (other than me) be interested? Last edited by Red5; 19-Nov-2004 at 13:22. |
|
#9
| |||
| |||
| Hi Thanks for the advice, I'll head over searchenginewatch.com shortly... Once (if ever!) I get this finished, I'd be happy to give you the source code (or document the workings, etc) for you, if you're interested. That is, if it works of course. I have to admit, it wasn't desperatly sensible for me to dive into this project though - my main strengths are encryption and database driven apps, not evalutating the English language (and I've got the satisfaction of failing my English exams when I went to school to prove that fact Cheers |
|
#10
| |||
| |||
| I would be interested. |
![]() |
| Bookmarks |
| Tags |
| average, nonmeaningful, word, ratio |
| Thread Tools | |
| Display Modes | |
| |
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| word stress | bread | Ask a Teacher | 1 | 16-Jul-2004 01:05 |
| Word Checker 1 - The Dolch basic word list | Tdol | UsingEnglish.com Content | 0 | 24-May-2004 13:26 |
| Word Checker 1 - The Dolch basic word list | Tdol | UsingEnglish.com Content | 0 | 19-Apr-2004 15:30 |
| word for "word reminder" | Anonymous | Ask a Teacher | 3 | 09-Dec-2003 05:41 |
| Questions about Inversions - Inverted Word Order | Anonymous | General Language Discussions | 21 | 31-May-2003 22:43 |