Google Ngram Viewer

Status
Not open for further replies.

Tdol

No Longer With Us (RIP)
Staff member
Joined
Nov 13, 2002
Native Language
British English
Home Country
UK
Current Location
Japan
Google Ngram Viewer

You can compare word usage across some of Google's book databases. :up:
 
  • Like
Reactions: 5jj

5jj

Moderator
Staff member
Joined
Oct 14, 2010
Member Type
English Teacher
Native Language
British English
Home Country
Czech Republic
Current Location
Czech Republic
Fascinating, thanks. That could become addictive.
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Hi,
indeed Google Ngram Viewer is a very good initiative, it is a strong starting point/foundation for future word/phrase comparisons/analysis.

I am not a Google's fan, but I admit ngram datasets given for free download are something that speaks well of people behind this project.

My console tool Leprechaun_quadrupleton utilizes(in particular) these sets(I downloaded and began to use 4-grams which is 400 chunks/files each 1GB i.e 400GB in total).

Having run Leprechaun_quadrupleton the result is: 400 files of 8MB or 3.2GB of pure unique 4-grams. The resultant lines/4-grams look like this:
Code:
D:\_KA45F~1\_4>dir
12/12/2010 01:37 PM 1,111,609,996 googlebooks-eng-us-all-4gram-20090715-0.csv
01/26/2011 06:46 PM 315 googlebooks-eng-us-all-4gram-20090715-0.csv.EXCERPT
01/26/2011 05:13 AM 514,048 Leprechaun_quadrupleton_Intel_IA-32_11.1.exe
D:\_KA45F~1\_4>type googlebooks-eng-us-all-4gram-20090715-0.csv.EXCERPT
...
It cut me to 2002 4 4 4
It cut me to 2004 4 4 4
It cut me to 2005 6 6 6
It cut me to 2006 2 2 2
It cut me to 2007 1 1 1
It cut me to 2008 1 1 1
It declares that ' 1816 1 1 1
It declares that ' 1832 2 2 2
It declares that ' 1833 1 1 1
It declares that ' 1834 1 1 1
It declares that ' 1838 1 1 1
...
D:\_KA45F~1\_4>dir *.excerpt/b>test.lst
D:\_KA45F~1\_4>Leprechaun_quadrupleton_Intel_IA-32_11.1.exe test.lst test.wrd
Leprechaun(Fast Greedy Word-Ripper), rev. 13_7pluses quadrupleton_r1, written by Svalqyatchx.
Leprechaun: 'Oh, well, didn't you hear? Bigger is good, but jumbo is dear.'
Kaze: Let's see what a 3-way hash + 6,602,752 Binary-Search-Trees can give us,
also the performance of a 3-way hash + 6,602,752 B-Trees of order 3.
Size of input file with files for Leprechauning: 53
Allocating memory 424MB ... OK
Size of Input TEXTual file: 315
|; Word count: 39 of them 1 distinct; Done: 64/64
Bytes per second performance: 315B/s
Words per second performance: 39W/s
Flushing unsorted words ...
Time for making unsorted wordlist: 1 second(s)
Deallocated memory in MB: 424
Allocated memory for words in MB: 1
Allocated memory for pointers-to-words in MB: 1
Sorting(with 'MultiKeyQuickSortX26Sort' by J. Bentley and R. Sedgewick) ...
Sort pass 26/26 ...
Flushing sorted words ...
Time for sorting unsorted wordlist: 1 second(s)
Leprechaun: Done.
D:\_KA45F~1\_4>type test.wrd
it_cut_me_to
There is a lot of ways to follow, that is, to use 4-gram phrases, currently I contemplate on automatic reporter: 4-grams(taken from incoming text) compared to 4-grams(taken from googlebooks-eng-us-all-4gram). In a few words: a kind of phrase-checker.

Regards
 

5jj

Moderator
Staff member
Joined
Oct 14, 2010
Member Type
English Teacher
Native Language
British English
Home Country
Czech Republic
Current Location
Czech Republic
Sanmayce's post made me feel so old and out of touch with modern life. :-( Still, I didn't do too badly I suppose. I understood the first two words.

Oh, and the last one.
 

birdeen's call

VIP Member
Joined
Jul 15, 2010
Member Type
Student or Learner
Native Language
Polish
Home Country
Poland
Current Location
Poland
Sanmayce's post made me feel so old and out of touch with modern life. :-( Still, I didn't do too badly I suppose. I understood the first two words.

Oh, and the last one.
This might help you understand more (it did in my case).
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Don't feel that way fivejedjon, the human touch/vision is far-far more superior than ANY machine, at least I believe this at 100%. The computers already beat/humiliate humans in terms of info processing(just ask who/what is world chess champion), BUT here enters soul... and everything turns into mystery i.e. non-defined-yet.

Consider this text fragment(an excerpt from a movie subtitles):

D:\_KA45F~1\_4>type "[2003] When the Last Sword Is Drawn 7.7@imdb CD2.srt.EXCERPT"
...
497
01:02:27,956 --> 01:02:35,089
Morioka, in Nanbu.
It's pretty as a picture!
498
01:02:35,196 --> 01:02:38,723
There's nowhere like it in all Japan!
499
01:02:39,834 --> 01:02:43,827
The Morioka cherry blossom
splits through rock to bloom.
500
01:02:44,506 --> 01:02:48,875
The Morioka magnolia blooms
even facing north.
501
01:02:49,911 --> 01:02:54,848
So I want you to run ahead
of the times.
502
01:02:55,950 --> 01:03:00,046
Go wild. Bloom.

The idea is to get(with help of some software) all 4-grams(it is a sequence of 4 words a.k.a. collocation) for the given text:

D:\_KA45F~1\_4>type test3.wrd
ahead_of_the_times
blooms_even_facing_north
blossom_splits_through_rock
cherry_blossom_splits_through
i_want_you_to
it_in_all_japan
it_s_pretty_as
like_it_in_all
magnolia_blooms_even_facing
morioka_cherry_blossom_splits
morioka_magnolia_blooms_even
nowhere_like_it_in
pretty_as_a_picture
run_ahead_of_the
s_nowhere_like_it
s_pretty_as_a
so_i_want_you
splits_through_rock_to
the_morioka_cherry_blossom
the_morioka_magnolia_blooms
there_s_nowhere_like
through_rock_to_bloom
to_run_ahead_of
want_you_to_run
you_to_run_ahead

Computers(in particular tablets being the future HANDY personal assistants) will remain only assistants and nothing more even when the AI(artificial intelligence) enters(hopefully) our life, I mean the old school is not dying just enhanced.
 

birdeen's call

VIP Member
Joined
Jul 15, 2010
Member Type
Student or Learner
Native Language
Polish
Home Country
Poland
Current Location
Poland
(just ask who/what is world chess champion),
As far as I know there are separate titles for humans and for computers and there is no joint title. (We're too scared to give them a chance! ;-))
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
(We're too scared to give them a chance! ;-))

Ha-ha you are right!
I have been watching this humiliation since Gary's first battles with IBM's Deep Blue, also with Deeper Blue. Also with other super-chess-computers.

I have a very high opinion of Kasparov, but he had told us(in 1997-) that a machine cannot "see" the game, which statement I knew back then was WRONG. The computer can be taught to develop tactics(mini-strategy), by scaling up, into deep-deep strategy which has nothing to do with the power of humans namely soul or creativity as in his case/interview.
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
'Graphein' a 4-gram-Phrase-Checker, revision 1-

GOALS:
- To offer 100% free open-source copyleft software(32bit Windows console tools written in C);
- To enrich(beautify as kids would say) the ability to make phrase reports/analyses of user-chosen English texts in order to estimate the appropriateness of 4-gram phrases/collocations;
- Targeted users are mostly people(this includes kids, learners and native English users as well) wanting to explore the English collocations by immersing themselves into 100+ million of google-4-grams;
- To allow an in-depth phrase-search independently from third-parties(and eventually second-parties free too).

First drawback: the package must/will be as simple as possible in regards of usage. To be done. The whole process of making reports must be in two steps:
- Copying all needed text files(folders also) into our working directory;
- Running a single batch file.
Second drawback: still not downloadable.
Third drawback: 'Graphein' developer being an amateur.
Fourth drawback: currently 'Graphein' is awfully-very(analyzing 'The Little Match Girl' took 02:27:00 hours or 400x(23seconds per file)) slow.
Fifth drawback: something rotten there(with googlebooks-eng-us-all-4gram-20090715 files) is! I am disappointed with the unexpected high number of unknown(Unfamiliar!) 4-grams:
- 'The Little Match Girl' having analyzed with 'Graphein' r.1 gives Total/Found/Unfamiliar: 580/370/210 phrases.
- I cannot figure it out! Phrases like:
wonderful_roast_goose_and Unfamiliar!
wonderful_smell_of_roast Unfamiliar!
wonderfully_the_fire_burned Unfamiliar!
would_surely_beat_her Unfamiliar!

not to be part of US English Google books, does anybody know what causes this frustrating misery?

My wish here is to present(in a hurry-mode) some aspects of not-completed-yet free-software-package which is being designed for making user-phrases vs google-books-phrases reports.
I give below a short help/guide step-by-step how to use these 100% free 32bit console programs.

~ The whole process looks like:
[incoming text file(s)] -> phrase-checker-package -> [text file containing all phrases(described whether they have been encountered in google-books-phrases or not)]

~ Or as in the following example:
[...] -> phrase-checker-package -> [...]
[lille_pige_med_svovlstikkerne] -> phrase-checker-package -> [lille_pige_med_svovlstikkerne Unfamiliar!]
[med_svovlstikkerne_by_jean] -> phrase-checker-package -> [med_svovlstikkerne_by_jean Unfamiliar!]
[more_beautiful_than_the] -> phrase-checker-package -> [more_beautiful_than_the Found!]
[rattled_by_terribly_fast] -> phrase-checker-package -> [rattled_by_terribly_fast Found!]
[reached_both_her_hands] -> phrase-checker-package -> [reached_both_her_hands Found!]
[really_seemed_to_the] -> phrase-checker-package -> [really_seemed_to_the Found!]
[those_in_the_printshops] -> phrase-checker-package -> [those_in_the_printshops Unfamiliar!]
[...] -> phrase-checker-package -> [...]

~ I intend the final report(tabulated) to look like this:
...
lille_pige_med_svovlstikkerne \t Unfamiliar!
...
med_svovlstikkerne_by_jean \t Unfamiliar! 3rd-bigram-OK!
more_beautiful_than_the \t Found!
...
rattled_by_terribly_fast \t Found!
reached_both_her_hands \t Found!
really_seemed_to_the \t Found!
...
those_in_the_printshops \t Unfamiliar! 1st-bigram-OK! 2nd-bigram-OK!
...

~ Discussion note:
I would appreciate any suggestion(s) regarding simplifying the usage of the whole package.
After all I develop(amateurishly) this package especially for kids having PCs.
I want to write a PDF file with simple step-by-step instructions but I need some feedback(where difficulties are pointed out) in order to simplify the package enough thus making it usable even from computer dummies/beginners.
Somewhat a problem remains with making the package downloadable since the package revision 1 is about 756MB whereas my site's poor-bandwidth is already heavily loaded.

Reference #1:
PDF: 'The_Little_Match_Girl'_analyzed_by_'Graphein'
Reference #2:
PDF: Getting_started_using_'Graphein'_phrase-package

Enjoy!
 
Last edited:

birdeen's call

VIP Member
Joined
Jul 15, 2010
Member Type
Student or Learner
Native Language
Polish
Home Country
Poland
Current Location
Poland
Some thoughts after skimming your PDFs.

1) We put spaces before opening brackets:

develop (amateurishly) - correct
develop(amateurishly) - incorrect

2) If you want your program to be used by "computer dummies", you should probably make it run in a separate window. Many computer users have never seen a TUI.

3) If you want your program to be widely used, you certainly need bandwidth. You could try to find other enthusiasts who would be willing to cooperate.

4) You should probably tell people what 4-grams are and why they want to find them.
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Thanks birdeen's call,

Regarding 1):
My style of writing is both brutal and nearly-incorrect due to the simple fact that I have never learned English, I have had only self-approaches, that is I learn it on the fly.
It is useful for me when my mistakes are shown, thanks.

Regarding 2):
Yes, it is true, I will do some efforts to make the package usable from the desktop and not to force the users to go in prompt by themselves.

Regarding 3):
Openness is one of my strong qualities, but I intend to use, as always, my own resources. Currently the package is not worth to be uploaded, it is 1day old - I started it yesterday.

Regarding 4):
That is right, I did feel the gap, it is to be explained for sure, but I wanted to give some overview - it is hard to me to explain the idea(moreover the specifications) while the need is unclear/not-explained. I firmly believe that the ability to make proper word arrangements is the hardcore of English, especially for guys like me who don't want to read study-books but prefer learning by reading texts on daily basis.

Regards

 

birdeen's call

VIP Member
Joined
Jul 15, 2010
Member Type
Student or Learner
Native Language
Polish
Home Country
Poland
Current Location
Poland
My style of writing is both brutal and nearly-incorrect due to the simple fact that I have never learned English, I have had only self-approaches, that is I learn it on the fly.
Your English is very good!
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Enter the 'Graphein' package revision 1

Mini-guide in HTML format, 18 screenshots: here
Mini-guide in PDF format, 4MB: here
The package itself in ZIP format, 757MB: here

Two regimes/modes of operation are available:
- Fully-automatic mode: by running 'Graphein_TXT.bat' you can compare your text files (with .TXT extension placed in _TXT-TREE folder) versus US English Google books 4grams from 2009-07-15;
- Semi(keyboard input is needed)-automatic mode: by running 'GRAPHEIN_keyboard.bat' you can search for your patterns(4grams) into US English Google books 4grams from 2009-07-15.

Pluses:
- Desktop launching;
- In this initial revision the console tools can be executed either from command prompt or desktop via 2 icons;
- Open-source.
Minuses:
- In fully-automatic mode files with more than a few thousand words are processed awfully slow;
- Unfortunately US English Google books 4grams from 2009-07-15 made me lose my momentum, not as rich corpus (still) as I expected;
- This revision gives the outline(it is mainly illustrative), it has no future unless some serious approach(mixing Graffith & Leprechaun is a nifty one) is applied;
- It took me 3 days to realize the [f]utility of current (100 brute-force way of comparing), still processing chunks is THE one/main way to go because of 32bit address limitations - I dream of 5x100+ million phrases inserted in million b-trees which demands 64bit code, sadly I am not ready to walk this way, yet.

Anyway enjoy!

P.S.
If anyone finds it useful please feel free to make mirrors at once, I cannot guarantee hosting the .ZIP file for more than month or so.
 

Sanmayce

Junior Member
Joined
Jan 25, 2011
Member Type
Student or Learner
Native Language
Bulgarian
Home Country
Bulgaria
Current Location
Bulgaria
Maybe the easiest way an idea to be understood is to tell its ultimate goal, in a 'I have a dream ...' style.

In my case: I dream of instant English-phrase sidekick (suggester/hinter) tool.
The goal is everybody to have the ability to write, in real time, as correctly as possible.
Nowadays when a text is typed in search-engine fields some pale text/suggestions appear, that is what I am talking about.

Imagine this: A kid playing with words and wanting to construct a sentence by using two afore-chosen words but not knowing any grammar.
The situation is similar to forcing some unexperienced person to drive bicycle instead of tricycle first.

The point is, that even when one is extensively trained, it is never enough - the tricycle remains as a reminder-of-everlasting-ignorance forever.
Or as one of our renowned translators of Jack London has said: 'The difference between the better translator and the good one is in using a dictionary.' - contrary to the expected the former uses it whereas the latter is too 'versed'/(vanity sick) and needs it not, well-said.

Of course for a corporation or skillful programmers it is not a big deal to achieve such functionality.
The problem lies not so much in applied algorithms or used programming language but in the scarcity of data which feeds the suggester.

Currently only Google (I am not their fan) shared their ngram datasets, as far as I know made from 4% of all printed books or 5+ million books.
In my view these huge numbers weigh little because the practice shows one poor (yet) corpus far from needed, to talk here for comprehensiveness is a nonsense (look at 210 unfamiliar 4grams encountered in 'The Little Match Girl').
So my dream needs a NEARLY-UNABRIDGED corpus of English(British, American, Australian ...) 4grams (as minimum) preferably 9grams (enabling whole sentences to appear under your fingers).
I know I know that it looks/sounds like/as a long shot, but such greediness fits well here, after all I speak of billions of phrases, no room for profanity, floppy thinking, half-done encoding.
In my vocabulary 'greedy' and 'uncompromising' are synonyms and I use them interchangeably.

In next paragraphs I tried to visualize poorly one static suggestion for word 'getting' by giving all/PROPER three-words-collocations from left and right side:
to get the suggestions I ran Graphein in search for:
*_getting|
also for:
getting_*
and finally the first search yields 17,425 hits, while the second search yields 18,773 hits.

The results for '*_getting|':
a_baby_is_getting
a_baby_without_getting
a_bad_time_getting
a_balance_between_getting
...
a_habit_of_getting
a_half_in_getting
a_hand_at_getting
a_hand_in_getting
a_handicap_in_getting
a_hard_job_getting
a_hard_time_getting
a_harder_time_getting
...
as_things_were_getting
as_time_was_getting
as_to_avoid_getting
as_to_ensure_getting
as_to_his_getting
as_to_my_getting
as_to_my_getting
as_to_our_getting
as_to_the_getting
...
assist_clients_in_getting
assist_her_in_getting
assist_him_in_getting
...
be_accustomed_to_getting
be_achieved_by_getting
be_active_in_getting
be_adopted_for_getting
be_afraid_of_getting
be_aided_in_getting
be_aimed_at_getting
...
expressed_concern_about_getting
expressed_interest_in_getting
extraordinary_power_of_getting
extreme_difficulty_in_getting
extreme_difficulty_of_getting
extreme_passion_for_getting
extremely_desirous_of_getting
extremely_effective_in_getting
extremely_fortunate_in_getting
extremely_helpful_in_getting
extremely_important_in_getting
extremely_interested_in_getting
extremely_useful_in_getting
...
within_despair_of_getting
within_minutes_of_getting
...
your_work_is_getting
yourself_began_by_getting
yourself_for_not_getting
zero_chance_of_getting


The results for 'getting_*':
...
getting_a_better_chance
getting_a_better_class
getting_a_better_deal
getting_a_better_education
getting_a_better_feel
getting_a_better_grade
getting_a_better_grasp
getting_a_better_grip
...
getting_a_hand_on
getting_a_handful_of
getting_a_handle_on
getting_a_handle_on
getting_a_handle_on
getting_a_handle_on
getting_a_hard_on
getting_a_hard_time
...
getting_along_very_fast
getting_along_very_nicely
getting_along_very_slowly
getting_along_very_well
...
getting_as_much_fun
getting_as_much_good
getting_as_much_help
getting_as_much_in
getting_as_much_information
...
getting_started_on_an
getting_started_on_her
getting_started_on_his
getting_started_on_it
getting_started_on_its
getting_started_on_my
getting_started_on_our
...
getting_yourself_ready_for
getting_yourself_ready_to
getting_yourself_talked_about
getting_yourself_to_the
getting_yourself_very_wet
getting_yourself_worked_up
getting_youth_by_years


Having all these suggestions is not enough, here comes the last but not least part: visualization.
Well, I cannot show anything but the suggestive outcome is something like:
Step 1: write 'getting'
two pop-up flowing windows appear each filled with above suggestions.
Step 2: you write 'in getting'
the useful part here is that the entered 2gram is to be searched not into 4grams but say 5grams (3 words from both sides, again) and so on until a suitable sentence (or 9gram) appears.
For example one suggestion is: 'helpful in getting information'.
Step 3: you have obtained a stable/proper 4gram containing needed words from left and right, and you finish the sentence by yourself.

The magical part is that (even if you are not sure what you are looking for) the sheer abundance of phrases will guide/supply/form your thought dynamically i.e. writing and thinking will happen (hopefully) simultaneously.
I recall a funny scene from 'Captain Apache' movie in which Lee Van Cleef was being asked 'What are you searching?' and his answer was something like: 'Nothing. It is amazing how many things pop-up when you just search.'

Let there be greediness ...
The style choosing would be try-once-give-me-more i.e. addictive/indispensable, as in the next example:
getting_a_handle_on [Google books]
getting_a_handle_on [Australian magazines & newspapers]
getting_a_handle_on [Agatha Christie - anthology]
getting_a_handle_on [Arthur Conan Doyle - 'Sherlock Holmes' collection]
getting_a_handle_on [Irish legends]
getting_a_handle_on [Fairy tales translated into English]

Obviously relying on one corpus alone is not remotely as useful as using hundreds-why-not-thousands of corpora.

It must feature smooth scrolling and animated-like appearance, but this is up to Graphic-Interface designers and gadget manufacturers.
Yes, the futuristic way of writing is neither left-to-right nor right-to-left anymore but from middle-to-edges, he-he.

I believe/KNOW the time (for such an assistant) is nearer than most of us think.
 
Status
Not open for further replies.
Top