Random chat 1 on 1
NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. For information about downloading them, see : Cumulative Word Length Distributions: Six translations of the Universal Declaration of Human Rights are processed; this graph shows that words having 5 or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.
Unfortunately, for many languages, substantial corpora are not yet available.
We will wait until later before exploring each Python construct systematically.
Don't worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and — if you're game — modify it by substituting some part of the code with a different text or word.
These are presented systematically in 2, where we also unpick the following code line by line.
For the moment, you can ignore the details and just concentrate on the output.
The filename contains the date, chatroom, and number of posts; e.g., The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.NLTK's small collection of web text includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators.The corpus contains over 10,000 posts, anonymized by replacing usernames with generic names of the form "User NNN", and manually edited to remove any other identifying information.In 1, we looked at the Inaugural Address Corpus, but treated it as a single text.The graph in fig-inaugural used "word offset" as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address.