hit counter


My development logbook

Nltk (2)


Chapter 2

  • Accessing Text Corpora: such as Brown and Reuter
  • Conditional Frequency Distributions (nltk.ConditionalFreqDist)
  • Generating Random Text with Bigrams
  • Lexical Resources


  • A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown.
  • Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.
  • A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre.
  • Python programs more than a few lines long should be entered using a text editor, saved to a file with a .py extension, and accessed using an import statement.
  • Python functions permit you to associate a name with a particular block of code, and re-use that code as often as necessary.
  • Some functions, known as “methods”, are associated with an object and we give the object name followed by a period followed by the function, like this: x.funct(y), e.g., word.isalpha().
  • WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets — and organized into a network.
  • Some functions are not available by default, but must be accessed using Python’s import statement.