PyconAU 2012: Natural language processing


Notes from a talk at Pycon AU 2012. Posted by Thomas Sutton on August 18, 2012

“Human as a Second Language: Succeeding (and failing) with the Natural Language Toolkit”

Natural Language Toolkit

General ho-hum generalisations of overly logical, Spock-ish stereotypes of “programmers”. Abstraction, gender, disincentive to creative natural langauge, etc.

NLTK is, like most toolkits, a bunch of tools and resources; bridges the gap between science and art (linguistic, presumably).

Language Features 101

  • Stopwords

    The common but semantically unimportant words. Generally remove stopwords when doing statistical tasks.

  • Parts of speech

    High-school grammar: nouns, adjectives/adverbs, verbs. N, ADJ, ADV, V.

    Also: a bunch more.

  • Stemming

    Reduce words to their stem, so you can unify various forms; generally for statistical techniques.

  • Lemmatization

    Similar to stemming, but results in a real word.

NLP Concepts

  • Training data

    Copora for English language words (stopwords), Boys’ names, Girls’ names, tagging part of speech.

    Wordnet linked dictionary.

  • Tokenisation

    Split a document into individual parts. The particular type of “part” will vary depending on the task (words, sentences, etc.)

    Many different tokenisation algorithms for different situations.

Applications

Sentiment analysis and opinion mining. Targeting advertising.

Establish patterns in language used to make guesses about the person talking: gender, age, etc.

Integration with BeautifulSoup for something to do with HTML? Not sure why you’d bother.

Chatbots: @PatrickAndElly use Twitter interface (Python Twitter Tools):

  1. Tokenise words.
  2. PoS tag.
  3. Simply tagging (because too much grammar is too much).

This post was published on August 18, 2012 and last modified on January 26, 2024. It is tagged with: event, PyconAU 2012, natural language, nltk, python.