PyconAU 2012: Natural language processing

“Human as a Second Language: Succeeding (and failing) with the Natural Language Toolkit”

General ho-hum generalisations of overly logical, Spock-ish stereotypes of “programmers”. Abstraction, gender, disincentive to creative natural langauge, etc.

NLTK is, like most toolkits, a bunch of tools and resources; bridges the gap between science and art (linguistic, presumably).

Language Features 101

Stopwords

The common but semantically unimportant words. Generally remove stopwords when doing statistical tasks.
Parts of speech

High-school grammar: nouns, adjectives/adverbs, verbs. N, ADJ, ADV, V.

Also: a bunch more.
Stemming

Reduce words to their stem, so you can unify various forms; generally for statistical techniques.
Lemmatization

Similar to stemming, but results in a real word.

NLP Concepts

Training data

Copora for English language words (stopwords), Boys’ names, Girls’ names, tagging part of speech.

Wordnet linked dictionary.
Tokenisation

Split a document into individual parts. The particular type of “part” will vary depending on the task (words, sentences, etc.)

Many different tokenisation algorithms for different situations.

Applications

Sentiment analysis and opinion mining. Targeting advertising.

Establish patterns in language used to make guesses about the person talking: gender, age, etc.

Integration with BeautifulSoup for something to do with HTML? Not sure why you’d bother.

Chatbots: @PatrickAndElly use Twitter interface (Python Twitter Tools):

Tokenise words.
PoS tag.
Simply tagging (because too much grammar is too much).

This post was published on August 18, 2012 and last modified on January 26, 2024. It is tagged with: event, PyconAU 2012, natural language, nltk, python.