“Human as a Second Language: Succeeding (and failing) with the Natural Language Toolkit”
General ho-hum generalisations of overly logical, Spock-ish stereotypes of “programmers”. Abstraction, gender, disincentive to creative natural langauge, etc.
NLTK is, like most toolkits, a bunch of tools and resources; bridges the gap between science and art (linguistic, presumably).
Language Features 101
Stopwords
The common but semantically unimportant words. Generally remove stopwords when doing statistical tasks.
Parts of speech
High-school grammar: nouns, adjectives/adverbs, verbs. N, ADJ, ADV, V.
Also: a bunch more.
Stemming
Reduce words to their stem, so you can unify various forms; generally for statistical techniques.
Lemmatization
Similar to stemming, but results in a real word.
NLP Concepts
Training data
Copora for English language words (stopwords), Boys’ names, Girls’ names, tagging part of speech.
Wordnet linked dictionary.
Tokenisation
Split a document into individual parts. The particular type of “part” will vary depending on the task (words, sentences, etc.)
Many different tokenisation algorithms for different situations.
Applications
Sentiment analysis and opinion mining. Targeting advertising.
Establish patterns in language used to make guesses about the person talking: gender, age, etc.
Integration with BeautifulSoup for something to do with HTML? Not sure why you’d bother.
Chatbots: @PatrickAndElly use Twitter interface (Python Twitter Tools):
- Tokenise words.
- PoS tag.
- Simply tagging (because too much grammar is too much).