<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Passing Curiosity: Posts tagged natural language</title>
    <link href="https://passingcuriosity.com/tags/natural-language/natural-language.xml" rel="self" />
    <link href="https://passingcuriosity.com" />
    <id>https://passingcuriosity.com/tags/natural-language/natural-language.xml</id>
    <author>
        <name>Thomas Sutton</name>
        
        <email>me@thomas-sutton.id.au</email>
        
    </author>
    <updated>2012-08-18T00:00:00Z</updated>
    <entry>
    <title>PyconAU 2012: Natural language processing</title>
    <link href="https://passingcuriosity.com/2012/nltk/" />
    <id>https://passingcuriosity.com/2012/nltk/</id>
    <published>2012-08-18T00:00:00Z</published>
    <updated>2012-08-18T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p>“Human as a Second Language: Succeeding (and failing) with the Natural
Language Toolkit”</p>
<p><a href="http://nltk.org/">Natural Language Toolkit</a></p>
<p>General ho-hum generalisations of overly logical, Spock-ish stereotypes of
“programmers”. Abstraction, gender, disincentive to creative natural langauge,
etc.</p>
<p>NLTK is, like most toolkits, a bunch of tools and resources; bridges the gap
between science and art (linguistic, presumably).</p>
<h3 id="language-features-101">Language Features 101</h3>
<ul>
<li><p>Stopwords</p>
<p>The common but semantically unimportant words. Generally remove stopwords
when doing statistical tasks.</p></li>
<li><p>Parts of speech</p>
<p>High-school grammar: nouns, adjectives/adverbs, verbs. N, ADJ, ADV, V.</p>
<p>Also: a bunch more.</p></li>
<li><p>Stemming</p>
<p>Reduce words to their stem, so you can unify various forms; generally for
statistical techniques.</p></li>
<li><p>Lemmatization</p>
<p>Similar to stemming, but results in a real word.</p></li>
</ul>
<h3 id="nlp-concepts">NLP Concepts</h3>
<ul>
<li><p>Training data</p>
<p>Copora for English language words (stopwords), Boys’ names, Girls’ names,
tagging part of speech.</p>
<p>Wordnet linked dictionary.</p></li>
<li><p>Tokenisation</p>
<p>Split a document into individual parts. The particular type of “part” will
vary depending on the task (words, sentences, etc.)</p>
<p>Many different tokenisation algorithms for different situations.</p></li>
</ul>
<h3 id="applications">Applications</h3>
<p>Sentiment analysis and opinion mining. Targeting advertising.</p>
<p>Establish patterns in language used to make guesses about the person talking:
gender, age, etc.</p>
<p>Integration with BeautifulSoup for something to do with HTML? Not sure why
you’d bother.</p>
<p>Chatbots: <span class="citation" data-cites="PatrickAndElly">@PatrickAndElly</span> use Twitter interface (Python Twitter Tools):</p>
<ol type="1">
<li>Tokenise words.</li>
<li>PoS tag.</li>
<li>Simply tagging (because too much grammar is too much).</li>
<li></li>
</ol>]]></summary>
</entry>

</feed>
