Files
the_information_nexus/tech_docs/python/NLTK.md
2024-05-01 12:28:44 -06:00

2.9 KiB
Raw Blame History

For handling natural language processing (NLP) tasks in Python, NLTK (Natural Language Toolkit) is a highly useful library. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Heres a concise reference guide for common use cases with NLTK, formatted in Markdown syntax:

NLTK Reference Guide

Installation

pip install nltk

After installing, you may need to download specific data packages used in your project:

import nltk
nltk.download('popular')  # Downloads popular packages

Basic NLP Tasks

Tokenization

from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello there, how are you? Weather is great, and Python is awesome."
words = word_tokenize(text)
sentences = sent_tokenize(text)

Removing Stopwords

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if not word in stop_words]

Stemming

from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_words]

Part-of-Speech Tagging

from nltk import pos_tag

tagged_words = pos_tag(words)

Named Entity Recognition (NER)

from nltk import ne_chunk

ner_tree = ne_chunk(tagged_words)

Working with WordNet

from nltk.corpus import wordnet

# Find synonyms
synonyms = wordnet.synsets("program")

# Example of usage
word = "ship"
synsets = wordnet.synsets(word)
for syn in synsets:
    print("Lemma: ", syn.lemmas()[0].name())
    print("Definition: ", syn.definition())

Advanced NLP Tasks

Parsing Sentence Structure

from nltk import CFG

grammar = CFG.fromstring("""
    S -> NP VP
    VP -> V NP
    NP -> 'the' N
    N -> 'cat'
    V -> 'sat'
    """)

Frequency Distribution

from nltk.probability import FreqDist

fdist = FreqDist(words)
most_common_words = fdist.most_common(2)

Sentiment Analysis

NLTK can be used for sentiment analysis, but it's more about providing foundational tools. For complex sentiment analysis, integrating NLTK with machine learning libraries like scikit-learn is common.

Saving and Loading Models

NLTK itself doesn't focus on machine learning models in the way libraries like scikit-learn or tensorflow do. However, it's often used to preprocess text data for machine learning tasks, after which models can be saved and loaded using those libraries' mechanisms.

NLTK is a comprehensive library for building Python programs to work with human language data, offering a wide array of functionalities from simple tokenization to complex parsing and semantic reasoning. This guide introduces the basics, but exploring NLTKs documentation and tutorials can provide deeper insights into handling various NLP tasks.