NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text information in a smart and efficient manner. By utilizing NLP and its parts, one can organize the massive chunks of text information, perform various automated tasks and solve a wide range of issues like – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to lexical resources like WordNet, along with a collection of text processing libraries for classification, tokenization, stemming, and tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
NLTK has been called “a wonderful tool for teaching and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”
Downloading and installing NLTK
The main issue with text data is that it's all in text format. However, the Machine learning algorithms need some variety of numerical feature vector so as to perform the task. Thus before we have a tendency to begin with any NLP project we'd like to pre-process it to form it ideal for working. Basic text pre-processing includes:
import nltk
from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))
Output: ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']
import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))
# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
# choose some words to be stemmed
words = ["Connect", "Connects", “Connected”, "Connecting", "Connection", "Connections"]
for w in words:
print(w, " : ", ps.stem(w))
# import these modules
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))
-> rocks : rock
-> corpora : corpus
-> better : good
Now we need to transform text into a meaningful vector array. This vector array is a representation of text that describes the occurrence of words within a document. For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1). A problem is that extremely frequent words begin to dominate within the document (e.g. larger score), however might not contain as much informational content. Also, it will offer additional weight to longer documents than shorter documents.
One approach is to rescale the frequency of words or the scores for frequent words called Term Frequency-Inverse Document Frequency.
TF = (Number of times term t appears in a document)/ (Number of terms in the document)
IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
Tf-idf weight is a weight often used in information retrieval and text mining.
Tf-IDF can be implemented in scikit learn as:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
... 'This is the first document.’
... 'This document is the second document.’
... 'And this is the third one.’
... 'Is this the first document?',]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# vectors
a = np.array([1,2,3])
b = np.array([1,1,4])
# manually compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
After completion of cosine similarity matric we perform algorithmic operation on it for Document similarity calculation, sentiment analysis, topic segmentation etc.
I have done my best to make the article simple and interesting for you, hope you found it useful and interesting too.