Monday, July 25, 2016

How Bag of Words Model Can Work in Text Mining with Python?

The Bag of Words model, BOW, is a popular representation which is commonly used in natural language processing applications. In this model, a text document including several sentences and paragraphs is represented as the bag of the words included in that document, without considering syntax and the order of words. The main idea is to quantize each extracted key point into one of categorical words, and then represent each document by a histogram of the categorical words. The BOW feature design creates a vocabulary out of all the words in the training set and, for each example, creates a vector of word counts pertaining to that instance. Since the vector holds a place for each word in the vocabulary, the resulting BOW matrix is sparse (mostly zeros) because most words in the vocabulary don't appear in a given example.
One of its universal application is in document classification. In this case, the frequency of occurrence of each word from a dictionary is considered as an attribute for learning a classifier [1]. Salton and McGill proposed a Keyword Classification method to classify the documents into different categories providing a synonym dictionary for document indexing and query retrieval [2]. Furthermore, a statistical framework for generalizing the BOW representation is presented by Zhang et.al. for image processing inspired from text mining applications [3]. Voloshynovskiy et. al. also performs a survey for the analysis of the performance of BOW method to present a better understanding of this technique in terms of the robustness and accuracy in decision making. They also accomplished a successful experiment on real image data [4].
After we determine how many documents we have, we simply put all into a single text file. Then we will replace all the punctuation such as $, %, & and @ as well as some digits or specific terms such as WWW and com with blank space. After cleaning text file, we will count the frequency of each word in our text file, and rank them according to their frequency. We will always choose the top N words from the text file, and N is a user-defined parameter.

Here is a demo of how BOW works on a text data. For the first paragraph of this blog post, we have:


from __future__ import print_function
import numpy as np
import pandas as pd
from collections import Counter


def get_user_terms_stops(user_words):
    ''' extract the tokens set for each user from the document:
    '''
    import re
    import string
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer

    print('extracingt the token set from urls for each user ...')              
    user_terms_stops = []

    for i in range(len(user_words)):
        terms_stops = []
        u = ' '.join(np.array(user_words[i])).split()  # only return the non-empty elements 
        for j in range(len(u)):
            # tokenizing:
            tokens = re.findall('\w+', u[j])
            #remove stop words:
            stop = stopwords.words('english') + list(string.punctuation)
            terms_stop = [word for word in tokens if (word not in stop)]
            terms_stops = terms_stops + terms_stop
            #stemming:
            p_stemmer = PorterStemmer()
            stemmed_tokens = [p_stemmer.stem(k) for k in terms_stops]       

        user_terms_stops.append(terms_stops)
    print('accomplished!')
    user_terms_stops = [item for sublist in user_terms_stops for item in sublist]
    return user_terms_stops


if __name__ == '__main__':
       
    X = [['The Bag of Words model',
          'BOW, is a popular representation which is commonly used in natural language processing applications.',
          'In this model',
          'a text document including several sentences and paragraphs is represented as the bag of the words',
          'included in that document, without considering syntax and the order of words.'],
         ['The main idea is to quantize each extracted key point into one of categorical words',
          'and then represent each document by a histogram of the categorical words.'],
         ['The BOW feature design creates a vocabulary out of all the words in the training set and',
          'for each example, creates a vector of word counts pertaining to that instance.',
          'Since the vector holds a place for each word in the vocabulary',
          'the resulting BOW matrix is sparse (mostly zeros) because most words in the vocabulary do not appear in a given example.']
        ]
   
   
user_terms_stops = get_user_terms_stops(X)
user_terms_stops

['The',
 'Bag',
 'Words',
 'model',
 'BOW',
 'popular',
 'representation',
 'commonly',
 'used',
 'natural',
 'language',
 'processing',
 'applications',
 'In',
 'model',
 'text',
 'document',
 'including',
 'several',
 'sentences',
 'paragraphs',
 'represented',
 'bag',
 'words',
 'included',
 'document',
 'without',
 'considering',
 'syntax',
 'order',
 'words',
 'The',
 'main',
 'idea',
 'quantize',
 'extracted',
 'key',
 'point',
 'one',
 'categorical',
 'words',
 'represent',
 'document',
 'histogram',
 'categorical',
 'words',
 'The',
 'BOW',
 'feature',
 'design',
 'creates',
 'vocabulary',
 'words',
 'training',
 'set',
 'example',
 'creates',
 'vector',
 'word',
 'counts',
 'pertaining',
 'instance',
 'Since',
 'vector',
 'holds',
 'place',
 'word',
 'vocabulary',
 'resulting',
 'BOW',
 'matrix',
 'sparse',
 'mostly',
 'zeros',
 'words',
 'vocabulary',
 'appear',
 'given',
 'example']


c = Counter(user_terms_stops)
c




Following code returns the N most frequent words in our documents which can be used in many applications like text tagging, text categorization and etc. 
Counter({'BOW': 3,
         'Bag': 1,
         'In': 1,
         'Since': 1,
         'The': 3,
         'Words': 1,
         'appear': 1,
         'applications': 1,
         'bag': 1,
         'categorical': 2,
         'commonly': 1,
         'considering': 1,
         'counts': 1,
         'creates': 2,
         'design': 1,
         'document': 3,
         'example': 2,
         'extracted': 1,
         'feature': 1,
         'given': 1,
         'histogram': 1,
         'holds': 1,
         'idea': 1,
         'included': 1,
         'including': 1,
         'instance': 1,
         'key': 1,
         'language': 1,
         'main': 1,
         'matrix': 1,
         'model': 2,
         'mostly': 1,
         'natural': 1,
         'one': 1,
         'order': 1,
         'paragraphs': 1,
         'pertaining': 1,
         'place': 1,
         'point': 1,
         'popular': 1,
         'processing': 1,
         'quantize': 1,
         'represent': 1,
         'representation': 1,
         'represented': 1,
         'resulting': 1,
         'sentences': 1,
         'set': 1,
         'several': 1,
         'sparse': 1,
         'syntax': 1,
         'text': 1,
         'training': 1,
         'used': 1,
         'vector': 2,
         'vocabulary': 3,
         'without': 1,
         'word': 2,
         'words': 6,
         'zeros': 1})


top_N = 6
classes = []
top_words = c.most_common(top_N)
top_words


[('words', 6),
 ('BOW', 3),
 ('vocabulary', 3),
 ('The', 3),
 ('document', 3),
 ('creates', 2)]





References
[1] Harris, Zellig. "Distributional Structure". Word 10 (2/3): 146{62, 1954.
[2] Gerard Salton and Michael J. McGill. "Introduction to Modern Information Retrieval".
McGraw-Hill Book Company, New York, 1983.
[3] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. "Understanding bag-of-words model: a
statistical framework". International Journal of Machine Learning and Cybernetics, 1(1-
4):43{52, 2010.
[4] Sviatoslav Voloshynovskiy, Maurits Diephuis, Dimche Kostadinov, Farzad Farhadzadeh and Taras Holotyak. "On accuracy, robustness and security of bag-of-word search systems".


No comments:

Post a Comment