Python - 文本分类

  • 简述

    很多时候,我们需要通过一些预定义的标准将可用文本分类为各种类别。nltk 提供了这样的功能作为各种语料库的一部分。在下面的示例中,我们查看电影评论语料库并检查可用的分类。
    
    # Lets See how the movies are classified
    from nltk.corpus import movie_reviews
    all_cats = []
    for w in movie_reviews.categories():
        all_cats.append(w.lower())
    print(all_cats)
    
    当我们运行上述程序时,我们得到以下输出 -
    
    ['neg', 'pos']
    
    现在让我们来看看其中一个带有正面评价的文件的内容。此文件中的句子已标记化,我们打印前四个句子以查看示例。
    
    from nltk.corpus import movie_reviews
    from nltk.tokenize import sent_tokenize
    fields = movie_reviews.fileids()
    sample = movie_reviews.raw("pos/cv944_13521.txt")
    token = sent_tokenize(sample)
    for lines in range(4):
        print(token[lines])
    
    当我们运行上述程序时,我们得到以下输出 -
    
    meteor threat set to blow away all volcanoes & twisters !
    summer is here again !
    this season could probably be the most ambitious = season this decade with hollywood churning out films 
    like deep impact , = godzilla , the x-files , armageddon , the truman show , 
    all of which has but = one main aim , to rock the box office .
    leading the pack this summer is = deep impact , one of the first few film 
    releases from the = spielberg-katzenberg-geffen's dreamworks production company .
    
    接下来,我们对每个文件中的单词进行标记,并使用 nltk 中的 FreqDist 函数找到最常见的单词。
    
    import nltk
    from nltk.corpus import movie_reviews
    fields = movie_reviews.fileids()
    all_words = []
    for w in movie_reviews.words():
        all_words.append(w.lower())
    all_words = nltk.FreqDist(all_words)
    print(all_words.most_common(10))
    
    当我们运行上述程序时,我们得到以下输出 -
    
    [(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
    (of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]