Python - 数据科学 词干和词形还原

  • 简述

    在自然语言处理领域,我们遇到两个或多个单词有共同词根的情况。例如,同意、同意和同意这三个词具有相同的词根同意。涉及这些词中的任何一个的搜索都应将它们视为同一个词,即根词。因此,将所有单词链接到它们的词根就变得至关重要。NLTK 库具有执行此链接并提供显示根词的输出的方法。
    下面的程序使用 Porter Stemming Algorithm 进行词干提取。
    
    import nltk
    from nltk.stem.porter import PorterStemmer
    porter_stemmer = PorterStemmer()
    word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
    # First Word tokenization
    nltk_tokens = nltk.word_tokenize(word_data)
    #Next find the roots of the word
    for w in nltk_tokens:
           print "Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w))
    
    当我们执行上面的代码时,它会产生以下结果。
    
    Actual: It  Stem: It
    Actual: originated  Stem: origin
    Actual: from  Stem: from
    Actual: the  Stem: the
    Actual: idea  Stem: idea
    Actual: that  Stem: that
    Actual: there  Stem: there
    Actual: are  Stem: are
    Actual: readers  Stem: reader
    Actual: who  Stem: who
    Actual: prefer  Stem: prefer
    Actual: learning  Stem: learn
    Actual: new  Stem: new
    Actual: skills  Stem: skill
    Actual: from  Stem: from
    Actual: the  Stem: the
    Actual: comforts  Stem: comfort
    Actual: of  Stem: of
    Actual: their  Stem: their
    Actual: drawing  Stem: draw
    Actual: rooms  Stem: room
    
    词形还原与词干相似,但它为单词带来了上下文。因此,它通过将具有相似含义的单词与一个单词联系起来更进一步。例如,如果一个段落有汽车、火车和汽车之类的词,那么它将所有这些词都链接到汽车。在下面的程序中,我们使用 WordNet 词法数据库进行词形还原。
    
    import nltk
    from nltk.stem import WordNetLemmatizer
    wordnet_lemmatizer = WordNetLemmatizer()
    word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
    nltk_tokens = nltk.word_tokenize(word_data)
    for w in nltk_tokens:
           print "Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w))
    
    当我们执行上面的代码时,它会产生以下结果。
    
    Actual: It  Lemma: It
    Actual: originated  Lemma: originated
    Actual: from  Lemma: from
    Actual: the  Lemma: the
    Actual: idea  Lemma: idea
    Actual: that  Lemma: that
    Actual: there  Lemma: there
    Actual: are  Lemma: are
    Actual: readers  Lemma: reader
    Actual: who  Lemma: who
    Actual: prefer  Lemma: prefer
    Actual: learning  Lemma: learning
    Actual: new  Lemma: new
    Actual: skills  Lemma: skill
    Actual: from  Lemma: from
    Actual: the  Lemma: the
    Actual: comforts  Lemma: comfort
    Actual: of  Lemma: of
    Actual: their  Lemma: their
    Actual: drawing  Lemma: drawing
    Actual: rooms  Lemma: room