Find Unique Named Entities Using Python And NLTK

I'm exploring a concept that requires natural language analysis to programmatically find the subjects of a given piece of text which, in my case, are just the unique named entities.

The Python NLTK library is just what I needed, but it has some non-trivial API changes since version 3.0 which broke many of the recent NLTK examples on the Internet.

Here's an example of what I ended up doing in light of those changes to extract the unique named entities:

import nltk

def parts_of_speech(corpus):
    "returns named entity chunks in a given text"
    sentences = nltk.sent_tokenize(corpus)
    tokenized = [nltk.word_tokenize(sentence) for sentence in sentences]
    pos_tags  = [nltk.pos_tag(sentence) for sentence in tokenized]
    chunked_sents = nltk.ne_chunk_sents(pos_tags, binary=True)
    return chunked_sents
    

def find_entities(chunks):
    "given list of tagged parts of speech, returns unique named entities"

    def traverse(tree):
        "recursively traverses an nltk.tree.Tree to find named entities"
          
        entity_names = []
    
        if hasattr(tree, 'label') and tree.label:
            if tree.label() == 'NE':
                entity_names.append(' '.join([child[0] for child in tree]))
            else:
                for child in tree:
                    entity_names.extend(traverse(child))
    
        return entity_names
    
    named_entities = []
    
    for chunk in chunks:
        
        entities = sorted(list(set([word for tree in chunk
                            for word in traverse(tree)])))
        for e in entities:
            if e not in named_entities:
                named_entities.append(e)
    return named_entities

entity_chunks  = parts_of_speech("George Bush is the 41st President of the United States.")
named_entities = find_entities(entity_chunks)

print named_entities

This was adapted for NLTK 3.x and was based this example.