Is It Possible To Use Spacy With Already Tokenized Input?

July 08, 2023 Post a Comment

I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized

Solution 1:

You can do this by replacing spaCy's default tokenizer with your own:

nlp.tokenizer = custom_tokenizer

Where custom_tokenizer is a function taking raw text as input and returning a Doc object.

You did not specify how you got the list of tokens. If you already have a function that takes raw text and returns a list of tokens, just make a small change to it:

defcustom_tokenizer(text):
    tokens = []

    # your existing code to fill the list with tokens# replace this line:return tokens

    # with this:return Doc(nlp.vocab, tokens)

See the documentation on Doc.

If for some reason you cannot do this (maybe you don't have access to the tokenization function), you can use a dictionary:

tokens_dict = {'Hello, world.': ['Hello', ',', 'world', '.']}

defcustom_tokenizer(text):
    if text in tokens_dict:
        return Doc(nlp.vocab, tokens_dict[text])
    else:
        raise ValueError('No tokenization available for input.')

Either way, you can then use the pipeline as in your first example:

doc = nlp('Hello, world.')

Solution 2:

In case the tokenized text is not constant, another option is skipping tokanization:

spacy_doc = Doc(nlp.vocab, words=tokenized_text)
forpipe in filter(None, nlp.pipeline):
    pipe[1](spacy_doc)

Introduction to Python Course

Is It Possible To Use Spacy With Already Tokenized Input?

Solution 1:

Solution 2:

Post a Comment for "Is It Possible To Use Spacy With Already Tokenized Input?"