decthings

String lookup

Transforms strings to numbers using a vocabulary.

Takes two inputs: input and vocabulary. The data type for the both inputs must be string. For the input, any shape is supported, but the vocabulary must have exactly one dimension.

By clicking the node the following parameters can be configured on the right panel:

  • Output mode: The method for conversion. One of:
    • Integer: Each string in the input is converted to a integer of data type Int64. The first string in the dictionary is converted to the integer 1, the second to 2, and so on. The number 0 is used if the string is not in the dictionary. The output will have data type Int64 and have the same shape as the input.
    • One-hot: Each string in the input is converted to a vector of size V + 1, where V is the vocabulary length. The first string in the vocabulary is converted to the vector [0, 1, 0, 0, ...], the second to [0, 0, 1, 0, ...], and so on. If the string is not in the vocabulary, then the string is converted to the vector [1, 0, 0, 0, ...]. The output data type will be Int64, with the shape (*, V + 1), where * is the input shape and V is the vocabulary length.
    • Multi-hot: Similar to one-hot, but the final dimension is reduced to size V + 1, where V is the voacbulary length. This new vector in the final dimension is filled in multiple places, with a 1 in each position if there was any such occurance in the final dimension. For example, given the input [["a", "b", "a"], ["c", "z", "x"]], and vocabulary ["a", "b", "c"], the output would be [[0, 1, 1, 0], [1, 0, 0, 1]]. In the first list, the second and third positions are filled because they correspond to "a" and "b" in the vocabulary, and these strings were present in the input. In the second list, the first position is filled because there was at least one out-of-vocabulary string in the input, and the fourth position is filled because there was a "c" in the input, which was the third element in the vocabulary.
    • Count: Similar to multi-hot, but instead of always outputting ones, outputs the count of the given string in the input. For example, given the input [["a", "b", "a"], ["c", "z", "x"]], and vocabulary ["a", "b", "c"], the output would be [[0, 2, 1, 0], [1, 0, 0, 1]]. In the first list, the second position is filled with a 2 because there were two "a" in the input. In the second list, the first position is filled with 2 because there were two out-of-vocabulary strings in the input.
    • TF-IDF: Similar to count, but in this mode the node takes an additional input tf-idf-weights, and multiplies each output vector in the final dimension by these weights. The supported data types for tf-idf-weights are Float16, Float32 or Float64, with a shape of (V) where V is the vocabulary size. There is no tf-idf-weight for the first element (the out-of-vocabulary element). Instead, it will be multiplied by the weight 1.

Product

  • Documentation
  • Pricing
  • API reference
  • Guides

Company

  • Support

Get going!

Sign up
  • Terms and conditions
  • Privacy policy
  • Cookie policy
  • GitHub
  • LinkedIn

This website uses cookies to enhance the experience.

Learn more