span8
span4
span8
span4
Hi,
Can anyone provide some advice as to the most efficient way to explode a huge free text field or fields into all of its character elements retain a single instance of each. I am essentially trying to complete a pre-flight check in order to understand whether there are any ‘odd’ or ‘unexpected’ characters in an ever expanding data set, over which I have no control.
I have created a process below which completes the task; however, it is very inefficient and as the number of records increases it will become too slow.
1. Derive string length of free text field
2. clone by number derived in 1 (clone number created in process)
3. substring extract using clone number to obtain character at that position
4. Duplicate remover to create my list.
5. Expose character code.
6. Output list
Thanks in advance,
Rob
1. You can expose the list name "_char{}" with the Attributes to Expose parameter in the PythonCaller parameters dialog.
2. This script creates a list from all the input features, then outputs a single feature having the list at last.
# PythonCaller Script Example 2 import fmeobjects class FeatureProcessor(object): def __init__(self): self.chars = set([]) def input(self, feature): self.chars |= set(feature.getAttribute('_text')) def close(self): feature = fmeobjects.FMEFeature() feature.setAttribute('_char{}', list(self.chars)) self.pyoutput(feature)
In addition, if you finally need to explode the feature on the list, the close method can be modified like this, instead of using the ListExploder afterword.
def close(self): for i, c in enumerate(self.chars): feature = fmeobjects.FMEFeature() feature.setAttribute('_char', c) feature.setAttribute('_element_index', i) self.pyoutput(feature)
Hi @rob14, I think using Python script could be more efficient. Assuming that an attribute called "_text" stores a text string, a PythonCaller with this script creates a list contains unique characters.
# PythonCaller Script Example def processFeature(feature): s = set(feature.getAttribute('_text')) feature.setAttribute('_char{}', list(s))
You could create your list initiallly by using a stringsearcher with regular expression . and creating a list name for all matches, then using a list duplicate remover to get a list of unique characters.
No idea on how that would compare performance wise
© 2019 Safe Software Inc | Legal