Solved

Remove multiple underscores that separates words


Badge +5

I'd like my regex to replace the middle underscores with a space. I cannot seem to escape the underscore character to use with curly brackets for repetitions.

I have strings with multiple underscores and spaces as word separators. I'd like my result to look like below:

 

word1_worda wordb wordc wordd_word3

Sample strings look like:

 

sample1: Autobooster 1_Autobooster Unit_Unit_Autobooster Unit

 

sample2:Anchor_FIELD_VERIFIED_DATE_NonDisplay

With my format these samples should result to:

 

sample1: Autobooster 1_Autobooster Unit Unit_Autobooster Unit

 

sample2:Anchor_FIELD VERIFIED DATE_NonDisplay

I used Lookahead and lookbehind and what I'm getting is the middle string with the underscores between word 1 and word 2 and word 2 and word 3 before and after the result. Regex below

\\w(?<=_).*_?\\w(?<=_)

Test string looks like below

If I were to finish my translation I have to use StringReplacer, StringConcatenator, and then merge them back again.

icon

Best answer by courtney_m 18 August 2017, 17:01

View original

10 replies

Userlevel 1
Badge +21

So you want all but the first and last underscores replaced with spaces?

Badge +10

So you want all but the first and last underscores replaced with spaces?

Yes, that's the ideal.

 

Badge

I was able to do this with 2 StringSearchers and an AttributeManager:

The first StringSearcher finds the first word and underscore by searching the text for the Regex ^[^_]*_ and saving it as _first_word:

The second string searched finds the last word and underscore by searching the text for the RegEx _[^_]*$ and saving it as _last_word:

Then, in the attribute manager, I created the _middle_text attribute by trimming _first_word off the left of the text, trimming _last_work off the right of the text, then replacing _ with a space. I used the following notation:

@ReplaceString(@TrimRight(@TrimLeft(@Value(text_line_data),@Value(_first_word)),@Value(_last_word)),"_"," ")

Then, I created the final_text attribute by concatenating the attributes _first_word, _middle_text, and _last_word. Finally, I removed the un-needed attributes.

From inspector, you can see what the value of final_text is....

I have also attached the workspace, if you want it. I hope this helps!

 

-Courtney

Badge +10

I was able to do this with 2 StringSearchers and an AttributeManager:

The first StringSearcher finds the first word and underscore by searching the text for the Regex ^[^_]*_ and saving it as _first_word:

The second string searched finds the last word and underscore by searching the text for the RegEx _[^_]*$ and saving it as _last_word:

Then, in the attribute manager, I created the _middle_text attribute by trimming _first_word off the left of the text, trimming _last_work off the right of the text, then replacing _ with a space. I used the following notation:

@ReplaceString(@TrimRight(@TrimLeft(@Value(text_line_data),@Value(_first_word)),@Value(_last_word)),"_"," ")

Then, I created the final_text attribute by concatenating the attributes _first_word, _middle_text, and _last_word. Finally, I removed the un-needed attributes.

From inspector, you can see what the value of final_text is....

I have also attached the workspace, if you want it. I hope this helps!

 

-Courtney

Thanks @courtney_m for explaining and providing your workspace. I appreciate that.

 

Userlevel 4
Badge +30

I was able to do this with 2 StringSearchers and an AttributeManager:

The first StringSearcher finds the first word and underscore by searching the text for the Regex ^[^_]*_ and saving it as _first_word:

The second string searched finds the last word and underscore by searching the text for the RegEx _[^_]*$ and saving it as _last_word:

Then, in the attribute manager, I created the _middle_text attribute by trimming _first_word off the left of the text, trimming _last_work off the right of the text, then replacing _ with a space. I used the following notation:

@ReplaceString(@TrimRight(@TrimLeft(@Value(text_line_data),@Value(_first_word)),@Value(_last_word)),"_"," ")

Then, I created the final_text attribute by concatenating the attributes _first_word, _middle_text, and _last_word. Finally, I removed the un-needed attributes.

From inspector, you can see what the value of final_text is....

I have also attached the workspace, if you want it. I hope this helps!

 

-Courtney

That is a powerful article. Great! @courtney_m

 

Badge
That is a powerful article. Great! @courtney_m

 

Thank you, @danilo_inovacao!
Badge
Thanks @courtney_m for explaining and providing your workspace. I appreciate that.

 

You're very welcome, @salvaleonrp. I'm glad I could help.

 

Badge +10

The accepted answer is good enough but I wonder if there's a single regex string to remove the extra underscores. Any takers?

Userlevel 2
Badge +17

@salvaleonrp, accept your challenge: "single regex string to remove the extra underscores"

[2017-08-20: Update] Simplified the regex.

Use a StringReplacer with these parameters.

  • Mode: Replace Regular Expression
  • Text To Replace: (?<=_)(.*?)_(?=.*_)
  • Replacement Text: \1<space>

This string expression set to a transformer parameter works as well. Assume a feature attribute called "text" contains the source text string.

@ReplaceRegEx(@Value(text),(?<=_)(.*?)_(?=.*_),\1 )

Another thought:

1. StringSearther: Split the source text into 3 parts.

  • Contains Regular Expression: ^(.*?_)(.*_.*)(_.*)$
  • Subexpression Matches List Name: _sub

2. StringReplacer: Replace every underscore in the middle part with space.

  • Attributes: _sub{1}.part
  • Mode: Replace Text
  • Text To Replace: _
  • Replacement Text: <space>

3. StringConcatenator etc.: Simply concatenate the three elements of "_sub{}.part" list.

@Value(_sub{0}.part)@Value(_sub{1}.part)@Value(_sub{2}.part)

The replacement and concatenation can also be performed with a single string expression.

@Value(_sub{0}.part)@ReplaceString(@Value(_sub{1}.part),_," ")@Value(_sub{2}.part)
Badge +10

@salvaleonrp, accept your challenge: "single regex string to remove the extra underscores"

[2017-08-20: Update] Simplified the regex.

Use a StringReplacer with these parameters.

  • Mode: Replace Regular Expression
  • Text To Replace: (?<=_)(.*?)_(?=.*_)
  • Replacement Text: \1<space>

This string expression set to a transformer parameter works as well. Assume a feature attribute called "text" contains the source text string.

@ReplaceRegEx(@Value(text),(?<=_)(.*?)_(?=.*_),\1 )

Another thought:

1. StringSearther: Split the source text into 3 parts.

  • Contains Regular Expression: ^(.*?_)(.*_.*)(_.*)$
  • Subexpression Matches List Name: _sub

2. StringReplacer: Replace every underscore in the middle part with space.

  • Attributes: _sub{1}.part
  • Mode: Replace Text
  • Text To Replace: _
  • Replacement Text: <space>

3. StringConcatenator etc.: Simply concatenate the three elements of "_sub{}.part" list.

@Value(_sub{0}.part)@Value(_sub{1}.part)@Value(_sub{2}.part)

The replacement and concatenation can also be performed with a single string expression.

@Value(_sub{0}.part)@ReplaceString(@Value(_sub{1}.part),_," ")@Value(_sub{2}.part)
Awesome!  Learned something new today and valuable in the future. I used AttributeManager and the ReplaceRegex for a new attribute. Thanks @takashi. I have a better understanding of look ahead and look behind now.

 

Reply