Extract string between two tokens inside a text in bash shell scripting




The other day, as a total newbye, I was writing a script in bash to process an HTML page, look for a URL inside it, and then download the content at that URL, recursively for a series of pages; I had it already done in AutoHotKey language, but since I don’t keep my Windows workstation always on, while my linux homeserver is, I decided I was better off learning a little of bash language to make the procedure more efficient.

In this script I need to extract a segment of string between two known “tokens”, in order to get the needed URL off the HTML page; in AutoHotKey I saved the string position inside the text for both of the tokens, and then did a precalculated substring operation between those two offsets, and was obviously looking to do the same in bash, as I was translating step-by-step from AutoHotKey.

With my big surprise, this is not possible, as the only operation to search for a string inside another string, actually searches only for the first occurrence of the first caracter of the key string; in other words, you can search for “my” inside “The mellow fur of my cat is brown”, but the result would be 4, and not the expected 18, because as said expr index "$string" $substring only matches the first character of substring, hence the first position of “m” inside the string.

Now, I didn’t really care to find the position of those tokens, I only cared about the string in the middle, so any other operation that did the trick was fine; alas, I realized it only after about a hour of wasting time, but in the end I came up with the solution, and I am publishing it here to save time to other wanderers.

It basically consists in stripping from the string everything that is before the first token (including first token), and then stripping, from the result of this first operation, everything that is after the second token (including the second token):

#variables declaration
text="longtexttoprocess"
firsttoken="something"
secondtoken="somethingelse"

#actual processing
middlestring=${text#*$firsttoken}
middlestring=${middlestring%%$secondtoken*}

This returns $middlestring as the string contained between the first occurrence of $firsttoken (single # symbol), and the first occurrence of $secondtoken (double %% symbol, indicating the farthest occurrence from the end of string).

  This article has been Digiproved




Leave a Reply

Your email address will not be published. Required fields are marked *