Using regular expression to merge database content into Rich Text format (RTF) template documents

As Rich Text Format (RTF) documents really are text-based documents, it sounds like a simple task to use it as a base for creating merge documents from database applications. The inserted tags in such a document can easily be replaced with contents of database fields to produce merged documents ready to be opened in Microsoft Word.

Practical experience with such a solution is, however not that straight forward. Why is that? – Microsoft Word sometimes injects formatting codes in between the tags to accomplish word-splitting proposals or whatever. Those formatting codes cannot be seen in Microsoft Word, but in a plain text editor, you will see why a substitution of the tag sometimes fails. To example, such plain merge tag like this:

Becomes like this:

A simple replace string to replace the tag with a database content field, will then fail. The solution to this, is to use regular expression matching to find and replace the tags with the database content.

Most development environments have regular expression matching either natively built into the development environment – or – using third party plugins. My PHP Example below should be easy to implement in almost any language.

The tag construct is shown here:

Our regular expression pattern will have seven capturing groups. Capturing groups are bits of information we extract from a bigger match. Those groups are:

1. The formatting codes between the start of tag mark and the tag-word.
2. The tag-word itself.
3. The formatting codes between the tag-word and the end-of-tag mark. – OR – inserted formatting into the tag.
4. The second part of the tag word if split by injected formatting
5. Formatting before “default value” _colon_ mark
6. The défault value to be applied.
7. The end of formatting if split.

Regex patterns for each individual group

In the text file, most RTF formatting codes start with a backslash. Since this code has a special meaning, we can improve the readability of the regex by doing a substitute “\” with something not occurring in the file elsewhere, we suppose a double Euro sign “€€” is OK. Otherwise, we would have to escape the escape characters.

First is the formatting codes pattern.

CAPTURE GROUP 1

The tag starts always with “<<“. Then a capture-group with a non-capture group inside. The “(?:” signify a not-capture group). The non-capture group has alternatives (the | character) between any sequence starting with €€ and contains the lowercase characters from a-z or the numbers from 0-9; a curly brace closing or opening; or white space characters.  Those alternatives can be repeated zero or more times. The whole match goes into capture group 1. If there is no formatting codes, this group that also can be zero times, will be empty.

To test the Reg-Ex patterns, I used www.regexpal.com. Hovering over the reg-ex shows a tooltip text with an explanation of each individual character in the pattern.

With test string like this:

CAPTURE GROUP 2

Next part is the TAG Word:

This construct is more simple. This is a capture group that simply contains any number of characters from a-z; numbers from 0-9; a period; – and + signs; underscore; the Norwegian æøåÆØÅ;  and uppercase A-Z. We avoid having space here not to let Microsoft Word to be tempted to insert some RTF tags between the words. Sometimes it does so anyway, but we keep it to a minimum this way.

CAPTURE GROUP 3

Next will be more RTF tags, that can either be between words-parts in the tag OR at the end of the tag.

CAPTURE GROUP 4

And then a potential second part of the TAG Word:

CAPTURE GROUP 5

And potential more formatting codes before our default value part.

CAPTURE GROUP 6

And then a potential default value starting with a colon (:).

This is a non-capture group with a capture group inside. It starts with any number of white space; Then a colon; and a capture group with any numbers of non-line-break characters (The dot). The question mark says we match as little as possible, so we don’t eat the rest of the expression. If anything matches the rest of our regex it goes on with that. After that, any number of white space characters. The whole group can repeat zero or one time (The ? mark at the end). We don’t let this pattern repeat more than one time since the default value could contain a colon too, and we don’t want to miss that.

CAPTURE GROUP 7

The last part is more RTF formatting codes, and the tag closing (>>).

The Whole reg-ex shown here (From the PHP example):

PHP Example

I have put the source on Github, so it can be downloaded for experimenting or implementing.

Head into the following GITHUB page to browse the code, or download and try it out:
PHP_RTF_RegExMerge on Github

I explain each part of the code below. The full listing is after the parts. 

First, we just get the content of the RTF file into a variable named $file. Then we replace the backslash with the double euro sign as described above. Then we put our complete regex into a variable called $regex2.

Then we run our regex against the file content in $mergetext.

The result will be available as a multi-dimensional array in the variable $out.

Then loop trough all the matches in the document:

The whole match is in index 0 of the array, we keep that to replace later. We also keep the start group and of course the tag itself.

Above, we check of there is a second part tag or a default value, then we have the end formatting in group 5. If match 5 is empty, it can be in match 7 instead. If there is no second part word or no default value, we have our end-formatting in match 3.

Our second part tag, if any will be in group 4. If non-empty we just add it to our tag from group 2. The default value will always be in group 6. It can contain some formatting codes in addition to the value if the user to example changes the text color of the default text. This is an important part of the default value.

Normally you will pick values from the database, but we do it a bit more simple in this example and have a switch statement with three tags we support in this example. The last one is to test the default value. The text we want to merge will then be in the variable $txt.

If it´s empty, we set the default value we got from capture group 6. It might be empty too.

Replace multiline line-breaks with RTF style line-breaks. I suppose here we have line feed characters(10) in the multiline text (Unix) if from a Mac application it will be character 13 instead.

Replace the Whole match with the start tag, the text, and the end-tag. Then end of loop – continue with our next match.

We replace back the back-slash characters and save to a new file.

That’s ALL 🙂

The complete code:

Let us try it out

This example document has three tags, one that is split (I have made half bold to enforce this), and the second has inserted formatting between the start mark and end mark of the tag (enforced putting the word itself in italic) – This can happen without doing anything like this. The third tag is one with a default value in red text. Here we go…

The source document:

 

We run the merge. You can see from the output how the two example tags fall into the seven capturing groups as described above.

 

And finally, the result:

If we change in our script to have some value:

And run it again:

We see that the word “Developer” has taken place of the default value.

Good luck

Ole Kristian Ek Hornnes
System Developer

2 thoughts on “Using regular expression to merge database content into Rich Text format (RTF) template documents

  1. Daniel

    Nearly every practical programmer ought at some point to make a close and thorough study of regular expressions. Other programmers have given us tools for exploiting this powerful language of string-specification, so that the cost of learning is more than repaid in time saved not structuring complex conditional processes from the ground upward.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *