Mon blog-notes pour vous parler du métier de traducteur, mais aussi et de manière plus générale de la langue française, de son étymologie, de sa littérature, de sa traduction, de ses expressions et d'un tas d'autres surprises.
N'hésitez pas à donner votre avis en cochant les cases sous chaque billet.

vendredi 1 juillet 2011

How to get rid of plain text tags in your translation memory? (Trados Studio 2009 trick)

I am not used to write in English, especially not for this blog. But I thought I had to deliver the explanation to this trick in the most universal way to other Trados Studio 2009 users because some of you may also have to face this problem. It happens sometimes, when you are using a translation memory (TM) provided by a customer, or even your own TM, that some translation units (TU’s) display strange code like the one hereunder:
<cf size="12" complexscriptssize="12" bold="on" underlinestyle="single">Marketing & Sales</cf>
It doesn’t need to be an IT engineer to understand that the <cf> thing is the plain text display of the tags used with CAT tools which give indications about text formatting. But why are those tags displayed in plain text and aren’t they recognised – as usually – as normal tags?

The origin of the problem

This problem occurs when importing into your TM some aligned txt files that weren’t generated with the right options. As you probably know, file alignment requires the use of the Winalign tool, which is part of the Trados 2007 software package (which is for the moment still included in - or rather said alongside – the Trados Studio 2009 package). The then generated txt file can be imported into a Trados 2007 TM (tmw file extension). On the other hand, you need to export your alignment into a tmx file in order to import it into a Trados Studio 2009 TM (sdltm file extension), since Studio amazingly doesn’t provide such an alignment tool and, even more amazingly, doesn’t support the import of bilingual txt files either (so one may ask himself why the heck a so-called major upgrade version of Trados has lower, limited, import abilities than the previous version, if not to force Trados 2007 to become faster obsolete).

The workaround

Anyhow, once in a while, you come across those plain text tags in your TU’s. They are very annoying since they prevent the detection of full matches and may complicate the context search to look up words in the TU’s (the well-known and essential F3 function). So how to get rid of them? The process is rather fastidious, but automated and quiet efficient. It allows you to delete all of them with a batch script rather than looking up and editing every single TU manually, the latter being humanly impossible with huge TM’s. I tried it with a 320,000 TU’s TM and believe me it worked – seamlessly.

Step 1: open the TM

First of all, in TS2009, go to the “Translation Memories” tab in the left pane. It is the lowest tab, located under the editor view. You don’t even need to open the TM you want to clean, just select/highlight it with a click. It goes without saying; make sure to create beforehand a backup of your TM in case something goes wrong. Simply copy the sdltm file into another folder using Windows Explorer.

Step 2: Open the “Batch Edit” window

In the left pane, right click on the TM you want to process. In the context menu, select the “Batch Edit…” (NOT “Batch Delete…”) option as shown below.

Step 3: “Find and Replace Text”

The following window appears. Click on the “Add” button and select the “Find and Replace Text” option from the drop down menu. 


The “Find and Replace Text” window appears.

That’s where the complicated stuff begins. This needs some explanations so that you understand what you are exactly doing.
The point is to replace all tags by nothing, not even a space. This implies that you find all tags. There are two types of tags: the opening and the closing tags. Indeed, to indicate that a text should be formatted in a particular way, it is preceded by an opening tag, which tells which kind of formatting is needed (for example, bold, italic, superscript, combinations of many), but it is also followed by a simple closing tag which indicates where the special formatting should end. 
The problem is that there are numerous different opening tags. Their only common point is to start with “<cf” end to end with the “>” sing. What comes in between varies and may be short or very long, depending on the specifications of the text formatting. Therefore, it is impossible to perform a simple search and replace just like you would do in MS Word. That’s why we need regular expressions. Regular expressions (often abbreviated regex or regexp) form a sort of programming language mainly used by web developers to enter parameters into their page codes. They are very handy since they enhance the search possibilities, but they are also very tricky to use. Long story short, here is what you should enter in the first field:
The square brackets “[ ]” mean that what is inside of it should be considered literally, i.e. it will look for the signs “<cf” in the text, because the “<” sign has a particular meaning on its own which could completely screw your regex and hence your TM. The square brackets make it “meaningless”.
The dot matches any single character (except the line break character).
The asterisk tells the engine to attempt to match the preceding token zero or more times.
The question mark makes the preceding token in the regex optional.

Shortly said, the combination of tokens “.*?” stand for “absolutely anything until the next delimiter”. And the next delimiter, in this case, is the “>” sign, again surrounded by square brackets. This last delimiter is very important. If it is missing, the regex will delete anything (.*?) up to the end of the TU!

Step 4: Insert the regex to delete opening tags in the source text

Follow the steps below:
  1. In the “Find what” filed, type [<cf].*?[>] as shown in the screen capture above.
  2. Leave the “Replace with” field blank, since we don’t want to replace it with anything, but just to delete it.
  3. In the “Search in” option, select the “Source” radio button.
  4. Make sure to tick the “Use regular expression” box.
  5. Press the “OK” button.

Now a new line appears in the “Batch Edit” box:


Step 5: Insert the regex to delete opening tags in the target text

Follow the same procedure as above, but at the 3rd step, select the “Target” radio button instead of the “Source” radio button. A second line will appear in the “Batch Edit” box.

Step 6: Insert the regex to delete closing tags in the source text

Fortunately, the closing tag is always the same, no matter what the opening tag is: <\cf>.
There are several methods to delete this one, since we do not strictly need a regular expression to perform the deletion. You may even use the “Find and Replace” option in the “Translation Memories” view of Trados Studio 2009. But since we are very clever persons (aren’t we?) and think ahead for next time, when we will have to perform this task again, we will use the “Batch Edit” feature to delete both opening and closing tags, both in source and target text, all at once.

1. In the “Batch Edit” box, click again on the “Add” button and select once again “Find and Replace Text”.
2. Complete the “Find and Replace” box as shown on the screen capture below: 

In this example, we use the regular expression again, even though it isn’t strictly necessary. But hey, aren’t we specialist of regex by now?  
3. So don’t forget to insert the square brackets in the search string and to tick the “Use regular expression” box.  
4. Press the “OK” button.

Step 7: Insert the regex to delete closing tags in the target text

Perform the same steps as above, but select the “Target” radio button instead.

Step 8: Save your batch edit script

It will probably happen that you will have to perform this trick once again in the future, for the same or for a different TM. It doesn’t cost much of disk space and time to save this script for later use.
  1. In the “Batch Edit” window, press the “Save” button.
  2. In the “Save TM Batch Edit script” window, choose a folder to save your script or use the default folder that will show at first, as long as you remember where it is saved for later use.
  3. Type a name for your script, for example tags_deletion.
  4. Click the “Save” button. 
Next time, you will only need to press the “Load” button in the “Btach Edit” window to open your save script.

Step 9: Run the batch edit script

We’re almost there! Your “Batch Edit” window should look like this:


Press the “Finish” button. A new window with a progress bar should appear. When the task is performed (it might take several minutes depending on the size of your TM), simply press the “Close” button and voilà!
Your TM is now cleared from all those nasty plain text tags.

Further reading

2 commentaires:

  1. Thanks a lot! It was really useful and explanations were very clear.

  2. Thank you so much, with your great instruction I was able to solve my big problem.