Using EditPad Pro to convert a TXT subtitle file to XML

The original file I received looked like this, and is a typical subtitle file as used by Adobe Encore.

1
00:00:03,030 –> 00:00:07,989
It brings me great pleasure to welcome to Stanford Jack Dorsey,

2
00:00:07,990 –> 00:00:11,419
who is as you know, the co-founder and chairman of the

3
00:00:11,420 –> 00:00:15,909
board of Twitter and the co-founder and CEO of Square.

 

To convert the file to a structured XML file for translation in SDL Studio I ran the following regular expressions (regex) in EditPad Pro.

1) Match the timecode and place

2)Find ID number
(^[\d]*) replace with<id>$1</id>

3)Find text line
(^.*^\w.*$) replace with<trans>$1</trans>

4)Place <xml> at start of file
\A  replace with <xml>

5) Place </xml> at end of file
\z  replace with </xml>

The resulting file then looks like this:

<xml>

<seg>1</seg>
<time>00:00:03,030 –> 00:00:07,989</time>
<trans>It brings me great pleasure to welcome to Stanford Jack Dorsey,</trans>

<seg>2</seg>
<time>00:00:07,990 –> 00:00:11,419</time>
<trans>who is as you know, the co-founder and chairman of the</trans>

<seg>3</seg>
<time>00:00:11,420 –> 00:00:15,909</time>
<trans>board of Twitter and the co-founder and CEO of Square.</trans>

</xml>

A new XML filetype can now be made in SDL studio to filter out the translatable text between the <trans> tags.

This entry was posted in HTML & XML, Regular Expressions (RegEx), Subtitles & Captions. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.