Wednesday, 20 April 2011

Converting RTF to Plain text in SSIS


RichText formatted data can easily be converted to plain text using a regular expression in the script component in SSIS.

The regular expression to strip out rtf formatting is as follows:
({\\)(.+?)(})|(\\)(.+?)(\b)

The following steps outline the creation of an SSIS package to extract records from a database table to a delimited text file where the data in one of the columns is formatted as RTF and the output file requires plain text.

  1. Create a new SSIS data flow

  2. Add an OLE DB Source to access the records. The column containing the rtf data is assumed to be called body.

  3. Add a script component to the data flow and attach it to the output of the OLE DB Source.

  4. In the script component properties ensure the body field is selected as an input column

  5. In the script component properties add an output column body_converted, give it a datatype of DT_STR and a suitable length.

  6. Click the Edit Script button and then use the following code in the ProcessInputRow method.


    Regex regExObj = new Regex("({\\\\)(.+?)(})|(\\\\)(.+?)(\\b)");
    System.Text.UTF8Encoding enc = new System.Text.UTF8Encoding();
    String strBody = enc.GetString(Row.body.GetBlobData(0,int)Row.body.Length));
    Row.bodyconverted = regExObj.Replace(strBody, "").Replace("}", "");

  7. Add aflat file destination and map the columns from the script component to it ensuring you use the body_converted column and not the original body column.

  8. Run the SSIS pacakge.

You should now have a file which contains records with plain text instead of rtf.

The regular expression works on all scenarios I have tested so far, but if you come across tags where it misses then it should be a simple case of extending the expression to catch them.

4 comments:

  1. The above code in the ProcessInputRow method is giving error so could you please check whether you missed something in copying this code.
    Or do I need to declare any parameters in Visual C# editor.

    ReplyDelete
  2. Hi Phil

    Thanks for this piece of work. It has helped.

    Here are a few addition things I found when implementing it.

    There is a slight syntax error above:

    enc.GetString(Row.body.GetBlobData(0,int)Row.body.Length));

    should be

    enc.GetString(Row.body.GetBlobData(0,(int)Row.body.Length));

    You need to add using System.Text.RegularExpressions to the code in the Namespaces section.

    When writing to a Csv destination, set the text qualifier to ".

    This doesn't work with an input column of type DT_NTEXT, but does with DT_TEXT.

    In my example there were leading and trailing carriage returns in every output string. When you view the CSV file through Excel, you won't see a result. You need to look at CSV file in notepad.

    There are more examples of regex expressions that may work at http://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string

    Thanks - Adam Gilmore - Dimodelo Solutions

    ReplyDelete
  3. thanks phil it worked but it wasnt able to convert all. so i modified your code in vb.net.

    Imports System
    Imports System.Data
    Imports System.Math
    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
    Imports System.Windows.Forms

    Public Class ScriptMain
    Inherits UserComponent

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    '
    ' Add your code here
    '
    Try

    Dim enc As New System.Text.UTF8Encoding()
    Dim RTFConvert As RichTextBox = New RichTextBox
    RTFConvert.Rtf = enc.GetString(Row.Body.GetBlobData(0, Convert.ToInt32(Row.Body.Length)))

    Row.converted.ResetBlobData()
    Row.converted.AddBlobData(System.Text.Encoding.UTF8.GetBytes(RTFConvert.Text))
    Catch E As Exception

    End Try
    End Sub


    End Class

    ReplyDelete
    Replies
    1. Hi phil and palzor,I have used this vb code and was able to convert RTF data,
      but I was able to convert only some 9000 rows of data,script component does not seem to work after that limit.Do we have any limitations to RichTextbox

      Thanks

      Delete