RichText formatted data can easily be converted to plain text using a regular expression in the script component in SSIS.
The regular expression to strip out rtf formatting is as follows:
({\\)(.+?)(})|(\\)(.+?)(\b)
The following steps outline the creation of an SSIS package to extract records from a database table to a delimited text file where the data in one of the columns is formatted as RTF and the output file requires plain text.
- Create a new SSIS data flow
- Add an OLE DB Source to access the records. The column containing the rtf data is assumed to be called body.
- Add a script component to the data flow and attach it to the output of the OLE DB Source.
- In the script component properties ensure the body field is selected as an input column
- In the script component properties add an output column body_converted, give it a datatype of DT_STR and a suitable length.
- Click the Edit Script button and then use the following code in the ProcessInputRow method.Regex regExObj = new Regex("({\\\\)(.+?)(})|(\\\\)(.+?)(\\b)");
System.Text.UTF8Encoding enc = new System.Text.UTF8Encoding();
String strBody = enc.GetString(Row.body.GetBlobData(0,int)Row.body.Length));
Row.bodyconverted = regExObj.Replace(strBody, "").Replace("}", ""); - Add aflat file destination and map the columns from the script component to it ensuring you use the body_converted column and not the original body column.
- Run the SSIS pacakge.
You should now have a file which contains records with plain text instead of rtf.
The regular expression works on all scenarios I have tested so far, but if you come across tags where it misses then it should be a simple case of extending the expression to catch them.
The above code in the ProcessInputRow method is giving error so could you please check whether you missed something in copying this code.
ReplyDeleteOr do I need to declare any parameters in Visual C# editor.
Hi Phil
ReplyDeleteThanks for this piece of work. It has helped.
Here are a few addition things I found when implementing it.
There is a slight syntax error above:
enc.GetString(Row.body.GetBlobData(0,int)Row.body.Length));
should be
enc.GetString(Row.body.GetBlobData(0,(int)Row.body.Length));
You need to add using System.Text.RegularExpressions to the code in the Namespaces section.
When writing to a Csv destination, set the text qualifier to ".
This doesn't work with an input column of type DT_NTEXT, but does with DT_TEXT.
In my example there were leading and trailing carriage returns in every output string. When you view the CSV file through Excel, you won't see a result. You need to look at CSV file in notepad.
There are more examples of regex expressions that may work at http://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string
Thanks - Adam Gilmore - Dimodelo Solutions
thanks phil it worked but it wasnt able to convert all. so i modified your code in vb.net.
ReplyDeleteImports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Windows.Forms
Public Class ScriptMain
Inherits UserComponent
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
'
' Add your code here
'
Try
Dim enc As New System.Text.UTF8Encoding()
Dim RTFConvert As RichTextBox = New RichTextBox
RTFConvert.Rtf = enc.GetString(Row.Body.GetBlobData(0, Convert.ToInt32(Row.Body.Length)))
Row.converted.ResetBlobData()
Row.converted.AddBlobData(System.Text.Encoding.UTF8.GetBytes(RTFConvert.Text))
Catch E As Exception
End Try
End Sub
End Class
Hi phil and palzor,I have used this vb code and was able to convert RTF data,
Deletebut I was able to convert only some 9000 rows of data,script component does not seem to work after that limit.Do we have any limitations to RichTextbox
Thanks