Thursday, 28 April 2011

Converting RTF to Plain Text Using Regular Expressions in MS SQL

RTF tags can be stripped from a text column in a MS SQL database by using regular expressions thorugh SQL CLR.

Using Visual Studio create a new library project with the following code.


using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Data.SqlTypes;

public class RegularExpressions
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]

public static SqlBoolean RegExMatch(SqlString input, SqlString pattern)
{
if (input.IsNull || pattern.IsNull)
return SqlBoolean.False;

return Regex.IsMatch(input.Value, pattern.Value, RegexOptions.IgnoreCase);
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlString RegExReplace(SqlString input, SqlString pattern, SqlString replacement)
{
if (input.IsNull || pattern.IsNull || replacement.IsNull)
return SqlString.Null;

return new SqlString(Regex.Replace(input.Value, pattern.Value, replacement.Value, RegexOptions.IgnoreCase));
}

}
This example shows two functions, one for pattern matchign and one for replacement. Only the replacement function will be used in the RTF conversion.

Create the assembly and register the CLR functions

Enable CLR

sp_configure 'clr enabled', 1
RECONFIGURE WITH OVERRIDE

Create the Assembly

CREATE ASSEMBLY TextFunctions FROM 'c:\dba\tools\lib\TextFunctions.dll'

Create SQL Functions to Reference the CLR Functions

CREATE FUNCTION RegExMatch( @Input NVARCHAR(512)
,@Pattern NVARCHAR(127)
)
RETURNS BIT

EXTERNAL NAME TextFunctions.RegularExpressions.RegExMatch
GO

CREATE FUNCTION RegExReplace( @Input NVARCHAR(512)
,@Pattern NVARCHAR(127)
,@Replacement NVARCHAR(512)
)
RETURNS NVARCHAR(512)

EXTERNAL NAME TextFunctions.RegularExpressions.RegExReplace
GO

Using the Functions to Convert RTF to Plain Text

The regular expression specified leaves a few unwanteds characters so these are stripped using the SQL replace function. Obviously the regular expression could be extended to do this.

SELECT
   ltrim(replace(replace(replace(
       dbo.RegExReplace(note_text,'({\\)(.+?)(})|(\\)(.+?)(\b)','')
      ,'}','')
      ,char(13),'')
      ,char(10),'')
      )
FROM
   dbo.tbl_note

2 comments:

  1. I made use of this when porting data from act to salesforce. Although I did need to strip an extra couple of characters.

    ReplyDelete
  2. This solution is A #1 .. Thanks

    ReplyDelete