Thursday, 28 April 2011

Converting RTF to Plain Text Using Regular Expressions in MS SQL

RTF tags can be stripped from a text column in a MS SQL database by using regular expressions thorugh SQL CLR.

Using Visual Studio create a new library project with the following code.


using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Data.SqlTypes;

public class RegularExpressions
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]

public static SqlBoolean RegExMatch(SqlString input, SqlString pattern)
{
if (input.IsNull || pattern.IsNull)
return SqlBoolean.False;

return Regex.IsMatch(input.Value, pattern.Value, RegexOptions.IgnoreCase);
}

[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlString RegExReplace(SqlString input, SqlString pattern, SqlString replacement)
{
if (input.IsNull || pattern.IsNull || replacement.IsNull)
return SqlString.Null;

return new SqlString(Regex.Replace(input.Value, pattern.Value, replacement.Value, RegexOptions.IgnoreCase));
}

}
This example shows two functions, one for pattern matchign and one for replacement. Only the replacement function will be used in the RTF conversion.

Create the assembly and register the CLR functions

Enable CLR

sp_configure 'clr enabled', 1
RECONFIGURE WITH OVERRIDE

Create the Assembly

CREATE ASSEMBLY TextFunctions FROM 'c:\dba\tools\lib\TextFunctions.dll'

Create SQL Functions to Reference the CLR Functions

CREATE FUNCTION RegExMatch( @Input NVARCHAR(512)
,@Pattern NVARCHAR(127)
)
RETURNS BIT

EXTERNAL NAME TextFunctions.RegularExpressions.RegExMatch
GO

CREATE FUNCTION RegExReplace( @Input NVARCHAR(512)
,@Pattern NVARCHAR(127)
,@Replacement NVARCHAR(512)
)
RETURNS NVARCHAR(512)

EXTERNAL NAME TextFunctions.RegularExpressions.RegExReplace
GO

Using the Functions to Convert RTF to Plain Text

The regular expression specified leaves a few unwanteds characters so these are stripped using the SQL replace function. Obviously the regular expression could be extended to do this.

SELECT
   ltrim(replace(replace(replace(
       dbo.RegExReplace(note_text,'({\\)(.+?)(})|(\\)(.+?)(\b)','')
      ,'}','')
      ,char(13),'')
      ,char(10),'')
      )
FROM
   dbo.tbl_note

Wednesday, 20 April 2011

Converting RTF to Plain text in SSIS


RichText formatted data can easily be converted to plain text using a regular expression in the script component in SSIS.

The regular expression to strip out rtf formatting is as follows:
({\\)(.+?)(})|(\\)(.+?)(\b)

The following steps outline the creation of an SSIS package to extract records from a database table to a delimited text file where the data in one of the columns is formatted as RTF and the output file requires plain text.

  1. Create a new SSIS data flow

  2. Add an OLE DB Source to access the records. The column containing the rtf data is assumed to be called body.

  3. Add a script component to the data flow and attach it to the output of the OLE DB Source.

  4. In the script component properties ensure the body field is selected as an input column

  5. In the script component properties add an output column body_converted, give it a datatype of DT_STR and a suitable length.

  6. Click the Edit Script button and then use the following code in the ProcessInputRow method.


    Regex regExObj = new Regex("({\\\\)(.+?)(})|(\\\\)(.+?)(\\b)");
    System.Text.UTF8Encoding enc = new System.Text.UTF8Encoding();
    String strBody = enc.GetString(Row.body.GetBlobData(0,int)Row.body.Length));
    Row.bodyconverted = regExObj.Replace(strBody, "").Replace("}", "");

  7. Add aflat file destination and map the columns from the script component to it ensuring you use the body_converted column and not the original body column.

  8. Run the SSIS pacakge.

You should now have a file which contains records with plain text instead of rtf.

The regular expression works on all scenarios I have tested so far, but if you come across tags where it misses then it should be a simple case of extending the expression to catch them.

Tuesday, 19 April 2011

Data Matching Algorithms

Researching data matching algorithms for use in an MDM project.

Thursday, 7 April 2011

Levenshtein Distance in SQL

For a description of Levenshtein distance see wikipedia article

CREATE Function lev_distance ( @source VARCHAR(100) ,@target VARCHAR(100) ) RETURNS int AS BEGIN DECLARE @distance varchar(30) ,@len_source int ,@len_target int ,@counter int DECLARE @matrix TABLE(x int, y int, value int) SET @len_source = (SELECT len(@source)) SET @len_target = (SELECT len(@target)) -- initialize the first row 0..len_source SET @counter = 0 WHILE @counter <= @len_source + 1 BEGIN INSERT INTO @matrix VALUES (@counter, 0, @counter) SET @counter = @counter + 1 END -- initialize the first column 0..len_target SET @counter = 1 WHILE @counter <= @len_target + 1 BEGIN INSERT INTO @matrix VALUES (0, @counter, @counter) SET @counter = @counter + 1 END -- evaluate character distance DECLARE @cost int ,@up int ,@left int ,@diag int ,@source_index int ,@target_index int SET @source_index = 1 SET @target_index = 1 WHILE @source_index <= @len_source + 1 BEGIN WHILE @target_index <= @len_target + 1 BEGIN SET @cost = (CASE WHEN SUBSTRING(@source, @source_index, 1) = SUBSTRING(@target, @target_index, 1) THEN 0 ELSE 1 END) SET @up = (SELECT [value] FROM @matrix WHERE [x] = @source_index AND [y] = @target_index - 1) + 1 SET @left = (SELECT [value] FROM @matrix WHERE [x] = @source_index - 1 AND [y] = @target_index) + 1 SET @diag = (SELECT [value] FROM @matrix WHERE [x] = @source_index - 1 AND [y] = @target_index - 1) + @cost SET @cost = (CASE WHEN (@up <= @left) AND (@up <= @diag) THEN @up WHEN (@left <= @up) AND (@left <= @diag) THEN @left WHEN (@diag <= @up) AND (@diag <= @left) THEN @diag END) INSERT INTO @matrix VALUES (@source_index, @target_index, @cost) SET @target_index = @target_index + 1 END SET @target_index = 1 SET @source_index = @source_index + 1 END -- get result SELECT @distance = value FROM @matrix WHERE x = len(@source) AND y = len(@target) RETURN @distance END GO

Data Management Related Standards


ISO/IEC 38500

Corporate governance of information technology standard, provides a framework for effective governance of IT to assist those at the highest level of organizations to understand and fulfill their legal, regulatory, and ethical obligations in respect of their organizations’ use of IT.


ISO8000 - Data quality.



AS4590 - Interchange of Client Information

Powershell SMTP

function Send-SMTPmail($to, $from, $subject, $body, $attachment, $cc, $bcc, $port, $timeout, $smtpserver, [switch] $html, [switch] $alert) {
if ($smtpserver -eq $null) {$smtpserver = "smtp.myserver.com"}
$mailer = new-object Net.Mail.SMTPclient($smtpserver)
if ($port -ne $null) {$mailer.port = $port}
if ($timeout -ne $null) {$mailer.timeout = $timeout}
$msg = new-object Net.Mail.MailMessage($from,$to,$subject,$body)
if ($html) {$msg.IsBodyHTML = $true}
if ($cc -ne $null) {$msg.cc.add($cc)}
if ($bcc -ne $null) {$msg.bcc.add($bcc)}
if ($alert) {$msg.Headers.Add("message-id", "<3bd50098e401463aa228377848493927-1>")}
if ($attachment -ne $null) {
$attachment = new-object Net.Mail.Attachment($attachment)
$msg.attachments.add($attachment)
}
$mailer.send($msg)
}

Database Deployment scripts in Powershell

To enable all database changes to be deployed to differenet environments;

  • firstly ensure each change is scripted in a seperate file.
  • create a build file and record the name of each change file within it.
  • use the following powershell script to read the build file and submit the change files one at a time.

    $server_name = read-host -Prompt "Servername"
    $db_name = read-host -Prompt "Database"

    # connect to server
    $server = new-object ("Microsoft.SqlServer.Management.Smo.Server") $server_name
    if ($windows_login -eq $false)
    {
    # prompt for connection details
    $server.ConnectionContext.LoginSecure=$false;
    $credential = Get-Credential
    $username = $credential.UserName -replace("\\","")
    $server.ConnectionContext.set_Login($username)
    $server.ConnectionContext.set_SecurePassword($credential.Password)
    }
    $db = $server.Databases[$db_name]

    # read the build list
    $build_list = (get-content C:\Projects\build-list.txt)

    foreach ($filename in $build_list)
    {
    write-host "running " $filename
    $stream = New-Object System.IO.StreamReader($filename)
    $script = $stream.ReadToEnd()

    $db.ExecuteNonQuery( $script, 1 )
    }
    write-host "build complete"
  • Documenting SQL Server Jobs Using Powershell

    The following script can be used to create wiki style documentation for all SQL server jobs.


    # Load SMO libraries
    [Reflection.Assembly]::Load("Microsoft.SqlServer.Smo, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91")
    [Reflection.Assembly]::Load("Microsoft.SqlServer.SqlEnum, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91")
    [Reflection.Assembly]::Load("Microsoft.SqlServer.SmoEnum, Version=9.0.242.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91")
    [Reflection.Assembly]::Load("Microsoft.SqlServer.ConnectionInfo, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91")

    # define connection properties
    [string] $server_name = '(local)'
    [string] $project_name = $server_name
    [string] $doc_root = "C:\dba\db_docs\jobs\"
    [bool] $windows_login = $true


    ##############################################################
    # Main
    ##############################################################

    # connect to server
    $server = new-object ("Microsoft.SqlServer.Management.Smo.Server") $server_name

    if ($windows_login -eq $false)
    {
    # prompt for connection details
    $server.ConnectionContext.LoginSecure=$false;
    $credential = Get-Credential
    $username = $credential.UserName -replace("\\","")
    $server.ConnectionContext.set_Login($username)
    $server.ConnectionContext.set_SecurePassword($credential.Password)
    }

    # create database directories, $doc-root
    new-item -path $doc_root -name $project_name -type directory -force > $null
    $file_name = $doc_root + "\" + $project_name + "\" + $project_name + " jobs" + ".wiki"
    new-item $file_name -type file -force

    add-content $file_name " || Job || Description || Created || Updated || Enabled || Schedules || "

    foreach ($job in $server.jobserver.jobs)
    {
    $content = " | " + $job.name + " | " `
    + $job.description + " | " `
    + $job.datecreated + " | " `
    + $job.DateLastModified + " | " `
    + $job.IsEnabled+ " | "


    foreach($schedule in $job.JobSchedules)
    {
    $content = $content + $schedule.FrequencyTypes

    if($schedule.FrequencySubDayTypes -ne "once")
    {
    $content = $content + " every " `
    + $schedule.FrequencySubDayInterval + " " `
    + $schedule.FrequencySubDayTypes
    }
    }
    $content = $content + " | "
    add-content $file_name $content

    }

    Creating Database Documentation with Powershell

    Creating database documentation can be time consuming and difficult to maintain. Have you ever tried to persuade developers or DBAs of the benefits of documentation. They are quick to complain when there is no documentation on a system they are forced to maintain.

    The following powershell script provides a simple method of creating documentation for database objects that can then be uploaded to your wiki.

    1. Load the required SMO libraries
    #Load SMO libraries
    [Reflection.Assembly]::Load("Microsoft.SqlServer.Smo, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91")
    [Reflection.Assembly]::Load("Microsoft.SqlServer.SqlEnum, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91")
    [Reflection.Assembly]::Load("Microsoft.SqlServer.SmoEnum, Version=9.0.242.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91")
    [Reflection.Assembly]::Load("Microsoft.SqlServer.ConnectionInfo, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91")

    2. Define database connection properties
    # define connection properties
    [string] $server_name = '(local)'
    [string] $db_name = 'master'
    [string] $project_name = "MyProject"
    [string] $doc_root = "C:\dba\db_docs\"
    [bool] $windows_login = $false

    3. I use winscp to publish the completed documentation to the wiki. The following function specifies the command line for this.
    function ftp-files
    {

    . "C:\Program Files\WinSCP3\winscp3.exe" /console /command "option batch on" "option confirm off" "open myProfile" "put C:\dba\db_docs\$project_name\*.* /wiki/db_doc/$project_name/" "close" "exit"

    }


    4. This function creates the top level database page. It can be extended to add any required database or server details.
    ##############################################################
    # doc-database
    ##############################################################
    function doc-database
    {
    param($db_dir)

    $file_name = $db_dir + "\" + $project_name + " Database " + ".wiki"
    new-item $file_name -type file -force

    add-content $file_name "h1. Overview"
    add-content $file_name "{excerpt}...{excerpt}"

    add-content $file_name "h1. Environment"
    add-content $file_name "|| Designation || Server || Database || OS || SQL ||"

    $string = "| Development | | " + $db_name + " | | |"
    add-content $file_name $string
    add-content $file_name "| Beta | | | | |"
    add-content $file_name "| Live | | | | |"

    add-content $file_name "h1. Database Objects"

    $string = "h2. [Tables|" + $project_name + " Tables]"
    add-content $file_name $string
    $string = "{children:depth=2|excerpt=true|page=" + $project_name + " Tables}"
    add-content $file_name $string

    $string = "h2. [Views|" + $project_name + " Views]"
    add-content $file_name $string
    $string = "{children:depth=2|excerpt=true|page=" + $project_name + " Views}"
    add-content $file_name $string

    $string = "h2. [Stored Procedures|" + $project_name + " Stored Procedures]"
    add-content $file_name $string
    $string = "{children:depth=2|excerpt=true|page=" + $project_name + " Stored Procedures}"
    add-content $file_name $string

    $string = "h2. [Functions|" + $project_name + " Functions]"
    add-content $file_name $string
    $string = "{children:depth=2|excerpt=true|page=" + $project_name + " Functions}"
    add-content $file_name $string

    $string = "h1. [Update Process|" + $project_name + " Update Process]"
    add-content $file_name $string

    add-content $file_name "h1. Attachments"
    add-content $file_name "{attachments}"
    }

    5. fucntion to extract the extended properties from stored procedure parameters.

    ##############################################################
    # doc-params
    ##############################################################
    function doc-params
    {
    param($file_name, $parameters)

    add-content $file_name "|| Parameter || Data Type || Default || Description ||"
    foreach ($param in $parameters)
    {
    $string = "| " + $param.name + " | " + $param.datatype
    if ($param.datatype.name -eq 'varchar')
    {
    $string = $string + "(" + $param.datatype.maximumlength + ")"
    }
    $string = $string + " | " + $param.default
    $string = $string + " | "
    $string = $string + $param.ExtendedProperties["MS_Description"].value
    $string = $string + " | "

    add-content $file_name $string
    }
    }

    6. Fucntion to document all indexes.
    ##############################################################
    # doc-indexes
    ##############################################################
    function doc-indexes
    {
    param($file_name, $table)

    add-content $file_name 'h2. Indexes'
    add-content $file_name '|| Key Name || Columns || Included || Clustered ||'

    foreach($index in $table.indexes)
    {
    $col_list = ""
    $inc_list = ""

    foreach($col in $index.indexedcolumns)
    {
    if($col.isincluded -eq "true")
    {
    $inc_list = $inc_list + $col.name + " "
    }
    else
    {
    $col_list = $col_list + $col.name + " "
    }
    }
    $string = "| "+ $index.name + "| " + $col_list +"| " + $inc_list + "| " + $index.isclustered + "|"
    add-content $file_name $string
    }
    }

    7. Function to document a table.
    ##############################################################
    # doc-table
    ##############################################################
    function doc-table
    {
    param($table_dir, $table)

    $file_name = $table_dir+ "\" + $project_name + " Table - " + $table.name + ".wiki"
    new-item $file_name -type file -force

    add-content $file_name "h1. $($table.name) "

    add-content $file_name "{excerpt}"
    add-content $file_name $table.ExtendedProperties["MS_Description"].value
    add-content $file_name "{excerpt}"

    add-content $file_name '|| Field || Type || Nullable || Key || Default || Description ||'

    foreach($column in $table.columns)
    {
    $column_text = "|" + $column.name + "|" + $column.datatype

    if ($column.datatype.name -eq 'varchar')
    {
    $column_text = $column_text + "(" + $column.datatype.maximumlength + ")"
    }

    $column_text = $column_text + " | " + $column.nullable + "|" + $column.inprimarykey `
    + "|" + $column.default + " | " `
    + $column.ExtendedProperties["MS_Description"].value `
    + " |"

    add-content $file_name $column_text
    }
    doc-indexes $file_name $table
    }


    8. Fucntion to document a view.
    ##############################################################
    # doc-view
    ##############################################################
    function doc-view
    {
    param($view_dir, $view)

    $file_name = $view_dir+ "\" + $project_name + " View - " + $view.name + ".wiki"
    new-item $file_name -type file -force

    add-content $file_name "h1. $($view.name) "
    add-content $file_name "{excerpt}"
    add-content $file_name $view.ExtendedProperties["MS_Description"].value
    add-content $file_name "{excerpt}"

    add-content $file_name '|| Field || Type || Description ||'

    foreach($column in $view.columns)
    {
    $column_text = "|" + $column.name + "|" + $column.datatype

    if ($column.datatype.name -eq 'varchar')
    {
    $column_text = $column_text + "(" + $column.datatype.maximumlength + ")"
    }

    $column_text = $column_text + " | " `
    + $column.ExtendedProperties["MS_Description"].value `
    + " |"
    add-content $file_name $column_text
    }

    add-content $file_name "{code}"
    add-content $file_name $view.TextHeader
    add-content $file_name $view.TextBody
    add-content $file_name "{code}"
    }

    9. Fucntion to document a stored procedure. When creating stored procedures I use a standard template and add extended properties. Both of which are used when constructing the documentation page.

    ##############################################################
    # doc-procedure
    ##############################################################
    function doc-procedure
    {
    param($proc_dir, $proc)

    $file_name = $proc_dir + "\" + $project_name + " Stored Procedure - " + $proc.name + ".wiki"
    new-item $file_name -type file -force

    # add description as excerpt
    add-content $file_name "h1. $($proc.name) "
    add-content $file_name "{excerpt}"
    add-content $file_name $proc.ExtendedProperties["MS_Description"].value
    add-content $file_name "{excerpt}"

    # add parameters
    add-content $file_name "h1. Parameters"
    doc-params $file_name $proc.parameters

    # return value
    add-content $file_name "h1. Return Value"
    add-content $file_name $proc.ExtendedProperties["RETURN_VALUE"].value

    # result set
    add-content $file_name "h1. Result Set"
    add-content $file_name $proc.ExtendedProperties["RESULT_SET"].value

    # add the code
    add-content $file_name "h1. Syntax"
    add-content $file_name "{code}"
    if($proc.IsEncrypted -eq $false)
    {
    if ($proc.TextHeader.IndexOf("Description:") -gt -1 `
    -and $proc.TextHeader.IndexOf("Parameters:") -gt -1 `
    -and $proc.TextHeader.IndexOf("Parameters:", $proc.TextHeader.IndexOf("Description:")) -gt -1 `
    )
    {
    add-content $file_name $proc.TextHeader.substring($proc.TextHeader.ToUpper().IndexOf("CREATE PROCEDURE") `
    ,$proc.TextHeader.length - $proc.TextHeader.ToUpper().IndexOf("CREATE PROCEDURE")).trim().trimend("AS").trimend("as")
    }
    else
    {
    add-content $file_name $proc.TextHeader.trim().trimend("AS").trimend("as")
    }
    }
    else
    {
    add-content $file_name "Procedure Encrypted"
    }
    add-content $file_name "{code}"

    }


    10. Fucntion to document a function.
    ##############################################################
    # doc-function
    ##############################################################
    function doc-function
    {
    param($func_dir, $func)

    $file_name = $func_dir + "\" + $project_name + " Function - " + $func.name + ".wiki"
    new-item $file_name -type file -force

    # add description as excerpt
    add-content $file_name "h1. $($func.name)"
    add-content $file_name "{excerpt}"
    add-content $file_name $proc.ExtendedProperties["MS_Description"].value
    add-content $file_name "{excerpt}"

    # add parameters
    add-content $file_name "h1. Parameters"
    doc-params $file_name $func.parameters

    # return value
    add-content $file_name "h1. Return Value"
    add-content $file_name $proc.ExtendedProperties["RETURN_VALUE"].value

    # add the code
    add-content $file_name "h1. Syntax"
    add-content $file_name "{code}"
    if($func.IsEncrypted -eq $false)
    {
    if ($func.TextHeader.IndexOf("Description:") -gt -1 `
    -and $func.TextHeader.IndexOf("Parameters:") -gt -1 `
    -and $func.TextHeader.IndexOf("Parameters:", $func.TextHeader.IndexOf("Description:")) -gt -1 `
    )
    {
    add-content $file_name $func.TextHeader.substring($func.TextHeader.ToUpper().IndexOf("CREATE FUNCTION") `
    ,$func.TextHeader.length - $func.TextHeader.ToUpper().IndexOf("CREATE FUNCTION")).trim().trimend("AS").trimend("as")
    }
    else
    {
    add-content $file_name $func.TextHeader.trim().trimend("AS").trimend("as")
    }
    }
    else
    {
    add-content $file_name "Function Encrypted"
    }
    add-content $file_name "{code}"

    }

    11. and this tiee it all together.
    ##############################################################
    # Main
    ##############################################################

    # connect to server
    $server = new-object ("Microsoft.SqlServer.Management.Smo.Server") $server_name

    if ($windows_login -eq $false)
    {
    # prompt for connection details
    $server.ConnectionContext.LoginSecure=$false;
    $credential = Get-Credential
    $username = $credential.UserName -replace("\\","")
    $server.ConnectionContext.set_Login($username)
    $server.ConnectionContext.set_SecurePassword($credential.Password)
    }

    $db = $server.Databases[$db_name]

    # create database directories, $doc-root
    new-item -path $doc_root -name $project_name -type directory -force

    $db_dir = $doc_root + $project_name
    $table_dir = $db_dir + "\Tables"
    $view_dir = $db_dir + "\Views"
    $proc_dir = $db_dir + "\Stored Procedures"
    $func_dir = $db_dir + "\Functions"

    new-item $table_dir -type directory -force
    new-item $proc_dir -type directory -force
    new-item $func_dir -type directory -force

    doc-database $db_dir

    ####################################
    # document all tables
    ####################################
    $file_name = $table_dir + "\" + $project_name + " Tables" + ".wiki"
    new-item $file_name -type file -force
    add-content $file_name "{children:excerpt=true|depth=3}"
    foreach ($table in $db.Tables)
    {
    doc-table $table_dir $table
    }

    ####################################
    # document all views
    ####################################
    $file_name = $view_dir + "\" + $project_name + " Views" + ".wiki"
    new-item $file_name -type file -force
    add-content $file_name "{children:excerpt=true|depth=3}"
    foreach ($view in $db.Views)
    {
    if($view.IsSystemObject -ne "true")
    {
    doc-view $view_dir $view
    }
    }


    ####################################
    # document all stored procedures
    ####################################
    $file_name = $proc_dir + "\" + $project_name + " Stored Procedures" + ".wiki"
    new-item $file_name -type file -force
    add-content $file_name "{children:excerpt=true|depth=3}"
    foreach ($proc in $db.StoredProcedures)
    {
    if($proc.IsSystemObject -ne "true")
    {
    doc-procedure $proc_dir $proc
    }
    }

    ####################################
    # document all functions
    ####################################
    $file_name = $func_dir + "\" + $project_name + " Functions" + ".wiki"
    new-item $file_name -type file -force
    add-content $file_name "{children:excerpt=true|depth=3}"
    foreach ($func in $db.UserDefinedFunctions)
    {
    if($func.IsSystemObject -ne "true")
    {
    doc-function $func_dir $func
    }
    }

    12. If you want to make use of winscp to copy files to the wiki then make a cal lto the ftp-files function.
    ftp-files