Description of TLTextDiff (Delphi version)

Component for text/string difference for Borland Delphi.

Dr. Andreas Lindae, January 2002.

 

 

Copyright Notice

Version 1.0

copyright by Dr. Andreas Lindae 2001/2002 www.lindae-software.com

all algorithms are invented and designed by Dr. Andreas Lindae

 

designed using Borland Delphi 5 Prof., using only standard VCL and pure Pascal code

the Kylix version has been designed using Borland Kylix 1.0.

 

All rights reserved.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER/S "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 

General description

As input for the component you first have to assign a pair of Stringlists or a pair of strings as the two texts to compare. Considering for a little while the topic of building the difference between texts you will be very fast aware of the fact that mostly there will be thousands of possible results. You ask: Which part of the first text has been deleted? Which part of the second text has been added? One of the most trivial results will be to define the first text completely as deleted and the second text completely as added. Of course we don’t want such a kind of “result”. In this sense the component presents two completely different ways to get the needed answers and it provides also the possibility to mix them via recursion.

In this sense - for the evaluation of the difference there are the main options: The used algorithms and the recursion type explained here.

 

There are 2 algorithms: “sequential standard” and an mathematical optimization method based on the (in the “mathematical society”) wellknown "branch and bound" method.

 

Hint: You do not have to know or learn the theory behind the algorithms. Only take the used names to distinguish them. If you want to get a concept of what is going on with them it is sufficient to read this introduction.

 

Another hint: The component can compare lines or words or letters. In the following text we will most use the word “string” to describe one of these objects since in the pascal language the string datatype can be all of them (or be converted).

 

The sequential standard algorithm finds subsequent strings which are equal. Here the property "Seq_EqualMatches" sets the number of equal strings to appear before the program accepts them as equal. It is called “sequential” since the texts were analyzed from “begin to end” in a sequential manner.

 

The branch and bound is an optimization method with a better and mostly excellent result, but much slower than the other method. It optimizes the number of differences to a minimum or –by setting of the property BBWordLengthWeights to true- it can also set the length (instead of the number) of the differences to a minimum. It has been implemented using much memory in this version - therefore being a little bit faster than a memory optimized version. It is limited to 32000 strings in one of the texts to compare. I.e. it will be replaced by the sequential algorithm if one of the inputs texts has more than 32000 strings. By the way: You can force the program to replace it much earlier by setting the property bblimit to a non zero value. Please note that the default value of this property is 1000. This has been done since the branch and bound method will be dramatically slower for higher string counts.

 

The difference between the algorithms in short:

The sequential standard is the “cheap” and fast algorithm. It analyzes the texts from begin to end, taking only the text parts into account which have been read at the moment. The branch and bound is much slower and will be dramatically slower for bigger texts. But it optimizes the whole difference, taking all of the text into account at each time so providing a very much better result in most cases, i.e. “showing the real difference”.

To get rid of the disadvantages of both methods at least partly, the component has the built in capability to invoke them recursively as described below.

 

Recursion:

There is the possibility to invoke the algorithms recursively, e.g. you can take some Stringlists (lines) as input, evaluating the difference using the lines as the compared objects for the moment. After finishing this step the program can look for subsequent sections of deleted and added data. For each of these sections there can be computed the difference taking now the words as objects to compare. After finishing this step the program can do the same with the result and invoke an algorithm for each new deleted/added section using now the letters as objects to compare.

For each level you can select the used algorithm. E.g. you can take the sequential standard at the lines-level and the branch and bound at the words- and letters- levels.

Also you can omit the letters level, i.e. going down only to the words-level or begin at the words- or letters level or using only one level without recursion.

 

 

General Usage:

1.)      Set the properties in the object inspector
setting the input texts:
You can choose here between both as String or both as StringLists.
Use the property InputMode to select the input.
Chose the used algorithms and the Recursion type.

2.)      Call the Execute method

3.)      The Result is always the data in the TStringList property "ResultList"

 

Format of the Resultlist

The result list (read only property “ResultList”) has the following simple format:

Each String has one of the following characters as the first character:

"+" for a String which has been added

"-" for a String which has been deleted

"0" for a String occuring in both input texts

 

HTML

For a HTML representation use the method makeHTMLfromResultList.

 

 

 

Description of the public and published properties, variables and methods

The declaration code of the types and the public and published parts of the component looks like this:

 

type

TTextDiffProgressEvent=procedure(Sender: TObject; Progresspercent: integer) of object;

TAlgorithmtype=(Seq_SequentialStandard,BB_BranchAndBound);

TInputMode=(takeText, takeLines);

TRecursionType=(no_Recursion, downto_Words, downto_Letters);

TNoRecursionType=(by_Lines_Or_Words, by_Words_only, by_Letters_only);

 

type

TLTextDiff = class(TComponent)

private

public

bbreplacedbyCount, bbreplacedbyMemory: boolean;

constructor Create(AOwner: TComponent); override;

destructor Destroy; override;

procedure Execute;

procedure setbreak;

procedure clearResultList;

function DifferencesOccured: boolean;

function breaked: boolean;

procedure makeHTMLfromResultList(thefile: String; bshowTheFile: boolean);

procedure testFirstInHTML(thefile: String; bshowTheFile: boolean);

procedure testSecondInHTML(thefile: String; bshowTheFile: boolean);

property IsRunning: boolean (readonly);

property ResultList: TStringList (readonly);

 

published

property inputFirstText: String;

property inputSecondText: String;

property inputFirstLines: TStringList;

property inputSecondLines: TStringList;

property inputMode: TInputMode;

 

property algorithmforLines: TAlgorithmtype;

property algorithmforWords: TAlgorithmtype;

property algorithmforLetters: TAlgorithmtype;

property omitEmptyLines: boolean;

property leadingBlanksOut: boolean;

property BBWordLengthWeights: boolean;

property RecursionType: TRecursionType;

property NoRecursionType: TNoRecursionType;

 

property Seq_EqualMatches: integer default 2;

 

property HTMLdeletedcolor: TColor default clred;

property HTMLinsertedcolor: TColor default clgreen;

property HTMLdeletedcolorAsHTML: String;

property HTMLinsertedcolorAsHTML: String;

property HTMLdeletedstrikeout: boolean default true;

property HTMLinsertedunderline: boolean default true;

 

property inputAutoSwitched: default true;

property AntiFreeze: boolean default true;

property bblimit: integer default 1000;

property OnProgress: TTextDiffProgressEvent;

end;

 

 

In the following sections the items will be explained.

 

Public variables bbreplacedbyCount, bbreplacedbyMemory: boolean;

These variables are set to true by the execute method if the branch and bound method has been replaced one or more times during the difference evaluation by the standard sequential algorithm. This may be by memory problems or by exceeding the count of compared strings in the following cases:

 

Memory low: bbreplacedbyMemory=true

The number of strings in one of the texts exceeds 32000: bbreplacedbyCount=true

The number of strings in one of the texts exceeds the property bblimit: bbreplacedbyCount=true

 

Due to technical reasons the branch and bound will also be replaced if one of the texts has only one string. In this case the variable bbreplacedbyCount is not set to true.

 

 

Public procedure Execute;

Executes the difference evaluation with the properties you have set.

 

 

Public procedure setbreak;

aborts the execution. Then the function breaked will return true until you start the next execution.

Set the property AntiFreeze to true if you want to use this procedure.

Hint: setbreak does not clear the ResultList.

 

 

Public procedure clearResultList;

You can clear the internal ResultList if you do not need the result any more and if you want to save the memory.

 

 

Public function DifferencesOccured: boolean;

Returns true if the ResultList shows differences. The Resultlist will be analyzed at the moment you call the function.

 

 

Public function breaked: boolean;

Returns true if an external break occurred e.g. by using the setbreak procedure.

 

 

Public procedure makeHTMLfromResultList(thefile: String; bshowTheFile: boolean);

Produces a HTML representation and stores it into the file “thefile”. If bshowTheFile=true it will be opened by the shellapi function Shellexecute (Windows).

You can influence the HTML representation using different colors and/or strikeout/underline unsing the properties

HTMLdeletedcolor

HTMLdeletedcolorAsHTML

HTMLinsertedcolor

HTMLinsertedcolorAsHTML

HTMLdeletedstrikeout

HTMLinsertedunderline

 

 

Public procedure testFirstInHTML(thefile: String; bshowTheFile: boolean);

Produces a HTML representation of the first text on the basis of the ResultList for test purposes.

The file will be stored as “thefile” and shown via Shellexecute if bshowTheFile is set to true (Windows).

 

 

Public procedure testSecondInHTML(thefile: String; bshowTheFile: boolean);

Produces a HTML representation of the second text on the basis of the ResultList for test purposes.

The file will be stored as “thefile” and shown via Shellexecute if bshowTheFile is set to true (Windows).

 

 

Public property IsRunning: boolean (readonly);

Shows true if the Execute method is running.

 

 

Public property ResultList: TStringList (readonly);

Represents the computed difference after Execute has been running successfully.

It has the following simple format:

Each String has one of the following characters as the first character:

"+" for a String which has been added

"-" for a String which has been deleted

"0" for a String occuring in both input texts

 

 

Published property inputFirstText: String;

The first text as a String.

This String will be taken for difference computing together with the String inputSecondText if the property inputmode is set to “takeText”. Please note also the description of the property RecursionType.

 

 

 

Published property inputSecondText: String;

The second text as a String.

This String will be taken for difference computing together with the String inputFirstText if the property inputmode is set to “takeText”. Please note also the description of the property RecursionType.

 

 

 

Published property inputFirstLines: TStringList;

The first text as a StringList.

This StringList will be taken for difference computing together with the StringList inputSecondLines if the property inputmode is set to “takeLines”. Please note also the description of the property RecursionType.

 

 

 

Published property inputSecondLines: TStringList;

The second text as a StringList.

This StringList will be taken for difference computing together with the StringList inputFirstLines if the property inputmode is set to “takeLines”. Please note also the description of the property RecursionType.

 

 

 

Published property inputMode: TInputMode;

Is set to “takeText” or “takeLines”.

TakeText means that the strings inputFirstText and inputSecondText will be taken as input.

TakeLines means that the stringlists inputFirstLines and inputSecondlines will be taken as input.

Please note also the description of the property RecursionType.

See also:

property inputAutoSwitched

 

 

Published property algorithmforLines: TAlgorithmtype;

Is set to “Seq_SequentialStandard” or “BB_BranchAndBound” for one of the two algorithms.

Specifies the used algorithm for computing the difference on the lines-level.

 

 

 

Published property algorithmforWords: TAlgorithmtype;

Is set to “Seq_SequentialStandard” or “BB_BranchAndBound” for one of the two algorithms.

Specifies the used algorithm for computing the difference on the words-level.

 

 

 

Published property algorithmforLetters: TAlgorithmtype;

Is set to “Seq_SequentialStandard” or “BB_BranchAndBound” for one of the two algorithms.

Specifies the used algorithm for computing the difference on the letters-level.

 

 

 

Published property omitEmptyLines: boolean;

If the difference on the lines level has to be computed empty lines can be ignored using this property. If set to true two texts which differ only by some deleted or inserted empty lines are considered as equal.

 

 

 

Published property leadingBlanksOut: boolean;

If set to true and the difference on the lines level has to be computed leading blanks (blanks at the beginning of the lines) are deleted internally before the difference process starts.

 

 

 

Published property BBWordLengthWeights: boolean;

If the branch and bound algorithm is used with this property set to true it will optimize the length of the difference to minimum instead of simply the number of differences.

 

 

 

Published property RecursionType: TRecursionType;

Is set to one of the values no_Recursion, downto_Words, downto_Letters.

no_Recursion means that no recursion is invoked. In this case the property NoRecursionType determines the mode of the process.

downto_Words means that there is a recursion from lines à words if lines have been selected as input. If the string input has been selected then there is no recursion.

downto_Letters means that there is a recursion from lines à words à letters if lines have been selected as input. If the string input has been selected then the process goes from words à letters.

 

 

 

Published property NoRecursionType: TNoRecursionType;

If no_Recursion has been selected for the property RecursionType you can select here the mode of the process:

by_Lines_Or_Words: Depending on the inputmode (lines or text) the program uses the lines as string to compare or words as strings to compare

by_Words_only: The program uses words as strings to compare in each case

by_Letters_only: The program uses letters as strings to compare in each case

 

 

 

Published property Seq_EqualMatches: integer default 2;

Determines how many string have to be equal until the sequential standard algorithm accepts the current section as equal.

 

 

 

Published property HTMLdeletedcolor: TColor default clred;

Used in the procedure makeHTMLfromResultList as the color for the deleted parts of the texts (i.e. strings found in the first text, but not found in the second text).

 

See also:

HTMLdeletedcolorAsHTML

HTMLinsertedcolor

HTMLinsertedcolorAsHTML

HTMLdeletedstrikeout

HTMLinsertedunderline

 

 

 

Published property HTMLinsertedcolor: TColor default clgreen;

Used in the procedure makeHTMLfromResultList as the color for the inserted parts of the texts (i.e. strings found in the second text but not found in the first text).

 

See also:

HTMLdeletedcolor

HTMLdeletedcolorAsHTML

HTMLinsertedcolorAsHTML

HTMLdeletedstrikeout

HTMLinsertedunderline

 

 

 

Published property HTMLdeletedcolorAsHTML: String;

Used in the procedure makeHTMLfromResultList.

With this property you can set the color for the deleted parts of the texts in HTML format (HTML color constants or another valid HTML color string).

 

See also:

HTMLdeletedcolor

HTMLinsertedcolor

HTMLinsertedcolorAsHTML

HTMLdeletedstrikeout

HTMLinsertedunderline

 

 

 

Published property HTMLinsertedcolorAsHTML: String;

Used in the procedure makeHTMLfromResultList.

With this property you can set the color for the inserted parts of the texts in HTML format (HTML color constants or another valid HTML color string).

 

See also:

HTMLdeletedcolor

HTMLdeletedcolorAsHTML

HTMLinsertedcolor

HTMLdeletedstrikeout

HTMLinsertedunderline

 

 

 

Published property HTMLdeletedstrikeout: boolean default true;

Used in the procedure makeHTMLfromResultList.

With this property set to true the deleted parts of the text appear striked out.

 

See also:

HTMLdeletedcolor

HTMLdeletedcolorAsHTML

HTMLinsertedcolor

HTMLinsertedcolorAsHTML

HTMLinsertedunderline

 

 

 

Published property HTMLinsertedunderline: boolean default true;

Used in the procedure makeHTMLfromResultList.

With this property set to true the inserted parts of the text appear underlined.

 

See also:

HTMLdeletedcolor

HTMLdeletedcolorAsHTML

HTMLinsertedcolor

HTMLinsertedcolorAsHTML

HTMLdeletedstrikeout

 

 

 

Published property inputAutoSwitched: default true;

If set to true the inputMode property is set automatically if you set one of the input properties

InputFirstText, inputSecondText, inputFirstLines, inputSecondLines.

 

See also

Property inputMode

 

 

 

Published property AntiFreeze: boolean default true;

If set to true at some point of the algorithms the application.processmessages command is invoked. So and external break may be done, e.g. by setbreak.

 

 

Published property bblimit: integer default 1000;

Sets a limit for the use of the branch and bound. It is related to the number of strings in one of the input texts/lines. If this value or the value 32000 is exceeded by one of the input texts/lines the sequential standard algorithm will be invoked instead.

See also the public variable bbreplacedbyCount.

 

 

Published property OnProgress: TTextDiffProgressEvent;

TTextDiffProgressEvent=procedure(Sender: TObject; Progresspercent: integer) of object;

This event property enables you e.g. to control a progressbar. Progresspercent means the Progress in percent. It does not measure exactly the progress since this is impossible in recursive processes. For these processes the progress is divided into 2 parts representing the first pass and the recursion pass.

 

--------------------