Component
for text/string difference for Borland Delphi.
Dr.
Andreas Lindae, January 2002.
Version
1.0
copyright
by Dr. Andreas Lindae 2001/2002 www.lindae-software.com
all
algorithms are invented and designed by Dr. Andreas Lindae
designed
using Borland Delphi 5 Prof., using only standard VCL and pure Pascal code
the
Kylix version has been designed using Borland Kylix 1.0.
All
rights reserved.
THIS
SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER/S "AS IS" AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
As
input for the component you first have to assign a pair of Stringlists or a
pair of strings as the two texts to compare. Considering for a little while the
topic of building the difference between texts you will be very fast aware of
the fact that mostly there will be thousands of possible results. You ask:
Which part of the first text has been deleted? Which part of the second text
has been added? One of the most trivial results will be to define the first
text completely as deleted and the second text completely as added. Of course
we don’t want such a kind of “result”. In this sense the component presents two
completely different ways to get the needed answers and it provides also the
possibility to mix them via recursion.
In
this sense - for the evaluation of the difference there are the main options:
The used algorithms and the recursion type explained here.
There
are 2 algorithms: “sequential standard”
and an mathematical optimization method based on the (in the “mathematical
society”) wellknown "branch and bound"
method.
Hint: You do not have to know or learn the theory behind
the algorithms. Only take the used names to distinguish them. If you want to
get a concept of what is going on with them it is sufficient to read this
introduction.
Another
hint: The component can compare
lines or words or letters. In the following text we will most use the word
“string” to describe one of these objects since in the pascal language the
string datatype can be all of them (or be converted).
The sequential standard algorithm finds
subsequent strings which are equal. Here the property "Seq_EqualMatches" sets the number of equal strings to appear
before the program accepts them as equal. It is called “sequential” since the
texts were analyzed from “begin to end” in a sequential manner.
The branch and bound is an optimization method
with a better and mostly excellent result, but much slower than the other
method. It optimizes the number of differences to a minimum or –by setting of
the property BBWordLengthWeights to true- it can also set the length (instead of the number) of the
differences to a minimum. It has been implemented using much memory in this
version - therefore being a little bit faster than a memory optimized version.
It is limited to 32000 strings in one of the texts to compare. I.e. it will be
replaced by the sequential algorithm if one of the inputs texts has more than
32000 strings. By the way: You can force the program to replace it much earlier
by setting the property bblimit to
a non zero value. Please note that the default value of this property is 1000.
This has been done since the branch and bound method will be dramatically
slower for higher string counts.
The difference between the algorithms in short:
The
sequential standard is the “cheap” and fast algorithm. It analyzes the texts
from begin to end, taking only the text parts into account which have been read
at the moment. The branch and bound is much slower and will be dramatically
slower for bigger texts. But it optimizes the whole difference, taking all of
the text into account at each time so providing a very much better result in
most cases, i.e. “showing the real difference”.
To
get rid of the disadvantages of both methods at least partly, the component has
the built in capability to invoke them recursively as described below.
Recursion:
There
is the possibility to invoke the algorithms recursively,
e.g. you can take some Stringlists (lines) as input, evaluating the difference
using the lines as the compared
objects for the moment. After finishing this step the program can look for
subsequent sections of deleted and added data. For each of these sections there
can be computed the difference taking now the words
as objects to compare. After finishing this step the program can do the same
with the result and invoke an algorithm for each new deleted/added section
using now the letters as objects
to compare.
For each level you can select the used algorithm. E.g. you can take the sequential standard at the lines-level and the branch and bound at the words- and letters- levels.
Also
you can omit the letters level, i.e. going down only to the words-level or
begin at the words- or letters level or using only one level without recursion.
General Usage:
1.)
Set
the properties in the object inspector
setting the input texts:
You can choose here between both as String or both as StringLists.
Use the property InputMode to select the input.
Chose the used algorithms and the Recursion type.
2.)
Call
the Execute method
3.)
The
Result is always the data in the TStringList property "ResultList"
The
result list (read only property “ResultList”) has the following simple format:
Each
String has one of the following characters as the first character:
"+"
for a String which has been added
"-"
for a String which has been deleted
"0"
for a String occuring in both input texts
For
a HTML representation use the method makeHTMLfromResultList.
The
declaration code of the types and the public and published parts of the
component looks like this:
type
TTextDiffProgressEvent=procedure(Sender:
TObject; Progresspercent: integer) of object;
TAlgorithmtype=(Seq_SequentialStandard,BB_BranchAndBound);
TInputMode=(takeText, takeLines);
TRecursionType=(no_Recursion, downto_Words,
downto_Letters);
TNoRecursionType=(by_Lines_Or_Words, by_Words_only,
by_Letters_only);
type
TLTextDiff = class(TComponent)
private
…
public
bbreplacedbyCount, bbreplacedbyMemory: boolean;
constructor Create(AOwner:
TComponent); override;
destructor Destroy; override;
procedure Execute;
procedure setbreak;
procedure clearResultList;
function DifferencesOccured:
boolean;
function breaked: boolean;
procedure
makeHTMLfromResultList(thefile: String; bshowTheFile: boolean);
procedure
testFirstInHTML(thefile: String; bshowTheFile: boolean);
procedure
testSecondInHTML(thefile: String; bshowTheFile: boolean);
property IsRunning: boolean
(readonly);
property ResultList:
TStringList (readonly);
published
property inputFirstText:
String;
property inputSecondText:
String;
property inputFirstLines:
TStringList;
property inputSecondLines:
TStringList;
property inputMode:
TInputMode;
property algorithmforLines:
TAlgorithmtype;
property algorithmforWords:
TAlgorithmtype;
property algorithmforLetters:
TAlgorithmtype;
property omitEmptyLines:
boolean;
property leadingBlanksOut:
boolean;
property BBWordLengthWeights:
boolean;
property RecursionType:
TRecursionType;
property NoRecursionType:
TNoRecursionType;
property Seq_EqualMatches:
integer default 2;
property HTMLdeletedcolor:
TColor default clred;
property HTMLinsertedcolor:
TColor default clgreen;
property
HTMLdeletedcolorAsHTML: String;
property
HTMLinsertedcolorAsHTML: String;
property
HTMLdeletedstrikeout: boolean default true;
property
HTMLinsertedunderline: boolean default true;
property inputAutoSwitched:
default true;
property AntiFreeze: boolean
default true;
property bblimit: integer
default 1000;
property OnProgress:
TTextDiffProgressEvent;
end;
In the
following sections the items will be explained.
These variables are set to true by the execute method if the branch and bound method has been replaced one or more times during the difference evaluation by the standard sequential algorithm. This may be by memory problems or by exceeding the count of compared strings in the following cases:
Memory
low: bbreplacedbyMemory=true
The
number of strings in one of the texts exceeds 32000: bbreplacedbyCount=true
The
number of strings in one of the texts exceeds the property bblimit: bbreplacedbyCount=true
Due
to technical reasons the branch and bound will also be replaced if one of the
texts has only one string. In this case the variable bbreplacedbyCount is not set to true.
Executes
the difference evaluation with the properties you have set.
aborts
the execution. Then the function breaked
will return true until you start the next execution.
Set
the property AntiFreeze to
true if you want to use this procedure.
Hint:
setbreak
does not clear the ResultList.
You
can clear the internal ResultList if you do not need the result any more and if
you want to save the memory.
Returns
true if the ResultList shows differences. The Resultlist will be analyzed at
the moment you call the function.
Returns
true if an external break occurred e.g. by using the setbreak procedure.
Produces
a HTML representation and stores it into the file “thefile”. If
bshowTheFile=true it will be opened by the shellapi function Shellexecute
(Windows).
You can influence the HTML representation using different colors and/or strikeout/underline unsing the properties
HTMLdeletedcolor
HTMLdeletedcolorAsHTML
HTMLinsertedcolor
HTMLinsertedcolorAsHTML
HTMLdeletedstrikeout
HTMLinsertedunderline
Produces
a HTML representation of the first text on the basis of the ResultList for test
purposes.
The
file will be stored as “thefile” and shown via Shellexecute if bshowTheFile is
set to true (Windows).
Produces
a HTML representation of the second text on the basis of the ResultList for
test purposes.
The
file will be stored as “thefile” and shown via Shellexecute if bshowTheFile is
set to true (Windows).
Shows
true if the Execute
method is running.
Represents
the computed difference after Execute has
been running successfully.
It
has the following simple format:
Each
String has one of the following characters as the first character:
"+"
for a String which has been added
"-"
for a String which has been deleted
"0"
for a String occuring in both input texts
The
first text as a String.
This
String will be taken for difference computing together with the String inputSecondText if the property inputmode is set to “takeText”. Please note also the
description of the property RecursionType.
The second
text as a String.
This
String will be taken for difference computing together with the String inputFirstText if the property inputmode is set to “takeText”. Please note also the
description of the property RecursionType.
The
first text as a StringList.
This
StringList will be taken for difference computing together with the StringList inputSecondLines if the property inputmode is set to “takeLines”. Please note also the
description of the property RecursionType.
The
second text as a StringList.
This
StringList will be taken for difference computing together with the StringList inputFirstLines if the property inputmode is set to “takeLines”. Please note also the
description of the property RecursionType.
Is
set to “takeText” or “takeLines”.
TakeText
means that the strings inputFirstText and inputSecondText will be taken as input.
TakeLines
means that the stringlists inputFirstLines and inputSecondlines will be taken as input.
Please
note also the description of the property RecursionType.
See
also:
property
inputAutoSwitched
Is
set to “Seq_SequentialStandard” or “BB_BranchAndBound” for one of the two algorithms.
Specifies
the used algorithm for computing the difference on the lines-level.
Is
set to “Seq_SequentialStandard” or “BB_BranchAndBound” for one of the two algorithms.
Specifies
the used algorithm for computing the difference on the words-level.
Is
set to “Seq_SequentialStandard” or “BB_BranchAndBound” for one of the two algorithms.
Specifies
the used algorithm for computing the difference on the letters-level.
If
the difference on the lines level has to be computed empty lines can be ignored
using this property. If set to true two texts which differ only by some deleted
or inserted empty lines are considered as equal.
If
set to true and the difference on the lines level has to be computed leading
blanks (blanks at the beginning of the lines) are deleted internally before the
difference process starts.
If
the branch and bound algorithm is used with this property set to true it will
optimize the length of the difference to minimum instead of simply the number
of differences.
Is
set to one of the values no_Recursion, downto_Words, downto_Letters.
no_Recursion means that no recursion is invoked. In this case the property NoRecursionType determines the mode of the process.
downto_Words means that there is a recursion from lines
à words if
lines have been selected as input. If the string input has been selected then
there is no recursion.
downto_Letters means that there is a recursion from lines
à words à letters if
lines have been selected as input. If the string input has been selected then
the process goes from words à letters.
If no_Recursion has been selected for the property RecursionType you can select here the mode of the process:
by_Lines_Or_Words: Depending on the inputmode (lines or text) the program uses the lines
as string to compare or words as strings to compare
by_Words_only: The program uses words as strings to compare in each case
by_Letters_only: The program uses letters as strings to compare in each case
Determines
how many string have to be equal until the sequential standard algorithm
accepts the current section as equal.
Used
in the procedure makeHTMLfromResultList as the color for the deleted parts of the texts (i.e. strings found in
the first text, but not found in the second text).
See also:
HTMLdeletedcolorAsHTML
HTMLinsertedcolor
HTMLinsertedcolorAsHTML
HTMLdeletedstrikeout
HTMLinsertedunderline
Used
in the procedure makeHTMLfromResultList as the color for the inserted parts of the texts (i.e. strings found in
the second text but not found in the first text).
See also:
HTMLdeletedcolor
HTMLdeletedcolorAsHTML
HTMLinsertedcolorAsHTML
HTMLdeletedstrikeout
HTMLinsertedunderline
Used
in the procedure makeHTMLfromResultList.
With
this property you can set the color for the deleted parts of the texts in HTML
format (HTML color constants or another valid HTML color string).
See also:
HTMLdeletedcolor
HTMLinsertedcolor
HTMLinsertedcolorAsHTML
HTMLdeletedstrikeout
HTMLinsertedunderline
Used
in the procedure makeHTMLfromResultList.
With
this property you can set the color for the inserted parts of the texts in HTML
format (HTML color constants or another valid HTML color string).
See also:
HTMLdeletedcolor
HTMLdeletedcolorAsHTML
HTMLdeletedstrikeout
HTMLinsertedunderline
Used
in the procedure makeHTMLfromResultList.
With
this property set to true the deleted parts of the text appear striked out.
See
also:
HTMLdeletedcolor
HTMLdeletedcolorAsHTML
HTMLinsertedcolor
HTMLinsertedcolorAsHTML
HTMLinsertedunderline
Used
in the procedure makeHTMLfromResultList.
With
this property set to true the inserted parts of the text appear underlined.
See
also:
HTMLdeletedcolor
HTMLdeletedcolorAsHTML
HTMLinsertedcolor
HTMLinsertedcolorAsHTML
HTMLdeletedstrikeout
If
set to true the inputMode property is set automatically if you set one of the
input properties
InputFirstText, inputSecondText, inputFirstLines,
inputSecondLines.
See
also
Property inputMode
If
set to true at some point of the algorithms the application.processmessages
command is invoked. So and external break may be done, e.g. by setbreak.
Sets
a limit for the use of the branch and bound. It is related to the number of
strings in one of the input texts/lines. If this value or the value 32000 is
exceeded by one of the input texts/lines the sequential standard algorithm will
be invoked instead.
See
also the public variable bbreplacedbyCount.
TTextDiffProgressEvent=procedure(Sender:
TObject; Progresspercent: integer) of object;
This
event property enables you e.g. to control a progressbar. Progresspercent means
the Progress in percent. It does not measure exactly the progress since this is
impossible in recursive processes. For these processes the progress is divided
into 2 parts representing the first pass and the recursion pass.
--------------------