"difflib" — Utilitaires pour le calcul des deltas
*************************************************

**Code source:** Lib/difflib.py

======================================================================

This module provides classes and functions for comparing sequences. It
can be used for example, for comparing files, and can produce
information about file differences in various formats, including HTML
and context and unified diffs. For comparing directories and files,
see also, the "filecmp" module.

class difflib.SequenceMatcher

   C'est une classe flexible permettant de comparer des séquences deux
   à deux de n'importe quel type, tant que les éléments des séquences
   sont *hachables*. L'algorithme de base est antérieur, et un peu
   plus sophistiqué, à un algorithme publié à la fin des années 1980
   par Ratcliff et Obershelp sous le nom hyperbolique de *gestalt
   pattern matching*. L'idée est de trouver la plus longue sous-
   séquence d'appariement contiguë qui ne contient pas d'éléments «
   indésirables » ; ces éléments « indésirables » sont ceux qui sont
   inintéressants dans un certain sens, comme les lignes blanches ou
   les espaces. (Le traitement des éléments indésirables est une
   extension de l'algorithme de Ratcliff et Obershelp). La même idée
   est ensuite appliquée récursivement aux morceaux des séquences à
   gauche et à droite de la sous-séquence correspondante. Cela ne
   donne pas des séquences de montage minimales, mais tend à donner
   des correspondances qui « semblent correctes » pour les gens.

   **Compléxité temporelle :** l'algorithme de base de Ratcliff-
   Obershelp est de complexité cubique dans le pire cas et de
   complexité quadratique dans le cas attendu. "SequenceMatcher" est
   de complexité quadratique pour le pire cas et son comportement dans
   le cas attendu dépend de façon complexe du nombre d'éléments que
   les séquences ont en commun ; le temps dans le meilleur cas est
   linéaire.

   **Heuristique automatique des indésirables:** "SequenceMatcher"
   utilise une heuristique qui traite automatiquement certains
   éléments de la séquence comme indésirables. L'heuristique compte
   combien de fois chaque élément individuel apparaît dans la
   séquence. Si les doublons d'un élément (après le premier)
   représentent plus de 1 % de la séquence et que la séquence compte
   au moins 200 éléments, cet élément est marqué comme « populaire »
   et est traité comme indésirable aux fins de la comparaison des
   séquences. Cette heuristique peut être désactivée en réglant
   l'argument "autojunk" sur "False" lors de la création de la classe
   "SequenceMatcher".

   Nouveau dans la version 3.2: Le paramètre *autojunk*.

class difflib.Differ

   Il s'agit d'une classe permettant de comparer des séquences de
   lignes de texte et de produire des différences ou deltas
   humainement lisibles. *Differ* utilise "SequenceMatcher" à la fois
   pour comparer des séquences de lignes, et pour comparer des
   séquences de caractères dans des lignes similaires (quasi-
   correspondantes).

   Chaque ligne d'un delta "Differ" commence par un code de deux
   lettres :

   +------------+---------------------------------------------+
   | Code       | Signification                               |
   |============|=============================================|
   | "'- '"     | ligne n'appartenant qu'à la séquence 1      |
   +------------+---------------------------------------------+
   | "'+ '"     | ligne n'appartenant qu'à la séquence 2      |
   +------------+---------------------------------------------+
   | "'  '"     | ligne commune aux deux séquences            |
   +------------+---------------------------------------------+
   | "'? '"     | ligne non présente dans l'une ou l'autre    |
   |            | des séquences d'entrée                      |
   +------------+---------------------------------------------+

   Les lignes commençant par "'?'" tentent de guider l'œil vers les
   différences intralignes, et n'étaient présentes dans aucune des
   séquences d'entrée. Ces lignes peuvent être déroutantes si les
   séquences contiennent des caractères de tabulation.

class difflib.HtmlDiff

   Cette classe peut être utilisée pour créer un tableau HTML (ou un
   fichier HTML complet contenant le tableau) montrant une comparaison
   côte à côte, ligne par ligne, du texte avec les changements inter-
   lignes et intralignes. Le tableau peut être généré en mode de
   différence complet ou contextuel.

   Le constructeur pour cette classe est :

   __init__(tabsize=8, wrapcolumn=None, linejunk=None, charjunk=IS_CHARACTER_JUNK)

      Initialise l'instance de "HtmlDiff".

      *tabsize* est un mot-clé optionnel pour spécifier l'espacement
      des tabulations et sa valeur par défaut est "8".

      *wrapcolumn* est un mot-clé optionnel pour spécifier le numéro
      de la colonne où les lignes sont coupées pour être ré-agencées,
      la valeur par défaut est "None" lorsque les lignes ne sont pas
      ré-agencées.

      *linejunk* et *charjunk* sont des arguments de mots-clés
      optionnels passés dans "ndiff()" (utilisés par "HtmlDiff" pour
      générer les différences HTML côte à côte). Voir la documentation
      de "ndiff()" pour les valeurs par défaut des arguments et les
      descriptions.

   Les méthodes suivantes sont publiques :

   make_file(fromlines, tolines, fromdesc='', todesc='', context=False, numlines=5, *, charset='utf-8')

      Compare *fromlines* et *tolines* (listes de chaînes de
      caractères) et renvoie une chaîne de caractères qui est un
      fichier HTML complet contenant un tableau montrant les
      différences ligne par ligne avec les changements inter-lignes et
      intralignes mis en évidence.

      *fromdesc* et *todesc* sont des arguments mot-clé optionnels
      pour spécifier les chaînes d'en-tête des colonnes *from/to* du
      fichier (les deux sont des chaînes vides par défaut).

      *context* et *numlines* sont tous deux des arguments mots-clés
      facultatifs. Mettre *context* à "True" lorsque les différences
      contextuelles doivent être affichées, sinon la valeur par défaut
      est "False" pour afficher les fichiers complets. Les *numlines*
      ont pour valeur par défaut "5". Lorsque *context* est *True`*,
      *numlines* contrôle le nombre de lignes de contexte qui
      entourent les différences mise en évidence. Lorsque *context*
      est "False", *numlines* contrôle le nombre de lignes qui sont
      affichées avant un surlignage de différence lors de
      l'utilisation des hyperliens « suivants » (un réglage à zéro
      ferait en sorte que les hyperliens « suivants » placeraient le
      surlignage de différence suivant en haut du navigateur sans
      aucun contexte introductif).

      Note:

        *fromdesc* et *todesc* sont interprétés comme du HTML non
        échappé et doivent être correctement échappés lors de la
        réception de données provenant de sources non fiables.

      Modifié dans la version 3.5: l'argument mot-clé *charset* a été
      ajouté. Le jeu de caractères par défaut du document HTML est
      passé de "'ISO-8859-1'" à "'utf-8'".

   make_table(fromlines, tolines, fromdesc='', todesc='', context=False, numlines=5)

      Compare *fromlines* et *tolines* (listes de chaînes) et renvoie
      une chaîne qui est un tableau HTML complet montrant les
      différences ligne par ligne avec les changements inter-lignes et
      intralignes mis en évidence.

      Les arguments pour cette méthode sont les mêmes que ceux de la
      méthode "make_file()".

   "Tools/scripts/diff.py" est un frontal en ligne de commande de
   cette classe et contient un bon exemple de son utilisation.

difflib.context_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\n')

   Compare *a* et *b* (listes de chaînes de caractères) ; renvoie un
   delta (un *generateur* générant les lignes delta) dans un format de
   différence de contexte.

   Context diffs are a compact way of showing just the lines that have
   changed plus a few lines of context.  The changes are shown in a
   before/after style.  The number of context lines is set by *n*
   which defaults to three.

   Par défaut, les lignes de contrôle de la différence (celles avec
   "***" ou "---") sont créées avec un saut de ligne à la fin. Ceci
   est utile pour que les entrées créées à partir de
   "io.IOBase.readlines()" résultent en des différences qui peuvent
   être utilisées avec "io.IOBase.writelines()" puisque les entrées et
   les sorties ont des nouvelles lignes de fin.

   Pour les entrées qui n'ont pas de retour à la ligne, mettre
   l'argument *lineterm* à """" afin que la sortie soit uniformément
   sans retour à la ligne.

   Le format de contexte de différence comporte normalement un en-tête
   pour les noms de fichiers et les heures de modification. Tout ou
   partie de ces éléments peuvent être spécifiés en utilisant les
   chaînes de caractères *fromfile*, *tofile*, *fromfiledate* et
   *tofiledate*. Les heures de modification sont normalement exprimées
   dans le format ISO 8601. Si elles ne sont pas spécifiées, les
   chaînes de caractères sont par défaut vierges.

   >>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
   >>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
   >>> sys.stdout.writelines(context_diff(s1, s2, fromfile='before.py', tofile='after.py'))
   *** before.py
   --- after.py
   ***************
   *** 1,4 ****
   ! bacon
   ! eggs
   ! ham
     guido
   --- 1,4 ----
   ! python
   ! eggy
   ! hamster
     guido

   Voir A command-line interface to difflib pour un exemple plus
   détaillé.

difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)

   Return a list of the best "good enough" matches.  *word* is a
   sequence for which close matches are desired (typically a string),
   and *possibilities* is a list of sequences against which to match
   *word* (typically a list of strings).

   Optional argument *n* (default "3") is the maximum number of close
   matches to return; *n* must be greater than "0".

   Optional argument *cutoff* (default "0.6") is a float in the range
   [0, 1]. Possibilities that don't score at least that similar to
   *word* are ignored.

   The best (no more than *n*) matches among the possibilities are
   returned in a list, sorted by similarity score, most similar first.

   >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
   ['apple', 'ape']
   >>> import keyword
   >>> get_close_matches('wheel', keyword.kwlist)
   ['while']
   >>> get_close_matches('pineapple', keyword.kwlist)
   []
   >>> get_close_matches('accept', keyword.kwlist)
   ['except']

difflib.ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK)

   Compare *a* and *b* (lists of strings); return a "Differ"-style
   delta (a *generator* generating the delta lines).

   Optional keyword parameters *linejunk* and *charjunk* are filtering
   functions (or "None"):

   *linejunk*: A function that accepts a single string argument, and
   returns true if the string is junk, or false if not. The default is
   "None". There is also a module-level function "IS_LINE_JUNK()",
   which filters out lines without visible characters, except for at
   most one pound character ("'#'") -- however the underlying
   "SequenceMatcher" class does a dynamic analysis of which lines are
   so frequent as to constitute noise, and this usually works better
   than using this function.

   *charjunk*: A function that accepts a character (a string of length
   1), and returns if the character is junk, or false if not. The
   default is module-level function "IS_CHARACTER_JUNK()", which
   filters out whitespace characters (a blank or tab; it's a bad idea
   to include newline in this!).

   "Tools/scripts/ndiff.py" is a command-line front-end to this
   function.

   >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),
   ...              'ore\ntree\nemu\n'.splitlines(keepends=True))
   >>> print(''.join(diff), end="")
   - one
   ?  ^
   + ore
   ?  ^
   - two
   - three
   ?  -
   + tree
   + emu

difflib.restore(sequence, which)

   Return one of the two sequences that generated a delta.

   Given a *sequence* produced by "Differ.compare()" or "ndiff()",
   extract lines originating from file 1 or 2 (parameter *which*),
   stripping off line prefixes.

   Exemple :

   >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True),
   ...              'ore\ntree\nemu\n'.splitlines(keepends=True))
   >>> diff = list(diff) # materialize the generated delta into a list
   >>> print(''.join(restore(diff, 1)), end="")
   one
   two
   three
   >>> print(''.join(restore(diff, 2)), end="")
   ore
   tree
   emu

difflib.unified_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\n')

   Compare *a* and *b* (lists of strings); return a delta (a
   *generator* generating the delta lines) in unified diff format.

   Unified diffs are a compact way of showing just the lines that have
   changed plus a few lines of context.  The changes are shown in an
   inline style (instead of separate before/after blocks).  The number
   of context lines is set by *n* which defaults to three.

   By default, the diff control lines (those with "---", "+++", or
   "@@") are created with a trailing newline.  This is helpful so that
   inputs created from "io.IOBase.readlines()" result in diffs that
   are suitable for use with "io.IOBase.writelines()" since both the
   inputs and outputs have trailing newlines.

   Pour les entrées qui n'ont pas de retour à la ligne, mettre
   l'argument *lineterm* à """" afin que la sortie soit uniformément
   sans retour à la ligne.

   Le format de contexte de différence comporte normalement un en-tête
   pour les noms de fichiers et les heures de modification. Tout ou
   partie de ces éléments peuvent être spécifiés en utilisant les
   chaînes de caractères *fromfile*, *tofile*, *fromfiledate* et
   *tofiledate*. Les heures de modification sont normalement exprimées
   dans le format ISO 8601. Si elles ne sont pas spécifiées, les
   chaînes de caractères sont par défaut vierges.

   >>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']
   >>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']
   >>> sys.stdout.writelines(unified_diff(s1, s2, fromfile='before.py', tofile='after.py'))
   --- before.py
   +++ after.py
   @@ -1,4 +1,4 @@
   -bacon
   -eggs
   -ham
   +python
   +eggy
   +hamster
    guido

   Voir A command-line interface to difflib pour un exemple plus
   détaillé.

difflib.diff_bytes(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\n')

   Compare *a* and *b* (lists of bytes objects) using *dfunc*; yield a
   sequence of delta lines (also bytes) in the format returned by
   *dfunc*. *dfunc* must be a callable, typically either
   "unified_diff()" or "context_diff()".

   Allows you to compare data with unknown or inconsistent encoding.
   All inputs except *n* must be bytes objects, not str. Works by
   losslessly converting all inputs (except *n*) to str, and calling
   "dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n,
   lineterm)". The output of *dfunc* is then converted back to bytes,
   so the delta lines that you receive have the same
   unknown/inconsistent encodings as *a* and *b*.

   Nouveau dans la version 3.5.

difflib.IS_LINE_JUNK(line)

   Return "True" for ignorable lines.  The line *line* is ignorable if
   *line* is blank or contains a single "'#'", otherwise it is not
   ignorable.  Used as a default for parameter *linejunk* in "ndiff()"
   in older versions.

difflib.IS_CHARACTER_JUNK(ch)

   Return "True" for ignorable characters.  The character *ch* is
   ignorable if *ch* is a space or tab, otherwise it is not ignorable.
   Used as a default for parameter *charjunk* in "ndiff()".

Voir aussi:

  Pattern Matching: The Gestalt Approach
     Discussion of a similar algorithm by John W. Ratcliff and D. E.
     Metzener. This was published in Dr. Dobb's Journal in July, 1988.


SequenceMatcher Objects
=======================

The "SequenceMatcher" class has this constructor:

class difflib.SequenceMatcher(isjunk=None, a='', b='', autojunk=True)

   Optional argument *isjunk* must be "None" (the default) or a one-
   argument function that takes a sequence element and returns true if
   and only if the element is "junk" and should be ignored. Passing
   "None" for *isjunk* is equivalent to passing "lambda x: False"; in
   other words, no elements are ignored. For example, pass:

      lambda x: x in " \t"

   if you're comparing lines as sequences of characters, and don't
   want to synch up on blanks or hard tabs.

   The optional arguments *a* and *b* are sequences to be compared;
   both default to empty strings.  The elements of both sequences must
   be *hashable*.

   The optional argument *autojunk* can be used to disable the
   automatic junk heuristic.

   Nouveau dans la version 3.2: Le paramètre *autojunk*.

   SequenceMatcher objects get three data attributes: *bjunk* is the
   set of elements of *b* for which *isjunk* is "True"; *bpopular* is
   the set of non-junk elements considered popular by the heuristic
   (if it is not disabled); *b2j* is a dict mapping the remaining
   elements of *b* to a list of positions where they occur. All three
   are reset whenever *b* is reset with "set_seqs()" or "set_seq2()".

   Nouveau dans la version 3.2: The *bjunk* and *bpopular* attributes.

   "SequenceMatcher" objects have the following methods:

   set_seqs(a, b)

      Set the two sequences to be compared.

   "SequenceMatcher" computes and caches detailed information about
   the second sequence, so if you want to compare one sequence against
   many sequences, use "set_seq2()" to set the commonly used sequence
   once and call "set_seq1()" repeatedly, once for each of the other
   sequences.

   set_seq1(a)

      Set the first sequence to be compared.  The second sequence to
      be compared is not changed.

   set_seq2(b)

      Set the second sequence to be compared.  The first sequence to
      be compared is not changed.

   find_longest_match(alo, ahi, blo, bhi)

      Find longest matching block in "a[alo:ahi]" and "b[blo:bhi]".

      If *isjunk* was omitted or "None", "find_longest_match()"
      returns "(i, j, k)" such that "a[i:i+k]" is equal to "b[j:j+k]",
      where "alo <= i <= i+k <= ahi" and "blo <= j <= j+k <= bhi". For
      all "(i', j', k')" meeting those conditions, the additional
      conditions "k >= k'", "i <= i'", and if "i == i'", "j <= j'" are
      also met. In other words, of all maximal matching blocks, return
      one that starts earliest in *a*, and of all those maximal
      matching blocks that start earliest in *a*, return the one that
      starts earliest in *b*.

      >>> s = SequenceMatcher(None, " abcd", "abcd abcd")
      >>> s.find_longest_match(0, 5, 0, 9)
      Match(a=0, b=4, size=5)

      If *isjunk* was provided, first the longest matching block is
      determined as above, but with the additional restriction that no
      junk element appears in the block.  Then that block is extended
      as far as possible by matching (only) junk elements on both
      sides. So the resulting block never matches on junk except as
      identical junk happens to be adjacent to an interesting match.

      Here's the same example as before, but considering blanks to be
      junk. That prevents "' abcd'" from matching the "' abcd'" at the
      tail end of the second sequence directly.  Instead only the
      "'abcd'" can match, and matches the leftmost "'abcd'" in the
      second sequence:

      >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
      >>> s.find_longest_match(0, 5, 0, 9)
      Match(a=1, b=0, size=4)

      If no blocks match, this returns "(alo, blo, 0)".

      This method returns a *named tuple* "Match(a, b, size)".

   get_matching_blocks()

      Return list of triples describing non-overlapping matching
      subsequences. Each triple is of the form "(i, j, n)", and means
      that "a[i:i+n] == b[j:j+n]".  The triples are monotonically
      increasing in *i* and *j*.

      The last triple is a dummy, and has the value "(len(a), len(b),
      0)".  It is the only triple with "n == 0".  If "(i, j, n)" and
      "(i', j', n')" are adjacent triples in the list, and the second
      is not the last triple in the list, then "i+n < i'" or "j+n <
      j'"; in other words, adjacent triples always describe non-
      adjacent equal blocks.

         >>> s = SequenceMatcher(None, "abxcd", "abcd")
         >>> s.get_matching_blocks()
         [Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)]

   get_opcodes()

      Return list of 5-tuples describing how to turn *a* into *b*.
      Each tuple is of the form "(tag, i1, i2, j1, j2)".  The first
      tuple has "i1 == j1 == 0", and remaining tuples have *i1* equal
      to the *i2* from the preceding tuple, and, likewise, *j1* equal
      to the previous *j2*.

      The *tag* values are strings, with these meanings:

      +-----------------+-----------------------------------------------+
      | Valeur          | Signification                                 |
      |=================|===============================================|
      | "'replace'"     | "a[i1:i2]" should be replaced by "b[j1:j2]".  |
      +-----------------+-----------------------------------------------+
      | "'delete'"      | "a[i1:i2]" should be deleted.  Note that "j1  |
      |                 | == j2" in this case.                          |
      +-----------------+-----------------------------------------------+
      | "'insert'"      | "b[j1:j2]" should be inserted at "a[i1:i1]".  |
      |                 | Note that "i1 == i2" in this case.            |
      +-----------------+-----------------------------------------------+
      | "'equal'"       | "a[i1:i2] == b[j1:j2]" (the sub-sequences are |
      |                 | equal).                                       |
      +-----------------+-----------------------------------------------+

      Par exemple :

         >>> a = "qabxcd"
         >>> b = "abycdf"
         >>> s = SequenceMatcher(None, a, b)
         >>> for tag, i1, i2, j1, j2 in s.get_opcodes():
         ...     print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
         ...         tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
         delete    a[0:1] --> b[0:0]      'q' --> ''
         equal     a[1:3] --> b[0:2]     'ab' --> 'ab'
         replace   a[3:4] --> b[2:3]      'x' --> 'y'
         equal     a[4:6] --> b[3:5]     'cd' --> 'cd'
         insert    a[6:6] --> b[5:6]       '' --> 'f'

   get_grouped_opcodes(n=3)

      Return a *generator* of groups with up to *n* lines of context.

      Starting with the groups returned by "get_opcodes()", this
      method splits out smaller change clusters and eliminates
      intervening ranges which have no changes.

      The groups are returned in the same format as "get_opcodes()".

   ratio()

      Return a measure of the sequences' similarity as a float in the
      range [0, 1].

      Where T is the total number of elements in both sequences, and M
      is the number of matches, this is 2.0*M / T. Note that this is
      "1.0" if the sequences are identical, and "0.0" if they have
      nothing in common.

      This is expensive to compute if "get_matching_blocks()" or
      "get_opcodes()" hasn't already been called, in which case you
      may want to try "quick_ratio()" or "real_quick_ratio()" first to
      get an upper bound.

      Note:

        Caution: The result of a "ratio()" call may depend on the
        order of the arguments. For instance:

           >>> SequenceMatcher(None, 'tide', 'diet').ratio()
           0.25
           >>> SequenceMatcher(None, 'diet', 'tide').ratio()
           0.5

   quick_ratio()

      Return an upper bound on "ratio()" relatively quickly.

   real_quick_ratio()

      Return an upper bound on "ratio()" very quickly.

The three methods that return the ratio of matching to total
characters can give different results due to differing levels of
approximation, although "quick_ratio()" and "real_quick_ratio()" are
always at least as large as "ratio()":

>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0


SequenceMatcher Examples
========================

This example compares two strings, considering blanks to be "junk":

>>> s = SequenceMatcher(lambda x: x == " ",
...                     "private Thread currentThread;",
...                     "private volatile Thread currentThread;")

"ratio()" returns a float in [0, 1], measuring the similarity of the
sequences.  As a rule of thumb, a "ratio()" value over 0.6 means the
sequences are close matches:

>>> print(round(s.ratio(), 3))
0.866

If you're only interested in where the sequences match,
"get_matching_blocks()" is handy:

>>> for block in s.get_matching_blocks():
...     print("a[%d] and b[%d] match for %d elements" % block)
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements

Note that the last tuple returned by "get_matching_blocks()" is always
a dummy, "(len(a), len(b), 0)", and this is the only case in which the
last tuple element (number of elements matched) is "0".

If you want to know how to change the first sequence into the second,
use "get_opcodes()":

>>> for opcode in s.get_opcodes():
...     print("%6s a[%d:%d] b[%d:%d]" % opcode)
 equal a[0:8] b[0:8]
insert a[8:8] b[8:17]
 equal a[8:29] b[17:38]

Voir aussi:

  * The "get_close_matches()" function in this module which shows how
    simple code building on "SequenceMatcher" can be used to do useful
    work.

  * Simple version control recipe for a small application built with
    "SequenceMatcher".


Differ Objects
==============

Note that "Differ"-generated deltas make no claim to be **minimal**
diffs. To the contrary, minimal diffs are often counter-intuitive,
because they synch up anywhere possible, sometimes accidental matches
100 pages apart. Restricting synch points to contiguous matches
preserves some notion of locality, at the occasional cost of producing
a longer diff.

The "Differ" class has this constructor:

class difflib.Differ(linejunk=None, charjunk=None)

   Optional keyword parameters *linejunk* and *charjunk* are for
   filter functions (or "None"):

   *linejunk*: A function that accepts a single string argument, and
   returns true if the string is junk.  The default is "None", meaning
   that no line is considered junk.

   *charjunk*: A function that accepts a single character argument (a
   string of length 1), and returns true if the character is junk. The
   default is "None", meaning that no character is considered junk.

   These junk-filtering functions speed up matching to find
   differences and do not cause any differing lines or characters to
   be ignored.  Read the description of the "find_longest_match()"
   method's *isjunk* parameter for an explanation.

   "Differ" objects are used (deltas generated) via a single method:

   compare(a, b)

      Compare two sequences of lines, and generate the delta (a
      sequence of lines).

      Each sequence must contain individual single-line strings ending
      with newlines.  Such sequences can be obtained from the
      "readlines()" method of file-like objects.  The delta generated
      also consists of newline-terminated strings, ready to be printed
      as-is via the "writelines()" method of a file-like object.


Differ Example
==============

This example compares two texts. First we set up the texts, sequences
of individual single-line strings ending with newlines (such sequences
can also be obtained from the "readlines()" method of file-like
objects):

>>> text1 = '''  1. Beautiful is better than ugly.
...   2. Explicit is better than implicit.
...   3. Simple is better than complex.
...   4. Complex is better than complicated.
... '''.splitlines(keepends=True)
>>> len(text1)
4
>>> text1[0][-1]
'\n'
>>> text2 = '''  1. Beautiful is better than ugly.
...   3.   Simple is better than complex.
...   4. Complicated is better than complex.
...   5. Flat is better than nested.
... '''.splitlines(keepends=True)

Next we instantiate a Differ object:

>>> d = Differ()

Note that when instantiating a "Differ" object we may pass functions
to filter out line and character "junk."  See the "Differ()"
constructor for details.

Finally, we compare the two:

>>> result = list(d.compare(text1, text2))

"result" is a list of strings, so let's pretty-print it:

>>> from pprint import pprint
>>> pprint(result)
['    1. Beautiful is better than ugly.\n',
 '-   2. Explicit is better than implicit.\n',
 '-   3. Simple is better than complex.\n',
 '+   3.   Simple is better than complex.\n',
 '?     ++\n',
 '-   4. Complex is better than complicated.\n',
 '?            ^                     ---- ^\n',
 '+   4. Complicated is better than complex.\n',
 '?           ++++ ^                      ^\n',
 '+   5. Flat is better than nested.\n']

As a single multi-line string it looks like this:

>>> import sys
>>> sys.stdout.writelines(result)
    1. Beautiful is better than ugly.
-   2. Explicit is better than implicit.
-   3. Simple is better than complex.
+   3.   Simple is better than complex.
?     ++
-   4. Complex is better than complicated.
?            ^                     ---- ^
+   4. Complicated is better than complex.
?           ++++ ^                      ^
+   5. Flat is better than nested.


A command-line interface to difflib
===================================

This example shows how to use difflib to create a "diff"-like utility.
It is also contained in the Python source distribution, as
"Tools/scripts/diff.py".

   #!/usr/bin/env python3
   """ Command line interface to difflib.py providing diffs in four formats:

   * ndiff:    lists every line and highlights interline changes.
   * context:  highlights clusters of changes in a before/after format.
   * unified:  highlights clusters of changes in an inline format.
   * html:     generates side by side comparison with change highlights.

   """

   import sys, os, difflib, argparse
   from datetime import datetime, timezone

   def file_mtime(path):
       t = datetime.fromtimestamp(os.stat(path).st_mtime,
                                  timezone.utc)
       return t.astimezone().isoformat()

   def main():

       parser = argparse.ArgumentParser()
       parser.add_argument('-c', action='store_true', default=False,
                           help='Produce a context format diff (default)')
       parser.add_argument('-u', action='store_true', default=False,
                           help='Produce a unified format diff')
       parser.add_argument('-m', action='store_true', default=False,
                           help='Produce HTML side by side diff '
                                '(can use -c and -l in conjunction)')
       parser.add_argument('-n', action='store_true', default=False,
                           help='Produce a ndiff format diff')
       parser.add_argument('-l', '--lines', type=int, default=3,
                           help='Set number of context lines (default 3)')
       parser.add_argument('fromfile')
       parser.add_argument('tofile')
       options = parser.parse_args()

       n = options.lines
       fromfile = options.fromfile
       tofile = options.tofile

       fromdate = file_mtime(fromfile)
       todate = file_mtime(tofile)
       with open(fromfile) as ff:
           fromlines = ff.readlines()
       with open(tofile) as tf:
           tolines = tf.readlines()

       if options.u:
           diff = difflib.unified_diff(fromlines, tolines, fromfile, tofile, fromdate, todate, n=n)
       elif options.n:
           diff = difflib.ndiff(fromlines, tolines)
       elif options.m:
           diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile,context=options.c,numlines=n)
       else:
           diff = difflib.context_diff(fromlines, tolines, fromfile, tofile, fromdate, todate, n=n)

       sys.stdout.writelines(diff)

   if __name__ == '__main__':
       main()
