Checking text similarity between two documents

Apr 16 2018 pub thesis latex

To start the series of “Things I did instead of writing my thesis to help me write my thesis”, a small Python script that compares two text documents and output similar parts. I did that to avoid auto-plagiarism of my manuscripts’ introduction in the main thesis introduction.

It’s a very naive approach but sped up the checking process (maybe worth the time). It first looks for short exact matches between the two documents, then extends these exact matches and uses the difflib module to keep text with a minimum similarity score (default 80%).

I put the simText.py Python script on GitHub here.

Usage

Basic command:

python simText.py -1 text1.txt -2 text2.txt

The help page:

> python simText.py -h
usage: simText.py [-h] -1 D1 -2 D2 [-k K] [-e EXT] [-s MINSIM] [-tex]

Find similar text between two documents.

optional arguments:
  -h, --help  show this help message and exit
  -1 D1       Text document 1
  -2 D2       Text document 2
  -k K        The number of char for 1st pass. Default 20
  -e EXT      The number of additional char. Default 70
  -s MINSIM   The minimum similarity to define a match. Default 0.8
  -tex        Skip LaTeX header and lines starting with %

LaTeX documents

The -tex option skips the header in LaTeX documents and lines starting with a %:

python simText.py -1 text1.tex -2 text2.tex -tex

I implemented this because the header and commented lines were annoying me in the output. More would be needed to have a good LaTeX mode but I submitted my thesis already so it will be for another time.

Playing with the stringency

By default, the script outputs text that are at least 80% similar (change with -s argument). To run more or less stringent checks, I play with -e which controls how long the 80% match must be.

Output

The output contains a paragraph for each match. Each paragraph has three lines with the similarity score, the text in the first document, the text in the second document, respectively. For example (with -e 50):

S:0.87
T1: tions of a genomic region, which affect DNA copy number, are collectively known as copy number varia
T2: eletions and duplications, which affect DNA copy number, are collectively known as copy number varia

Hippocamplus My Second Memory

Checking text similarity between two documents

Usage

LaTeX documents

Playing with the stringency

Output