Checking text similarity between two documents
Apr 16 2018 pub thesis latexTo start the series of “Things I did instead of writing my thesis to help me write my thesis”, a small Python script that compares two text documents and output similar parts. I did that to avoid auto-plagiarism of my manuscripts’ introduction in the main thesis introduction.
It’s a very naive approach but sped up the checking process (maybe worth the time). It first looks for short exact matches between the two documents, then extends these exact matches and uses the difflib module to keep text with a minimum similarity score (default 80%).
I put the simText.py
Python script on GitHub here.
Usage
Basic command:
python simText.py -1 text1.txt -2 text2.txt
The help page:
> python simText.py -h
usage: simText.py [-h] -1 D1 -2 D2 [-k K] [-e EXT] [-s MINSIM] [-tex]
Find similar text between two documents.
optional arguments:
-h, --help show this help message and exit
-1 D1 Text document 1
-2 D2 Text document 2
-k K The number of char for 1st pass. Default 20
-e EXT The number of additional char. Default 70
-s MINSIM The minimum similarity to define a match. Default 0.8
-tex Skip LaTeX header and lines starting with %
LaTeX documents
The -tex
option skips the header in LaTeX documents and lines starting with a %
:
python simText.py -1 text1.tex -2 text2.tex -tex
I implemented this because the header and commented lines were annoying me in the output. More would be needed to have a good LaTeX mode but I submitted my thesis already so it will be for another time.
Playing with the stringency
By default, the script outputs text that are at least 80% similar (change with -s
argument). To run more or less stringent checks, I play with -e
which controls how long the 80% match must be.
Output
The output contains a paragraph for each match. Each paragraph has three lines with the similarity score, the text in the first document, the text in the second document, respectively. For example (with -e 50
):
S:0.87
T1: tions of a genomic region, which affect DNA copy number, are collectively known as copy number varia
T2: eletions and duplications, which affect DNA copy number, are collectively known as copy number varia