PersianPlagDet 2016

Task Description

Task1

(Plagiarism Detection)

Given a set of suspicious and source documents written in Persian, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. This External Plagiarism Detection Task provides a standard situation to evaluate Persian Plagiarism detection systems.

General principles related to the PD system are consists of:

Input: A collection of suspicious and source documents written in Persian, and a text file that represent intended paired documents.
Goal: Exploration of similar fragments and their exact offsets in both suspicious and source files.

Output: An annotated XML file for each suspicious and source pairs which describe the detailed information about plagiarized fragments.

<document reference="suspicious-documentXYZ.txt">
<feature
  name="detected-plagiarism"
  this_offset="5"
  this_length="1000"
  source_reference="source-documentABC.txt"
  source_offset="100"
  source_length="1000"
/>
<feature ... />
...
</document>

Use Cases:

Evaluating plagiarism detection systems
Experiencing text similarity detection in Persian language
Measuring both lexical and semantic similarity methods in Persian

Task2

(Text Alignment Corpus Construction)

This sub-task includes the construction of text alignment plagiarism detection corpora. The corpus would be Persian mono-lingual or bi-lingual with the compound of Persian and any other languages. The task would include compiling Persian PD corpora and the goal is to evaluate existing corpora to rank them based on their quality. Also, the proposed PD systems would be run on submitted corpora in this sub-task. Validating of Persian plagiarism detection corpora. The Intellectual Property of the submitted corpora belongs to its owner and would not be transferred to Persian Plagdet2016.

The submitted corpora should follow the standard PAN text alignment annotation structure. The corpora shall contain a source, suspicious and a XML directory which include source documents, suspicious documents and annotated xml documents, respectively. Also a text file named pairs should list all pairs of suspicious documents and source documents to be compared. You can find a sample corpus here

Datasets

Download the Dataset

Training Dataset

To be available on July,15,2016

The training corpus will be available to competitors for developing their methods and setting their related parameters. The corpus is consists of suspicious files, source files and xml files. In addition a text file determines pairs of suspicious and source documents to investigate plagiarism. For each pair of suspicious and source document, the associated xml file shows the exact offset of common plagiarized fragments between documents. The structure and content of a sample XML file is shown in below.

Test Dataset

To be available on August,15,2016

The test corpus will be available for evaluating the competitors. The structure of the corpus is similar to the training corpus except that there are no xml files in this corpus. Participants should generate similar xml files as the training corpus for each pairs of suspicious and source documents.

Important Dates

~~15th July 2016~~ (Released) Training Data Release	~~22th August 2016~~ (Released) Test Data Release	~~1st September 2016~~ (Finished) Run Submission Deadline
~~15th September 2016~~ 21th September 2016 Results Declared	15th October 2016 Working Notes Due	8-10 Dec 2016 Conference

Evaluation

Task1

(Plagiarism Detection)

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score.

The precision and macro will be computed at character level, in addition the granularity measure quantifies whether the contiguity between plagiarized text passages is properly recognized. The plagdet score is a combination of precision, recall and granularity.

For more information read the related article:

Learn more

The following python code will be used for computing mentioned measures. The code provided by Martin Potthast at PAN@CLEF:

Download Measure

Task2

(Corpus Construction)

Performance will be measured by assessing the validity of submitted corpora in different ways:

Peer-review: Your corpus will be made available to the other participants of this task and be subject to peer-review.

Detection: The submitted corpora will be fed into the text alignment prototypes from task 1. The performances of each text alignment prototype in detecting the plagiarism in your corpus will be measured.

Results (Sub-task #1)

The winner is NLP Research Lab of Shahid Beheshti University.

Rank	Team	Plagdet	Granularity	Precision	Recall
1	Fatemeh Mashhadi, Mehrnoush Shamsfard Shahid Beheshti University, NLP Research Lab	0.92204	1.00146	0.92688	0.91919
2	Hadi Veisi, Kayvan Bijari, Kiarash Zahirnia, Erfaneh Gharavi University of Tehran, Data & Signal processing Lab	0.90593	1	0.95927	0.85820
3	Mozhgan Momtaz, Kayvan Bijari, Davood Heidarpour University of Tehran, COIN Lab	0.87103	1	0.89258	0.85049
4	Behrouz Minaei , Mahdi Niknam University of Qom	0.83015	1.03968	0.92034	0.79602
5	Faezeh Esteki, Faramarz Safi Esfahani Najafabad Branch, Islamic Azad University	0.80083	1.0	0.93337	0.70124
6	Alireza Talebpour, Mohammad Shirzadi, Zahra Aminolroaya, Mohammad Adibi, Ahmad Mahmoudi-Aznaveh Shahid Beheshti University, Content lab /cyberspace research institute	0.77496	1.22759	0.96383	0.83615
7	Nava Ehsan , Azadeh Shakeri University of Tehran	0.72662	1	0.74962	0.70499
8	Lee Gillam, Anna Vartapetiance University of Surrey	0.39968	1.52803	0.75484	0.41407
9	Muharram Mansoorizadeh,Taher Rahpooy Bu-Ali Sina University	0.38994	3.53698	0.90002	0.80659