PersianPlagDet 2016


A PAN Fire Shared Task On Plagiarism Detection

Find Out More

Task Description


Task1

(Plagiarism Detection)


Given a set of suspicious and source documents written in Persian, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. This External Plagiarism Detection Task provides a standard situation to evaluate Persian Plagiarism detection systems.

General principles related to the PD system are consists of:

  • Input: A collection of suspicious and source documents written in Persian, and a text file that represent intended paired documents.
  • Goal: Exploration of similar fragments and their exact offsets in both suspicious and source files.
  • Output: An annotated XML file for each suspicious and source pairs which describe the detailed information about plagiarized fragments.
    <document reference="suspicious-documentXYZ.txt">
    <feature
      name="detected-plagiarism"
      this_offset="5"
      this_length="1000"
      source_reference="source-documentABC.txt"
      source_offset="100"
      source_length="1000"
    />
    <feature ... />
    ...
    </document>

Use Cases:

  • Evaluating plagiarism detection systems
  • Experiencing text similarity detection  in Persian language
  • Measuring both lexical and semantic similarity methods in Persian

Task2

(Text Alignment Corpus Construction)

This sub-task includes the construction of text alignment plagiarism detection corpora. The corpus would be Persian mono-lingual or bi-lingual with the compound of Persian and any other languages. The task would include compiling Persian PD corpora and the goal is to evaluate existing corpora to rank them based on their quality. Also, the proposed PD systems would be run on submitted corpora in this sub-task. Validating of Persian plagiarism detection corpora. The Intellectual Property of the submitted corpora belongs to its owner and would not be transferred to Persian Plagdet2016.

The submitted corpora should follow the standard PAN text alignment annotation structure. The corpora shall contain a source, suspicious and a XML directory which include source documents, suspicious documents and annotated xml documents, respectively. Also a text file named pairs should list all pairs of suspicious documents and source documents to be compared. You can find a sample corpus here

Training Dataset

To be available on July,15,2016


The training corpus will be available to competitors for developing their methods and setting their related parameters. The corpus is consists of suspicious files, source files and xml files. In addition a text file determines pairs of suspicious and source documents to investigate plagiarism. For each pair of suspicious and source document, the associated xml file shows the exact offset of common plagiarized fragments between documents. The structure and content of a sample XML file is shown in below.

Test Dataset

To be available on August,15,2016

The test corpus will be available for evaluating the competitors. The structure of the corpus is similar to the training corpus except that there are no xml files in this corpus. Participants should generate similar xml files as the training corpus for each pairs of suspicious and source documents.

Important Dates


15th July 2016 (Released)

Training Data Release

22th August 2016 (Released)

Test Data Release

1st September 2016 (Finished)

Run Submission Deadline

15th September 2016 21th September 2016

Results Declared

15th October 2016

Working Notes Due

8-10 Dec 2016

Conference

Evaluation


Task1

(Plagiarism Detection)

Performance will be measured using macro-averaged precision and recall, granularity, and the plagdet score.

The precision and macro will be computed at character level, in addition the granularity measure quantifies whether the contiguity between plagiarized text passages is properly recognized. The plagdet score is a combination of precision, recall and granularity.

For more information read the related article:


Learn more

The following python code will be used for computing mentioned measures. The code provided by Martin Potthast at PAN@CLEF:


Download Measure

Task2

(Corpus Construction)

Performance will be measured by assessing the validity of submitted corpora in different ways:

Peer-review: Your corpus will be made available to the other participants of this task and be subject to peer-review.

Detection: The submitted corpora will be fed into the text alignment prototypes from task 1. The performances of each text alignment prototype in detecting the plagiarism in your corpus will be measured.

Results (Sub-task #1)

The winner is NLP Research Lab of Shahid Beheshti University.

Rank

Team

Plagdet 

Granularity

Precision

Recall

1

Fatemeh Mashhadi, Mehrnoush Shamsfard
Shahid Beheshti University, NLP Research Lab

0.92204

1.00146

0.92688

0.91919

2

Hadi Veisi, Kayvan Bijari, Kiarash Zahirnia, Erfaneh Gharavi
University of Tehran, Data & Signal processing Lab

0.90593

1

0.95927

0.85820

3

Mozhgan Momtaz, Kayvan Bijari, Davood Heidarpour
University of Tehran, COIN Lab

0.87103

1

0.89258

0.85049

4

Behrouz Minaei , Mahdi Niknam
University of Qom

0.83015

1.03968

0.92034

0.79602

5

Faezeh Esteki, Faramarz Safi Esfahani
Najafabad Branch, Islamic Azad University

0.80083

1.0

0.93337

0.70124

6

Alireza Talebpour, Mohammad Shirzadi, Zahra Aminolroaya, Mohammad Adibi, Ahmad Mahmoudi-Aznaveh
Shahid Beheshti University, Content lab /cyberspace research institute

0.77496

1.22759

0.96383

0.83615

7

Nava Ehsan , Azadeh Shakeri
University of Tehran

0.72662

1

0.74962

0.70499

8

Lee Gillam, Anna Vartapetiance
University of Surrey

0.39968

1.52803

0.75484

0.41407

9

Muharram Mansoorizadeh,Taher Rahpooy
Bu-Ali Sina University

0.38994

3.53698

0.90002

0.80659

Previous Events


PAN @ FIRE'15

FIRE'15logo

December 4-6, Gandhinagar, India

Overview

PAN @ FIRE'14

FIRE logo

December 5-7, Bangalore, India

Overview

PAN @ FIRE'13

FIRE logo

December 4-6, New Delhi, India

Overview

PAN @ FIRE'12

FIRE logo

December 17-19, Kolkata, India

Overview

PAN @ FIRE'11

FIRE logo

December 2-4, Bombay, India

Overview

Organizers


Steering Committee

Vahid Zarrabi, ICT Research Institute, ACECR ,Iran

Mehrnoosh Shamsfard, Shahid Beheshti University ,Iran

Omid Fatemi, University of Tehran ,Iran

Hesham Faili, University of Tehran ,Iran

Salar Mohtaj, ICT Research Institute, ACECR ,Iran

Behrouz Minaei,University of Science & Technology ,Iran

Habibollah Asghari, ICT Research Institute of ACECR ,Iran

Paolo Rosso, Universitat Politècnica de València, Spain