ICFHR 2016 Competition on Recognition of Handwritten Mathematical Expressions (ICFHR 2016 CROHME)


CROHME 2016 Training and Test datasets

Mainly the training datasets will the data from CROHME 2014 and earlier competitions available online from IAPR TC-11: tc11.cvc.uab.es/datasets/CROHME-2014_2 and the test datasets will be new samples for the full expressions tasks and CROHME 2014 test sets for sub tasks. Indeed, a part of the ground-truth will be necessary available to do the tasks CROHME2016-Symbols and CROHME2016-Structure, thus we can not provide this ground-truth for the main task.

More details are given in the table below:

Task Training Validation Test
CROHME2016-Formulas CROHME 2014 Train set CROHME 2014 Test set New CROHME 2016 Test set
CROHME2016-Symbols CROHME 2014 Train set CROHME 2013 Test set CROHME 2014 Test set
CROHME2016-Structure CROHME 2014 Train set CROHME 2013 Test set CROHME 2014 Test set
CROHME2016-Matrices CROHME 2014 Matrices Train set CROHME 2014 Matrices Test setNew CROHME 2016 Matrices Test set

Note that all previous CROHME datasets are available in the IAPR TC-11 package. The validation sets for tasks Symbols and Structure have been generated using the tools from CROHMELib.

For the first time, the CROHME Competition make available to participants a corpus of expressions which can allow to train Language Models. The source of this corpus is the Math Information Retrieval competition NTCIR-12 MathIR . We provide three corpus, from the more general to the more specific to CROHME tasks:

  • The full corpus Wiki_formulas_v0.1: All math expressions from Wikipedia (592 000 expressions). It includes MathML and LaTeX for each formula (as provided in the NTCIR-12 collection) in some HTML files. Available here (.bz2 file of 16Mo, 408Mo uncompressed)
  • The filtered corpus CROHME_Tex_formulas_v0.2: the same LaTeX expressions but filtered with some criteria (95782 expressions):
    • no duplicate expression;
    • only ANSI characters;
    • only symbols accepted by the Grammar IV of the competition.
    However, about the half the these expressions does not respect the grammar. Available here
  • The filtered corpus CROHME_Gram4_Wiki_formulas_v0.2: the same previous LaTeX expressions but filtered by the Grammar IV: 68883 are remaining. Available here

File Formats and Tools

Data from CROHME 2014 and earlier competitions is available online from IAPR TC-11: tc11.cvc.uab.es/datasets/CROHME-2014_2.

For CROHME 2016 training data will be provided in the same InkML (XML) format used in previous competitions. These InkML files may be visualized using an online tool: saskatoon.cs.rit.edu/inkml_viewer/.

Recognition outputs will be in a Comma-Separated Variable (.csv) format representing a labeled graph over handwritten strokes (.lg). Updated libraries for conversion between InkML and LG formats (CROHMELib) along with tools for evaluation and visualization (LgEval) will be provided on the competition web page. Earlier versions of these tools are available online: www.cs.rit.edu/~dprl/Software.html. The updated evaluation tools LgEval can be obtained online using: git clone http://saskatoon.cs.rit.edu:10001/root/lgeval.git

Result file format

The file format is exactly the same as during CROHME2014. Here are more precisions (which are consistent with CROHME2014):

  • CROHME2016-Symbols task (task 2) : the result file is a csv file listing all isolated samples, one sample per line, giving its UID + a list of up to 10 classes ranked (first is the best). Junk class is included but results will be shown with and without junk class (as in 2014) : (cf python script evalSymbIsole.py from CROHMELib)
    • MfrDB3907_85801, a, b, c, d, e, f, g, h, i, j
      MfrDB3907_85802, 1, |, l, COMMA, junk, x, X, \times
  • CROHME2016-Matrices task (task 4) : LG structure can handle several labels per stroke and relation, thus the matrix structures are overlapping the symbol labels with these labels : "*M" (Matrix), "*R" (Row), "*C" (column), "*C" (cell); linked with the relations "NcellR", "NcellC", "Nrow", "Ncol" which means "next cell in row", "next row in matrix" , ...
  • CROHME2016-Formulas (task 1) and CROHME2016-structure (task 3) : there were some inconsistencies in ground-truth of CROHME2014 with relations from structures with limits (sum, limit, integral). We are trying to do better this year. The rules are :
    • Limits of an integral or summation ( \int or \sum ) should be designated as Above/Superscript and Below/Subscript, consistent with the locations of the limits relative to the operator (i.e. the location of the limits matters)
    • the sub part of a limit (\lim symbol) should always be Below the \lim object (munder in MathML)
    • the \prime symbol is always in superscript (as in this LaTeX string $f^\prime$)
For more details, please see the script "crohme2lg.pl" which is used for the conversion of inkml files.