System requirements

Any recent Linux system can be used. If the package does not run, try the installation from source. The evaluation runs faster with a multicore machine. Ten cores is a good number.

The start up kit

We provide a basic kit to get things started. It contains a sample dataset (in English), together with the packages to do the evaluations. They can be downloaded here:

You can download this dataset and evaluation software without committing to the challenge. To register, send an email to and use this github repository for instructions.

General organisation of the start up and evaluation kits

The start up and evaluation kits are organized in a similar fashion. We give instructions for the start up kit below, but they can be adapted for the evaluation kit in a straightforward fashion.

Once downloaded, the kit will contain the following files:

         samplewav/       # contains sample *.wav
         sampleeval1/     # evaluation software for task 1
             eval1*              # the executable
             resources/         # the code, libraries, etc (do not enter)
             HTKposteriors/      # sample posteriorgram
             MFCC/               # sample mfcc 
         sampleeval2/    # evaluation software for task 2
             sample_eval2*                 # the executable
             resources/             # the guts of the eval2 software (do not enter)        
             sample.classes.example # example output for the ``sample'' dataset
             

to run the Track 1 evaluations:

         Usage: eval1 [options] <feature_file> <output-directory>
             options:
                     -h                  # print help message with details
                     -j <int>            # number of CPUs to use (default 1)
                     -kl                 # DTW+KL distance (DTW+cosine is the default)
                     -d <distancemodule> # define your own distance
                                         # (in the format `path/module.function')
                    -csv                # outputs a csv file with detailed results

         <feature_file>             # a text file containing the frame by frame
                                    # values of the speech features to be
                                    # evaluated.(see below for the precise format)
         <output-directory>         # name of a directory where the results 
                                             # will be stored.

example:

         $ cd sampleeval1
         $ ./eval1 MFCC MFCCscore        # by default, the distance is DTW+cosine

         ... lots of progress information ...  # takes about 3-5 minutes
         {'within_talkers': 24.3, 'across_talkers': 32.5} # ABX discriminability scores
                                                                # for each task          
                                                                # (between 0 and 1, 1 being best) 

          $ ./eval1 -kl HTKposteriors HTKscore  # you can specify a distance other than cosine
             
         {'within_talkers': 20.3, 'across_talkers': 23.9}

Note: the posteriorgrams are just provided for illustration purposes; they were obtained through a not particularly optimized HTK pipeline using a monophone model (PER: 42%). The output directory contains a text file with the above results (called 'results.txt'), plus (if the -csv option is used), a .csv files with the detailed results per minimal pair and talker (called 'DATASET_across.csv' and 'DATASET_within.csv', where DATASET corresponds to one of the dataset provided with the challenge). The directory will also contain a file called 'VERSION_$' indicating the version of the evaluation code that was used. Please make sure to report that number in your report.

to run the Track 2 evaluations on the provided sampleset:

     usage: sample_eval2 [-h] [-v] [-j N_JOBS] [-V]
                 DISCCLSFILE DESTINATION

     Evaluate spoken term discovery

     positional arguments:
       DISCCLSFILE               discovered classes
       DESTINATION               location for the evaluation results

     optional arguments:
       -h, --help                show this help message and exit
       -v, --verbose             display progress
       -j N_JOBS, --n-jobs N_JOBS
                                 number of cores to use
       -V, --version             show program's version number and exit

for example, to run the evaluation on the provided output (``sample.classes.example'') for the sample dataset:

     $ cd sample_eval2
     $ ./sample_eval2 sample.classes.example outputdir

To evaluate your own system's output on the provided ``english'' dataset and print progress information along the way:

     $ ./english_eval2 my_output.classes outputdir -v

To run the evaluation on multiple cores, use the -j flag. As an indication of the runtime of the program, the evaluation of the ``english'' dataset with a gold output takes 20 minutes using 2 3.2Ghz cores. Top RAM usage is about 10 GB. Note that this will increase with the number of cores used. Evaluation runtime and memory usage are also strongly dependent on the particulars of the input file. It is not useful to use more than 10 cores (each parallel job will do one of the 10 subsampling folds).

The output directory will contain one file each for the above described measures, with scores for both cross-speaker and within-speaker performance. The directory will also contain a file called ``VERSION_$'' indicating the version of the evaluation code that was used. Please make sure to report that number in your report. The version number can also be obtained by:

   $ ./sample_eval2 -V
   or
   $ ./english_eval2 -V

File formats

Speech Datasets

The files have been cut into sentence-sized files. They are in wav format (16kHz, 16 bits). The speaker identity can be obtained for each file by taking characters 11 to 13 of the file name.

Track 1 feature file format

Our evaluation system requires that your unsupervised subword modeling system outputs a vector of feature values for each frame. For each utterance in the set (eg aghsu09.wav), an ASCII features file with the same name (eg aghsu09.fea) as the utterance should be generated with the following format:

     <time> <val1>    ... <valN>
     <time> <val1>    ... <valN>

example:

     0.0125 12.3 428.8 -92.3 0.021 43.23         
     0.0225 19.0 392.9 -43.1 10.29 40.02
     ...

Note: the time is in seconds. It corresponds to the center of the frame of each feature. In this example, there are frames every 10ms and the first frame spans a duration of 25ms starting at the beginning of the file, hence, the first frame is centered at .0125 seconds and the second 10ms later. It is not required that the frames be regularly spaced, in fact the only requirement is that the timestamp of frame n+1 is strictly larger than the timestamp of frame n. The frame timestamps are used by the evaluation software to determine which features correspond to a particular triphone among the sequence of features for a whole sentence on the basis of manual phone-level alignments for that sentence.

Track 2 output format

The spoken word discovery system should output an ASCII file listing the set of fragments that were found with the following format:

     Class <classnb>
     <filename> <fragment_onset> <fragment_offset>
     <...>
     <filename> <fragment_onset> <fragment_offset>
     <NEWLINE>
     Class <classnb>
     <filename> <fragment_onset> <fragment_offset>

example:

     Class 1
     dsgea01   1.238  1.763 
     dsgea19   3.380  3.821
     reuiz28  18.036 18.537

     Class 2
     zeoqx71   8.389  9.132
     ...etc...

Note: the onset and offset are in seconds. If your system only does matching and not clustering, your classes will only have two elements each. If your system does not only matching, but also clustering and parsing, the fragments found will cover the entirety of the files, and there may be classes with only one element in it (the remainder of lexical-based segmentation).

Track 1 details

Processing time

The evaluation can be divided in two components:

Using your own distance

To see how it is possible to provide your own distance, let us show first how it is possible to obtain the default DTW+cosine distance using the -d option. The distance function used by default (DTW+cosine) is defined in the python script:

     sampleeval1/ressources/distance.py

by the function named distance. So calling the eval1 executable from the sampleeval1 folder with the option:

     -d ./ressources/distance.distance

will reproduce the default behavior.

Now to define your own distance function you can for example copy the file:

     sampleeval1/ressources/distance.py

in directory dir somewhere on your system, modify the distance function definition to suit your needs and call eval1 with the option:

     -d dir/distance.distance

You will see that the distance.py script begins by importing three other python modules, one for DTW, one for cosine distance and one for Kullback-Leibler divergence. The cosine and Kullback-Leibler modules are located in folder:

     sampleeval1/src/ABXpy/distances/metrics

and implement frame-to-frame distance computations in a fashion similar to the scipy.spatial.distance.cdist function from the scipy python library. The DTW module is also located in the folder:

     sampleeval1/src/ABXpy/distances/metrics

but as a static library (dtw.so) compiled from the cython source file:

     sampleeval1/src/ABXpy/distances/metrics/install/dtw.pyx

for efficiency reasons. You can use our optimized DTW implementation with any frame-to-frame distance function with a synopsis like the scipy.spatial.distance.cdist function by modifying appropriately your copy of distance.py. You can also replace the whole distance computation by any python or cython module that you designed as long as it has the same input and output format than the the distance function in the distance.py script.

Troubleshooting

Building from source

If the linux executables do not work directly for you, you might want to try installing from source. This is not the recommended solution, please try using the provided executables first.

It is strongly recommended that you use python anaconda, which is a self-contained scientific python installation, containing most of the libraries and dependencies that are needed for this software. Python anaconda does not require admin privilege to be installed and can be installed in any directory on your system. You can use a virtual environment to isolate it completely from the rest of your system.

To install with anaconda, go to the src folder and type:

     pip install h5features
     make install

If PyTables fails building, try:

     pip install numexpr
     pip install tables

If you really don't want to use anaconda, check out the README.rst and requirements.txt files in the src folder.

Multiprocessing.py

The parallelisation of our program relies on a module from python's standard library called multiprocessing.py which can be a bit unstable. If you experience problems when running the evaluation, try requiring only one cpu to avoid using this module altogether.