next up previous contents
Next: B.3.3 Dataset naming Up: B.3 Data Previous: B.3.1.1 Dense Datasets   Contents


B.3.2 Sparse Datasets

There are two sparse dataset formats, spardat and June. Both are whitespace delimited, and both require the input attributes to be binary. In the spardat format, the output value is included in the dataset as the first token of each line, and can be any real value. However, the output attribute will be thresholded to a binary value at dataset load time. All remaining tokens on the line are interpreted as indices of nonzero attributes. Attribute numbering begins at zero, and lines beginning with ``#'' are ignored.

In the June format the input attributes and the output are in separate files. The rows of the input file consist entirely of attribute indices separated by whitespace. As with the spardat format, these indices start at zero and are interpreted as the indices of nonzero attributes. Attribute numbering begins at zero, and lines beginning with ``#'' are ignored. The output file is comma-delimited and can contain multiple columns. The first line should be a list of output names, and the second line should be blank. All remaining lines can contain almost anything, but only columns containing exclusively ACT and NOT/SMALL>_ACT tokens can be used as an output column. Note that attribute indices in the output file begin with zero and include columns not suitable for use as outputs; this is important when interpreting certain output files associated with the mroc action. As with csv files, selection of the output attribute is deferred until the dataset is loaded.

One additional dataset format modification is understood. If the attribute indices in a spardat or June formatted file are appended with ``:1'', the ``:1'' will be ignored and the data will be loaded as expected.


next up previous contents
Next: B.3.3 Dataset naming Up: B.3 Data Previous: B.3.1.1 Dense Datasets   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu