next up previous contents
Next: B.4 Learners Up: B.3 Data Previous: B.3.2 Sparse Datasets   Contents


B.3.3 Dataset naming

All three dataset file formats, csv, spardat and June, require information beyond the file name when they are specified on the command-line. This information is specified though additional keywords and possibly extra characters appended to the filename. The dataset filename may also affect how the dataset is loaded. If the file ends in .csv or .fds, it is assumed to be in csv or fds format and an error will occur otherwise. Files not ending in .csv or .fds are assumed to be in either the spardat or June format.

When loading csv or fds files one must specify the filename and the output attribute. There are two methods for choosing the output attribute. If the keyword csv is written after the dataset name, separated by the usual whitespace between command-line tokens, then the last column of the dataset will be used as the output. The inputs are stored internally as real numbers in a dense matrix. The csv keyword is the only way to force our algorithms to use dense computations. Using the csv keyword with files not ending in .csv or .fds is an error. Not all learners can perform dense computations, and an error message will be displayed if such a learner is used with the csv keyword. The error message is likely to claim that the specified learner does not exist. Below is an example of performing dense computations. The csv dataset would be loaded, the last attribute used as the output, and the learner would use dense computations:

./afc roc in a-or-d.csv csv ...

The second method of choosing an output for csv and fds files uses the output keyword after the dataset name. This keyword must be followed by a valid attribute name from the csv or fds dataset. If this method is used, the dataset is assumed to consist entirely of binary input values, written in integer or floating point notation. Datasets in this form are stored internally using a sparse format, and the learning algorithms will generally employ efficient sparse techniques to speed learning and reduce memory load. If the dataset filename does not end in .csv or .fds, the output keyword will be ignored. The example below causes the same file used in the csv keyword example above to use attribute ``y'' as the output and allows learners to use sparse computations.

./afc roc in a-or-d.csv output y ...

It is important to realize that the csv or output keywords apply to all datasets on the command-line. For instance, in a train/test experiment, specifying the csv keyword causes the last attribute of both the training and the testing datasets to be used as the output. In this example, both datasets must be dense as well.

To use a June format file, one must specify the filename of the input dataset, the filename of the output dataset, and an attribute name from the output dataset. These are specified in the order just listed, separated by colons, and the output attribute must not end in .csv or .fds. The output is necessarily binary because of the constraints of the June format, as are the inputs. Datasets in the June format are stored internally in sparse form, and most learners will employ sparse computations for these datasets. If the input dataset filename is factors.ssv, the output dataset filename is activations.ssv, and attribute Act of activations.ssv is to be used as the output, one might type

./afc roc in factors.ssv:activations.ssv:Act ...

If a file fits neither the csv, fdsor June formats, it is assumed to be in spardat format. The spardat format includes the output as the first column of each record, stored as a real number. Because the output will be stored internally in a sparse binary structure, the real-valued outputs in the file must be thresholded. The threshold specifier consists of a colon, a threshold value, and a plus or minus sign. A plus sign indicates that output values greater than or equal to the threshold value are considered positive. A minus sign indicates that values less than or equal to the threshold are considered positive. As an example, suppose sp.txt is a spardat format dataset, and we wish to consider all outputs with value greater than 42.42 as positive outputs. The command line below specifies this using the notation just described.

./afc roc in sp.txt:42.42+ ...


next up previous contents
Next: B.4 Learners Up: B.3 Data Previous: B.3.2 Sparse Datasets   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu