next up previous contents
Next: 6.1.2 KNN Up: 6.1 Classifiers Previous: 6.1 Classifiers   Contents


6.1.1 SVM

To describe the SVM classifier, we must first make several definitions. Let $ S_-$ be the set of negative rows from $ \mathbf{X}$, and $ S_+$ be the set of positive rows. Suppose that $ S_-$ and $ S_+$ are linearly separable. A separating hyperplane is a hyperplane with unit normal $ \mathbf{beta}$ and distance to the origin $ \beta_0$ such that

$\displaystyle \forall \ensuremath{\mathbf{x}} \in S_+$   $\displaystyle \ensuremath{\mathbf{\beta}}^T \ensuremath{\mathbf{x}} + \beta_0 > 0$ (6.1)
$\displaystyle \forall \ensuremath{\mathbf{x}} \in S_-$   $\displaystyle \ensuremath{\mathbf{\beta}}^T \ensuremath{\mathbf{x}} + \beta_0 < 0$ (6.2)

That is to say all of $ S_+$ is on one side of the separating hyperplane while all of $ S_-$ is on the other side. Define the positive hyperplane as a hyperplane parallel to the separating hyperplane which touches the convex hull of $ S_+$, that is
$\displaystyle \forall \ensuremath{\mathbf{x}} \in S_+$   $\displaystyle \ensuremath{\mathbf{\beta}}^T \ensuremath{\mathbf{x}} + \beta_+ \ge 0$ (6.3)
$\displaystyle \exists \ensuremath{\mathbf{x}} \in S_+$   such that $\displaystyle \ensuremath{\mathbf{\beta}}^T \ensuremath{\mathbf{x}} + \beta_+ = 0$ (6.4)

Define the negative hyperplane in the same manner. The margin is the space bounded by the positive and negative hyperplanes.

For datasets with binary outcomes, we can code each outcome $ y_i$ as 1 and -1 instead of 1 and 0. With this coding we can unify Equations 6.1 and 6.2 as

$\displaystyle y_i (\ensuremath{\mathbf{\beta}}^T \ensuremath{\mathbf{x}} + \beta_+) > 0$ (6.5)

If the sets $ S_+$ and $ S_-$ are not linearly separable then our definition of a separating hyperplane can be adjusted to allow exceptions. For instance we can allow violations by replacing Equation 6.5 with

$\displaystyle y_i (\ensuremath{\mathbf{\beta}}^T \ensuremath{\mathbf{x}} + \beta_+) \ge C$ (6.6)

where, for instance, a small value $ C$ would allow only minor infractions. Observe that when the data is linearly separable, $ C$ can be as large as the half the width of the margin without any violations occurring. Though the margin as defined above does not exist for data which is not linearly separable, it is common to refer to $ C$ as the size of the margin.

SVMs are often described as a maximum margin classifier because they choose a separating hyperplane which maximizes $ C$ under some constraint on the number and size of separation violations. A precise definition of an SVM can be written as the mathematical program

\begin{displaymath}\begin{array}{lrcl} \mbox{Maximize} & \multicolumn{3}{l}{C} \...
...& \sum_{i=1}^R \xi_i & \le & D  & \xi_i & \ge & 0 \end{array}\end{displaymath} (6.7)

where $ D$ is a restriction on the cumulative size of the violations $ \xi_i$ and $ \ensuremath{\mathbf{x_i}}, y_i$ come from the dataset. This formulation of SVMs roughly follows that of Hastie et al. [10], where it is shown that the above mathematical program is equivalent to the quadratic program

\begin{displaymath}\begin{array}{llrcl} QP: & \mbox{Maximize} & \multicolumn{3}{...
...e & \gamma  & \alpha_i, \mu_i, \xi_i & \ge & 0  \end{array}\end{displaymath} (6.8)

The SVM described above is a linear SVM. It may be desirable to apply a nonlinear transform to data which is not linearly separable before applying SVM. The separating hyperplane found for the transformed space becomes a nonlinear boundary in the original space. A common transformation uses Radial Basis Functions, which are simply a basis created from a properly scaled probability density function (pdf). Note that this pdf is not interpreted probabilistically. [34,10]

We did not implement SVM, and instead chose to use the popular SVM$ ^{\mbox{\emph{light}}}$ package, version 5 [18,16]. This software uses a variety of tricks to find solutions to QP, defined in Equation 6.8, more quickly than traditional quadratic programming software. SVM$ ^{\mbox{\emph{light}}}$ can function as a linear or RBF SVM, where the RBF SVM uses the Normal pdf and scale parameter gamma. The default parameters for the linear SVM work well in our experiments. The default value of gamma made RBF SVM experiments too slow, and we adjusted it to gamma=0.001. Smaller values of gamma did not improve speed further. Joachims [16] describes the time complexity of SVM$ ^{\mbox{\emph{light}}}$ in a manner that we cannot easily describe here. While this complexity is bounded above by O$ \left( R^2 M \right)$, much better performance is generally observed.

We have written a wrapper around SVM$ ^{\mbox{\emph{light}}}$ so that it may be timed in the same manner as the other algorithms, as well as report scores in the same manner as the other algorithms. Before calling SVM$ ^{\mbox{\emph{light}}}$, we must first write data to a file in a special format. When the training phase is done, we read data from a file created by SVM$ ^{\mbox{\emph{light}}}$. A similar process happens during classification. This is repeated for every fold. We assume that SVM$ ^{\mbox{\emph{light}}}$ spends the same amount of time reading the files as our SVM wrapper spends writing them, and vice-versa. Let the filetime be the amount of time we spend reading and writing files. We do not want to include the filetime in the published algorithm time. Therefore we report the algorithm time as we normally measure it, less twice the filetime.


next up previous contents
Next: 6.1.2 KNN Up: 6.1 Classifiers Previous: 6.1 Classifiers   Contents
Copyright 2004 Paul Komarek, komarek@cmu.edu