In [2]:
%matplotlib inline
from data_vis import load, run
In [3]:
load()

Automatic Parallel Coordinates Axes Sorting for Binary Classification

by Luke Plewa - June 11th, 2015

Introduction

The community is torn between love and hate towards parallel coordinates. How can we demonstrate an effective use for parallel coordinates, supported by a metric?

In binary classification, we're interested in using a set of features to divide each sample in a dataset into one of two classes.

Parallel coordinates can help visualize this problem. The goal here is to identify multiple features which we can clump together to create a particularly accurate division. In this way, we can remove unhelpful features and only include helpful features.

This project uses a dataset of heart rates. The normal sinus rhythm (healthy) heart rates are shown in blue. The onset of sudden cardiac arrest (unhealthy) heart rates are shown in red. The dataset features almost 300 samples and has 10 derived features stored in Python pickle files. All the features are normalized beforehand.

Binary Classification

  • Machine learning problem

  • Two sets of classes (healthy and unhealthy heart rates)

  • Dataset = set of heart rate (HRV) derived features and class labels

  • Divide each sample into one of two classes

Parallel Coordinates

  • Previous work in clustering to derive these classes, but no work where the labels are already given

  • Visualize binary classification problem

  • Each axis represents a heart rate derived feature

  • Cluster the two classes together and as far away apart as possible

  • Visualize helpful and unhelpful features

Examples

  • Red = onset of sudden cardiac arrest (unhealthy) heart rates

  • Blue = normal sinus rhythm (healthy) heart rates

  • 300 samples

  • 10 heart rate variability features

  • sample features stored in python pickle files for fast loading

  • features are normalized beforehand

Contributions

The main complaint with parallel coordinates is the big effect the sorting of the axes has on the graph. Using this complaint, this project provides the following contributions:

  • An scoring metric for scoring parallel coordinate plots made for binary classification.

  • Automatic sorting methods which are proven to be better than random sorting by this metric.

  • The use of cubic splines to show the average plot for each class.

The Scoring Algorithm

Background

  • Take the average of each feature for both classes

  • Each subplot (the area between two axes) is represented by a trapezoid between these average features

  • Scoring algorithm is modeled after the area of a trapezoid, because we want to maximize the area between these two classes

  • Area of a trapezoid = (height_a + height_b) / 2.0 * base

Implementation

For every subplot:

  • Let height_a be the difference of the average features between both classes on the left feature.

  • Let height_b be the difference of the average features between both classes on the right feature.

  • score += abs(height_a * height_b)

For any given graph, we want the highest score possible.

Results

  • The average score for random sorting of axes (based on 10 iterations) is 0.1550.

  • 3 out of 4 of the sorting algorithms scored higher than this.

  • SCD increasing order = 0.2225

  • Absolute difference between averages increasing order = 0.1824

  • Normal sinus rhythm increasing order = 0.1750

  • Non-absolute differences between averages increasing order = 0.1046

In [6]:
# sorts the parallel coordinates in increasing order based on the average SCD class normalized feature values
run("scd")
['std_dev', 'min', 'mean', 'sdsd', 'median', 'rmsdd', 'max', 'outlier', 'nn_50', 'sdhr']
score: 0.22249022561954082

In [3]:
# Absolute sorts the deltas by absolute value,
# giving us the features in order of highest to lowest difference by average

run("absolute")
['sdhr', 'nn_50', 'outlier', 'max', 'min', 'median', 'sdsd', 'mean', 'rmsdd', 'std_dev']
score: 0.18245909806126992

In [6]:
# sorts the parallel coordinates in increasing order based on the average normal sinus rhythm class normalized feature values
run("norm")
['std_dev', 'min', 'median', 'sdsd', 'mean', 'rmsdd', 'max', 'outlier', 'nn_50', 'sdhr']
score: 0.17508011502406648

In [4]:
# normal gives us the sorting by delta averages,
# starting with red on bottom and ending up with blue on top
run("normal")
['sdhr', 'std_dev', 'rmsdd', 'sdsd', 'mean', 'median', 'min', 'max', 'outlier', 'nn_50']
score: 0.1046859827814546

In [7]:
# Randomly sort the axes
run(spline=True)
run()
run()
run()
['nn_50', 'max', 'sdhr', 'sdsd', 'min', 'outlier', 'median', 'rmsdd', 'std_dev', 'mean']
error: 0.1757012441236637

['std_dev', 'sdsd', 'median', 'nn_50', 'mean', 'rmsdd', 'max', 'sdhr', 'min', 'outlier']
error: 0.20959513422663908

['median', 'mean', 'outlier', 'sdhr', 'rmsdd', 'nn_50', 'std_dev', 'max', 'min', 'sdsd']
error: 0.17808626566417704