In [2]:

```
%matplotlib inline
from data_vis import load, run
```

In [3]:

```
load()
```

The community is torn between love and hate towards parallel coordinates. How can we demonstrate an effective use for parallel coordinates, supported by a metric?

In binary classification, we're interested in using a set of features to divide each sample in a dataset into one of two classes.

Parallel coordinates can help visualize this problem. The goal here is to identify multiple features which we can clump together to create a particularly accurate division. In this way, we can remove unhelpful features and only include helpful features.

This project uses a dataset of heart rates. The normal sinus rhythm (healthy) heart rates are shown in blue. The onset of sudden cardiac arrest (unhealthy) heart rates are shown in red. The dataset features almost 300 samples and has 10 derived features stored in Python pickle files. All the features are normalized beforehand.

Machine learning problem

Two sets of classes (healthy and unhealthy heart rates)

Dataset = set of heart rate (HRV) derived features and class labels

Divide each sample into one of two classes

Previous work in clustering to derive these classes, but no work where the labels are already given

Visualize binary classification problem

Each axis represents a heart rate derived feature

Cluster the two classes together and as far away apart as possible

Visualize helpful and unhelpful features

Red = onset of sudden cardiac arrest (unhealthy) heart rates

Blue = normal sinus rhythm (healthy) heart rates

300 samples

10 heart rate variability features

sample features stored in python pickle files for fast loading

features are normalized beforehand

The main complaint with parallel coordinates is the big effect the sorting of the axes has on the graph. Using this complaint, this project provides the following contributions:

An scoring metric for scoring parallel coordinate plots made for binary classification.

Automatic sorting methods which are proven to be better than random sorting by this metric.

The use of cubic splines to show the average plot for each class.

Take the average of each feature for both classes

Each subplot (the area between two axes) is represented by a trapezoid between these average features

Scoring algorithm is modeled after the area of a trapezoid, because we want to maximize the area between these two classes

Area of a trapezoid = (height_a + height_b) / 2.0 * base

For every subplot:

Let height_a be the difference of the average features between both classes on the left feature.

Let height_b be the difference of the average features between both classes on the right feature.

score += abs(height_a * height_b)

For any given graph, we want the highest score possible.

The average score for random sorting of axes (based on 10 iterations) is 0.1550.

3 out of 4 of the sorting algorithms scored higher than this.

SCD increasing order = 0.2225

Absolute difference between averages increasing order = 0.1824

Normal sinus rhythm increasing order = 0.1750

Non-absolute differences between averages increasing order = 0.1046

In [6]:

```
# sorts the parallel coordinates in increasing order based on the average SCD class normalized feature values
run("scd")
```

In [3]:

```
# Absolute sorts the deltas by absolute value,
# giving us the features in order of highest to lowest difference by average
run("absolute")
```

In [6]:

```
# sorts the parallel coordinates in increasing order based on the average normal sinus rhythm class normalized feature values
run("norm")
```

In [4]:

```
# normal gives us the sorting by delta averages,
# starting with red on bottom and ending up with blue on top
run("normal")
```

In [7]:

```
# Randomly sort the axes
run(spline=True)
run()
run()
run()
```