%matplotlib inline from data_vis import load, run
The community is torn between love and hate towards parallel coordinates. How can we demonstrate an effective use for parallel coordinates, supported by a metric?
In binary classification, we're interested in using a set of features to divide each sample in a dataset into one of two classes.
Parallel coordinates can help visualize this problem. The goal here is to identify multiple features which we can clump together to create a particularly accurate division. In this way, we can remove unhelpful features and only include helpful features.
This project uses a dataset of heart rates. The normal sinus rhythm (healthy) heart rates are shown in blue. The onset of sudden cardiac arrest (unhealthy) heart rates are shown in red. The dataset features almost 300 samples and has 10 derived features stored in Python pickle files. All the features are normalized beforehand.
Machine learning problem
Two sets of classes (healthy and unhealthy heart rates)
Dataset = set of heart rate (HRV) derived features and class labels
Divide each sample into one of two classes
Previous work in clustering to derive these classes, but no work where the labels are already given
Visualize binary classification problem
Each axis represents a heart rate derived feature
Cluster the two classes together and as far away apart as possible
Visualize helpful and unhelpful features
Red = onset of sudden cardiac arrest (unhealthy) heart rates
Blue = normal sinus rhythm (healthy) heart rates
10 heart rate variability features
sample features stored in python pickle files for fast loading
features are normalized beforehand
The main complaint with parallel coordinates is the big effect the sorting of the axes has on the graph. Using this complaint, this project provides the following contributions:
An scoring metric for scoring parallel coordinate plots made for binary classification.
Automatic sorting methods which are proven to be better than random sorting by this metric.
The use of cubic splines to show the average plot for each class.
Take the average of each feature for both classes
Each subplot (the area between two axes) is represented by a trapezoid between these average features
Scoring algorithm is modeled after the area of a trapezoid, because we want to maximize the area between these two classes
Area of a trapezoid = (height_a + height_b) / 2.0 * base
For every subplot:
Let height_a be the difference of the average features between both classes on the left feature.
Let height_b be the difference of the average features between both classes on the right feature.
score += abs(height_a * height_b)
For any given graph, we want the highest score possible.
The average score for random sorting of axes (based on 10 iterations) is 0.1550.
3 out of 4 of the sorting algorithms scored higher than this.
SCD increasing order = 0.2225
Absolute difference between averages increasing order = 0.1824
Normal sinus rhythm increasing order = 0.1750
Non-absolute differences between averages increasing order = 0.1046
# sorts the parallel coordinates in increasing order based on the average SCD class normalized feature values run("scd")
['std_dev', 'min', 'mean', 'sdsd', 'median', 'rmsdd', 'max', 'outlier', 'nn_50', 'sdhr'] score: 0.22249022561954082
# Absolute sorts the deltas by absolute value, # giving us the features in order of highest to lowest difference by average run("absolute")
['sdhr', 'nn_50', 'outlier', 'max', 'min', 'median', 'sdsd', 'mean', 'rmsdd', 'std_dev'] score: 0.18245909806126992
# sorts the parallel coordinates in increasing order based on the average normal sinus rhythm class normalized feature values run("norm")
['std_dev', 'min', 'median', 'sdsd', 'mean', 'rmsdd', 'max', 'outlier', 'nn_50', 'sdhr'] score: 0.17508011502406648
# normal gives us the sorting by delta averages, # starting with red on bottom and ending up with blue on top run("normal")
['sdhr', 'std_dev', 'rmsdd', 'sdsd', 'mean', 'median', 'min', 'max', 'outlier', 'nn_50'] score: 0.1046859827814546
# Randomly sort the axes run(spline=True) run() run() run()
['nn_50', 'max', 'sdhr', 'sdsd', 'min', 'outlier', 'median', 'rmsdd', 'std_dev', 'mean'] error: 0.1757012441236637
['std_dev', 'sdsd', 'median', 'nn_50', 'mean', 'rmsdd', 'max', 'sdhr', 'min', 'outlier'] error: 0.20959513422663908
['median', 'mean', 'outlier', 'sdhr', 'rmsdd', 'nn_50', 'std_dev', 'max', 'min', 'sdsd'] error: 0.17808626566417704