{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**CSC 466: Knowledge Discovery in Data **\n", "\n", "** Individual Test**\n", "\n", "**Task 2 **" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Name :** \n", "\n", "**Cal Poly Email:** \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Assignment**:\n", "\n", "1. Run 10-fold cross-validation on top of the trainMagicClassifier() classification method\n", "2. For each fold, report prediction accuracy\n", "3. Report overall prediction accuracy\n", "4. Report the overall confusion matrix" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Imports\n", "\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import seaborn\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Classifier**\n", "\n", "trainMagicClassifier() is a simple implementation of Support Vector Machine, with some hardcoded parameters.\n", "It takes as input a dataset \"data\", the array of labels \"labels\" (data point data[i] has class label label[i]), and two hyperparameters: Rate (this parameter in the context of SVMs is called the learing rate) and the number of iterations to complete (SVMs are typically trained using an iterative approach until convergence, this classifier replaces convergence with simply a number of iterations).\n", "\n", "You do not need to understand the code in trainMagicClassifier(), nor do you need to make any changes in this part of the notebook. The trained model is three coefficients model[0], model[1], model[2], which combine to form an equation of a line in 2D that separates the two classes:\n", "\n", "$$model[0]\\cdot x + model[1]\\cdot y + model[2] = 0$$\n", "\n", "where $x$ and $y$ are the coordinates of the 2D data point.\n", "\n", "**Note:** like all SVM classifiers, ours works only on binary classes, with class labels +1 and -1.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def trainMagicClassifier(data, labels, Rate, iterations):\n", "\n", " C = 2 ## error significance\n", " thetas = np.ones(len(data))*-1\n", " w = np.array([-1,0,0]) # starting approximation \n", " \n", " # svmData = np.array([data[0],data[1], thetas])\n", " for j in range(iterations):\n", " HingeLoss = np.array([0,0,0])\n", " for i in range(len(data)):\n", " datum = data[i]\n", " \n", " ## compute the hinge loss\n", " y = labels[i]\n", " uVector = np.array([datum[0], datum[1], thetas[i]])\n", " if (w[0]*datum[0]+ w[1]*datum[1]+ w[2]*thetas[i])*y <= 1:\n", " HingeLoss = HingeLoss - y*uVector\n", " \n", " \n", " w = w - Rate *(w + C * HingeLoss)\n", " ##print(\"W:\", w)\n", " ## print(\"Loss:\", HingeLoss)\n", " ## plotW(slo,paso,w)\n", " return(w)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Predictor**:\n", "\n", "Using the model trained in trainMagicClassifier(), the predictor computes on which side of the line separating the classes a data point is, and returns the side (above = +1, below = -1) as the class predictor.\n", "\n", "You do not need to modify this code." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def predict(model, point):\n", " value = np.sign(model[0]*point[0]+model[1]*point[1]+model[2])\n", " ## print(value)\n", " return value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generatePredictions(function) simply generates the list of predictions given a collection of the data points" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def generatePredictions(model, data):\n", " predictions = [predict(model, point) for point in data]\n", " return predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Execution Starts Here**\n", "\n", "First, we read the data, and split the input into the data table, and the class labels table." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filename=\"data5.csv\"\n", "\n", "rawData = np.loadtxt(filename, delimiter = \",\")\n", "\n", "## let's keep only the two columns with the data attributes\n", "\n", "data = rawData[:,0:2]\n", "labels = rawData[:,2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Training the Classifier**\n", "\n", "We set the learning rate to 0.01 (established empirically, don't worry about it).\n", "Then we train the classifier on the entire data set." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rate = 0.01\n", "model = trainMagicClassifier(data, labels, rate, 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below,we display the learned model parameters" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([-7.35152749, 5.94812682, 1.17282658])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and generate the predictions" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predictions =generatePredictions(model,data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Plotting**\n", "To illustrate the accuracy of the predictor we plot the model (blue line) against the scatter plot of the dataset colored according to the ground truth labels.\n", "\n", "As seen, our predictor does really well, misclassifying only two red points at the left edge of the scatter plot.\n", "\n", "**However**, we trained and tested on the same data." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "colors = ['red','red', 'green']\n", "prColors = [colors[int(i+1)] for i in labels ]\n", "\n", "plt.scatter(data[:,0], data[:,1], c=prColors)\n", "plt.plot(data[:,0], -(model[0]/model[1])*data[:,0] - model[2]/model[1])\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Task**\n", "In the cells below build the functionality for conducting the 10-fold cross-validation of the trainMagicClassifier() classifier.\n", "\n", "Each fold shall contain 10% of randomly selected points from the input dataset.\n", "\n", "For each fold, train on all other folds, then test on it and compute the accuracy.\n", "\n", "Report the accuracy of each fold, overall accuracy, and print out the confusion matrix\n", "\n", "**Notes**: you may want to create a function that splits the dataset into folds. For this task, it is ok to create 10 copies of the data (our dataset is small enough), if this makes your life easier, but you can also construct each training and testing set separately.\n", "\n", "There is no need to visualize any steps using graphs, but you can use the plotting structure above to make things look better for\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fold 0 accuracy: 100.0%\n", "Fold 1 accuracy: 100.0%\n", "Fold 2 accuracy: 100.0%\n", "Fold 3 accuracy: 90.0%\n", "Fold 4 accuracy: 90.0%\n", "Fold 5 accuracy: 80.0%\n", "Fold 6 accuracy: 90.0%\n", "Fold 7 accuracy: 100.0%\n", "Fold 8 accuracy: 100.0%\n", "Fold 9 accuracy: 100.0%\n", "Total accuracy: 95.0%\n", "True Positives: 47 | True Negatives: 48\n", "False Positives: 4 | False Negatives: 1\n" ] } ], "source": [ "# Create a list that represents the randomized indices of both lists\n", "randomized_indices = np.random.permutation(len(data))\n", "\n", "# Take that list of random indices and break it into 10 equal sized chunks\n", "equal_length_sets = np.array_split(randomized_indices, 10)\n", "\n", "# Create a variable for tracking total accuracy\n", "total_accuracy = 0\n", "\n", "# Create variables for tracking the confusion matrix: true positive, true negative, false positive, false negative\n", "tp = tn = fp = fn = 0\n", "\n", "# For each partition...\n", "for fold_value, test in enumerate(equal_length_sets):\n", " # Get your 10% of testing data\n", " test_set = data[test]\n", " test_labels = labels[test]\n", "\n", " # Get your 90% of training data. There is probably a better way to do this but I am panicking\n", " train_set = []\n", " train_labels = []\n", " for train in equal_length_sets:\n", " if not np.array_equal(train, test):\n", " train_set.append(data[train])\n", " train_labels.append(labels[train])\n", " train_set = [item for sublist in train_set for item in sublist]\n", " train_labels = [item for sublist in train_labels for item in sublist]\n", " \n", " # Convert your train_set and train_labels to numpy arrays\n", " train_set, train_labels = np.array(train_set), np.array(train_labels)\n", " \n", " # Create your model and get your predictions\n", " rate = 0.01\n", " model = trainMagicClassifier(train_set, train_labels, rate, 20)\n", " predictions = generatePredictions(model, test_set)\n", " \n", " # Calculate the accuracy and the confusion matrix\n", " # Once again: This is bad code, but I am panicking\n", " accuracy = 0\n", " for i, prediction in enumerate(predictions):\n", " if prediction == test_labels[i]:\n", " accuracy += 1\n", " if prediction == 1.0:\n", " tp += 1\n", " else:\n", " tn += 1\n", " else:\n", " if prediction == 1.0:\n", " fp += 1\n", " else:\n", " fn += 1\n", " \n", " # Get a percentage value for accuracy\n", " accuracy /= len(predictions)\n", " \n", " # Report the accuracy\n", " print(\"Fold {} accuracy: {}%\".format(fold_value, accuracy * 100))\n", " \n", " # Increment the total accuracy\n", " total_accuracy += accuracy\n", "\n", "# Print out the total accuracy and the confusion matrix \n", "print(\"Total accuracy: {}%\".format(total_accuracy * 10))\n", "print(\"True Positives: {} | True Negatives: {}\".format(tp, tn))\n", "print(\"False Positives: {} | False Negatives: {}\".format(fp, fn))\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Congratulations!** You are done.\n", "\n", "Download the notebook and submit it using the\n", "\n", " handin dekhtyar 466-test \n", " \n", " command." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }