{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**CSC 466: Knowledge Discovery in Data **\n", "\n", "** Individual Test**\n", "\n", "**Task 3**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Name :** \n", "\n", "**Cal Poly Email:** " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Assignment**:\n", "\n", "1. Complete the code for the Naive Bayes Classifier\n", "2. Complete the training and testing of the Classifier\n", "3. Compute the accuracy of the classifier and output the overall accuracy and the confusion matrix." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Imports\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import seaborn\n", "import pandas as pd\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**estimateNBModel**: function to produce all parameter estimates for the Naive Bayes model.\n", "\n", "This function returns two structures: an array of class probabilities P(Class = c_i),\n", "and a collection of \n", "$$P(A_j = a_k|Class = c_i)$$\n", "probability estimators for the probability that an observed record from class $c_i$ will take the value $a_k$ for its $j$th attribute $A_j$.\n", "\n", "The latter collection is stored as a list model[0..nClasses-1], where each model[i] is itself a list of length nAttributes of probability distributions over the values of each attribute.\n", "\n", "\n", "Input parameters for estimateNBModel():\n", "\n", "data: training set data points\n", "\n", "labels: labels for the training set data\n", "\n", "nAttributes: number of attributes in the dataset\n", "\n", "attributeRanges: number of unique values each attribute takes\n", "\n", "nClasses: number of classes (class labels)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Task** : write the getClassProbability() function that given\n", " \n", " * the list of labels (labels) of the training set\n", "\n", " * the class label (classId),\n", " \n", " * and the total number of classes in the dataset (nClasses)\n", " \n", " returns the probability estimate for $P(Class = classId)$" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def getClassProbability(labels, classId,nClasses):\n", " return sum([1 for i in labels if i == classId]) / len(labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Task**: write the getNBEstimate() function that given\n", "\n", "* the training set (data)\n", "\n", "* the class labels for the training set (labels)\n", "\n", "* the class label for which the estimate is being produced (classLabel)\n", "\n", "* the attribute for which the estimate is being produced (attId)\n", "\n", "* the attribute value for which the estimate is being produced (attValue)\n", "\n", "* and the total number of values attribute attId has (nValues)\n", "\n", "produces the probability estimate for $P(A_{attId} = attValue | Class = classLabel)$" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def getNBEstimate(data, labels, classLabel, attId, attValue, nValues):\n", " indices, = np.where(labels == classLabel)\n", " subset = data[indices,attId]\n", " return sum([1 for i in subset if i == attValue]) / len(subset)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def estimateNBModel(data, labels, nAttributes, attributeRanges, nClasses):\n", " ## Naive Bayes Model consists of two types of estimators\n", " \n", " ## First, we estimate the probability of seeing an object from a specific class\n", " \n", " classProbabilities = [getClassProbability(labels,l, nClasses) for l in range(nClasses)]\n", " #print(classProbabilities, nClasses)\n", " \n", " ## now we estimate the probabilities of seeing a specific value of a specific attribute in\n", " ## a data point from a given class\n", " \n", " ## for each class create the appropriate estimates\n", " model = [] # model is the list of estimates for all classes\n", " for i in range(nClasses): # for each class\n", " ## for each attribute\n", " classDistr = [] # classDistr is the collection of estimates for one class\n", " for j in range(nAttributes):\n", " estimates = [] # estimates is a distribution of estimates for a single attribute\n", " for k in range(attributeRanges[j]):\n", " #print(i)\n", " est = getNBEstimate(data, labels, i,j,k, attributeRanges[j])\n", " #print(est)\n", " #break\n", " #break\n", " estimates.append(est)\n", " classDistr.append(estimates)\n", " model.append(classDistr)\n", " \n", " return classProbabilities, model\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Predicting The Class**\n", "\n", "function predictNBLabels() predicts the class labels for all data points in the test set.\n", "\n", "function predictNB() computes the probability estimates for each class and selects the class with the highest estimate for a single data point\n", "\n", "function predictNBClass computes the probability estimate for a specific class.\n", "\n", "** Your Task **: implement predictNBClass()\n", "\n", "its parameters are:\n", "\n", "* point: the data point for which the estimate is given\n", "\n", "* classProb: the class probability P(Class = classID) for the class \n", "\n", "* classModel: the portion of the Naive Bayes model related to predicting this particular class\n", "\n", "(note that class label is not passed, but all proper values are selected in predictNB())" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def predictNBClass(point, classProb, classModel):\n", " ans = 1\n", " for i in range(len(classModel)):\n", " ans *= classModel[i][int(point[i])]\n", " return ans * classProb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "predictNB() and predictNBLabels() parameters\n", "\n", "point: data point\n", "\n", "classProbabilities: the list of $P(Class = c_i)$ estimates\n", "\n", "model: the collection of $P(A_j = a_k |Class = c_i)$ probability estimates\n", "\n", "nClasses: number of class labels in the dataset\n", "\n", "You do not need to touch this code" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def predictNB(point, classProbabilities, model,nClasses):\n", " \n", " predictions= np.array([predictNBClass(point, classProbabilities[i],model[i]) for i in range(nClasses)])\n", " \n", " predictedClass = np.argmax(predictions)\n", " return predictedClass" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def predictNBLabels(data, classProbabilities, model, nClasses):\n", " predicted = [predictNB(point, classProbabilities, model, nClasses) for point in data]\n", " return predicted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Load Data **" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filename=\"data8.csv\"\n", "\n", "rawData = np.loadtxt(filename, delimiter = \",\")\n", "\n", "## let's keep only the two columns with the data attributes\n", "\n", "nAttributes = rawData.shape[1] - 1\n", "\n", "data = rawData[:,0:nAttributes]\n", "labels = rawData[:,nAttributes]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Train the Model**\n", "\n", "In the cells below the entire dataset is used to train the model. \n", "\n", "This allows us to see the predictions, but this is not a fair way to evaluate the quality of prediction." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "### Let us compute how many unique values each attribute has.\n", "### all attributes have values 0,1,.., k-1 where k is the number of unique values for that attribute.\n", "\n", "attributeRanges= [np.unique(data[:,i]).shape[0] for i in range(nAttributes)]\n", "\n", "## number of classes\n", "\n", "nClasses = np.unique(labels).shape[0]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "d,m = estimateNBModel(data,labels,nAttributes, attributeRanges, nClasses)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "predicted = predictNBLabels(data,d,m,nClasses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An easy way to see where we missed:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., -2., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., -1., 0., 0., 0.,\n", " -2., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., -1., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 1., -2., 0., 0., 0., 0., 0., 0., 0., 0., 2., 0.,\n", " 0., 0., -2., 1., 0., 0., 1., -1., 0., 0., 0., 0., -1.,\n", " 0., -1., 0., 0., 0., 0., 0., 0., -1., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., -2., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., -2.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,\n", " -2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -1., 0.,\n", " -1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., -1., 0., 0., 0., 0., 0., 0., 0., 0.,\n", " 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 2., 0., 0.,\n", " 0., 0., -1., 0., 0., 2., 0., 0., 0., 0., 0., -1., 0.,\n", " 0., -2., 1., -1., 0., 0., 0., 0., 0., 0., 0., 0., -1.,\n", " 0., 0., -1., 0., 0., 0., 0., 0., 1., 0., 0., -2., 0.,\n", " 0., 0., 0., 0., 0., 0., 0., -2., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 2., 1., 0., 0., -2., 0., 0., 0., 0., 1.,\n", " 0., 0., 0., 0., 0., 0., 0., 0., 0., -1., 0., 0., 0., 0.])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predicted - labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Task**:\n", "\n", "Split your dataset into the training set (2/3 of the data = 200 data points) and the test set (the remaining 1/3 of the data points = 100 points). Select your points at random, but make it reproducible by setting the seed." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "np.random.seed(0)\n", "indices = list(range(300))\n", "np.random.shuffle(indices)\n", "diff = np.array(list(set(list(range(300))).difference(indices[:200])))\n", "X_train, y_train, X_test, y_test = data[indices,:], labels[indices], data[diff,:], labels[diff]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Your Task**: train the model on the training set" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "attributeRanges= [np.unique(X_train[:,i]).shape[0] for i in range(nAttributes)]\n", "nClasses = np.unique(y_train).shape[0]\n", "d, m = estimateNBModel(X_train, y_train, nAttributes, attributeRanges, nClasses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Your Task**: evaluate the model on the test set. Retrieve the predicted labels for the test set data points." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "predicted_train = predictNBLabels(X_train, d, m, nClasses) \n", "predicted_test = predictNBLabels(X_test, d, m, nClasses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Your Task**: compute the predictive accuracy on the training set and report it. \n", "\n", "Compute the predictive accuract on the test set and report it.\n", "\n", "Is there any evidence that the model overfits? (put a note in a markdown cell)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on train\n", "0.8466666666666667\n", "\n", "Accuracy on test\n", "0.86\n" ] } ], "source": [ "print(\"Accuracy on train\")\n", "print(sum([1 for i in range(len(y_train)) if y_train[i] == predicted_train[i]]) / len(y_train))\n", "print()\n", "print(\"Accuracy on test\")\n", "print(sum([1 for i in range(len(y_test)) if y_test[i] == predicted_test[i]]) / len(y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Answer the overfit question here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There does not seem to be much overfitting on the test set, if at all. Since the accuracy of training set is not drastically higher than the accuracy on the test set (84.6% vs. 86%), which is a good sign of overfitting on the training set, we can conclude Naïve Bayes does not overfit on this dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Task**: compute and output the confusion matrix" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mat = np.zeros((3,3))\n", "for i in range(len(predicted_test)):\n", " mat[predicted_test[i], int(y_test[i])] += 1\n", "df = pd.DataFrame({\"True value: 0\": mat[:,0], \"True value: 1\": mat[:,1], \"True value: 2\": mat[:,2]})\n", "df.index = [\"Predicted value: {}\".format(i) for i in range(3)]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
True value: 0True value: 1True value: 2
Predicted value: 027.02.04.0
Predicted value: 11.020.04.0
Predicted value: 21.02.039.0
\n", "
" ], "text/plain": [ " True value: 0 True value: 1 True value: 2\n", "Predicted value: 0 27.0 2.0 4.0\n", "Predicted value: 1 1.0 20.0 4.0\n", "Predicted value: 2 1.0 2.0 39.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Congratulations!** Your are done.\n", "\n", "Download the notebook, and submit it using the \n", "\n", " handin dekhtyar 446-test \n", " command." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }