{"cells":[{"cell_type":"markdown","metadata":{"id":"_FO1xbko7UGW"},"source":["# Classification by *k*NN\n","## Introduction\n","The goal of this week is to get a first experience with supervised classification.\n","In particular, we will get familiar with how to set up, run and evaluate experiments.\n","We will also implement the *k*NN-algorithm using pure Python and sklearn."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"H8BY9AuK7UGY"},"outputs":[],"source":["import numpy as np\n","import matplotlib.pyplot as plt\n","import sklearn"]},{"cell_type":"markdown","metadata":{"id":"E7jAxsg27UGZ"},"source":["## Dataset\n","To do machine learning, we need data.\n","To make it simple, we use scikit-learn to construct a synthetic dataset with\n","- 2 classes\n","- 2 numerical features\n","- 200 items\n","\n","We will use `X`-s for the input values and `t`-s for the target values. Since we will be using pure python in this exercise set, we transform the data from numpy arrays, like `X_np`, to lists, like `X1`.\n","\n","Don't worry about the magic recipe for how we cook the data for now!"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"9Y1c3-K07UGZ"},"outputs":[],"source":["from sklearn.datasets import make_blobs\n","X_np, t_np = make_blobs(n_samples=200, centers=[[0,0],[1,2]],\n","                  n_features=2, random_state=2019)\n","X1 = [(X_np[i,0], X_np[i,1]) for i in range(X_np.shape[0])]\n","t1 = [t_np[i] for i in range(X_np.shape[0])]"]},{"cell_type":"markdown","metadata":{"id":"WQ4Ue63Z7UGZ"},"source":["This is a general form for representing data we will use a lot in this course. We store the features in one list and the labels in another list of the same length. For example, y[14] is the label the dataset ascribes to the input X[14], where X[14] is a pair (two-tuple) of numbers.\n","\n","(Later on we will use numpy arrays and not lists, e.g., the X_np, t_np, above.)\n","\n","We can then take a look at the training set using scatterplot."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Jc4iVZR27UGZ"},"outputs":[],"source":["plt.scatter(X_np[:, 0], X_np[:, 1], c=t_np)"]},{"cell_type":"markdown","metadata":{"id":"d6KmOpaS7UGa"},"source":["To add a legend (i.e., naming the classes) we sort the data into the two classes before plotting. We may then use the `plot` command."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Jc5ohgnw7UGa"},"outputs":[],"source":["def show(X, y, marker='.'):\n","    labels = set(y)\n","    cl = {lab : [] for lab in labels}\n","    # cl[lab] shall contain the datapoints labeled lab\n","    for (a, b) in zip(X, y):\n","        cl[b].append(a)\n","    for lab in labels:\n","        plt.plot([a[0] for a in cl[lab]], [a[1] for a in cl[lab]],\n","                 marker, label=\"class {}\".format(lab))\n","    plt.legend()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"f9pnLfhb7UGa"},"outputs":[],"source":["show(X1, t1)"]},{"cell_type":"markdown","metadata":{"id":"WtQdU3QJ7UGb"},"source":["## *k*NN\n","We will now implement the *k*NN algorithm.\n","We first need to calculate the distance between two points."]},{"cell_type":"markdown","source":["### (L2-) distance function:\n","It should work for points in *n*-dimensional space for any integer *n*>0. Check that dist((3,4,0),(0,0,12)) is 13."],"metadata":{"id":"5LotBuLc8_AR"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"jUJOEajZ7UGb"},"outputs":[],"source":["def distance_L2(a, b):\n","    # Euclidean distance in a procedural way\n","    s = 0\n","    for (x,y) in zip(a,b):\n","        s += (x - y) ** 2\n","    return s ** 0.5"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"p5QOIKWv7UGb"},"outputs":[],"source":["distance_L2((3,4,0),(0,0,12)) #=13"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"AR_Cahpy7UGb"},"outputs":[],"source":["assert distance_L2((3,4,0),(0,0,12)) == 13"]},{"cell_type":"markdown","metadata":{"id":"iNQ3DDdT7UGb"},"source":["### Majority class:\n","The next thing we need is a way to determine the majority class from a set of votes. Implement a procedure which takes a list as argument and returns the majority class."]},{"cell_type":"markdown","metadata":{"id":"OiAbgSxc7UGc"},"source":["#### Hint: Counter\n","For this we can use the Counter method. If you are not familiar with Counter, experiment with it to see how it works."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PMZ1Zpvg7UGc"},"outputs":[],"source":["from collections import Counter\n","print(\"Example\")\n","s = ['a', 'b', 'c', 'b', 'c']\n","counts = Counter(s)\n","print(s)\n","print(counts)\n","print(counts.most_common())"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"KPZT5TvP7UGc"},"outputs":[],"source":["def majority(a):\n","    counts = Counter(a)\n","    return counts.most_common()[0][0]"]},{"cell_type":"markdown","metadata":{"id":"M8Dpz_Xm7UGc"},"source":["### Exercise: the *k*NN algorithm\n","We will use a class for implementing the classifier. We have chosen a format that we can later reuse for various other classifier algorithms. The format is inspired by scikit-learn. We will have a superclass where we can put methods common to the various classification algorithms.\n","\n","The class will have three methods; one `init` where we set the hyperparameters, one `fit` where the training takes place, and one `predict` which predicts the class of a list of new items after we have trained the classifier.\n","\n","The training will have the form\n","```python\n","cls = PykNNClassifier(k=5) # OR some other number, default 3\n","cls.fit(X_train, t_train)\n","```\n","\n","We can then classify a new item by e.g.\n","```python\n","p = (1,1)\n","cls.predict([p])\n","```\n","\n","Implement the `predict` method."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"GipKdk3X7UGc"},"outputs":[],"source":["# This is usually done as a design placeholder or base class.\n","\n","# Plan to add shared methods later (e.g., fit(), predict())\n","# Serve as a parent class for other classifiers:\n","\n","class PyClassifier():\n","    \"\"\"Common methods to all python classifiers --- if any\n","\n","    Nothing here yet\"\"\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"mBT7mxVu7UGc"},"outputs":[],"source":["class PykNNClassifier(PyClassifier):\n","    \"\"\"kNN classifier using pure python representations\"\"\"\n","\n","    def __init__(self, k=3, dist=distance_L2):\n","        self.k = k\n","        self.dist = dist\n","\n","    def fit(self, X_train, t_train):\n","        X_train = [[x for x in xx] for xx in X_train]\n","        t_train = [t for t in t_train]\n","        # The last two lines preserve a training set on the\n","        # current form. In addition they open for  training sets in\n","        # the form of numpy arrays and  transform them to python\n","        # (lists of) lists.\n","        self.X_train = X_train\n","        self.t_train = t_train\n","\n","    def predict_one(self, a):\n","        X = self.X_train\n","        y = self.t_train\n","        ########## implement below\n","        distances =\n","        predictors =\n","        prediction =\n","        ############\n","        return prediction\n","\n","    def predict(self, X):\n","        return [self.predict_one(z) for z in X]"]},{"cell_type":"markdown","source":["## Before implementation, think about below questions:\n","\n"],"metadata":{"id":"dqXZPCFQBwtd"}},{"cell_type":"markdown","source":["###1. What is k-nearest neighbors (kNN)? Is it a training-based model, or a memory-based (lazy) model?"],"metadata":{"id":"0b9W5n7gD-Hs"}},{"cell_type":"markdown","source":["answer: kNN does not learn parameters, it stores data and compares distances."],"metadata":{"id":"F7KcedAiEMqr"}},{"cell_type":"markdown","source":["###2. What does `fit()` actually do?\n","\n","\n","*   What are `X_train` and `t_train`? Features or labels?\n","*   What is stored in: `self.X_train` and `self.t_train`?\n"],"metadata":{"id":"76tFRWBVECta"}},{"cell_type":"markdown","source":["answer: \"Training\" or \"Fit\" here is just store the dataset in a usable format"],"metadata":{"id":"h455XuYGE4is"}},{"cell_type":"markdown","source":["###3. What is the goal of `predict_one(a)`?\n","*   What is `a`?\n","*   What should the function return?"],"metadata":{"id":"OZPWC_JDEFlv"}},{"cell_type":"markdown","source":["answer: `a` is a single test sample, the function should return the prediction of `a`"],"metadata":{"id":"xxi89mJnFkWC"}},{"cell_type":"markdown","source":["### 4. How do we compute distances?\n","\n","\n","*   Distance between what and what?\n","*   What is `self.dist`?\n","*   What should distances look like? List of numbers? Or (distance, label) pairs?\n","\n"],"metadata":{"id":"pepah573EJHY"}},{"cell_type":"markdown","source":["answer:\n","*   Distance should be between `a` and each training point in `X`\n","*   `self.dist` is a function (e.g., Euclidean distance)\n","*   (distance, label) pairs, because later we also need labels for prediction"],"metadata":{"id":"I8fHDraEGLI3"}},{"cell_type":"markdown","source":["### 5. How do we select the k nearest neighbors?\n","*   After computing distances, what's next?\n","*   How do we find the k smallest distances?\n","*   What data structure do we sort?"],"metadata":{"id":"4abE8kVrHWFt"}},{"cell_type":"markdown","source":["answer:\n","*  Sort by distance\n","*  Take first k"],"metadata":{"id":"nY8vJ57XH7ts"}},{"cell_type":"markdown","source":["### 6. How do we turn neighbors into a prediction?\n","*  What labels do the k neighbors have?\n","*  How do we combine them?"],"metadata":{"id":"iBUsfJ9uIVWE"}},{"cell_type":"markdown","source":["answer: using majority to choose the most frequent label"],"metadata":{"id":"obP5JU1tIj4T"}},{"cell_type":"markdown","source":["### 7. How does `predict()` work?\n","* Why reuse `predict_one`?\n","* What is `X` here?"],"metadata":{"id":"a-fAM5a9IuyH"}},{"cell_type":"markdown","source":["answer: `X` are Multiple samples. Prediction is apply single-sample logic to all inputs"],"metadata":{"id":"qkACcHuDIyAV"}},{"cell_type":"markdown","source":["### 8. How do all parts connect?"],"metadata":{"id":"Vg3uxoknJQ6x"}},{"cell_type":"markdown","source":["answer:\n","* `fit()`: store training data\n","* `predict_one(a)`: compute distances; find k nearest; vote\n","* `predict()`: loop over inputs"],"metadata":{"id":"swOA8JpbJSt1"}},{"cell_type":"markdown","metadata":{"id":"aoXIH-Ey7UGc"},"source":["### Solution"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"9S4c7o9R7UGc"},"outputs":[],"source":["class PykNNClassifier(PyClassifier):\n","    \"\"\"kNN classifier using pure python representations\"\"\"\n","\n","    def __init__(self, k=3, dist=distance_L2):\n","        self.k = k\n","        self.dist = dist\n","\n","    def fit(self, X_train, t_train):\n","        X_train = [[x for x in xx] for xx in X_train]\n","        t_train = [t for t in t_train]\n","        # The last two lines preserve a training set on the\n","        # current form. In addition they open for  training sets in\n","        # the form of numpy arrays and  transform them to python\n","        # (lists of) lists.\n","        self.X_train = X_train\n","        self.t_train = t_train\n","\n","    def predict_one(self, a):\n","        X = self.X_train\n","        y = self.t_train\n","        distances = [(self.dist(a, b), c) for (b, c) in zip(X, y)]\n","        distances.sort()\n","        predictors = [c for (_, c) in distances[0: self.k]]\n","        return majority(predictors)\n","\n","    def predict(self, X):\n","        return [self.predict_one(z) for z in X]"]},{"cell_type":"markdown","metadata":{"id":"ZRuuOlZp7UGc"},"source":["#### End of solution"]},{"cell_type":"markdown","metadata":{"id":"qXWw0eX27UGc"},"source":["## Experiments and evaluation\n","To check how good the classifier is, we cannot consider singular datapoints.\n","We have to see how the classifier performs on a larger test set.\n","With our synthetic training data, we can make a test set in a similar way.\n","\n","We follow the same recipe as for the training set, but observe that we use a different *random_state* to get a set different from the training set."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dyvI4cbI7UGc"},"outputs":[],"source":["X_np, t_np = make_blobs(n_samples=200, centers=[[0,0],[1,2]],\n","                  n_features=2, random_state=2020)\n","X2 = [(X_np[i,0], X_np[i,1]) for i in range(X_np.shape[0])]\n","t2 = [t_np[i] for i in range(X_np.shape[0])]"]},{"cell_type":"code","execution_count":null,"metadata":{"scrolled":true,"id":"thNhIZt87UGc"},"outputs":[],"source":["show(X2, t2, 'x')"]},{"cell_type":"markdown","metadata":{"id":"KSQENkF77UGc"},"source":["### Metric: Accuracy\n","There are several different evaluation measures that can be used, and we will see a couple of them the coming weeks. For today, we only consider the simple *accuracy*, the proportion of items classified correctly.\n","\n","Here's a function `accuracy()`. It takes two arguments, where each is a list of labels and compare the two and return a numerical value for the accuracy. We will apply it to measure the accuracy of the classifier `cls` as follows:\n","\n","```python\n","accuracy(cls.predict(X_test), t_test)\n","```\n","\n","Test it on X2, t2 when trained on X1, t1 for various values of *k*.\n","Let *k* be any odd integer below 20. Plot the results.\n","\n","Beware that there is no *k* which is the best for all datasets. It varies with the dataset. To decide on the best *k* for a specific dataset, we should use a separate development test set to determine the best *k*. Then we fix this *k* and test on the final test set."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1pOb6ePe7UGd"},"outputs":[],"source":["def accuracy(y, t):\n","    \"\"\"Calculate the accuracy\"\"\"\n","    equal = len([(p, g) for (p,g) in zip(y, t) if p==g])\n","    return equal / len(t)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"sfm4HimM7UGd"},"outputs":[],"source":["x = range(1, 20, 2)\n","accuracies = []\n","for k in x:\n","    cls = PykNNClassifier(k=k)\n","    cls.fit(X1, t1)\n","    accuracies.append(accuracy(cls.predict(X2), t2))\n","plt.plot(x, accuracies, label=\"Testset X2, t2\")\n","plt.xticks(x)\n","plt.legend()"]},{"cell_type":"markdown","metadata":{"id":"A_K-WFQZ7UGk"},"source":["#### End of solution"]}],"metadata":{"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.10.12"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0}