# Scikit-learn examples
## IN3050/IN4050 Mandatory Assignment 2: Supervised Learning

### Intialization

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import datasets
import random

# Part 1: Comparing classifiers
## Datasets
We start by making a synthetic dataset of 1600 datapoints and three classes, with 800 individuals in one class and 400 in each of the two other classes. (See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs regarding how the data are generated.)

When we are doing experiments in supervised learning, and the data are not already split into training and test sets, we should start by splitting the data. Sometimes there are natural ways to split the data, say training on data from one year and testing on data from a later year, but if that is not the case, we should shuffle the data randomly before splitting. (OK, that is not necessary with this particular synthetic data set, since it is already shuffled by default by scikit, but that will not be the case with real-world data.) We should split the data so that we keep the alignment between X and t, which may be achieved by shuffling the indices. We split into 50% for training, 25% for validation, and 25% for final testing. The set for final testing *must not be used* till the and of the assignment in part 3.

We fix the seed both for data set generation and for shuffling, so that we work on the same datasets when we rerun the experiments.

In [None]:
from sklearn.datasets import make_blobs
X, t = make_blobs(n_samples=[400,800,400], centers=[[0,0],[1,2],[2,3]], 
                  n_features=2, random_state=2019)

In [None]:
indices = np.arange(X.shape[0])
random.seed(2020)
random.shuffle(indices)
indices[:10]

In [None]:
X_train = X[indices[:800],:]
X_val = X[indices[800:1200],:]
X_test = X[indices[1200:],:]
t_train = t[indices[:800]]
t_val = t[indices[800:1200]]
t_test = t[indices[1200:]]

Next, we will  make a second dataset by merging the two smaller classes in (X,t) and call the new set (X, t2). This will be a binary set.

In [None]:
t2_train = t_train == 1
t2_train = t2_train.astype('int')
t2_val = (t_val == 1).astype('int')
t2_test = (t_test == 1).astype('int')

Plot the two training sets.

In [None]:
# Your solution

## Binary classifiers

### Linear regression
### Logistic regression
### Perceptron
### *k*NN

In [None]:
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.neighbors import KNeighborsClassifier

In [None]:
lin_reg_cl = RidgeClassifier()
lin_reg_cl.fit(X_train, t2_train)

In [None]:
# RidgeClassifier?

In [None]:
lin_reg_cl.predict(X_val[:10, :])

In [None]:
t2_val[:10]

In [None]:
# Accuracy
lin_reg_cl.score(X_val, t2_val)

In [None]:
log_reg_cl = LogisticRegression()
log_reg_cl.fit(X_train, t2_train)

In [None]:
# LogisticRegression?

In [None]:
log_reg_cl.score(X_val, t2_val)

In [None]:
per_cl = Perceptron()
per_cl.fit(X_train, t2_train)

In [None]:
per_cl.score(X_val, t2_val)

In [None]:
kNN_cl_7 = KNeighborsClassifier(n_neighbors=7)
kNN_cl_7.fit(X_train, t2_train)

In [None]:
kNN_cl_7.score(X_val, t2_val)

### Scaling example
We are using the StandardScaler(). There are also other scalers.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

In [None]:
lin_sc = RidgeClassifier().fit(X_train_scaled, t2_train)
lin_sc.score(X_val_scaled, t2_val)

In [None]:
log_sc = LogisticRegression().fit(X_train_scaled, t2_train)
log_sc.score(X_val_scaled, t2_val)

## Multi-class classifiers
We now turn to the task of classifying when there are more than two classes, and the task is to ascribe one class to each input. We will now use the set (X, t).

### Logistic regression "one-vs-rest"
We saw in the lecture how a logistic regression classifier can be turned into a multi-class classifier using the one-vs-rest approach. We train one classifier for each class and assign the class which ascribes the highest probability.

Extend the logisitc regression classifier to a multi-class classifier. To do this, you must modify the target values from scalars to arrays. Train the resulting classifier on (X_train, t_train), test it on (X_val, t_val), and report the accuracy.

In [None]:
log_reg_cl = LogisticRegression(multi_class='ovr')
log_reg_cl.fit(X_train, t_train)

In [None]:
log_reg_cl.score(X_train, t_train)

In [None]:
softmax_cl = LogisticRegression(multi_class='multinomial')
softmax_cl.fit(X_train, t_train)

In [None]:
softmax_cl.score(X_train, t_train)

In [None]:
# LogisticRegression?

## Highly recommended
https://scikit-learn.org/stable/getting_started.html