Deep Learning has been gaining a lot of attention in recent times. Before we dive into Deep learning, let’s start with the algorithm that started it all. The Perceptron, also known as the Rosenblatt’s Perceptron. Perceptrons are the most primitive classifiers, akin to the base neurons in a deep-learning system.

## What is a perceptron?

At the very basic level, a perceptron is a bunch of parameters, also known as weights. BTW, that is true of most parametric machine learning models ;). The model can be succinctly represented by just listing these weight values. No other information is necessary. Each dimension in the input has a corresponding weight in the perceptron. There is one additional weight that corresponds to an augmented constant in the input. This weight helps deal with bias in the classifier, as we will see later in this tutorial.

The weights are applied to arrive at a weighted sum of the inputs. If the weighted sum is greater than a threshold, then the input instance class A, else it belongs to class B. For simplicity, let us assume a threshold of 0. A weighted sum above 0 implies positive class (+1) and a weighted sum below 0 implies the negative class (-1).

So, given some observations, how do we arrive at the correct set of weights? You “tune” them to comply with the observations in the training set. That is the training or learning phase of the Perceptron and is described by the Perceptron learning rule. For every input instance that is presented to the Perceptron, three outcomes are possible.

- The prediction of the Perceptron matches the true class label of the input instance. So, no update in the weight vector is needed
- The prediction is lower than the true label. That means, the Perceptron is arriving at a negative weighted sum, when it should really be positive. Consequently, we need to increase the weights such that the score becomes higher for that instance. So, we will increase the weights with respect to that positive instance by merely adding the positive instance into the weight vector we already had. (That is naive, and we will see later why)
- The prediction is higher than the true label. Weighted sum is positive when it should have been negative. So, we decrease the weight vector with respect to that negative instance by subtracting it from the existing weight vector.

This suggests an iterative procedure. You pick an instance, check the prediction of the trained Perceptron for that instance, and then perform one of the three steps mentioned above. You do this several times over the available input instances in the training set.

## How do you train a perceptron?

As you can guess from our earlier discussion, the Perceptron learning algorithm is one of the easiest algorithms to implement. Here’s a tiny implementation in Python. Because we intend to use it repeatedly throughout this series of tutorials, we will implement it as a python class. If you look beyond the boiler plate in the class template, you will notice that the training phase in the **fit** function is merely 3 lines long and the **predict** function is merely 2 lines. Let’s look at the code first:

from sklearn.base import BaseEstimator import numpy as np class Perceptron(BaseEstimator): """A simple Perceptron algorithm""" def __init__(self,dims,max_iters=100): BaseEstimator.__init__(self) self._dims = dims self._max_iters = max_iters self.w = np.zeros(self._dims + 1) def augmented_x(self, x): n, f = x.shape augx = np.ones((n,f+1)) # create augmented X augx[:,:-1] = x return augx def fit(self,x,y): augx = self.augmented_x(x) for i in range(self._max_iters): xi,yi = random.choice(zip(augx,y)) if yi * np.dot(self.w, xi) &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0: # update only for incorrect examples self.w += yi*xi def predict(self,x): augx = self.augmented_x(x) scores = np.dot(augx, self.w) preds = np.ones(len(scores)) preds[scores &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0] = -1.0 return preds

Note that we have developed the Perceptron algorithm to work with the scikit-learn (sklearn) ecosystem by adhering to their API for how classifiers or any estimators should be designed.

The class initialization is simple. You specify:

**Number of dimensions**in the input data: these are used for building the weight vector “w” of the appropriate size- Maximum number of
**iterations**till convergence: The perceptron learning algorithm is an iterative update algorithm. Although it converges fast, you do not want to wait forever to arrive at the learnt model.

We have created an “augmented” version of the input data in the “augmented_x” variable. We have just appended a constant, 1, to each input observation. The corresponding entry in the weight vector will be the “bias” parameter, that will enable not-through-the-origin class-boundaries with the perceptron. As we will later see, if you do not utilize the bias parameter, you will always have a through-the-origin solution.

The fit method is straightforward. In each iteration, you randomly choose a training point to work with (line 21). Then, you test the prediction of the current model on this chosen instance. Since the input classes are -1 and 1, the product of the prediction and the true label tell you whether the prediction is accurate. If the product is positive, it is correct (signs match), else it is incorrect. For every incorrect prediction, you update the weight vector to be “better” for the instance on which it failed. You continue to do this till completion. Note that a better strategy is to keep a validation set to evaluate the performance of the algorithm at regular intervals, says after every 100 iterations. Then the stopping criteria could be a stabilized learner, that means, when the model performance has stopped improving.

## What can you do with a perceptron?

**You can use the **perceptron** for classification**. Let’s try a simple classification experiment. We will create a simple training set which corresponds to the “OR” class. This is a bivariate or 2-dimensional dataset for illustration purposes.

x = np.array([ [0,0], [1,0], [0,1], [1,1] ], dtype=float) y = np.array([-1,1,1,1], dtype=float)

In this dataset, an instance belongs to the positive class if either of the input variables is 1, else it belongs to the negative class. Now we will perform the experiment of learning and predicting on this data.

from sklearn.metrics import accuracy_score def experiment(classifier, x, y, config='OR'): classifier.fit(x,y) yhat = classifier.predict(x) print config, "accuracy is", accuracy_score(y,yhat)

Just 3 steps. train (fit), predict the categories, and then compute the accuracy of the prediction. If you did everything right so far, you will notice that the output is:

OR accuracy is 1.0

Voila! Perfect performance. Almost seems magical.

Let’s visualize the classification boundary of the trained classifier, in the context of the training set that we supplied the dataset.

The “OR” data that we concocted is a realizable case for the perceptron algorithm. That means, our classifier is a linear classifier and OR is a linearly separable dataset. Naturally, perfect performance is “realizable”. Theoretically, it can be shown that the perceptron algorithm converges in the realizable setting to an accurate solution. But only in the realizable setting.

Let us make a simple non-linear dataset, such as the XOR or exclusive-OR dataset.

x = np.array([ [0,0], [1,0], [0,1], [1,1] ], dtype=float) y = np.array([-1,1,1,-1], dtype=float)

The x’s remain the same, but the y undergoes a change. If both input variables are 1, then the instance now belongs to the negative class. No matter how long you train it, the accuracy of the perceptron on the XOR dataset will never be perfect. Why? Because it is not possible to draw a line that will perfectly separate the positive and the negative classes in this case. Perceptron, or any other linear classifier, will never be able to perfectly categorize in these situations.

Later, in this series of tutorials, we will look at examples of more complex machine learning classifiers that can classify non-linear datasets perfectly. For now, here is the complete code for the experiments we did for this post.

## What are the benefits of the perceptron?

Firstly, it is easy to implement. Secondly, it lends itself naturally to the online setting. Remember how we iterated through a whole bunch of examples by random choice. In practical applications with big data, when you cannot hold the entire data in memory, you can just iterate through the examples as an input stream, in an online fashion. You do not need to hold the entire data in memory. You utilize one instance for updating the model, discard it, and move on to the next. Minor changes.

Moreover, it is a compact representation. More compact than something like a decision tree or a non-parametric method.

Inspite of being such a simple algorithm, it is used today in many libraries and systems for complex tasks. For example, the part of speech tagger in the nltk library is in fact an implementation of a trained perceptron designed for multiclass classification. As these results show, the performance is indeed comparable and in some cases even better than other well-known implementations.