MultiLayer Perceptron (MLP) .
This Article contains :
- Simple Perceptron .
- limitation of The Perceptron .
- Gradient descent .
- Mono Layer Perceptron with Gradient Descent .
- Multi Layer Perceptron for Binary Classification problems.
- Multi Layer Perceptron for Multi-class Classification problems.
- Model Evaluation (Confusion Matrix).
Perceptron :
the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.[1]
The Activation Function Used by The Perceptron is Step Function it called also Heaviside Function.
Formula of step function :
graph of step function :
The Update Rules for The Perceptron are :
Implementation :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class Perceptron :
def __init__(self,learning_rate = 0.1 , number_iter = 1000):
self.learning_rate = learning_rate
self.number_iter = number_iter
def fit(self,x,y):
self.x = x
self.y = y
def initParameters(self , x):
w = np.random.randn(1,x.shape[1])
return w
def heaviside(self,x):
return 1 if x>= 0 else 0
def heavisideArray(self,x):
a = [1 if x1>= 0 else 0 for x1 in x[0]]
return a
def train(self):
self.w = self.initParameters(self.x)
for i in range(self.number_iter):
for x , y in zip(self.x , self.y):
z = np.dot(self.w , x)
y_hat = self.heaviside(z)
self.w += self.learning_rate * (y - y_hat) * x
#self.displayModel()
def predict(self,x):
z = np.dot(self.w , x)
a = self.heavisideArray(z)
return a
def displayModel(self):
fig , ax = plt.subplots(figsize=(10,7))
ax.scatter(self.x[:,0] , self.x[:,1] , c = self.y , cmap="bwr")
x1 = np.linspace(-15,4,100)
x2 = (-self.w[0][0] * x1 - self.w[0][2]) / self.w[0][1]
ax.plot(x1,x2 , c='g' , lw=8)
The Perceptron can separate classes linearly ,In This example we will implement The Perceptron to classify the three problem AND , OR and XOR . Each Problem have two classes 0 and 1 .
This is the representation of each problem that we’ve :
as you can see in the picture above we’ve three logic gates AND , OR and XOR . as i already say The Perceptron can separate classes linearly , and as you can see we can separate AND , OR with a line but we can’t separate the two classes in The XOR by a line .
- AND :
- OR :
- XOR :
perceptron can’t implement XOR. The reason is because the classes in XOR are not linearly separable. You cannot draw a straight line to separate the points (0,0),(1,1) from the points (0,1),(1,0).
this is why we need more layer in the network and a different activation function to create the non linearity .
Implementation :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# AND Logic circuit
AND_data = [(np.array([1,1,1]),1),
(np.array([1,0,1]),0),
(np.array([0,1,1]),0),
(np.array([0,0,1]),0)]
# OR Logic circuit
OR_data = [(np.array([1,1,1]),1),
(np.array([1,0,1]),1),
(np.array([0,1,1]),1),
(np.array([0,0,1]),0)]
# XOR Logic circuit
XOR_data = [(np.array([1,1,1]),0),
(np.array([1,0,1]),1),
(np.array([0,1,1]),1),
(np.array([0,0,1]),0)]
def heaviside(x):
return 1 if x>= 0 else 0
def Perceptron(training_data,circuit):
w = np.random.rand(3)
learning_rate = 0.1
number_iterations = 100
for i in range(number_iterations):
x,y = choice(training_data)
z = x.dot(w)
y_predicted = heaviside(z)
w += learning_rate * (y - y_predicted)*x
print(f"Prediction for {circuit} Logic Circuit : ")
for x,y in training_data:
z = x.dot(w)
y_pred = heaviside(z)
print(f"x = {x[:2]} , z = {z} , y_pred = {y_pred}")
Perceptron(AND_data, "AND")
Perceptron(OR_data, "OR")
Perceptron(XOR_data, "XOR")
Mono Layer Perceptron :
in this section we will see a Mono Layer perceptron with Log Loss as a cost function :
we use the term log loss for binary classification problems, and the more general cross-entropy for the general case of multi-class classification .
- Log Loss Formula :
- Cross-entropy Formula :
Gradient Descent :
In This section we will use Gradient Descent as an algorithm for Optimizing our Model :
now let apply the gradient descent to a simple Function to understand how it works :
we will Use This Function :
The optimal value of this Function is 5 .
The Gradient Descent Update Formula is like bellow :
Implementation :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#expression of our function :
def f(x):
return x**2 - x -1
#derivative of our function
def df(x):
return 2*x - 1
# Gradient Descent
def gradient(f,df,x0,learning_rate = 0.1 , nbr_iterations = 100 ,threashHold = 1e-6):
gr = df(x0)
i = 0
while i < nbr_iterations and abs(gr) > threashHold:
x0 -= learning_rate * gr
gr = df(x0)
i+=1
return x0
#Optimizing f(x) = x^2 - x -1
gradient(f, df, 6)
We get This result , which very good and very close to 5 .
The architecture of our Model is like below :
Model Parameters :
Gradient Descent Update Rules in this case are :
The Activation Function Used in This example is Sigmoid :
Implementation :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def sigmoid(x):
return 1/(1 + np.exp(-x))
def cross_entropy(y,a):
return -(y*np.log(a) + (1-y)*np.log(1-a))/y.shape[0]
def initialize_w_b(x):
w,b = np.random.randn(x.shape[1],1) , np.random.randn(1)
return w,b
def forward_propagation(x,w,b):
z = x.dot(w) + b
a = sigmoid(z)
return a
def gradients(x,a,y):
dw , db = np.dot(x.T,a-y) / x.shape[0] , (a-y)/x.shape[0]
return dw,db
def backward_propagation(x,w,b,a,y,learning_rate):
dw , db = gradients(x, a, y)
w , b = w - learning_rate * dw , b - learning_rate *db
return w,b
def predict(x,w,b):
a = forward_propagation(x, w, b)
return a >= 0.5
def perceptron(x,y,learning_rate=0.2,n_iter=1000):
w,b = initialize_w_b(x)
losses = []
for i in range(n_iter):
a = forward_propagation(x, w, b)
loss = cross_entropy(y, a)
losses.append(loss)
w , b = backward_propagation(x, w, b, a, y, learning_rate)
y_predicted = predict(x, w, b)
return y_predicted , losses , w , b
def displayResult(losses):
fig , ax = plt.subplots()
x_lim = ax.get_xlim()
x = np.linspace(x_lim[0] , x_lim[1] , 1000)
losses = np.sum(losses , axis = 1)
losses = np.array(losses)
plt.plot(x , losses)
plt.show()
MultiLayer Perceptron for Binary Classification :
In This Section we will use a MultiLayer Perceptron with this architecture :
Implementation :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class NeuralNetwork:
def __init__(self,x,y,layers,learning_rate,n_iterations):
self.x = x.T
self.y = y.reshape((1,y.shape[0]))
self.layers = layers
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.params = {}
for i in range(1,len(self.layers)):
self.params['w'+str(i)] = np.random.randn(self.layers[i],self.layers[i-1])
self.params['b'+str(i)] = np.random.randn(self.layers[i],1)
def sigmoid(self,x):
return 1/(1 + np.exp(-x))
def dsigmoid(self,x):
return x*(1-x)
def forward_propagation(self,x):
activations = {'A0':x}
for i in range(1 , (len(self.params) // 2) +1):
z = np.dot( self.params['w'+str(i)] , activations['A'+str(i-1)]) + self.params['b'+str(i)]
activations['A'+str(i)] = self.sigmoid(z)
return activations
def gradients(self,activations):
grads = {}
dz = activations['A'+str(len(self.params) // 2)] -self.y
for i in reversed(range(1,(len(self.params) // 2)+1)):
grads['dw'+str(i)] = 1 / self.y.shape[1] * np.dot(dz,activations['A'+str(i-1)].T)
grads['db'+str(i)] = 1 / self.y.shape[1] * np.sum(dz,axis=1,keepdims=True)
dz = np.dot(self.params['w'+str(i)].T , dz) * self.dsigmoid(activations['A'+str(i-1)])
return grads
def back_propagation(self,activations):
grads = self.gradients(activations)
for i in range(1,len(self.params)//2) :
self.params['w'+str(i)] -= self.learning_rate * grads['dw'+str(i)]
self.params['b'+str(i)] -= self.learning_rate * grads['db'+str(i)]
def train(self):
for i in range(self.n_iterations):
activations = self.forward_propagation(self.x)
self.back_propagation(activations)
def prediction(self,x):
activations = self.forward_propagation(x)
return activations['A'+str(len(self.params) // 2)]
def predict(self,x):
activations = self.forward_propagation(x)
predictions = activations['A'+str(len(self.params) // 2)]
for i in range(0,len(predictions[0])):
if predictions[0,i] > 0.5 :
predictions[0,i] = 1
else :
predictions[0,i] = 0
return predictions
def displayResult(self):
fig, ax = plt.subplots()
ax.scatter(self.x[0, :], self.x[1, :], c=self.y, cmap='bwr', s=50)
x0_lim = ax.get_xlim()
x1_lim = ax.get_ylim()
resolution = 100
x0 = np.linspace(x0_lim[0], x0_lim[1], resolution)
x1 = np.linspace(x1_lim[0], x1_lim[1], resolution)
X0, X1 = np.meshgrid(x0, x1)
XX = np.vstack((X0.ravel(), X1.ravel()))
y_pred = self.prediction(XX)
y_pred = y_pred.reshape(resolution, resolution)
ax.pcolormesh(X0, X1, y_pred, cmap='bwr',alpha = 0.3 , zorder = -1)
ax.contour(X0, X1, y_pred, c='g')
plt.show()
Explanation of the Update Rules :
You can find The PDF here : Update Rules
MultiLayer Perceptron for Multiple classification :
The difference between MPL Binary classifier and MLP Multiple classifier is in The last Layer of the MLP multiple classifier we used Softmax Activation function instead of sigmoid.
Model Architecture :
Implementation :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
class NeuralNetwork:
def __init__(self,x,y,layers,learning_rate,n_iterations):
#np.random.seed(0)
self.x = x.T
self.y_orig = y
self.y = LabelBinarizer().fit_transform(y) #to_categorical(y)
self.y = y.T
self.layers = layers
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.params = {}
for i in range(1,len(self.layers)):
self.params['w'+str(i)] = np.random.randn(self.layers[i],self.layers[i-1])
self.params['b'+str(i)] = np.random.randn(self.layers[i],1)
def sigmoid(self,x):
return 1/(1 + np.exp(-x))
def dsigmoid(self,x):
return x*(1-x)
def forward_propagation(self,x):
activations = {'A0':x}
L = len(self.params) // 2
for i in range(1 , L):
z = np.dot( self.params['w'+str(i)] , activations['A'+str(i-1)]) + self.params['b'+str(i)]
activations['A'+str(i)] = self.sigmoid(z)
z = np.dot( self.params['w'+str(L)] , activations['A'+str(L-1)]) + self.params['b'+str(L)]
activations['A'+str(L)] = self.softMax(z)
return activations
def softMax(self,x):
return np.exp(x) / np.sum(np.exp(x),axis = 0)
def gradients(self,activations):
grads = {}
dz = activations['A'+str(len(self.params) // 2)] - self.y
for i in reversed(range(1,(len(self.params) // 2)+1)):
grads['dw'+str(i)] = 1 / self.y.shape[0] * np.dot(dz,activations['A'+str(i-1)].T)
grads['db'+str(i)] = 1 / self.y.shape[0] * np.sum(dz,axis=1,keepdims=True)
dz = np.dot(self.params['w'+str(i)].T , dz) * self.dsigmoid(activations['A'+str(i-1)])
return grads
def back_propagation(self,activations):
grads = self.gradients(activations)
for i in range(1,len(self.params)//2) :
self.params['w'+str(i)] -= self.learning_rate * grads['dw'+str(i)]
self.params['b'+str(i)] -= self.learning_rate * grads['db'+str(i)]
def train(self):
for i in range(self.n_iterations):
activations = self.forward_propagation(self.x)
self.back_propagation(activations)
def prediction(self,x):
activations = self.forward_propagation(x)
act = activations['A'+str(len(self.params)//2)]
return np.argmax(act,axis=0)
def predict(self,x):
activations = self.forward_propagation(x)
predictions = activations['A'+str(len(self.params) // 2)]
for i in range(0,len(predictions[0])):
if predictions[0,i] > 0.5 :
predictions[0,i] = 1
else :
predictions[0,i] = 0
return predictions
def displayResult(self):
fig, ax = plt.subplots()
ax.scatter(self.x[0, :], self.x[1, :], c=self.y_orig, cmap='bwr', s=50)
x0_lim = ax.get_xlim()
x1_lim = ax.get_ylim()
resolution = 100
x0 = np.linspace(x0_lim[0], x0_lim[1], resolution)
x1 = np.linspace(x1_lim[0], x1_lim[1], resolution)
# meshgrid
X0, X1 = np.meshgrid(x0, x1)
# assemble (100, 100) -> (10000, 2)
XX = np.vstack((X0.ravel(), X1.ravel()))
y_pred = self.prediction(XX)
y_pred = y_pred.reshape(resolution, resolution)
ax.pcolormesh(X0, X1, y_pred, cmap='bwr',alpha = 0.3 , zorder = -1, shading='auto' )
ax.contour(X0, X1, y_pred, colors='green')
plt.show()
Model Evaluation (Confusion Matrix) :
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes.
A good model is one which has high True Positive and True Negative rates, while low False Positive and False Negative rates.
Confusion Matrix Elements :
- True Positives (TP): when the actual value is Positive and predicted Value is also Positive.
- True negatives (TN): when the actual value is Negative and predicted Value is also Negative.
- False positives (FP): When the actual Value is negative but predicted Value is Positive.
- False negatives (FN): When the actual Value is Positive but the predicted Value is Negative.
Classification Measures :
Calculate The Classification Measures manually :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.metrics import confusion_matrix
myconfusionMatrix = confusion_matrix(y_test , y_hat)
accuracy = (myconfusionMatrix[0][0] + myconfusionMatrix[1][1]) / np.sum(myconfusionMatrix)
precision = myconfusionMatrix[0][0] / (myconfusionMatrix[0][0] + myconfusionMatrix[0][1])
recall = myconfusionMatrix[0][0] / (myconfusionMatrix[0][0] + myconfusionMatrix[1][0])
specificity = myconfusionMatrix[1][1] / (myconfusionMatrix[1][1] + myconfusionMatrix[0][1])
f1score = 2 * myconfusionMatrix[0][0] / (2 * myconfusionMatrix[0][0] + myconfusionMatrix[1][0] + myconfusionMatrix[0][1])
print(f"Accuracy : {accuracy} .")
print(f"precision : {precision} .")
print(f"recall : {recall} .")
print(f"specificity : {specificity} .")
print(f"f1score : {f1score} .")
Calculate The Classification Measures using sklearn :
1
2
3
4
5
6
7
8
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score
accuracy_score(y_test, y_hat)
precision_score(y_test, y_hat)
recall_score(y_test, y_hat)
f1_score(y_test, y_hat)
fbeta_score(y_test, y_hat,beta=10)