Plan

  • Introduce the artificial neural network design for logistic regression

  • Calculate logistic regression in R by using glm

  • Implement all calculations in Python

  • Build the same model by using Tensorflow

  • Build the same by using PyTorch

Artificial Neural Network design

  • The neural network design will be simple; one input node, one neuron in one layer.
___Artificial neural network design for logistic regression___

Artificial neural network design for logistic regression

  • The training data \(\{(x_i, y_i) \}_{i=1,2,3,4,5,6}\) are
    {(2,0), (3,0), (5,1), (7,0), (11,1), (13,1)}.

Idea for ANN design

  • \(x \mapsto z^1_{1}=w^1_{11}x+b^1_1\).

  • \(neuron \mapsto \sigma(z^1_{1})\). Here, \(\sigma=logit^{-1}(x)=e^x/(1+e^x)\).

  • We want our single neuron (layer 1) to fire a 1 when the input is 5, 11, or 13, and a 0 when the input is 2, 3, or 7.

  • Our goal is to achieve this by selecting appropriate values for \(w^1_{11}\) and \(b^1_1\). However, is this strictly possible? Unfortunately, no. Instead, we will attempt to choose \(w^1_{11}\) and \(b^1_1\) to make the model approximate the desired behavior as closely as possible.

Logistic regression using R package

glm

Let’s get values for \(w^1_{11}\) and \(b^1_1\) using a R function called glm.

R

training_data <- data.frame(x=c(2,3,5,7,11,13), y=c(0,0,1,0,1,1))
log_reg.res <- glm(data = training_data, y ~ x, family = binomial)
w=coef(summary(log_reg.res))[2,1]
b=coef(summary(log_reg.res))[1,1]
print(paste("w =",w))
## [1] "w = 0.551775920706039"
print(paste("b =",b))
## [1] "b = -3.54907067213626"

glm can perform generalized linear models. The family binomial will do the logistic regression.
\(w=w^1_{11}=0.5518\) and \(b=b^1_1=-3.5491\) are found by glm.
What does this mean? The below is the plot of the data points and \(f(x)=e^{wx+b}/(1+e^{wx+b})\) with \(w\) and \(b\) values that glm found.

In this context, we assume that the output, \(y\), follows a binomial distribution \(B(n=1,p)\), where \(p\) is given by \[p=e^{wx+b}/(1+e^{wx+b}).\] Here, \(x\) represents the input. For example, when the input is \(x=2\), the output follows \(B(1, 0.07976687)\), meaning the probability of obtaining \(y=1\) is approximately \(0.0797\)–a rare event. That explains why the dataset contains the data point \((2,0)\). Conversely, when \(x=13\), the probability pf \(y=1\) increases to approximately \(0.974\). Thus, observing \((13,0)\) in the dataset would be unexpected. The parameters \(w\) and \(b\) are expected to adjust the functin \(e^{wx+b}/(1+e^{wx+b})\) to approximate a probability distribution that best explains the dataset. Apparently, \(w\) and \(b\) fit the dataset well at \(x=2,3,11,13\), but not as well at \(x=5,7\). This limitation arises from the choice of the inverse logit function as the fitting function and the fact that there are only two parameters.

Now, consider two additional sets of parameter values. They seem to fit the data well, except for one instance. By what standard are the parameter values produced by glm considered the best?

Maximum likelihood, loss function, cost function.

The values for \(w\) and \(b\) are chosen to maximize the likelihood function, which represents a conditional probability \(P_{w,b}(y|x)\). Assuming independence, this probability can be computed as follows. \[ \begin{aligned} P_{w,b}(y|x) & = P_{w,b}(y=0|x=2)P_{w,b}(y=0|x=3)\cdots P_{w,b}(y=1|x=13) \\ & = \Big(1-\frac{e^{2w+b}}{1+e^{2w+b}}\Big)\Big(1-\frac{e^{3w+b}}{1+e^{3w+b}}\Big)\cdots \Big(\frac{e^{13w+b}}{1+e^{13w+b}}\Big) \\ & = \Big(\frac{1}{1+e^{2w+b}}\Big)\Big(\frac{1}{1+e^{3w+b}}\Big)\cdots \Big(\frac{e^{13w+b}}{1+e^{13w+b}}\Big) \end{aligned} \]

Maximizing this likelihood function is what happens behind the scenes in glm. Since the logarithm is a monotonically increasing function, maximizing the log-likelihood function is equivalent to maximizing the likelihood function. Moreover, the log-likelihood is much easier to work with. Alternatively, minimizing the negative log-likelihood function accomplishes the same task. The negative log-likelihood is often referred to as the loss function or cost function.

____________________________________________________________________________________

Digression - general NN model and language

___General ANN diagram___

General ANN diagram

Forward passing (propagation)

Outputs, \(a^{l-1}_k\), from the previous layer (\(l-1\)), come to \(j_{th}\) neuron in Layer \(l\). First, \(z^l_j\) will be calculated by \(z^l_{j} =b^l_j + \sum_k w^l_{jk} a^{l-1}_k\) Deploying vectors and matrices. \[z^l = w^la^{l-1}+b^l\] where \(z^l=[z^l_1\; z^l_2 \, \dots \, z^l_m]^T\), \(w^l=[w^l_{jk}]_{1\leq j\leq m, 1\leq k\leq n}\), \(a^{l-1}=[\sigma(z^{l-1}_1)\; \sigma(z^{l-1}_2) \, \dots \, \sigma(z^{l-1}_n)]^T\), and \(b^l=[b^l_1\; b^l_2 \, \dots \, b^l_m]^T\).

After the last layer, there will be a cost function comparing the last output of the model against the desired output. Searching for \(w\)’s and \(b\)’s that minimize (typically) the cost function is called learning. One popular method for this optimization is gradient descending. For this, we need to find the gradient of the cost function as a function of \(w\)’s and \(b\)’s.
The cost function for the logistic regression that we are building is \[ C(w,b) = -\sum_{i=1}^6 y_i \ln( logit^{-1}(wx_i+b) ) + (1-y_i) \ln(1-logit^{-1}(wx_i+b)).\] It is also called binary cross entropy.

Back propagation

The logistic regression model that we are building as an artificial neural network is simple enough to calculate the gradient just by following the basic gradient formula.
In general, however, it can be challenging as there are more layers and more neurons. For that, there is an algorithmic approach called back propagation. First, Define \(\delta^l_j=\frac{\partial C}{\partial z^l_j}\). Layer \(L\) is the last layer.

  • BP1: \(\delta^N_j=\frac{\partial L}{\partial a^N_j}\sigma'(z^N_j)\)

  • BP2: \(\delta^l=\bigl( (w^{l+1})^T \delta^{l+1}\bigr) \odot \sigma '(z^l) \text{ for } l<N. \;\;\odot: \text{Hadamard product}\)

  • BP3: \(\frac{\partial L}{\partial b^l_j}=\delta^l_j\)

  • BP4: \(\frac{\partial L}{\partial w^l_{jk}}=a^{l-1}_k\delta^l_j\)
    Back propagation is a few layers of chain rule calculation. It’s good for implementing step-by-step gradient calculation.

  • Let’s calculate \(\frac{\partial C}{\partial w^L_{jk}}\) first. Combining BP4 and BP1,
    \(\frac{\partial C}{\partial w^L_{jk}}=a^{L-1}_k\delta^L_j =a^{L-1}_k\frac{\partial C}{\partial a^L_j}\sigma'(z^L_j).\) The calculation of all the terms in the right hand side are straight forward because all calculations are either at the last layer or evaluation that is done for forward passing.

  • It is also not challenging to calculate \(\frac{\partial C}{\partial b^L_{j}}\). It’s another last layer calculation.

  • Let’s try \(\frac{\partial C}{\partial w^{L-1}_{jk}}\). \(\frac{\partial C}{\partial w^{L-1}_{jk}}=a^{L-2}_k\delta^{L-1}_j\). \(a^{L-2}_k\) has been calculated already in the forward-propagation. \(\delta^{L-1}_j=\bigl( (w^{L})^T \delta^{L}_j\bigr) \odot \sigma '(z^L_j)\) by BP2 and everything on the right hand side is calculatable based on the calculation at the layer \(L\).

  • In this way, the calculation marches backward to calculate all \(\frac{\partial C}{\partial w^{l}_{jk}}\) and \(\frac{\partial C}{\partial b^{l}_{j}}\).

____________________________________________________________________________________

Back to logistic regression

___Artificial neural network design for logistic regression___

Artificial neural network design for logistic regression

In the context of glm and its connection to artificial neural networks (ANN), each input \(x\) passing through the ANN produces an output of \(\sigma(wx+b)=e^{wx+b}/(1+e^{wx+b})\), which is interpreted as a probability distribution \(q\) as seen earlier. Specifically:
\[ q(y=0)=\frac{1}{1+e^{wx+b}}, \;\; q(y=1)=\frac{e^{wx+b}}{1+e^{wx+b}}.\] The output \(y\) is also considered a distribution \(p\): \[ p(y=0)=\begin{cases} 1 \text{ if } y(x)=0\\ 0 \text{ otherwise.} \end{cases} \]

Given the distributions \(p\) and \(q\), the cross entropy \(H(p,q)\) is \[H(p,q)=-E_p(\log q)=-p(y=0) \log q(y=0)-p(y=1) \log q(y=1).\] This cross-entropy serves as the loss function, commonly referred to as binary cross-entropy loss. The total loss function \(\mathcal{L}_{total}\) is \[\mathcal{L}_{total}=\sum H(p(x_i,y_i),q(x_i,y_i)).\] This corresponds to the negative log-likelihood function in Maximum likelihood, loss function, cost function.

Python with sci-kit learn

The same process can be implemented using the Python scikit-learn package.

Python

from sklearn.linear_model import LogisticRegression
import numpy as np

logisticRegr = LogisticRegression(max_iter = 2500, penalty='l2', C=1e+6)

train_data = np.array([2.0, 3, 5, 7, 11, 13]).reshape(-1, 1)
train_label = np.array([0.0,  0.0, 1.0, 0.0, 1.0, 1.0])

logisticRegr.fit(train_data, train_label)
LogisticRegression(C=1000000.0, max_iter=2500)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(logisticRegr.coef_[0,0])
## 0.5518358239998236
print(logisticRegr.intercept_)
## [-3.54940996]

Use the trained model to predict

x=np.array([8.5]).reshape(-1,1)
print(f'When x = {x[0,0]} prediction is {logisticRegr.predict(x)}')
## When x = 8.5 prediction is [1.]
print(f'probability is {logisticRegr.predict_proba(x)[0,1]}')
## probability is 0.757898891570654

Two methods for logistic regression have been introduced so far, and their fundamental ideas are connected to the ANN framework. More details on the underlying calculations will be provided in the next implementation and will focus on finding the values of \(w\) and \(b\) that minimize the negative log-likelihood function.

Bare-hand implementation

import numpy as np
import scipy as sp
from tqdm import trange
import matplotlib.pyplot as plt

from scipy.special import expit

def inv_logit(x):
    return expit(x)

def loss_func(x,y):
    return -y*np.log(x)-(1-y)*(np.log(1-x))

x = np.array([2,3,5,7,11,13])
y = np.array([0,0,1,0,1,1])


w = -0.0; b = -1.0
delta = 0.0001; lr = 0.02; epoch = 2500
loss_record = np.zeros(epoch)
progress_bar = trange(epoch, unit = " epoch")
##   0%|          | 0/2500 [00:00<?, ? epoch/s]

wb_record = np.zeros([epoch,2])
grad_record = np.zeros([epoch,2])

for ii in progress_bar: #range(epoch):
    partial_tl_with_w = (np.sum(loss_func(inv_logit((w+delta)*x+b),y)) - 
                         np.sum(loss_func(inv_logit((w-delta)*x+b),y)))/2/delta
    partial_tl_with_b = (np.sum(loss_func(inv_logit(w*x+(b+delta)),y)) - 
                         np.sum(loss_func(inv_logit(w*x+(b-delta)),y)))/2/delta
    wb_record[ii,0] = w
    wb_record[ii,1] = b
    grad_record[ii,0] = (-1)*partial_tl_with_w
    grad_record[ii,1] = (-1)*partial_tl_with_b 
    
    w = w - partial_tl_with_w*lr
    b = b - partial_tl_with_b*lr

    loss_record[ii] = np.sum(loss_func(inv_logit(w*x+b),y))

Essentially, the implementation here uses the gradient descent algorithm to minimize the negative log-likelihood function with respect to \(w\) and \(b\). The gradients of the function with respect to \(w\) and \(b\) are computed, and their values are updated by moving in the direction of the negative gradient. The output below displays the values of \(w\) and \(b\), which are nearly identical to the previous results.

print(f'w is {w} and b is {b}')
## w is 0.5514636707705467 and b is -3.5468396301032925

The graph below illustrates how the values of the negative log-likelihood function evolve as \(w\) and \(b\) are updated along the negative gradient.

data_size = x.shape[0]

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))

# ax = subplots(figsize=(5,3))[1]
axes.scatter(np.arange(0, epoch, 1, dtype=int), loss_record[0:epoch]/data_size, s = 2, c = 'r')
axes.set_xlabel('epoch');
axes.set_ylabel('Total loss');
axes.set_title('The last loss = ' + str(loss_record[epoch-1]/data_size))

# plt.tight_layout()
plt.show()

Barehand with a bit of help from Pytorch

Here, torch.nn.Module has not been utilized.

import numpy as np
import scipy as sp
import torch
import torch.optim as optim

def inv_logit_tensor(x):
    return torch.special.expit(x)
    
def loss_func_tensor(x,y):
    return -y*torch.log(x)-(1-y)*(torch.log(1-x))
  
X = torch.tensor([2.0, 3.0, 5.0, 7.0, 11.0, 13.0], dtype=torch.float32)
y = torch.tensor(np.array([0,0,1,0,1,1]), dtype=torch.float32)

w = torch.tensor(.0, requires_grad=True)
b = torch.tensor(-1.0, requires_grad=True)

optimizer = optim.SGD((w,b), lr = 0.02) # tell torch that w and b are variables

epoch = 5000
loss_record = np.zeros(epoch)

for ii in range(epoch):
    optimizer.zero_grad()
    loss = torch.sum(loss_func_tensor(inv_logit_tensor(w*X+b),y))
    loss.backward() # Gradient calculation
    optimizer.step() # Update w and b
    loss_record[ii] = loss

print(f'w is {w} and b is {b}')
## w is 0.5517694354057312 and b is -3.5490245819091797

Pytorch implementation.

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import trange

X = torch.tensor([2.0,3,5,7,11,13]).reshape(-1, 1)
y = torch.tensor([0.0,0,1,0,1,1]).reshape(-1, 1)

class LogisticRegression(torch.nn.Module):
     def __init__(self, input_dim, output_dim):
         super(LogisticRegression, self).__init__()
         self.linear = torch.nn.Linear(input_dim, output_dim)
     def forward(self, x):
         outputs = torch.sigmoid(self.linear(x))
         return outputs

model = LogisticRegression(1,1) # this make the neural network model

loss_func = torch.nn.BCELoss()  # Binary Cross Entropy loss is chosen
optimizer = optim.SGD(model.parameters(), lr=0.1) # SGD: Stochastic Gradient Descending

n_epochs = 2500
progress_bar = trange(n_epochs, unit = " epoch")
##   0%|          | 0/2500 [00:00<?, ? epoch/s]
for epoch in progress_bar: 
    labels = y
    optimizer.zero_grad() 
    outputs = model(X)
    loss = loss_func(outputs, labels) 
    loss.backward()                    # This calculates the gradient
    optimizer.step()                   # This updates w and b
    
for p in model.parameters():
    print(p) 
## Parameter containing:
## tensor([[0.5508]], requires_grad=True)
## Parameter containing:
## tensor([-3.5420], requires_grad=True)

If the trained model is used to predict

test_val = torch.tensor([8.5])
model(test_val)
## tensor([0.7576], grad_fn=<SigmoidBackward0>)

Tensorflow 2.x implementation

Somehow, this doesn’t work with reticulate. Run this in Python.

import tensorflow as tf 
from tensorflow import keras
from tensorflow.keras import layers
print(tf.__version__)
from tqdm import tqdm

class logreg(tf.keras.Model):
  def __init__(self, input_dim):
    super().__init__()                #super(logreg, self).__init__()
    self.input_dim = input_dim
    self.dense1 = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=(self.input_dim, )),
            tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
        ]
    )
  @tf.function
  def call(self, inputs):
      return self.dense1(inputs)

model_logreg = logreg(1)

train_data = tf.constant([2.0, 3, 5, 7, 11, 13])
train_label = tf.constant([0.0,  0.0, 1.0, 0.0, 1.0, 1.0])
output = model_logreg(train_data)

model_logreg.summary()
model_logreg.get_weights()

loss_fun=keras.losses.BinaryCrossentropy()

optimizer=keras.optimizers.SGD(learning_rate=0.1)
epochs = 2500
progress_bar = tqdm(range(1, epochs + 1), unit = " epoch")
for epoch in progress_bar: #range(1, epochs + 1):
    with tf.GradientTape() as tape:
        loss = loss_fun(train_label, model_logreg(train_data))
    gradients = tape.gradient(loss, model_logreg.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model_logreg.trainable_variables))

model_logreg.get_weights()
test_val = tf.constant([8.5])
model_logreg.predict(test_val)

model_logreg(test_val)

More complex model

As you can imagine, the concepts and mathematics behind ANNs are fairly simple. However, direct implementation can be quite challenging. Machine learning libraries such as PyTorch and TensorFlow significantly simplify the coding process. Implementing the model below will require only a few additional lines to define the architecture, while the rest of the code will remain largely unchanged.
Autoencoder and a code for it

Autoencoder and a code for it