R
by using
glm
Artificial neural network design for logistic regression
\(x \mapsto z^1_{1}=w^1_{11}x+b^1_1\).
\(neuron \mapsto \sigma(z^1_{1})\). Here, \(\sigma=logit^{-1}(x)=e^x/(1+e^x)\).
We want our single neuron (layer 1) to fire a 1 when the input is 5, 11, or 13, and a 0 when the input is 2, 3, or 7.
Our goal is to achieve this by selecting appropriate values for \(w^1_{11}\) and \(b^1_1\). However, is this strictly possible? Unfortunately, no. Instead, we will attempt to choose \(w^1_{11}\) and \(b^1_1\) to make the model approximate the desired behavior as closely as possible.
glm
Let’s get values for \(w^1_{11}\)
and \(b^1_1\) using a R
function called glm
.
R
training_data <- data.frame(x=c(2,3,5,7,11,13), y=c(0,0,1,0,1,1))
log_reg.res <- glm(data = training_data, y ~ x, family = binomial)
w=coef(summary(log_reg.res))[2,1]
b=coef(summary(log_reg.res))[1,1]
print(paste("w =",w))
## [1] "w = 0.551775920706039"
print(paste("b =",b))
## [1] "b = -3.54907067213626"
glm
can perform generalized linear models. The family
binomial
will do the logistic regression.
\(w=w^1_{11}=0.5518\) and \(b=b^1_1=-3.5491\) are found by
glm
.
What does this mean? The below is the plot of the
data points and \(f(x)=e^{wx+b}/(1+e^{wx+b})\) with \(w\) and \(b\) values that glm
found.
In this context, we assume that the output, \(y\), follows a binomial distribution \(B(n=1,p)\), where \(p\) is given by \[p=e^{wx+b}/(1+e^{wx+b}).\] Here, \(x\) represents the input. For example, when the input is \(x=2\), the output follows \(B(1, 0.07976687)\), meaning the probability of obtaining \(y=1\) is approximately \(0.0797\)–a rare event. That explains why the dataset contains the data point \((2,0)\). Conversely, when \(x=13\), the probability pf \(y=1\) increases to approximately \(0.974\). Thus, observing \((13,0)\) in the dataset would be unexpected. The parameters \(w\) and \(b\) are expected to adjust the functin \(e^{wx+b}/(1+e^{wx+b})\) to approximate a probability distribution that best explains the dataset. Apparently, \(w\) and \(b\) fit the dataset well at \(x=2,3,11,13\), but not as well at \(x=5,7\). This limitation arises from the choice of the inverse logit function as the fitting function and the fact that there are only two parameters.
Now, consider two additional sets of parameter values. They seem to
fit the data well, except for one instance. By what standard are the
parameter values produced by glm
considered the best?
The values for \(w\) and \(b\) are chosen to maximize the likelihood function, which represents a conditional probability \(P_{w,b}(y|x)\). Assuming independence, this probability can be computed as follows. \[ \begin{aligned} P_{w,b}(y|x) & = P_{w,b}(y=0|x=2)P_{w,b}(y=0|x=3)\cdots P_{w,b}(y=1|x=13) \\ & = \Big(1-\frac{e^{2w+b}}{1+e^{2w+b}}\Big)\Big(1-\frac{e^{3w+b}}{1+e^{3w+b}}\Big)\cdots \Big(\frac{e^{13w+b}}{1+e^{13w+b}}\Big) \\ & = \Big(\frac{1}{1+e^{2w+b}}\Big)\Big(\frac{1}{1+e^{3w+b}}\Big)\cdots \Big(\frac{e^{13w+b}}{1+e^{13w+b}}\Big) \end{aligned} \]
Maximizing this likelihood function is what happens behind the scenes
in glm
. Since the logarithm is a monotonically increasing
function, maximizing the log-likelihood function is equivalent to
maximizing the likelihood function. Moreover, the log-likelihood is much
easier to work with. Alternatively, minimizing the negative
log-likelihood function accomplishes the same task. The negative
log-likelihood is often referred to as the loss
function or cost function.
____________________________________________________________________________________
General ANN diagram
Outputs, \(a^{l-1}_k\), from the
previous layer (\(l-1\)), come to \(j_{th}\) neuron in Layer \(l\). First, \(z^l_j\) will be calculated by \(z^l_{j} =b^l_j + \sum_k w^l_{jk}
a^{l-1}_k\) Deploying vectors and matrices. \[z^l = w^la^{l-1}+b^l\] where \(z^l=[z^l_1\; z^l_2 \, \dots \, z^l_m]^T\),
\(w^l=[w^l_{jk}]_{1\leq j\leq m, 1\leq k\leq
n}\), \(a^{l-1}=[\sigma(z^{l-1}_1)\;
\sigma(z^{l-1}_2) \, \dots \, \sigma(z^{l-1}_n)]^T\), and \(b^l=[b^l_1\; b^l_2 \, \dots \, b^l_m]^T\).
After the last layer, there will be a cost function comparing the
last output of the model against the desired output. Searching for \(w\)’s and \(b\)’s that minimize (typically) the cost
function is called learning. One popular method for this optimization is
gradient descending. For this, we need to find the gradient of the cost
function as a function of \(w\)’s and
\(b\)’s.
The cost function for the
logistic regression that we are building is \[ C(w,b) = -\sum_{i=1}^6 y_i \ln(
logit^{-1}(wx_i+b) ) + (1-y_i) \ln(1-logit^{-1}(wx_i+b)).\] It is
also called binary cross entropy.
The logistic regression model that we are building as an artificial
neural network is simple enough to calculate the gradient just by
following the basic gradient formula.
In general, however, it can
be challenging as there are more layers and more neurons. For that,
there is an algorithmic approach called back propagation. First, Define
\(\delta^l_j=\frac{\partial C}{\partial
z^l_j}\). Layer \(L\) is the
last layer.
BP1: \(\delta^N_j=\frac{\partial
L}{\partial a^N_j}\sigma'(z^N_j)\)
BP2: \(\delta^l=\bigl( (w^{l+1})^T
\delta^{l+1}\bigr) \odot \sigma '(z^l) \text{ for } l<N.
\;\;\odot: \text{Hadamard product}\)
BP3: \(\frac{\partial L}{\partial b^l_j}=\delta^l_j\)
BP4: \(\frac{\partial L}{\partial
w^l_{jk}}=a^{l-1}_k\delta^l_j\)
Back propagation is a few
layers of chain rule calculation. It’s good for implementing
step-by-step gradient calculation.
Let’s calculate \(\frac{\partial
C}{\partial w^L_{jk}}\) first. Combining BP4 and BP1,
\(\frac{\partial C}{\partial
w^L_{jk}}=a^{L-1}_k\delta^L_j =a^{L-1}_k\frac{\partial C}{\partial
a^L_j}\sigma'(z^L_j).\) The calculation of all the terms in
the right hand side are straight forward because all calculations are
either at the last layer or evaluation that is done for forward
passing.
It is also not challenging to calculate \(\frac{\partial C}{\partial b^L_{j}}\). It’s
another last layer calculation.
Let’s try \(\frac{\partial C}{\partial w^{L-1}_{jk}}\). \(\frac{\partial C}{\partial w^{L-1}_{jk}}=a^{L-2}_k\delta^{L-1}_j\). \(a^{L-2}_k\) has been calculated already in the forward-propagation. \(\delta^{L-1}_j=\bigl( (w^{L})^T \delta^{L}_j\bigr) \odot \sigma '(z^L_j)\) by BP2 and everything on the right hand side is calculatable based on the calculation at the layer \(L\).
In this way, the calculation marches backward to calculate all \(\frac{\partial C}{\partial w^{l}_{jk}}\) and \(\frac{\partial C}{\partial b^{l}_{j}}\).
____________________________________________________________________________________
Artificial neural network design for logistic regression
In the context of glm
and its connection to artificial
neural networks (ANN), each input \(x\)
passing through the ANN produces an output of \(\sigma(wx+b)=e^{wx+b}/(1+e^{wx+b})\), which
is interpreted as a probability distribution \(q\) as seen earlier. Specifically:
\[ q(y=0)=\frac{1}{1+e^{wx+b}},
\;\; q(y=1)=\frac{e^{wx+b}}{1+e^{wx+b}}.\] The output \(y\) is also considered a distribution \(p\): \[
p(y=0)=\begin{cases}
1 \text{ if } y(x)=0\\
0 \text{ otherwise.}
\end{cases}
\]
Given the distributions \(p\) and
\(q\), the cross entropy \(H(p,q)\) is \[H(p,q)=-E_p(\log q)=-p(y=0) \log q(y=0)-p(y=1)
\log q(y=1).\] This cross-entropy serves as the loss function,
commonly referred to as binary cross-entropy loss. The total loss
function \(\mathcal{L}_{total}\) is
\[\mathcal{L}_{total}=\sum
H(p(x_i,y_i),q(x_i,y_i)).\] This corresponds to the negative
log-likelihood function in
Maximum likelihood, loss function, cost function
.
The same process can be implemented using the Python
scikit-learn
package.
Python
from sklearn.linear_model import LogisticRegression
import numpy as np
logisticRegr = LogisticRegression(max_iter = 2500, penalty='l2', C=1e+6)
train_data = np.array([2.0, 3, 5, 7, 11, 13]).reshape(-1, 1)
train_label = np.array([0.0, 0.0, 1.0, 0.0, 1.0, 1.0])
logisticRegr.fit(train_data, train_label)
LogisticRegression(C=1000000.0, max_iter=2500)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=1000000.0, max_iter=2500)
print(logisticRegr.coef_[0,0])
## 0.5518358239998236
print(logisticRegr.intercept_)
## [-3.54940996]
x=np.array([8.5]).reshape(-1,1)
print(f'When x = {x[0,0]} prediction is {logisticRegr.predict(x)}')
## When x = 8.5 prediction is [1.]
print(f'probability is {logisticRegr.predict_proba(x)[0,1]}')
## probability is 0.757898891570654
Two methods for logistic regression have been introduced so far, and their fundamental ideas are connected to the ANN framework. More details on the underlying calculations will be provided in the next implementation and will focus on finding the values of \(w\) and \(b\) that minimize the negative log-likelihood function.
import numpy as np
import scipy as sp
from tqdm import trange
import matplotlib.pyplot as plt
from scipy.special import expit
def inv_logit(x):
return expit(x)
def loss_func(x,y):
return -y*np.log(x)-(1-y)*(np.log(1-x))
x = np.array([2,3,5,7,11,13])
y = np.array([0,0,1,0,1,1])
w = -0.0; b = -1.0
delta = 0.0001; lr = 0.02; epoch = 2500
loss_record = np.zeros(epoch)
progress_bar = trange(epoch, unit = " epoch")
## 0%| | 0/2500 [00:00<?, ? epoch/s]
wb_record = np.zeros([epoch,2])
grad_record = np.zeros([epoch,2])
for ii in progress_bar: #range(epoch):
partial_tl_with_w = (np.sum(loss_func(inv_logit((w+delta)*x+b),y)) -
np.sum(loss_func(inv_logit((w-delta)*x+b),y)))/2/delta
partial_tl_with_b = (np.sum(loss_func(inv_logit(w*x+(b+delta)),y)) -
np.sum(loss_func(inv_logit(w*x+(b-delta)),y)))/2/delta
wb_record[ii,0] = w
wb_record[ii,1] = b
grad_record[ii,0] = (-1)*partial_tl_with_w
grad_record[ii,1] = (-1)*partial_tl_with_b
w = w - partial_tl_with_w*lr
b = b - partial_tl_with_b*lr
loss_record[ii] = np.sum(loss_func(inv_logit(w*x+b),y))
Essentially, the implementation here uses the gradient descent algorithm to minimize the negative log-likelihood function with respect to \(w\) and \(b\). The gradients of the function with respect to \(w\) and \(b\) are computed, and their values are updated by moving in the direction of the negative gradient. The output below displays the values of \(w\) and \(b\), which are nearly identical to the previous results.
print(f'w is {w} and b is {b}')
## w is 0.5514636707705467 and b is -3.5468396301032925
The graph below illustrates how the values of the negative log-likelihood function evolve as \(w\) and \(b\) are updated along the negative gradient.
data_size = x.shape[0]
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
# ax = subplots(figsize=(5,3))[1]
axes.scatter(np.arange(0, epoch, 1, dtype=int), loss_record[0:epoch]/data_size, s = 2, c = 'r')
axes.set_xlabel('epoch');
axes.set_ylabel('Total loss');
axes.set_title('The last loss = ' + str(loss_record[epoch-1]/data_size))
# plt.tight_layout()
plt.show()
torch.nn.Module
has not been utilized.import numpy as np
import scipy as sp
import torch
import torch.optim as optim
def inv_logit_tensor(x):
return torch.special.expit(x)
def loss_func_tensor(x,y):
return -y*torch.log(x)-(1-y)*(torch.log(1-x))
X = torch.tensor([2.0, 3.0, 5.0, 7.0, 11.0, 13.0], dtype=torch.float32)
y = torch.tensor(np.array([0,0,1,0,1,1]), dtype=torch.float32)
w = torch.tensor(.0, requires_grad=True)
b = torch.tensor(-1.0, requires_grad=True)
optimizer = optim.SGD((w,b), lr = 0.02) # tell torch that w and b are variables
epoch = 5000
loss_record = np.zeros(epoch)
for ii in range(epoch):
optimizer.zero_grad()
loss = torch.sum(loss_func_tensor(inv_logit_tensor(w*X+b),y))
loss.backward() # Gradient calculation
optimizer.step() # Update w and b
loss_record[ii] = loss
print(f'w is {w} and b is {b}')
## w is 0.5517694354057312 and b is -3.5490245819091797
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import trange
X = torch.tensor([2.0,3,5,7,11,13]).reshape(-1, 1)
y = torch.tensor([0.0,0,1,0,1,1]).reshape(-1, 1)
class LogisticRegression(torch.nn.Module):
def __init__(self, input_dim, output_dim):
super(LogisticRegression, self).__init__()
self.linear = torch.nn.Linear(input_dim, output_dim)
def forward(self, x):
outputs = torch.sigmoid(self.linear(x))
return outputs
model = LogisticRegression(1,1) # this make the neural network model
loss_func = torch.nn.BCELoss() # Binary Cross Entropy loss is chosen
optimizer = optim.SGD(model.parameters(), lr=0.1) # SGD: Stochastic Gradient Descending
n_epochs = 2500
progress_bar = trange(n_epochs, unit = " epoch")
## 0%| | 0/2500 [00:00<?, ? epoch/s]
for epoch in progress_bar:
labels = y
optimizer.zero_grad()
outputs = model(X)
loss = loss_func(outputs, labels)
loss.backward() # This calculates the gradient
optimizer.step() # This updates w and b
for p in model.parameters():
print(p)
## Parameter containing:
## tensor([[0.5508]], requires_grad=True)
## Parameter containing:
## tensor([-3.5420], requires_grad=True)
test_val = torch.tensor([8.5])
model(test_val)
## tensor([0.7576], grad_fn=<SigmoidBackward0>)
reticulate
. Run this in
Python.import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
print(tf.__version__)
from tqdm import tqdm
class logreg(tf.keras.Model):
def __init__(self, input_dim):
super().__init__() #super(logreg, self).__init__()
self.input_dim = input_dim
self.dense1 = tf.keras.Sequential(
[
tf.keras.layers.InputLayer(input_shape=(self.input_dim, )),
tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
]
)
@tf.function
def call(self, inputs):
return self.dense1(inputs)
model_logreg = logreg(1)
train_data = tf.constant([2.0, 3, 5, 7, 11, 13])
train_label = tf.constant([0.0, 0.0, 1.0, 0.0, 1.0, 1.0])
output = model_logreg(train_data)
model_logreg.summary()
model_logreg.get_weights()
loss_fun=keras.losses.BinaryCrossentropy()
optimizer=keras.optimizers.SGD(learning_rate=0.1)
epochs = 2500
progress_bar = tqdm(range(1, epochs + 1), unit = " epoch")
for epoch in progress_bar: #range(1, epochs + 1):
with tf.GradientTape() as tape:
loss = loss_fun(train_label, model_logreg(train_data))
gradients = tape.gradient(loss, model_logreg.trainable_variables)
optimizer.apply_gradients(zip(gradients, model_logreg.trainable_variables))
model_logreg.get_weights()
test_val = tf.constant([8.5])
model_logreg.predict(test_val)
model_logreg(test_val)
PyTorch
and
TensorFlow
significantly simplify the coding process.
Implementing the model below will require only a few additional lines to
define the architecture, while the rest of the code will remain largely
unchanged.
Autoencoder and a code for it