Explainable AI - Notes and Practice

Published: October 17, 2025 by Rain

Categories:

Explainable AI - Notes and Practice

Linear Regression and Logistic Regression

We make regression because we want to make a "supervise learning", that is, input observation $\mathbf{x}$ which is typically a vector in $\mathbb{R}^d$ and output a $y \in \mathbb{R}$ which is a real number. Our goal is to predict the output $y$ from input $\mathbf{x}$. That is, learn the function:

\[y = f(\mathbf{x}) + \epsilon\]

where $\epsilon$ is the noise.

Then we come to the linear regression.

Linear Regression

Given a linear function of the input feature $x$, $f(x) = \omega * x + b$, where $\omega$ is the slope and $b$ is the intercept. Then we observe a noisy output of the function, then $f(x) = \omega * x + b + \epsilon$ where $\epsilon$ is a assumed Gaussian noise.

We make it to the d-dimension case. The linear combination (also called weighted sum) of $d$ input variables $x_1, …, x_d$. Now we have:

\[f(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 + ... + \omega_d x_d\]

where $x_j$ is the input features, $\omega_j$ is the input feature weights and $\omega_0$ is the intercept.

The KEY ASSUMPTIONS of Linear Regression:

Linearity: Prediction is a linear combination of features. (Obviously, linear effects are easily interpretable; and additivity seperates the effects.)
Normality: Errors are normally distributed.
Homoscedasticity: Constant variance of errors across feature space.
Independence: Each instance is independent. (We do not measure a same thing repeatly.)
Fixed Features: Features are treated as constants, free of measurement error.
No Multicollinearity: Features should not be strongly correlated.

Ordinary Least Squares (OLS)

Suppose we have a linear function:

\[f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b\]

need to estimate the parameter $(\mathbf{w}, b)$ from the data. A method is to fit the parameters by minimizing the squared prediction error on the training set ${\lbrace(\mathbf{x}i, y_i)\rbrace}^N{i=1}$. Like this:

\[\min_{\mathbf{w}, b}{\sum^{N}_{i = 1}{(y_i - f(x_i))^2}} = \min_{\mathbf{w}, b}{\sum^{N}_{i = 1}{(y_i - (\mathbf{w}^T \mathbf{x}_i + b))^2}}\]

We have a closed-form solution for this. The bias $b$ can be absorbed into $\mathbf{w}$ by redefining:

\[\mathbf{w} \leftarrow \begin{bmatrix}\mathbf{w}\\b\end{bmatrix}, \mathbf{x} \leftarrow \begin{bmatrix}\mathbf{x}\\1\end{bmatrix}\]

then the minimization problem will be like this:

\[\min_{\mathbf{w}} ||\mathbf{y} - \mathbf{X}^T \mathbf{w}||^2\]

where $\mathbf{X} = [\mathbf{x_1}, \mathbf{x_2}, …, \mathbf{x_N}]$ is the data matrix and $\mathbf{y} = [y_1, y_2, …, y_N]^T$ is the vector of outputs.

So we get the closed-form solution.

\[\mathbf{w}^* = (\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}\mathbf{y}\]

The $(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}$ part is also called the pseudo-inverse of $\mathbf{X}$.