Explainable AI - Notes and Practice
Explainable AI - Notes and Practice
Linear Regression and Logistic Regression
We make regression because we want to make a "supervise learning", that is, input observation $\mathbf{x}$ which is typically a vector in $\mathbb{R}^d$ and output a $y \in \mathbb{R}$ which is a real number. Our goal is to predict the output $y$ from input $\mathbf{x}$. That is, learn the function:
\[y = f(\mathbf{x}) + \epsilon\]where $\epsilon$ is the noise.
Then we come to the linear regression.
Linear Regression
Given a linear function of the input feature $x$, $f(x) = \omega * x + b$, where $\omega$ is the slope and $b$ is the intercept. Then we observe a noisy output of the function, then $f(x) = \omega * x + b + \epsilon$ where $\epsilon$ is a assumed Gaussian noise.
We make it to the d-dimension case. The linear combination (also called weighted sum) of $d$ input variables $x_1, …, x_d$. Now we have:
\[f(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 + ... + \omega_d x_d\]where $x_j$ is the input features, $\omega_j$ is the input feature weights and $\omega_0$ is the intercept.
The KEY ASSUMPTIONS of Linear Regression:
- Linearity: Prediction is a linear combination of features. (Obviously, linear effects are easily interpretable; and additivity seperates the effects.)
- Normality: Errors are normally distributed.
- Homoscedasticity: Constant variance of errors across feature space.
- Independence: Each instance is independent. (We do not measure a same thing repeatly.)
- Fixed Features: Features are treated as constants, free of measurement error.
- No Multicollinearity: Features should not be strongly correlated.
Ordinary Least Squares (OLS)
Suppose we have a linear function:
\[f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b\]need to estimate the parameter $(\mathbf{w}, b)$ from the data. A method is to fit the parameters by minimizing the squared prediction error on the training set ${\lbrace(\mathbf{x}i, y_i)\rbrace}^N{i=1}$. Like this:
\[\min_{\mathbf{w}, b}{\sum^{N}_{i = 1}{(y_i - f(x_i))^2}} = \min_{\mathbf{w}, b}{\sum^{N}_{i = 1}{(y_i - (\mathbf{w}^T \mathbf{x}_i + b))^2}}\]We have a closed-form solution for this. The bias $b$ can be absorbed into $\mathbf{w}$ by redefining:
\[\mathbf{w} \leftarrow \begin{bmatrix}\mathbf{w}\\b\end{bmatrix}, \mathbf{x} \leftarrow \begin{bmatrix}\mathbf{x}\\1\end{bmatrix}\]then the minimization problem will be like this:
\[\min_{\mathbf{w}} ||\mathbf{y} - \mathbf{X}^T \mathbf{w}||^2\]where $\mathbf{X} = [\mathbf{x_1}, \mathbf{x_2}, …, \mathbf{x_N}]$ is the data matrix and $\mathbf{y} = [y_1, y_2, …, y_N]^T$ is the vector of outputs.
So we get the closed-form solution.
\[\mathbf{w}^* = (\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}\mathbf{y}\]The $(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}$ part is also called the pseudo-inverse of $\mathbf{X}$.