Explainable AI - Notes and Practice

Published: by

  • Categories:

Explainable AI - Notes and Practice

Linear Regression and Logistic Regression

We make regression because we want to make a "supervise learning", that is, input observation $\mathbf{x}$ which is typically a vector in $\mathbb{R}^d$ and output a $y \in \mathbb{R}$ which is a real number. Our goal is to predict the output $y$ from input $\mathbf{x}$. That is, learn the function:

\[y = f(\mathbf{x}) + \epsilon\]

where $\epsilon$ is the noise.

Then we come to the linear regression.

Linear Regression

Given a linear function of the input feature $x$, $f(x) = \omega * x + b$, where $\omega$ is the slope and $b$ is the intercept. Then we observe a noisy output of the function, then $f(x) = \omega * x + b + \epsilon$ where $\epsilon$ is a assumed Gaussian noise.

We make it to the d-dimension case. The linear combination (also called weighted sum) of $d$ input variables $x_1, …, x_d$. Now we have:

\[f(x) = \omega_0 + \omega_1 x_1 + \omega_2 x_2 + ... + \omega_d x_d\]

where $x_j$ is the input features, $\omega_j$ is the input feature weights and $\omega_0$ is the intercept.

The KEY ASSUMPTIONS of Linear Regression:

  • Linearity: Prediction is a linear combination of features. (Obviously, linear effects are easily interpretable; and additivity seperates the effects.)
  • Normality: Errors are normally distributed.
  • Homoscedasticity: Constant variance of errors across feature space.
  • Independence: Each instance is independent. (We do not measure a same thing repeatly.)
  • Fixed Features: Features are treated as constants, free of measurement error.
  • No Multicollinearity: Features should not be strongly correlated.

Ordinary Least Squares (OLS)

Suppose we have a linear function:

\[f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b\]

need to estimate the parameter $(\mathbf{w}, b)$ from the data. A method is to fit the parameters by minimizing the squared prediction error on the training set

\[{\lbrace(\mathbf{x}_i, y_i)\rbrace}^{N}_{i = 1}\]

Like this:

\[\min_{\mathbf{w}, b}{\sum^{N}_{i = 1}{(y_i - f(x_i))^2}} = \min_{\mathbf{w}, b}{\sum^{N}_{i = 1}{(y_i - (\mathbf{w}^T \mathbf{x}_i + b))^2}}\]

We have a closed-form solution for this. The bias $b$ can be absorbed into $\mathbf{w}$ by redefining:

\[\mathbf{w} \leftarrow \begin{bmatrix}\mathbf{w}\\b\end{bmatrix}, \mathbf{x} \leftarrow \begin{bmatrix}\mathbf{x}\\1\end{bmatrix}\]

then the minimization problem will be like this:

\[\min_{\mathbf{w}} ||\mathbf{y} - \mathbf{X}^T \mathbf{w}||^2\]

where $\mathbf{X} = [\mathbf{x_1}, \mathbf{x_2}, …, \mathbf{x_N}]$ is the data matrix and $\mathbf{y} = [y_1, y_2, …, y_N]^T$ is the vector of outputs.

So we get the closed-form solution.

\[\mathbf{w}^* = (\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}\mathbf{y}\]

The $(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}$ part is also called the pseudo-inverse of $\mathbf{X}$.

Model Fit-R-Square

Commonly known as R-Squared ($R^2$), is a statistical measure that shows the proportion of the variance in the dependent variable that can be explained by the independent variables in a regression model. It essentially indicates how well the regression model fits the observed data.

For the prediction: $\hat{y_i} = f(x_i)$, R-Squared (Coefficient of Determination): proportion of variance is explained by the model.

\[R^2 = 1 - \frac{SSE}{SST}\]

where the $SSE$ is the Sum of Squared Errors, calculated by

\[SSE = \sum^{n}_{i = 1}(y_i - \hat{y_i})^2\]

measuring the total squared difference between actual and predicted values; and the $SST$ is the Total Sum of Squares, $SST = \sum^{n}_{i = 1}(y_i - \bar{y})^2$, measuring the total variance in the observed data, and $\bar{y}$ is the mean.

Now we get $R^2$ can we can make an interpretation. $R^2$ closed to 1 means model explains most of the variance; and $R^2$ closed to 0 means model explains little of the variance.

Then we have Adjusted R-Squared:

\[\bar{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\]

and it adjusts $R^2$ for the number of predictors $p$ and the sample size $n$. Thus, we can use Adjusted R-Squared when comparing models with different numbers of features. Also, Adjusted R-Squared penalizes unnecessary predictors.

つづく…