Chapter 7 Linear model
So far on this course we have examined models with no predictors. However, usually the modeling situation is that have the observations \(Y_1, \dots, Y_n\), often called response variable or output variable, and for each observation \(Y_i\) we have the vector of predictors \(\mathbf{x}_{i} = (x_{i1}, \dots, x_{ik})\), which we use to predict its value.
We are interested in values of the response variable given the predictors, so they we can think the values of the predictors as constants, i.e. we do not have to set any prior for the them.
Liner models and generalized linear model are one of the most important tools of applied statistican. In principle the inference does not differ from the computations we have done earlier on this course. We have already examined the posterior inference for the normal distribution, on which the linear models are based on. However, usually on linear models we have multiple predictors: this means that the posterior for the regression coefficients is a multinormal distribution. This complicates the things a little bit, but the principle stays the same.
We can collect the values of the predicted variable \(\mathbf{Y} = (Y_1, \dots, Y_n)\) into the \(n\times1\)-matrix \[ \mathbf{Y} = \begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix}, \] and the values of the predictors into the \(n\times k\)-matrix \[ \mathbf{X} = \begin{bmatrix} x_{11} & \dots & x_{1k} \\ \vdots & & \vdots \\ x_{n1} & \dots & x_{nk} \end{bmatrix}, \] so that we can use a convenient matrix notation for the linear model. Usually we also want to add a constant term into the model. This can be incorporated into the vector notation by setting the first column of the matrix of the predictors into the vector of ones: \((x_{11}, \dots, x_{n1}) = \mathbf{1}_n\). The regression coefficients can be written into the \(k\times 1\)-matrix \[ \boldsymbol{\beta} = \begin{bmatrix} \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}, \] where \(\beta_1\) is the intercept of the model (if the constant term is used).
7.1 Classical linear model
In the classical linear model, also known as ordinary least squares regression, it is assumed that the response variables are independent, and follow normal distributions given the values of the predictors, and that the expected values of these normal distributions are linear combinations of the regression coefficients \(\beta\): \[ E[Y_i \,|\, \boldsymbol{\beta}, \mathbf{x}_i] = \mathbf{x}_i^T\boldsymbol{\beta} = x_{i1}\beta_1 + \dots + x_{ik}\beta_k , \] and that these normal distributions have a same variance \(\sigma_2\). In the Bayesian setting the noninformative prior for the parameter vector is \(p(\mathbf{\beta}, \sigma^2) \propto (\sigma^2)^{-1}\). This means that the model can be written as \[ \begin{split} \quad Y_i \,|\, \boldsymbol{\beta}, \sigma^2 &\sim N(\boldsymbol{x}_i^T \boldsymbol{\beta}, \sigma^2) \quad \text{for all}\,\, i = 1, \dots , n, \\ p(\boldsymbol{\beta}, \sigma^2) &\propto \frac{1}{\sigma^2}, \end{split} \] or more compactly using the matrix notation introduced above as: \[ \begin{split} \mathbf{Y} &\sim N(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}) \\ p(\boldsymbol{\beta}, \sigma^2) &\propto \frac{1}{\sigma^2}. \end{split} \]
7.1.1 Posterior for classical linear regression
With derivations similar to the ones done in Section 5.3 we can show that the conditional posterior distribution \(p(\boldsymbol{\beta}, | \sigma^2, \mathbf{y})\) of the regression coefficients given the variance is a \(k\)-dimensional multinormal distribution \[ \boldsymbol{\beta} \,|\,\mathbf{y}, \sigma^2 \sim N(\hat{\boldsymbol{\beta}}, V_{\boldsymbol{\beta}} \sigma^2), \] where \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}, \] and \[ \mathbf{V}_{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1}. \] The marginal posterior distribution for the variance \(\sigma^2\) is an inverted chi-squared distribution with degrees of freedom \(n-k\): \[ \sigma^2 | \mathbf{y} \sim \chi^{-2}_{n-k}(s^2), \] where \[ s^2 = \frac{1}{n-k}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}). \] We can observe that when the noninformative prior is used, the results are again quite close to the results of the classical statistical inference for the linear model.