Bayesian Modelling by Method of Normal Regression: A Case of Modelling Gluten Content in terms of Protein Content in a Variety of Wheat

This article is about the Bayesian modelling of the parameters of a simple linear regression with normal errors. It studies the use of non-informative normal priors to the regression parameters. It has an application on modelling gluten content in terms of protein content of a variety of wheat. The exact estimations of credible sets of the regression parameters obtained from real and simulated data by using MCMC. The posterior estimates of the gluten content in terms of protein content are better in this regression model with normal non-informative prior.


INTRODUCTION
Models are the designed statements for predicting future events, capturing summarized trends and regularities in the observed data.A statistical model is a collection of probabilistic statements that describes and interprets present behaviour or predicts future performance.Statistical models are cheaply used to describe real life problems under uncertainty (Ntzoufras, 2009).It consists of three components: the response variable Y, the explanatory variables X, and a linking mechanism between the two sets of variables.The response variable Y is a stochastic part of the model because the outcome is uncertain before it is observed.In modelling procedure we put our interest to a certain outcomes of Y and to predict a future outcome of Y. Y is a stochastic variable, so Y| X 1 , X 2 , … … …, X p ~ D(Τ), where D(θ) is a probability distribution with parameter θ.The advantage of models is that they impose us to arrange and organize all information available in a logical way, which helps to define precisely the problem under study and facilitates exchange of knowledge.Models may be used for prediction when verified and validated, which may require data from both observation and experiments.To describe significant dependencies among variables, dependency modelling is used whereas, to describe the causal relations between determinant factors and performance measures causation models are used (Fayyad et al. 1996).

MODELS AND METHODS Modelling in Bayesian Paradigm
If the underlying processes are not enough understood, models are designed based only on the observed data.Instead, models are constructed with existing expertise, by beginning with a flexible model specified by a set of parameters, and combined it with the statistical model of the generated data set.The former is the modelling technique in standard classical approach and the latter is the Bayesian modelling approach (O'Hagan, 1995).Bayesian modelling is the method of parametric modelling of data with prior information.The strength of Bayesian approach is that they can make use of information that might not pertain exactly to the issue at hand.The information can be weighted according to relevance or quality, and sensitivity analysis can be used to assess the priority to be given for collecting more directly relevant data.Bayesian variants of Monte Carlo integration procedures have been devised to address these objections using Gaussian process models (Rasmussen & Ghahramani, 2003 is called the likelihood of the model and contains the available information provided by the observed sample.Models are constructed, usually, in order to asses or interpret causal relationship between the response variable Y and various characteristics expressed as a variable υ ∈ j , X j , called explanatory variables; j indicates a model term (or covariate) andυ , the set of all terms under consideration.The explanatory variable is linked with the response variable via a deterministic function and a part of the original parameter vectors is substituted by an alternative set of the parameters (denoted by β ) that usually summarizes the effect of each covariate on the response variable.In a Bayesian model selection, we calculate the posterior distribution over a set of models given a priori knowledge and some new observations (data).The knowledge is represented in the form of a prior over model structures P(M), and their parameters P(Τ/M), which define the probabilistic dependencies between the variables in the model (Beal, 2003).By Bayes' rule, the posterior over models M observing data y is given by: The process of assembling information into a Bayesian model is a multi-stage one, using data and information of many types.It is important to note that, even though these models provide a structure into which the available data can be incorporated and use expert opinion; where there are no data, this does not mean that the models are a substitute for experimental data.The greatest advantage of Bayesian models is that they can be used to facilitate decision analysis despite inadequate data; this is especially important as some types of data are not likely to be readily collected at all (Beal, 2003).
Bayesian analysis of the regression was first presented in the landmark paper by Lindley and Smith (1972).

Normal Regression Model
By regression we mean a statistical method used to model the relationship of one or more dependent (or response or outcome) variables to one or more independent (or explanatory) variables.The regression analysis is used with objectives of analyzing correlation, predict the values of dependent variable(s) given independent variable(s), infer cause and effect relationships and estimate systematic relationships and filter out noise.
In normal regression models the response variable Y is considered to be a continuous random variable distributed with the normal distribution with the parameters µ (mean) and σ 2 (variance).The normal regression model is summarized as: where, An alternative formulation of the regression model is that representing response variable directly as a function of the explanatory variable plus a random normal error with mean 0 and variance σ 2 .

Likelihood Specification in Normal Regression Model
To simplify computational notation, we denote the response variable given explanatory variable simply by Y, and the expected value be the values of the explanatory variable X 1 , X 2 ,…..,X p and with a sample size n corresponding to response values ; then the model is expressed as Bayesian Modelling by Method of Normal Regression: A Case of Modelling Gluten Content in terms of Protein Content in a ...

Independent Prior Specification:
To make inferences about the regression coefficients, we obviously need to choose a prior distribution for β, σ 2 .The basic way of assuming a priori regarding the parameters in the normal regression model is the use of independent distributions.
The computational software, WinBUGS, for Bayesian analysis prefer to use precision (τ) instead of variance σ 2 .So, the specification is expressed as

Conjugate Prior Specification:
The normal distribution is assigned as conjugate prior for the β|σ 2 and an inverse gamma distribution for σ 2 for the normal regression model.The priori for the joint follows normal-inverse gamma distribution.We symbolize it as and c 2 is a parameter controlling overall magnitude of the prior variance (Zellner, 1986); the default choice of c 2 = n (Kass & Washerman, 1995).

Posterior Updating
The object of statistical inference is the posterior distribution of the parameters β 0 ,…,β k and σ 2 .By Bayes' Rule, this is simply In case of simple linear regression

Simple Linear Regression with Normal Errors
Simple linear regression is the statistical method used to model the relationship of one dependent (or response or outcome) variable to one independent (or explanatory) variable.In simple linear regression, we assume mean of dependent variable Y is linearly related to independent variable X.
For simplicity this model is written as with common notation 0 α β = and 1 β β = .
For simple linear regression, data consist of a sample of ) , pairs and we infer whether or not distribution of Y depends on X, estimate coefficients of relationship between Y and X, find credible intervals for coefficients of the slope and intercept, evaluate how much of the variability in Y is explained by X, predict not-yetobserved Y for a given X value and evaluate adequacy of model.

The likelihood function is given by
The standard estimates for α, β are

Bayesian Estimation of the Parameters in Simple Linear Regression
Let us re-parameterize where, β is called the slope of the regression line (simply, regression coefficient) and η is sometimes called the intercept, although this term is usually used for α.
The likelihood function for the re-parameterized form is The regression estimates in the re-parameterized model are By assigning the reference prior, g(α,β,φ) = 1/φ, the posterior is obtained as prior times likelihood: The first term in the exponential does not involve η and β , and the second part is proportional to a bivariate normal density.
Conditional on the variance, φ, the posterior distribution for η and β is bivariate normal.The mean of η is h and its variance is n / φ and mean of β is b and its variance is Sxx /

φ
. Conditional on φ, η and β are independent but untransformed intercept α is not independent of β.
On the subject of posterior distribution of φ, we find the posterior distribution for See / φ , which is Posterior distribution of precision parameter φ τ Marginal posterior distribution for η and β are given as If we marginalize out φ, η and β are not independent.
The theory behind these distributions can be found in Draper and Smith (1981) and Lee (1997).

Sample and Data
Independent samples were collected for a variety of wheat to study the relationship between the percentage of protein and gluten content (Khatiwada, 2011).Protein content is vital for baking quality of wheat flour whereas gluten is the main structure for forming protein.Sahin and Sumnu (2006) explain that 'proteins are surface active compounds, comparable with low molecular weight emulsifiers (surfactants), result in lowering of interfacial tension of fluid interfaces, emulsify an oil phase in water and stabilize the emulsion'.It helps in increasing flavor, self-life of the product and helps to make the product soft.
The summary statistics regarding percentage of protein content and gluten content obtained from 20 samples is given in the Table 1.

Summaries of the OLS Method
The correlation coefficient between the percentage of protein and the gluten contents (0.935) is significant (p <0.000) with SE=0.337.By ordinary least square (OLS) method the regression coefficients are obtained as α =0.29 and β = 0.38 with SE(α)=0.458 and SE(β)=0.034respectively.The values of R-square and adjusted Rsquare are 0.875 and 0.868 respectively.The classical simple regression line of the percentage of gluten content on protein contained by OLS method is given by Y=0.29+0.38X.

Bayesian Regression Summaries from Real Data
The following summaries were obtained on applying Bayesian method of regression on the real data of the percentage of protein and gluten content.
Posterior distributions of the parameters (α and β), conditional on φ : -Posterior mean of -The posterior mean of the slope and intercept are the same as that of the (OLS) estimates.
Posterior distributions of the parameters (α and β), for un-conditionality on φ: -Posterior α has a t distribution with 18 df, center 5.36 and standard error = The posterior summaries of the variance (φ) are obtained from the inverted chi-square with n-2 degrees of freedom.The 50% HDR for the variance (φ)

Results from the Simulated Data
For the model assessment posterior estimates were generated by using MCMC via Win BUGS.MCMC offers a way of using numerical methods to sum over the uncertainty about the parameters in the model in order to summarize the marginal distributions even in the absence of an accessible analytic solution.The noninformative normal priors were used for the regression parameters and a gamma prior for the precision parameter to update the normal regression model for the protein and gluten content data.5000 iterations were performed to look at the convergence of the model and the results were taken discarding initial preliminary 500 iterations.The likelihood of the percentage of gluten content in a wheat variety is, then obtained as: Yi ~ N (µ i , σ 2 ), where, µi = α + β x i (x i is the proportion or percentage of protein content) The non informative priors were taken as α ~N (0, 1000), β ~N (0, 1000) and τ ~gamma (0.1, 0.1) The summary of the posterior densities of the parameters [intercept (α), regression coefficient (β), precision (τ) and the predicted values of the means (µ) of the response variable Y], obtained from simulated data, using MCMC via WinBUGS, are given in the Table 3.
Based on the posterior distribution of the values of the parameters, the fitted model is obtained as The density plots and trace plots of alpha and beta are drawn using software WinBUGS and presented.in Figure 2. The density plots (Figure 1) show that the posterior values of the alpha and beta are better fitted to the normal distribution.The trace plots (Figure 2) show that the convergence of the model is satisfactory.Figures 3 depicts the box plots of the average predicted values (µ i ) of response variable y i and the scatter plot with the fitted line for µ i is given in the Figure 4.
The Bayesian version of MSE and R-square were obtained 0.3419 and 0.903 respectively for the simulated data.Simulated data has large R 2 value and less mean square errors of the estimate.This model gives the predicted values very close to those values obtained in classical regression, because of the use of a noninformative normal prior with large variance.However, this model has heavy tail distribution with the large values of standard deviations of the estimates.

=a
The estimates a and b are often called ordinary least square (OLS) estimates because they minimize the sum of squared deviations from the regression line.and b are the maximum likelihood estimates of α andβ, if the error term is normally distributed.Assuming the prior distribution for (α,β and φ) as and b are the posterior expected values of α and β if n > 2.

Fig. 1 .
Fig. 1.Posterior density plots of alpha and beta from the simulated data using MCMC

Table 1 . Summary statistics of the percentage of protein and gluten content
Bayesian Modelling by Method of Normal Regression: A Case of Modelling Gluten Content in terms of Protein Content in a ...

Table 2 . Posterior summaries of the HDR and CI of the regression parameters Parameters Mean Se 50% HDR (given, t 0.75,18 =0.688) 95% CI (given, t 0.95,18 =2.101)
HDR) of the parameters α and β obtained from the real data, updated using a non-informative prior, are given in the Table2.