DEALING WITH OUTLIERS AND INFLUENTIAL POINTS WHILE FITTING REGRESSION

Dealing with outliers and influential points while fitting regression is recognizing them, identifying the reasons to their existence in the process and employing the best alternatives to lessen their effect to the fitted regression model. In this paper, before considering elimination of outliers and the influential points while fitting a regression, as they contain important information, issues why unusual observations (possible outliers) appear in the process and how to analyze them to detect if they were real outliers, have been discussed thoroughly. And, when detected as outliers and influential points, to investigate and eliminate their effect in the fitted model, analytic procedures; leverage value, studentized residuals and cook's distance were carefully employed to optimize a multiple regression model for rice production forecasting in Nepal. This model was fitted with 35 years (1961-1995) time series data, collected from Ministry of Agriculture and Cooperatives, Food and Agriculture Organization Statistics Database, International Rice Research Institute and Department of Hydrology and Metrology which to its end was consisted of the three predictors, price at harvest, rural population and area harvested.


INTRODUCTION
In statistics, outlier is an observation that is numerically distant from the rest of the sample in which it occurs.Rahaman et al. (2012), Stevens (1983) and Sweet & Martin (2012) have compatibly defined outliers to be the points in a data set which are very different from the other points.Also, Jarrell (1994), Rasmussen (1988) and Stevens (1984), (as cited in Osborne & Overbay, 2004) have mentioned that an outlier is generally considered to be a data point that is far outside the norm for a variable.However, a little skewed definition of outlier given by Dixon (1950) and Waine (1976) is, "values that are dubious in the eyes of the researcher."Outliers and influential points are sensitive to regression analysis.According to Osborne & Overbay (2004) possible deadly effects can cause increase in error variance to reduce the power of statistical test, help violet the assumptions in the model which ultimately result a biased estimate.To identify the nature of outliers and how to detect them, some discovering of the methods for dealing with them correctly are essential to fit an unbiased regression model.Despite such severity of outliers in regression model fitting, there is a great deal of debate as to what to do with identified outliers.Meanwhile though, the above-mentioned authors congruently reveal different approaches are in practice to deal with outliers and several indicators are used for identifying and analyzing them.According to Osborne & Overbay (2004) a thorough review of the various arguments about outliers and influential points is almost impossible in a single write and there come situations where researchers must use their training, intuition, reasoned argument, and thoughtful consideration in making decisions about outliers and the influential points.Sweet & Martin (2012) reveal that, the outliers are deleted if a) they are the wrong entry or b) they are some special cases isolated from a common phenomenon in an analysis.Otherwise, run the analysis first without and second with the outlier excluded.If the outlier is exerting an undue influence on the outcomes, both models should reasonably coincide.Otherwise report the both.Stevens (1983) have reported four diagnostics that are useful in identifying outliers.Namely: studentized residuals [standardizing the deleted residuals produces studentized residuals, (studentized residuals, 2017)], the hat elements, Cook's distance [a combination of each observation's leverage and residual values; the higher the leverage and residuals, the higher the Cook's distance (Andale, 2016), where leverages are defined as a measure of how far away the independent variable values of an observation are from those of the other observations (Leverage statistics, 2016)], and Mahalanobis distance.If there is an outlier in the data, rather omit it, the preference would be its effect is removed.Possible ways that any data point can be outlier are: it could have, an extreme x value, an extreme value, an extreme and value and it might be distant from the rest of the data, even without and values.According to Bowerman et al. (2005) an observation may be an outlier with respect to its y value and /or its x value, but an outlier may or may not be influential.Influence of outliers is identified by computing regression coefficients with and without outliers.Observations that have a large influence on the estimation results of a regression model are called influential observations.When data set includes influential point, things to consider are: the influential point may be bad data viz.the measurement error, and one needs to check the validity of the data point.Andersen (2012) and, Jacoby (2005) have described outlying observation can cause to misinterpret patterns in plots.More importantly, according to the author, separated points can have strong influence on statistical models viz.unusual cases can substantially influence on the fit of the Ordinary Least Square (OLS) model.And therefore, deleting outliers from a regression model can sometimes give completely different results.Cases that are both outliers and high leverage exert influence on both the slopes and intercept of the model, outliers may also indicate that model fails to capture important characteristics of the data.Bowerman et al. (2005) have recommended; first dealing with outliers with respect to their values and explaining that they could affect the overall fit of the model.According to whom if this was done first, other problems become much less important or disappear.

MATERIALS AND METHODS
Data (Appendix A) for the study were collected from(MOAC, 2010), (FAOSTAT, 2013), (IRRI, 2014) and (DHM, 2014).Osborne & Overbay (2004) claim, if the studentized residual of an observation is 2 (irrespective of any sign) then the observation is an outlier with respect to its y value.If the leverage value for an observation is greater than , [where = number of independent variables and = number of observation], the observation is outlying with respect it's value.Moreover, a rough rule of thumb, to determine if an observation is influential is, to calculate Cook's distance measure written as .If Cook's Distances for the outliers are , then these outliers are the influential points.In addition to this, If none of these appears to be the case, two analyses-one with the influential cases in and one with these cases deleted-could be reported to emphasize the impact of these few points on the analysis.This is a case where researchers must use their training, intuition, reasoned argument, and thoughtful consideration in making decisions.(Osborne & Overbay, 2004).
Once identified and if there is a reason to believe that these cases arise from a process different from that for the rest of the data, then the cases should be deleted.For example, the failure of a measuring instrument etc. otherwise, two analyses-one with the influential cases in and one with these cases deleted-could be reported to emphasize the impact of these few points on the analysis.During analysis in Minitab following were observed to be unusual observations (Table 1).

RESULTS AND DISCUSSION
Observations suspected to be outliers or influential points were checked for their possible wrong recordings.But this was not the case, they were found correctly recorded.Therefore, the other criterions were sought to identify if they were outliers or the influential points.Table 2. shows that observation 21 is not outlying with respect to its y value [studentized residual (2.04) > 2, not significantly greater)] nor was it outlying with respect to its x value [leverage value (0.23), not greater than the threshold value].And, for the same (observation 21) Cook's Distance (0.31) < 1 proved that this observation was not influential.
Similarly, from the figures in the same (Table1) show that observation 31 was not an outlying due to its y value [studentized residual (1.68) < 2], but clearly was an outlier due to its value [leverage value (0.517) > 0.23].However again, [Cook's Distance (0.76) < 1] showed that this (observation 31) was also, no more any influential point.
As above we came to the conclusion that neither of the suspected observations were the influential outliers.However, keeping in view that the mentioned criterions many a times (as they have been discussed in the theoretical section) could give malicious results, models with and without the suspected observations were computed (Table 3) and cross checked.et al., 1998).Table 3. shows that, neither options either 21 or 31 deleted turn by turn or both 21 and 31, deleted together did give better model.For instance, autocorrelation, lack of fit condition and the multicollinear situation were rather degraded in the newer model as compared to the corresponding values of the models for which the suspected observations were deleted.And hence the model was kept as it was before starting to have a check on the suspected values.

CONCLUSION AND RECOMMENDATIONS
Outliers are the values that lie very far from the middle of the distribution in either direction.They take extreme values compared to most of the observations in a data set.Outliers and influential points could have significant impact in the results of any analysis.Outliers not necessarily be influential in affecting the regression coefficients.They are to be dealt with utmost care.If these influential points are to be removed, it may lead to a different model.Outliers are associated each other.In the presence of the first outlier the second might not act as an outlier.Occurrence of outliers may be by chance.If the occurrence is by chance, they are discarded.In this paper, outliers, with the tools, studentized residual, leverage value and Cook's distance are checked through the deletion approach of the outliers.And, we have demonstrated through examples that outliers and influential points can be carefully dealt with.This procedure can be applied in the cases when similar situation arises.
Rahaman et al. (2012) claims different computerbased approaches (distribution-based, distancebased, density-based and deviation-based) have been proposed for detecting outlying data.