Stats Tutorial - Dealing with Outliers:

Introduction

Basics of Excel™

Basic Statistics

Linear Regression:

Introduction

Correlation

Linear Portions

Regression Equation

Regression Errors

Using the Calibration

Limits of Detection

Outliers & Regression

Q-test tables

Data Evaluation & Comparison


PreviousSitemapNext

When you take a set of linear measurements, it is all-too common for one or more data points to lie "far away" from the expected value. This sort of data is called an outlier and, if it is due to an erroneous measurement, can easily skew your regression line. However, an outlier could also reveal information about an incomplete regression, or the requirement for more a complex regression model.

This is why each calibration point should ideally be obtained at least in triplicate; you can then see if any of the replicates is an outlier by applying the Q-test directly to the replicate data. Alternatively, you can use the mean and standard deviation for each calibration solution in a technique called weighted leastsquares analysis. A third approach is to use robust regression methods. Both weighted and robust regression are beyond the scope of this tutorial; those interested should consult one of the chemometrics books suggested in the bibliography, or consult the article by del Rio et al in Analyst, 2001, 126, 1113-1117.

It is not always practical or possible to take replicates, usually because time is limited. This leaves the problem of how to identify and address potential outliers within your calibration data.

Statistics provides a few tools for dealing with outliers. Some of these methods are only valid for small sample sizes, and none of them are overly reliable for regression analysis. The key is to be careful - if you have too many outliers in your data, it may be an indication that you should redo your experiment, or choose an alternative experimental method to collect your data.

The method we cover here is the Q-test. The Q-test is a good first try, but it is not designed for regression analysis, since they require the y-values to be independent. It is still acceptable to use a Q-test in regression analysis, but be aware that it is not intended for this purpose and care should be taken.

Q-Test

The Q-statistic is calculated with the following formula

This formula is not well-suited for regression data, since in a regression analysis, the y-values data points change for each consecutive value. For this reason, we need to normalize the data by using the regression residual , where is the y-value obtained from the calibration curve for xi. You can then determine the Q-statistic for the residual values, and compare the result to a tabulated value, which are available in any statistics textbooks for 95% confidence intervals. When comparing the calculated value to the tablulated value, your reject the outlier when Qcalc > Qn,95%.

Consider the graph shown below to illustrate how to deal with outliers. The value at xi=20 is possibly an outlier and skewing the regression line. Can we discard it? The residuals are shown beside the graph.

0.6
-1.1
-0.2
-1.1
-0.9
5.6
-1.2
-1.7

In this case, ynearest is 0.6 and ysmallest is -1.7. Applying the Q-test, we find Qcalc=0.684. Referring to a table of Q-values for n=8 and a 95% confidence interval, we find Q8,95%=0.526. Since Qcalc > Q8,95%, we can reject the outlier. Bear in mind that this is for the 95% confidence interval, so there is still 1 chance in 20 that the data point is a real value and should not have been rejected.


This concludes the section on linear regression and calibration curves. From this lesson, you should have all the statistical tools you need to create linear regression calibration curves and analyze the errors associated with determining unknown sample concentrations from a measured signal.

The following and last section covers more advanced statistics used for comparing sets of data based on mean and variance, as well as a more detailed look at some of the statistical concepts discussed in earlier sections.

© 2006 Dr. David C. Stone & Jon Ellis, Chemistry, University of Toronto
Last updated: September 26th, 2006