# Stats Tutorial - The Correlation Coefficient:

Introduction

Basics of Excel™

Basic Statistics

Linear Regression:

Data Evaluation & Comparison

In the first part of this tutorial, we saw how to use the trendline feature in Excel to fit a straight line through calibration data and obtain both the equation of the best-fit straight line and the correlation coefficient, R (sometimes written as R2).

There are in fact various correlation coefficients, but the one we are interested in here is the Pearson or product-moment correlation coefficient. The Pearson R value provides a measure of the degree to which the values of x and y are linearly correlated. We can assess this visually using a scatter plot (Figure 1), in which we also mark the centroid of the data, .

Figure 1 - XY scatter plot showing the centroid of the data

If x and y were linearly correlated, we would expect all the points to fall on a straight line passing through the centroid. As a result, we would expect all x values to be uniformly distributed either side of ; similarly, all the y values should be uniformly distributed about . The Pearson R is calculated using the formula

If x and y are perfectly correlated in a linear fashion, we would expect the value of R to be either +1 or -1, depending on whether y increases (positive slope) or decreases (negative slope) with x.

To demonstrate how to calculate this formula in Excel, we return to our previous example of fluorescence intensity data. Then,

1. Set up a spreadsheet with the xi and yi values in columns
2. In the adjacent cells, set up expressions for , , their squares, and their product. For instance, the formula for may look like `=B3-AVERAGE(B\$2:B\$8)`, depending on the location of your cells in the spreadsheets.
3. Determine the sums of squares and , and the sum of products in Excel and insert these values in the formula for R.
4. To calculate the square root in the denominator, use the `SQRT` function.

The easiest way to calculate R in Excel is by setting up a table to calculate the required values, as shown below. This yields a correlation coefficient R2=0.9978, so the data are well-correlated and the best-fit line describes the data.

The following equation is an alternate and simpler way to calculate the correlation coefficient. It is left as an excercise for the reader to verify and implement this formula.

A few points to mention regarding the correlation coefficient:

• It is essential to retain a large number of significant figures in the numerator and denominator during the calculation, otherwise a misleading value of R may be obtained.
• Even a high R value of, say, 0.9991 does not necessarily indicate that the data fits to a straight line. The trendline should always be plotted and inspected visually. R2 is more discriminating in this respect, although it no longer indicates the slope of the regression line. This, however, is evident by inspection.
• Any curvature in the data will result in erroneous conclusions about the correlation. R values are only applicable to linear correlations. Nonlinear correlations are possible, but involve a different measure than R, and R values will not necessarily be close to 1.
• The statistical significance of R depends on the number of samples in the data set n.

We have seen how to calculate the correlation coefficient using Excel. On the next page, we discuss the use of linear portions of non-linear curves to estimate a calibration equation.

© 2006 Dr. David C. Stone & Jon Ellis, Chemistry, University of Toronto
Last updated: September 26th, 2006