Trusting Numbers Instead of Your Eyes

Pitfall Information

Cons:
False sense of security
r-square doesn't tell you everything
Graphs convey more information - "A picture is worth a 1,000 words"

How to identify problem:
Graphical inspection, along with ANOVA table


Easily Avoided

This is by far the most easily prevented pitfall in curve fitting, but most people still run into it. The most common cause is an overemphasis on a high r-square value that lures people into a false sense of security. Although the r-square is a good place to start analyzing your curve fit, it’s not a good ending point.

There are too many areas where a high r-square can be very deceptive, as shown while discussing pitfalls surrounding Polynomial and Rational equations.

The phrase “A picture is worth a thousand words” is the best way to describe how to best evaluate a curve fit. Although the numerical information about a curve fit is very useful, you can’t rely solely on those results. As we have seen in these examples, a graph of the curve fit, confidence and prediction intervals will display more information than any amount of numbers could ever hope to reveal.



Why The Emphasis On The F-Statistic?

There's a couple of reasons why the F-Statistic is much more accurate than the r-square. This number actually factors in a lot of information. Let's look back at the ANOVA tables from the Polynomial and Rational equations pitfall:



Notice that the F-Statistic is actually the final result of many numbers. Think if it as a ratio between the amount of error that's explained by the fitted equation (MSR) divided by the amount of error that's not explained by the fitted equation (MSE).

The Mean Square Regression (MSR) takes into account the number of parameters used for the curve fit. The Mean Square Error (MSE) takes into account both the number of data points and parameters.

To calculate the r-square for this fit, you just divide the SSR by the total of SSR+SSE. That's it. Notice that has nothing to do with the data points or parameters? It's also why the 7th order polynomial had a higher r-square. Once the number of parameters was factored in, the 3rd order polynomial did a better job of describing the data from a statistical perspective.