Cons:
False sense of security
r-square doesn't tell you everything
Graphs convey more information - "A picture is worth a 1,000 words"
How to identify problem: Graphical inspection, along with ANOVA table
Easily
Avoided
This is by far the most easily prevented pitfall in curve fitting, but most people
still run into it. The most common cause is an overemphasis on a high r-square
value that lures people into a false sense of security. Although the r-square
is a good place to start analyzing your curve fit, its not a good ending
point.
There are too many areas where a high r-square can be very deceptive, as shown
while discussing pitfalls surrounding Polynomial and Rational
equations.
The phrase A picture is worth a thousand words is the best way to
describe how to best evaluate a curve fit. Although the numerical information
about a curve fit is very useful, you cant rely solely on those results.
As we have seen in these examples, a graph of the curve fit, confidence and prediction
intervals will display more information than any amount of numbers could ever
hope to reveal.
Why The
Emphasis On The F-Statistic?
There's a couple of reasons why the F-Statistic is much more accurate than the
r-square. This number actually factors in a lot of information. Let's look back
at the ANOVA tables from the Polynomial and Rational equations
pitfall:
Notice that the F-Statistic is actually the final result of many numbers. Think
if it as a ratio between the amount of error that's explained by the fitted equation
(MSR) divided by the amount of error that's not explained by the fitted equation
(MSE).
The Mean Square Regression (MSR) takes into account the number of parameters used
for the curve fit. The Mean Square Error (MSE) takes into account both the number
of data points and parameters.
To calculate the r-square for this fit, you just divide the SSR by the total of
SSR+SSE. That's it. Notice that has nothing to do with the data points or parameters?
It's also why the 7th order polynomial had a higher r-square. Once the number
of parameters was factored in, the 3rd order polynomial did a better job of describing
the data from a statistical perspective.