比較RMSE與模型


1

我正在使用新數據集上的RMSE評估模型預測的準確性。現在,RMSE本身並沒有給出任何關於它是否是好的模型的跡象,因為沒有閾值表明它是"好的"。我的問題是,以均值作為預測因子來計算空模型的均方根誤差,並將其與我模型的均方根誤差進行比較是否有意義?還是應該將"訓練"數據的模型的均方誤差與"測試"數據的均方根進行比較?

我目前使用的模型是基於BIC得分的所有可用預測變量中最好的模型,但我試圖弄清楚該模型的實際效果如何。我還計算了調整。R平方表示我的模型解釋了20.7%的方差,但我懷疑這是否是一種很好的精度度量。

2

Your suggestion about using a null model is similar to $R^2$. $R^2$ is defined as $1-MSE/V$, where $MSE$ is the model's mean squared error and $V$ is the variance of the observed output. You can think of the variance as the mean squared error of a null model that always gives the mean as its predicted output. Even here, the question is: how much better can you do? This is very hard to answer. The reason is that it's hard to know whether the error reflects variation in the output that's fundamentally unpredictable from the input (e.g. 'noise', but could be something else), or whether additional structure is present that the model has simply failed to capture. Sometimes looking at the residuals can give a hint. Under some circumstances, it's possible to estimate the 'noise' level. For example, if you have many repeated trials where inputs are identical, you can measure variability of the output for equal inputs. This gives a bound on the maximum possible performance. You would typically encounter this situation in the context of controlled experiments. Or, you may be able to do something similar if you have access to a known 'correct model' (e.g. in a theoretical setting, or if you're modeling a well understood physical system). Otherwise, it's hard to know whether there's a better model out there.

Looking at the training vs. test error can give you some idea about the extent to which your model is overfitting (the expected training error would be lower than the expected test error). There can be variability here when using a small number of samples and/or few repetitions. A gap between training and test error isn't a problem per se, but a large gap might signal a problem. Even so...one model that overfits might still have better generalization performance than another model that doesn't.

Instead of asking how good your model is, you can also ask how bad it is. You could use a significance testing approach to see whether your prediction is better than 'chance'. For example, you might compare the test error on real data to the test error on permuted data (where relationships between the input/output have been destroyed, and any apparent performance is due to sampling variability or overfitting).