一体何が起きているのでしょうか? 多変量解析/パターン認識によるデータ解析を行うときに常に留意しなければならない複数の要因が絡まってこのような結果となっています。
1.過剰適合 ( Over Fitting)
2.偶然性 ( Chance Correlation )
3.インデックスパラメータ (Index Parameter) の使用
I never do such funny data analysis.
The most of researchers think that such case is impossible to happen on their research.
However if we careful and verifying a data analysis results which are published open on WEB by famous national research center and institution in Europe and USA, we notice that the case that a thing like these operation is seen on WEB when we check those results of data analysis cautiously. More over on those WEB site, high R and R2 values and large number of samples are written just like the decoration and proof of an excellent data analysis. Of course, there is not malevolence but it is often doing a thing like the case on the person in charge while not conscious.
Because R and R2 values are the most important and famous index of goodness of results of regression data analysis, generally, a multiple regression analysis has been processed for achieving high R and R2 values. In general, R and R2 values are improve even if it is a little when adding a parameter, most of researcher think and generate various parameters and adds it steadily.
In this case, those generated parameters which are related with the research subjects to make a goal at least, by not making an unnatural value like the compound ID. Therefore most of researcher doesn't feel unusual or abnormal status. As a result, in the case that R and R2 values have been achieved high values, but it can’t get well and excellent prediction results.
What will get up on this case? More than one which must be always pay attention while doing a data analysis by multi-variate analysis and pattern recognition analysis becomes such a result.
1. Over Fitting
2. Chance Correlation
3. The use of ‘index parameter’
It is necessary to execute while careful of the various conditions and restrictions that comes from basis of data analysis methods. Otherwise, the conclusion which derived from the data analysis is influenced by the meaningless result and it has been made fun of it.
The research fields, for example the ‘QSAR (Quantitative Structure – Activity Relationships)’ and the ‘Chemometrics’, are constructed based on the assumption that the applied data analysis are processed correctly.
Therefore, the ‘QSAR’ and ‘Chemometrics’, when the basics of the multi-variate analysis and the pattern recognition aren't applied correctly, obtained results of the data analysis, like the case at the end, those conclusions pass only to FAKE. On the ‘QSAR’ and the ‘Chemometrics’ research works, it is important to understand the basics of the multi-variate analysis and the pattern recognition. There are some more important limitations on data analysis methods. The detail of those limitations are explained on this blog in order in the future.
Incidentally, the limitation and the application limit which are derived from the ‘QSAR’ and the ‘Chemometrics’ research. It is necessary to implement while understanding these limitations.
文責: 株式会社 インシリコデータ 湯田 浩太郎