Nossa sugestão do mês de fevereiro é um artigo publicado na BMC Bioinformatics que conta um pouco mais sobre as diferenças entre métodos supervisionados e não-supervisionados, assim como suas vantagens, desvantagens e critérios de seleção. Como pontos de partida na nossa discussão, o autor afirma:
“PLS-DA can be thought of as a “supervised” version of Principal Component Analysis (PCA) in the sense that it achieves dimensionality reduction but with full awareness of the class labels. Besides its use for dimensionality-reduction, it can be adapted to be used for feature selection [8] as well as for classification [9–11].”
Com isso, perguntamos: existe um workflow adequado para análise quimiométrica? Sempre o PCA deve ser realizado antes do PLS-DA? Além disso, o PCA pode ser utilizado para criar o modelo usado no PLS-DA se não existe nenhuma informação sobre os dados? Como você realiza essa análise?
“It is important to note that its role in discriminant analysis can be easily misused and misinterpreted [2, 12]. Since it is prone to overfitting, cross-validation (CV) is an important step in using PLS-DA as a feature selector, classifier or even just for visualization [13, 14].”
Nessa afirmacão, levantamos dois questionamentos:
- Em literatura, existem diferentes tipos de CV que podem ser usados para validar um método quimiométrico (venetian blinds, contiguous blocks, random subsets, leave-one-out). Quais atributos influenciam na seleção do melhor método para seus dados?
- Além do CV, outros métodos de validação são comumente utilizados, como a análise de p-value por t-test. Neste teste, valores menores que 0.05 indicam que o modelo é siginificativo a 95% de nível de confiança. Porém, assim como CV, diferentes algorítimos podem ser implementados, como Wilcoxon, sign test, rand t-test. Como selecionar o melhor método?
*Devido aos direitos de publicação, não disponibilizamos o artigo em formato pdf.
Que tema interessante!
Deixo aqui um link do site do Eigenvector, que responde algumas das perguntas acima:
Tipos de CV:
"For the following descriptions, n is the total number of objects in the data set, and s is the number of data splits specified for the cross-validation procedure, which must be less than n/2.
Venetian Blinds: Each test set is determined by selecting every sth object in the data set, starting at objects numbered 1 through s. This method is simple and easy to implement, and generally safe to use if there are relatively many objects that are already in random order. For time-series data, it can be useful for estimating errors in the method from non-temporal sources, provided that a sufficient number of data splits are specified. However, for blocked data with replicates, one must choose parameters very carefully to avoid the replicate sample trap. Similarly, for time-series data, a low number of splits can lead to in overly optimistic results.
Contiguous Blocks: Each test sets is determined by selecting contiguous blocks of n/s objects in the data set, starting at object number 1. Like Venetian Blinds, this method is simple and easy to implement, and generally safe to use in cases where there are relative many objects in random order. For data that is not randomly ordered, one must choose parameters carefully to avoid overly pessimistic results from the ill-conditioned trap. For time-series data and batch data, this method can be convenient for assessing the temporal stability and batch-to-batch stability of a model built from the data.
Random Subsets: s different test sets are determined through random selection of n/s objects in the data set, such that no single object is in more than one test set. This procedure is repeated r times, where r is the number of iterations. It is the averaged results of the iterations that is used in the report of the cross validation results. The random subset selection method is rather versatile, in that it can be used effectively in a wide range of situations, especially if one has the time to run multiple iterations of subset selections. As the number of splits is inversely proportional to the number of test samples per sub-validation, a higher number of splits can be used to avoid the ill-conditioned trap, and a lower number of splits can be used to avoid the replicate sample trap. However, a general disadvantage of this method is that the user has no control over the selection of test sample subsets for each sub-validation experiment, thus making it difficult to assess whether the results were adversely affected by either of these traps. Fortunately, their effect can be reduced to some extent by increasing the number of iterations.
Leave-One-Out: Each single object in the data set is used as a test set. This method is generally reserved for small data sets (n not greater than 20). It might also be useful in the case of randomly distributed objects, if one has enough time or n is not too large. However, it is not recommended even for small data sets if the objects are blocked with replicates or generated from a design of experiments (DOE), due to the replicate sample trap and ill-conditioned trap, respectively."