One might have notice a pattern, when applying machine learning
techniques in cancer microarray datasets: they are scattered
through multiple repositories, normally from old studies, being
employed time and time again for the same purposes. However, the
reality is that the microarray technology has changed, from their
chip technology and number of known probes to their preprocessing
options. Hence, continuing employing the same examples and old
datasets, already manipulated by older studies, is not in
agreement with the reality we have nowadays. Right now, microarray
datasets contain more genes, come from multiple platforms and need
a more rigorous filtering and preprocessing to be ready for machine
learning approaches.
Here we present the Curated Microarray Database (CuMiDa),
a repository containing 78 handpicked cancer microarray datasets,
extensively curated from 30.000 studies from the Gene Expression
Omnibus (GEO), solely for machine learning. The aim of CuMiDa is
to offer homogeneous and state-of-the-art biological preprocessing
of these datasets, together with numerous 3-fold cross validation
benchmark results to propel machine learning studies focused on
cancer research. The database make available various download
options to be employed by other programs, as well for PCA and
t-SNE results. CuMiDa stands different from existing databases
for offering newer datasets, manually and carefully curated, from
samples quality, unwanted probes, background correction and
normalization, to create a more reliable source of data for
computational research.

How to Cite
If you use CuMiDa in a scientific publication, we would appreciate citations to the following paper:
-
Feltes, B.C.; Chandelier, E.B.; Grisci, B.I.; Dorn, M. CuMiDa:
An Extensively Curated Microarray Database for Benchmarking and
Testing of Machine Learning Approaches in Cancer Research.
Journal of Computational Biology, 2019.
BibTeX@article{Feltes2019, author = {Feltes, Bruno César and Chandelier, Eduardo Bassani and Grisci, Bruno Iochins and Dorn, Márcio}, title = {CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research}, journal = {Journal of Computational Biology}, volume = {26}, number = {4}, pages = {376-386}, year = {2019}, doi = {10.1089/cmb.2018.0238}, note = {PMID: 30789283}, URL = {https://doi.org/10.1089/cmb.2018.0238}, eprint = {https://doi.org/10.1089/cmb.2018.0238} }
Publications
There are some papers already using CuMiDa! These are some:
- GRISCI, B.I. ; KRAUSE, M. J.; DORN, M. Relevance aggregation for neural networks interpretability and knowledge discovery on tabular data. INFORMATION SCIENCES, v. 559, p. 111-129, 2021.
- Feltes, B.C.; Poloni, J.F.; Nunes, I.J.G.; Faria, S.S.; Dorn, M. Multi-Approach Bioinformatics Analysis of Curated Omics Data Provides a Gene Expression Panorama for Multiple Cancer Types. Frontiers in Genetics, v. 11, p. 1354; 2020.
- Grisci, B. I. ; Feltes, B. C. ; Dorn, M. . Neuroevolution as a Tool for Microarray Gene Expression Pattern Identification in Cancer Research. Journal of Biomedical Informatics, v. 89, p. 122-133, 2019.