CuMiDa: An Extensively Curated Microarray Database


One might have notice a pattern, when applying machine learning techniques in cancer microarray datasets: they are scattered trough multiple repositories, normally from old studies, being employed time and time again for the same purposes. However, the reality is that the microarray technology has changed, from their chip technology and number of known probes to their preprocessing options. Hence, continuing employing the same examples and old datasets, already manipulated by older studies, is not in agreement with the reality we have nowadays. Right now, microarray datasets contain more genes, come from multiple platforms and need a more rigorous filtering and preprocessing to be ready for machine learning approaches.

Here we present the Curated Microarray Database (CuMiDa), a repository containing 78 handpicked cancer microarray datasets, extensively curated from 30.000 studies from the Gene Expression Omnibus (GEO), solely for machine learning. The aim of CuMiDa is to offer homogeneous and state-of-the-art biological preprocessing of these datasets, together with numerous 3-fold cross validation benchmark results to propel machine learning studies focused on cancer research. The database make available various download options to be employed by other programs, as well for PCA and t-SNE results. CuMiDa stands different from existing databases for offering newer datasets, manually and carefully curated, from samples quality, unwanted probes, background correction and normalization, to create a more reliable source of data for computational research.

How to cite: Feltes, B.C.; Chandelier, E.B.; Grisci, B.I.; Dorn, M. CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research. Journal of Computational Biology, Ahead of Print, 2019.


Cancer type
GSE Name Filter
TYPE GSE PLATFORM
SAMPLES
GENES
CLASSE
DOWNLOADS PCA t
SNE
Z
E
R
O
R
S
V
M
M
L
P
D
T
N
B
R
F
H
C
K
N
N
K
-
M
E
A
N
S
Pancreatic GSE16515 GPL570 51 54676 2 0.71 0.86 0.78 0.78 0.84 0.82 0.69 0.76 0.76
Breast GSE33447 GPL14550 16 36623 2 0.44 1.00 0.88 0.88 0.88 0.94 0.56 0.88 0.88
Breast GSE59246 GPL13607 101 36623 2 0.55 0.85 0.60 0.77 0.72 0.79 0.56 0.73 0.62
Breast GSE26910 GPL570 12 54676 2 0.50 0.83 0.83 0.25 0.83 0.83 0.58 0.75 0.83
Breast GSE57297 GPL17077 26 42946 2 0.73 0.96 1.00 0.65 0.77 0.85 0.69 1.00 0.69
Breast GSE22820 GPL6480 139 33580 2 0.93 1.00 1.00 0.96 0.97 0.99 0.92 0.99 0.65
Breast GSE42568 GPL570 116 54676 2 0.87 0.99 0.99 0.94 0.99 0.97 0.88 0.98 0.62
Breast GSE26304 GPL6848 115 33638 5 0.36 0.26 0.30 0.39 0.34 0.34 0.36 0.30 0.35
Breast GSE70947 GPL13607 289 35982 2 0.51 0.93 0.70 0.80 0.83 0.86 0.51 0.82 0.78
Breast GSE7904 GPL570 45 54676 3 0.47 0.96 0.93 0.71 0.82 0.91 0.49 0.80 0.62
Breast GSE38959 GPL4133 43 33580 2 0.70 0.95 0.95 0.74 0.91 0.98 0.72 0.88 0.98
Breast GSE45827 GPL570 151 54676 6 0.27 0.94 0.58 0.80 0.93 0.95 0.34 0.80 0.70
Breast GSE10797 GPL571 66 22278 3 0.41 0.82 0.53 0.65 0.67 0.65 0.44 0.55 0.58
Breast GSE89116 GPL6947 38 39427 3 0.45 0.47 0.29 0.45 0.42 0.50 0.37 0.42 0.58
Liver GSE62232 GPL570 91 54676 2 0.89 1.00 0.99 0.89 0.97 0.95 0.88 1.00 0.70
Liver GSE50579 GPL14550 76 36548 2 0.84 0.99 0.92 0.97 0.89 0.87 0.87 0.96 0.79
Liver GSE14520
U133A
GPL3921 357 22278 2 0.51 0.97 0.80 0.92 0.96 0.96 0.50 0.93 0.92
Liver GSE46408 GPL4133 12 33468 2 0.50 0.83 0.83 0.25 0.92 0.92 0.58 0.83 0.92
Liver GSE22405 GPL10553 48 22284 2 0.50 0.92 0.92 0.83 0.88 0.83 0.52 0.73 0.62
Liver GSE62043 GPL6480 95 40820 1 NA NA NA NA NA NA NA NA NA
Liver GSE57957 GPL10558 75 47324 2 0.52 0.96 0.96 0.88 0.97 0.97 0.51 0.89 0.93
Liver GSE60502 GPL96 36 22284 2 0.50 0.97 0.97 0.81 0.97 0.94 0.53 0.72 0.69
Liver GSE76427 GPL10558 165 47323 2 0.70 0.96 0.93 0.95 0.96 0.95 0.70 0.88 0.94
Liver GSE14520
U133_2
GPL3921 41 22278 2 0.54 1.00 0.98 0.98 0.95 1.00 0.51 0.88 0.83
Throat GSE42743 GPL570 103 54676 2 0.72 0.87 0.83 0.85 0.86 0.87 0.73 0.89 0.82
Throat GSE59102 GPL6480 42 32704 2 0.69 0.98 1.00 0.93 1.00 0.98 0.67 1.00 1.00
Throat GSE12452 GPL570 40 54676 2 0.78 0.97 0.97 0.85 0.95 0.93 0.80 0.93 0.75
Throat GSE53819 GPL6480 35 32784 2 0.49 1.00 0.97 0.74 0.91 0.97 0.51 0.97 0.51
Leukemia GSE33615 GPL4133 71 33580 2 0.70 1.00 1.00 0.93 0.94 1.00 0.69 0.99 0.99
Leukemia GSE71935 GPL570 46 54676 2 0.80 0.89 0.87 0.74 0.80 0.80 0.78 0.87 0.63
Leukemia GSE22529
U133B
GPL97 52 22646 2 0.79 0.96 0.96 0.88 0.90 0.90 0.77 0.90 0.52
Leukemia GSE14317 GPL571 25 22278 2 0.72 1.00 1.00 0.68 0.88 0.88 0.68 0.92 0.80
Leukemia GSE63270 GPL17810 101 54676 2 0.59 1.00 1.00 0.89 1.00 1.00 0.60 0.99 0.79
Leukemia GSE28497 GPL96 281 22284 7 0.26 0.88 0.72 0.73 0.78 0.79 0.27 0.70 0.45
Leukemia GSE71449 GPL19197 45 52201 4 0.44 0.58 0.42 0.71 0.49 0.38 0.42 0.38 0.42
Leukemia GSE22529
U133A
GPL97 52 22284 2 0.79 0.98 0.98 0.90 0.90 0.92 0.81 0.94 0.54
Leukemia GSE9476 GPL96 64 22284 5 0.41 0.98 0.94 0.89 0.89 0.98 0.41 0.89 0.67
Prostate GSE8511 GPL1708 12 41055 1 NA NA NA NA NA NA NA NA NA
Prostate GSE6919
U95C
GPL8300 115 12647 2 0.51 0.64 0.63 0.65 0.69 0.66 0.50 0.55 0.51
Prostate GSE11682 GPL4133 31 33468 2 0.52 0.52 0.52 0.52 0.48 0.35 0.55 0.39 0.55
Prostate GSE46602 GPL570 49 54676 2 0.71 0.94 0.96 0.82 0.90 0.92 0.69 0.94 0.65
Prostate GSE26910 GPL570 12 54676 2 0.50 0.83 0.67 0.50 0.67 0.83 0.58 0.67 0.67
Prostate GSE38241 GPL4133 39 41016 1 NA NA NA NA NA NA NA NA NA
Prostate GSE6919
U95B
GPL8300 124 12621 2 0.52 0.68 0.62 0.60 0.71 0.67 0.51 0.56 0.54
Prostate GSE55945 GPL570 17 54676 2 0.59 0.82 0.94 0.82 0.59 0.65 0.65 0.71 0.53
Prostate GSE6919
U95Av2
GPL8300 124 12626 2 0.49 0.67 0.65 0.45 0.63 0.69 0.51 0.58 0.62
Prostate GSE60329 GPL14550 105 42531 1 NA NA NA NA NA NA NA NA NA
Ovary GSE12470 GPL887 53 18930 3 0.66 0.85 0.85 0.62 0.83 0.79 0.68 0.79 0.70
Ovary GSE16708 GPL6947 24 48804 2 0.62 1.00 1.00 0.92 1.00 1.00 0.58 1.00 1.00
Ovary GSE16570 GPL6947 15 48804 2 0.60 1.00 1.00 0.93 1.00 1.00 1.00 1.00 1.00
Ovary GSE6008 GPL96 98 22284 4 0.42 0.71 0.64 0.65 0.68 0.71 0.42 0.66 0.41
Brain GSE50161 GPL570 130 54676 5 0.35 0.95 0.82 0.85 0.85 0.91 0.38 0.87 0.46
Brain GSE15824 GPL570 37 54676 4 0.32 0.81 0.70 0.41 0.62 0.78 0.51 0.81 0.62
Bladder GSE31189 GPL570 85 54676 2 0.56 0.64 0.58 0.54 0.46 0.55 0.58 0.62 0.55
Bladder GSE40355 GPL13497 24 29045 3 0.25 0.79 0.75 0.67 0.54 0.75 0.38 0.71 0.75
Lung GSE7670 GPL96 51 22284 2 0.53 0.96 0.96 0.96 0.90 0.98 0.55 0.86 0.96
Lung GSE27262 GPL570 48 54676 2 0.50 1.00 1.00 0.94 0.98 1.00 0.52 0.94 1.00
Lung GSE19804 GPL570 114 54676 2 0.49 0.93 0.85 0.91 0.91 0.92 0.52 0.79 0.89
Lung GSE74706 GPL13497 35 29149 2 0.49 1.00 1.00 0.91 1.00 1.00 0.54 0.94 0.94
Lung GSE63459 GPL6883 65 24527 2 0.49 0.68 0.63 0.49 0.72 0.74 0.52 0.58 0.71
Lung GSE18842 GPL570 90 54676 2 0.51 1.00 0.99 0.97 1.00 1.00 0.50 0.94 0.97
Renal GSE66270 GPL570 28 54676 2 0.46 1.00 1.00 0.79 1.00 1.00 1.00 1.00 1.00
Renal GSE6344
U133A
GPL97 20 22284 2 0.45 0.80 0.80 0.45 0.90 0.85 0.90 0.85 0.90
Renal GSE53757 GPL570 143 54676 2 0.50 0.83 0.83 0.74 0.84 0.85 0.51 0.79 0.85
Renal GSE6344
U133B
GPL97 20 22646 2 0.45 0.85 0.80 0.85 0.90 0.85 0.85 0.80 0.80
Gastric GSE79973 GPL570 20 54676 2 0.45 0.85 0.85 0.65 0.90 0.90 0.55 0.85 0.90
Gastric GSE22804 GPL6480 14 41084 1 NA NA NA NA NA NA NA NA NA
Gastric GSE19826 GPL570 24 54676 2 0.50 0.67 0.67 0.67 0.71 0.67 0.54 0.67 0.79
Colorectal GSE44861 GPL3921 105 22278 2 0.50 0.84 0.64 0.78 0.84 0.82 0.51 0.69 0.60
Colorectal GSE75548 GPL10558 12 48108 2 0.50 0.83 0.83 0.92 0.75 0.83 0.58 0.83 0.67
Colorectal GSE41328 GPL570 18 54676 2 0.56 0.89 0.89 1.00 0.89 0.89 0.67 0.94 0.72
Colorectal GSE77953 GPL96 55 22284 4 0.29 0.95 0.95 0.60 0.85 0.87 0.35 0.76 0.51
Colorectal GSE25070 GPL6883 52 24527 2 0.48 0.96 0.94 0.81 0.94 0.96 0.52 0.88 0.96
Colorectal GSE8671 GPL570 63 54676 2 0.51 1.00 1.00 0.94 1.00 1.00 0.52 0.98 0.98
Colorectal GSE41657 GPL6480 86 33468 4 0.33 0.78 0.70 0.64 0.86 0.79 0.34 0.64 0.58
Colorectal GSE32323 GPL570 33 54676 2 0.52 1.00 1.00 0.82 0.97 0.97 0.52 1.00 1.00
Colorectal GSE21510 GPL570 147 54676 3 0.71 0.99 1.00 0.90 0.97 0.94 0.71 0.97 0.83
Colorectal GSE44076 GPL13667 194 49387 2 0.49 0.99 0.99 0.95 0.98 0.98 0.51 0.98 0.98

Finantial Support: