Machine learning pipeline to analyze clinical and proteomics data: experiences on a prostate cancer case

Table 3 Example of input dataset showing a sample per row. Missing values have been represented by the NaN (Not a Number) literal. An excessive number of missing values will cause the elimination of the sample, while the remaining data will be statistically imputed in the preprocessing phase. Each sample is identified by its Id, and other relevant features are reported, e.g. the Age of the patient. The Disease feature holds the class information for each tuple (sample). The features on the right of the Disease column are protein expression values related to each sample

Id	Age	Prostate GlandSize	TotalPsa	FTratio	PsaFree	Disease	...	sema7a
id100	57	95.00	8.94	24.0	2.14	BPH	...	15300
id19	73	NaN	0.07	71.0	0.05	BPH	...	29200
id7	47	20.0	6.97	8.0	0.59	PCA	...	31800
id30	62	50.0	19.71	10.0	1.97	PCa	...	9230
id144	73	NaN	1.83	16.0	0.29	PCa	...	28300
id142	72	NaN	8.05	21.0	1.71	PCa	...	22800
...	...	...	...	...	...	...	...	...

ISSN: 1472-6947