Skip to main content

Table 3 Example of input dataset showing a sample per row. Missing values have been represented by the NaN (Not a Number) literal. An excessive number of missing values will cause the elimination of the sample, while the remaining data will be statistically imputed in the preprocessing phase. Each sample is identified by its Id, and other relevant features are reported, e.g. the Age of the patient. The Disease feature holds the class information for each tuple (sample). The features on the right of the Disease column are protein expression values related to each sample

From: Machine learning pipeline to analyze clinical and proteomics data: experiences on a prostate cancer case

Id

Age

Prostate GlandSize

TotalPsa

FTratio

PsaFree

Disease

...

sema7a

id100

57

95.00

8.94

24.0

2.14

BPH

...

15300

id19

73

NaN

0.07

71.0

0.05

BPH

...

29200

id7

47

20.0

6.97

8.0

0.59

PCA

...

31800

id30

62

50.0

19.71

10.0

1.97

PCa

...

9230

id144

73

NaN

1.83

16.0

0.29

PCa

...

28300

id142

72

NaN

8.05

21.0

1.71

PCa

...

22800

...

...

...

...

...

...

...

...

...