Skip to main content

Table 2 Feature ablation study on the Random Forest model. Each set of features is removed, and the difference of the performance is measured

From: Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records

 

#features

Validation set

Test set

Full model

14

0.8832

0.8246

- Token-based

5

0.8689 (−1.5%)

0.8129 (−1.2%)

- Character-based

2

0.8655 (−1.8%)

0.8154 (−0.9%)

- Sequence-based

4

0.8697 (−1.4%)

0.8034 (−2.1%)

- Semantic-based

1

0.8704 (−1.3%)

0.8235 (−0.1%)

- Entity-based

2

0.8738 (−0.9%)

0.8150 (−0.9%)