Page 131 - Quantitative Imaging of Small Tumours with Positron Emission Tomography
P. 131
Chapter 6 ensemble algorithm (26). To assess model generalizability (i.e. its prediction performance on unseen data), we used a stratified 5-fold cross-validation approach. In each cross-validation fold the Random Forest was trained on 80% of samples and validated on an unseen subset of 20% of samples. This was repeated until each fold had served as the test set. Finally, this 5-fold cross-validation was repeated 50 times to further limit chance findings. Features were scaled using a z-score normalization. Model hyperparameters (tree depth, splitting criterion) were optimized within each training set in nested cross-validation using a randomized search algorithm. All pre-processing and optimization steps were performed within each training fold to prevent leakage of test data into the trained model (Fig. 6.1). All data Train fold tTraraininfofoldld Train fold Train fold Test fold Data normalization Dimension reduction Hyperparameter optimization Oversampling Random Forest training Trained Random Forest Performance score 5x score from cross-validation 250x score after repeated cross-validation Mean score ± SD 5-fold cross-validation 1. 2. 3. 4. 5. Cross-validation repeated 50x Transform Transform Test Figure 6.1: Schematic overview of the implemented machine learning pipeline. Data pre-processing and model tuning are performed on the training dataset in repeated cross-validation to prevent leakage of information between training and testing data. 130