Page 109 - 18F-FDG PET as biomarker in aggressive lymphoma; technical and clinical validation
P. 109
Workflow optimization of MTV in DLBCL
intervals (95%CIs) were calculated with a two-way random-effects model for absolute agreement [19]. The 95%CIs of the ICC values were interpreted as poor (< 0.5), moderate (0.5–0.75), good (0.75–0.9), and excellent (> 0.9) [20,21]. CoV was calculated as the ratio of the standard deviation (over three observers) of MTVs or TLGs divided by the mean values per patient. Mean CoVs are presented, i.e., CoVs averaged over all patients. Bland-Altman plots were drawn to visually assess potential bias of the mean differences between the workflows and to estimate 95 % limits of agreement [22]. Normality of MTV and TLG differences before and after manual modification was checked with the Shapiro- Wilkinson (SW) test, in which P < 0.05 was an indication of a non-normal distribution. Statistical analyses were performed using SPSS Statistics (IBM, v.20).
Results
Workflow A; Individual Lesion Selection
Lesion Selection
The total number of selected lesions for observer 1, 2, and 3 was 162, 117, and 118, respectively, which was due to the fact that observer 1 separately selected small lesions close to larger lesions, which were ignored by observers 2 and 3. It resulted in larger volumes for the A50%P and the 2 MV consensus methods for observer 1 (Supplemental Fig. 1). In total, 76 lesions were selected by all observers; of which, 35 showed identical segmentation results, and 18 lesions had a difference in volume between observers of < 1 ml. Twenty-three non-identical lesions were caused by clicking in different parts of a heterogeneous lesion, which resulted in missing the SUVmax or SUVpeak of the lesion.
Interobserver Reliability
ICC values for semi-automated MTVs were 0.43, 0.86, 0.96, and 0.94 for the 41%MAX, A50%P, SUV≥2.5, and SUV≥4.0 thresholds, respectively. Mean CoVs were 65.5 %, 36.7 %, 13.3 %, and 13.8 %, respectively (Table 1). When considering the 95%CIs of ICCs, only SUV≥2.5 and SUV≥4.0 showed excellent and good to excellent reliability, respectively. For the MV2 and MV3 consensus methods, the mean CoVs were 22.7 % and 33.5 % and ICCs were 0.92 and 0.91, respectively. Overall, fixed SUV threshold methods (SUV≥2.5 and SUV≥4.0) showed least
107
5