br Test Gene based test to control gene length related
Test 2: Gene-based test to control gene-length related mutation variability
We fit a monotonically increasing smooth function to estimate the gene length effect: y gamðx Þ, where y is the mutation frequency for a gene in a particular cancer type and x is the mRNA length. We also tried to use amino Okadaic acid length as the predictor and the results were similar. After successful fitting, we calculated a chi-square value for the gene following the likelihood ratio c2 =
yp Þ ð
, where yp is the predicted frequency. The resulting p-value was denoted as pgene.
For each TSG 3 cancer event, we randomized the labels for the inactivated samples and WT samples. In each randomized set, we qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ conducted a differential expression analysis and calculated Trandom = t21;random + t22;random. With 10,000 randomization trials, we
calculated an empirical p-value (pemp) for each TSG 3 cancer event as pemp = #(Trandom > T)/10000. pemp measures the significance of the combined impacts of both cis- and trans-effects.
Filtering abnormal outliers at the transcriptomic level
In our manual inspection of mutant samples (i.e., those with L2 or L1 mutations), we observed occasional abnormal outliers in some genes, whose expression profile appeared as dissimilar to other mutation samples. For example, a TSG was reported with inactiva-tion mutations but it showed no sign of decrease in its expression. In Figure S7A, we took RB1 in the BRCA_Basal subtype as an example. There were 19 BRCA_Basal samples with RB1 inactivation mutations, including 15 samples with deep deletion (L2), 3 samples with copy loss accompanied with truncation mutations (L2), and 1 sample with truncation mutations (L1). However, a visu-alization inspection revealed that two samples with deep deletion had quite high expression of RB1, implying these deep deletion events did not function as expected.
We then developed a strategy to systematically and quantitatively screen for such abnormal outliers. We used the combined qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ impact score T = t21 + t22 to determine whether a mutated sample is an outlier from the remaining of the mutated samples.
Here t1 is the t-value for the TSG itself (cis-effect) and t2 is the average t-value for the top 1% DEGs in the transcriptome excluding the TSG itself (trans-effect). Hence, T measured an overall impact, instead of the TSG itself. Specifically, for each TSG in each cancer, we started with its L2+L1 samples and calculated a list of new T0 s, each corresponding to the exclusion of a mutated sample. If excluding a mutated sample would lead to an increase of T by 5%, i.e., T0 > Tð1 + 0:05Þ, then the corresponding mutated sample would be excluded from the L2+L1 sample pool. Notably, to avoid self-service analyses, such abnormal outlier samples were not re-grouped into the wild-type samples either but were permanently excluded from all following analyses. The process was iteratively repeated until no mutated sample was associated with an extreme T0 pedigree analysis would increase the combined impact score by 5%.
With the quantitative assessment of potential outlier samples, the two samples with RB1 in BRCA_Basal (Figure S7A) were iden-tified as outlier samples, i.e., removing each of them could lead to an increase of the impact score T by > 5%. As shown in the right panel in Figure S7A, we observed a prominent increase in both the cis-effect and the trans-effect after removing these two outlier
samples. Notably, the outlier samples were excluded from the inactivated samples permanently. They would not be categorized as WT samples and were not included in the following analysis.
We applied the filtering strategy for all 277 TSG 3 cancer events. As a result, 208/277 (75.1%) events remained unchanged, 53 (19.1%) were reduced by % 3 samples, and 11 events were excluded due to insufficient samples (Figures S7B and S7C). For all the subsequent analysis, we used the 266 events with cleaned inactivated samples.
Pathway enrichment analysis
For each TSG 3 cancer event, we used the single sample gene set enrichment analysis (ssGSEA) method implemented in a R package GSVA (Ha¨nzelmann et al., 2013). ssGSEA calculates an enrichment score (ES) for each pathway in each sample, resulting in an ES matrix with rows representing pathways and columns representing TCGA samples. The algorithms for calculation of ES can be found in Ha¨nzelmann et al. (2013). ssGSEA was performed for each cancer type respectively. For each TSG 3 cancer event, we then used Wilcoxon Rank Sum test to compare the ES values in inactivated samples and the WT samples for the corresponding TSG for each pathway. The resultant p-values were corrected for multiple testing using stringent Bonferroni method. Significant pathways (Bonferroni p value < 0.05) in > 16 TSG 3 cancer events (10% of all 161 events) were used to generate Figure 4E.