The gene essentiality data sets that were developed for the SUM breast cancer cell lines by the Ethier lab, and for 50 other breast cancer cell lines by the Hahn lab at Dana Farber, are the centerpieces of the functional genomics strategy that is at the heart of the SLKBase. These gene essentiality data sets resulted from genome-scale shRNA or CRISPR screens carried out using these cell lines. As many of you know by now, the power of these genome-scale essentiality screens is that they identify, on a cell line by cell line basis, the genes that play the most important roles (are most essential) to the survival and proliferation of each cell line. For a detailed description of how the Ethier lab carried out the shRNA screens, please refer to the power point presentation or video presentation that can be found on the SLKBase here.
In developing the data mining tools, I wanted people to be able to make sense out of the data sets that result from these screens and use them to deepen their understanding of the biology of individual breast cancer cell lines. To me, the real power of these data sets does not lie in having a list of the most essential genes for each cell line, but rather, in the ability of the gene essentiality data to functionalize the other genomic data that has been obtained for these cell lines. For example, it is well known that in some types of cancer, oncogenes are activated by gene amplification and over expression, and studying this phenomenon has resulted in the identification and deeper understanding of the ERBB2 (HER2) oncogene as well as many others. Since we have genome wide data on copy number changes (amplifications and deletions) for every breast cancer cell line, and gene expression data to go along with that, it becomes clear upon inspection of that data that in any given breast cancer specimen or cell line, hundreds of genes are copy number amplified and some of these are also overexpressed. So, the question arises, what can be done to filter through these long lists of genes to identify those amplification/overexpression events that are most functionally relevant? And, as you may have guessed by now, the answer is to leverage the gene essentiality data to find the subset of genes that are amplified, overexpressed, and deemed essential by scoring as a hit in the functional screen. We described this in detail in our recent OncoTarget paper, and to give one example from that paper, we analyzed gene amplification, expression and essentiality in our SUM-52 breast cancer cell line and found that this cell line has over 800 copy number amplified genes many of which are overexpressed. Using our shRNA screen, we identified approximately 800 essential genes in the cell line that scored as hits in our screen. The Venn-diagram analysis of these two data sets resulted in the identification of just five functionally active oncogenes in this cell line! This analysis shows the power of orthogonal analysis of genomic data sets when gene essentiality data are used to functionalize genomic data.
In later posts, I will discuss how we use the essentiality screen data in our data mining tools and how these tools can lead to new and exciting discoveries on breast cancer cell lines, but in this post, I want to focus on the fact that in our Knowledge Base and MySQL data base, we actually have two different types of essentiality data, and for some lines, both types of data are available. In the Ethier lab, we used an shRNA library consisting of over 81,000 shRNA vectors targeting over 15,000 well annotated human genes to perform our screens. The Hahn lab also performed shRNA (RNAi) screens on hundreds of cell lines, some of them breast cancer lines. The Hahn lab has also performed CRISPR based screens on many cell lines and there is considerable overlap in the list of cell lines for which screen data are available from both types of screens. So, what’s the difference between RNAi and CRISPR-based screens and which method is better?
Whereas the goal for each type of screen is the same and the basic methodology is the same, shRNA and CRISPR vectors operate in fundamentally different ways that must be kept in mind when interpreting the screen/essentiality data. Whenever possible, we use both types of data in our data mining tools to help with those interpretations. shRNA screens work by knocking down the level of expression of the target gene at the mRNA level, whereas CRISPR screens work by knocking out, at the DNA level, the target of the guide RNAs. Thus, genes that are found to be essential and score as hits in the shRNA screens are those for which reductions in the levels of gene expression can have a significant effect on proliferation or survival of the cell line. By contrast, since the CRISPR screens work by knocking out the target gene, this approach identifies all genes that are essential when the expression of the target gene is eliminated completely. As a result, the CRISPR approach appears to be more sensitive than the RNAi approach and identifies more essential genes because there are more genes that are essential when completely eliminated than when simply reduced in expression. Thus, one important difference in the data sets obtained using the two screening methods is that CRISPR screens yield a larger number of so-called common essential genes than found in RNAi based screens. Common essential genes are defined at those genes that would score in a screen in virtually any cell line tested because they perform a basic function that is essential to cell proliferation or survival. Thus, if one is using screen data to identify vulnerabilities specific to individual cancer cell lines, identifying a gene in a screen that is a common essential is not helpful to the analysis. Thus, the number of common essential genes identified using the two different methods differs significantly because there are many genes that, when reduced in expression, do not have a significant impact on proliferation or survival, but have a big impact when knocked out completely. To observe a dramatic example of this effect, use the KEGG Pathway Engine for the 50 breast cancer cell lines and choose the SUM-159 cell line, and then choose the Cell Cycle Pathway. The Hahn lab performed a CRISPR screen on the SUM-159 cells, whereas the Ethier lab analyzed this cell line using an shRNA library. The CRISPR screen pathway analysis shows that when looking at genes that regulate Cell Cycle, the vast majority of genes in the pathway scored as essential. Next, go to the SUM breast cancer cell line KEGG Pathway Engine and observe how the number of genes considered to be essential in SUM-159 cells in this pathway is dramatically different and much reduced. This makes sense in terms of the way genome-scale knock-down or knock-out screens are performed, because these screens rely on cells undergoing several rounds of cell division after infection before the sequencing of the bar codes in the constructs that is used to determine the level of gene drop out. So, analysis of the SUM-159 data shows that, when using CRISPR to knock out genes, essentially every gene in the Cell Cycle pathway is essential, which is not the case for the RNAi based screen. Thus, as you work with the data mining tools in the Knowledge Base, keep in mind these differences as you will sometimes see genes that were hits in the CRISPR screens that did not score as hits in the RNAi screen. When this is the case, it’s always important to know if the gene that scored in the CRISPR screen is a common essential and we are working to identify these in the tools.
In a future post, I will discuss how gene essentiality is related to genomic alterations for specific genes important in breast cancer. I will also describe the relationship between gene essentiality and targeted drug sensitivity for these same genes. In discussing these specific breast cancer genes, I will show the correlation coefficients for the screen results using the two methods. As you will see, in some cases the correlations are high, approaching 0.9, while for other genes the correlations are lower, and for still other genes, the correlations are actually quite poor. The upshot of this finding is that for some genes, either screening method works well and yields similar results, while for other genes, one method outperforms the other. The reasons for these similarities or differences will be discussed in detail for each of the genes that we examine, and this too can be illuminating. So, given all of that, which screening method is better? The answer is (as you will see); it depends.