Background. Hepatocellular carcinoma (HCC) is one of the most common and lethal cancers worldwide. Therefore, it is necessary to develop and validate a novel prognostic model for HCC patients.
Objectives. To establish an innovative and valuable prediction model of long non-coding RNAs (lncRNAs) for HCC.
Materials and methods. Transcriptome and clinical data from The Cancer Genome Atlas (TCGA) were analyzed globally using bioinformatic approaches. We used Cox and least absolute shrinkage and selection operator (LASSO) regression analyses to screen for prognostic lncRNAs, while receiver operating characteristic (ROC) and Kaplan–Meier curve analyses were used to evaluate the effectiveness of the models. Clinical data from our center were used as a validation set.
Results. In the training set, a prediction model was established based on the expression of AP000844.2, LINC00942, SRGAP3-AS2, and AC010280.2 Hepatocellular carcinoma patients were divided into 2 groups (high-risk group and low-risk group) according to their risk score, and differences in survival were compared between the groups. The clinical data from our center served as a validation set to re-evaluate the effectiveness of the predictive model. The model had an excellent performance. The area under the curve (AUC) of 3-year survival was 0.771, while for 5-year survival it was 0.741, and the concordance index (C-index) was 0.756 (standard error (SE) = 0.023, 95% confidence interval (95% CI) = 0.620–0.891).
Conclusions. The 4-lncRNA combination model is critically important in evaluating the prognosis of HCC. It is an effective independent prognostic factor, although prospective, multi-center studies are needed to validate our findings.
Key words: prognosis, survival analysis, hepatocellular carcinoma, least absolute shrinkage and selection operator
Hepatocellular carcinoma (HCC) is one of the most frequent malignancies globally and ranks 3rd for morbidity and mortality rates, respectively.1 Although advances have been made in multidisciplinary treatment and targeted medicine, the 5-year survival rate of HCC patients remains relatively low due to the poor performance of diagnostic techniques and lack of comprehensive treatment.2 Hepatocellular carcinoma diagnosis and treatment still pose a considerable challenge. Patients undergoing surgical treatment miss the opportunity for intervention due to the atypical recurrence of symptoms if there is no active postoperative follow-up.3, 4 Therefore, a reasonable estimation of the patient’s survival time and paying attention to the time window for possible recurrence are vital to improve the prognosis. Thus, it is critical to deepen our understanding of the pathogenesis of HCC development and to seek novel biological markers or therapeutic targets.5
Recently, it was detected that the abnormal expression of long non-coding RNAs (lncRNAs) correlate with the occurrence and development of tumors and other diseases.6, 7 They are valuable in prognostic evaluation, early diagnosis and clinical treatment of many malignant tumors.8 More notably, lncRNAs have a critical impact on abnormal cellular regulation and tumorigenesis. At present, in the post-genomic era, lncRNAs are also a vital research hotspot. They regulate genes and play critical biological functions in cell development and multiple levels, such as transcription, post-transcription, translation, and post-translational modification.9 Therefore, it is crucial to characterize the structure and interaction of lncRNA for understanding its cellular mechanism of action.10 Some studies have found that lncRNAs are closely related to tumors, and their expression and regulation are found to be abnormal in breast cancer, colon cancer and stomach cancer.11 Studies on the functions of lncRNAs have shown that the abnormal expression of lncRNAs may be used for the prognosis and treatment of HCC.12 The lncRNA UPK1A-AS1 could work as a scaffold to reinforce the binding of EZH2 and SUZ12 in order to induce the chemoresistance of HCC cells.13 Furthermore, lncRNA LL22NC03-N14H11.1 promotes mitochondrial fission and induces the epithelial–mesenchymal transition of HCC through the MAPK pathway.14 These data indicate that the aberrant expression of lncRNAs may directly or indirectly adjust the malignant phenotype of HCC and affect its prognosis. In addition, as it is commonly known, genes are highly interactive. Therefore, it is necessary to select appropriate lncRNAs and establish multiple lncRNA prediction models. Such actions could play an essential role in assisting the evaluation of HCC prognosis.
Transcriptome data from the Cancer Genome Atlas (TCGA) and clinical data from ShengJing Hospital of China Medical University were analyzed. We have developed and validated an effective multi-lncRNA combined survival prediction model using Cox regression and least absolute shrinkage and selection operator (LASSO) regression15; this model might be helpful in the prognosis of HCC patients.
This study aims to establish a prediction model for HCC based on differentially expressed lncRNAs (DELs) and evaluate its effectiveness based on clinical data.
Materials and methods
The raw RNA sequencing data were downloaded from TCGA (https://portal.gdc.cancer.gov/) through the RTCGA Toolbox package (TCGA-Liver Hepatocellular Carcinoma (LIHC)) and matched survival data were obtained the same way. The LIHC data were used as the training set to build a predictive model, as described below. The training set consisted of 424 samples, including 374 cancer tissue samples and 50 normal tissue samples. The transcriptome data were captured using the Illumina HiSeq RNA-Seq platform (https://portal.gdc.cancer.gov/projects/TCGA-LIHC).
We used the edgeR package (https://bioconductor.org/packages/release/bioc/html/edgeR.html) to obtain DELs based on the RNA expression profile data. First, we homogenized the expression of each lncRNA in each sample. Then, we compared the expression value of each lncRNA between the cancer tissue group and the normal tissue group, and the multiple of difference was expressed as fold change (FC) value, with p < 0.05 considered statistically different. The value of p < 0.05 and |FC| ≥ 4 (|log2FC| ≥ 2) were determined as the cut-off values. Any lncRNA that met the above 2 conditions was classified as DEL. Then, the “survival” package was used to perform a univariate proportional hazards model (Cox) regression analysis. The meaningful DELs in univariate Cox regression analysis were enrolled to construct the LASSO (“glmnet” package) regression, and the “survminer” package was adopted for visualization. The effectiveness of lncRNA combined prediction model was evaluated in terms of receiver operating characteristic (ROC) and concordance index (C-index). Subjects were divided into the high-expression group and the low-expression group based on the prediction model, and the Kaplan–Meier survival curve described the clinical prognostic significance of the model. Subsequently, clinical data from our center were used as a validation set to evaluate the effectiveness of the model (the workflow process is described in Figure 1A).
Ethical statement and tissue samples
We obtained 100 tumor samples and matched non-tumor tissue samples from HCC patients undergoing surgical resection at the ShengJing Hospital of China Medical University in 2010–2015. All specimens were pathologically confirmed as HCC. The Ethics Committee of ShengJing Hospital of China Medical University approved this study (approval No. 20191215) and all patients signed informed consent prior to surgery. The follow-up deadline was January 31, 2020. These 100 patients were used as a validation set to evaluate the effectiveness of the model.
The human HCC cell lines (Huh7 and Hep3B) and a hepatocellular cell line (THLE-3) were acquired from China Medical University (Shenyang, China). Cell culture was performed using RPMI-1640 medium (Gibco, Carlsbad, USA) containing 10% fetal bovine serum (FBS) at 37°C with 5% CO2.
Reverse transcription polymerase chain reaction (RT-PCR)
The TRIzol extraction kits (Invitrogen, Waltham, USA) were adopted for total RNA extraction, and optical density (OD) values were obtained at 260–280 nm using an ultraviolet spectrophotometer. The RNA was used for subsequent quantitative polymerase chain reaction (qPCR) quantification if its OD260/OD280 ratio was >1.8. Next, the reverse transcription of RNA into cDNA was performed using PrimeScript RT kits, with a system of 10 μL, according to the manufacturer’s instruction. The reaction conditions were 25°C for 30 min, 45°C for 30 min and 85°C for 5 min. Using cDNA as a template, quantitative fluorescence PCR was carried out with 2×TaqMan™ Universal PCR Master Mix (Thermo Fisher, Waltham, USA) under the reaction conditions of 95°C for 3 min, cycling 5 times (94°C for 20 s, 63°C for 30 s, 72°C for 30 s) and cycling 40 times (95°C for 15 s, 60°C for 30 s), with U6 being an internal reference. Three wells and negative controls without a template were set up for all reactions. The quantitative analysis was carried out using the 2−ΔΔCt method. All primers were purchased from Sangon Biotech (Shanghai, China).
The IBM SPSS v. 21.0 statistical software (IBM Corp., Armonk, USA) and GraphPad Prism v. 8.0 (GraphPad, San Diego, USA) were used for data processing. Each experiment was repeated 3 times. Measurement data were expressed as mean ± standard deviation (SD). The Shapiro–Wilk test was used to check data normality. The expression of lncRNAs in cells conformed to a normal distribution, so an independent t-test was used to compare the expression of lncRNAs between cells. The expression of lncRNAs in tissues did not conform to a normal distribution, so the Mann–Whitney U test was performed to compare the expression of lncRNAs between tumor tissues and non-tumor tissues. Table 1 presents the results of the Shapiro–Wilk test. The Kaplan–Meier method and a log-rank test were used to measure overall survival (OS). Multivariate models of prognostic factors were carried out using Cox regression. The LASSO regression analysis was performed to reduce overfitting caused by univariate Cox regression. Getting the corresponding number of variables by the minimum lambda value of p < 0.05 was considered a statistically significant difference.
lncRNAs have differential expression in HCC
The RNA transcriptome data were obtained from TCGA, including 374 HCC tissues and 50 normal tissues (TCGA-LIHC) as the training set. Then, treating the p < 0.05 and |log2FC| ≥ 2 as the cut-off values, the edgeR package was performed to distinguish the DELs. A volcano plot of the distribution of DELs was drawn with the log values of FC and false discovery rate (FDR) as the horizontal and vertical axes, respectively. A total of 1212 upregulated (in red) and 80 downregulated (in green) DELs were recognized (Figure 1B). The DELs are listed in Supplementary Table 1 (available at: https://doi.org/10.5281/zenodo.6794063). Moreover, the top 50 DELs are displayed in a heatmap (Figure 1C).
A prediction model based
on the co-expression of 4-lncRNAs
It was necessary to evaluate the clinical significance of DELs in HCC. First, univariate Cox regression showed that a total of 141 DELs contributed to the survival of HCC (Supplementary Table 2 (available at: https://doi.org/10.5281/zenodo.6794063), p < 0.05). Subsequently, we extracted the expression data of DELs in HCC patients (n = 141) and obtained the corresponding clinical data. The LASSO regression analysis was performed to further evaluate these data; this analysis could reduce overfitting caused by univariate Cox regression. With a continuous lambda increase, the absolute value of the regression coefficient was correspondingly compressed, and some relatively unimportant variables were compressed to 0. This allowed an expression curve between regression coefficients and lambda values to be obtained (Figure 2A). When the lambda value reached a specific size, increasing the number of model-independent variables and reducing the lambda value could not significantly improve the model performance. Therefore, we obtained the smallest lambda value using LASSO regression and got the corresponding number of variables. The analysis showed that 16 out of the 141 DELs may be associated with HCC prognosis (Figure 2B). Furthermore, a multivariate regression analysis was performed on these 16 DELs, and we discovered that only 4 DELs might be independent risk factors for HCC prognosis (Figure 2C and Figure 2D). We adopted these 4 DELs for modeling and assigned scores according to their respective weights in the multi-Cox analysis.
Then, each HCC sample got a risk score based on the expression level of the 4-DEL combination model. A cumulative distribution function (CDF) map was built based on the risk score value of each sample. With the risk score value = 1 as the cut-off value (log2 risk score = 0), we divided the samples into the high-risk (red) group and the low-risk (green) group (Figure 3A). To assess the relationship between the risk score and patient survival, a scatter plot was drawn, with survival time measured in years (Figure 3B; red: deceased, green: alive). With the increased risk score, the number of surviving patients decreased gradually and the number of deceased patients increased. We used a heatmap to demonstrate the score of 4 DELs in each sample (Figure 3C).
Then, we evaluated the effectiveness of the model in predicting HCC prognosis. Traditionally, both ROC and C-index have been important indices for a prediction model. We applied these 2 indicators to evaluate the predictive ability of the 4-DEL prediction model. The area under the curve (AUC) of 3-year and 5-year survival were 0.771 and 0.741, respectively (Figure 3D). In addition, the C-index was 0.756 (standard error (SE) = 0.023, 95% confidence interval (95% CI) = 0.620–0.891). As both indices were greater than 0.7, it suggested that this model was predictive for the prognosis of patients with HCC. The patients were divided into the high-risk group and the low-risk group, according to the risk score. The 5-year survival rate was significantly reduced in the high-risk group (Figure 3E, p = 0.0001). Based on the above analysis, we obtained the 4-DEL combined prediction model for HCC: “Risk score = 1.14 × AP000844.2 + 1.12 × LINC00942 + 1.20 × SRGAP3-AS2 – 0.84 × AC010280.2”, and prepared for further model validation (cut-off value = 0.945).
The expression of AP000844.2, LINC00942, SRGAP3-AS2, and AC010280.2 in HCC
We obtained a predictive model based on the molecular expression of 4 lncRNAs, namely AP000844.2, LINC00942, SRGAP3-AS2, and AC010280.2 using a bioinformatics analysis with a public database as the training set. To further demonstrate the authenticity and validity of the analysis, we confirmed the expression of these 4 molecules in HCC. The human HCC cell lines (Huh7 and Hep3B) and the hepatocellular cell line (THLE-3) were used to check the expression at the cell level using RT-PCR. We found that AP000844.2, LINC00942 and SRGAP3-AS2 were overexpressed in HCC cells, as compared with hepatocellular cells (Figure 4A–C, p < 0.05). On the contrary, AC010280.2 was weakly expressed in HCC cells but it was upregulated in THLE-3 (Figure 4D, p < 0.05). The RT-PCR results were consistent with the data obtained in the previous bioinformatics analysis. Subsequently, we re-evaluated the expression of the 4 lncRNAs in HCC tissues and matched normal controls in our center (n = 100). As expected, AP000844.2, LINC00942 and SRGAP3-AS2 were all overexpressed in HCC tissues when compared to controls. The AC010280.2 was weakly expressed in HCC tissues but overexpressed in normal hepatic tissues (Figure 4E–H, p < 0.05). Consistent results were also obtained for clinical samples and cell samples, which further corroborated the results of our bioinformatics analysis. It is worth emphasizing that LINC00942 showed the most significant difference at both cell and tissue levels.
Verification of the prognostic effect of the 4-lncRNA combination model
The disease-free survival (DFS) ranged from 5 to 90 months, and the OS ranged from 8 to 90 months. We found that 81 out of 100 patients died before the end of the follow-up in the validation set. The expressions of AP000844.2, LINC00942, SRGAP3-AS2, and AC010280.2 were tested in 100 tissue samples (Figure 4E–H). Then, the correlation between the risk score and survival was assessed. The tumor number, tumor stage, vascular invasion, capsule, distant metastasis, and prediction model all contributed to the poor DFS and OS (Table 2, p < 0.05). The high-risk group in the 4-lncRNA combined prediction model demonstrated an unequivocally poor prognosis as evidenced by DFS (25.18 compared to 52.85, p < 0.01, Figure 5A) and OS (31.42 compared to 57.31, p < 0.01, Figure 5B) in the validation set. The tumor stage and prediction model were independent prognostic risk factors for HCC (Table 3, p < 0.05).
Hepatocellular carcinoma has high morbidity and mortality rates.16 Individual treatment and precision medicine are important ways of improving the prognosis. Through the use of genomics, proteomics and transcriptomics, and further development of related technologies, molecular stratification theory has become a powerful tool for the in-depth understanding of tumors. It brings oncology from a discipline that simply describes macro-information, such as size and quantity, to a more in-depth molecular analysis. Furthermore, the discovery of molecular therapies and prognostic biomarkers could bring hope to HCC treatment. Hence, it is necessary to explore novel therapeutic strategies and corresponding molecular targets. The lncRNAs, which have a limited protein-coding ability, play critical roles in cancer progression and metastasis.17 Clinical prediction tools based on lncRNAs have been rapidly developing, including diagnosis, prognostic biomarkers and potential therapeutic targets.8 The lncRNA UPK1A-AS1 promoted HCC development and indicated poor prognosis.13 Furthermore, lncRNA CASC9 is a potential diagnostic and prognostic biomarker for HCC.18 However, these studies only assessed the prognostic value of a single biomarker. Since genes are interactive, a single biomarker may not be enough to accurately predict the prognosis of HCC.
In this study, transcriptome and survival data of patients with HCC were obtained from TCGA and used as a training set. In order to improve the predictive accuracy of the regression model, Cox and LASSO regressions were used to evaluate the correlation between the expression of lncRNAs and survival of HCC patients. Finally, a 4-lncRNA combined prediction model was obtained, using lncRNAs AP000844.2, LINC00942, SRGAP3-AS2, and AC010280.2. Clinical data from patients in our hospital were used as a validation set. The AC010280.2 and LINC00942 are long intergenic non-coding RNAs. It has been previously reported that LINC00942 could act as an oncogene that promoted METTL14-mediated m6A methylation in breast cancer.19 The LINC00942 gene had been recorded to be missing in an autism spectrum disorder patient.20 It had been found that AC010280.2 could participate in establishing the HCC prognosis model.21 The AP000844.2 and SRGAP3-AS2 are antisense lncRNAs, and AP000844.2 might be the component of prostate cancer22 and hepatitis virus-positive HCC23 prognostic models. The SRGAP3-AS2 is also expressed in lung adenocarcinoma24 and could serve as a potential predictive biomarker for that disease.
Both the AUC and C-index demonstrated that our model has good predictive efficacy. The validation set also confirmed the prognostic validity of the model. Although some studies have used LASSO regression to analyze the prognosis of HCC data and obtained a prediction model, the predictive efficacy of this model has not been verified by a validation set.21 For the first time, we established a prediction model of HCC-related lncRNAs based on the LASSO analysis and verified its clinical efficacy. The LASSO analysis is a classic regression analysis method related to statistics and machine learning25, 26 aimed at improving the prediction accuracy and interpretability through variable selection and regularization compared with other regression methods. The evaluation process of LASSO regression includes the relation with ridge regression, the selection of optimal subset, and the relation between LASSO coefficient estimation and soft threshold.27, 28 The validation set confirmed that the 4-lncRNA combined prediction model was an independent risk factor for HCC prognosis.
This study has 3 limitations. First, it is limited to a single center, and it is still necessary to expand the sample size in order to verify our results in multiple centers. Second, our follow-up research goal was to integrate our model with clinical data to form a comprehensive quantitative index. Finally, the molecular mechanism of each lncRNA needs to be further explored in subsequent experiments.
Through the analysis of public databases and verification of clinical data from our center, we have obtained a 4-lncRNA combined prediction model. The model could effectively evaluate the prognosis of HCC patients. Currently, the TNM staging system is still the most important indicator for evaluating tumor malignancy and prognosis. However, the role of the molecular signature of tumors cannot be ignored. For example, Ki-67 indicates proliferation and Her-2 indicates the degree of malignancy. Microsatellite instability (MSI) and tumor mutational burden (TMB) are also beacons for treatment options. We believe that a further improvement of molecular stratification and the application of prognostic markers can provide valuable information for tumor treatment. In addition, even though the expression of molecules still needs to be obtained from tissue samples, if, with the advancement of liquid biopsy technology, we can detect these molecules in the body fluid, their expression changes may indicate tumor recurrence and metastasis.