Survival comparison statistical methodology

Data Selection

Two databases were considered for this study. The National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Program database, and the National Cancer Database (NCDB).

The SEER database is an authoritative data set created for use as an epidemiological tool to monitor the incidence and mortality of cancer in the United States. SEER collects patient demographics, tumor characteristics, and survival data from 17 regional registries throughout the United States, representing 28 percent of the U.S. population.

The NCDB compiles cancer registry data from cancer programs in the United States and Puerto Rico, capturing approximately 75 percent of newly diagnosed cancers in these areas. It includes data on patient characteristics, tumor staging, tumor histology, type of first treatment, disease recurrence and survival using standardized coding definitions. It is commonly used to guide quality improvement and pursue investigator-initiated research questions. The NCDB provides insight into analytic cancer diagnoses and primary treatment. The main limitation of the data is that the cohorts are not population-based; they are identified from the hospitals at which the patients presented for diagnosis and/or treatment.

The SEER database was selected to conduct this analysis because of its comprehensive content and access to patient-level data (and because of restrictions imposed on the use of the NCDB database for comparative analysis and external reporting purposes).

The SEER comparison sample was chosen by the categories in categorical factors (e.g., cancer stages) with the Cancer Treatment Centers of America® (CTCA) cancer cohort and electing the overlapping ranges in continuous factors (e.g., age at diagnosis) from the CTCA® cancer cohort. These factors affect survival outcomes. The latest SEER Limited-Use Database (2016) was used to select the SEER comparison sample. The final survival analyses included only patients from both the CTCA and SEER databases whose following cancer characteristics were available from the two databases: SEER Summary Stages, primary tumor sites, cancer histologic types, gender and age at initial diagnosis. For example, if a specific SEER Summary Stage had only patients in one database, none of these patients were used in the analysis. To match the age at initial diagnosis, the range (i.e., minimum and maximum ages) was computed for each sample. Only patients whose age at initial diagnosis fell into the overlap of the two ranges from the CTCA and SEER samples were included in the comparative survival analyses.


For both the CTCA and SEER samples, only cancer patients whose initial diagnosis occurred between 2000 and 2015 were analyzed. Cancer cases with missing information on either the date of initial diagnosis or date of last contact were deleted from the CTCA database because the survival time or censoring time for such patients could not be computed. Cancer patients with missing SEER Summary Stages were also excluded from the analyses. For patients with multiple cancers in the SEER and CTCA databases, only the first or primary cancer diagnosed was used for the survival comparisons. Patients with a histologic code (ICD-O-3) between 9590 and 9989 were excluded from the analyses because these histologic types are generally not included by SEER for any nonhematopoietic cancer types. Patients who did not receive treatment from CTCA were also excluded from the analyses.

The survival outcomes from the SEER database were provided by the SEER Limited-Use Data File as the number of completed months. These numbers were then converted to the number of years by dividing the number of total months by 12. Although the exact dates for the initial diagnosis and death were available in the CTCA database, the CTCA survival outcomes were computed using the same methodology as the SEER database; the number of completed months was computed by first dividing the exact days from the initial diagnosis to death, or last contact for those who remained alive, by 365.24 (as was done by SEER), then rounding down to the number of completed months, and finally dividing the result by 12. For those patients who were still alive or lost to follow-up at the time of entering the databases, survival time was treated as statistically censored at the difference between the date of last contact and the date of initial diagnosis.1

The survival curve for each cancer type (defined as the probability of a cancer patient’s survival as a function of time from the initial diagnosis) was estimated by the Kaplan-Meier nonparametric product-limit estimator.1 Three statistical tests were then used to compare the survival curves between the CTCA database and the SEER database.

Two of these tests, the log rank test and Wilcoxon test, are nonparametric and thus, valid to compare survival curves that have any shapes.1 These tests are different, however, in their sensitivity (or the power) to detect survival differences. The log rank test is generally the most sensitive or powerful when the risk or the hazard of death between CTCA and SEER samples is approximately proportional, whereas the Wilcoxon test tends to be more sensitive when the ratio of hazards of death is higher at earlier times than at later ones. The third test, the likelihood ratio test, is the most restrictive of the three in the sense that it is appropriate to use only for special survival curves (called exponential distributions) whose hazards of death are constant across time.2

Ninety-five percent confidence interval (95 percent CI) estimates for the individual survival rates, as well as the difference in survival rates between the CTCA and SEER samples at specific time points after diagnosis, were based on the estimated survival curves and the relevant asymptotic normal distributions. All these analyses were implemented using the standard SAS package of statistical tests (i.e., SAS/PROC LIFETEST).3 Adjusted analyses were also done (results not shown) using the stratified log rank test and the Wilcoxon test as well as Cox’s proportional hazards models to compare the survival outcomes between the CTCA and SEER samples after adjusting for the effects of age at diagnosis, gender (except for breast and prostate cancers), race, marital status at diagnosis, insurance status at diagnosis and year of initial diagnosis. The technical details of these statistical analyses are available from CTCA.


Direct statistical comparisons of survival outcomes between groups of cancer patients have limitations because of the possible confounding effects of other factors cited throughout this website. Accordingly, the data should be considered directional, not definitive.

First, although a large sample of patients was available from the SEER Program across many geographic regions in the U.S., both samples, including the sample from CTCA, are convenience samples. This precludes the assumption of a causal interpretation of the statistical inferences. Second, although some types of matching, as described above, were implemented to select the appropriate SEER and CTCA comparison samples, the distributions of important covariates such as age at initial diagnosis, gender, race, marital status at diagnosis, insurance status at diagnosis and year of initial diagnosis were not exactly the same between the CTCA sample and SEER sample. Hence, even with the adjusted analyses, possible confounding of these factors to the analyses and results may not be ruled out. Further, many factors (e.g., household income, mobility, etc.) other than those considered in the analyses and available from the databases may have contributed to the actual survival outcomes. As a result of these factors, the possible confounding of the results of these analyses may not be ruled out. Finally, the survival analyses were based on the statistical comparisons of the rate of death from all possible causes, not solely cancer-specific death. These data are not included in the CTCA data set and, therefore, not available for statistical comparison.

Visit our cancer treatment statistics and results page for more information about the methodology used to calculate the CTCA results and read about the analysis limitations.


1 Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. New York: John Wiley, 1980.

2 Lawless JF. Statistical Methods and Methods for Lifetime Data, New York: John Wiley & Sons, Inc., 1982.

3 SAS Institute Inc., SAS/STAT User’s Guide, Volume 2, Version 6, 1990. Cary, NC, USA.