The code below aims to mimic some of the outputs of (Henson et al., 2019), namely, to calculate standardised mortality ratios comparing suicide deaths among individuals diagnosed with cancer with the general population.
This code mimics but does not fully replicate the code made available by the NHS England for the study because this code was written without access to the underlying datasets. Instead, this code was written using the Simulacrum, a synthetic dataset which imitates some of the data used in the original study. It is important to note that the Simulacrum is missing several key variables which this code aims to reproducibly create but will not produce valid results. Consequently, results are not intended to be interpreted and may not make sense given prior knowledge. For example, patient ages were drawn from a random distribution as age/date of birth variables are not available in Simulacrum. To understand what the authors in (Henson et al., 2019) did more fully, I would recommend consulting their annotated code.
Calculate suicide mortality rates by age and sex among patients diagnosed with cancer (likely using cancer or disease registry data)
Calculate suicide mortality rates by age and sex among the general population (likely using civil mortality registers)
Calculate standardised mortality ratios and absolute excess risks using these mortality rates
For an introduction to direct and indirect standardisation methods, you can consult (Naing, 2000).
Many thanks to Chloe Bright who provided me with the original Stata code for this analysis and to Alexandra Pitman for supporting this work.
1 Set up
In the first instance, you’ll need access to data on suicide deaths among patients with cancer and suicide deaths among the general population. We will be standardising by sex and age group so we will need to first prepare the data accordingly. In the R code below, we use functions from the Tidyverse but you can just as easily do this in base R or using different functions.
First, we process data on suicide deaths among patients with cancer, including recoding cancer groups and underlying causes of death (presumably using ICD codes). As well, we need to note the diagnosis date (which serves as the start date) and the date of exit from the cohort (i.e. exit due to death, loss to follow-up, or end of follow-up). In the Simulacrum dataset, this is relatively straightforward because we are using the date of diagnosis up to the date of vital status but other datasets will need to note other potential exit causes (e.g. loss to follow-up). Once we have recoded and cleaned the data among cancer patients, we can set it as survival data to be able to derive the number of person-years observed.
Next, we need to process data on suicide deaths in the general population. This allows us to calculate the expected number of suicide deaths based on population rates. You are likely to do this using civil mortality registers or other information provided by national statistics offices about suicide deaths in the general population, standardised by age and sex.
R Code
# Load required librarieslibrary(arrow)library(beepr)library(bibtex)library(blob)library(data.table)library(dplyr)library(dtplyr)library(feather)library(finalfit)library(forcats)library(foreign)library(ggplot2)library(glue)library(haven)library(here)library(Hmisc)library(hms)library(janitor)library(kableExtra)library(lubridate)library(magrittr)library(multidplyr)library(parallel)library(purrr)library(reticulate)library(readr)library(readxl)library(renv)library(stringi)library(stringr)library(survival)library(tibble)library(tidyr)library(xlsx)# Set seed for reproducibility## This will ensure randomly generated variables are reproducible set.seed(1223)# Set clusters## Because data used in these analyses can be quite large, I am making use of a number of packages which help, including packages to parallelize functions where possible and to load data in columnar format ## This function will set clusters to n-1 where n is the number of cores on your machinecluster <-new_cluster(parallel::detectCores() -1)# Load data files in Arrow## Arrow is used here to help load in very large data files and to process them## More detail is available at https://arrow.apache.org/sim_av_patient <-open_csv_dataset(here("simulacrum_v2.1.0", "Data", "sim_av_patient.csv"))sim_av_tumour <-open_csv_dataset(here("simulacrum_v2.1.0", "Data", "sim_av_tumour.csv"))# Data extraction from source## We define the exclusions here and use their patient IDs later to remove them from the cohortexclusions <- sim_av_patient |>select(PATIENTID, VITALSTATUS) |>collect() |> janitor::clean_names() |>distinct() |> dplyr::filter(str_detect(vitalstatus, "A[123]") |str_detect(vitalstatus, "D[145]") |str_detect(vitalstatus, "X1") |str_detect(vitalstatus, "I") ) |>select(patientid)# Here we extract information about cancer patients, including causes of death related to suicide patient <- sim_av_patient |>select( PATIENTID, GENDER, ETHNICITY, VITALSTATUS, VITALSTATUSDATE, DEATHCAUSECODE_1A, DEATHCAUSECODE_1B, DEATHCAUSECODE_1C, DEATHCAUSECODE_2, DEATHCAUSECODE_UNDERLYING ) |>collect() |> janitor::clean_names() |>distinct() |>anti_join(exclusions) |>mutate(suicide_cause_2 =case_when(str_detect(deathcausecode_1a, "Y339") ~NA,str_detect(deathcausecode_1a, "E9888") ~NA,str_detect(deathcausecode_1a, "X6[0123456789]") ~"1A",str_detect(deathcausecode_1a, "X7[0123456789]") ~"1A",str_detect(deathcausecode_1a, "X8[01234]") ~"1A",str_detect(deathcausecode_1a, "Y1[0123456789]") ~"1A",str_detect(deathcausecode_1a, "Y2[0123456789]") ~"1A",str_detect(deathcausecode_1a, "Y3[01234]") ~"1A",str_detect(deathcausecode_1a, "E98") ~"1A",str_detect(deathcausecode_1a, "E95[0123456789]") ~"1A",str_detect(deathcausecode_1a, "E87[02]") ~"1A",str_detect(deathcausecode_1b, "Y339") ~NA,str_detect(deathcausecode_1b, "E9888") ~NA,str_detect(deathcausecode_1b, "X6[0123456789]") ~"1B",str_detect(deathcausecode_1b, "X7[0123456789]") ~"1B",str_detect(deathcausecode_1b, "X8[01234]") ~"1B",str_detect(deathcausecode_1b, "Y1[0123456789]") ~"1B",str_detect(deathcausecode_1b, "Y2[0123456789]") ~"1B",str_detect(deathcausecode_1b, "Y3[01234]") ~"1B",str_detect(deathcausecode_1b, "E98") ~"1B",str_detect(deathcausecode_1b, "E95[0123456789]") ~"1B",str_detect(deathcausecode_1b, "E87[02]") ~"1B",str_detect(deathcausecode_1c, "Y339") ~NA,str_detect(deathcausecode_1c, "E9888") ~NA,str_detect(deathcausecode_1c, "X6[0123456789]") ~"1C",str_detect(deathcausecode_1c, "X7[0123456789]") ~"1C",str_detect(deathcausecode_1c, "X8[01234]") ~"1C",str_detect(deathcausecode_1c, "Y1[0123456789]") ~"1C",str_detect(deathcausecode_1c, "Y2[0123456789]") ~"1C",str_detect(deathcausecode_1c, "Y3[01234]") ~"1C",str_detect(deathcausecode_1c, "E98") ~"1C",str_detect(deathcausecode_1c, "E95[0123456789]") ~"1C",str_detect(deathcausecode_1c, "E87[02]") ~"1C",str_detect(deathcausecode_2, "Y339") ~NA,str_detect(deathcausecode_2, "E9888") ~NA,str_detect(deathcausecode_2, "X6[0123456789]") ~"2",str_detect(deathcausecode_2, "X7[0123456789]") ~"2",str_detect(deathcausecode_2, "X8[01234]") ~"2",str_detect(deathcausecode_2, "Y1[0123456789]") ~"2",str_detect(deathcausecode_2, "Y2[0123456789]") ~"2",str_detect(deathcausecode_2, "Y3[01234]") ~"2",str_detect(deathcausecode_2, "E98") ~"2",str_detect(deathcausecode_2, "E95[0123456789]") ~"2",str_detect(deathcausecode_2, "E87[02]") ~"2",str_detect(deathcausecode_underlying, "Y339") ~NA,str_detect(deathcausecode_underlying, "E9888") ~NA,str_detect(deathcausecode_underlying, "X6[0123456789]") ~"Underlying",str_detect(deathcausecode_underlying, "X7[0123456789]") ~"Underlying",str_detect(deathcausecode_underlying, "X8[01234]") ~"Underlying",str_detect(deathcausecode_underlying, "Y1[0123456789]") ~"Underlying",str_detect(deathcausecode_underlying, "Y2[0123456789]") ~"Underlying",str_detect(deathcausecode_underlying, "Y3[01234]") ~"Underlying",str_detect(deathcausecode_underlying, "E98") ~"Underlying",str_detect(deathcausecode_underlying, "E95[0123456789]") ~"Underlying",str_detect(deathcausecode_underlying, "E87[02]") ~"Underlying",TRUE~NA ) ) |>mutate(suicide_2 =case_when(str_detect(deathcausecode_1a, "Y339") ~0,str_detect(deathcausecode_1a, "E9888") ~0,str_detect(deathcausecode_1a, "X6[0123456789]") ~1,str_detect(deathcausecode_1a, "X7[0123456789]") ~1,str_detect(deathcausecode_1a, "X8[01234]") ~1,str_detect(deathcausecode_1a, "Y1[0123456789]") ~1,str_detect(deathcausecode_1a, "Y2[0123456789]") ~1,str_detect(deathcausecode_1a, "Y3[01234]") ~1,str_detect(deathcausecode_1a, "E98") ~1,str_detect(deathcausecode_1a, "E95[0123456789]") ~1,str_detect(deathcausecode_1a, "E87[02]") ~1,str_detect(deathcausecode_1b, "Y339") ~0,str_detect(deathcausecode_1b, "E9888") ~0,str_detect(deathcausecode_1b, "X6[0123456789]") ~1,str_detect(deathcausecode_1b, "X7[0123456789]") ~1,str_detect(deathcausecode_1b, "X8[01234]") ~1,str_detect(deathcausecode_1b, "Y1[0123456789]") ~1,str_detect(deathcausecode_1b, "Y2[0123456789]") ~1,str_detect(deathcausecode_1b, "Y3[01234]") ~1,str_detect(deathcausecode_1b, "E98") ~1,str_detect(deathcausecode_1b, "E95[0123456789]") ~1,str_detect(deathcausecode_1b, "E87[02]") ~1,str_detect(deathcausecode_1c, "Y339") ~0,str_detect(deathcausecode_1c, "E9888") ~0,str_detect(deathcausecode_1c, "X6[0123456789]") ~1,str_detect(deathcausecode_1c, "X7[0123456789]") ~1,str_detect(deathcausecode_1c, "X8[01234]") ~1,str_detect(deathcausecode_1c, "Y1[0123456789]") ~1,str_detect(deathcausecode_1c, "Y2[0123456789]") ~1,str_detect(deathcausecode_1c, "Y3[01234]") ~1,str_detect(deathcausecode_1c, "E98") ~1,str_detect(deathcausecode_1c, "E95[0123456789]") ~1,str_detect(deathcausecode_1c, "E87[02]") ~1,str_detect(deathcausecode_2, "Y339") ~0,str_detect(deathcausecode_2, "E9888") ~0,str_detect(deathcausecode_2, "X6[0123456789]") ~1,str_detect(deathcausecode_2, "X7[0123456789]") ~1,str_detect(deathcausecode_2, "X8[01234]") ~1,str_detect(deathcausecode_2, "Y1[0123456789]") ~1,str_detect(deathcausecode_2, "Y2[0123456789]") ~1,str_detect(deathcausecode_2, "Y3[01234]") ~1,str_detect(deathcausecode_2, "E98") ~1,str_detect(deathcausecode_2, "E95[0123456789]") ~1,str_detect(deathcausecode_2, "E87[02]") ~1,str_detect(deathcausecode_underlying, "Y339") ~0,str_detect(deathcausecode_underlying, "E9888") ~0,str_detect(deathcausecode_underlying, "X6[0123456789]") ~1,str_detect(deathcausecode_underlying, "X7[0123456789]") ~1,str_detect(deathcausecode_underlying, "X8[01234]") ~1,str_detect(deathcausecode_underlying, "Y1[0123456789]") ~1,str_detect(deathcausecode_underlying, "Y2[0123456789]") ~1,str_detect(deathcausecode_underlying, "Y3[01234]") ~1,str_detect(deathcausecode_underlying, "E98") ~1,str_detect(deathcausecode_underlying, "E95[0123456789]") ~1,str_detect(deathcausecode_underlying, "E87[02]") ~1,TRUE~0 ) ) |>mutate(death_cause_flag =case_when(str_detect(vitalstatus, "A") ~"Alive", (!str_detect(vitalstatus, "A") &is.na(deathcausecode_1a) &is.na(deathcausecode_1b) &is.na(deathcausecode_1c) &is.na(deathcausecode_2) &is.na(deathcausecode_underlying) ) ~"Dead without Cause", (!str_detect(vitalstatus, "A") & (!is.na(deathcausecode_1a) |!is.na(deathcausecode_1b) |!is.na(deathcausecode_1c) |!is.na(deathcausecode_2) |!is.na(deathcausecode_underlying) ) ) ~"Dead with Cause",TRUE~NA ) ) |>as.data.table()# Here we identify the patient's most recently diagnosed cancer tumour <- sim_av_tumour |>select( PATIENTID, TUMOURID, SITE_ICD10_O2_3CHAR, DIAGNOSISDATEBEST, STAGE_BEST, QUINTILE_2019, AGE ) |>collect() |> janitor::clean_names() |>distinct() |>anti_join(exclusions) |>rename(tumour_site = site_icd10_o2_3char,deprivation_quintile = quintile_2019,age_at_diagnosis = age ) |>as.data.table()# Dataset merging## We now merge information from the two source tables and annotate the new dataset## Annotations are intended to match the original Stata code developed for this analysis as closely as possible, but you will note that there are some deviationsdataset <-merge(patient, tumour) |>anti_join(exclusions) |>filter(gender !=9) |>filter(!(gender ==2&str_detect(tumour_site, "C6[0123]"))) |>filter(!(gender ==1&str_detect(tumour_site, "C5[12345678]"))) |>filter(str_detect(tumour_site, "^C") &!str_detect(tumour_site, "C44")) |>group_by(patientid) |>arrange(desc(diagnosisdatebest)) |>slice_head() |>rename(patient_id = patientid,gender_code = gender,ethnicity_code = ethnicity,vital_status_code = vitalstatus,vital_status_date = vitalstatusdate,suicide = suicide_2,suicide_cause = suicide_cause_2,tumour_id = tumourid,diagnosis_date = diagnosisdatebest,stage = stage_best, ) |>ungroup() |>mutate(cancer_group =case_when(str_detect(tumour_site, "C0[0123456789]") ~"Head and neck",str_detect(tumour_site, "C1[01234]") ~"Head and neck",str_detect(tumour_site, "C15") ~"Oesophagus",str_detect(tumour_site, "C16") ~"Stomach",str_detect(tumour_site, "C17") ~"Other malignant neoplasms",str_detect(tumour_site, "C1[89]") ~"Colorectal",str_detect(tumour_site, "C20") ~"Colorectal",str_detect(tumour_site, "C22") ~"Liver",str_detect(tumour_site, "C25") ~"Pancreas",str_detect(tumour_site, "C3[12]") ~"Head and neck",str_detect(tumour_site, "C3[34]") ~"Lung",str_detect(tumour_site, "C4[019]") ~"Sarcoma",str_detect(tumour_site, "C43") ~"Melanoma",str_detect(tumour_site, "C45") ~"Mesothelioma",str_detect(tumour_site, "C50") ~"Breast",str_detect(tumour_site, "C53") ~"Cervix",str_detect(tumour_site, "C5[45]") ~"Uterus",str_detect(tumour_site, "C5[67]") ~"Ovary",str_detect(tumour_site, "C61") ~"Prostate",str_detect(tumour_site, "C62") ~"Testis",str_detect(tumour_site, "C6[4568]") ~"Kidney and unspecified urinary organs",str_detect(tumour_site, "C67") ~"Bladder",str_detect(tumour_site, "C7[012]") ~"Central Nervous System (incl brain)",str_detect(tumour_site, "C73") ~"Head and neck",str_detect(tumour_site, "C7[789]") ~"Cancer of Unknown Primary",str_detect(tumour_site, "C80") ~"Cancer of Unknown Primary",str_detect(tumour_site, "C81") ~"Hodgkin lymphoma",str_detect(tumour_site, "C8[2345]") ~"Non-Hodgkin lymphoma",str_detect(tumour_site, "C88") ~"Multiple myeloma",str_detect(tumour_site, "C90") ~"Multiple myeloma",str_detect(tumour_site, "C9[12345]") ~"Leukaemia",TRUE~"Other malignant neoplasms" ) ) |>mutate(gender =fcase( gender_code ==1,"Male", gender_code ==2,"Female", gender_code ==9,"Indeterminate", gender_code =="X",NA ) ) |>mutate(ethnicity =fcase( ethnicity_code =="0","White", ethnicity_code =="A","White", ethnicity_code =="B","White", ethnicity_code =="C","White", ethnicity_code =="CA","Unknown", ethnicity_code =="CP","White", ethnicity_code =="D","Mixed", ethnicity_code =="E","Mixed", ethnicity_code =="F","Mixed", ethnicity_code =="G","Mixed", ethnicity_code =="H","Asian", ethnicity_code =="J","Asian", ethnicity_code =="K","Asian", ethnicity_code =="L","Asian", ethnicity_code =="M","Black", ethnicity_code =="N","Black", ethnicity_code =="P","Black", ethnicity_code =="R","Other", ethnicity_code =="S","Other", ethnicity_code =="X","Unknown", ethnicity_code =="Z","Not Stated" ) ) |>mutate(vital_status =fcase( vital_status_code =="A","Alive", vital_status_code =="D","Dead", vital_status_code =="D3","Dies before diagnosis", vital_status_code =="X2","Patient may have been returned and successfully traced", vital_status_code =="X4","Patient may have been returned but has not been successfully traced", vital_status_code =="X5","Loss to follow up" ) ) |>mutate(diagnosis_year =factor(year(diagnosis_date))) |>as.data.table()## Generate random ages and derive age-based variablesage_breaks <-c(0, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 79, 84, 89, 999)age_labels <-c("15-19","20-24","25-29","30-34","35-39","40-44","45-49","50-54","55-59","60-64","65-69","70-74","75-79","80-84","85-89","90+" )dataset <- dataset |>partition(cluster) |>group_by(patient_id) |>mutate(age_at_diagnosis =floor(rnorm(n =1, mean =65, sd =12.5 ))) |>collect() |>ungroup() |>mutate(age_at_diagnosis =case_when( age_at_diagnosis <18~18, age_at_diagnosis >99~99,TRUE~ age_at_diagnosis ) ) |>mutate(birth_date = diagnosis_date - lubridate::years(age_at_diagnosis) - lubridate::days(floor(runif(n =1, min =1, max =10 )))) |>mutate(age_category_at_diagnosis =cut( age_at_diagnosis,breaks = age_breaks,labels = age_labels,right =TRUE ) ) |>mutate(death_date =case_when(vital_status =="Dead"~ vital_status_date,TRUE~NA)) |>mutate(age_at_death =floor(as.numeric(difftime(death_date, birth_date, unit ="days") /365.25 ))) |>mutate(age_category_at_death =cut( age_at_death,breaks = age_breaks,labels = age_labels,right =TRUE ) ) |>mutate(age_category_at_death =case_when(vital_status !="Dead"~NA,TRUE~ age_category_at_death)) |>mutate(suicide_death_date =case_when(suicide ==1~ vital_status_date,TRUE~NA)) |>mutate(age_at_suicide_death =floor(as.numeric(difftime(suicide_death_date, birth_date, unit ="days") /365.25 ))) |>mutate(age_category_at_suicide_death =cut( age_at_suicide_death,breaks = age_breaks,labels = age_labels,right =TRUE ) ) |>mutate(age_category_at_suicide_death =case_when(suicide ==0~NA,TRUE~ age_category_at_suicide_death) )## Calculate months since diagnosismonths_breaks <-c(-1, 5, 11, 23, 35, 59, 119, 999)months_labels <-c("0-5 mo", "6-11 mo", "12-23 mo", "24-35mo", "3-4y", "5-9y", "10+ y")dataset <- dataset |>mutate(months_from_diagnosis =floor(as.numeric(difftime(vital_status_date, diagnosis_date, units =c("weeks")) ) /4)) |>mutate(months_from_diagnosis =case_when(months_from_diagnosis <0~0,TRUE~ months_from_diagnosis)) |>mutate(months_category_from_diagnosis =cut( months_from_diagnosis,breaks = months_breaks,labels = months_labels,right =TRUE ) )# Reverse deprivation quintiles## Deprivation is sometimes reported from least to most deprived or from most to least deprived. Here, we reverse the quintiles to match what is presented in the JAMA Psychiatry publication. dataset <- dataset |>mutate(deprivation_quintile_reversed =case_when( deprivation_quintile =="1 - most deprived"~"5, Most", deprivation_quintile =="2"~"4", deprivation_quintile =="3"~"3", deprivation_quintile =="4"~"2", deprivation_quintile =="5 - least deprived"~"1, Least", ) )# Survival analysis## Here we shape the data as survival data, following up from the diagnosis date to the date of the vital status update survival <- dataset |>mutate(time =as.numeric(difftime(vital_status_date, diagnosis_date, unit ="days"))) |>filter(time >=0) |>mutate(time =case_when(time ==0~1,TRUE~ time)) |>mutate(time =as.numeric(time)/365.25) |>mutate(suicide = suicide +1) |>mutate(gender =factor(gender, levels =c("Male", "Female")),ethnicity =factor( ethnicity,levels =c("White","Asian" ,"Black","Mixed","Other","Unknown","Not Stated" ) ),cancer_group =as.factor(cancer_group),deprivation_quintile_reversed =as.factor(deprivation_quintile_reversed), )# Variable labels## Here we apply a number of variable labels for tidinesslabel(survival$age_at_death) <-"Attained age at death"label(survival$age_at_diagnosis) <-"Attained age at cancer diagnosis, y"label(survival$age_at_suicide_death) <-"Attained age at suicide death, y"label(survival$age_category_at_death) <-"Age at death (attained age), y"label(survival$age_category_at_diagnosis) <-"Age category at cancer diagnosis"label(survival$age_category_at_suicide_death) <-"Age category at suicide death"label(survival$birth_date) <-"Birth date"label(survival$cancer_group) <-"Last primary cancer"label(survival$deathcausecode_1a) <-"Death cause code - 1A"label(survival$deathcausecode_1b) <-"Death cause code - 1B"label(survival$deathcausecode_1c) <-"Death cause code - 1C"label(survival$deathcausecode_2) <-"Death cause code - 2"label(survival$deathcausecode_underlying) <-"Death cause code - Underlying"label(survival$death_cause_flag) <-"Alive or Dead"label(survival$death_date) <-"Date of death"label(survival$deprivation_quintile) <-"Deprivation"label(survival$deprivation_quintile_reversed) <-"Deprivation"label(survival$diagnosis_date) <-"Date of Diagnosis"label(survival$diagnosis_year) <-"Year of cancer diagnosis"label(survival$ethnicity) <-"Ethnicity"label(survival$ethnicity_code) <-"Ethnicity code"label(survival$gender) <-"Sex"label(survival$gender_code) <-"Gender code"label(survival$months_from_diagnosis) <-"Time from cancer diagnosis"label(survival$months_category_from_diagnosis) <-"Time from cancer diagnosis"label(survival$patient_id) <-"Patient ID"label(survival$stage) <-"Tumour stage"label(survival$suicide) <-"Suicide-related death (Y/N)"label(survival$suicide_cause) <-"Suicide-related death listing on death certificate"label(survival$suicide_death_date) <-"Date of suicide death"label(survival$time) <-"Person-years observed"label(survival$tumour_id) <-"Tumour ID"label(survival$tumour_site) <-"Tumour site"label(survival$vital_status) <-"Vital Status"label(survival$vital_status_code) <-"Vital status code"label(survival$vital_status_date) <-"Vital status date"# Setting variables## Here we define our explanatory and dependent variables ## Note that the dependent function is a survival function explanatory <-c("gender","months_category_from_diagnosis","age_category_at_death","cancer_group","deprivation_quintile_reversed","ethnicity","diagnosis_year")dependent <-"Surv(time, suicide)"# Load in ONS suicide data## This is the data for suicides in the general population which we will use to standardise the suicide rates among people with cancer## Notably, this data was provided as Excel sheets which require a fair amount of cleaning, so some of the code may seem idiosyncratic because we have to deal with some of the strangeness of trying to import and reshape Excel data for our needssuicides2022 <-read_xlsx(here("suicides2022.xlsx"),sheet ="Table_6",range ="A6:CK132",col_names =TRUE) |> janitor::clean_names() %>%rename_with(~str_remove(., "_note.*")) |>select(-contains("x10_14_"))# Clean up observed suicidesobserved_suicides <- suicides2022 |>rename(year = year_of_death_registration) |>select(-area_code, -area_of_usual_residence) |>select(!contains("_ucl") &!contains("_lcl") &!contains("_marker")) %>%rename_with( ~str_replace(., "^x", "")) |>group_by(sex, year) |>select(contains("_number_of_deaths")) |>pivot_longer(cols =ends_with("_number_of_deaths")) |>rename(attained_age = name, observed_suicides = value) |>mutate(attained_age =str_replace(attained_age, "_number_of_deaths", "")) |>mutate(attained_age =str_replace(attained_age, "_", "-"))# Clean up standardised suicide ratesobserved_suicide_rates <- suicides2022 |>rename(year = year_of_death_registration) |>select(-area_code, -area_of_usual_residence) |>select(!contains("_ucl") &!contains("_lcl") &!contains("_marker")) %>%rename_with( ~str_replace(., "^x", "")) |>group_by(sex, year) |>select(contains("_rate_per_100_000")) |>pivot_longer(cols =ends_with("_rate_per_100_000")) |>rename(attained_age = name, suicides_per_100000 = value) |>mutate(attained_age =str_replace(attained_age, "_rate_per_100_000", "")) |>mutate(attained_age =str_replace(attained_age, "_", "-"))reference_suicides <- observed_suicides |>inner_join(observed_suicide_rates) |>mutate(sex =case_when(sex =="Males"~"Male", sex =="Females"~"Female", TRUE~ sex)) |>mutate(attained_age =case_when(attained_age =="90"~"90+" , TRUE~ attained_age)) |>select(year, sex, attained_age, observed_suicides, suicides_per_100000) |>rename(reference_suicides = observed_suicides)# Save and export raw datasetsave(dataset, exclusions, survival, reference_suicides,file =here("dataset.Rdata"))write_csv(dataset, file =here("dataset", "dataset.csv"))write_csv(survival, file =here("dataset", "survival.csv"))write_csv(reference_suicides,file =here("dataset", "reference_suicides.csv"))
Using the standardised suicide rates in the general population, we calculate the expected numbers of suicide among those diagnosed with cancer and compare them to the number of suicides actually observed. This gives us the standardised mortality ratio (SMR).
\[
\text{standardised mortality ratio}=\frac{\text{observed number of suicides}}{\text{expected number of suicides}}
\]
We can also then calculate absolute excess risk (AER) per 10 000 as follows:
\[
\text{absolute excess risk per 10,000}=\frac{\text{observed number of suicides} - \text{expected number of suicides}}{\text{person-years at risk}}
\]
2 Sample characteristics
These tables aims to reproduce the main outputs of Table 1 in (Henson et al., 2019). There are some important differences to note:
Age groups are categorised according to the same ones provided by the ONS standardised mortality rates since we will need to standardise using those same age groups anyway
Simulacrum covers a different time period than the authors use in (Henson et al., 2019)
Ages were randomly generated, so don’t read too much into them
R Code
# Table 1## A quirk of the finalfit package is that survival events need to be coded as "1" and "2" rather than "0" and "1" survival |>mutate(suicide =as.factor(case_when(suicide ==2~"Y", suicide ==1~"N", TRUE~NA))) |>summary_factorlist(dependent ="suicide",explanatory =c("gender","months_category_from_diagnosis","age_category_at_death","cancer_group","deprivation_quintile_reversed","ethnicity","age_category_at_diagnosis","diagnosis_year"),total_col =TRUE,add_dependent_label =TRUE,add_col_totals =TRUE ) |> kableExtra::kable() |>kable_classic(full_width = F) |>footnote(general ="Percentages provided in parentheses.")
Dependent: suicide
N
Y
Total
Total N (%)
1189118 (99.9)
1180 (0.1)
1190298
Sex
Male
613178 (51.6)
723 (61.3)
613901 (51.6)
Female
575940 (48.4)
457 (38.7)
576397 (48.4)
Time from cancer diagnosis
0-5 mo
212013 (17.8)
395 (33.5)
212408 (17.8)
6-11 mo
78731 (6.6)
176 (14.9)
78907 (6.6)
12-23 mo
97424 (8.2)
237 (20.1)
97661 (8.2)
24-35mo
105740 (8.9)
154 (13.1)
105894 (8.9)
3-4y
382207 (32.1)
184 (15.6)
382391 (32.1)
5-9y
313003 (26.3)
34 (2.9)
313037 (26.3)
10+ y
0 (0.0)
0 (0.0)
0 (0.0)
Age at death (attained age), y
15-19
56 (0.0)
0 (0.0)
56 (0.0)
20-24
239 (0.0)
2 (0.2)
241 (0.0)
25-29
853 (0.2)
4 (0.3)
857 (0.2)
30-34
2622 (0.5)
4 (0.3)
2626 (0.5)
35-39
6770 (1.3)
14 (1.2)
6784 (1.3)
40-44
15064 (2.9)
33 (2.8)
15097 (2.9)
45-49
28887 (5.5)
67 (5.7)
28954 (5.5)
50-54
47163 (9.0)
101 (8.6)
47264 (9.0)
55-59
66720 (12.7)
145 (12.3)
66865 (12.7)
60-64
79828 (15.2)
156 (13.2)
79984 (15.2)
65-69
81973 (15.6)
198 (16.8)
82171 (15.6)
70-74
72311 (13.8)
152 (12.9)
72463 (13.8)
75-79
54431 (10.4)
131 (11.1)
54562 (10.4)
80-84
34688 (6.6)
76 (6.4)
34764 (6.6)
85-89
19186 (3.7)
62 (5.3)
19248 (3.7)
90+
14518 (2.8)
34 (2.9)
14552 (2.8)
Last primary cancer
Bladder
31935 (2.7)
35 (3.0)
31970 (2.7)
Breast
178351 (15.0)
97 (8.2)
178448 (15.0)
Cancer of Unknown Primary
8875 (0.7)
9 (0.8)
8884 (0.7)
Central Nervous System (incl brain)
18932 (1.6)
19 (1.6)
18951 (1.6)
Cervix
10734 (0.9)
13 (1.1)
10747 (0.9)
Colorectal
136413 (11.5)
122 (10.3)
136535 (11.5)
Head and neck
49813 (4.2)
108 (9.2)
49921 (4.2)
Hodgkin lymphoma
7120 (0.6)
13 (1.1)
7133 (0.6)
Kidney and unspecified urinary organs
42377 (3.6)
51 (4.3)
42428 (3.6)
Leukaemia
32780 (2.8)
33 (2.8)
32813 (2.8)
Liver
20979 (1.8)
17 (1.4)
20996 (1.8)
Lung
152151 (12.8)
184 (15.6)
152335 (12.8)
Melanoma
55370 (4.7)
36 (3.1)
55406 (4.7)
Mesothelioma
8995 (0.8)
8 (0.7)
9003 (0.8)
Multiple myeloma
21580 (1.8)
16 (1.4)
21596 (1.8)
Non-Hodgkin lymphoma
47014 (4.0)
61 (5.2)
47075 (4.0)
Oesophagus
29461 (2.5)
20 (1.7)
29481 (2.5)
Other malignant neoplasms
38373 (3.2)
33 (2.8)
38406 (3.2)
Ovary
24055 (2.0)
9 (0.8)
24064 (2.0)
Pancreas
35001 (2.9)
22 (1.9)
35023 (2.9)
Prostate
171745 (14.4)
204 (17.3)
171949 (14.4)
Sarcoma
8496 (0.7)
11 (0.9)
8507 (0.7)
Stomach
20590 (1.7)
26 (2.2)
20616 (1.7)
Testis
7940 (0.7)
15 (1.3)
7955 (0.7)
Uterus
30038 (2.5)
18 (1.5)
30056 (2.5)
Deprivation
1, Least
256470 (21.6)
252 (21.4)
256722 (21.6)
2
256429 (21.6)
242 (20.5)
256671 (21.6)
3
245122 (20.6)
247 (20.9)
245369 (20.6)
4
220523 (18.5)
209 (17.7)
220732 (18.5)
5, Most
210574 (17.7)
230 (19.5)
210804 (17.7)
Ethnicity
White
1039481 (88.9)
1029 (88.2)
1040510 (88.9)
Asian
30647 (2.6)
39 (3.3)
30686 (2.6)
Black
23013 (2.0)
26 (2.2)
23039 (2.0)
Mixed
5735 (0.5)
4 (0.3)
5739 (0.5)
Other
18664 (1.6)
14 (1.2)
18678 (1.6)
Unknown
7795 (0.7)
11 (0.9)
7806 (0.7)
Not Stated
44147 (3.8)
44 (3.8)
44191 (3.8)
Age category at cancer diagnosis
15-19
179 (0.0)
0 (0.0)
179 (0.0)
20-24
626 (0.1)
2 (0.2)
628 (0.1)
25-29
2291 (0.2)
4 (0.3)
2295 (0.2)
30-34
6586 (0.6)
8 (0.7)
6594 (0.6)
35-39
17369 (1.5)
12 (1.0)
17381 (1.5)
40-44
38011 (3.2)
37 (3.1)
38048 (3.2)
45-49
72013 (6.1)
81 (6.9)
72094 (6.1)
50-54
115164 (9.7)
104 (8.8)
115268 (9.7)
55-59
157842 (13.3)
151 (12.8)
157993 (13.3)
60-64
184585 (15.5)
174 (14.7)
184759 (15.5)
65-69
184510 (15.5)
192 (16.3)
184702 (15.5)
70-74
157836 (13.3)
141 (11.9)
157977 (13.3)
75-79
115375 (9.7)
121 (10.3)
115496 (9.7)
80-84
71405 (6.0)
78 (6.6)
71483 (6.0)
85-89
38225 (3.2)
44 (3.7)
38269 (3.2)
90+
27101 (2.3)
31 (2.6)
27132 (2.3)
Year of cancer diagnosis
2016
290193 (24.4)
298 (25.3)
290491 (24.4)
2017
290613 (24.4)
301 (25.5)
290914 (24.4)
2018
305343 (25.7)
289 (24.5)
305632 (25.7)
2019
302969 (25.5)
292 (24.7)
303261 (25.5)
Note:
Percentages provided in parentheses.
We can also output an analogous table using Python.
Python Code
# Import necessary packagesimport pandas as pdimport numpy as npimport matplotlib as mplimport tableone as tbl1import tabulatefrom IPython.display import Markdown# Import datasetdf = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")## Redefine suicide == 0 and suicide == 1 df.eval("suicide = suicide - 1", inplace=True)# Annotate datasetcolumns = ['gender', 'months_category_from_diagnosis', 'age_category_at_death', 'cancer_group', 'deprivation_quintile_reversed', 'ethnicity', 'age_category_at_diagnosis', 'diagnosis_year']categorical = ['gender', 'months_category_from_diagnosis', 'age_category_at_death', 'cancer_group', 'deprivation_quintile_reversed', 'ethnicity', 'age_category_at_diagnosis', 'diagnosis_year']order = {'gender': ['Male', 'Female'],'months_category_from_diagnosis': ['0-5 mo', '6-11 mo', '12-23 mo', '24-35mo', '3-4y', '5-9y', '10+ y'],'age_category_at_death': ['15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90+'],'ethnicity': ['White', 'Asian', 'Black', 'Mixed', 'Other', 'Unknown', 'Not Stated'] }labels = {'gender': 'Sex','suicide': 'Suicide Death','months_category_from_diagnosis': 'Time from cancer diagnosis','age_category_at_death': 'Age at death (attained age), y','cancer_group': 'Last primary cancer','deprivation_quintile_reversed': 'Deprivation','ethnicity': 'Ethnicity','age_category_at_diagnosis': 'Age category at cancer diagnosis','diagnosis_year': 'Year of cancer diagnosis' }groupby = ['suicide']#Produce Table Onemytable = tbl1.TableOne(df, columns = columns, rename = labels, order = order, categorical = categorical, groupby = groupby)Markdown(mytable.tabulate(tablefmt="grid"))
Missing
Overall
0
1
n
1190298
1189118
1180
Sex, n (%)
Male
0
613901 (51.6)
613178 (51.6)
723 (61.3)
Female
576397 (48.4)
575940 (48.4)
457 (38.7)
Time from cancer diagnosis, n (%)
0-5 mo
0
212408 (17.8)
212013 (17.8)
395 (33.5)
6-11 mo
78907 (6.6)
78731 (6.6)
176 (14.9)
12-23 mo
97661 (8.2)
97424 (8.2)
237 (20.1)
24-35mo
105894 (8.9)
105740 (8.9)
154 (13.1)
3-4y
382391 (32.1)
382207 (32.1)
184 (15.6)
5-9y
313037 (26.3)
313003 (26.3)
34 (2.9)
Age at death (attained age), y, n (%)
15-19
663810
56 (0.0)
56 (0.0)
20-24
241 (0.0)
239 (0.0)
2 (0.2)
25-29
857 (0.2)
853 (0.2)
4 (0.3)
30-34
2626 (0.5)
2622 (0.5)
4 (0.3)
35-39
6784 (1.3)
6770 (1.3)
14 (1.2)
40-44
15097 (2.9)
15064 (2.9)
33 (2.8)
45-49
28954 (5.5)
28887 (5.5)
67 (5.7)
50-54
47264 (9.0)
47163 (9.0)
101 (8.6)
55-59
66865 (12.7)
66720 (12.7)
145 (12.3)
60-64
79984 (15.2)
79828 (15.2)
156 (13.2)
65-69
82171 (15.6)
81973 (15.6)
198 (16.8)
70-74
72463 (13.8)
72311 (13.8)
152 (12.9)
75-79
54562 (10.4)
54431 (10.4)
131 (11.1)
80-84
34764 (6.6)
34688 (6.6)
76 (6.4)
85-89
19248 (3.7)
19186 (3.7)
62 (5.3)
90+
14552 (2.8)
14518 (2.8)
34 (2.9)
Last primary cancer, n (%)
Bladder
0
31970 (2.7)
31935 (2.7)
35 (3.0)
Breast
178448 (15.0)
178351 (15.0)
97 (8.2)
Cancer of Unknown Primary
8884 (0.7)
8875 (0.7)
9 (0.8)
Central Nervous System (incl brain)
18951 (1.6)
18932 (1.6)
19 (1.6)
Cervix
10747 (0.9)
10734 (0.9)
13 (1.1)
Colorectal
136535 (11.5)
136413 (11.5)
122 (10.3)
Head and neck
49921 (4.2)
49813 (4.2)
108 (9.2)
Hodgkin lymphoma
7133 (0.6)
7120 (0.6)
13 (1.1)
Kidney and unspecified urinary organs
42428 (3.6)
42377 (3.6)
51 (4.3)
Leukaemia
32813 (2.8)
32780 (2.8)
33 (2.8)
Liver
20996 (1.8)
20979 (1.8)
17 (1.4)
Lung
152335 (12.8)
152151 (12.8)
184 (15.6)
Melanoma
55406 (4.7)
55370 (4.7)
36 (3.1)
Mesothelioma
9003 (0.8)
8995 (0.8)
8 (0.7)
Multiple myeloma
21596 (1.8)
21580 (1.8)
16 (1.4)
Non-Hodgkin lymphoma
47075 (4.0)
47014 (4.0)
61 (5.2)
Oesophagus
29481 (2.5)
29461 (2.5)
20 (1.7)
Other malignant neoplasms
38406 (3.2)
38373 (3.2)
33 (2.8)
Ovary
24064 (2.0)
24055 (2.0)
9 (0.8)
Pancreas
35023 (2.9)
35001 (2.9)
22 (1.9)
Prostate
171949 (14.4)
171745 (14.4)
204 (17.3)
Sarcoma
8507 (0.7)
8496 (0.7)
11 (0.9)
Stomach
20616 (1.7)
20590 (1.7)
26 (2.2)
Testis
7955 (0.7)
7940 (0.7)
15 (1.3)
Uterus
30056 (2.5)
30038 (2.5)
18 (1.5)
Deprivation, n (%)
1, Least
0
256722 (21.6)
256470 (21.6)
252 (21.4)
2
256671 (21.6)
256429 (21.6)
242 (20.5)
3
245369 (20.6)
245122 (20.6)
247 (20.9)
4
220732 (18.5)
220523 (18.5)
209 (17.7)
5, Most
210804 (17.7)
210574 (17.7)
230 (19.5)
Ethnicity, n (%)
White
19649
1040510 (88.9)
1039481 (88.9)
1029 (88.2)
Asian
30686 (2.6)
30647 (2.6)
39 (3.3)
Black
23039 (2.0)
23013 (2.0)
26 (2.2)
Mixed
5739 (0.5)
5735 (0.5)
4 (0.3)
Other
18678 (1.6)
18664 (1.6)
14 (1.2)
Unknown
7806 (0.7)
7795 (0.7)
11 (0.9)
Not Stated
44191 (3.8)
44147 (3.8)
44 (3.8)
Age category at cancer diagnosis, n (%)
15-19
0
179 (0.0)
179 (0.0)
20-24
628 (0.1)
626 (0.1)
2 (0.2)
25-29
2295 (0.2)
2291 (0.2)
4 (0.3)
30-34
6594 (0.6)
6586 (0.6)
8 (0.7)
35-39
17381 (1.5)
17369 (1.5)
12 (1.0)
40-44
38048 (3.2)
38011 (3.2)
37 (3.1)
45-49
72094 (6.1)
72013 (6.1)
81 (6.9)
50-54
115268 (9.7)
115164 (9.7)
104 (8.8)
55-59
157993 (13.3)
157842 (13.3)
151 (12.8)
60-64
184759 (15.5)
184585 (15.5)
174 (14.7)
65-69
184702 (15.5)
184510 (15.5)
192 (16.3)
70-74
157977 (13.3)
157836 (13.3)
141 (11.9)
75-79
115496 (9.7)
115375 (9.7)
121 (10.3)
80-84
71483 (6.0)
71405 (6.0)
78 (6.6)
85-89
38269 (3.2)
38225 (3.2)
44 (3.7)
90+
27132 (2.3)
27101 (2.3)
31 (2.6)
Year of cancer diagnosis, n (%)
2016
0
290491 (24.4)
290193 (24.4)
298 (25.3)
2017
290914 (24.4)
290613 (24.4)
301 (25.5)
2018
305632 (25.7)
305343 (25.7)
289 (24.5)
2019
303261 (25.5)
302969 (25.5)
292 (24.7)
3 Suicide SMRs and AERs per 10,000 person-years at risk according to last primary cancer
These tables aim to reproduce some of the outputs of Table 2 in (Henson et al., 2019).
We can also output an analogous table using Python.
Python Code
# Import necessary packagesimport pandas as pdimport numpy as npfrom tabulate import tabulatefrom IPython.display import Markdown# Import datasetdf_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")# Calculate suicides among cancer patientsdf_survival['attained_age'] = df_survival['age_category_at_death']df_survival['sex'] = df_survival['gender']df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.yeardf_survival['suicide'] = df_survival['suicide'] -1# Group by and summarizegrouped = df_survival.groupby(['year', 'sex', 'attained_age', 'ethnicity'])cancer_suicides = grouped.agg( person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())), suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')).reset_index()cancer_suicides.dropna(inplace=True)cancer_suicides = cancer_suicides.drop_duplicates()cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)# Prepare for merging by ensuring alignment of the key columns used for joincancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')# Handle potential NaN values from the merge before calculationcancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']# Calculate expected suicidescancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] /10000) * cancer_suicides['suicides_per_100000']# Aggregate for SMR calculationgrouped_smr = cancer_suicides.groupby('ethnicity')ethnicity_smr = grouped_smr.agg( observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'), expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())), person_years=pd.NamedAgg(column='person_years', aggfunc='sum')).reset_index()ethnicity_smr['smr'] = np.round(ethnicity_smr['observed_suicides'] / ethnicity_smr['expected_suicides'], 2)ethnicity_smr['aer'] = np.round((ethnicity_smr['observed_suicides'] - ethnicity_smr['expected_suicides']) / ethnicity_smr['person_years'] *10000, 2)# Display as a table with custom column namesethnicity_smr_table = ethnicity_smr[['ethnicity', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]ethnicity_smr_table.columns = ['Ethnicity', 'Person-years observed', 'Observed suicide deaths', 'Expected suicide deaths', 'Standardised mortality ratio', 'Absolute excess risk per 10 000']# Display the tableMarkdown(tabulate(ethnicity_smr_table, headers='keys'))
Ethnicity
Person-years observed
Observed suicide deaths
Expected suicide deaths
Standardised mortality ratio
Absolute excess risk per 10 000
0
Asian
16455
39
17
2.29
13.37
1
Black
12175
26
13
2
10.68
2
Mixed
2969
4
3
1.33
3.37
3
Not Stated
22998
44
25
1.76
8.26
4
Other
10059
14
10
1.4
3.98
5
Unknown
3952
11
4
2.75
17.71
6
White
572763
1028
608
1.69
7.33
References
Henson, K. E., Brock, R., Charnock, J., Wickramasinghe, B., Will, O., & Pitman, A. (2019). Risk of suicide after cancer diagnosis in england. JAMA Psychiatry, 76(1), 51. https://doi.org/10.1001/jamapsychiatry.2018.3181
Naing, N. N. (2000). Easy way to learn standardization: Direct and indirect methods. The Malaysian Journal of Medical Sciences: MJMS, 7(1), 10.