Risk of Suicide After Cancer Diagnosis in England

Examplar code using R and Python

Author

Affiliations

Justin C Yang

University College London

Camden & Islington NHS Foundation Trust

Published

May 9, 2024

The code below aims to mimic some of the outputs of (Henson et al., 2019), namely, to calculate standardised mortality ratios comparing suicide deaths among individuals diagnosed with cancer with the general population.

This code mimics but does not fully replicate the code made available by the NHS England for the study because this code was written without access to the underlying datasets. Instead, this code was written using the Simulacrum, a synthetic dataset which imitates some of the data used in the original study. It is important to note that the Simulacrum is missing several key variables which this code aims to reproducibly create but will not produce valid results. Consequently, results are not intended to be interpreted and may not make sense given prior knowledge. For example, patient ages were drawn from a random distribution as age/date of birth variables are not available in Simulacrum. To understand what the authors in (Henson et al., 2019) did more fully, I would recommend consulting their annotated code.

Standardised mortality rates for suicide deaths for England were obtained from the Office for National Statistics.

The generalised approach for this analysis is to:

Calculate suicide mortality rates by age and sex among patients diagnosed with cancer (likely using cancer or disease registry data)
Calculate suicide mortality rates by age and sex among the general population (likely using civil mortality registers)
Calculate standardised mortality ratios and absolute excess risks using these mortality rates

For an introduction to direct and indirect standardisation methods, you can consult (Naing, 2000).

Many thanks to Chloe Bright who provided me with the original Stata code for this analysis and to Alexandra Pitman for supporting this work.

1 Set up

In the first instance, you’ll need access to data on suicide deaths among patients with cancer and suicide deaths among the general population. We will be standardising by sex and age group so we will need to first prepare the data accordingly. In the R code below, we use functions from the Tidyverse but you can just as easily do this in base R or using different functions.

First, we process data on suicide deaths among patients with cancer, including recoding cancer groups and underlying causes of death (presumably using ICD codes). As well, we need to note the diagnosis date (which serves as the start date) and the date of exit from the cohort (i.e. exit due to death, loss to follow-up, or end of follow-up). In the Simulacrum dataset, this is relatively straightforward because we are using the date of diagnosis up to the date of vital status but other datasets will need to note other potential exit causes (e.g. loss to follow-up). Once we have recoded and cleaned the data among cancer patients, we can set it as survival data to be able to derive the number of person-years observed.

Next, we need to process data on suicide deaths in the general population. This allows us to calculate the expected number of suicide deaths based on population rates. You are likely to do this using civil mortality registers or other information provided by national statistics offices about suicide deaths in the general population, standardised by age and sex.

R Code

# Load required libraries
library(arrow)
library(beepr)
library(bibtex)
library(blob)
library(data.table)
library(dplyr)
library(dtplyr)
library(feather)
library(finalfit)
library(forcats)
library(foreign)
library(ggplot2)
library(glue)
library(haven)
library(here)
library(Hmisc)
library(hms)
library(janitor)
library(kableExtra)
library(lubridate)
library(magrittr)
library(multidplyr)
library(parallel)
library(purrr)
library(reticulate)
library(readr)
library(readxl)
library(renv)
library(stringi)
library(stringr)
library(survival)
library(tibble)
library(tidyr)
library(xlsx)

# Set seed for reproducibility
## This will ensure randomly generated variables are reproducible 
set.seed(1223)

# Set clusters
## Because data used in these analyses can be quite large, I am making use of a number of packages which help, including packages to parallelize functions where possible and to load data in columnar format 
## This function will set clusters to n-1 where n is the number of cores on your machine
cluster <- new_cluster(parallel::detectCores() - 1)

# Load data files in Arrow
## Arrow is used here to help load in very large data files and to process them
## More detail is available at https://arrow.apache.org/
sim_av_patient <-
  open_csv_dataset(here("simulacrum_v2.1.0", "Data", "sim_av_patient.csv"))

sim_av_tumour <-
  open_csv_dataset(here("simulacrum_v2.1.0", "Data", "sim_av_tumour.csv"))

# Data extraction from source
## We define the exclusions here and use their patient IDs later to remove them from the cohort
exclusions <- sim_av_patient |>
  select(PATIENTID, VITALSTATUS) |>
  collect() |>
  janitor::clean_names() |>
  distinct() |>
  dplyr::filter(
    str_detect(vitalstatus, "A[123]") |
      str_detect(vitalstatus, "D[145]") |
      str_detect(vitalstatus, "X1") |
      str_detect(vitalstatus, "I")
  ) |>
  select(patientid)

# Here we extract information about cancer patients, including causes of death related to suicide 
patient <- sim_av_patient |>
  select(
    PATIENTID,
    GENDER,
    ETHNICITY,
    VITALSTATUS,
    VITALSTATUSDATE,
    DEATHCAUSECODE_1A,
    DEATHCAUSECODE_1B,
    DEATHCAUSECODE_1C,
    DEATHCAUSECODE_2,
    DEATHCAUSECODE_UNDERLYING
  ) |>
  collect() |>
  janitor::clean_names() |>
  distinct() |>
  anti_join(exclusions) |>
  mutate(
    suicide_cause_2 =
      case_when(
        str_detect(deathcausecode_1a, "Y339") ~ NA,
        str_detect(deathcausecode_1a, "E9888") ~ NA,
        str_detect(deathcausecode_1a, "X6[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "X7[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "X8[01234]") ~ "1A",
        str_detect(deathcausecode_1a, "Y1[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "Y2[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "Y3[01234]") ~ "1A",
        str_detect(deathcausecode_1a, "E98") ~ "1A",
        str_detect(deathcausecode_1a, "E95[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "E87[02]") ~ "1A",
        str_detect(deathcausecode_1b, "Y339") ~ NA,
        str_detect(deathcausecode_1b, "E9888") ~ NA,
        str_detect(deathcausecode_1b, "X6[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "X7[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "X8[01234]") ~ "1B",
        str_detect(deathcausecode_1b, "Y1[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "Y2[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "Y3[01234]") ~ "1B",
        str_detect(deathcausecode_1b, "E98") ~ "1B",
        str_detect(deathcausecode_1b, "E95[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "E87[02]") ~ "1B",
        str_detect(deathcausecode_1c, "Y339") ~ NA,
        str_detect(deathcausecode_1c, "E9888") ~ NA,
        str_detect(deathcausecode_1c, "X6[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "X7[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "X8[01234]") ~ "1C",
        str_detect(deathcausecode_1c, "Y1[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "Y2[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "Y3[01234]") ~ "1C",
        str_detect(deathcausecode_1c, "E98") ~ "1C",
        str_detect(deathcausecode_1c, "E95[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "E87[02]") ~ "1C",
        str_detect(deathcausecode_2, "Y339") ~ NA,
        str_detect(deathcausecode_2, "E9888") ~ NA,
        str_detect(deathcausecode_2, "X6[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "X7[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "X8[01234]") ~ "2",
        str_detect(deathcausecode_2, "Y1[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "Y2[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "Y3[01234]") ~ "2",
        str_detect(deathcausecode_2, "E98") ~ "2",
        str_detect(deathcausecode_2, "E95[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "E87[02]") ~ "2",
        str_detect(deathcausecode_underlying, "Y339") ~ NA,
        str_detect(deathcausecode_underlying, "E9888") ~ NA,
        str_detect(deathcausecode_underlying, "X6[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "X7[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "X8[01234]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "Y1[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "Y2[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "Y3[01234]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "E98") ~ "Underlying",
        str_detect(deathcausecode_underlying, "E95[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "E87[02]") ~ "Underlying",
        TRUE ~ NA
      )
  ) |>
  mutate(
    suicide_2 =
      case_when(
        str_detect(deathcausecode_1a, "Y339") ~ 0,
        str_detect(deathcausecode_1a, "E9888") ~ 0,
        str_detect(deathcausecode_1a, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "X8[01234]") ~ 1,
        str_detect(deathcausecode_1a, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_1a, "E98") ~ 1,
        str_detect(deathcausecode_1a, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "E87[02]") ~ 1,
        str_detect(deathcausecode_1b, "Y339") ~ 0,
        str_detect(deathcausecode_1b, "E9888") ~ 0,
        str_detect(deathcausecode_1b, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "X8[01234]") ~ 1,
        str_detect(deathcausecode_1b, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_1b, "E98") ~ 1,
        str_detect(deathcausecode_1b, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "E87[02]") ~ 1,
        str_detect(deathcausecode_1c, "Y339") ~ 0,
        str_detect(deathcausecode_1c, "E9888") ~ 0,
        str_detect(deathcausecode_1c, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "X8[01234]") ~ 1,
        str_detect(deathcausecode_1c, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_1c, "E98") ~ 1,
        str_detect(deathcausecode_1c, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "E87[02]") ~ 1,
        str_detect(deathcausecode_2, "Y339") ~ 0,
        str_detect(deathcausecode_2, "E9888") ~ 0,
        str_detect(deathcausecode_2, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "X8[01234]") ~ 1,
        str_detect(deathcausecode_2, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_2, "E98") ~ 1,
        str_detect(deathcausecode_2, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "E87[02]") ~ 1,
        str_detect(deathcausecode_underlying, "Y339") ~ 0,
        str_detect(deathcausecode_underlying, "E9888") ~ 0,
        str_detect(deathcausecode_underlying, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "X8[01234]") ~ 1,
        str_detect(deathcausecode_underlying, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_underlying, "E98") ~ 1,
        str_detect(deathcausecode_underlying, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "E87[02]") ~ 1,
        TRUE ~ 0
      )
  ) |>
  mutate(
    death_cause_flag =
      case_when(
        str_detect(vitalstatus, "A") ~ "Alive",
        (
          !str_detect(vitalstatus, "A") &
            is.na(deathcausecode_1a) &
            is.na(deathcausecode_1b) &
            is.na(deathcausecode_1c) &
            is.na(deathcausecode_2) &
            is.na(deathcausecode_underlying)
        ) ~ "Dead without Cause",
        (
          !str_detect(vitalstatus, "A") &
            (
              !is.na(deathcausecode_1a) |
                !is.na(deathcausecode_1b) |
                !is.na(deathcausecode_1c) |
                !is.na(deathcausecode_2) |
                !is.na(deathcausecode_underlying)
            )
        ) ~ "Dead with Cause",
        TRUE ~ NA
      )
  ) |>
  as.data.table()

# Here we identify the patient's most recently diagnosed cancer 
tumour <- sim_av_tumour |>
  select(
    PATIENTID,
    TUMOURID,
    SITE_ICD10_O2_3CHAR,
    DIAGNOSISDATEBEST,
    STAGE_BEST,
    QUINTILE_2019,
    AGE
  ) |>
  collect() |>
  janitor::clean_names() |>
  distinct() |>
  anti_join(exclusions) |>
  rename(
    tumour_site = site_icd10_o2_3char,
    deprivation_quintile = quintile_2019,
    age_at_diagnosis = age
  ) |>
  as.data.table()

# Dataset merging
## We now merge information from the two source tables and annotate the new dataset
## Annotations are intended to match the original Stata code developed for this analysis as closely as possible, but you will note that there are some deviations
dataset <- merge(patient, tumour) |>
  anti_join(exclusions) |>
  filter(gender != 9) |>
  filter(!(gender == 2 & str_detect(tumour_site, "C6[0123]"))) |>
  filter(!(gender == 1 &
             str_detect(tumour_site, "C5[12345678]"))) |>
  filter(str_detect(tumour_site, "^C") &
           !str_detect(tumour_site, "C44")) |>
  group_by(patientid) |>
  arrange(desc(diagnosisdatebest)) |>
  slice_head() |>
  rename(
    patient_id = patientid,
    gender_code = gender,
    ethnicity_code = ethnicity,
    vital_status_code = vitalstatus,
    vital_status_date = vitalstatusdate,
    suicide = suicide_2,
    suicide_cause = suicide_cause_2,
    tumour_id = tumourid,
    diagnosis_date = diagnosisdatebest,
    stage = stage_best,
  ) |>
  ungroup() |>
  mutate(
    cancer_group = case_when(
      str_detect(tumour_site, "C0[0123456789]") ~ "Head and neck",
      str_detect(tumour_site, "C1[01234]") ~ "Head and neck",
      str_detect(tumour_site, "C15") ~ "Oesophagus",
      str_detect(tumour_site, "C16") ~ "Stomach",
      str_detect(tumour_site, "C17") ~ "Other malignant neoplasms",
      str_detect(tumour_site, "C1[89]") ~ "Colorectal",
      str_detect(tumour_site, "C20") ~ "Colorectal",
      str_detect(tumour_site, "C22") ~ "Liver",
      str_detect(tumour_site, "C25") ~ "Pancreas",
      str_detect(tumour_site, "C3[12]") ~ "Head and neck",
      str_detect(tumour_site, "C3[34]") ~ "Lung",
      str_detect(tumour_site, "C4[019]") ~ "Sarcoma",
      str_detect(tumour_site, "C43") ~ "Melanoma",
      str_detect(tumour_site, "C45") ~ "Mesothelioma",
      str_detect(tumour_site, "C50") ~ "Breast",
      str_detect(tumour_site, "C53") ~ "Cervix",
      str_detect(tumour_site, "C5[45]") ~ "Uterus",
      str_detect(tumour_site, "C5[67]") ~ "Ovary",
      str_detect(tumour_site, "C61") ~ "Prostate",
      str_detect(tumour_site, "C62") ~ "Testis",
      str_detect(tumour_site, "C6[4568]") ~ "Kidney and unspecified urinary organs",
      str_detect(tumour_site, "C67") ~ "Bladder",
      str_detect(tumour_site, "C7[012]") ~ "Central Nervous System (incl brain)",
      str_detect(tumour_site, "C73") ~ "Head and neck",
      str_detect(tumour_site, "C7[789]") ~ "Cancer of Unknown Primary",
      str_detect(tumour_site, "C80") ~ "Cancer of Unknown Primary",
      str_detect(tumour_site, "C81") ~ "Hodgkin lymphoma",
      str_detect(tumour_site, "C8[2345]") ~ "Non-Hodgkin lymphoma",
      str_detect(tumour_site, "C88") ~ "Multiple myeloma",
      str_detect(tumour_site, "C90") ~ "Multiple myeloma",
      str_detect(tumour_site, "C9[12345]") ~ "Leukaemia",
      TRUE ~ "Other malignant neoplasms"
    )
  ) |>
  mutate(
    gender = fcase(
      gender_code == 1,
      "Male",
      gender_code == 2,
      "Female",
      gender_code == 9,
      "Indeterminate",
      gender_code == "X",
      NA
    )
  ) |>
  mutate(
    ethnicity = fcase(
      ethnicity_code == "0",
      "White",
      ethnicity_code == "A",
      "White",
      ethnicity_code == "B",
      "White",
      ethnicity_code == "C",
      "White",
      ethnicity_code == "CA",
      "Unknown",
      ethnicity_code == "CP",
      "White",
      ethnicity_code == "D",
      "Mixed",
      ethnicity_code == "E",
      "Mixed",
      ethnicity_code == "F",
      "Mixed",
      ethnicity_code == "G",
      "Mixed",
      ethnicity_code == "H",
      "Asian",
      ethnicity_code == "J",
      "Asian",
      ethnicity_code == "K",
      "Asian",
      ethnicity_code == "L",
      "Asian",
      ethnicity_code == "M",
      "Black",
      ethnicity_code == "N",
      "Black",
      ethnicity_code == "P",
      "Black",
      ethnicity_code == "R",
      "Other",
      ethnicity_code == "S",
      "Other",
      ethnicity_code == "X",
      "Unknown",
      ethnicity_code == "Z",
      "Not Stated"
    )
  ) |>
  mutate(
    vital_status = fcase(
      vital_status_code == "A",
      "Alive",
      vital_status_code == "D",
      "Dead",
      vital_status_code == "D3",
      "Dies before diagnosis",
      vital_status_code == "X2",
      "Patient may have been returned and successfully traced",
      vital_status_code == "X4",
      "Patient may have been returned but has not been successfully traced",
      vital_status_code == "X5",
      "Loss to follow up"
    )
  ) |>
  mutate(diagnosis_year = factor(year(diagnosis_date))) |>
  as.data.table()

## Generate random ages and derive age-based variables
age_breaks <-
  c(0, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 79, 84, 89, 999)
age_labels <-
  c(
    "15-19",
    "20-24",
    "25-29",
    "30-34",
    "35-39",
    "40-44",
    "45-49",
    "50-54",
    "55-59",
    "60-64",
    "65-69",
    "70-74",
    "75-79",
    "80-84",
    "85-89",
    "90+"
  )

dataset <- dataset |>
  partition(cluster) |>
  group_by(patient_id) |>
  mutate(age_at_diagnosis = floor(rnorm(
    n = 1, mean = 65, sd = 12.5
  ))) |>
  collect() |>
  ungroup() |>
  mutate(
    age_at_diagnosis = case_when(
      age_at_diagnosis < 18 ~ 18,
      age_at_diagnosis > 99 ~ 99,
      TRUE ~ age_at_diagnosis
    )
  ) |>
  mutate(birth_date = diagnosis_date - lubridate::years(age_at_diagnosis) - lubridate::days(floor(runif(
    n = 1, min = 1, max = 10
  )))) |>
  mutate(
    age_category_at_diagnosis = cut(
      age_at_diagnosis,
      breaks = age_breaks,
      labels = age_labels,
      right = TRUE
    )
  ) |>
  mutate(death_date = case_when(vital_status == "Dead" ~ vital_status_date,
                                TRUE ~ NA)) |>
  mutate(age_at_death = floor(as.numeric(
    difftime(death_date, birth_date, unit = "days") /
      365.25
  ))) |>
  mutate(
    age_category_at_death = cut(
      age_at_death,
      breaks = age_breaks,
      labels = age_labels,
      right = TRUE
    )
  ) |>
  mutate(age_category_at_death = case_when(vital_status != "Dead" ~ NA,
                                           TRUE ~ age_category_at_death)) |>
  mutate(suicide_death_date = case_when(suicide == 1 ~ vital_status_date,
                                        TRUE ~ NA)) |>
  mutate(age_at_suicide_death = floor(as.numeric(
    difftime(suicide_death_date, birth_date, unit = "days") /
      365.25
  ))) |>
  mutate(
    age_category_at_suicide_death = cut(
      age_at_suicide_death,
      breaks = age_breaks,
      labels = age_labels,
      right = TRUE
    )
  ) |>
  mutate(
    age_category_at_suicide_death = case_when(suicide == 0 ~ NA,
                                              TRUE ~ age_category_at_suicide_death)
  )

## Calculate months since diagnosis
months_breaks <- c(-1, 5, 11, 23, 35, 59, 119, 999)
months_labels <-
  c("0-5 mo", "6-11 mo", "12-23 mo", "24-35mo", "3-4y", "5-9y", "10+ y")

dataset <- dataset |>
  mutate(months_from_diagnosis = floor(as.numeric(
    difftime(vital_status_date, diagnosis_date, units = c("weeks"))
  ) / 4)) |>
  mutate(months_from_diagnosis = case_when(months_from_diagnosis < 0 ~ 0,
                                           TRUE ~ months_from_diagnosis)) |>
  mutate(
    months_category_from_diagnosis = cut(
      months_from_diagnosis,
      breaks = months_breaks,
      labels = months_labels,
      right = TRUE
    )
  )

# Reverse deprivation quintiles
## Deprivation is sometimes reported from least to most deprived or from most to least deprived. Here, we reverse the quintiles to match what is presented in the JAMA Psychiatry publication. 
dataset <- dataset |>
  mutate(
    deprivation_quintile_reversed = case_when(
      deprivation_quintile == "1 - most deprived" ~ "5, Most",
      deprivation_quintile == "2" ~ "4",
      deprivation_quintile == "3" ~ "3",
      deprivation_quintile == "4" ~ "2",
      deprivation_quintile == "5 - least deprived" ~ "1, Least",
    )
  )

# Survival analysis
## Here we shape the data as survival data, following up from the diagnosis date to the date of the vital status update 
survival <- dataset |>
  mutate(time = as.numeric(difftime(vital_status_date, diagnosis_date, unit = "days"))) |>
  filter(time >= 0) |>
  mutate(time = case_when(time == 0 ~ 1,
                          TRUE ~ time)) |>
  mutate(time = as.numeric(time)/365.25) |>
  mutate(suicide = suicide + 1) |>
  mutate(
    gender = factor(gender, levels = c("Male", "Female")),
    ethnicity = factor(
      ethnicity,
      levels = c(
        "White",
        "Asian" ,
        "Black",
        "Mixed",
        "Other",
        "Unknown",
        "Not Stated"
      )
    ),
    cancer_group = as.factor(cancer_group),
    deprivation_quintile_reversed = as.factor(deprivation_quintile_reversed),
  )

# Variable labels
## Here we apply a number of variable labels for tidiness
label(survival$age_at_death) <- "Attained age at death"
label(survival$age_at_diagnosis) <-
  "Attained age at cancer diagnosis, y"
label(survival$age_at_suicide_death) <-
  "Attained age at suicide death, y"
label(survival$age_category_at_death) <- "Age at death (attained age), y"
label(survival$age_category_at_diagnosis) <-
  "Age category at cancer diagnosis"
label(survival$age_category_at_suicide_death) <-
  "Age category at suicide death"
label(survival$birth_date) <- "Birth date"
label(survival$cancer_group) <- "Last primary cancer"
label(survival$deathcausecode_1a) <- "Death cause code - 1A"
label(survival$deathcausecode_1b) <- "Death cause code - 1B"
label(survival$deathcausecode_1c) <- "Death cause code - 1C"
label(survival$deathcausecode_2) <- "Death cause code - 2"
label(survival$deathcausecode_underlying) <-
  "Death cause code - Underlying"
label(survival$death_cause_flag) <- "Alive or Dead"
label(survival$death_date) <- "Date of death"
label(survival$deprivation_quintile) <- "Deprivation"
label(survival$deprivation_quintile_reversed) <- "Deprivation"
label(survival$diagnosis_date) <- "Date of Diagnosis"
label(survival$diagnosis_year) <- "Year of cancer diagnosis"
label(survival$ethnicity) <- "Ethnicity"
label(survival$ethnicity_code) <- "Ethnicity code"
label(survival$gender) <- "Sex"
label(survival$gender_code) <- "Gender code"
label(survival$months_from_diagnosis) <- "Time from cancer diagnosis"
label(survival$months_category_from_diagnosis) <- "Time from cancer diagnosis"
label(survival$patient_id) <- "Patient ID"
label(survival$stage) <- "Tumour stage"
label(survival$suicide) <- "Suicide-related death (Y/N)"
label(survival$suicide_cause) <-
  "Suicide-related death listing on death certificate"
label(survival$suicide_death_date) <- "Date of suicide death"
label(survival$time) <- "Person-years observed"
label(survival$tumour_id) <- "Tumour ID"
label(survival$tumour_site) <- "Tumour site"
label(survival$vital_status) <- "Vital Status"
label(survival$vital_status_code) <- "Vital status code"
label(survival$vital_status_date) <- "Vital status date"

# Setting variables
## Here we define our explanatory and dependent variables 
## Note that the dependent function is a survival function 
explanatory <- c("gender",
                 "months_category_from_diagnosis",
                 "age_category_at_death",
                 "cancer_group",
                 "deprivation_quintile_reversed",
                 "ethnicity",
                 "diagnosis_year")

dependent <- "Surv(time, suicide)"

# Load in ONS suicide data
## This is the data for suicides in the general population which we will use to standardise the suicide rates among people with cancer
## Notably, this data was provided as Excel sheets which require a fair amount of cleaning, so some of the code may seem idiosyncratic because we have to deal with some of the strangeness of trying to import and reshape Excel data for our needs

suicides2022 <- read_xlsx(
  here("suicides2022.xlsx"),
  sheet = "Table_6",
  range = "A6:CK132",
  col_names = TRUE
) |>
  janitor::clean_names() %>%
  rename_with(~ str_remove(., "_note.*")) |>
  select(-contains("x10_14_"))

# Clean up observed suicides
observed_suicides <- suicides2022 |>
  rename(year = year_of_death_registration) |>
  select(-area_code, -area_of_usual_residence) |>
  select(!contains("_ucl") &
           !contains("_lcl") & !contains("_marker")) %>%
  rename_with( ~ str_replace(., "^x", "")) |>
  group_by(sex, year) |>
  select(contains("_number_of_deaths")) |>
  pivot_longer(cols = ends_with("_number_of_deaths")) |>
  rename(attained_age = name, observed_suicides = value) |>
  mutate(attained_age = str_replace(attained_age, "_number_of_deaths", "")) |>
  mutate(attained_age = str_replace(attained_age, "_", "-"))

# Clean up standardised suicide rates
observed_suicide_rates <- suicides2022 |>
  rename(year = year_of_death_registration) |>
  select(-area_code, -area_of_usual_residence) |>
  select(!contains("_ucl") &
           !contains("_lcl") & !contains("_marker")) %>%
  rename_with( ~ str_replace(., "^x", "")) |>
  group_by(sex, year) |>
  select(contains("_rate_per_100_000")) |>
  pivot_longer(cols = ends_with("_rate_per_100_000")) |>
  rename(attained_age = name, suicides_per_100000 = value) |>
  mutate(attained_age = str_replace(attained_age, "_rate_per_100_000", "")) |>
  mutate(attained_age = str_replace(attained_age, "_", "-"))

reference_suicides <- observed_suicides |>
  inner_join(observed_suicide_rates) |>
  mutate(sex = case_when(sex == "Males" ~ "Male", sex == "Females" ~ "Female", TRUE ~ sex)) |>
  mutate(attained_age =
           case_when(attained_age == "90" ~ "90+" , TRUE ~ attained_age)) |>
  select(year,
         sex,
         attained_age,
         observed_suicides,
         suicides_per_100000) |>
  rename(reference_suicides = observed_suicides)

# Save and export raw dataset
save(dataset,
     exclusions,
     survival,
     reference_suicides,
     file = here("dataset.Rdata"))

write_csv(dataset, file = here("dataset", "dataset.csv"))

write_csv(survival, file = here("dataset", "survival.csv"))

write_csv(reference_suicides,
          file = here("dataset", "reference_suicides.csv"))

Using the standardised suicide rates in the general population, we calculate the expected numbers of suicide among those diagnosed with cancer and compare them to the number of suicides actually observed. This gives us the standardised mortality ratio (SMR).

\[ \text{standardised mortality ratio}=\frac{\text{observed number of suicides}}{\text{expected number of suicides}} \]

We can also then calculate absolute excess risk (AER) per 10 000 as follows:

\[ \text{absolute excess risk per 10,000}=\frac{\text{observed number of suicides} - \text{expected number of suicides}}{\text{person-years at risk}} \]

2 Sample characteristics

These tables aims to reproduce the main outputs of Table 1 in (Henson et al., 2019). There are some important differences to note:

Age groups are categorised according to the same ones provided by the ONS standardised mortality rates since we will need to standardise using those same age groups anyway
Simulacrum covers a different time period than the authors use in (Henson et al., 2019)
Ages were randomly generated, so don’t read too much into them

R Code

# Table 1
## A quirk of the finalfit package is that survival events need to be coded as "1" and "2" rather than "0" and "1" 
survival |>
  mutate(suicide = as.factor(case_when(suicide == 2 ~ "Y", 
                                       suicide == 1 ~ "N", 
                                       TRUE ~ NA))) |>
  summary_factorlist(
    dependent = "suicide",
    explanatory = c("gender",
                 "months_category_from_diagnosis",
                 "age_category_at_death",
                 "cancer_group",
                 "deprivation_quintile_reversed",
                 "ethnicity",
                 "age_category_at_diagnosis",
                 "diagnosis_year"),
    total_col = TRUE,
    add_dependent_label = TRUE,
    add_col_totals = TRUE
  ) |>
  kableExtra::kable() |> 
  kable_classic(full_width = F) |> 
  footnote(general = "Percentages provided in parentheses.")

Dependent: suicide		N	Y	Total
Total N (%)		1189118 (99.9)	1180 (0.1)	1190298
Sex	Male	613178 (51.6)	723 (61.3)	613901 (51.6)
	Female	575940 (48.4)	457 (38.7)	576397 (48.4)
Time from cancer diagnosis	0-5 mo	212013 (17.8)	395 (33.5)	212408 (17.8)
	6-11 mo	78731 (6.6)	176 (14.9)	78907 (6.6)
	12-23 mo	97424 (8.2)	237 (20.1)	97661 (8.2)
	24-35mo	105740 (8.9)	154 (13.1)	105894 (8.9)
	3-4y	382207 (32.1)	184 (15.6)	382391 (32.1)
	5-9y	313003 (26.3)	34 (2.9)	313037 (26.3)
	10+ y	0 (0.0)	0 (0.0)	0 (0.0)
Age at death (attained age), y	15-19	56 (0.0)	0 (0.0)	56 (0.0)
	20-24	239 (0.0)	2 (0.2)	241 (0.0)
	25-29	853 (0.2)	4 (0.3)	857 (0.2)
	30-34	2622 (0.5)	4 (0.3)	2626 (0.5)
	35-39	6770 (1.3)	14 (1.2)	6784 (1.3)
	40-44	15064 (2.9)	33 (2.8)	15097 (2.9)
	45-49	28887 (5.5)	67 (5.7)	28954 (5.5)
	50-54	47163 (9.0)	101 (8.6)	47264 (9.0)
	55-59	66720 (12.7)	145 (12.3)	66865 (12.7)
	60-64	79828 (15.2)	156 (13.2)	79984 (15.2)
	65-69	81973 (15.6)	198 (16.8)	82171 (15.6)
	70-74	72311 (13.8)	152 (12.9)	72463 (13.8)
	75-79	54431 (10.4)	131 (11.1)	54562 (10.4)
	80-84	34688 (6.6)	76 (6.4)	34764 (6.6)
	85-89	19186 (3.7)	62 (5.3)	19248 (3.7)
	90+	14518 (2.8)	34 (2.9)	14552 (2.8)
Last primary cancer	Bladder	31935 (2.7)	35 (3.0)	31970 (2.7)
	Breast	178351 (15.0)	97 (8.2)	178448 (15.0)
	Cancer of Unknown Primary	8875 (0.7)	9 (0.8)	8884 (0.7)
	Central Nervous System (incl brain)	18932 (1.6)	19 (1.6)	18951 (1.6)
	Cervix	10734 (0.9)	13 (1.1)	10747 (0.9)
	Colorectal	136413 (11.5)	122 (10.3)	136535 (11.5)
	Head and neck	49813 (4.2)	108 (9.2)	49921 (4.2)
	Hodgkin lymphoma	7120 (0.6)	13 (1.1)	7133 (0.6)
	Kidney and unspecified urinary organs	42377 (3.6)	51 (4.3)	42428 (3.6)
	Leukaemia	32780 (2.8)	33 (2.8)	32813 (2.8)
	Liver	20979 (1.8)	17 (1.4)	20996 (1.8)
	Lung	152151 (12.8)	184 (15.6)	152335 (12.8)
	Melanoma	55370 (4.7)	36 (3.1)	55406 (4.7)
	Mesothelioma	8995 (0.8)	8 (0.7)	9003 (0.8)
	Multiple myeloma	21580 (1.8)	16 (1.4)	21596 (1.8)
	Non-Hodgkin lymphoma	47014 (4.0)	61 (5.2)	47075 (4.0)
	Oesophagus	29461 (2.5)	20 (1.7)	29481 (2.5)
	Other malignant neoplasms	38373 (3.2)	33 (2.8)	38406 (3.2)
	Ovary	24055 (2.0)	9 (0.8)	24064 (2.0)
	Pancreas	35001 (2.9)	22 (1.9)	35023 (2.9)
	Prostate	171745 (14.4)	204 (17.3)	171949 (14.4)
	Sarcoma	8496 (0.7)	11 (0.9)	8507 (0.7)
	Stomach	20590 (1.7)	26 (2.2)	20616 (1.7)
	Testis	7940 (0.7)	15 (1.3)	7955 (0.7)
	Uterus	30038 (2.5)	18 (1.5)	30056 (2.5)
Deprivation	1, Least	256470 (21.6)	252 (21.4)	256722 (21.6)
	2	256429 (21.6)	242 (20.5)	256671 (21.6)
	3	245122 (20.6)	247 (20.9)	245369 (20.6)
	4	220523 (18.5)	209 (17.7)	220732 (18.5)
	5, Most	210574 (17.7)	230 (19.5)	210804 (17.7)
Ethnicity	White	1039481 (88.9)	1029 (88.2)	1040510 (88.9)
	Asian	30647 (2.6)	39 (3.3)	30686 (2.6)
	Black	23013 (2.0)	26 (2.2)	23039 (2.0)
	Mixed	5735 (0.5)	4 (0.3)	5739 (0.5)
	Other	18664 (1.6)	14 (1.2)	18678 (1.6)
	Unknown	7795 (0.7)	11 (0.9)	7806 (0.7)
	Not Stated	44147 (3.8)	44 (3.8)	44191 (3.8)
Age category at cancer diagnosis	15-19	179 (0.0)	0 (0.0)	179 (0.0)
	20-24	626 (0.1)	2 (0.2)	628 (0.1)
	25-29	2291 (0.2)	4 (0.3)	2295 (0.2)
	30-34	6586 (0.6)	8 (0.7)	6594 (0.6)
	35-39	17369 (1.5)	12 (1.0)	17381 (1.5)
	40-44	38011 (3.2)	37 (3.1)	38048 (3.2)
	45-49	72013 (6.1)	81 (6.9)	72094 (6.1)
	50-54	115164 (9.7)	104 (8.8)	115268 (9.7)
	55-59	157842 (13.3)	151 (12.8)	157993 (13.3)
	60-64	184585 (15.5)	174 (14.7)	184759 (15.5)
	65-69	184510 (15.5)	192 (16.3)	184702 (15.5)
	70-74	157836 (13.3)	141 (11.9)	157977 (13.3)
	75-79	115375 (9.7)	121 (10.3)	115496 (9.7)
	80-84	71405 (6.0)	78 (6.6)	71483 (6.0)
	85-89	38225 (3.2)	44 (3.7)	38269 (3.2)
	90+	27101 (2.3)	31 (2.6)	27132 (2.3)
Year of cancer diagnosis	2016	290193 (24.4)	298 (25.3)	290491 (24.4)
	2017	290613 (24.4)	301 (25.5)	290914 (24.4)
	2018	305343 (25.7)	289 (24.5)	305632 (25.7)
	2019	302969 (25.5)	292 (24.7)	303261 (25.5)
Note:
Percentages provided in parentheses.

We can also output an analogous table using Python.

Python Code

# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib as mpl
import tableone as tbl1
import tabulate
from IPython.display import Markdown

# Import dataset
df = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")

## Redefine suicide == 0 and suicide == 1 
df.eval("suicide = suicide - 1", inplace=True)

# Annotate dataset
columns = ['gender', 'months_category_from_diagnosis', 'age_category_at_death', 'cancer_group', 'deprivation_quintile_reversed', 'ethnicity', 'age_category_at_diagnosis', 'diagnosis_year']

categorical = ['gender', 'months_category_from_diagnosis', 'age_category_at_death', 'cancer_group', 'deprivation_quintile_reversed', 'ethnicity', 'age_category_at_diagnosis', 'diagnosis_year']

order = {
  'gender': ['Male', 'Female'],
  'months_category_from_diagnosis': ['0-5 mo', '6-11 mo', '12-23 mo', '24-35mo', '3-4y', '5-9y', '10+ y'],
  'age_category_at_death': ['15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90+'],
  'ethnicity': ['White', 'Asian', 'Black', 'Mixed', 'Other', 'Unknown', 'Not Stated']
  }

labels = {
  'gender': 'Sex',
  'suicide': 'Suicide Death',
  'months_category_from_diagnosis': 'Time from cancer diagnosis',
  'age_category_at_death': 'Age at death (attained age), y',
  'cancer_group': 'Last primary cancer',
  'deprivation_quintile_reversed': 'Deprivation',
  'ethnicity': 'Ethnicity',
  'age_category_at_diagnosis': 'Age category at cancer diagnosis',
  'diagnosis_year': 'Year of cancer diagnosis'
  }

groupby = ['suicide']

#Produce Table One
mytable = tbl1.TableOne(df, columns = columns, rename = labels, order = order, categorical = categorical, groupby = groupby)

Markdown(mytable.tabulate(tablefmt="grid"))

		Missing	Overall	0	1
n			1190298	1189118	1180
Sex, n (%)	Male	0	613901 (51.6)	613178 (51.6)	723 (61.3)
	Female		576397 (48.4)	575940 (48.4)	457 (38.7)
Time from cancer diagnosis, n (%)	0-5 mo	0	212408 (17.8)	212013 (17.8)	395 (33.5)
	6-11 mo		78907 (6.6)	78731 (6.6)	176 (14.9)
	12-23 mo		97661 (8.2)	97424 (8.2)	237 (20.1)
	24-35mo		105894 (8.9)	105740 (8.9)	154 (13.1)
	3-4y		382391 (32.1)	382207 (32.1)	184 (15.6)
	5-9y		313037 (26.3)	313003 (26.3)	34 (2.9)
Age at death (attained age), y, n (%)	15-19	663810	56 (0.0)	56 (0.0)
	20-24		241 (0.0)	239 (0.0)	2 (0.2)
	25-29		857 (0.2)	853 (0.2)	4 (0.3)
	30-34		2626 (0.5)	2622 (0.5)	4 (0.3)
	35-39		6784 (1.3)	6770 (1.3)	14 (1.2)
	40-44		15097 (2.9)	15064 (2.9)	33 (2.8)
	45-49		28954 (5.5)	28887 (5.5)	67 (5.7)
	50-54		47264 (9.0)	47163 (9.0)	101 (8.6)
	55-59		66865 (12.7)	66720 (12.7)	145 (12.3)
	60-64		79984 (15.2)	79828 (15.2)	156 (13.2)
	65-69		82171 (15.6)	81973 (15.6)	198 (16.8)
	70-74		72463 (13.8)	72311 (13.8)	152 (12.9)
	75-79		54562 (10.4)	54431 (10.4)	131 (11.1)
	80-84		34764 (6.6)	34688 (6.6)	76 (6.4)
	85-89		19248 (3.7)	19186 (3.7)	62 (5.3)
	90+		14552 (2.8)	14518 (2.8)	34 (2.9)
Last primary cancer, n (%)	Bladder	0	31970 (2.7)	31935 (2.7)	35 (3.0)
	Breast		178448 (15.0)	178351 (15.0)	97 (8.2)
	Cancer of Unknown Primary		8884 (0.7)	8875 (0.7)	9 (0.8)
	Central Nervous System (incl brain)		18951 (1.6)	18932 (1.6)	19 (1.6)
	Cervix		10747 (0.9)	10734 (0.9)	13 (1.1)
	Colorectal		136535 (11.5)	136413 (11.5)	122 (10.3)
	Head and neck		49921 (4.2)	49813 (4.2)	108 (9.2)
	Hodgkin lymphoma		7133 (0.6)	7120 (0.6)	13 (1.1)
	Kidney and unspecified urinary organs		42428 (3.6)	42377 (3.6)	51 (4.3)
	Leukaemia		32813 (2.8)	32780 (2.8)	33 (2.8)
	Liver		20996 (1.8)	20979 (1.8)	17 (1.4)
	Lung		152335 (12.8)	152151 (12.8)	184 (15.6)
	Melanoma		55406 (4.7)	55370 (4.7)	36 (3.1)
	Mesothelioma		9003 (0.8)	8995 (0.8)	8 (0.7)
	Multiple myeloma		21596 (1.8)	21580 (1.8)	16 (1.4)
	Non-Hodgkin lymphoma		47075 (4.0)	47014 (4.0)	61 (5.2)
	Oesophagus		29481 (2.5)	29461 (2.5)	20 (1.7)
	Other malignant neoplasms		38406 (3.2)	38373 (3.2)	33 (2.8)
	Ovary		24064 (2.0)	24055 (2.0)	9 (0.8)
	Pancreas		35023 (2.9)	35001 (2.9)	22 (1.9)
	Prostate		171949 (14.4)	171745 (14.4)	204 (17.3)
	Sarcoma		8507 (0.7)	8496 (0.7)	11 (0.9)
	Stomach		20616 (1.7)	20590 (1.7)	26 (2.2)
	Testis		7955 (0.7)	7940 (0.7)	15 (1.3)
	Uterus		30056 (2.5)	30038 (2.5)	18 (1.5)
Deprivation, n (%)	1, Least	0	256722 (21.6)	256470 (21.6)	252 (21.4)
	2		256671 (21.6)	256429 (21.6)	242 (20.5)
	3		245369 (20.6)	245122 (20.6)	247 (20.9)
	4		220732 (18.5)	220523 (18.5)	209 (17.7)
	5, Most		210804 (17.7)	210574 (17.7)	230 (19.5)
Ethnicity, n (%)	White	19649	1040510 (88.9)	1039481 (88.9)	1029 (88.2)
	Asian		30686 (2.6)	30647 (2.6)	39 (3.3)
	Black		23039 (2.0)	23013 (2.0)	26 (2.2)
	Mixed		5739 (0.5)	5735 (0.5)	4 (0.3)
	Other		18678 (1.6)	18664 (1.6)	14 (1.2)
	Unknown		7806 (0.7)	7795 (0.7)	11 (0.9)
	Not Stated		44191 (3.8)	44147 (3.8)	44 (3.8)
Age category at cancer diagnosis, n (%)	15-19	0	179 (0.0)	179 (0.0)
	20-24		628 (0.1)	626 (0.1)	2 (0.2)
	25-29		2295 (0.2)	2291 (0.2)	4 (0.3)
	30-34		6594 (0.6)	6586 (0.6)	8 (0.7)
	35-39		17381 (1.5)	17369 (1.5)	12 (1.0)
	40-44		38048 (3.2)	38011 (3.2)	37 (3.1)
	45-49		72094 (6.1)	72013 (6.1)	81 (6.9)
	50-54		115268 (9.7)	115164 (9.7)	104 (8.8)
	55-59		157993 (13.3)	157842 (13.3)	151 (12.8)
	60-64		184759 (15.5)	184585 (15.5)	174 (14.7)
	65-69		184702 (15.5)	184510 (15.5)	192 (16.3)
	70-74		157977 (13.3)	157836 (13.3)	141 (11.9)
	75-79		115496 (9.7)	115375 (9.7)	121 (10.3)
	80-84		71483 (6.0)	71405 (6.0)	78 (6.6)
	85-89		38269 (3.2)	38225 (3.2)	44 (3.7)
	90+		27132 (2.3)	27101 (2.3)	31 (2.6)
Year of cancer diagnosis, n (%)	2016	0	290491 (24.4)	290193 (24.4)	298 (25.3)
	2017		290914 (24.4)	290613 (24.4)	301 (25.5)
	2018		305632 (25.7)	305343 (25.7)	289 (24.5)
	2019		303261 (25.5)	302969 (25.5)	292 (24.7)

3 Suicide SMRs and AERs per 10,000 person-years at risk according to last primary cancer

These tables aim to reproduce some of the outputs of Table 2 in (Henson et al., 2019).

R Code

# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year, sex, attained_age, cancer_group) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(year,
         sex,
         attained_age,
         person_years,
         suicide_deaths,
         cancer_group) |>
  ungroup()

# Calculate SMRs
cancer_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(cancer_group) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(cancer_group,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  cancer_smr,
  col.names = c(
    "Last primary cancer",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)

Last primary cancer	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
Bladder	23977	35	29	1.21	2.50
Breast	65578	97	33	2.94	9.76
Cancer of Unknown Primary	3536	9	4	2.25	14.14
Central Nervous System (incl brain)	12074	19	13	1.46	4.97
Cervix	3907	13	2	6.50	28.15
Colorectal	89450	122	98	1.24	2.68
Head and neck	25121	107	28	3.82	31.45
Hodgkin lymphoma	1652	13	2	6.50	66.59
Kidney and unspecified urinary organs	23501	51	28	1.82	9.79
Leukaemia	18277	33	20	1.65	7.11
Liver	14182	17	16	1.06	0.71
Lung	108106	184	115	1.60	6.38
Melanoma	19560	36	20	1.80	8.18
Mesothelioma	8057	8	10	0.80	-2.48
Multiple myeloma	16948	16	19	0.84	-1.77
Non-Hodgkin lymphoma	24413	61	27	2.26	13.93
Oesophagus	22827	20	25	0.80	-2.19
Other malignant neoplasms	22732	33	24	1.38	3.96
Ovary	16310	9	8	1.12	0.61
Pancreas	18091	22	20	1.10	1.11
Prostate	76454	204	119	1.71	11.12
Sarcoma	5179	11	6	1.83	9.65
Stomach	14889	26	16	1.62	6.72
Testis	646	15	1	15.00	216.72
Uterus	15434	18	8	2.25	6.48

We can also output an analogous table using Python.

Python Code

# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'cancer_group'])
cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()
cancer_suicides.dropna(inplace=True)
cancer_suicides = cancer_suicides.drop_duplicates()
cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('cancer_group')
cancer_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

cancer_smr['smr'] = np.round(cancer_smr['observed_suicides'] / cancer_smr['expected_suicides'], 2)
cancer_smr['aer'] = np.round((cancer_smr['observed_suicides'] - cancer_smr['expected_suicides']) / cancer_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
cancer_smr_table = cancer_smr[['cancer_group', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]
cancer_smr_table.columns = [
    'Last primary cancer', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(cancer_smr_table, headers='keys'))

	Last primary cancer	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
0	Bladder	23977	35	29	1.21	2.5
1	Breast	65578	97	33	2.94	9.76
2	Cancer of Unknown Primary	3536	9	4	2.25	14.14
3	Central Nervous System (incl brain)	12074	19	13	1.46	4.97
4	Cervix	3907	13	2	6.5	28.15
5	Colorectal	89450	122	98	1.24	2.68
6	Head and neck	25121	107	28	3.82	31.45
7	Hodgkin lymphoma	1652	13	2	6.5	66.59
8	Kidney and unspecified urinary organs	23501	51	28	1.82	9.79
9	Leukaemia	18277	33	20	1.65	7.11
10	Liver	14182	17	16	1.06	0.71
11	Lung	108106	184	115	1.6	6.38
12	Melanoma	19560	36	20	1.8	8.18
13	Mesothelioma	8057	8	10	0.8	-2.48
14	Multiple myeloma	16948	16	19	0.84	-1.77
15	Non-Hodgkin lymphoma	24413	61	27	2.26	13.93
16	Oesophagus	22827	20	25	0.8	-2.19
17	Other malignant neoplasms	22732	33	24	1.38	3.96
18	Ovary	16310	9	8	1.12	0.61
19	Pancreas	18091	22	20	1.1	1.11
20	Prostate	76454	204	119	1.71	11.12
21	Sarcoma	5179	11	6	1.83	9.65
22	Stomach	14889	26	16	1.62	6.72
23	Testis	646	15	1	15	216.72
24	Uterus	15434	18	8	2.25	6.48

4 Suicide SMRs and AERs per 10 000 person-years at risk according to sex for all cancers combined

These tables aim to reproduce some of the outputs of Table 3 in (Henson et al., 2019).

R Code

# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year,
           sex,
           attained_age) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(
    year,
    sex,
    attained_age,
    person_years,
    suicide_deaths
  ) |>
  ungroup()

# Calculate SMRs
sex_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(sex) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(sex,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  sex_smr,
  col.names = c(
    "Sex",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)

Sex	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
Female	304081	457	149	3.07	10.13
Male	346831	722	541	1.33	5.22

We can also output an analogous table using Python.

Python Code

# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'cancer_group'])
cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()
cancer_suicides.dropna(inplace=True)
cancer_suicides = cancer_suicides.drop_duplicates()
cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('sex')

sex_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

sex_smr['smr'] = np.round(sex_smr['observed_suicides'] / sex_smr['expected_suicides'], 2)
sex_smr['aer'] = np.round((sex_smr['observed_suicides'] - sex_smr['expected_suicides']) / sex_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
sex_smr_table = sex_smr[['sex', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]
sex_smr_table.columns = [
    'Sex', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(sex_smr_table, headers='keys'))

	Sex	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
0	Female	304067	457	149	3.07	10.13
1	Male	346834	722	541	1.33	5.22

5 Suicide SMRs and AERs per 10 000 person-years at risk according to attained age at death for all cancers combined

These tables aim to reproduce some of the outputs of Table 3 in (Henson et al., 2019).

R Code

# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year,
           sex,
           attained_age) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(
    year,
    sex,
    attained_age,
    person_years,
    suicide_deaths
  ) |>
  ungroup()

# Calculate SMRs
age_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(attained_age) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(attained_age,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  age_smr,
  col.names = c(
    "Attained age at death",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)

Attained age at death	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
15-19	24	0	0	NaN	0.00
20-24	215	2	0	Inf	93.02
25-29	772	4	1	4.00	38.86
30-34	2578	4	3	1.33	3.88
35-39	6575	14	9	1.56	7.60
40-44	15343	33	21	1.57	7.82
45-49	30954	67	49	1.37	5.82
50-54	52204	101	78	1.29	4.41
55-59	76916	145	98	1.48	6.11
60-64	95812	156	107	1.46	5.11
65-69	101858	198	91	2.18	10.50
70-74	94135	152	73	2.08	8.39
75-79	73584	131	57	2.30	10.06
80-84	48877	76	41	1.85	7.16
85-89	28183	62	32	1.94	10.64
90+	22882	34	31	1.10	1.31

We can also output an analogous table using Python.

Python Code

# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'cancer_group'])
cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()
cancer_suicides.dropna(inplace=True)
cancer_suicides = cancer_suicides.drop_duplicates()
cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('attained_age')

age_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

age_smr['smr'] = np.round(age_smr['observed_suicides'] / age_smr['expected_suicides'], 2)
age_smr['aer'] = np.round((age_smr['observed_suicides'] - age_smr['expected_suicides']) / age_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
age_smr_table = age_smr[['attained_age', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]
age_smr_table.columns = [
    'Attained age at death', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(age_smr_table, headers='keys'))

	Attained age at death	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
0	15-19	24	0	0	nan	0
1	20-24	208	2	0	inf	96.15
2	25-29	769	4	1	4	39.01
3	30-34	2585	4	3	1.33	3.87
4	35-39	6574	14	9	1.56	7.61
5	40-44	15351	33	21	1.57	7.82
6	45-49	30951	67	49	1.37	5.82
7	50-54	52192	101	78	1.29	4.41
8	55-59	76919	145	98	1.48	6.11
9	60-64	95809	156	107	1.46	5.11
10	65-69	101856	198	91	2.18	10.51
11	70-74	94132	152	73	2.08	8.39
12	75-79	73598	131	57	2.3	10.05
13	80-84	48869	76	41	1.85	7.16
14	85-89	28181	62	32	1.94	10.65
15	90+	22883	34	31	1.1	1.31

6 Suicide SMRs and AERs per 10 000 person-years at risk according to deprivation at death for all cancers combined

These tables aim to reproduce some of the outputs of Table 3 in (Henson et al., 2019).

R Code

# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year,
           sex,
           attained_age,
           deprivation_quintile_reversed) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(
    year,
    sex,
    attained_age,
    person_years,
    suicide_deaths,
    deprivation_quintile_reversed
  ) |>
  ungroup()

# Calculate SMRs
deprivation_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(deprivation_quintile_reversed) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(deprivation_quintile_reversed,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  deprivation_smr,
  col.names = c(
    "Deprivation",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)

Deprivation	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
1, Least	136824	252	145	1.74	7.82
2	138537	242	148	1.64	6.79
3	133581	247	142	1.74	7.86
4	122291	208	130	1.60	6.38
5, Most	119670	230	126	1.83	8.69

We can also output an analogous table using Python.

Python Code

# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'deprivation_quintile_reversed'])

cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()

cancer_suicides.dropna(inplace=True)

cancer_suicides = cancer_suicides.drop_duplicates()

cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('deprivation_quintile_reversed')

deprivation_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

deprivation_smr['smr'] = np.round(deprivation_smr['observed_suicides'] / deprivation_smr['expected_suicides'], 2)
deprivation_smr['aer'] = np.round((deprivation_smr['observed_suicides'] - deprivation_smr['expected_suicides']) / deprivation_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
deprivation_smr_table = deprivation_smr[['deprivation_quintile_reversed', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]
deprivation_smr_table.columns = [
    'Deprivation', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(deprivation_smr_table, headers='keys'))

	Deprivation	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
0	1, Least	136824	252	145	1.74	7.82
1	2	138537	242	148	1.64	6.79
2	3	133581	247	142	1.74	7.86
3	4	122291	208	130	1.6	6.38
4	5, Most	119670	230	126	1.83	8.69

7 Suicide SMRs and AERs per 10 000 person-years at risk according to ethnicity at death for all cancers combined

These tables aim to reproduce some of the outputs of Table 3 in (Henson et al., 2019).

R Code

# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year,
           sex,
           attained_age,
           ethnicity) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(
    year,
    sex,
    attained_age,
    person_years,
    suicide_deaths,
    ethnicity
  ) |>
  ungroup()

# Calculate SMRs
ethnicity_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(ethnicity) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(ethnicity,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  ethnicity_smr,
  col.names = c(
    "Ethnicity",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)

Ethnicity	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
White	572763	1028	608	1.69	7.33
Asian	16455	39	17	2.29	13.37
Black	12175	26	13	2.00	10.68
Mixed	2969	4	3	1.33	3.37
Other	10059	14	10	1.40	3.98
Unknown	3952	11	4	2.75	17.71
Not Stated	22998	44	25	1.76	8.26

We can also output an analogous table using Python.

Python Code

# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'ethnicity'])

cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()

cancer_suicides.dropna(inplace=True)

cancer_suicides = cancer_suicides.drop_duplicates()

cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('ethnicity')

ethnicity_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

ethnicity_smr['smr'] = np.round(ethnicity_smr['observed_suicides'] / ethnicity_smr['expected_suicides'], 2)
ethnicity_smr['aer'] = np.round((ethnicity_smr['observed_suicides'] - ethnicity_smr['expected_suicides']) / ethnicity_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
ethnicity_smr_table = ethnicity_smr[['ethnicity', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]

ethnicity_smr_table.columns = [
    'Ethnicity', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(ethnicity_smr_table, headers='keys'))

	Ethnicity	Person-years observed	Observed suicide deaths	Expected suicide deaths	Standardised mortality ratio	Absolute excess risk per 10 000
0	Asian	16455	39	17	2.29	13.37
1	Black	12175	26	13	2	10.68
2	Mixed	2969	4	3	1.33	3.37
3	Not Stated	22998	44	25	1.76	8.26
4	Other	10059	14	10	1.4	3.98
5	Unknown	3952	11	4	2.75	17.71
6	White	572763	1028	608	1.69	7.33

References

Henson, K. E., Brock, R., Charnock, J., Wickramasinghe, B., Will, O., & Pitman, A. (2019). Risk of suicide after cancer diagnosis in england. JAMA Psychiatry, 76(1), 51. https://doi.org/10.1001/jamapsychiatry.2018.3181

Naing, N. N. (2000). Easy way to learn standardization: Direct and indirect methods. The Malaysian Journal of Medical Sciences: MJMS, 7(1), 10.

Reuse

CC BY-NC-SA 4.0