Risk of Suicide After Cancer Diagnosis in England

Examplar code using R and Python

Author
Affiliations

Justin C Yang

University College London

Camden & Islington NHS Foundation Trust

Published

May 9, 2024

The code below aims to mimic some of the outputs of (Henson et al., 2019), namely, to calculate standardised mortality ratios comparing suicide deaths among individuals diagnosed with cancer with the general population.

This code mimics but does not fully replicate the code made available by the NHS England for the study because this code was written without access to the underlying datasets. Instead, this code was written using the Simulacrum, a synthetic dataset which imitates some of the data used in the original study. It is important to note that the Simulacrum is missing several key variables which this code aims to reproducibly create but will not produce valid results. Consequently, results are not intended to be interpreted and may not make sense given prior knowledge. For example, patient ages were drawn from a random distribution as age/date of birth variables are not available in Simulacrum. To understand what the authors in (Henson et al., 2019) did more fully, I would recommend consulting their annotated code.

Standardised mortality rates for suicide deaths for England were obtained from the Office for National Statistics.

The generalised approach for this analysis is to:

  1. Calculate suicide mortality rates by age and sex among patients diagnosed with cancer (likely using cancer or disease registry data)

  2. Calculate suicide mortality rates by age and sex among the general population (likely using civil mortality registers)

  3. Calculate standardised mortality ratios and absolute excess risks using these mortality rates

For an introduction to direct and indirect standardisation methods, you can consult (Naing, 2000).

Many thanks to Chloe Bright who provided me with the original Stata code for this analysis and to Alexandra Pitman for supporting this work.

1 Set up

In the first instance, you’ll need access to data on suicide deaths among patients with cancer and suicide deaths among the general population. We will be standardising by sex and age group so we will need to first prepare the data accordingly. In the R code below, we use functions from the Tidyverse but you can just as easily do this in base R or using different functions.

First, we process data on suicide deaths among patients with cancer, including recoding cancer groups and underlying causes of death (presumably using ICD codes). As well, we need to note the diagnosis date (which serves as the start date) and the date of exit from the cohort (i.e. exit due to death, loss to follow-up, or end of follow-up). In the Simulacrum dataset, this is relatively straightforward because we are using the date of diagnosis up to the date of vital status but other datasets will need to note other potential exit causes (e.g. loss to follow-up). Once we have recoded and cleaned the data among cancer patients, we can set it as survival data to be able to derive the number of person-years observed.

Next, we need to process data on suicide deaths in the general population. This allows us to calculate the expected number of suicide deaths based on population rates. You are likely to do this using civil mortality registers or other information provided by national statistics offices about suicide deaths in the general population, standardised by age and sex.

R Code
# Load required libraries
library(arrow)
library(beepr)
library(bibtex)
library(blob)
library(data.table)
library(dplyr)
library(dtplyr)
library(feather)
library(finalfit)
library(forcats)
library(foreign)
library(ggplot2)
library(glue)
library(haven)
library(here)
library(Hmisc)
library(hms)
library(janitor)
library(kableExtra)
library(lubridate)
library(magrittr)
library(multidplyr)
library(parallel)
library(purrr)
library(reticulate)
library(readr)
library(readxl)
library(renv)
library(stringi)
library(stringr)
library(survival)
library(tibble)
library(tidyr)
library(xlsx)

# Set seed for reproducibility
## This will ensure randomly generated variables are reproducible 
set.seed(1223)

# Set clusters
## Because data used in these analyses can be quite large, I am making use of a number of packages which help, including packages to parallelize functions where possible and to load data in columnar format 
## This function will set clusters to n-1 where n is the number of cores on your machine
cluster <- new_cluster(parallel::detectCores() - 1)

# Load data files in Arrow
## Arrow is used here to help load in very large data files and to process them
## More detail is available at https://arrow.apache.org/
sim_av_patient <-
  open_csv_dataset(here("simulacrum_v2.1.0", "Data", "sim_av_patient.csv"))

sim_av_tumour <-
  open_csv_dataset(here("simulacrum_v2.1.0", "Data", "sim_av_tumour.csv"))

# Data extraction from source
## We define the exclusions here and use their patient IDs later to remove them from the cohort
exclusions <- sim_av_patient |>
  select(PATIENTID, VITALSTATUS) |>
  collect() |>
  janitor::clean_names() |>
  distinct() |>
  dplyr::filter(
    str_detect(vitalstatus, "A[123]") |
      str_detect(vitalstatus, "D[145]") |
      str_detect(vitalstatus, "X1") |
      str_detect(vitalstatus, "I")
  ) |>
  select(patientid)

# Here we extract information about cancer patients, including causes of death related to suicide 
patient <- sim_av_patient |>
  select(
    PATIENTID,
    GENDER,
    ETHNICITY,
    VITALSTATUS,
    VITALSTATUSDATE,
    DEATHCAUSECODE_1A,
    DEATHCAUSECODE_1B,
    DEATHCAUSECODE_1C,
    DEATHCAUSECODE_2,
    DEATHCAUSECODE_UNDERLYING
  ) |>
  collect() |>
  janitor::clean_names() |>
  distinct() |>
  anti_join(exclusions) |>
  mutate(
    suicide_cause_2 =
      case_when(
        str_detect(deathcausecode_1a, "Y339") ~ NA,
        str_detect(deathcausecode_1a, "E9888") ~ NA,
        str_detect(deathcausecode_1a, "X6[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "X7[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "X8[01234]") ~ "1A",
        str_detect(deathcausecode_1a, "Y1[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "Y2[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "Y3[01234]") ~ "1A",
        str_detect(deathcausecode_1a, "E98") ~ "1A",
        str_detect(deathcausecode_1a, "E95[0123456789]") ~ "1A",
        str_detect(deathcausecode_1a, "E87[02]") ~ "1A",
        str_detect(deathcausecode_1b, "Y339") ~ NA,
        str_detect(deathcausecode_1b, "E9888") ~ NA,
        str_detect(deathcausecode_1b, "X6[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "X7[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "X8[01234]") ~ "1B",
        str_detect(deathcausecode_1b, "Y1[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "Y2[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "Y3[01234]") ~ "1B",
        str_detect(deathcausecode_1b, "E98") ~ "1B",
        str_detect(deathcausecode_1b, "E95[0123456789]") ~ "1B",
        str_detect(deathcausecode_1b, "E87[02]") ~ "1B",
        str_detect(deathcausecode_1c, "Y339") ~ NA,
        str_detect(deathcausecode_1c, "E9888") ~ NA,
        str_detect(deathcausecode_1c, "X6[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "X7[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "X8[01234]") ~ "1C",
        str_detect(deathcausecode_1c, "Y1[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "Y2[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "Y3[01234]") ~ "1C",
        str_detect(deathcausecode_1c, "E98") ~ "1C",
        str_detect(deathcausecode_1c, "E95[0123456789]") ~ "1C",
        str_detect(deathcausecode_1c, "E87[02]") ~ "1C",
        str_detect(deathcausecode_2, "Y339") ~ NA,
        str_detect(deathcausecode_2, "E9888") ~ NA,
        str_detect(deathcausecode_2, "X6[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "X7[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "X8[01234]") ~ "2",
        str_detect(deathcausecode_2, "Y1[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "Y2[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "Y3[01234]") ~ "2",
        str_detect(deathcausecode_2, "E98") ~ "2",
        str_detect(deathcausecode_2, "E95[0123456789]") ~ "2",
        str_detect(deathcausecode_2, "E87[02]") ~ "2",
        str_detect(deathcausecode_underlying, "Y339") ~ NA,
        str_detect(deathcausecode_underlying, "E9888") ~ NA,
        str_detect(deathcausecode_underlying, "X6[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "X7[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "X8[01234]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "Y1[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "Y2[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "Y3[01234]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "E98") ~ "Underlying",
        str_detect(deathcausecode_underlying, "E95[0123456789]") ~ "Underlying",
        str_detect(deathcausecode_underlying, "E87[02]") ~ "Underlying",
        TRUE ~ NA
      )
  ) |>
  mutate(
    suicide_2 =
      case_when(
        str_detect(deathcausecode_1a, "Y339") ~ 0,
        str_detect(deathcausecode_1a, "E9888") ~ 0,
        str_detect(deathcausecode_1a, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "X8[01234]") ~ 1,
        str_detect(deathcausecode_1a, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_1a, "E98") ~ 1,
        str_detect(deathcausecode_1a, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_1a, "E87[02]") ~ 1,
        str_detect(deathcausecode_1b, "Y339") ~ 0,
        str_detect(deathcausecode_1b, "E9888") ~ 0,
        str_detect(deathcausecode_1b, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "X8[01234]") ~ 1,
        str_detect(deathcausecode_1b, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_1b, "E98") ~ 1,
        str_detect(deathcausecode_1b, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_1b, "E87[02]") ~ 1,
        str_detect(deathcausecode_1c, "Y339") ~ 0,
        str_detect(deathcausecode_1c, "E9888") ~ 0,
        str_detect(deathcausecode_1c, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "X8[01234]") ~ 1,
        str_detect(deathcausecode_1c, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_1c, "E98") ~ 1,
        str_detect(deathcausecode_1c, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_1c, "E87[02]") ~ 1,
        str_detect(deathcausecode_2, "Y339") ~ 0,
        str_detect(deathcausecode_2, "E9888") ~ 0,
        str_detect(deathcausecode_2, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "X8[01234]") ~ 1,
        str_detect(deathcausecode_2, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_2, "E98") ~ 1,
        str_detect(deathcausecode_2, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_2, "E87[02]") ~ 1,
        str_detect(deathcausecode_underlying, "Y339") ~ 0,
        str_detect(deathcausecode_underlying, "E9888") ~ 0,
        str_detect(deathcausecode_underlying, "X6[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "X7[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "X8[01234]") ~ 1,
        str_detect(deathcausecode_underlying, "Y1[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "Y2[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "Y3[01234]") ~ 1,
        str_detect(deathcausecode_underlying, "E98") ~ 1,
        str_detect(deathcausecode_underlying, "E95[0123456789]") ~ 1,
        str_detect(deathcausecode_underlying, "E87[02]") ~ 1,
        TRUE ~ 0
      )
  ) |>
  mutate(
    death_cause_flag =
      case_when(
        str_detect(vitalstatus, "A") ~ "Alive",
        (
          !str_detect(vitalstatus, "A") &
            is.na(deathcausecode_1a) &
            is.na(deathcausecode_1b) &
            is.na(deathcausecode_1c) &
            is.na(deathcausecode_2) &
            is.na(deathcausecode_underlying)
        ) ~ "Dead without Cause",
        (
          !str_detect(vitalstatus, "A") &
            (
              !is.na(deathcausecode_1a) |
                !is.na(deathcausecode_1b) |
                !is.na(deathcausecode_1c) |
                !is.na(deathcausecode_2) |
                !is.na(deathcausecode_underlying)
            )
        ) ~ "Dead with Cause",
        TRUE ~ NA
      )
  ) |>
  as.data.table()

# Here we identify the patient's most recently diagnosed cancer 
tumour <- sim_av_tumour |>
  select(
    PATIENTID,
    TUMOURID,
    SITE_ICD10_O2_3CHAR,
    DIAGNOSISDATEBEST,
    STAGE_BEST,
    QUINTILE_2019,
    AGE
  ) |>
  collect() |>
  janitor::clean_names() |>
  distinct() |>
  anti_join(exclusions) |>
  rename(
    tumour_site = site_icd10_o2_3char,
    deprivation_quintile = quintile_2019,
    age_at_diagnosis = age
  ) |>
  as.data.table()

# Dataset merging
## We now merge information from the two source tables and annotate the new dataset
## Annotations are intended to match the original Stata code developed for this analysis as closely as possible, but you will note that there are some deviations
dataset <- merge(patient, tumour) |>
  anti_join(exclusions) |>
  filter(gender != 9) |>
  filter(!(gender == 2 & str_detect(tumour_site, "C6[0123]"))) |>
  filter(!(gender == 1 &
             str_detect(tumour_site, "C5[12345678]"))) |>
  filter(str_detect(tumour_site, "^C") &
           !str_detect(tumour_site, "C44")) |>
  group_by(patientid) |>
  arrange(desc(diagnosisdatebest)) |>
  slice_head() |>
  rename(
    patient_id = patientid,
    gender_code = gender,
    ethnicity_code = ethnicity,
    vital_status_code = vitalstatus,
    vital_status_date = vitalstatusdate,
    suicide = suicide_2,
    suicide_cause = suicide_cause_2,
    tumour_id = tumourid,
    diagnosis_date = diagnosisdatebest,
    stage = stage_best,
  ) |>
  ungroup() |>
  mutate(
    cancer_group = case_when(
      str_detect(tumour_site, "C0[0123456789]") ~ "Head and neck",
      str_detect(tumour_site, "C1[01234]") ~ "Head and neck",
      str_detect(tumour_site, "C15") ~ "Oesophagus",
      str_detect(tumour_site, "C16") ~ "Stomach",
      str_detect(tumour_site, "C17") ~ "Other malignant neoplasms",
      str_detect(tumour_site, "C1[89]") ~ "Colorectal",
      str_detect(tumour_site, "C20") ~ "Colorectal",
      str_detect(tumour_site, "C22") ~ "Liver",
      str_detect(tumour_site, "C25") ~ "Pancreas",
      str_detect(tumour_site, "C3[12]") ~ "Head and neck",
      str_detect(tumour_site, "C3[34]") ~ "Lung",
      str_detect(tumour_site, "C4[019]") ~ "Sarcoma",
      str_detect(tumour_site, "C43") ~ "Melanoma",
      str_detect(tumour_site, "C45") ~ "Mesothelioma",
      str_detect(tumour_site, "C50") ~ "Breast",
      str_detect(tumour_site, "C53") ~ "Cervix",
      str_detect(tumour_site, "C5[45]") ~ "Uterus",
      str_detect(tumour_site, "C5[67]") ~ "Ovary",
      str_detect(tumour_site, "C61") ~ "Prostate",
      str_detect(tumour_site, "C62") ~ "Testis",
      str_detect(tumour_site, "C6[4568]") ~ "Kidney and unspecified urinary organs",
      str_detect(tumour_site, "C67") ~ "Bladder",
      str_detect(tumour_site, "C7[012]") ~ "Central Nervous System (incl brain)",
      str_detect(tumour_site, "C73") ~ "Head and neck",
      str_detect(tumour_site, "C7[789]") ~ "Cancer of Unknown Primary",
      str_detect(tumour_site, "C80") ~ "Cancer of Unknown Primary",
      str_detect(tumour_site, "C81") ~ "Hodgkin lymphoma",
      str_detect(tumour_site, "C8[2345]") ~ "Non-Hodgkin lymphoma",
      str_detect(tumour_site, "C88") ~ "Multiple myeloma",
      str_detect(tumour_site, "C90") ~ "Multiple myeloma",
      str_detect(tumour_site, "C9[12345]") ~ "Leukaemia",
      TRUE ~ "Other malignant neoplasms"
    )
  ) |>
  mutate(
    gender = fcase(
      gender_code == 1,
      "Male",
      gender_code == 2,
      "Female",
      gender_code == 9,
      "Indeterminate",
      gender_code == "X",
      NA
    )
  ) |>
  mutate(
    ethnicity = fcase(
      ethnicity_code == "0",
      "White",
      ethnicity_code == "A",
      "White",
      ethnicity_code == "B",
      "White",
      ethnicity_code == "C",
      "White",
      ethnicity_code == "CA",
      "Unknown",
      ethnicity_code == "CP",
      "White",
      ethnicity_code == "D",
      "Mixed",
      ethnicity_code == "E",
      "Mixed",
      ethnicity_code == "F",
      "Mixed",
      ethnicity_code == "G",
      "Mixed",
      ethnicity_code == "H",
      "Asian",
      ethnicity_code == "J",
      "Asian",
      ethnicity_code == "K",
      "Asian",
      ethnicity_code == "L",
      "Asian",
      ethnicity_code == "M",
      "Black",
      ethnicity_code == "N",
      "Black",
      ethnicity_code == "P",
      "Black",
      ethnicity_code == "R",
      "Other",
      ethnicity_code == "S",
      "Other",
      ethnicity_code == "X",
      "Unknown",
      ethnicity_code == "Z",
      "Not Stated"
    )
  ) |>
  mutate(
    vital_status = fcase(
      vital_status_code == "A",
      "Alive",
      vital_status_code == "D",
      "Dead",
      vital_status_code == "D3",
      "Dies before diagnosis",
      vital_status_code == "X2",
      "Patient may have been returned and successfully traced",
      vital_status_code == "X4",
      "Patient may have been returned but has not been successfully traced",
      vital_status_code == "X5",
      "Loss to follow up"
    )
  ) |>
  mutate(diagnosis_year = factor(year(diagnosis_date))) |>
  as.data.table()

## Generate random ages and derive age-based variables
age_breaks <-
  c(0, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 79, 84, 89, 999)
age_labels <-
  c(
    "15-19",
    "20-24",
    "25-29",
    "30-34",
    "35-39",
    "40-44",
    "45-49",
    "50-54",
    "55-59",
    "60-64",
    "65-69",
    "70-74",
    "75-79",
    "80-84",
    "85-89",
    "90+"
  )

dataset <- dataset |>
  partition(cluster) |>
  group_by(patient_id) |>
  mutate(age_at_diagnosis = floor(rnorm(
    n = 1, mean = 65, sd = 12.5
  ))) |>
  collect() |>
  ungroup() |>
  mutate(
    age_at_diagnosis = case_when(
      age_at_diagnosis < 18 ~ 18,
      age_at_diagnosis > 99 ~ 99,
      TRUE ~ age_at_diagnosis
    )
  ) |>
  mutate(birth_date = diagnosis_date - lubridate::years(age_at_diagnosis) - lubridate::days(floor(runif(
    n = 1, min = 1, max = 10
  )))) |>
  mutate(
    age_category_at_diagnosis = cut(
      age_at_diagnosis,
      breaks = age_breaks,
      labels = age_labels,
      right = TRUE
    )
  ) |>
  mutate(death_date = case_when(vital_status == "Dead" ~ vital_status_date,
                                TRUE ~ NA)) |>
  mutate(age_at_death = floor(as.numeric(
    difftime(death_date, birth_date, unit = "days") /
      365.25
  ))) |>
  mutate(
    age_category_at_death = cut(
      age_at_death,
      breaks = age_breaks,
      labels = age_labels,
      right = TRUE
    )
  ) |>
  mutate(age_category_at_death = case_when(vital_status != "Dead" ~ NA,
                                           TRUE ~ age_category_at_death)) |>
  mutate(suicide_death_date = case_when(suicide == 1 ~ vital_status_date,
                                        TRUE ~ NA)) |>
  mutate(age_at_suicide_death = floor(as.numeric(
    difftime(suicide_death_date, birth_date, unit = "days") /
      365.25
  ))) |>
  mutate(
    age_category_at_suicide_death = cut(
      age_at_suicide_death,
      breaks = age_breaks,
      labels = age_labels,
      right = TRUE
    )
  ) |>
  mutate(
    age_category_at_suicide_death = case_when(suicide == 0 ~ NA,
                                              TRUE ~ age_category_at_suicide_death)
  )

## Calculate months since diagnosis
months_breaks <- c(-1, 5, 11, 23, 35, 59, 119, 999)
months_labels <-
  c("0-5 mo", "6-11 mo", "12-23 mo", "24-35mo", "3-4y", "5-9y", "10+ y")

dataset <- dataset |>
  mutate(months_from_diagnosis = floor(as.numeric(
    difftime(vital_status_date, diagnosis_date, units = c("weeks"))
  ) / 4)) |>
  mutate(months_from_diagnosis = case_when(months_from_diagnosis < 0 ~ 0,
                                           TRUE ~ months_from_diagnosis)) |>
  mutate(
    months_category_from_diagnosis = cut(
      months_from_diagnosis,
      breaks = months_breaks,
      labels = months_labels,
      right = TRUE
    )
  )

# Reverse deprivation quintiles
## Deprivation is sometimes reported from least to most deprived or from most to least deprived. Here, we reverse the quintiles to match what is presented in the JAMA Psychiatry publication. 
dataset <- dataset |>
  mutate(
    deprivation_quintile_reversed = case_when(
      deprivation_quintile == "1 - most deprived" ~ "5, Most",
      deprivation_quintile == "2" ~ "4",
      deprivation_quintile == "3" ~ "3",
      deprivation_quintile == "4" ~ "2",
      deprivation_quintile == "5 - least deprived" ~ "1, Least",
    )
  )

# Survival analysis
## Here we shape the data as survival data, following up from the diagnosis date to the date of the vital status update 
survival <- dataset |>
  mutate(time = as.numeric(difftime(vital_status_date, diagnosis_date, unit = "days"))) |>
  filter(time >= 0) |>
  mutate(time = case_when(time == 0 ~ 1,
                          TRUE ~ time)) |>
  mutate(time = as.numeric(time)/365.25) |>
  mutate(suicide = suicide + 1) |>
  mutate(
    gender = factor(gender, levels = c("Male", "Female")),
    ethnicity = factor(
      ethnicity,
      levels = c(
        "White",
        "Asian" ,
        "Black",
        "Mixed",
        "Other",
        "Unknown",
        "Not Stated"
      )
    ),
    cancer_group = as.factor(cancer_group),
    deprivation_quintile_reversed = as.factor(deprivation_quintile_reversed),
  )

# Variable labels
## Here we apply a number of variable labels for tidiness
label(survival$age_at_death) <- "Attained age at death"
label(survival$age_at_diagnosis) <-
  "Attained age at cancer diagnosis, y"
label(survival$age_at_suicide_death) <-
  "Attained age at suicide death, y"
label(survival$age_category_at_death) <- "Age at death (attained age), y"
label(survival$age_category_at_diagnosis) <-
  "Age category at cancer diagnosis"
label(survival$age_category_at_suicide_death) <-
  "Age category at suicide death"
label(survival$birth_date) <- "Birth date"
label(survival$cancer_group) <- "Last primary cancer"
label(survival$deathcausecode_1a) <- "Death cause code - 1A"
label(survival$deathcausecode_1b) <- "Death cause code - 1B"
label(survival$deathcausecode_1c) <- "Death cause code - 1C"
label(survival$deathcausecode_2) <- "Death cause code - 2"
label(survival$deathcausecode_underlying) <-
  "Death cause code - Underlying"
label(survival$death_cause_flag) <- "Alive or Dead"
label(survival$death_date) <- "Date of death"
label(survival$deprivation_quintile) <- "Deprivation"
label(survival$deprivation_quintile_reversed) <- "Deprivation"
label(survival$diagnosis_date) <- "Date of Diagnosis"
label(survival$diagnosis_year) <- "Year of cancer diagnosis"
label(survival$ethnicity) <- "Ethnicity"
label(survival$ethnicity_code) <- "Ethnicity code"
label(survival$gender) <- "Sex"
label(survival$gender_code) <- "Gender code"
label(survival$months_from_diagnosis) <- "Time from cancer diagnosis"
label(survival$months_category_from_diagnosis) <- "Time from cancer diagnosis"
label(survival$patient_id) <- "Patient ID"
label(survival$stage) <- "Tumour stage"
label(survival$suicide) <- "Suicide-related death (Y/N)"
label(survival$suicide_cause) <-
  "Suicide-related death listing on death certificate"
label(survival$suicide_death_date) <- "Date of suicide death"
label(survival$time) <- "Person-years observed"
label(survival$tumour_id) <- "Tumour ID"
label(survival$tumour_site) <- "Tumour site"
label(survival$vital_status) <- "Vital Status"
label(survival$vital_status_code) <- "Vital status code"
label(survival$vital_status_date) <- "Vital status date"

# Setting variables
## Here we define our explanatory and dependent variables 
## Note that the dependent function is a survival function 
explanatory <- c("gender",
                 "months_category_from_diagnosis",
                 "age_category_at_death",
                 "cancer_group",
                 "deprivation_quintile_reversed",
                 "ethnicity",
                 "diagnosis_year")

dependent <- "Surv(time, suicide)"

# Load in ONS suicide data
## This is the data for suicides in the general population which we will use to standardise the suicide rates among people with cancer
## Notably, this data was provided as Excel sheets which require a fair amount of cleaning, so some of the code may seem idiosyncratic because we have to deal with some of the strangeness of trying to import and reshape Excel data for our needs

suicides2022 <- read_xlsx(
  here("suicides2022.xlsx"),
  sheet = "Table_6",
  range = "A6:CK132",
  col_names = TRUE
) |>
  janitor::clean_names() %>%
  rename_with(~ str_remove(., "_note.*")) |>
  select(-contains("x10_14_"))

# Clean up observed suicides
observed_suicides <- suicides2022 |>
  rename(year = year_of_death_registration) |>
  select(-area_code, -area_of_usual_residence) |>
  select(!contains("_ucl") &
           !contains("_lcl") & !contains("_marker")) %>%
  rename_with( ~ str_replace(., "^x", "")) |>
  group_by(sex, year) |>
  select(contains("_number_of_deaths")) |>
  pivot_longer(cols = ends_with("_number_of_deaths")) |>
  rename(attained_age = name, observed_suicides = value) |>
  mutate(attained_age = str_replace(attained_age, "_number_of_deaths", "")) |>
  mutate(attained_age = str_replace(attained_age, "_", "-"))

# Clean up standardised suicide rates
observed_suicide_rates <- suicides2022 |>
  rename(year = year_of_death_registration) |>
  select(-area_code, -area_of_usual_residence) |>
  select(!contains("_ucl") &
           !contains("_lcl") & !contains("_marker")) %>%
  rename_with( ~ str_replace(., "^x", "")) |>
  group_by(sex, year) |>
  select(contains("_rate_per_100_000")) |>
  pivot_longer(cols = ends_with("_rate_per_100_000")) |>
  rename(attained_age = name, suicides_per_100000 = value) |>
  mutate(attained_age = str_replace(attained_age, "_rate_per_100_000", "")) |>
  mutate(attained_age = str_replace(attained_age, "_", "-"))

reference_suicides <- observed_suicides |>
  inner_join(observed_suicide_rates) |>
  mutate(sex = case_when(sex == "Males" ~ "Male", sex == "Females" ~ "Female", TRUE ~ sex)) |>
  mutate(attained_age =
           case_when(attained_age == "90" ~ "90+" , TRUE ~ attained_age)) |>
  select(year,
         sex,
         attained_age,
         observed_suicides,
         suicides_per_100000) |>
  rename(reference_suicides = observed_suicides)

# Save and export raw dataset
save(dataset,
     exclusions,
     survival,
     reference_suicides,
     file = here("dataset.Rdata"))

write_csv(dataset, file = here("dataset", "dataset.csv"))

write_csv(survival, file = here("dataset", "survival.csv"))

write_csv(reference_suicides,
          file = here("dataset", "reference_suicides.csv"))

Using the standardised suicide rates in the general population, we calculate the expected numbers of suicide among those diagnosed with cancer and compare them to the number of suicides actually observed. This gives us the standardised mortality ratio (SMR).

\[ \text{standardised mortality ratio}=\frac{\text{observed number of suicides}}{\text{expected number of suicides}} \]

We can also then calculate absolute excess risk (AER) per 10 000 as follows:

\[ \text{absolute excess risk per 10,000}=\frac{\text{observed number of suicides} - \text{expected number of suicides}}{\text{person-years at risk}} \]

2 Sample characteristics

These tables aims to reproduce the main outputs of Table 1 in (Henson et al., 2019). There are some important differences to note:

  • Age groups are categorised according to the same ones provided by the ONS standardised mortality rates since we will need to standardise using those same age groups anyway

  • Simulacrum covers a different time period than the authors use in (Henson et al., 2019)

  • Ages were randomly generated, so don’t read too much into them

R Code
# Table 1
## A quirk of the finalfit package is that survival events need to be coded as "1" and "2" rather than "0" and "1" 
survival |>
  mutate(suicide = as.factor(case_when(suicide == 2 ~ "Y", 
                                       suicide == 1 ~ "N", 
                                       TRUE ~ NA))) |>
  summary_factorlist(
    dependent = "suicide",
    explanatory = c("gender",
                 "months_category_from_diagnosis",
                 "age_category_at_death",
                 "cancer_group",
                 "deprivation_quintile_reversed",
                 "ethnicity",
                 "age_category_at_diagnosis",
                 "diagnosis_year"),
    total_col = TRUE,
    add_dependent_label = TRUE,
    add_col_totals = TRUE
  ) |>
  kableExtra::kable() |> 
  kable_classic(full_width = F) |> 
  footnote(general = "Percentages provided in parentheses.")
Dependent: suicide N Y Total
Total N (%) 1189118 (99.9) 1180 (0.1) 1190298
Sex Male 613178 (51.6) 723 (61.3) 613901 (51.6)
Female 575940 (48.4) 457 (38.7) 576397 (48.4)
Time from cancer diagnosis 0-5 mo 212013 (17.8) 395 (33.5) 212408 (17.8)
6-11 mo 78731 (6.6) 176 (14.9) 78907 (6.6)
12-23 mo 97424 (8.2) 237 (20.1) 97661 (8.2)
24-35mo 105740 (8.9) 154 (13.1) 105894 (8.9)
3-4y 382207 (32.1) 184 (15.6) 382391 (32.1)
5-9y 313003 (26.3) 34 (2.9) 313037 (26.3)
10+ y 0 (0.0) 0 (0.0) 0 (0.0)
Age at death (attained age), y 15-19 56 (0.0) 0 (0.0) 56 (0.0)
20-24 239 (0.0) 2 (0.2) 241 (0.0)
25-29 853 (0.2) 4 (0.3) 857 (0.2)
30-34 2622 (0.5) 4 (0.3) 2626 (0.5)
35-39 6770 (1.3) 14 (1.2) 6784 (1.3)
40-44 15064 (2.9) 33 (2.8) 15097 (2.9)
45-49 28887 (5.5) 67 (5.7) 28954 (5.5)
50-54 47163 (9.0) 101 (8.6) 47264 (9.0)
55-59 66720 (12.7) 145 (12.3) 66865 (12.7)
60-64 79828 (15.2) 156 (13.2) 79984 (15.2)
65-69 81973 (15.6) 198 (16.8) 82171 (15.6)
70-74 72311 (13.8) 152 (12.9) 72463 (13.8)
75-79 54431 (10.4) 131 (11.1) 54562 (10.4)
80-84 34688 (6.6) 76 (6.4) 34764 (6.6)
85-89 19186 (3.7) 62 (5.3) 19248 (3.7)
90+ 14518 (2.8) 34 (2.9) 14552 (2.8)
Last primary cancer Bladder 31935 (2.7) 35 (3.0) 31970 (2.7)
Breast 178351 (15.0) 97 (8.2) 178448 (15.0)
Cancer of Unknown Primary 8875 (0.7) 9 (0.8) 8884 (0.7)
Central Nervous System (incl brain) 18932 (1.6) 19 (1.6) 18951 (1.6)
Cervix 10734 (0.9) 13 (1.1) 10747 (0.9)
Colorectal 136413 (11.5) 122 (10.3) 136535 (11.5)
Head and neck 49813 (4.2) 108 (9.2) 49921 (4.2)
Hodgkin lymphoma 7120 (0.6) 13 (1.1) 7133 (0.6)
Kidney and unspecified urinary organs 42377 (3.6) 51 (4.3) 42428 (3.6)
Leukaemia 32780 (2.8) 33 (2.8) 32813 (2.8)
Liver 20979 (1.8) 17 (1.4) 20996 (1.8)
Lung 152151 (12.8) 184 (15.6) 152335 (12.8)
Melanoma 55370 (4.7) 36 (3.1) 55406 (4.7)
Mesothelioma 8995 (0.8) 8 (0.7) 9003 (0.8)
Multiple myeloma 21580 (1.8) 16 (1.4) 21596 (1.8)
Non-Hodgkin lymphoma 47014 (4.0) 61 (5.2) 47075 (4.0)
Oesophagus 29461 (2.5) 20 (1.7) 29481 (2.5)
Other malignant neoplasms 38373 (3.2) 33 (2.8) 38406 (3.2)
Ovary 24055 (2.0) 9 (0.8) 24064 (2.0)
Pancreas 35001 (2.9) 22 (1.9) 35023 (2.9)
Prostate 171745 (14.4) 204 (17.3) 171949 (14.4)
Sarcoma 8496 (0.7) 11 (0.9) 8507 (0.7)
Stomach 20590 (1.7) 26 (2.2) 20616 (1.7)
Testis 7940 (0.7) 15 (1.3) 7955 (0.7)
Uterus 30038 (2.5) 18 (1.5) 30056 (2.5)
Deprivation 1, Least 256470 (21.6) 252 (21.4) 256722 (21.6)
2 256429 (21.6) 242 (20.5) 256671 (21.6)
3 245122 (20.6) 247 (20.9) 245369 (20.6)
4 220523 (18.5) 209 (17.7) 220732 (18.5)
5, Most 210574 (17.7) 230 (19.5) 210804 (17.7)
Ethnicity White 1039481 (88.9) 1029 (88.2) 1040510 (88.9)
Asian 30647 (2.6) 39 (3.3) 30686 (2.6)
Black 23013 (2.0) 26 (2.2) 23039 (2.0)
Mixed 5735 (0.5) 4 (0.3) 5739 (0.5)
Other 18664 (1.6) 14 (1.2) 18678 (1.6)
Unknown 7795 (0.7) 11 (0.9) 7806 (0.7)
Not Stated 44147 (3.8) 44 (3.8) 44191 (3.8)
Age category at cancer diagnosis 15-19 179 (0.0) 0 (0.0) 179 (0.0)
20-24 626 (0.1) 2 (0.2) 628 (0.1)
25-29 2291 (0.2) 4 (0.3) 2295 (0.2)
30-34 6586 (0.6) 8 (0.7) 6594 (0.6)
35-39 17369 (1.5) 12 (1.0) 17381 (1.5)
40-44 38011 (3.2) 37 (3.1) 38048 (3.2)
45-49 72013 (6.1) 81 (6.9) 72094 (6.1)
50-54 115164 (9.7) 104 (8.8) 115268 (9.7)
55-59 157842 (13.3) 151 (12.8) 157993 (13.3)
60-64 184585 (15.5) 174 (14.7) 184759 (15.5)
65-69 184510 (15.5) 192 (16.3) 184702 (15.5)
70-74 157836 (13.3) 141 (11.9) 157977 (13.3)
75-79 115375 (9.7) 121 (10.3) 115496 (9.7)
80-84 71405 (6.0) 78 (6.6) 71483 (6.0)
85-89 38225 (3.2) 44 (3.7) 38269 (3.2)
90+ 27101 (2.3) 31 (2.6) 27132 (2.3)
Year of cancer diagnosis 2016 290193 (24.4) 298 (25.3) 290491 (24.4)
2017 290613 (24.4) 301 (25.5) 290914 (24.4)
2018 305343 (25.7) 289 (24.5) 305632 (25.7)
2019 302969 (25.5) 292 (24.7) 303261 (25.5)
Note:
Percentages provided in parentheses.

We can also output an analogous table using Python.

Python Code
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib as mpl
import tableone as tbl1
import tabulate
from IPython.display import Markdown

# Import dataset
df = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")

## Redefine suicide == 0 and suicide == 1 
df.eval("suicide = suicide - 1", inplace=True)

# Annotate dataset
columns = ['gender', 'months_category_from_diagnosis', 'age_category_at_death', 'cancer_group', 'deprivation_quintile_reversed', 'ethnicity', 'age_category_at_diagnosis', 'diagnosis_year']

categorical = ['gender', 'months_category_from_diagnosis', 'age_category_at_death', 'cancer_group', 'deprivation_quintile_reversed', 'ethnicity', 'age_category_at_diagnosis', 'diagnosis_year']

order = {
  'gender': ['Male', 'Female'],
  'months_category_from_diagnosis': ['0-5 mo', '6-11 mo', '12-23 mo', '24-35mo', '3-4y', '5-9y', '10+ y'],
  'age_category_at_death': ['15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90+'],
  'ethnicity': ['White', 'Asian', 'Black', 'Mixed', 'Other', 'Unknown', 'Not Stated']
  }

labels = {
  'gender': 'Sex',
  'suicide': 'Suicide Death',
  'months_category_from_diagnosis': 'Time from cancer diagnosis',
  'age_category_at_death': 'Age at death (attained age), y',
  'cancer_group': 'Last primary cancer',
  'deprivation_quintile_reversed': 'Deprivation',
  'ethnicity': 'Ethnicity',
  'age_category_at_diagnosis': 'Age category at cancer diagnosis',
  'diagnosis_year': 'Year of cancer diagnosis'
  }

groupby = ['suicide']

#Produce Table One
mytable = tbl1.TableOne(df, columns = columns, rename = labels, order = order, categorical = categorical, groupby = groupby)

Markdown(mytable.tabulate(tablefmt="grid"))
Missing Overall 0 1
n 1190298 1189118 1180
Sex, n (%) Male 0 613901 (51.6) 613178 (51.6) 723 (61.3)
Female 576397 (48.4) 575940 (48.4) 457 (38.7)
Time from cancer diagnosis, n (%) 0-5 mo 0 212408 (17.8) 212013 (17.8) 395 (33.5)
6-11 mo 78907 (6.6) 78731 (6.6) 176 (14.9)
12-23 mo 97661 (8.2) 97424 (8.2) 237 (20.1)
24-35mo 105894 (8.9) 105740 (8.9) 154 (13.1)
3-4y 382391 (32.1) 382207 (32.1) 184 (15.6)
5-9y 313037 (26.3) 313003 (26.3) 34 (2.9)
Age at death (attained age), y, n (%) 15-19 663810 56 (0.0) 56 (0.0)
20-24 241 (0.0) 239 (0.0) 2 (0.2)
25-29 857 (0.2) 853 (0.2) 4 (0.3)
30-34 2626 (0.5) 2622 (0.5) 4 (0.3)
35-39 6784 (1.3) 6770 (1.3) 14 (1.2)
40-44 15097 (2.9) 15064 (2.9) 33 (2.8)
45-49 28954 (5.5) 28887 (5.5) 67 (5.7)
50-54 47264 (9.0) 47163 (9.0) 101 (8.6)
55-59 66865 (12.7) 66720 (12.7) 145 (12.3)
60-64 79984 (15.2) 79828 (15.2) 156 (13.2)
65-69 82171 (15.6) 81973 (15.6) 198 (16.8)
70-74 72463 (13.8) 72311 (13.8) 152 (12.9)
75-79 54562 (10.4) 54431 (10.4) 131 (11.1)
80-84 34764 (6.6) 34688 (6.6) 76 (6.4)
85-89 19248 (3.7) 19186 (3.7) 62 (5.3)
90+ 14552 (2.8) 14518 (2.8) 34 (2.9)
Last primary cancer, n (%) Bladder 0 31970 (2.7) 31935 (2.7) 35 (3.0)
Breast 178448 (15.0) 178351 (15.0) 97 (8.2)
Cancer of Unknown Primary 8884 (0.7) 8875 (0.7) 9 (0.8)
Central Nervous System (incl brain) 18951 (1.6) 18932 (1.6) 19 (1.6)
Cervix 10747 (0.9) 10734 (0.9) 13 (1.1)
Colorectal 136535 (11.5) 136413 (11.5) 122 (10.3)
Head and neck 49921 (4.2) 49813 (4.2) 108 (9.2)
Hodgkin lymphoma 7133 (0.6) 7120 (0.6) 13 (1.1)
Kidney and unspecified urinary organs 42428 (3.6) 42377 (3.6) 51 (4.3)
Leukaemia 32813 (2.8) 32780 (2.8) 33 (2.8)
Liver 20996 (1.8) 20979 (1.8) 17 (1.4)
Lung 152335 (12.8) 152151 (12.8) 184 (15.6)
Melanoma 55406 (4.7) 55370 (4.7) 36 (3.1)
Mesothelioma 9003 (0.8) 8995 (0.8) 8 (0.7)
Multiple myeloma 21596 (1.8) 21580 (1.8) 16 (1.4)
Non-Hodgkin lymphoma 47075 (4.0) 47014 (4.0) 61 (5.2)
Oesophagus 29481 (2.5) 29461 (2.5) 20 (1.7)
Other malignant neoplasms 38406 (3.2) 38373 (3.2) 33 (2.8)
Ovary 24064 (2.0) 24055 (2.0) 9 (0.8)
Pancreas 35023 (2.9) 35001 (2.9) 22 (1.9)
Prostate 171949 (14.4) 171745 (14.4) 204 (17.3)
Sarcoma 8507 (0.7) 8496 (0.7) 11 (0.9)
Stomach 20616 (1.7) 20590 (1.7) 26 (2.2)
Testis 7955 (0.7) 7940 (0.7) 15 (1.3)
Uterus 30056 (2.5) 30038 (2.5) 18 (1.5)
Deprivation, n (%) 1, Least 0 256722 (21.6) 256470 (21.6) 252 (21.4)
2 256671 (21.6) 256429 (21.6) 242 (20.5)
3 245369 (20.6) 245122 (20.6) 247 (20.9)
4 220732 (18.5) 220523 (18.5) 209 (17.7)
5, Most 210804 (17.7) 210574 (17.7) 230 (19.5)
Ethnicity, n (%) White 19649 1040510 (88.9) 1039481 (88.9) 1029 (88.2)
Asian 30686 (2.6) 30647 (2.6) 39 (3.3)
Black 23039 (2.0) 23013 (2.0) 26 (2.2)
Mixed 5739 (0.5) 5735 (0.5) 4 (0.3)
Other 18678 (1.6) 18664 (1.6) 14 (1.2)
Unknown 7806 (0.7) 7795 (0.7) 11 (0.9)
Not Stated 44191 (3.8) 44147 (3.8) 44 (3.8)
Age category at cancer diagnosis, n (%) 15-19 0 179 (0.0) 179 (0.0)
20-24 628 (0.1) 626 (0.1) 2 (0.2)
25-29 2295 (0.2) 2291 (0.2) 4 (0.3)
30-34 6594 (0.6) 6586 (0.6) 8 (0.7)
35-39 17381 (1.5) 17369 (1.5) 12 (1.0)
40-44 38048 (3.2) 38011 (3.2) 37 (3.1)
45-49 72094 (6.1) 72013 (6.1) 81 (6.9)
50-54 115268 (9.7) 115164 (9.7) 104 (8.8)
55-59 157993 (13.3) 157842 (13.3) 151 (12.8)
60-64 184759 (15.5) 184585 (15.5) 174 (14.7)
65-69 184702 (15.5) 184510 (15.5) 192 (16.3)
70-74 157977 (13.3) 157836 (13.3) 141 (11.9)
75-79 115496 (9.7) 115375 (9.7) 121 (10.3)
80-84 71483 (6.0) 71405 (6.0) 78 (6.6)
85-89 38269 (3.2) 38225 (3.2) 44 (3.7)
90+ 27132 (2.3) 27101 (2.3) 31 (2.6)
Year of cancer diagnosis, n (%) 2016 0 290491 (24.4) 290193 (24.4) 298 (25.3)
2017 290914 (24.4) 290613 (24.4) 301 (25.5)
2018 305632 (25.7) 305343 (25.7) 289 (24.5)
2019 303261 (25.5) 302969 (25.5) 292 (24.7)

3 Suicide SMRs and AERs per 10,000 person-years at risk according to last primary cancer

These tables aim to reproduce some of the outputs of Table 2 in (Henson et al., 2019).

R Code
# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year, sex, attained_age, cancer_group) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(year,
         sex,
         attained_age,
         person_years,
         suicide_deaths,
         cancer_group) |>
  ungroup()

# Calculate SMRs
cancer_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(cancer_group) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(cancer_group,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  cancer_smr,
  col.names = c(
    "Last primary cancer",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)
Last primary cancer Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
Bladder 23977 35 29 1.21 2.50
Breast 65578 97 33 2.94 9.76
Cancer of Unknown Primary 3536 9 4 2.25 14.14
Central Nervous System (incl brain) 12074 19 13 1.46 4.97
Cervix 3907 13 2 6.50 28.15
Colorectal 89450 122 98 1.24 2.68
Head and neck 25121 107 28 3.82 31.45
Hodgkin lymphoma 1652 13 2 6.50 66.59
Kidney and unspecified urinary organs 23501 51 28 1.82 9.79
Leukaemia 18277 33 20 1.65 7.11
Liver 14182 17 16 1.06 0.71
Lung 108106 184 115 1.60 6.38
Melanoma 19560 36 20 1.80 8.18
Mesothelioma 8057 8 10 0.80 -2.48
Multiple myeloma 16948 16 19 0.84 -1.77
Non-Hodgkin lymphoma 24413 61 27 2.26 13.93
Oesophagus 22827 20 25 0.80 -2.19
Other malignant neoplasms 22732 33 24 1.38 3.96
Ovary 16310 9 8 1.12 0.61
Pancreas 18091 22 20 1.10 1.11
Prostate 76454 204 119 1.71 11.12
Sarcoma 5179 11 6 1.83 9.65
Stomach 14889 26 16 1.62 6.72
Testis 646 15 1 15.00 216.72
Uterus 15434 18 8 2.25 6.48

We can also output an analogous table using Python.

Python Code
# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'cancer_group'])
cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()
cancer_suicides.dropna(inplace=True)
cancer_suicides = cancer_suicides.drop_duplicates()
cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('cancer_group')
cancer_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

cancer_smr['smr'] = np.round(cancer_smr['observed_suicides'] / cancer_smr['expected_suicides'], 2)
cancer_smr['aer'] = np.round((cancer_smr['observed_suicides'] - cancer_smr['expected_suicides']) / cancer_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
cancer_smr_table = cancer_smr[['cancer_group', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]
cancer_smr_table.columns = [
    'Last primary cancer', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(cancer_smr_table, headers='keys'))
Last primary cancer Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
0 Bladder 23977 35 29 1.21 2.5
1 Breast 65578 97 33 2.94 9.76
2 Cancer of Unknown Primary 3536 9 4 2.25 14.14
3 Central Nervous System (incl brain) 12074 19 13 1.46 4.97
4 Cervix 3907 13 2 6.5 28.15
5 Colorectal 89450 122 98 1.24 2.68
6 Head and neck 25121 107 28 3.82 31.45
7 Hodgkin lymphoma 1652 13 2 6.5 66.59
8 Kidney and unspecified urinary organs 23501 51 28 1.82 9.79
9 Leukaemia 18277 33 20 1.65 7.11
10 Liver 14182 17 16 1.06 0.71
11 Lung 108106 184 115 1.6 6.38
12 Melanoma 19560 36 20 1.8 8.18
13 Mesothelioma 8057 8 10 0.8 -2.48
14 Multiple myeloma 16948 16 19 0.84 -1.77
15 Non-Hodgkin lymphoma 24413 61 27 2.26 13.93
16 Oesophagus 22827 20 25 0.8 -2.19
17 Other malignant neoplasms 22732 33 24 1.38 3.96
18 Ovary 16310 9 8 1.12 0.61
19 Pancreas 18091 22 20 1.1 1.11
20 Prostate 76454 204 119 1.71 11.12
21 Sarcoma 5179 11 6 1.83 9.65
22 Stomach 14889 26 16 1.62 6.72
23 Testis 646 15 1 15 216.72
24 Uterus 15434 18 8 2.25 6.48

4 Suicide SMRs and AERs per 10 000 person-years at risk according to sex for all cancers combined

These tables aim to reproduce some of the outputs of Table 3 in (Henson et al., 2019).

R Code
# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year,
           sex,
           attained_age) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(
    year,
    sex,
    attained_age,
    person_years,
    suicide_deaths
  ) |>
  ungroup()

# Calculate SMRs
sex_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(sex) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(sex,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  sex_smr,
  col.names = c(
    "Sex",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)
Sex Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
Female 304081 457 149 3.07 10.13
Male 346831 722 541 1.33 5.22

We can also output an analogous table using Python.

Python Code
# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'cancer_group'])
cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()
cancer_suicides.dropna(inplace=True)
cancer_suicides = cancer_suicides.drop_duplicates()
cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('sex')

sex_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

sex_smr['smr'] = np.round(sex_smr['observed_suicides'] / sex_smr['expected_suicides'], 2)
sex_smr['aer'] = np.round((sex_smr['observed_suicides'] - sex_smr['expected_suicides']) / sex_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
sex_smr_table = sex_smr[['sex', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]
sex_smr_table.columns = [
    'Sex', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(sex_smr_table, headers='keys'))
Sex Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
0 Female 304067 457 149 3.07 10.13
1 Male 346834 722 541 1.33 5.22

5 Suicide SMRs and AERs per 10 000 person-years at risk according to attained age at death for all cancers combined

These tables aim to reproduce some of the outputs of Table 3 in (Henson et al., 2019).

R Code
# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year,
           sex,
           attained_age) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(
    year,
    sex,
    attained_age,
    person_years,
    suicide_deaths
  ) |>
  ungroup()

# Calculate SMRs
age_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(attained_age) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(attained_age,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  age_smr,
  col.names = c(
    "Attained age at death",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)
Attained age at death Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
15-19 24 0 0 NaN 0.00
20-24 215 2 0 Inf 93.02
25-29 772 4 1 4.00 38.86
30-34 2578 4 3 1.33 3.88
35-39 6575 14 9 1.56 7.60
40-44 15343 33 21 1.57 7.82
45-49 30954 67 49 1.37 5.82
50-54 52204 101 78 1.29 4.41
55-59 76916 145 98 1.48 6.11
60-64 95812 156 107 1.46 5.11
65-69 101858 198 91 2.18 10.50
70-74 94135 152 73 2.08 8.39
75-79 73584 131 57 2.30 10.06
80-84 48877 76 41 1.85 7.16
85-89 28183 62 32 1.94 10.64
90+ 22882 34 31 1.10 1.31

We can also output an analogous table using Python.

Python Code
# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'cancer_group'])
cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()
cancer_suicides.dropna(inplace=True)
cancer_suicides = cancer_suicides.drop_duplicates()
cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('attained_age')

age_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

age_smr['smr'] = np.round(age_smr['observed_suicides'] / age_smr['expected_suicides'], 2)
age_smr['aer'] = np.round((age_smr['observed_suicides'] - age_smr['expected_suicides']) / age_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
age_smr_table = age_smr[['attained_age', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]
age_smr_table.columns = [
    'Attained age at death', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(age_smr_table, headers='keys'))
Attained age at death Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
0 15-19 24 0 0 nan 0
1 20-24 208 2 0 inf 96.15
2 25-29 769 4 1 4 39.01
3 30-34 2585 4 3 1.33 3.87
4 35-39 6574 14 9 1.56 7.61
5 40-44 15351 33 21 1.57 7.82
6 45-49 30951 67 49 1.37 5.82
7 50-54 52192 101 78 1.29 4.41
8 55-59 76919 145 98 1.48 6.11
9 60-64 95809 156 107 1.46 5.11
10 65-69 101856 198 91 2.18 10.51
11 70-74 94132 152 73 2.08 8.39
12 75-79 73598 131 57 2.3 10.05
13 80-84 48869 76 41 1.85 7.16
14 85-89 28181 62 32 1.94 10.65
15 90+ 22883 34 31 1.1 1.31

6 Suicide SMRs and AERs per 10 000 person-years at risk according to deprivation at death for all cancers combined

These tables aim to reproduce some of the outputs of Table 3 in (Henson et al., 2019).

R Code
# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year,
           sex,
           attained_age,
           deprivation_quintile_reversed) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(
    year,
    sex,
    attained_age,
    person_years,
    suicide_deaths,
    deprivation_quintile_reversed
  ) |>
  ungroup()

# Calculate SMRs
deprivation_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(deprivation_quintile_reversed) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(deprivation_quintile_reversed,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  deprivation_smr,
  col.names = c(
    "Deprivation",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)
Deprivation Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
1, Least 136824 252 145 1.74 7.82
2 138537 242 148 1.64 6.79
3 133581 247 142 1.74 7.86
4 122291 208 130 1.60 6.38
5, Most 119670 230 126 1.83 8.69

We can also output an analogous table using Python.

Python Code
# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'deprivation_quintile_reversed'])

cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()

cancer_suicides.dropna(inplace=True)

cancer_suicides = cancer_suicides.drop_duplicates()

cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('deprivation_quintile_reversed')

deprivation_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

deprivation_smr['smr'] = np.round(deprivation_smr['observed_suicides'] / deprivation_smr['expected_suicides'], 2)
deprivation_smr['aer'] = np.round((deprivation_smr['observed_suicides'] - deprivation_smr['expected_suicides']) / deprivation_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
deprivation_smr_table = deprivation_smr[['deprivation_quintile_reversed', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]
deprivation_smr_table.columns = [
    'Deprivation', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(deprivation_smr_table, headers='keys'))
Deprivation Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
0 1, Least 136824 252 145 1.74 7.82
1 2 138537 242 148 1.64 6.79
2 3 133581 247 142 1.74 7.86
3 4 122291 208 130 1.6 6.38
4 5, Most 119670 230 126 1.83 8.69

7 Suicide SMRs and AERs per 10 000 person-years at risk according to ethnicity at death for all cancers combined

These tables aim to reproduce some of the outputs of Table 3 in (Henson et al., 2019).

R Code
# Calculate suicides among cancer patients
cancer_suicides <- survival |>
  rename(attained_age = age_category_at_death, sex = gender) |>
  mutate(year = year(vital_status_date)) |>
  mutate(suicide = suicide - 1) |>
  group_by(year,
           sex,
           attained_age,
           ethnicity) |>
  mutate(sex = as.character(sex)) |>
  mutate(attained_age = as.character(attained_age)) |>
  summarise(person_years = round(sum(time)),
            suicide_deaths = sum(suicide)) |>
  drop_na() |>
  distinct() |>
  select(
    year,
    sex,
    attained_age,
    person_years,
    suicide_deaths,
    ethnicity
  ) |>
  ungroup()

# Calculate SMRs
ethnicity_smr <- cancer_suicides |>
  left_join(reference_suicides) |>
  mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
  group_by(ethnicity) |>
  summarise (
    observed_suicides = sum(suicide_deaths),
    expected_suicides = round(sum(expected_suicides)),
    person_years = sum(person_years)
  ) |>
  mutate(
    smr = round(observed_suicides / expected_suicides, digits = 2),
    aer = round((observed_suicides - expected_suicides) / person_years * 10000,
                 digits = 2
    )
  ) |>
  select(ethnicity,
         person_years,
         observed_suicides,
         expected_suicides,
         smr,
         aer)

kableExtra::kable(
  ethnicity_smr,
  col.names = c(
    "Ethnicity",
    "Person-years observed",
    "Observed suicide deaths",
    "Expected suicide deaths",
    "Standardised mortality ratio",
    "Absolute excess risk per 10 000"
  )
) |>
  kable_classic(full_width = F)
Ethnicity Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
White 572763 1028 608 1.69 7.33
Asian 16455 39 17 2.29 13.37
Black 12175 26 13 2.00 10.68
Mixed 2969 4 3 1.33 3.37
Other 10059 14 10 1.40 3.98
Unknown 3952 11 4 2.75 17.71
Not Stated 22998 44 25 1.76 8.26

We can also output an analogous table using Python.

Python Code
# Import necessary packages
import pandas as pd
import numpy as np
from tabulate import tabulate
from IPython.display import Markdown

# Import dataset
df_survival = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/survival.csv")
df_suicides = pd.read_csv("C:/Users/yangj/OneDrive - University College London/NCRAS/dataset/reference_suicides.csv")

# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1

# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'ethnicity'])

cancer_suicides = grouped.agg(
    person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
    suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()

cancer_suicides.dropna(inplace=True)

cancer_suicides = cancer_suicides.drop_duplicates()

cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)

# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')

# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']

# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']

# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('ethnicity')

ethnicity_smr = grouped_smr.agg(
    observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
    expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
    person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()

ethnicity_smr['smr'] = np.round(ethnicity_smr['observed_suicides'] / ethnicity_smr['expected_suicides'], 2)
ethnicity_smr['aer'] = np.round((ethnicity_smr['observed_suicides'] - ethnicity_smr['expected_suicides']) / ethnicity_smr['person_years'] * 10000, 2)

# Display as a table with custom column names
ethnicity_smr_table = ethnicity_smr[['ethnicity', 'person_years', 'observed_suicides', 'expected_suicides', 'smr', 'aer']]

ethnicity_smr_table.columns = [
    'Ethnicity', 
    'Person-years observed', 
    'Observed suicide deaths', 
    'Expected suicide deaths', 
    'Standardised mortality ratio', 
    'Absolute excess risk per 10 000'
]

# Display the table
Markdown(tabulate(ethnicity_smr_table, headers='keys'))
Ethnicity Person-years observed Observed suicide deaths Expected suicide deaths Standardised mortality ratio Absolute excess risk per 10 000
0 Asian 16455 39 17 2.29 13.37
1 Black 12175 26 13 2 10.68
2 Mixed 2969 4 3 1.33 3.37
3 Not Stated 22998 44 25 1.76 8.26
4 Other 10059 14 10 1.4 3.98
5 Unknown 3952 11 4 2.75 17.71
6 White 572763 1028 608 1.69 7.33

References

Henson, K. E., Brock, R., Charnock, J., Wickramasinghe, B., Will, O., & Pitman, A. (2019). Risk of suicide after cancer diagnosis in england. JAMA Psychiatry, 76(1), 51. https://doi.org/10.1001/jamapsychiatry.2018.3181
Naing, N. N. (2000). Easy way to learn standardization: Direct and indirect methods. The Malaysian Journal of Medical Sciences: MJMS, 7(1), 10.

Reuse