Notes on Stata/R/Python
University College London
Camden & Islington NHS Foundation Trust
2024-05-22
The SMR compares rates of suicide deaths among cancer patients and rates of suicide deaths among the general population, standardised for age, sex, and time period.
\[ \text{SMR}=\frac{\text{observed number of suicides}}{\text{expected number of suicides}} \]
The AER or attributable risk is the difference between two absolute risks over a specific time period. In this case, it is the difference between observed suicides and expected suicides among patients with cancer.
\[ \text{AER}=\frac{\text{observed number of suicides} - \text{expected number of suicides}}{\text{person-years at risk}} \]
* Calculate expected numbers of deaths using population rates
gen E_suicide = (_t-_t0) * (poprate / 100000)
*--OBSERVED NO
gen obstr = string(_dsuicide) + " / " + string(E_suicide , "%9.0f")
gen smr = (_dsuicide/E_suicide)
gen smrll = ((invgammap( _dsuicide, (0.05)/2))/E_suicide)
gen smrul = ((invgammap((_dsuicide+ 1), (1.95)/2))/E_suicide)
gen str smrstr = string(smr , "%9.1f") + " (" + string(smrll , "%9.1f") + "," + string(smrul , "%9.1f") + ")"
gen aer = cond(((_dsuicide- E_suicide)/pyrs)>0 , ((_dsuicide- E_suicide)/pyrs) , 0)
gen aerll = aer - (1.96*(sqrt(_dsuicide)/pyrs))
gen aerul = aer + (1.96*(sqrt(_dsuicide)/pyrs))
gen str aerstr = string(aer , "%9.1f") + " (" + string(aerll , "%9.1f") + "," + string(aerul , "%9.1f") + ")"
sort `v'
decode `v', gen(strdiag)
gen str8 factor=""
replace factor = "`v'"
keep cancergroup2 factor strdiag smrstr* obstr* aerstr*
save "$resultjuly\result-overallsmr-broad-1899-dep-`v'-`i'", replace
restore
}
}
# Calculate SMRs
cancer_smr <- cancer_suicides |>
left_join(reference_suicides) |>
mutate(expected_suicides = (person_years / 10000) * suicides_per_100000) |>
group_by(cancer_group) |>
summarise (
observed_suicides = sum(suicide_deaths),
expected_suicides = round(sum(expected_suicides)),
person_years = sum(person_years)
) |>
mutate(
smr = round(observed_suicides / expected_suicides, digits = 2),
aer = round((observed_suicides - expected_suicides) / person_years * 10000,
digits = 2
)
) |>
select(cancer_group,
person_years,
observed_suicides,
expected_suicides,
smr,
aer)
# Calculate suicides among cancer patients
df_survival['attained_age'] = df_survival['age_category_at_death']
df_survival['sex'] = df_survival['gender']
df_survival['year'] = pd.to_datetime(df_survival['vital_status_date']).dt.year
df_survival['suicide'] = df_survival['suicide'] - 1
# Group by and summarize
grouped = df_survival.groupby(['year', 'sex', 'attained_age', 'cancer_group'])
cancer_suicides = grouped.agg(
person_years=pd.NamedAgg(column='time', aggfunc=lambda x: round(x.sum())),
suicide_deaths=pd.NamedAgg(column='suicide', aggfunc='sum')
).reset_index()
cancer_suicides.dropna(inplace=True)
cancer_suicides = cancer_suicides.drop_duplicates()
cancer_suicides['sex'] = cancer_suicides['sex'].astype(str)
cancer_suicides['attained_age'] = cancer_suicides['attained_age'].astype(str)
# Prepare for merging by ensuring alignment of the key columns used for join
cancer_suicides = cancer_suicides.merge(df_suicides, on=['year', 'sex', 'attained_age'], how='left')
# Handle potential NaN values from the merge before calculation
cancer_suicides['suicides_per_100000'] = cancer_suicides['suicides_per_100000']
# Calculate expected suicides
cancer_suicides['expected_suicides'] = (cancer_suicides['person_years'] / 10000) * cancer_suicides['suicides_per_100000']
# Aggregate for SMR calculation
grouped_smr = cancer_suicides.groupby('cancer_group')
cancer_smr = grouped_smr.agg(
observed_suicides=pd.NamedAgg(column='suicide_deaths', aggfunc='sum'),
expected_suicides=pd.NamedAgg(column='expected_suicides', aggfunc=lambda x: round(x.sum())),
person_years=pd.NamedAgg(column='person_years', aggfunc='sum')
).reset_index()
cancer_smr['smr'] = np.round(cancer_smr['observed_suicides'] / cancer_smr['expected_suicides'], 2)
cancer_smr['aer'] = np.round((cancer_smr['observed_suicides'] - cancer_smr['expected_suicides']) / cancer_smr['person_years'] * 10000, 2)
While most analysts become fluent in one or more statistical software packages, analyses are often conducted using a single software option. However, several packages exist to allow calling upon different statistical software in the same document:
Dealing with large datasets can be challenging, especially datasets which may be larger-than-memory. General advice for dealing with includes: