Acute Subarachnoid Hemorrhage

Source: A multiparameter panel method for outcome prediction following aneurysmal subarachnoid hemorrhage. Intensive Care Medicine 36(1), 107–115. DOI: http://10.1007/s00134-009-1641-y
Several clinical and one laboratory variable of 113 patients with an aneurysmal subarachnoid hemorrhage.
Data frame with 113 observations of 7 variables
Explanations of variables:
- gos6: Glasgow Outcome Score
- ndka: Nucleoside diphosphate kinase A
- wfns: World Federation of Neurological Surgeons Scoring
- s100b: Astrocyte protein, which spills over in blood in the presence of brain injury
- outcome: Two grades of SAH - Good and Poor


	gos6	outcome	gender	age	wfns	s100b	ndka

1	5	Good	Female	42	1	-2.040	1.102
2	5	Good	Female	37	1	-1.966	2.145
3	5	Good	Female	42	1	-2.303	2.091
4	5	Good	Female	27	1	-3.219	2.344
5	1	Poor	Female	42	3	-2.040	2.856
6	1	Poor	Male	48	2	-2.303	2.546

Overview of Diagnostic Tests

Classes

Entities (Patients) can be classified in Classes (Good and Poor)
Classes are containers with homogenous entities
Different classes are heterogenous
Many of the classes are hierarchial in nature

Diagnostic Tests as Classifiers

Classifiers: Tools which help us to place a given entity (patient) in one of the many mutually exclusive classes (Good and Poor)

Diagnostic Tests change the Probability Distribution

This is Bayes Theorem. Crux of all diagnostic tests

Chain of Diagnostic Tests

In real life, we apply a chain of diagnostic tests to assign a particular class to a patient.

Stages of development of Diagnostic Tests

Applied to patients with known diagnoses: Assess the performance of Diagnostic test

Applied to patients before knowing diagnosis: Assess performance of diagnostic test + calculation post-test assignment of probability

Applied to patients in one of the arms after randomisation and assessing difference in clinically relevant outcome: Most realistic assessment of a diagnostic test

Diagnostic Test Performance

Description

In already classified patients (Good or Poor), we run s100b assay
We get continuous data from the diagnostic test (s100b assay)


	s100b	outcome

1	-2.040	Good
2	-1.966	Good
3	-2.303	Good
4	-3.219	Good
5	-2.040	Poor
6	-2.303	Poor
7	-0.755	Good
8	-1.833	Poor
9	-1.715	Good
10	-2.303	Good

Graphical Representation

Higher values of s100b are associated with Poor class

Decision boundary

Decision boundary is the boundary which is used by the diagnostic test to classify an entity to one of the class (Good or Poor). Cut off for one variable diagnostic test

Mistakes are committed by the diagnostic test

How to choose best decision boundary (Cut Off)?

Theoretical Basis and Clinical Usefulness: Based on the subject knowledge, previous studies, cost of making mistakes in classification.
Data Driven:
- ROC curve based
- Classifiers: LRM, LDA, QDA, RF, SVM
- Danger of over fitting to the training data
- Based on maximising/minimising some metric

Let us take cut off as -1

Crosstabulation


	pred_outcome	Good	Poor

1	Pred_Good	63	24
2	Pred_Poor	9	17

The diagnostic test commits many mistakes

Understanding Crosstabulation

Sensitivity: True Positives
Specificity: True Negatives

Combining performances across both classes

Depicts discriminatory performance
Likelihood Ratio
- Positive LR: (Proportion of tests positive for Class 1 in patients belonging to Class 1)/(Proportion of tests positive for Class 1 in patients not belonging to Class 1)
  - Sensitivity/(1 - Specificity) = 0.875/(1 - 0.415) = 1.495
- Negative LR: (Proportion of tests negative for Class 1 in patients belonging to Class 1)/(Proportion of tests negative for Class 1 in patients not belonging to Class 1)
  - (1 - Sensitivity)/Specificity = (1 - 0.875)/0.415 = 0.301

Relevance of Likelihood Ratios

Posttest Odds of being in Class 1 given test positive for Class 1 = Pretest Odds of being in Class 1 x Positive LR

Posttest Odds of being in Class 1 given test negative for Class 1 = Pretest Odds of being in Class 1 x Negative LR

Higher Positive LR increases the chance that patient belongs to class 1, if test is positive for class 1 (10)
Lower Negative LR decreases the chance that patient belongs to class 1, if test is negative for class 1 (0.1)
We need to know about pretest prevalence of disease to know about the post test prevalence of disease (OUR PRIMARY INTEREST)

Combining performance of Diagnostic Test across multiple Cut Offs

Receiver Operator Characteristic (ROC)

Across all the possible cut off values, we calculate sensitivity and specificity


	cut_offs	sensitivity	specificity

1	-3.3627	0.9756	0
2	-3.1073	0.9756	0.0694
3	-2.9046	0.9756	0.1111
4	-2.1638	0.7561	0.5417
5	-2.0802	0.7317	0.5417
6	-2.0032	0.6829	0.5833
7	-1.0087	0.4146	0.875
8	-0.9296	0.4146	0.8889
9	-0.8678	0.3902	0.8889
10	-0.0958	0.0488	1
11	0.3434	0.0244	1

Plot graph with (1 - specificity) on X axis, and sensitivity on Y axis

Uses of ROC: Area under ROC

Meaning of AUROC

Discriminatory performance of logistic regression model with the values of diagnostic test as co-variate.


	outcome	predicted_prob

1	Good	0.2847
2	Good	0.3019
3	Good	0.2288
4	Good	0.0961
5	Poor	0.2847
6	Poor	0.2288
7	Good	0.6267
8	Poor	0.3343
9	Good	0.3643
10	Good	0.2288

Meaning:
- Randomly selecting persons belonging to and not belonging to a class
- Proportion of times model correctly assigns higher probability to person belonging to the said class than to person not belonging to the said class
AUROC in our example is 0.7314

Uses of ROC: Finding Cut offs

ROC curves should never be used to find cut offs

Comparing two diagnostic tests

Comparing diagnostic performance when cut offs are predetermined

Diagnostic Test 2: Nucleoside diphosphate kinase A (NDKA)
Let us say, the cut off is 2.5


	var	sensitivity	specificity	pos_lr	neg_lr

1	s100b	0.4146	0.875	3.3171	0.669
2	ndka	0.6098	0.5556	1.372	0.7024

Comparing overall diagnostic performance

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  roc_s100b and roc_ndka
## Z = 1.3908, p-value = 0.1643
## alternative hypothesis: true difference in AUC is not equal to 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.7313686   0.6119580

Reliability Analysis

Concept

Repeated measurements of a given attribute of different entities (subjects)
Measurements can be by different entities (raters) or are repeated measurements by same entity (rater)
Raters can be decomposed into humans and machines
Raters are fixed or are a random sample from a big population. Subjects are always a random sample from a big population
Single rater, multiple measurements: Rater effect is always fixed
Measurements can be categorical, nominal or continuous
Aim: To find out magnitude of inter-measurement variation

Categorical Measurement

Agreement between 2 raters (Cohen’s Kappa)

Continuous Measurement: Intraclass Correlation

Decomposition of unbiased measurement

\[ Val_{obs} = Val_{true} + Error \]

True value is usually not known. Usually estimated from the mean of repeated measures

Decomposition of Error

\[ Error = Error_{rater} + Error_{instrument} + Error_{unexplainable} \]

As we decompose the error into more and more groups, unexplainable error is minimised

Decomposition of Variance

Variance is measure of dispersion of individual measurements from mean
Mean can be of total measurements, measurements within each subject and measurements by each rater

\[ Var_{total} = Var_{between-subject} + Var_{between-rater} + Var_{rest} \]

Definition of ICC

\[ ICC = Var_{between-subject}/(Var_{between-subject} + Var_{between-rater} + Var_{rest}) \]

ICC: between 0 to 1
< 0.5 is poor reliability
Causes of poor reliability
- Less variation between subjects
- Increased variation between raters
- Increased unexplainable errors
Data: Anxiety score by three raters on 20 subjects


	rater1	rater2	rater3

1	3	3	2
2	3	6	1
3	3	4	4
4	4	6	4
5	5	2	3
6	5	4	2
7	2	2	1
8	3	4	6
9	5	3	1
10	2	3	1

Graphical Representation

Calculating ICC

Models of the ICC: Random and Fixed Effects
- One Way model (Model 1): Different sets of raters are measuring different subjects. Variance among raters not measurable.
- Two Way Random Effect Model (Model 2): Same set of raters are measuring in all the subjects. Raters are considered to be random sample from a population of raters with similar characters
- Two Way Mixed Effect Model (Model 3): Same set of raters are measuring in all the subjects, but raters will be the same for future measurements, so they are not considered to be a random sample. Multiple measurements by same rater also analysed by the same model
Forms of the ICC: Single and Average Ratings
- Usually, reliability studies are based on comparison of measurements from individual raters (Single, 1)
- In case of unstable measurements, average of measurements is used for calculating reliability (Average, k)
Absolute Agreement vs Consistency
- Model 2: Absolute Agreement
- Model 3: Consistency
Given Example
- Model 2 (random raters and subjects)
- Individual measurements (1)
- Absolute Agreement

##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 20 
##      Raters = 3 
##    ICC(A,1) = 0.198
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(19,39.7) = 1.83 , p = 0.0543 
## 
##  95%-Confidence Interval for ICC Population Values:
##   -0.039 < ICC < 0.494

Continuous Measurement: 2 measurers, difference in measurements

Say, a new method is measuring ndka. We want to see, if the measurements by new method agrees with the older method.


	old	new

1	1.1019	1.1015
2	2.1448	2.2711
3	2.0906	2.1748
4	2.3437	2.6211
5	2.8565	2.9799
6	2.5455	2.7093
7	1.7918	1.7754
8	2.5802	2.8231
9	2.7434	2.6784
10	1.7934	1.8215

Scatterplot

For perfect agreement between both methods, the points need to be on the black line

Incorrect use of correlation coefficient

## 
##  Pearson's product-moment correlation
## 
## data:  ndka_df$old and ndka_df$new
## t = 36.283, df = 111, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9428688 0.9725325
## sample estimates:
##       cor 
## 0.9603319

Linearly correlated does not mean presence of agreement

We have perfect agreement only if the points lie along the line of equality (they donot have any difference), but we will have perfect correlation if the points lie along any straight line
Change in scale of measurement does not affect the correlation, but it affects the agreement

Bland and Altman Method: Analysis of Difference

Plot of difference between measurements with mean of measurements

Bias: Mean of difference is different from 0
- On an average, method new is 0.099 units more than method old

Bland and Altman Method: Analysis of Difference (contd)

Distribution of differences

## 
##  Shapiro-Wilk normality test
## 
## data:  ba_df$diffs
## W = 0.98909, p-value = 0.5018

Differences are normally distributed
The difference between new and old methods can vary between -0.291 and 0.49 95% of the times

Bland and Altman Method: Analysis of Difference (contd)

Is the bias significant?
- We can answer by looking at 95% CI of the bias
- 95% CI is 0.0622 to 0.1365
- Bias is significant, as line of equality (0) is not in the 95% CI
Defining acceptable agreement limit?
- Define a priori the limits of maximum acceptable differences (limits of agreement expected), based on biologically and analytically relevant criteria
- Plot is to be used just to see, if the limits of agreement exceeds that of the maximum acceptable limit
95% CI of limits of agreement
- Upper limit: 0.4255 to 0.5541
- Lower limit: -0.3554 to -0.2267

OVERVIEW OF DIAGNOSTIC TESTS

Example dataset

Acute Subarachnoid Hemorrhage

Overview of Diagnostic Tests

Classes

Diagnostic Tests as Classifiers

Diagnostic Tests change the Probability Distribution

Chain of Diagnostic Tests

Stages of development of Diagnostic Tests

Diagnostic Test Performance

Description

Graphical Representation

Decision boundary

How to choose best decision boundary (Cut Off)?

Let us take cut off as -1

Understanding Crosstabulation

Combining performances across both classes

Relevance of Likelihood Ratios

Combining performance of Diagnostic Test across multiple Cut Offs

Receiver Operator Characteristic (ROC)

Uses of ROC: Area under ROC

Meaning of AUROC

Uses of ROC: Finding Cut offs

Comparing two diagnostic tests

Comparing diagnostic performance when cut offs are predetermined

Comparing overall diagnostic performance

Reliability Analysis

Concept

Categorical Measurement

Continuous Measurement: Intraclass Correlation

Decomposition of unbiased measurement

Decomposition of Error

Decomposition of Variance

Definition of ICC

Calculating ICC

Continuous Measurement: 2 measurers, difference in measurements

Incorrect use of correlation coefficient

Bland and Altman Method: Analysis of Difference

Bland and Altman Method: Analysis of Difference (contd)

Bland and Altman Method: Analysis of Difference (contd)

Bland and Altman Method: Analysis of Difference (contd)

Thank you