Essentials of Sample Size Calculation

Suman Kumar Pramanik

09 Aug 2019

Introduction

Sample Size Calculation is the fundamental requirement before carrying out any Inferential Study
Requires an understanding of the study design and a rough estimate of the desired parameters

Population and Sample

Population is usually hypothetical, often not well quantified
- Males in India, Elderly females in India
Our aim is to find about any attribute of the population
- Prevalence of Heart Attack, Chances of surviving after Lung Cancer at the end of 1 year
- Mean age of males having 1st episode of Acute Coronary Syndrome
We estimate it by studying a sample from the population

Central Dogma of Inferential Statistics

POPULATION \(\rightarrow\) SAMPLE
ESTIMATE POPULATION \(\leftarrow\) SAMPLE
What we get is an Estimate of Actuality from studying a sample

What is a Random Sample?

All the entities from underlying population have equal chance of being selected for inclusion in the sample.
Prior inclusion of any entity does not influence the chances of any of the entities being selected in future.

Simulation: One Population

Scenario

Duration of remission after chemotherapy (exponential distribution)
Mean Duration: 5 months
Our aim is to estimate mean remission duration of the underlying population

Two random samples (size = 100)

Both the random samples are different
Blue Sample (Mean): 4.8525941
Red Sample (Mean): 4.9722859

Distribution (1000) of sample means (Sample size: 10)

Distribution of Sample Means with Sample Size

Larger the sample size

Distribution nearer to normal (Central Limit Theorem)
Narrower is the spread of distribution. More accurate is the estimate.
No change in the mean value of the distribution

Requirements for smple size calculation for 1 sample

Minimum desired precision (Margin of error) (\(\Delta\))
Underlying standard deviation (\(sd\)) (measure of spread, may require pilot study)
Width of confidence interval (95% CI, 90% CI, 99% CI)
For 95% CI

\[ n = (sd * Z_{0.975} / \Delta)^{2} \\Z_{0.975} = 1.96 \]

Example

A survey is carried out to estimate the mean height of a population in a city
A pilot survey was carried out with 50 people and standard deviation of 50 cm was estimated
We want to estimate the mean height with a margin of error of 10 cm
n = ((50 * 1.96) / 10)^2 = 96

Simulation for difference between two populations

Scenario

A new drug (Drug 2) has been invented as chemotherapy which is being tested against standard of care (Drug 1)
Outcome of interest is the duration of remission
Say, duration of remission for standard of care (Drug 1) is exponentially distributed with mean of 5 months
Duration of remission for new drug (Drug 2) is exponentially distributed with mean of 10 months

Assessing difference between Drug 1 and Drug 2

Difference between means between Drug 2 and Drug 1 (Drug 2 - Drug 1)
Drug 2 is better by 5 months than Drug 1 (A VALUE WHICH IS NOT KNOWN IN REAL LIFE)
We define a clinically significant difference between Drug 1 and Drug 2 as 3 months (POPULATION CHARACTERISTICS)

We will use difference between means of duration of remission between both the drugs as measure of performance difference between both the drugs.

As told earlier, we know that drug 2 is better than drug 1 by 5 months. We have to remember that we are not aware of this fact in real life.

Before proceeding further we, the investigator, need to decide on the clinically significant difference between drug 1 and drug 2 (among populations but not samples). It means that we will have to decide on the minimum difference between both the drugs, which is clinically significant. Setting this is the first requirement for sample size calculation. It is the most controversial issue of the sample size determination, as to how can an investigator decide on the clinically significant difference, which will be different for different clinicians and situations.

But this is the limitation with which we have to carry on.

For simulation, we decide that difference be 3 months.

Simulation

Population under Null hypothesis: difference = 0 (5m, 5m)
Population under Clinically significant difference hypothesis (population with minimum difference of clinically significant difference): difference = 3 (5m, 8m)

Sample size = 10

Green dashed line: sample mean
Red line: 97.5% of the null population (determines the region of rejection)

Sample in the region of rejection: assumed that sample doesnot belong to the null population
Assuming that null population is the truth, probability of committing error that the sample doesnot belong to the null population: Type I error (Alpha) (5%)

Upper panel: Population belonging to minimum clinically significant difference
Assuming the the above population is the truth, the probability of correctly inferring that the sample belongs to above population is the POWER (80%)

Assuming that the null population is the truth, the probability that the sample same or more extreme to the present sample belongs to the null population is p value

Sample size = 100

p value decreases to so called statistically significant level, just by increasing the sample size
- FALLACY of P VALUE: Some Other Day!!
Power of study increases by increasing the sample size
Estimates estimate the population parameter more precisely by increasing the sample size

Sample size = 1000

PS Power and Sample Size Calculator (Vanderbilt University)

Download …

Downloadable from http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize as pssetup3.exe file

Cite the package …

Dupont WD, Plummer WD: ‘Power and Sample Size Calculations: A Review and Computer Program’, Controlled Clinical Trials 1990; 11: 116-28

Dupont WD, Plummer WD: ‘Power and Sample Size Calculations for studies involving Linear Regression:’, Controlled Clinical Trials 1998; 19: 589-601

Slides can be obtained from …

https://sumprain.netlify.com/files/html/sample_ahrr/presentation_ahrr.html