Introduction to R package copulaSim

What is the usage of the R package copulaSim?

This package serves to simulate jointly distributed patient-level data from historical data based on the copula invariance property.

Why do we need to perform virtual patient simulation?

To consistently optimize clinical trial designs and data analysis methods through trial simulation, we need to simulate multivariate mixed-type virtual patient data independent of designs and analysis methods under evaluation. To make the outcome of optimization more realistic, we should utilize relevant empirical patient-level data when it is available.

The challenges faced when simulating small data

When simulating small empirical data, the underlying marginal distributions and their dependence structure cannot be understood or verified thoroughly due to the limited sample size.

Solution: copula invariance property

To resolve this issue, we use the copula invariance property to generate the joint distribution without making a strong parametric assumption. The theoretical background is addressed below.

Sklar’s Theorem

The idea of copula was first introduced by Dr. Abe Sklar in 1959 in the field of statistics. He proposed a theorem, which is later called Sklar’s theorem. This theorem essentially consists of two parts. First, the copula function can be used to describe the relationship between the joint and marginal distributions. This function assigns the value of joint distribution to each ordered pair of values of marginal distributions. That is, the coupla function maps the range of joint distribution from a d-dimensional ball to a unit line. The second part is that the copula function can be uniquely determined for every joint distribution.

Why can the copula function be employed to generate joint distribution?

Each joint density can be viewed as the product of marginal densities multiplied by copula density. The copula density, which is defined as the partial derivative of the copula function, contains all the information about the dependence structure of the joint distribution. As a result, the joint distribution can be flexibly constructed by copula dependency and marginal distributions.

The implementation of copula into a new R package

To share this finding with the community, we have implemented the copula algorithm into a new R package entitled copulaSim. The copulaSim package is designed to perform virtual patient simulation. The idea of the copula simulation algorithm is given in the following. Based on the copula invariance property, the dependence structure of the joint distribution can be well preserved when performing quantile transformation. Because of this feature, the copula simulation algorithm allows for the simulated data to resemble the empirical data.

The demonstration of R Package copulaSim

Generate empirical data:

Assume that the single-arm, 5-dimensional empirical data follows multivariate normal data, and let the sample size of empirical data be 30.

library(mvtnorm)
arm1 <- rmvnorm(n = 30, mean  = rep(10, 5), sigma = diag(5) + 0.5)
test_data <- as.data.frame(cbind(1:30, rep(1, 30), arm1))
colnames(test_data) <- c("id","arm",paste0("time_", 1:5))
knitr::kable((test_data), "simple")
id arm time_1 time_2 time_3 time_4 time_5
1 1 9.847417 8.718636 10.096511 10.074992 9.545107
2 1 10.861423 11.257343 10.432825 10.054927 8.963653
3 1 10.807233 9.533325 8.910878 10.043752 8.231493
4 1 9.944575 7.391258 10.430159 8.020573 9.338640
5 1 12.317052 7.514527 11.017105 9.689028 9.296219
6 1 12.399677 9.376767 9.570212 9.622131 9.609292
7 1 7.690368 9.698433 8.508525 8.392751 9.043848
8 1 9.995485 10.136553 11.199419 11.814768 8.126455
9 1 10.228329 8.913702 8.603700 7.914687 9.970727
10 1 12.377492 9.020758 9.103875 9.501413 10.878960
11 1 10.455926 11.327639 9.999410 10.806695 9.204420
12 1 9.186798 9.475803 8.011086 9.973337 7.826552
13 1 11.119150 10.106095 9.783354 10.090354 10.011098
14 1 8.579672 9.405365 10.202102 8.739225 7.787366
15 1 9.620858 10.399574 10.741399 7.586905 7.421299
16 1 11.116086 10.556649 10.078298 12.036323 10.054633
17 1 9.050256 8.920143 9.326882 9.419048 8.904098
18 1 9.466838 10.338584 11.499958 9.328286 9.353604
19 1 11.496138 7.530536 10.145431 10.769298 10.207179
20 1 8.691839 9.325801 9.670741 9.906605 10.246320
21 1 8.758001 7.138285 8.143479 9.558700 8.210306
22 1 10.201826 11.150467 11.044535 12.623122 9.648970
23 1 11.531771 12.015741 10.798337 9.651560 10.375046
24 1 9.813097 9.588711 10.314587 10.047077 10.971630
25 1 9.232211 7.823685 9.054518 8.603881 10.146230
26 1 12.013102 9.561857 10.337676 11.047072 12.312474
27 1 9.321162 9.206222 11.094687 10.586561 9.650932
28 1 10.615848 10.521042 10.158529 12.258805 13.114965
29 1 9.072691 9.744249 11.053102 12.172192 10.250946
30 1 7.460766 11.229696 10.972728 9.767766 9.688565

Load package copulaSim

library(copulaSim)

Use function copula.sim to generate ONE simulated dataset for the emperical data

Argument Definition Assigned Value
data.input The empirical data test_data[,-c(1,2)]
id.vec ID fo individual patient in the input data test_data$id
arm.vec The column to identify the arm in clinical trial test_data$arm
n.patient The targeted number of patients in each simulated dataset 50
n.simulation The number of simulated datasets 1
seed The random seed to reproduce the simulation study 2022
validation.type Specify hypothesis test to detect the difference between empirical data and simulated data “energy”
verbose Whether to print message for simulation process or not TRUE
  • As a means of avoiding extreme sampling results when performing one simulated dataset, it is advised to use “energy” or “ball” in the argument validation.type. The purpose of doing this is to perform data validation, which ensures the similarity between empirical data and the simulated data based on 2-sample test.

  • Below shows the 2-sample test result, which indicates that the joint distribution of the simulated data is not significantly different from the empirical data.

## Generate 1 simulated dataset
simu_S1 <- copula.sim(data.input = test_data[,-c(1,2)], 
                      id.vec = test_data$id, 
                      arm.vec = test_data$arm,
                      n.patient = 50 , 
                      n.simulation = 1, 
                      seed = 2022, 
                      validation.type = "energy",
                      verbose = TRUE)
## Simulate 1th Dataset
## p.value for energy test: 0.8470
## Compelete simulating 1th Dataset in 0.127 seconds
## Obtain the simulated long-form dataset
simu_S1$data.simul
## # A tibble: 250 × 6
##       id   arm col.num col.name data.sim sim.id
##    <dbl> <dbl>   <dbl> <chr>       <dbl>  <int>
##  1     1     1       1 time_1      10.2       1
##  2     2     1       1 time_1       8.61      1
##  3     3     1       1 time_1      11.1       1
##  4     4     1       1 time_1      11.8       1
##  5     5     1       1 time_1       9.31      1
##  6     6     1       1 time_1       9.93      1
##  7     7     1       1 time_1      12.1       1
##  8     8     1       1 time_1      10.8       1
##  9     9     1       1 time_1       8.61      1
## 10    10     1       1 time_1       9.82      1
## # ℹ 240 more rows
  • Also, from the marginal density plot, we can see that the copula simulated data is close to the empirical data.
library(dplyr)
## Obtain the empirical long-form dataset
empir <- simu_S1$data.input.long %>% mutate(cate = "empirical_n30") %>% rename(data = data.input)

## Produce the marginal density plot
simul <- simu_S1$data.simul %>%  mutate(cate = "copulaSim_n50") %>% 
         rename(data = data.sim) %>% select(-sim.id)  
library(ggplot2)
rbind(empir, simul) %>% filter(grepl('time', col.name)) %>%
  ggplot(aes(x = data, color = cate, fill = cate)) +
  facet_wrap(.~col.name, ncol = 5) +
  geom_density(alpha = 0.001, size = 1)

Use function extract.data.sim to convert simulated data into a list of wide-form matrices

## Converting the long-form simulated dataset to wide-form 
simu.wide <- extract.data.sim(simu_S1)
simu.wide
## $`sim.id=1`
## $`sim.id=1`$`arm=1`
##          time_1    time_2    time_3    time_4    time_5
##  [1,] 10.151302 11.262896 11.179125 12.336607 10.755550
##  [2,]  8.611930 11.315514 11.047152  8.561220  7.506208
##  [3,] 11.126120 11.577182 11.281983 12.611838 10.208543
##  [4,] 11.752735  9.988710 10.556013 10.078319 13.065829
##  [5,]  9.310427  7.443341 10.430644  9.985622  9.999425
##  [6,]  9.928467  7.545930  9.074725  9.999663 10.257611
##  [7,] 12.061901 10.388941 10.011796 10.701000 10.952909
##  [8,] 10.837329  7.520669  8.066801  9.329762  9.596770
##  [9,]  8.607893 10.033156 10.347485 10.462500  8.356089
## [10,]  9.817994 10.042937  8.784328  9.639264 10.249570
## [11,]  9.831326  7.420905 10.871118  8.206978 10.934924
## [12,] 10.904633  9.363302 10.021266  7.767827  7.793121
## [13,]  9.475139 10.077049 11.001252  8.484578  9.309441
## [14,] 12.363950 10.386280 11.160094 12.081534 10.244775
## [15,] 10.813434  9.566561  8.773451  7.975407 10.113657
## [16,]  9.974961  7.466789  8.384725  8.277364  9.975148
## [17,]  8.747835  7.514959  8.552011  7.887507  7.529547
## [18,]  8.859644  7.266393  8.915675  7.742814  7.527039
## [19,] 11.118052  7.519730 11.051283  8.193885  9.045561
## [20,]  8.512230  9.403146  9.703764  7.945986  7.822941
## [21,] 10.818910  9.376858 10.082516 11.843499 10.787450
## [22,]  7.643678  7.916587 10.130554 10.696183  9.084092
## [23,]  9.159416 10.128242 11.049562 10.787312  9.481023
## [24,] 10.807697 11.202243  9.856331  9.814826  9.190226
## [25,]  8.913029 10.024658 11.057402  9.948391  9.672249
## [26,] 10.435270  8.926810  8.063619  8.617271  9.650825
## [27,] 10.312298  9.567646  8.193035  9.308108  9.216468
## [28,] 10.634116  9.316347 10.243001 12.050085 10.461273
## [29,] 11.117572  9.189028 10.431482 11.631533 10.356882
## [30,]  7.854881  9.570265 10.887115 12.013592  9.008375
## [31,] 11.116546 11.076438 11.068651  9.777916  8.080101
## [32,] 10.293340 11.113292  9.020730  9.642551  9.338996
## [33,] 11.483485  7.651354  9.091693  9.571358 10.031431
## [34,]  9.706704 10.131359 10.335925 10.764851 10.965803
## [35,]  9.129314  9.537628 11.426575 10.870881 10.005800
## [36,] 12.330935  7.578337 10.026179  8.736298 10.020030
## [37,] 10.706147  7.515011 10.182795  8.639284  8.696482
## [38,]  9.058985  8.831833 10.134090  9.762597 10.906351
## [39,] 10.765721  9.489975 10.308114  8.594825 10.020310
## [40,]  9.320699  8.915188 11.009409  8.509022  9.679105
## [41,]  9.834014 10.425242 10.769724 10.065575  9.604299
## [42,]  8.714811  9.325008 10.332542  8.938965 10.238226
## [43,]  9.645307 11.255328  9.315267  8.630444  7.805637
## [44,]  9.847089  7.450173 10.678122  9.537576  9.211284
## [45,]  9.302916  7.188245 10.072721  8.559050  9.575030
## [46,]  7.885152  7.571727 10.232449  9.856905  9.451851
## [47,] 12.035023 10.357907 10.347732  8.733867  8.229622
## [48,]  9.605216 10.550296  8.620983  9.578498  9.035317
## [49,]  9.849406  9.572772  8.574798  8.482307  8.961818
## [50,] 10.949633  9.534782 11.050449  9.751648  9.270638

Use function compare.copula.sim to perform the comparison between empirical data and multiple simulated datasets.

  • To increase the data variability, we suggest that the data validation option should be turned off when producing multiple simulated datasets. That is, the argument for validation.type should be set to “none”.
## Generate 100 simulated datasets
simu_S100 <- copula.sim(data.input = test_data[,-c(1,2)], 
                         id.vec = test_data$id, 
                         arm.vec = test_data$arm,
                         n.patient = 50 , 
                         n.simulation = 100, 
                         seed = 2022, 
                         validation.type = "none",
                         verbose = FALSE)

## Compare the marginal mean via the function compare.copula.sim 
compare <- compare.copula.sim(simu_S100)
knitr::kable(compare$mean.comparison, "simple")
marginal.name arm empir.sample simu.sample n.simu empir.mean simu.mean simu.mean.low.lim simu.mean.upp.lim simu.mean.RB simu.mean.SB simu.mean.RMSE empir.sd simu.sd
time_1 1 30 50 100 10.1091 10.1341 9.8429 10.4292 0.0025 0.1029 0.1552 1.3215 1.2334
time_2 1 30 50 100 9.5642 9.5646 9.2725 9.8805 0.0000 0.1146 0.1556 1.2488 1.1450
time_3 1 30 50 100 10.0101 9.9984 9.7649 10.2065 -0.0012 0.0993 0.1142 0.9504 0.8927
time_4 1 30 50 100 10.0034 10.0019 9.6969 10.3098 -0.0002 0.1128 0.1655 1.2961 1.2064
time_5 1 30 50 100 9.6130 9.5913 9.3466 9.8589 -0.0023 0.1063 0.1539 1.2400 1.0976

Employ function new.arm.copula.sim to simulate new multivariate datasets with shifted mean vector from existing empirical data.

  • This function assumes that the dependence structure of the new-arm data and that of the empirical data are the same.
## Generate Empirical Data 
##  Assume that the single-arm, 3-dimensional empirical data follows multivariate normal data
arm1 <- rmvnorm(n = 80, mean  = c(10,10.5,11), sigma = diag(3) + 0.5)
test_data2 <- as.data.frame(cbind(1:80, rep(1,80), arm1))
colnames(test_data2) <- c("id", "arm", paste0("time_", 1:3))

## Generate 1 simulated datasets with one empirical arm and two new-arms
## The mean difference between empirical arm and 
## (i) the 1st new arm is assumed to be 2.5, 2.55, and 2.6 at each time point
## (ii) the 2nd new arm is assumed to be 4.5, 4.55, and 4.6 at each time point

newARM <- new.arm.copula.sim(data.input = test_data2[,-c(1,2)], 
                             id.vec = test_data2$id,
                             arm.vec = test_data2$arm, 
                             n.patient = 100 , 
                             n.simulation = 1, 
                             seed = 2022,
                             shift.vec.list = list(c(2.5,2.55,2.6), c(4.5,4.55,4.6)),
                             verbose = FALSE)

## Obtain the simulated long-form dataset
newARM$data.simul
## # A tibble: 900 × 6
##       id   arm col.num col.name data.sim sim.id
##    <dbl> <dbl>   <dbl> <chr>       <dbl>  <int>
##  1     1     1       1 time_1       9.63      1
##  2     2     1       1 time_1       9.65      1
##  3     3     1       1 time_1       8.24      1
##  4     4     1       1 time_1      12.1       1
##  5     5     1       1 time_1       9.12      1
##  6     6     1       1 time_1      10.0       1
##  7     7     1       1 time_1       8.56      1
##  8     8     1       1 time_1      10.6       1
##  9     9     1       1 time_1      10.5       1
## 10    10     1       1 time_1       8.48      1
## # ℹ 890 more rows
## Verify the mean difference
newARM$data.simul %>%  
  group_by(.data$arm, .data$col.num) %>%
  summarise(N = n(), Mean = mean(.data$data.sim), SD = sd(.data$data.sim))
## # A tibble: 9 × 5
## # Groups:   arm [3]
##     arm col.num     N  Mean    SD
##   <dbl>   <dbl> <int> <dbl> <dbl>
## 1     1       1   100  9.71 1.01 
## 2     1       2   100 10.2  1.26 
## 3     1       3   100 10.8  1.23 
## 4     2       1   100 12.0  0.975
## 5     2       2   100 12.7  1.30 
## 6     2       3   100 13.3  1.14 
## 7     3       1   100 14.0  0.975
## 8     3       2   100 14.7  1.30 
## 9     3       3   100 15.3  1.14
- Authored by Pei-Shan Yen
- CRAN page: https://CRAN.R-project.org/package=copulaSim
- github page: https://github.com/psyen0824/copulaSim

Acknowledgement

This research project and the development of the R package are supported by AbbVie Experiential Internship Program. I am also grateful to Dr. Xuemin Gu, Dr. Jenny Jiao, and Dr. Jane Zhang at the Eyecare Clinical Statistics Team for valuable comments on this work.