Introduction to R package copulaSim

What is the usage of the R package copulaSim?

This package serves to simulate jointly distributed patient-level data from historical data based on the copula invariance property.

Why do we need to perform virtual patient simulation?

To consistently optimize clinical trial designs and data analysis methods through trial simulation, we need to simulate multivariate mixed-type virtual patient data independent of designs and analysis methods under evaluation. To make the outcome of optimization more realistic, we should utilize relevant empirical patient-level data when it is available.

The challenges faced when simulating small data

When simulating small empirical data, the underlying marginal distributions and their dependence structure cannot be understood or verified thoroughly due to the limited sample size.

Solution: copula invariance property

To resolve this issue, we use the copula invariance property to generate the joint distribution without making a strong parametric assumption. The theoretical background is addressed below.

Sklar’s Theorem

The idea of copula was first introduced by Dr. Abe Sklar in 1959 in the field of statistics. He proposed a theorem, which is later called Sklar’s theorem. This theorem essentially consists of two parts. First, the copula function can be used to describe the relationship between the joint and marginal distributions. This function assigns the value of joint distribution to each ordered pair of values of marginal distributions. That is, the coupla function maps the range of joint distribution from a d-dimensional ball to a unit line. The second part is that the copula function can be uniquely determined for every joint distribution.

Why can the copula function be employed to generate joint distribution?

Each joint density can be viewed as the product of marginal densities multiplied by copula density. The copula density, which is defined as the partial derivative of the copula function, contains all the information about the dependence structure of the joint distribution. As a result, the joint distribution can be flexibly constructed by copula dependency and marginal distributions.

The implementation of copula into a new R package

To share this finding with the community, we have implemented the copula algorithm into a new R package entitled copulaSim. The copulaSim package is designed to perform virtual patient simulation. The idea of the copula simulation algorithm is given in the following. Based on the copula invariance property, the dependence structure of the joint distribution can be well preserved when performing quantile transformation. Because of this feature, the copula simulation algorithm allows for the simulated data to resemble the empirical data.

The demonstration of R Package copulaSim

Generate empirical data:

Assume that the single-arm, 5-dimensional empirical data follows multivariate normal data, and let the sample size of empirical data be 30.

library(mvtnorm)
arm1 <- rmvnorm(n = 30, mean  = rep(10, 5), sigma = diag(5) + 0.5)
test_data <- as.data.frame(cbind(1:30, rep(1, 30), arm1))
colnames(test_data) <- c("id","arm",paste0("time_", 1:5))
knitr::kable((test_data), "simple")

id	arm	time_1	time_2	time_3	time_4	time_5
1	1	7.004179	10.858860	10.732748	9.393314	7.641030
2	1	9.069737	8.255120	8.550848	8.754447	8.005901
3	1	10.671620	9.684173	9.647934	9.592275	10.423479
4	1	11.502250	8.949238	9.354255	8.014124	9.544719
5	1	7.971693	8.606493	9.159820	10.513604	10.976158
6	1	7.802940	8.729765	7.287771	9.027088	9.389086
7	1	10.413267	11.509154	10.582584	8.953625	8.848974
8	1	11.160479	10.064774	11.178553	9.642552	9.165252
9	1	10.145421	11.185889	8.999335	9.407854	9.218580
10	1	11.693722	11.119677	9.873623	10.489826	11.099211
11	1	9.600325	9.574549	10.365281	10.710560	8.262470
12	1	8.316114	10.329009	10.364516	9.620803	8.520266
13	1	9.577054	10.192527	10.880188	10.536656	10.300502
14	1	7.815416	9.940175	9.321676	10.759877	8.904190
15	1	10.764114	10.259471	11.303855	10.150500	10.120862
16	1	9.673978	9.622582	9.505462	10.479923	8.190353
17	1	10.755589	9.590768	8.683669	11.625726	9.955030
18	1	12.188390	10.830433	12.433515	10.361161	11.424058
19	1	9.007566	10.059487	11.248720	9.160229	11.780805
20	1	9.148672	8.796984	10.821739	9.394521	9.586339
21	1	10.151003	9.781009	8.802328	9.197591	8.898786
22	1	12.839315	10.061330	10.820462	14.424335	10.992401
23	1	9.783661	11.947356	10.820921	12.360630	11.322764
24	1	8.860439	7.918311	8.462041	8.411107	9.881564
25	1	9.945831	8.506570	12.413710	9.686724	10.908293
26	1	9.491668	11.541358	11.920329	10.045451	9.850341
27	1	8.274242	9.214772	9.126962	11.029209	8.620990
28	1	10.927147	10.389272	11.266542	8.422165	9.958860
29	1	9.532406	9.926696	10.863276	10.790423	9.940318
30	1	6.685534	9.560129	7.911584	7.698771	7.709009

Load package copulaSim

library(copulaSim)

Use function copula.sim to generate ONE simulated dataset for the emperical data

Argument	Definition	Assigned Value
data.input	The empirical data	test_data[,-c(1,2)]
id.vec	ID fo individual patient in the input data	test_data$id
arm.vec	The column to identify the arm in clinical trial	test_data$arm
n.patient	The targeted number of patients in each simulated dataset	50
n.simulation	The number of simulated datasets	1
seed	The random seed to reproduce the simulation study	2022
validation.type	Specify hypothesis test to detect the difference between empirical data and simulated data	“energy”
verbose	Whether to print message for simulation process or not	TRUE

As a means of avoiding extreme sampling results when performing one simulated dataset, it is advised to use “energy” or “ball” in the argument validation.type. The purpose of doing this is to perform data validation, which ensures the similarity between empirical data and the simulated data based on 2-sample test.
Below shows the 2-sample test result, which indicates that the joint distribution of the simulated data is not significantly different from the empirical data.

## Generate 1 simulated dataset
simu_S1 <- copula.sim(data.input = test_data[,-c(1,2)], 
                      id.vec = test_data$id, 
                      arm.vec = test_data$arm,
                      n.patient = 50 , 
                      n.simulation = 1, 
                      seed = 2022, 
                      validation.type = "energy",
                      verbose = TRUE)

## Simulate 1th Dataset

## p.value for energy test: 0.8600
## Compelete simulating 1th Dataset in 0.084 seconds

## Obtain the simulated long-form dataset
simu_S1$data.simul

## # A tibble: 250 × 6
##       id   arm col.num col.name data.sim sim.id
##    <dbl> <dbl>   <dbl> <chr>       <dbl>  <int>
##  1     1     1       1 time_1      10.6       1
##  2     2     1       1 time_1       7.91      1
##  3     3     1       1 time_1      11.5       1
##  4     4     1       1 time_1      11.7       1
##  5     5     1       1 time_1       9.06      1
##  6     6     1       1 time_1       9.53      1
##  7     7     1       1 time_1      11.6       1
##  8     8     1       1 time_1       9.67      1
##  9     9     1       1 time_1       7.81      1
## 10    10     1       1 time_1       9.51      1
## # ℹ 240 more rows

Also, from the marginal density plot, we can see that the copula simulated data is close to the empirical data.

library(dplyr)
## Obtain the empirical long-form dataset
empir <- simu_S1$data.input.long %>% mutate(cate = "empirical_n30") %>% rename(data = data.input)

## Produce the marginal density plot
simul <- simu_S1$data.simul %>%  mutate(cate = "copulaSim_n50") %>% 
         rename(data = data.sim) %>% select(-sim.id)  
library(ggplot2)
rbind(empir, simul) %>% filter(grepl('time', col.name)) %>%
  ggplot(aes(x = data, color = cate, fill = cate)) +
  facet_wrap(.~col.name, ncol = 5) +
  geom_density(alpha = 0.001, size = 1)

Use function extract.data.sim to convert simulated data into a list of wide-form matrices

## Converting the long-form simulated dataset to wide-form 
simu.wide <- extract.data.sim(simu_S1)
simu.wide

## $`sim.id=1`
## $`sim.id=1`$`arm=1`
##          time_1    time_2    time_3    time_4    time_5
##  [1,] 10.646030 11.508598 12.383159 12.344778 11.196392
##  [2,]  7.906921 11.530535 10.868472  8.613763  7.672538
##  [3,] 11.488890 11.683551 12.418972 14.351649 10.933665
##  [4,] 11.674039 10.278188 11.290460 10.209412 11.763459
##  [5,]  9.058024  8.349815 10.766467  9.847295  9.965676
##  [6,]  9.525045  8.589647  9.051937  9.895395 10.591442
##  [7,] 11.565107 10.669053 10.674988 10.516370 11.226897
##  [8,]  9.666971  8.547415  7.711234  9.158491  9.188532
##  [9,]  7.810495 10.060665  9.870426 10.512083  8.407317
## [10,]  9.508795 10.073492  8.874052  9.394347 10.368984
## [11,]  9.623987  8.421433 11.241604  8.411708 11.364956
## [12,] 10.154517  9.761087  9.504706  7.906412  7.971246
## [13,]  9.552278 10.240774 11.196153  8.422030  9.550802
## [14,] 12.408001 10.846784 12.417849 11.153370 11.032656
## [15,] 10.149460 10.060019  9.071214  8.167252 10.117872
## [16,]  9.381862  8.421866  8.488951  8.420846  9.878377
## [17,]  7.577646  8.434464  7.909029  8.125325  7.655523
## [18,]  7.796030  8.074995  8.316483  7.921764  7.659268
## [19,] 10.763164  8.692411 11.290970  8.417913  9.538153
## [20,]  7.491876  9.616625  9.001531  8.141072  7.948432
## [21,] 10.425955  9.618459 10.364864 10.819036 10.999410
## [22,]  6.910126  8.570361  9.150830 10.562771  8.681986
## [23,]  9.087944 10.189729 11.238413 10.568345  9.581249
## [24,] 10.323836 11.172171  9.580513  9.628769  8.903126
## [25,]  8.895780 10.098067 11.267221  9.631732  9.956042
## [26,]  9.579709  9.193831  7.756245  8.839966  9.437998
## [27,]  9.589272  9.939231  8.209913  9.083260  8.811662
## [28,] 10.257380  9.587075 10.786034 11.102677 10.987084
## [29,] 10.763390  9.582981 10.864430 10.784030 10.999231
## [30,]  7.511624  9.732732 10.695382 11.031974  8.783793
## [31,] 10.854295 11.171149 11.296986  9.627058  8.542033
## [32,]  9.790780 11.064767  8.988945  9.402885  9.039851
## [33,] 10.742460  8.734623  9.159147  9.394254  9.958702
## [34,]  9.599377 10.208506 10.821526 10.512270 11.234424
## [35,]  9.499949  9.928644 12.423342 10.638786 10.433021
## [36,] 11.593060  8.776017 10.445761  8.974288 10.273611
## [37,] 10.038496  8.569351 10.364904  8.933078  8.865378
## [38,]  8.388515  8.794167 10.364737  9.589014 11.063180
## [39,] 10.532201  9.938724 10.821350  8.610536 10.281676
## [40,]  9.439596  9.201327 11.209017  8.472734 10.014539
## [41,]  9.642456 10.429374 10.860511 10.375274  9.806008
## [42,]  8.131413  9.581076 10.783445  8.845520 10.812484
## [43,]  9.187614 11.439001  9.011362  8.811182  7.821734
## [44,]  9.545616  8.432328 10.820747  9.393653  9.216290
## [45,]  8.825321  7.990494  9.386574  8.787410  9.577267
## [46,]  7.213790  8.536551  9.590313  9.636644  9.276305
## [47,] 11.474859 10.836771 10.821533  8.965041  8.864153
## [48,]  9.086116 10.832289  8.575551  9.393783  8.612259
## [49,]  9.252498  9.959260  8.546251  8.618660  8.592188
## [50,] 10.760235 10.037144 11.284739  9.617806  9.557708

Use function compare.copula.sim to perform the comparison between empirical data and multiple simulated datasets.

To increase the data variability, we suggest that the data validation option should be turned off when producing multiple simulated datasets. That is, the argument for validation.type should be set to “none”.

## Generate 100 simulated datasets
simu_S100 <- copula.sim(data.input = test_data[,-c(1,2)], 
                         id.vec = test_data$id, 
                         arm.vec = test_data$arm,
                         n.patient = 50 , 
                         n.simulation = 100, 
                         seed = 2022, 
                         validation.type = "none",
                         verbose = FALSE)

## Compare the marginal mean via the function compare.copula.sim 
compare <- compare.copula.sim(simu_S100)
knitr::kable(compare$mean.comparison, "simple")

marginal.name	arm	empir.sample	simu.sample	n.simu	empir.mean	simu.mean	simu.mean.low.lim	simu.mean.upp.lim	simu.mean.RB	simu.mean.SB	simu.mean.RMSE	empir.sd	simu.sd
time_1	1	30	50	100	9.6925	9.7038	9.3982	10.0230	0.0012	0.1000	0.1654	1.4797	1.3539
time_2	1	30	50	100	9.9002	9.9004	9.6645	10.1436	0.0000	0.1119	0.1275	1.0134	0.9315
time_3	1	30	50	100	10.0901	10.0756	9.7570	10.3681	-0.0014	0.1049	0.1579	1.3027	1.2065
time_4	1	30	50	100	9.9552	9.9177	9.6534	10.2080	-0.0038	0.1169	0.1642	1.3454	1.1496
time_5	1	30	50	100	9.6480	9.6452	9.3827	9.8853	-0.0003	0.1006	0.1451	1.1523	1.0804

Employ function new.arm.copula.sim to simulate new multivariate datasets with shifted mean vector from existing empirical data.

This function assumes that the dependence structure of the new-arm data and that of the empirical data are the same.

## Generate Empirical Data 
##  Assume that the single-arm, 3-dimensional empirical data follows multivariate normal data
arm1 <- rmvnorm(n = 80, mean  = c(10,10.5,11), sigma = diag(3) + 0.5)
test_data2 <- as.data.frame(cbind(1:80, rep(1,80), arm1))
colnames(test_data2) <- c("id", "arm", paste0("time_", 1:3))

## Generate 1 simulated datasets with one empirical arm and two new-arms
## The mean difference between empirical arm and 
## (i) the 1st new arm is assumed to be 2.5, 2.55, and 2.6 at each time point
## (ii) the 2nd new arm is assumed to be 4.5, 4.55, and 4.6 at each time point

newARM <- new.arm.copula.sim(data.input = test_data2[,-c(1,2)], 
                             id.vec = test_data2$id,
                             arm.vec = test_data2$arm, 
                             n.patient = 100 , 
                             n.simulation = 1, 
                             seed = 2022,
                             shift.vec.list = list(c(2.5,2.55,2.6), c(4.5,4.55,4.6)),
                             verbose = FALSE)

## Obtain the simulated long-form dataset
newARM$data.simul

## # A tibble: 900 × 6
##       id   arm col.num col.name data.sim sim.id
##    <dbl> <dbl>   <dbl> <chr>       <dbl>  <int>
##  1     1     1       1 time_1       9.63      1
##  2     2     1       1 time_1       9.65      1
##  3     3     1       1 time_1       8.24      1
##  4     4     1       1 time_1      12.1       1
##  5     5     1       1 time_1       9.12      1
##  6     6     1       1 time_1      10.0       1
##  7     7     1       1 time_1       8.56      1
##  8     8     1       1 time_1      10.6       1
##  9     9     1       1 time_1      10.5       1
## 10    10     1       1 time_1       8.48      1
## # ℹ 890 more rows

## Verify the mean difference
newARM$data.simul %>%  
  group_by(.data$arm, .data$col.num) %>%
  summarise(N = n(), Mean = mean(.data$data.sim), SD = sd(.data$data.sim))

## # A tibble: 9 × 5
## # Groups:   arm [3]
##     arm col.num     N  Mean    SD
##   <dbl>   <dbl> <int> <dbl> <dbl>
## 1     1       1   100  9.71 1.01 
## 2     1       2   100 10.2  1.26 
## 3     1       3   100 10.8  1.23 
## 4     2       1   100 12.0  0.975
## 5     2       2   100 12.7  1.30 
## 6     2       3   100 13.3  1.14 
## 7     3       1   100 14.0  0.975
## 8     3       2   100 14.7  1.30 
## 9     3       3   100 15.3  1.14

- Authored by Pei-Shan Yen

- CRAN page: https://CRAN.R-project.org/package=copulaSim

- github page: https://github.com/psyen0824/copulaSim

Acknowledgement

This research project and the development of the R package are supported by AbbVie Experiential Internship Program. I am also grateful to Dr. Xuemin Gu, Dr. Jenny Jiao, and Dr. Jane Zhang at the Eyecare Clinical Statistics Team for valuable comments on this work.