---
title: "Introduction to R package copulaSim"
author: "Pei-Shan Yen"
date: "2022-08-02"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to copulaSim}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

## What is the usage of the R package *copulaSim*?

This package serves to simulate jointly distributed patient-level data from historical data based on the copula invariance property.


## Why do we need to perform virtual patient simulation?

To consistently optimize clinical trial designs and data analysis methods through trial simulation, we need to simulate multivariate mixed-type virtual patient data independent of designs and analysis methods under evaluation. To make the outcome of optimization more realistic, we should utilize relevant empirical patient-level data when it is available. 

## The challenges faced when simulating small data
When simulating small empirical data, the underlying marginal distributions and their dependence structure cannot be understood or verified thoroughly due to the limited sample size. 

## Solution: copula invariance property 
To resolve this issue, we use the copula invariance property to generate the joint distribution without making a strong parametric assumption. The theoretical background is addressed below.

### Sklar's Theorem

The idea of [copula](https://en.wikipedia.org/wiki/Copula_(probability_theory)#:~:text=Sklar's) was first introduced by Dr. Abe Sklar in 1959 in the field of statistics. He proposed a theorem, which is later called Sklar's theorem. This theorem essentially consists of two parts. First, the copula function can be used to describe the relationship between the joint and marginal distributions. This function assigns the value of joint distribution to each ordered pair of values of marginal distributions. That is, the coupla function maps the range of joint distribution from a d-dimensional ball to a unit line. The second part is that the copula function can be uniquely determined for every joint distribution. 

### Why can the copula function be employed to generate joint distribution? 

Each joint density can be viewed as the product of marginal densities multiplied by copula density. The copula density, which is defined as the partial derivative of the copula function, contains all the information about the dependence structure of the joint distribution. As a result, the joint distribution can be flexibly constructed by copula dependency and marginal distributions.

### The implementation of copula into a new R package

To share this finding with the community, we have implemented the copula algorithm into a new R package entitled [*copulaSim*](https://CRAN.R-project.org/package=copulaSim). The copulaSim package is designed to perform virtual patient simulation. The idea of the copula simulation algorithm is given in the following. Based on the copula invariance property, the dependence structure of the joint distribution can be well preserved when performing quantile transformation. Because of this feature, the copula simulation algorithm allows for the simulated data to resemble the empirical data.    
 
![](https://scontent-ord5-1.xx.fbcdn.net/v/t39.30808-6/297369637_10228299537732851_2703531645042273957_n.jpg?_nc_cat=107&ccb=1-7&_nc_sid=730e14&_nc_ohc=6Mmh5OMMX88AX_bLnxn&_nc_ht=scontent-ord5-1.xx&oh=00_AT9LGv5Bjey1TKa3kjnJZV-kN1FG07QBiId-7S61MxstlA&oe=62EEEE1C)

## The demonstration of R Package *copulaSim*
## Generate empirical data: 
### Assume that the single-arm, 5-dimensional empirical data follows multivariate normal data, and let the sample size of empirical data be 30.

```{r, message=FALSE, warning=FALSE}
library(mvtnorm)
arm1 <- rmvnorm(n = 30, mean  = rep(10, 5), sigma = diag(5) + 0.5)
test_data <- as.data.frame(cbind(1:30, rep(1, 30), arm1))
colnames(test_data) <- c("id","arm",paste0("time_", 1:5))
knitr::kable((test_data), "simple")
```


## Load package copulaSim

```{r, message=FALSE, warning=FALSE}
library(copulaSim)
```


## Use function *copula.sim* to generate ONE simulated dataset for the emperical data

+-----------------+-----------------------------------------------------------+--------------------+
| Argument        | Definition                                                | Assigned Value     |
+=================+===========================================================+====================+
| data.input      | The empirical data                                        | test_data[,-c(1,2)]|
+-----------------+-----------------------------------------------------------+--------------------+
| id.vec          | ID fo individual patient in the input data                | test_data$id       |
+-----------------+-----------------------------------------------------------+--------------------+
| arm.vec         | The column to identify the arm in clinical trial          | test_data$arm      |
+-----------------+-----------------------------------------------------------+--------------------+
| n.patient       | The targeted number of patients in each simulated dataset | 50                 |
+-----------------+-----------------------------------------------------------+--------------------+
| n.simulation    | The number of simulated datasets                          | 1                  |
+-----------------+-----------------------------------------------------------+--------------------+
| seed            | The random seed to reproduce the simulation study         | 2022               |
+-----------------+-----------------------------------------------------------+--------------------+
| validation.type | Specify hypothesis test to detect the difference between  | "energy"           |
|                 | empirical data and simulated data                         |                    |
+-----------------+-----------------------------------------------------------+--------------------+
| verbose         | Whether to print message for simulation process or not    | TRUE               |
+-----------------+-----------------------------------------------------------+--------------------+

- As a means of avoiding extreme sampling results when performing one simulated dataset, it is advised to use "energy" or "ball" in the argument *validation.type*. The purpose of doing this is to perform data validation, which ensures the similarity between empirical data and the simulated data based on 2-sample test.

- Below shows the 2-sample test result, which indicates that the joint distribution of the simulated data is not significantly different from the empirical data. 

```{r, message=FALSE, warning=FALSE}
## Generate 1 simulated dataset
simu_S1 <- copula.sim(data.input = test_data[,-c(1,2)], 
                      id.vec = test_data$id, 
                      arm.vec = test_data$arm,
                      n.patient = 50 , 
                      n.simulation = 1, 
                      seed = 2022, 
                      validation.type = "energy",
                      verbose = TRUE)
## Obtain the simulated long-form dataset
simu_S1$data.simul
```

- Also, from the marginal density plot, we can see that the copula simulated data is close to the empirical data. 

```{r, message=FALSE, warning=FALSE, fig.dim = c(8, 3)}
library(dplyr)
## Obtain the empirical long-form dataset
empir <- simu_S1$data.input.long %>% mutate(cate = "empirical_n30") %>% rename(data = data.input)

## Produce the marginal density plot
simul <- simu_S1$data.simul %>%  mutate(cate = "copulaSim_n50") %>% 
         rename(data = data.sim) %>% select(-sim.id)  
library(ggplot2)
rbind(empir, simul) %>% filter(grepl('time', col.name)) %>%
  ggplot(aes(x = data, color = cate, fill = cate)) +
  facet_wrap(.~col.name, ncol = 5) +
  geom_density(alpha = 0.001, size = 1)
```


## Use function *extract.data.sim* to convert simulated data into a list of wide-form matrices

```{r, message=FALSE, warning=FALSE}
## Converting the long-form simulated dataset to wide-form 
simu.wide <- extract.data.sim(simu_S1)
simu.wide
```

## Use function *compare.copula.sim* to perform the comparison between empirical data and multiple simulated datasets.

- To increase the data variability, we suggest that the data validation option should be turned off when producing multiple simulated datasets. That is, the argument for validation.type should be set to "none". 

```{r, message=FALSE, warning=FALSE}
## Generate 100 simulated datasets
simu_S100 <- copula.sim(data.input = test_data[,-c(1,2)], 
                         id.vec = test_data$id, 
                         arm.vec = test_data$arm,
                         n.patient = 50 , 
                         n.simulation = 100, 
                         seed = 2022, 
                         validation.type = "none",
                         verbose = FALSE)

## Compare the marginal mean via the function compare.copula.sim 
compare <- compare.copula.sim(simu_S100)
knitr::kable(compare$mean.comparison, "simple")
```

## Employ function *new.arm.copula.sim* to simulate new multivariate datasets with shifted mean vector from existing empirical data. 

- This function assumes that the dependence structure of the new-arm data and that of the empirical data are the same.

```{r, message=FALSE, warning=FALSE}
## Generate Empirical Data 
##  Assume that the single-arm, 3-dimensional empirical data follows multivariate normal data
arm1 <- rmvnorm(n = 80, mean  = c(10,10.5,11), sigma = diag(3) + 0.5)
test_data2 <- as.data.frame(cbind(1:80, rep(1,80), arm1))
colnames(test_data2) <- c("id", "arm", paste0("time_", 1:3))

## Generate 1 simulated datasets with one empirical arm and two new-arms
## The mean difference between empirical arm and 
## (i) the 1st new arm is assumed to be 2.5, 2.55, and 2.6 at each time point
## (ii) the 2nd new arm is assumed to be 4.5, 4.55, and 4.6 at each time point

newARM <- new.arm.copula.sim(data.input = test_data2[,-c(1,2)], 
                             id.vec = test_data2$id,
                             arm.vec = test_data2$arm, 
                             n.patient = 100 , 
                             n.simulation = 1, 
                             seed = 2022,
                             shift.vec.list = list(c(2.5,2.55,2.6), c(4.5,4.55,4.6)),
                             verbose = FALSE)

## Obtain the simulated long-form dataset
newARM$data.simul

## Verify the mean difference
newARM$data.simul %>%  
  group_by(.data$arm, .data$col.num) %>%
  summarise(N = n(), Mean = mean(.data$data.sim), SD = sd(.data$data.sim))
```


--------------------
- Authored by Pei-Shan Yen
- CRAN page: https://CRAN.R-project.org/package=copulaSim
- github page: https://github.com/psyen0824/copulaSim
--------------------
## Acknowledgement
This research project and the development of the R package are supported by AbbVie Experiential Internship Program. I am also grateful to Dr. Xuemin Gu, Dr. Jenny Jiao, and Dr. Jane Zhang at the Eyecare Clinical Statistics Team for valuable comments on this work.