Skip to contents

In the electronic supplementary material of our paper, there is a simulation study where we discussed the different effects when removing registers in the estimation of our model. The link and the info for the paper follows:

  • Mussino, E., Santos, B., Monti, A. et al. Multiple systems estimation for studying over-coverage and its heterogeneity in population registers. Quality & Quantity (2023). https://doi.org/10.1007/s11135-023-01757-x

Here we discuss the different steps of the simulation study and show how to reproduce the experiment.

Initial Setup

For this simultation, we need to generate new population according to the same rules for every replication of the study. In the same way, we create 3 lists, \(X\), \(Y\) and \(Z\), where the probability of being observed always follows the same function. For this experiment, we create 3 binary variables that are used separately in the probability function in each one of the lists.

The initial step is to generate the population, where we create one continuous and one categorical variable, although they will not be used for the study. The function just needs to create at least one variable of these types.

library(overcoverage)
main_list <- create_population(size = 1e6,
                                 n_bin_var = 3, 
                                 n_cont_var = 1,
                                 n_cat_var = 1,
                                 c(0.5), 
                                 c("bin1", "bin2", "bin3"))

We assume that individuals will leave the country according to a logistic model and using bin1 variable as the covariate to write this probability. Using the create_presences function from our package and considering only 2 years of observation, we have the following:

true_presences <- create_presences(main_list, 
                                     formula_phi = ~ bin1,
                                     coef_values = c(2, -1),
                                     varying_arrival = TRUE,
                                     years = 2)

If \(B_1\) represents bin1 and \(\phi\) is the probability of staying in the country (usually known as surviving, in the capture-recapture literature) each year, we considering the following logistic regression to calculate those probabilities

\[\log \begin{pmatrix}\frac{\phi}{1 - \phi} \end{pmatrix} = 2 - B_1,\]

With this equation, we are creating two probabilities of staying in the country, when \(B_1 = 0\) or when \(B_1 = 1\), which are calculated respectively as

\[\frac{\exp(2)}{1 + \exp(2)} = 0.881 \quad \mbox{ and } \quad \frac{\exp(2 - 1)}{1 + \exp(2 - 2)} = 0.731.\]

For each simulation step we also create 3 lists, \(X\), \(Y\) and \(Z\), using the function create_list_presences.

X <- create_list_presences(main_list, presences = true_presences, 
                           formula_prob = ~ bin1, 
                           coef_values = c(1.5, -0.5))
   
Y <- create_list_presences(main_list, presences = true_presences,
                           formula_prob = ~ bin2, 
                           coef_values = c(-0.5, -0.5))
  
Z <- create_list_presences(main_list, presences = true_presences,
                           formula_prob = ~ bin3, 
                           coef_values = c(-1.5, -0.5))

The argument formula_prob controls which variables are used to calculate the probability of being observed in each list. As we have said previously, we are considering the 3 different binary variables (bin1, bin2 and bin3) separately for each list. The values in the argument coef_values are used in the logistic regression to generate the observation in each list. For instance, for list \(X\) we are calculating the following

\[\log \begin{pmatrix}\frac{P(X = 1|B_1)}{1 - P(X = 1|B = 1)} \end{pmatrix} = 1.5 - 0.5 \cdot B_1,\]

which gives us estimated probabilities of 0.818 and 0.731 for \(B_1 = 0\) and \(B_1 = 1\), respectively.

Simulation setup

Analysis of the results

Final discussion