How to study the effect of removing one list in the estimation of over-coverage
how-to-study-removal-lists.Rmd
In the electronic supplementary material of our paper, there is a simulation study where we discussed the different effects when removing registers in the estimation of our model. The link and the info for the paper follows:
- Mussino, E., Santos, B., Monti, A. et al. Multiple systems estimation for studying over-coverage and its heterogeneity in population registers. Quality & Quantity (2023). https://doi.org/10.1007/s11135-023-01757-x
Here we discuss the different steps of the simulation study and show how to reproduce the experiment.
Initial Setup
For this simultation, we need to generate new population according to the same rules for every replication of the study. In the same way, we create 3 lists, \(X\), \(Y\) and \(Z\), where the probability of being observed always follows the same function. For this experiment, we create 3 binary variables that are used separately in the probability function in each one of the lists.
The initial step is to generate the population, where we create one continuous and one categorical variable, although they will not be used for the study. The function just needs to create at least one variable of these types.
library(overcoverage)
main_list <- create_population(size = 1e6,
n_bin_var = 3,
n_cont_var = 1,
n_cat_var = 1,
c(0.5),
c("bin1", "bin2", "bin3"))
We assume that individuals will leave the country according to a
logistic model and using bin1
variable as the covariate to
write this probability. Using the create_presences
function
from our package and considering only 2 years of observation, we have
the following:
true_presences <- create_presences(main_list,
formula_phi = ~ bin1,
coef_values = c(2, -1),
varying_arrival = TRUE,
years = 2)
If \(B_1\) represents
bin1
and \(\phi\) is the
probability of staying in the country (usually known as
surviving, in the capture-recapture literature) each year, we
considering the following logistic regression to calculate those
probabilities
\[\log \begin{pmatrix}\frac{\phi}{1 - \phi} \end{pmatrix} = 2 - B_1,\]
With this equation, we are creating two probabilities of staying in the country, when \(B_1 = 0\) or when \(B_1 = 1\), which are calculated respectively as
\[\frac{\exp(2)}{1 + \exp(2)} = 0.881 \quad \mbox{ and } \quad \frac{\exp(2 - 1)}{1 + \exp(2 - 2)} = 0.731.\]
For each simulation step we also create 3 lists, \(X\), \(Y\)
and \(Z\), using the function
create_list_presences
.
X <- create_list_presences(main_list, presences = true_presences,
formula_prob = ~ bin1,
coef_values = c(1.5, -0.5))
Y <- create_list_presences(main_list, presences = true_presences,
formula_prob = ~ bin2,
coef_values = c(-0.5, -0.5))
Z <- create_list_presences(main_list, presences = true_presences,
formula_prob = ~ bin3,
coef_values = c(-1.5, -0.5))
The argument formula_prob
controls which variables are
used to calculate the probability of being observed in each list. As we
have said previously, we are considering the 3 different binary
variables (bin1
, bin2
and bin3
)
separately for each list. The values in the argument
coef_values
are used in the logistic regression to generate
the observation in each list. For instance, for list \(X\) we are calculating the following
\[\log \begin{pmatrix}\frac{P(X = 1|B_1)}{1 - P(X = 1|B = 1)} \end{pmatrix} = 1.5 - 0.5 \cdot B_1,\]
which gives us estimated probabilities of 0.818 and 0.731 for \(B_1 = 0\) and \(B_1 = 1\), respectively.