In this section, we’ll walk through a few simple examples to
demonstrate the some regular workflow using this package.
Regular Normal
To begin, we consider a simple example where the data is drawn i.i.d.
from a normal distribution with unknown mean \(\mu\) and unknown variance \(\sigma^2\): \[X_i\sim N(\mu, \sigma^2).\] Suppose the
observed statistics (sample mean and variance) are disclosed, how do we
obtain confidence intervals (or sets) of the true population parameters?
First, let’s load the pacakge:
Specify the sample size, significance level, and a large enough Repro
sample size (usually \(200\) is
enough).
n <- 50 # sample size
repro_size <- 200 # number of repro samples
Generate a matrix of seeds that will be used to generate ‘fake’
statistics, which has repro_size
rows. The number of
columns depends on the specific problem. In this case, it has
n
columns, one for each of the i.i.d. data points.
seeds <- matrix(rnorm(repro_size * n), nrow = repro_size, ncol = n)
We’ll also write a function that, given some parameter and the random
seeds defined previously, computes ‘repro_size’ copies of the ‘fake’
statistics and stack them vertically. In this case, the first component
of ‘theta’ represents the mean, and the second component represents the
variance.
s_sample <- function(seeds, theta) {
# generate the raw data points
data <- theta[1] + sqrt(theta[2]) * seeds[, 1:n]
# compute the regular statistics
s_mean <- apply(data, 1, mean)
s_var <- apply(data, 1, var)
return(cbind(s_mean, s_var))
}
Suppose the observed statistic is as follows:
Regular Normal: The ‘p_value’ function
With the above observed statistic, we want to conduct a hypothesis
test for whether the true mean is \(0\):
\[H_0:\mu=0, H_1:\mu\ne 0.\]
What’s the p-value of the test with the observed statistics given
above? The ‘p_value’ function in the package comes in handy here. Set
both ‘lower_bds’ and ‘upper_bds’ to equal to \((0, 1e-6)\)and \((0,10)\) respectively (choose a large
enough range for the variance), and use the seeds, generating function,
and observed statistic from before:
lower_bds = c(0, 1e-6)
upper_bds = c(0, 10)
result <- p_value(lower_bds, upper_bds, seeds, s_sample, s_obs)
print(result$p_val) # p_value
#> [1] 0.004975124
If the significance level of the test is \(0.05\), then we reject the null hypothesis
that the population mean is \(0\).
Alternatively, we can conduct composite hypothesis tests:
\[
H_0:\theta\in \Theta_0, H_1:\theta\notin\Theta_0,
\]
where \(\Theta_0\) is of the form
\([l_1,u_1]\times[l_2,u_2]\times\cdots\). In
this case, the ‘p_value’ function searches through the whole parameter
space \(\Theta_0\) and finds the
parameter corresponding to the largest p-value. For example, if \(\Theta_0=[-1,2]\times[1,5]\), then we can
do the following:
lower_bds = c(-1, 1) # lower bounds for mean and variance
upper_bds = c(2, 5) # upper bounds for mean and variance
result <- p_value(lower_bds, upper_bds, seeds, s_sample, s_obs)
print(result$p_val) # the largest p_value found
#> [1] 0.318408
print(result$theta_hat) # the parameter corresponding to the largest p value
#> [1] 1.124757 1.000000
If the significance level is \(0.05\), then we fail to reject the null
hypothesis.
Regular Normal: The ‘get_CI’ function
To go a step further, we can find the confidence intervals for the
true mean or variance using the ‘get_CI’ function. The extra things to
specify here are parameter_index
(indicating which
parameter to get the confidence interval), significance level
alpha
, and the tolerance for the confidence intervals
tol
. The returned confidence interval will cover the
actual \((1-\alpha)\)-confidence
interval with at most a difference of ‘tol’ in width. First, let’s find
the confidence interval for the mean (which is the first argument in
s_obs
).
lower_bds <- c(-5, 1e-6)
upper_bds <- c(5, 10)
parameter_index <- 1 # an integer specifying the parameter of interest
alpha <- .05 # significance level
tol <- 1e-3 # tolerance
# choosing j=1 indicates that we're obtaining a confidence interval for the first argument in s_obs. In this case, it's the mean.
mean_CI <- get_CI(alpha, lower_bds, upper_bds, parameter_index, seeds, s_sample, s_obs, tol)
print(mean_CI)
#> [1] 0.8261201 1.4344877
Similarly, we can get a confidence interval for the variance:
parameter_index <- 2
var_CI <- get_CI(alpha, lower_bds, upper_bds, parameter_index, seeds, s_sample, s_obs, tol)
print(var_CI)
#> [1] 0.4538163 1.3640110
The methodology adopted here ensures that the obtained confidence
interval will contain the actual confidence interval, with its width
bounded by the width of the actual confidence interval plus
tol
.
Regular Normal: The ‘confidence_grid’ function
In many situations, the confidence interval for a single parameter is
very loose and not helpful, especially when the number of parameters is
large. The ‘confidence_grid’ function in this package obtains a
confidence set . resolution
is an extra input needed here
that specifies the number of grids for each element of the parameter.
Since we already have a sense of where the confidence region is from the
previously obtained confidence intervals, we can set
lower_bds
and upper_bds
to be much narrower to
obtain a finer representation of the confidence region (which is valid
only when we’re using the same seeds as before).
lower_bds <- c(0.5, 1e-6)
upper_bds <- c(1.5, 1.5)
resolution <- 13
result <- confidence_grid(alpha, lower_bds, upper_bds, seeds, s_sample, s_obs, tol, resolution)
indicator_array <- result$ind_array
print(result)
#> $ind_array
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 1 1 1 1 1 0 0 0
#> [6,] 0 0 0 0 1 1 1 1 1 1 1 0 0
#> [7,] 0 0 0 0 1 1 1 1 1 1 1 1 0
#> [8,] 0 0 0 1 1 1 1 1 1 1 1 1 0
#> [9,] 0 0 0 1 1 1 1 1 1 1 1 1 0
#> [10,] 0 0 0 1 1 1 1 1 1 1 1 1 0
#> [11,] 0 0 0 0 1 1 1 1 1 1 1 1 0
#> [12,] 0 0 0 0 1 1 1 1 1 1 0 0 0
#> [13,] 0 0 0 0 0 1 1 1 0 0 0 0 0
#>
#> $search_lower_bds
#> [1] 0.8259210 0.4538477
#>
#> $search_upper_bds
#> [1] 1.434823 1.364023
We can also retrieve the simultaneous confidence intervals for both
parameters as follows:
print(c(result$search_lower_bds[1], result$search_upper_bds[1]))
#> [1] 0.825921 1.434823
print(c(result$search_lower_bds[2], result$search_upper_bds[2]))
#> [1] 0.4538477 1.3640228
Regular Normal: The ‘plot_grid’ function
This package includes a function called ‘plot_grid’ which takes the
indicator array above and visualizes it. Note that the ‘lower_bds’ and
‘upper_bds’ here must agree with the ones used to produce this indicator
array.
parameter_names <- c("mean", "variance")
plot_grid(indicator_array, lower_bds, upper_bds, parameter_names)

Note that the plot looks just like a rotated version of the indicator
array above, this is because coordinates start from the top left of the
array but bottom left of the plot.
The ‘grid_projection’ function
Usually, parameter spaces are not just 2-dimensional, but
higher-dimensional as well. In these cases, the indicator array produced
by the ‘confidence_grid’ function is also higher-dimensional array. The
indicator array is useful on its own, but if we want to visualize it, we
must project it down to 2 dimensions via the ‘gird_projection’ function.
Take, for instance, a joint Gaussian model with \(\mu_1=0,\mu_2=1\), and identity covariance
matrix \(\Sigma=\sigma^2 I_2\)
(uncorrelated joint Gaussian with a common variance \(\sigma^2=1\).). First, we obtain the
indicator array as usual:
n <- 100 # sample size
repro_size <- 200 # number of repro samples
seeds <- matrix(rnorm(repro_size * (2 * n)), nrow = repro_size, ncol = 2 * n)
s_sample <- function(seeds, theta) {
n_obs <- ncol(seeds) / 2
mu1 <- theta[1]
mu2 <- theta[2]
sigma <- sqrt(theta[3])
seeds1 <- seeds[, 1:n_obs, drop = FALSE] # first n columns for X
seeds2 <- seeds[, (n_obs + 1):(2 * n_obs), drop = FALSE] # next n columns for Y
# generate bivariate data
data1 <- mu1 + sigma * seeds1 # X ~ N(mu_1, sigma^2)
data2 <- mu2 + sigma * seeds2 # Y ~ N(mu_2, sigma^2)
# compute statistics
s_mean1 <- rowMeans(data1) # sample mean of X
s_mean2 <- rowMeans(data2) # sample mean of Y
s_var1 <- apply(data1, 1, var) # sample variance of X
s_var2 <- apply(data2, 1, var) # sample variance of Y
return(cbind(s_mean1, s_mean2, s_var1, s_var2))
}
s_obs <- c(-0.1, 0.1, 1.05, 0.9)
alpha <- 0.05
tol <- 1e-3
lower_bds <- c(-0.5, -0.5, 1e-6)
upper_bds <- c(0.5, 0.5, 2)
resolution <- 10
result <- confidence_grid(alpha, lower_bds, upper_bds, seeds, s_sample, s_obs, tol, resolution)
indicator_array <- result$ind_array
The statistics here are the two sample means and two sample
variances, which include four real numebrs. However, the number of
parameters does not need to match the dimension of the statistics. In
this caes, there are only three parameters: two population means and a
variance. Suppose we’re interested in the confidence set involving only
the two means \(\mu_1, \mu_2\), then we
can choose the index set as follows:
index_set <- c(1,2)
means_array <- grid_projection(indicator_array, index_set)
To plot the joint confidence region for \(\mu_1,\mu_2\), we then use the same lower
bounds and upper bounds used in the ‘confidence_grid’ function, and take
only the components corresponding to these two parameters. For
example,
means_lower_bds <- lower_bds[index_set]
means_upper_bds <- upper_bds[index_set]
parameter_names <- c("mu1", "mu2")
plot_grid(means_array, means_lower_bds, means_upper_bds, parameter_names)

Alternatively, if we want to plot the confidence regions for \(\mu_1\) and \(\sigma^2\), or for \(\mu_2\) and \(\sigma^2\), we can choose a different index
set and repeat the above procedure
index_set2 <- c(1,3)
mu1_var_array <- grid_projection(indicator_array, index_set2)
index_set3 <- c(2,3)
mu2_var_array <- grid_projection(indicator_array, index_set3)
mu1_var_lower_bds <- lower_bds[index_set2]
mu1_var_upper_bds <- upper_bds[index_set2]
parameter_names <- c("mu1", "variance")
plot_grid(mu1_var_array, mu1_var_lower_bds, mu1_var_upper_bds, parameter_names)
mu2_var_lower_bds <- lower_bds[index_set3]
mu2_var_upper_bds <- upper_bds[index_set3]
parameter_names <- c("mu2", "variance")
plot_grid(mu2_var_array, mu2_var_lower_bds, mu2_var_upper_bds, parameter_names)


Although these 2D visualizations are useful , it’s to be noted that
the projections by themselves are not enough to reconstruct the original
confidence set. Consider, for example, a cube that’s hollow inside.
DP Normal
Next, we’ll consider a setting in which this package was initially
designed for, and where it shines–statistical analysis with differential
privacy. With privacy mechanisms present, it’s extremely hard to analyze
the statistics and give closed-form expressions for confidence
intervals/sets. Consider a basic setting where the data is drawn i.i.d.
from a normal distribution with unknown mean \(\mu\) and variance \(\sigma^2\). \[X_i\sim N(\mu,\sigma^2)\]To protect the
privacy of the data owners, the data center first clamps the data with
lower and upper bounds \(\textit{lower_clamp}\) and \(\textit{upper_clamp}\) (publicly known),
then computes an \((\epsilon,
0)\)-differentially private statistic \(s_{DP}=(\text{private mean}, \text{private
variance})\) before releasing it to the public. How do we conduct
hypothesis tests with privacy constraints in place? How can we obtain
confidence intervals/sets for the population mean and variance? First,
we use the same basic parameters as before:
n <- 50 # sample size
repro_size <- 200 # number of repro samples
Since the privacy mechanism is also published by the data center, the
clamps and the privacy guarantee parameter \(\epsilon\) are publicly known. Note that
these may differ depending on the specific privacy mechanism.
upper_clamp <- 3
lower_clamp <- -3
eps <- 4 # privacy guarantee parameter
Now, instead of only the randomness in the data generation process,
we also need randomness associated with the privacy mechanism.
Therefore, we need a matrix of seeds that contains both. Below,
regular_seeds
is used to generate Gaussian samples, while
each column of dp_seeds
is used to generate privacy noises
for sample mean and sample variance, respectively. Since we’re using the
Laplace mechanism here, we choose the privacy seeds to be a product of
shifted Bernoulli\((1/2)\) random
variables and Exp\((1)\) random
variables (which have the same distribution as Laplace\((1)\) random variables).
regular_seeds <- matrix(rnorm(repro_size * n), nrow = repro_size, ncol = n)
dp_seeds <- matrix((2 * rbinom(n = repro_size * 2, size = 1, prob = 1/2) - 1) * rexp(repro_size * 2), nrow = repro_size, ncol = 2)
seeds <- cbind(regular_seeds, dp_seeds)
Knowing the privacy mechanism, we can define the generating function
that produces i.i.d. samples of the statistics:
s_sample <- function(seeds, theta) {
# generate the i.i.d. normal r.v.s using the first n columns of the seeds
raw_data <- theta[1] + sqrt(theta[2]) * seeds[, 1:n]
# clamp the raw data using the clamps defined above
clamped <- pmin(pmax(raw_data, lower_clamp), upper_clamp)
# compute the private statistics (adding a properly scaled Laplace noise to the sample mean and sample variance)
s_mean <- apply(clamped, 1, mean) + (upper_clamp - lower_clamp) / (n * eps / 2) * seeds[, n+1]
s_var <- apply(clamped, 1, var) + (upper_clamp - lower_clamp)^2 / (n * eps / 2) * seeds[, n+2]
return(cbind(s_mean, s_var))
}
We’re now ready to run hypothesis tests and identify confidence
intervals/sets using the provided functions from the package.
DP Normal: The ‘p_value’ function
We begin by encoding the released statistic (private sample mean and
private sample variance) as a vector:
s_DP <- c(1.12, 1.07) # define the observed statistic as a vector
Let’s say we want to test whether or not the mean is in between \(0\) and \(1\):
\[
H_0:\mu\in[0,1], H_1:\mu\notin[0,1].
\]
Then we choose a large enough search interval for the other
parameters. In this case, the population variance \(\sigma^2\) is assumed to be in the range
\([10^{-6},10]\).
lower_bds <- c(0, 1e-6)
upper_bds <- c(1, 10)
Again, we apply the p_value
function for hypothesis
tests:
result <- p_value(lower_bds, upper_bds, seeds, s_sample, s_DP)
print(result$p_val) # the largest p_value found
#> [1] 0.7064677
print(result$theta_hat) # the parameter corresponding to the largest p value
#> [1] 1.000000 1.159879
With a significance level of \(0.05\), we fail to reject the null
hypothesis.
DP Normal: The ‘get_CI’ function
Now, let’s find the confidence intervals for \(\mu\) and \(\sigma^2\).
# search lower bounds and upper bounds for mean and variance
alpha <- .05 # significance level
tol <- 1e-3 # tolerance
lower_bds <- c(-5, 1e-6)
upper_bds <- c(5, 10)
First, let’s find a \(95\%\)
confidence interval for \(\mu\):
# choose parameter_index=1 indicates that we're obtaining a confidence interval for the first argument in s_DP. In this case, it's the mean.
parameter_index <- 1
mean_CI <- get_CI(alpha, lower_bds, upper_bds, parameter_index, seeds, s_sample, s_DP, tol)
print(mean_CI)
#> [1] 0.5914339 1.7243051
Similarly, we can obtain a \(95\%\)
confidence interval for \(\sigma^2\),
by changing parameter_index
from \(1\) to \(2\):
parameter_index <- 2
var_CI <- get_CI(alpha, lower_bds, upper_bds, parameter_index, seeds, s_sample, s_DP, tol)
print(var_CI)
#> [1] 0.000001 4.247069
DP Normal: The ‘confidence_grid’ function
We can obtain a joint confidence region for \(\mu\) and \(\sigma^2\). Based on the individual
confidence intervals obtained above, we can choose a tighter search
region to make the visualization look better.
lower_bds <- c(0, 1e-6)
upper_bds <- c(2, 5)
resolution <- 20
result <- confidence_grid(alpha, lower_bds, upper_bds, seeds, s_sample, s_DP, tol, resolution)
indicator_array <- result$ind_array
print(result)
#> $ind_array
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 1 1 1 1 0 0 0 0
#> [7,] 0 0 0 1 1 1 1 1 1 1 1 1 1
#> [8,] 0 1 1 1 1 1 1 1 1 1 1 1 1
#> [9,] 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [10,] 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [11,] 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [12,] 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [13,] 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [14,] 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [15,] 0 1 1 1 1 1 1 1 1 1 1 1 1
#> [16,] 0 0 1 1 1 1 1 1 1 1 1 1 1
#> [17,] 0 0 0 0 1 1 1 1 1 1 1 1 1
#> [18,] 0 0 0 0 0 0 0 1 1 1 1 1 0
#> [19,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [20,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> [1,] 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 0
#> [8,] 1 0 0 0 0 0 0
#> [9,] 1 1 0 0 0 0 0
#> [10,] 1 1 1 1 0 0 0
#> [11,] 1 1 1 1 0 0 0
#> [12,] 1 1 1 1 0 0 0
#> [13,] 1 1 1 1 0 0 0
#> [14,] 1 1 1 1 0 0 0
#> [15,] 1 1 1 1 0 0 0
#> [16,] 1 1 1 0 0 0 0
#> [17,] 1 1 0 0 0 0 0
#> [18,] 0 0 0 0 0 0 0
#> [19,] 0 0 0 0 0 0 0
#> [20,] 0 0 0 0 0 0 0
#>
#> $search_lower_bds
#> [1] 0.5919373 0.0000010
#>
#> $search_upper_bds
#> [1] 1.724242 4.246818
DP Normal: The ‘plot_grid’ function
Lastly, we can visualize the confidence region using the
plot_grid
function.
parameter_names <- c("mean", "variance") # optional input
plot_grid(indicator_array, lower_bds, upper_bds, parameter_names)

The differentially-private confidence region looks more dispersed
compared to the regular normal case, due to the added privacy noise.