representr
Record linkage (entity resolution or de-duplication) is used to join
multiple databases to remove duplicate entities. While record linkage
removes the duplicate entities from the data, many researchers are
interested in performing inference, prediction, or post-linkage analysis
on the linked data (e.g., regression or capture-recapture), which we
call the downstream task. Depending on the downstream task, one
may wish to find the most representative record before performing the
post-linkage analysis. For example, when the values of features used in
a downstream task differ for linked data, which values should be used?
This is where representr
comes in. Before introducing our
new package representr
from the paper Kaplan, Betancourt, and Steorts (n.d.), we first
provide an introduction to record linkage.
Throughout this vignette, we will use data that is available in the
representr
package, rl_reg1
(rl = record
linkage, reg = regression, 1 = amount of noisiness).
# load library
library(representr)
# load data
data("rl_reg1") # data for record linkage and regression
data("identity.rl_reg1") # true identity of each record
fname | lname | bm | bd | by | sex | education | income | bp |
---|---|---|---|---|---|---|---|---|
jasmine | sirotic | 3 | 31 | 1972 | F | High school graduates, no college | 32 | 127 |
hugo | white | 6 | 9 | 1958 | M | Bachelor’s degree only | 72 | 134 |
madeline | burgemeifter | 12 | 9 | 1967 | F | Some college or associate degree | 30 | 130 |
kyle | clarke | 4 | 9 | 1952 | M | Advanced degree | 90 | 125 |
livia | braciak | 11 | 6 | 1950 | F | High school graduates, no college | 27 | 134 |
phoebe | green | 8 | 8 | 1957 | F | High school graduates, no college | 32 | 128 |
This is simulated data, which consists of 500 records with 30% duplication and the following attributes:
fname
: First namelname
: Last namebm
: Birth month (numeric)bd
: Birth dayby
: Birth yearsex
: Sex (“M” or “F”)education
: Education level (“Less than a high school
diploma”, ““High school graduates, no college”, “Some college or
associate degree”, “Bachelor’s degree only”, or “Advanced degree”)income
: Yearly income (in $1000s)bp
: Systolic blood pressureBefore we perform prototyping to get a representative data set using
representr
, we must first perform record linkage to remove
duplication in the data set. In the absence of unique identifier (such
as a social security number), we can use probabilistic methods to
perform record linkage. We recommend the use of clustering records to a
latent entity, known in the literature as graphical entity resolution.
See (Binette and Steorts, n.d.) for a
review.
For the examples in this vignette, we have fit the model in recent
work of (N. G. Marchant et al. 2020) using
dblinkR
(N. Marchant 2021).
Please see associated vignette for details on using
dblinkR
. We load the results of running this record linkage
model for \(100,000\) iterations, which
have been stored in the package as a data object called
linkage.rl
.
After record linkage is complete, one may want to perform analyses of
the linked data. This is what we call the “downstream task”. As
motivation, consider modeling blood pressure (bp) using the following
two features (covariates): income and sex in our example data
rl_reg1
. We want to fit this model after performing record
linkage using the following features: first and last name and full data
of birth. Here is an example of four records that represent the same
individual (based on the results from record linkage) using data that is
in the representr
package.
fname | lname | bm | bd | by | sex | education | income | bp | |
---|---|---|---|---|---|---|---|---|---|
370 | elenys | reit | 11 | 11 | 1954 | F | High school graduates, no college | 28 | 109 |
371 | elen7 | reicl | 11 | 22 | 1954 | F | Advanced degree | 52 | 118 |
372 | dleny | rejd | 11 | 11 | 1954 | M | Bachelor’s degree only | 63 | 109 |
373 | eleni | reid | 11 | 11 | 1954 | F | Bachelor’s degree only | 52 | 109 |
Examination of this table raises important questions that need to be addressed before performing a particular downstream task, such as which values of bp, income, and sex should be used as the representative features (or covariates) in a regression model? In this vignette, we will provide multiple solutions to this question using a prototyping approach.
We have four methods to choose or create the representative record
from linked data included in representr
. This process is a
function of the data and the linkage structure, and we present both
probabilistic and deterministic functions. The result in all cases is a
representative data set to be passed on to the downstream task. The
prototyping is completed using the represent()
function.
Our first proposal to choose a representative record (prototype) for a cluster is the simplest and serves as a baseline or benchmark. One simply chooses the representative record uniformly at random or using a more informed distribution.
For demonstration purposes, we can create a representative dataset
using the last iteration of the results from running the record linkage
model using blink
. This is accomplished using the
represent()
function, and passing through the type of
prototyping to be proto_random
.
# ids for representative records (random)
random_id <- represent(rl_reg1, lambda, "proto_random", parallel = FALSE)
rep_random <- rl_reg1[random_id,] # representative records (random)
We can have a look at a few records chosen as representative in this way.
fname | lname | bm | bd | by | sex | education | income | bp | |
---|---|---|---|---|---|---|---|---|---|
240 | callum | lecke | 8 | 5 | 1981 | M | Advanced degree | 86 | 126 |
353 | lachlan | ebert | 5 | 15 | 1954 | M | Some college or associate degree | 42 | 150 |
161 | frankesco | pedito | 6 | 16 | 1972 | M | Bachelor’s degree only | 38 | 151 |
6 | phoebe | green | 8 | 8 | 1957 | F | High school graduates, no college | 32 | 128 |
315 | peter | chittleborough | 7 | 11 | 1980 | M | High school graduates, no college | 36 | 153 |
Our second proposal to choose a representative record is to select the record that “most closely captures” that of the latent entity. Of course, this is quite subjective. We propose selecting the record whose farthest neighbors within the cluster is closest, where closeness is measured by a record distance function, \(d_r(\cdot)\). We can write this as the record \(r = (i, j)\) within each cluster \(\Lambda_{j'}\) such that \[ r = \arg\min\limits_{(i, j) \in \Lambda_{j'}} \max\limits_{(i^*, j^*) \in \Lambda_{j'}} d_r((i, j), (i^*, j^*)). \] The result is a set of representative records, one for each latent individual, that is closest to the other records in each cluster. When there is a tie within the cluster, we select a record uniformly at random.
There are many distance functions that can be used for \(d_r(\cdot, \cdot)\). We define the distance function to be a weighted average of individual variable-level distances that depend on the column type. Given two records, \((i, j)\) and \((i*, j*)\), we use a weighted average of column-wise distances (based on the column type) to produce the following single distance metric: \[ d_r((i, j), (i*, j*)) = \sum\limits_{\ell = 1}^p w_\ell d_{r\ell}((i, j), (i^*, j^*)), \] where \(\sum\limits_{\ell = 1}^p w_\ell = 1\). The column-wise distance functions \(d_{r\ell}(\cdot, \cdot)\) we use are presented below.
Column | \(d_{r\ell}(\cdot, \cdot)\) |
---|---|
String | Any string distance function, i.e. Jaro-Winkler string distance |
Numeric | Absolute distance, \(d_{r\ell}((i, j), (i^*, j^*)) = \mid x_{ij\ell} - x_{i^*j^*\ell} \mid\) |
Categorical | Binary distance, \(d_{r\ell}((i, j), (i^*, j^*)) = \mathbb{I}(x_{ij\ell} != x_{i^*j^*\ell})\) |
Ordinal | Absolute distance between levels. Let \(\gamma(x_{ij\ell})\) be the order of the value \(x_{ij\ell}\), then \(d_{r\ell}((i, j), (i^*, j^*)) = \mid \gamma(x_{ij\ell}) - \gamma(x_{i^*j^*\ell}) \mid\) |
The weighting of variable distances is used to place importance on individual features according to prior knowledge of the data set and to scale the feature distances to a common range. In this vignette, we scale all column-wise distances to be values between \(0\) and \(1\).
Again, we can create a representative dataset using the last
iteration of the results from running the record linkage model using
blink
. But this time we need to specify some more
parameters, like what types the columns are. This is accomplished using
the represent()
function, and passing through the type of
prototyping to be proto_minimax
.
# additional parameters for minimax prototyping
# need column types, the order levels for any ordinal variables, and column weights
col_type <- c("string", "string", "numeric", "numeric", "numeric", "categorical", "ordinal", "numeric", "numeric")
orders <- list(education = c("Less than a high school diploma", "High school graduates, no college", "Some college or associate degree", "Bachelor's degree only", "Advanced degree"))
weights <- c(.25, .25, .05, .05, .1, .15, .05, .05, .05)
# ids for representative records (minimax)
minimax_id <- represent(rl_reg1, linkage.rl[nrow(linkage.rl),], "proto_minimax",
distance = dist_col_type, col_type = col_type,
weights = weights, orders = orders, scale = TRUE, parallel = FALSE)
rep_minimax <- rl_reg1[minimax_id,] # representative records (minimax)
We can have a look at some of the representative records chosen via minimax prototyping.
fname | lname | bm | bd | by | sex | education | income | bp | |
---|---|---|---|---|---|---|---|---|---|
240 | callum | lecke | 8 | 5 | 1981 | M | Advanced degree | 86 | 126 |
356 | marlee | campbell | 4 | 30 | 1956 | F | Some college or associate degree | 33 | 127 |
350 | bethany | osseweijer | 4 | 18 | 1964 | F | Advanced degree | 59 | 102 |
95 | jayde | melhado | 3 | 24 | 1975 | F | Bachelor’s degree only | 45 | 116 |
179 | isabella | petersen | 11 | 10 | 1967 | F | Some college or associate degree | 34 | 126 |
Our third proposal to choose a representative record is by
aggregating the records (in each cluster) to form a composite record
that includes information from each linked record. The form of
aggregation can depend on the column type, and the aggregation itself
can be weighted by some prior knowledge of the data sources or use the
posterior information from the record linkage model. For quantitative
variables, we use a weighted arithmetic mean to combine linked values,
whereas for categorical variables, a weighted majority vote is used. For
string variables, we use a weighted majority vote for each character,
which allows for noisy strings to differ on a continuum. This is
accomplished using the represent()
function, and passing
through the type of prototyping to be composite
.
# representative records (composite)
rep_composite <- represent(rl_reg1, linkage.rl[nrow(linkage.rl),], "composite", col_type = col_type, parallel = FALSE)
We can have a look at some of the representative records.
fname | lname | bm | bd | by | sex | education | income | bp | |
---|---|---|---|---|---|---|---|---|---|
240 | callum | lecke | 8 | 5.0 | 1981 | M | Advanced degree | 86.0 | 126 |
353 | lachlan | ebert | 5 | 15.0 | 1954 | M | Some college or associate degree | 42.0 | 150 |
158 | francesco | petito | 6 | 16.4 | 1972 | M | High school graduates, no college | 52.6 | 150 |
6 | phoebe | green | 8 | 8.0 | 1957 | F | High school graduates, no college | 32.0 | 128 |
315 | peter | chittleborough | 7 | 11.0 | 1980 | M | High school graduates, no college | 36.0 | 153 |
Our fourth proposal to choose a representative record utilizes the minimax prototyping method in a fully Bayesian setting. This is desirable as the posterior distribution of the linkage is used to weight the downstream tasks, which allows the error from the record linkage task to be naturally propagated into the downstream task.
We propose two methods for utilizing the posterior prototyping (PP) weights — a weighted downstream task and a thresholded representative data set based on the weights. As already mentioned, PP weights naturally propagate the linkage error into the downstream task, which we now explain. For each MCMC iteration from the Bayesian record linkage model, we obtain the most representative records using minimax prototyping and then compute the probability of each record being selected over all MCMC iterations. The posterior prototyping (PP) probabilities can then either be used as weights for each record in the regression or as a thresholded variant where we only include records whose PP weights are above \(0.5\). Note that a record with PP weight above 0.5 has a posterior probability greater than 0.5 of being chosen as a prototype and should be included in the final data set.
# Posterior prototyping weights
pp_weights <- pp_weights(rl_reg1, linkage.rl[seq(80000, 100000, by = 100), ],
"proto_minimax", distance = dist_col_type,
col_type = col_type, weights = weights, orders = orders,
scale = TRUE, parallel = FALSE)
We can look at the minimax PP weights distribution for the true and duplicated records in the data set as an example. Note that the true records consistently have higher PP weights and the proportion of duplicated records with high weights is relatively low.
We can make a representative dataset with these weights by using the cutoff of \(0.5\), and look at some of the records.
fname | lname | bm | bd | by | sex | education | income | bp |
---|---|---|---|---|---|---|---|---|
jasmine | sirotic | 3 | 31 | 1972 | F | High school graduates, no college | 32 | 127 |
hugo | white | 6 | 9 | 1958 | M | Bachelor’s degree only | 72 | 134 |
madeline | burgemeifter | 12 | 9 | 1967 | F | Some college or associate degree | 30 | 130 |
kyle | clarke | 4 | 9 | 1952 | M | Advanced degree | 90 | 125 |
livia | braciak | 11 | 6 | 1950 | F | High school graduates, no college | 27 | 134 |
phoebe | green | 8 | 8 | 1957 | F | High school graduates, no college | 32 | 128 |
These four proposed methods each have potential benefits. The goal of prototyping is to select the correct representations of latent entities as often as possible; however, uniform random selection has no means to achieve this goal. Turning to minimax selection, if a distance function can accurately reflect the distance between pairs of records in the data set, then this method may perform well. Alternatively, composite records necessarily alter the data for all entities with multiple copies in the data, affecting some downstream tasks (like linear regression) heavily. The ability of posterior prototyping to propagate record linkage error to the downstream task is an attractive feature and a great strength of the Bayesian paradigm. In addition, the ability to use the entire posterior distribution of the linkage structure also poses the potential for superior downstream performance.
We can evaluate the performance of our methods by assessing the distributional closeness of the representative dataset to the true records. The distributional closeness of the representative datasets to the true records is useful because one of the benefits of using a two-stage approach to record linkage and downstream analyses is the ability to perform multiple analyses with the same data set. As such, downstream performance of representative records may be dependent on the type of downstream task that is being performed. In order to assess the distributional closeness of the representative data sets to the truth, we use an empirical Kullback-Leibler (KL) divergence metric. Let \(\hat{F}_{rep}(\boldsymbol x)\) and \(\hat{F}_{true}(\boldsymbol x)\) be the empirical distribution functions for the representative data set and true data set, respectively (with continuous variables transformed to categorical using a histogram approach with statistically equivalent data-dependent bins). The empirical KL divergence metric we use is then defined as \[ \hat{D}_{KL}(\hat{F}_{rep} || \hat{F}_{true}) = \sum_{\boldsymbol x} \hat{F}_{rep}(\boldsymbol x) \log\left(\frac{\hat{F}_{rep}(\boldsymbol x)}{\hat{F}_{true}(\boldsymbol x)}\right). \]
This metric is accessed in representr
using the
emp_kl_div()
command.
true_dat <- rl_reg1[unique(identity.rl_reg1),] # true records
emp_kl_div(true_dat, rep_random, c("sex"), c("income", "bp"))
#> [1] 0.01532769
emp_kl_div(true_dat, rep_minimax, c("sex"), c("income", "bp"))
#> [1] 0.01084999
emp_kl_div(true_dat, rep_composite, c("sex"), c("income", "bp"))
#> [1] 0.05639306
emp_kl_div(true_dat, rep_pp_thresh, c("sex"), c("income", "bp"))
#> [1] 0.007088876
The representative dataset based on the posterior prototyping weights is the closest to the truth using the three variables we might be interested in using for regression. This might indicate that we should use this representation in a downstream model, like linear regression.