In this vignette, we discuss the robust Horvitz-Thompson (RHT) estimator of Hulliger (1995, 1999). The vignette is organized as follows.
First, we load the package.
> library(robsurvey, quietly = TRUE)The workplace sample consists of payroll data on \(n=142\) workplaces or business establishments (with paid employees) in the retail sector of a Canadian province.
workplace data are not those collected by Statistics Canada but have been generated by Fuller (2009, Example 3.1.1, Table 6.3).The original weights of WES were about 2200 for the stratum of small workplaces, about 750 for medium-sized, and about 35 for large workspaces. Several strata containing very large workplaces were sampled exhaustively; see Patak et al (1998).
> attach(workplace)The variable of interest is payroll, and the goal is to estimate the population total of payroll in the retail sector.
> head(workplace, 3)
ID weight employment payroll strat fpc
1 1 786 17 260000 2 10718
2 2 32 661 6873000 1 3432
3 3 36 3 366000 1 3432For the survey methods (not bare-bone methods), we must load the survey package (Lumley, 2010, 2021)
> library(survey)and specify a survey.design object
> dn <- svydesign(ids = ~ID, strata = ~strat, fpc = ~fpc, weights = ~weight, data = workplace)which appears (on print) with the following output in the console
Stratified Independent Sampling design
svydesign(ids = ~ID, strata = ~strat, fpc = ~fpc, weights = ~weight,
data = workplace)
To get a first impression of the distribution of payroll, we examine two (design-weighted) boxplots of payroll (on raw and logarithmic scale). The boxplots are obtained using function survey::svyboxplot.
From the boxplot with payroll on raw scale, we recognise that the sample distribution of payroll is very skewed to the right; the boxplot on logarithmic scale demonstrates that log-transform is not sufficient to turn the skewed distribution into a symmetric distribution, and shows some outliers. The outliers need not be errors. Following Chambers (1986), we distinguish representative outliers from non-representative outliers (\(\rightarrow\) see vignette “Basic Robust Estimators” for an introduction to the notion of non-/ representative outliers).
The outliers visible in the boxplot refer to a few but rather large workplaces. Moreover, we assume that these outliers represent other workplaces in the population that are similar in value (i.e., representative outliers).
The following bare-bone estimating methods are available
weighted_mean_huber()weighted_total_huber()weighted_mean_tukey()weighted_total_tukey()The functions with postfix _tukey are \(M\)-estimators with the Tukey biweight \(\psi\)-function. The Huber RHT \(M\)-estimator of the payroll total can be computed with
> weighted_total_huber(payroll, weight, k = 8, type = "rht")
[1] 15587090084Note that we must specify type = "rht" for the RHT [the case type = "rhj" is discussed in the vignette “Basic Robust Estimators”]. Here, we have chosen the robustness tuning constant \(k = 8\).
The following survey method are available
svymean_huber()svytotal_huber()svymean_tukey()svytotal_tukey()The survey method of the RHT (and its standard error) are computed as follows (type = "rht")
> m <- svytotal_huber(~payroll, dn, k = 8, type = "rht")
> m
total SE
payroll 1.559e+10 1.198e+09The summary() method summarizes the most important facts about the estimate.
> summary(m)
Huber M-estimator (type = rht) of the sample total
total SE
payroll 1.559e+10 1.198e+09
Robustness:
Psi-function: with k = 8
mean of robustness weights: 0.9917
Algorithm performance:
converged in 4 iterations
with residual scale (weighted MAD): 89474
Sampling design:
Stratified Independent Sampling design
svydesign(ids = ~ID, strata = ~strat, fpc = ~fpc, weights = ~weight,
data = workplace)The estimated location, variance, and standard error can be extracted from object m with the following commands.
> coef(m)
payroll
15587090084
> vcov(m)
Variance
payroll 1.434857e+18
> SE(m)
[1] 1197855270For \(M\)-estimators, the estimated scale (weighted MAD) can be extracted with the scale() function.
> scale(m)
[1] 89474.01Additional utility functions
residuals() to extract the residualsfitted() to extract the fitted values under the model in userobweights() to extract the robustness weightsIn the following figure, the robustness weights of object m are plotted against the residuals. The Huber RHT \(M\)-estimator downweights observations at both ends of the residuals’ distribution.
> plot(residuals(m), robweights(m))An adaptive \(M\)-estimator of the total (or mean) is defined by letting the data chose the tuning constant \(k\). This approach is available for the RHT estimator \(\rightarrow\) see vignette “Basic Robust Estimators”, Chap. 5.3 on \(M\)-estimators.
Chambers, R. (1986). Outlier Robust Finite Population Estimation. Journal of the American Statistical Association 81, pp. 1063–1069.
Fuller, W.A. (2009). Sampling Statistics, Jown Wiley & Sons, Hoboken (NJ).
Hulliger B (2005). Horvitz-Thompson Estimators, Robustified, in: S. Kotz (ed.), Encyclopedia of Statistical Sciences, volume 5, 2nd edition. John Wiley and Sons, Hoboken (NJ).
Hulliger, B (1999). Simple and robust estimators for sampling, in: Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 54–63. American Statistical Association.
Hulliger. B. (1995). Outlier Robust Horvitz–Thompson Estimators. Survey Methodology 21, pp. 79–87.
Lumley, T (2021). survey: analysis of complex survey samples. R package version 4.0, URL https://CRAN.R-project.org/package=survey.
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R: A Guide to Analysis Using R. John Wiley and Sons, Hoboken, NJ.