Winsorization change

 by  Trond Pedersen

The winsorization confidentiality measure in has been adjusted; the underlying data in the user’s workspace is no longer affected by this measure. Thus, regression analyzes can now be executed without being influenced by winsorization.

Until now, numerical data has been winsorized during import of data to the user’s workspace, population delimitations (drop if/keep if), and for descriptive statistics for sub-samples (e.g. summarize income if gender == "1").

This is one of the confidentiality measures in, and is intended to prevent users from being able to indirectly identify people via extreme values. Incomes are examples of information where this can be a problem.

Winsorization in this context means that the 1% highest values ​​are censored and set to the lower limit for the last percentile, and the 1% lowest values ​​are set to the upper limit for the first percentile.

Impact on means, standard deviations and regressions

An undesirable effect created by the way winsorization has worked so far, is that numerical variables imported into the user’s workspace have been censored, impacting all subsequent analyses and data management processes.

Statistical measures such as means and standard deviations will then report values ​​that are somewhat lower than the actual ones.

Until now, regression estimates have also been affected by the fact that the estimation is based on censored values. The degree of influence depends on how long the “tails” are in the value distribution of the relevant variables (i.e. to what extent extreme values ​​occur).

To minimize the disadvantages of winsorization, only the visible and identifiable output of descriptive statistics now undergo winsorization. The underlying user workspace data is no longer subject to censorship. Regression estimates are therefore 100% correct as they are based on the actual data.

For descriptive statistics, the reported means and standard deviations will still be somewhat lower than the actual values ​​for most numerical variables. This is intentional, and is regarded necessary to maintain the correct balance between confidentiality and sufficient flexibility in defining the analysis population.

Dummy variables and numeric multi-category variables

A common problem has been that also imported dummy variables (numerical variables with the values ​​0 and 1) were winsorized if one of the categories accounted for less than 1% of the values ​​in your population. Since the winsorization uses the neighboring percentile as the censorship value, all dummy values ​​have been coded to resp. 0 or 1 in such cases.

When running regression analyzes, this can create a problem in cases where winsorized dummy variables are included, either imported or variables derived from these, since variables with only one value are not accepted.

Also for numerical multi-category variables, there is a risk that the highest and/or lowest category has been merged with the neighboring category. Then it looks like the highest or lowest category has no observations for your dataset.

After the change, it still will appear that dummy variables will be winsorized when running descriptive statistics. However, this only applies to the visible statistics output. If the same variables are used in regression analyses, non-winsorized data is used as input.

Population delimitations

Until now, numerical data have not only been winsorized by import, but also for each time population delimitations are made. So if you have run many drop if or keep if, your data has been winsorized the corresponding number of times.

This problem is now eliminated, since winsorization is only carried out in the generation of descriptive statistics.