R: Differentially Private Projected Histograms

pph	R Documentation

Differentially Private Projected Histograms

Description

The pph package provides a way to create training data for classifiers in a differentially private manner.

pdata is a differentially private procedure that first projects data onto a subset of dimensions conditioned on a particular target dimension, and subsequently creates data by rounding a noisy truncated histogram of the projected data.

defk is used to estimate the number of covariate dimensions needed.

gamma is used to distribute the privacy budget between the two tasks of projecting and generate a noisy histogram.

kmda is a used to compute a size k projection onto predictor columns for a target attribute in a differentially private manner.

to01 is a convenience function that scales numeric columns of a data frame into the unit interval.

Usage

pdata(data, target = ncol(data), eps = 1, A = getOption('pph.A', 0.5),
      k = defk(data, target = target), gamma. = gamma(data, k = k),
      jitter = NULL, histogram.out=FALSE, verbose=FALSE)
defk(data, tau=0.1, target=ncol(data))
gamma(data, p=0.95, tau=0.05, B=getOption('pph.B', 0.5), k = NULL)
kmda(object, ...)
to01(data, warn=FALSE)

## S3 method for class 'numeric'
kmda(object, m, k, epsilon=1, verbose=FALSE)
## S3 method for class 'data.frame'
kmda(object, target=ncol(object), ...)
## S3 method for class 'matrix'
kmda(object, target=ncol(object), ...)
## S3 method for class 'formula'
kmda(object, data=NULL, ...)

Arguments

`data`	A data frame containing data intended for building a classifier for a target attribute. In `kmda.formula` it is the data frame for which we seek a set of size `k` of predictor columns.
`target`	The index of the target attribute in `data` (or `object` in `kmda.data.frame` and `kmda.matrix`).
`eps`	The privacy level.
`epsilon`	The privacy level for `kmda`. If this is set to `NA`, `kmda` will run in non-private mode.
`A`	A tuning parameter used to filter the histogram by removing entries that are equal to or less `A*log(nrow(data))/eps`.
`k`	The number of predictor (covariate, independent) dimensions. If `NULL` in `gamma` then `defk` will be called to provide a value.
`tau`	`1 - tau` is the fraction of target pairs discerned in the estimation of the needed `k` in `defk`.
`p`	the target probability that an original data point data makes it into the truncated histogram.
`B`	a tuning parameter for the importance of projection in the computations in `gamma`.
`gamma.`	Distributes the privacy budget `epsilon` between computing which `k` dimensions to project onto, and building the noisy histogram. The former gets `(1 - gamma.)eps`, while the latter gets `gamma. eps`.
`jitter`	If non-NULL, this adds small numeric noise to numeric values to smooth out the effects of discretization by calling `base::jitter` with a factor argument taking the value of `jitter`. The noise added is uniform in the range [-a, a] where `a = jitter * d/5`, where `d` is the smallest difference between two different discretized values.
`histogram.out`	if `TRUE` `pdata` returns a `hash` object representing the noisy truncated histogram. If `FALSE` a data frame created from the histogram is returned.
`verbose`	print information about progress and parameters if `TRUE`.
`object`	in `kmda.numeric` this is the index of the target column in `m`. In `kmda.matrix` and `kmda.data.frame` it is a matrix or data.frame respectively. In `kmda.formula` it is a formula that describes among which columns predictor columns for the target attribute should be sought among.
`m`	a matrix containing the attribute columns out of which `object` is the target column index for which we seek `k` predictor columns for.
`...`	Parameters to be passed to `kmda.numeric` from the other `kmda` methods.
`warn`	warn if columns were scaled.

Details

Columns in the data frame data are expected to be either numeric or factors. A numeric column in data must only contain values from the unit interval in order for pdata to be differentially private. For convenience, pdata will scale non unit interval numeric columns to the unit interval and issue a warning. Numeric data is discretized into bins of equal width. This width is computed as a function of the size and dimensionality of the data. A minimum width can be set by setting option pph.minbw using options.

Value

pdata returns a data frame.

kmda.formula returns a formula representation of the projection. The other kmda methods return a list containing two items S a vector of column indices into object or m of the computed predictors, lo a measure of pairs not discerned.

Warning

pdata is only differentially private if numeric data is constrained to the unit interval.

In kmda, the lo component is not produced in a differentially private manner.

The code currently applies exhaustive exploration of possible histogram entries as opposed to the more efficient sampling method presented in the reference on which the method is based.

Note

For now, this package can be installed by issuing install.packages('pph', repos='http://laats.github.io/sw/R').

This implementation was in part supported by NIH NLM grant 7R01LM007273-07 and NIH Roadmap for Medical Research grant U54 HL108460.

Author(s)

Staal A. Vinterbo <sav@ucsd.edu>

References

S. A. Vinterbo. Differentially Private Projected Histograms: Construction and Use for Prediction. Proc. ECML-PKDD 2012, to appear.

Examples

  data(iris)
  # scale numeric covariates into the unit interval
  iris <- to01(iris)

  # Differentially private logistic regression:
  model <- glm(I(Species == 'virginica') ~ ., binomial, pdata(iris))
  summary(model)
  p <- predict(model, iris, type='response')

  ## show results:
  boxplot(p ~ s, data.frame(p=p, s=iris$Species), ylab='P(virginica|x)',
          xlab='Actual Class')

  # Differentially private multinomial logistic regression
  # (not run due to nnet dependency)
  ## Not run: 
    library(nnet) # load multinom
    data(iris)
    iris <- to01(iris)
    model <- multinom(Species ~ ., data=pdata(iris))
    p <- predict(model, iris)
  
## End(Not run)
  # compute a projection
  m <- data.frame(matrix(sample(0:1, 100, replace=TRUE), ncol=5))
  pr <- kmda(m, k=3)
  pr <- kmda(m, target=5, k=3)
  pr <- kmda(X5 ~ ., m, k=3)
  # the above are all equivalent, except that the last one
  # returns a formula instead of a list.