pph | R Documentation |
The pph
package provides a way to create training data for
classifiers in a differentially private manner.
pdata
is a differentially private procedure that first projects
data onto a subset of dimensions conditioned on a particular target
dimension, and subsequently creates data by rounding a noisy
truncated histogram of the projected data.
defk
is used to estimate the number of covariate dimensions
needed.
gamma
is used to distribute the privacy budget between the
two tasks of projecting and generate a noisy histogram.
kmda
is a used to compute a size k
projection onto
predictor columns for a target attribute in a differentially private
manner.
to01
is a convenience function that scales numeric columns of a
data frame into the unit interval.
pdata(data, target = ncol(data), eps = 1, A = getOption('pph.A', 0.5), k = defk(data, target = target), gamma. = gamma(data, k = k), jitter = NULL, histogram.out=FALSE, verbose=FALSE) defk(data, tau=0.1, target=ncol(data)) gamma(data, p=0.95, tau=0.05, B=getOption('pph.B', 0.5), k = NULL) kmda(object, ...) to01(data, warn=FALSE) ## S3 method for class 'numeric' kmda(object, m, k, epsilon=1, verbose=FALSE) ## S3 method for class 'data.frame' kmda(object, target=ncol(object), ...) ## S3 method for class 'matrix' kmda(object, target=ncol(object), ...) ## S3 method for class 'formula' kmda(object, data=NULL, ...)
data |
A data frame containing data intended for building a classifier
for a target attribute. In |
target |
The index of the target attribute in |
eps |
The privacy level. |
epsilon |
The privacy level for |
A |
A tuning parameter used to filter the histogram by removing
entries that are equal to or less |
k |
The number of predictor (covariate, independent) dimensions. If
|
tau |
|
p |
the target probability that an original data point data makes it into the truncated histogram. |
B |
a tuning parameter for the importance of projection in
the computations in |
gamma. |
Distributes the privacy budget |
jitter |
If non-NULL, this adds small numeric noise to numeric values to
smooth out the effects of discretization by calling
|
histogram.out |
if |
verbose |
print information about progress and parameters if |
object |
in |
m |
a matrix containing the attribute columns out of which |
... |
Parameters to be passed to |
warn |
warn if columns were scaled. |
Columns in the data frame data
are expected to be either
numeric or factors. A numeric column in data
must only contain
values from the unit interval in order for pdata
to be
differentially private. For convenience, pdata
will scale
non unit interval numeric columns to the unit interval and issue a
warning. Numeric data is discretized into bins of equal width. This
width is computed as a function of the size and dimensionality of the
data. A minimum width can be set by setting option pph.minbw
using options
.
pdata
returns a data frame.
kmda.formula
returns a formula representation of the
projection. The other kmda
methods return a list containing two
items S
a vector of column indices into object
or m
of
the computed predictors, lo
a measure of pairs not discerned.
pdata
is only differentially private if numeric data is
constrained to the unit interval.
In kmda
, the lo
component is not produced in a
differentially private manner.
The code currently applies exhaustive exploration of possible histogram entries as opposed to the more efficient sampling method presented in the reference on which the method is based.
For now, this package can be installed by issuing
install.packages('pph', repos='http://laats.github.io/sw/R')
.
This implementation was in part supported by NIH NLM grant 7R01LM007273-07 and NIH Roadmap for Medical Research grant U54 HL108460.
Staal A. Vinterbo <sav@ucsd.edu>
S. A. Vinterbo. Differentially Private Projected Histograms: Construction and Use for Prediction. Proc. ECML-PKDD 2012, to appear.
See also hash
, options
.
data(iris) # scale numeric covariates into the unit interval iris <- to01(iris) # Differentially private logistic regression: model <- glm(I(Species == 'virginica') ~ ., binomial, pdata(iris)) summary(model) p <- predict(model, iris, type='response') ## show results: boxplot(p ~ s, data.frame(p=p, s=iris$Species), ylab='P(virginica|x)', xlab='Actual Class') # Differentially private multinomial logistic regression # (not run due to nnet dependency) ## Not run: library(nnet) # load multinom data(iris) iris <- to01(iris) model <- multinom(Species ~ ., data=pdata(iris)) p <- predict(model, iris) ## End(Not run) # compute a projection m <- data.frame(matrix(sample(0:1, 100, replace=TRUE), ncol=5)) pr <- kmda(m, k=3) pr <- kmda(m, target=5, k=3) pr <- kmda(X5 ~ ., m, k=3) # the above are all equivalent, except that the last one # returns a formula instead of a list.