A Note on Choosing Hyperparameters
Source:vignettes/A-Note-on-Choosing-Hyperparameters.Rmd
A-Note-on-Choosing-Hyperparameters.Rmd
This document outlines guidelines for choosing a spectrum of
hyperparameters. We first summarize the interpretation of the three
hyper-parameters and the impact of their values on the estimation to
motivate the various choices of the hyper-parameter grid we propose. The
scaling parameters \(\alpha\) and \(\beta\) determine how much information the
estimation will draw from the two coordinates, GPS score, and exposure
level. A large scaling parameter suggests that varying the corresponding
coordinate is only associated with a small change in the outcome; that
is, this coordinate does not contribute too much to the variation of the
outcome. The signal-to-noise ratio parameter \(\gamma/\sigma\) (g_sigma
in
the code) encodes how different the observed data is from pure noise. A
large \(\gamma/\sigma\) indicates
strong associations between the outcome and the coordinates of the GP,
while small \(\gamma/\sigma\) suggests
the observed outcome is likely to be drawn from a random process that is
independent of the coordinates.
In the setting of observational studies and under the no unobserved confounding assumption (note that this is the specific setting for which GPCERF is expressly designed), both the exposure level and the GPS score encode important information for the estimation of CERF; as a result, the range of their scaling parameters should be comparable, and the covariate balance will determine which coordinate is more important. The range should also cover both ends of the importance, from extremely important to nearly irrelevant. We choose to achieve this by considering the range on the log-10 scale with equally spaced candidate values. The specification of a range of \(\gamma/\sigma\) also follows the same rationale when the prior belief about the strength of the causal effect of the exposure is weak. Here is an example grid that covers both ends.
params_lst <- list(alpha = 10 ^ seq(-2, 2, length.out = 10),
beta = 10 ^ seq(-2, 2, length.out = 10),
g_sigma = c(0.1, 1, 10),
tune_app = "all")
In some instances, the observed data could be derived from randomized trials in which the exposure is entirely independent of any existing covariates. In this case, the estimated GPS score is, in fact, a noisy version of the exposure level (since the mean of the exposure level is not a function of the covariates), and it does not provide any additional information beyond the exposure itself. In this case, we have a strong prior belief that the importance of the GPS score is low, and thus, a large value of \(\alpha\) can be specified without any additional tuning. The range of \(\beta\) should focus on smaller values to capture the information carried by the exposure. The specification of \(\gamma/\sigma\) should rely on some qualitative characterization, such as simple visualization, of the relationship between the outcome and the exposure. The reason is that under randomized design, the CERF can be estimated by regressing the outcome on the exposure directly with a flexible model. If the scatterplot of the outcome vs. the exposure exhibits a clear trend, we should specify a large \(\gamma/\sigma\) to reflect the prior belief of a strong causal effect of the exposure. In reality, we may not know whether the study is a randomized trial. We can then compute the absolute correlations between the exposure and each of the covariates to detect potential deviation from randomization. When all the absolute correlation is smaller than a pre-specified threshold, for instance, 0.05, we can employ the specification for a hyper-parameter grid as described above for randomized trials.