In the process of estimating non-parametric exposure response
functions, we employ two local polynomial regression methodologies:
locpol
and KernSmooth
. The
KernSmooth
package, however, does not support weighted
data. Consequently, we leverage the counter_weight
column
from the pseudo population to replicate our data.
For the matching
approach, the procedure is quite
straightforward. Given that the counter_weight
column
values are integers, we can directly replicate the data. Conversely,
when dealing with the weighting
approach, certain
approximations become necessary. Our strategy involves rounding the
weights to 2 (or 3) decimal places, multiplying the resultant weights by
100 (or 1000), converting these figures into integer values, and finally
replicating the data.
This document aims to evaluate the degree of difference and sensitivity in the results in relation to the chosen multiplication factor. It’s important to remember, however, that opting for larger multiplication values may lead to significant memory consumption, posing computational challenges.
library(KernSmooth)
#> KernSmooth 2.23 loaded
#> Copyright M. P. Wand 1997-2009
library(locpol)
library(ggplot2)
library(gridExtra)
compare_smooth_methods <- function(multiplication_factor) {
# create some sample data
set.seed(123)
n <- 200
x <- runif(n)
y <- sin(2*pi*x) + rnorm(n, 0, 0.3)
w <- runif(n)
count <- as.integer(floor(w * multiplication_factor))
data <- data.frame(x = x, y = y, weight = w, count = count)
data_rep <- data[rep(1:nrow(data), data$count), , drop = FALSE]
rownames(data_rep) <- NULL
bw_values <- c(0.0005, 0.001, 0.01, 0.1, 1)
plot_list <- list()
for(i in 1:length(bw_values)) {
bw <- bw_values[i]
smoothed_val <- stats::approx(KernSmooth::locpoly(data_rep$x,
data_rep$y,
bandwidth = bw,
gridsize=1000),
xout=data$x,
rule=2)$y
val <- locpol::locpol(formula = y ~ x,
data = data,
bw = bw,
weig = data$weight,
xeval = data$x,
kernel = locpol::gaussK)
kernsmooth_val <- data.frame(x = data$x, kernsmooth = smoothed_val)
locpol_val <- data.frame(x = val$lpFit$x, locpol = val$lpFit$y)
merged_df <- merge(kernsmooth_val, locpol_val, by = "x")
merged_df$difference <- abs(merged_df$kernsmooth - merged_df$locpol)
g1 <- ggplot(data = data) +
geom_point(aes(x = x, y = y)) +
geom_line(data = merged_df, aes(x = x, y = kernsmooth,
color = "KernSmooth"), size = 0.5) +
geom_line(data = merged_df, aes(x = x, y = locpol, color = "Locpol"),
size = 0.5, linetype = "dashed") +
labs(title = paste("locpol vs Kernsmooth, bw =", bw),
color = "Method") +
scale_color_manual(values = c("KernSmooth" = "blue", "Locpol" = "red")) +
theme(legend.position = "none")
g2 <- ggplot(data = merged_df, aes(x = x, y = difference)) +
geom_line(color = "darkgreen", size = 0.5) +
labs(title = paste("Difference, bw =", bw), y = "KernSmooth - Locpol") +
theme(legend.position = "none")
plot_list[[2*i-1]] <- g1
plot_list[[2*i]] <- g2
print(paste("Max residual for bw = ", bw, ": ", max(merged_df$difference)))
}
return(plot_list)
}
In the subsequent visualizations, the dashed red line denotes
locpol
, while the solid blue line stands for
kernsmooth
.
Multiplication factor: 100
plot_list_100 <- compare_smooth_methods(100)
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> [1] "Max residual for bw = 5e-04 : 0.376724232844119"
#> [1] "Max residual for bw = 0.001 : 0.34102886655536"
#> [1] "Max residual for bw = 0.01 : 0.00885525713034725"
#> [1] "Max residual for bw = 0.1 : 0.0015642091493715"
#> [1] "Max residual for bw = 1 : 0.000887744632188214"
grid.arrange(grobs = plot_list_100, ncol = 2)
Multiplication factor: 1000
plot_list_1000 <- compare_smooth_methods(1000)
#> [1] "Max residual for bw = 5e-04 : 0.37053801414259"
#> [1] "Max residual for bw = 0.001 : 0.340214416795152"
#> [1] "Max residual for bw = 0.01 : 0.001086480021527"
#> [1] "Max residual for bw = 0.1 : 0.00108884634631823"
#> [1] "Max residual for bw = 1 : 9.28881533373005e-05"
grid.arrange(grobs = plot_list_1000, ncol = 2)
The visualizations clearly illustrate that the significance of the multiplication factor diminishes as the bandwidth value escalates. Therefore, within our package, we’ve opted to utilize a multiplication factor of 100. We caution that employing a multiplication factor of 1000 could potentially trigger memory overflow errors when handling medium to large datasets. Should you require control over the multiplication factor, we encourage you to open an issue.