library(tidyverse)
library(quantreg)
The observed times (in hours) to perform variant calling using HPexome on a high performance system managed via Sun Grid Engine (SGE) are shown below. The compute node used for this work has 48 CPUs and 78 GB of RAM.
timings <- read_csv("hpexome_timings.csv") %>%
mutate(elapsed = (end - start) / 3600) %>%
select(scatter_count, count, elapsed)
## Parsed with column specification:
## cols(
## scatter_count = col_double(),
## count = col_double(),
## start = col_double(),
## end = col_double()
## )
timings %>%
pivot_wider(names_from = scatter_count, values_from = elapsed) %>%
select(-count)
1 | 2 | 4 | 8 | 16 | 32 |
---|---|---|---|---|---|
26.77028 | 16.06222 | 10.07861 | 7.153333 | 5.186667 | 5.337222 |
26.72889 | 16.01194 | 10.14500 | 6.978333 | 5.345833 | 5.371667 |
26.78694 | 15.99500 | 10.12056 | 6.845278 | 5.287500 | 5.378889 |
26.57833 | 16.17028 | 10.04528 | 6.986944 | 5.203611 | 5.029444 |
26.75333 | 16.17028 | 10.18667 | 6.837222 | 5.462222 | 5.079444 |
The boxplot below shows the gain in performance obtained through parallelization of the variant calling algorithm. One can easily note the dramatic decrease in time.
ggplot(timings, aes(factor(scatter_count), elapsed)) +
geom_boxplot() +
labs(x='Parallel Processing Units', y='Elapsed Time (hours)') +
theme_bw(base_size = 11)
ggsave("elapsed_time.png", dpi = 600)
## Saving 7 x 5 in image
We observed that the gain in time is not linear on the number of processing units. For this reason, we transform both variables (number of parallel processing units and time) to the logarithmic scale (base 2), as the Figure below shows. This strategy brings the relationship between both variables closer to linearity, allowing the use of advanced statistical methods for assessment of gains in performance.
ggplot(timings, aes(log2(scatter_count), log2(elapsed))) +
geom_point() +
geom_smooth(method='loess', color='black') +
labs(x='log2(Parallel Processing Units)', y='log2(hours)') +
theme_bw(base_size = 11)
ggsave("time_smooth.png", dpi = 600)
## Saving 7 x 5 in image
Below, we perform a quantile regression to estimate the median elapsed time (in the logarithmic scale) as a function of the number of parallel processing units (also in the logarithmic scale).
fit = rq(log2(elapsed)~log2(scatter_count), tau=.5, data=timings)
summary(fit)
##
## Call: rq(formula = log2(elapsed) ~ log2(scatter_count), tau = 0.5,
## data = timings)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) 4.53994 4.27083 4.73628
## log2(scatter_count) -0.53434 -0.58918 -0.43281
The table above shows the estimated median time in the logarithmic scale (4.54) for a run using a single processing unit. This model presents the evidences in favor of time reduction through parallel processing: the second coefficient (-0.53) quantifies the reduction in log2(time) for every time we double the number of parallel processing units. By representing the number of parallel processing units by \(n\), we can rewrite this model as: \[log_2(time) = 4.53 - 0.58 \times log_2(n).\]
Because the lower and upper confidence bounds for the \(log_2(n)\) coefficient range bewtween \(-0.73\) and \(-0.63\) (i.e., the confidence interval does not include the zero, which would suggest the lack of association between the variables), we are 95% certain that doubling the number of parallel processing units imply on a significant reduction of processing time. This model suggests that every time we double the number of processors, the required time for execution will be reduced to 65.23% of what was needed before (\(2^{-0.68427} = 0.6223 = 62.23\%\)).
new <- data.frame(scatter_count = unique(timings$scatter_count))
pred <- predict(fit, newdata = new)
data <- cbind(new, data.frame(log2time = pred, time = 2^pred))
write_csv(data, "log2_timings.csv")
data
scatter_count | log2time | time |
---|---|---|
1 | 4.539937 | 23.262553 |
2 | 4.005600 | 16.062222 |
4 | 3.471262 | 11.090570 |
8 | 2.936924 | 7.657767 |
16 | 2.402586 | 5.287500 |
32 | 1.868248 | 3.650889 |
Estimated Median Time to Completion of Process by Number of Parallel Processing Units.
We would like to thank the EMBRAPA Multiuser Bioinformatics Laboratory (http://www.lmb.cnptia.embrapa.br) for providing access to the high-performance computing environment.