Lecture 20
bench
d = tibble(
x = runif(10000),
y = runif(10000)
)
(b = bench::mark(
d[d$x > 0.5, ],
d[which(d$x > 0.5), ],
subset(d, x > 0.5),
filter(d, x > 0.5)
))
# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 49.4µs 57.4µs 16967. 240.14KB 46.5
2 d[which(d$x > 0.5), ] 90µs 102.6µs 9559. 272.03KB 50.8
3 subset(d, x > 0.5) 87.6µs 104.1µs 9350. 298.36KB 49.3
4 filter(d, x > 0.5) 290.9µs 318µs 3000. 1.48MB 48.7
d = tibble(
x = runif(1e6),
y = runif(1e6)
)
(b = bench::mark(
d[d$x > 0.5, ],
d[which(d$x > 0.5), ],
subset(d, x > 0.5),
filter(d, x > 0.5)
))
# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 3.64ms 4.19ms 240. 13.4MB 160.
2 d[which(d$x > 0.5), ] 8.45ms 8.85ms 113. 24.8MB 142.
3 subset(d, x > 0.5) 9.15ms 9.87ms 102. 24.8MB 117.
4 filter(d, x > 0.5) 4.9ms 5.5ms 181. 24.8MB 228.
bench
- relative results# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 d[d$x > 0.5, ] 1 1 2.36 1 1.37
2 d[which(d$x > 0.5), ] 2.32 2.11 1.11 1.86 1.21
3 subset(d, x > 0.5) 2.51 2.36 1 1.86 1
4 filter(d, x > 0.5) 1.35 1.31 1.78 1.86 1.95
parallel
Part of the base packages in R
tools for the forking of R processes (some functions do not work on Windows)
Core functions:
detectCores
pvec
mclapply
mcparallel
& mccollect
detectCores
Surprisingly, detects the number of cores of the current system.
Parallelization of a vectorized function call
user system elapsed
0.088 0.011 0.099
user system elapsed
0.164 0.118 0.224
user system elapsed
0.091 0.166 0.165
bench::system_time
10^6 10^7 10^8
1 cores 0.004 0.057 0.324
4 cores 0.038 0.149 1.738
6 cores 0.031 0.143 1.336
8 cores 0.042 0.137 1.438
10 cores 0.032 0.168 1.406
Parallelized version of lapply
user system elapsed
0.262 0.004 0.265
user system elapsed
0.327 0.092 0.268
user system elapsed
0.335 0.100 0.174
user system elapsed
0.338 0.150 0.163
user system elapsed
0.368 0.157 0.169
Asynchronously evaluation of an R expression in a separate process
Checks mcparallel
objects for completion
Packages by Revolution Analytics that provides the foreach
function which is a parallelizable for
loop (and then some).
Core functions:
registerDoMC
foreach
, %dopar%
, %do%
registerDoMC
Primarily used to set the number of cores used by foreach
, by default uses options("cores")
or half the number of cores found by detectCores
from the parallel package.
foreach
A slightly more powerful version of base for
loops (think for
with an lapply
flavor). Combined with %do%
or %dopar%
for single or multicore execution.
foreach
- iteratorsforeach
can iterate across more than one value, but it doesn’t do length coercion
foreach
- combining resultsforeach
- parallelizationSwapping out %do%
for %dopar%
will use the parallel backend.
user system elapsed
0.298 0.028 0.110
user system elapsed
0.302 0.039 0.078
user system elapsed
0.336 0.051 0.067
user system elapsed
0.000 0.000 3.011
user system elapsed
0.045 0.006 3.074
Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a samples of size n (with replacement) from the original data, and using that to repeat an analysis procedure of interest. Below is an example of fitting a local regression (loess
) to some synthetic data, we will construct a bootstrap prediction interval for this model.
Optimal use of parallelization / multiple cores is hard, there isn’t one best solution
Don’t underestimate the overhead cost
Experimentation is key
Measure it or it didn’t happen
Be aware of the trade off between developer time and run time
An awful lot of statistics is at its core linear algebra.
For example:
\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]
Principle component analysis
Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).
Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).
Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.
Numerical linear algebra \(\ne\) mathematical linear algebra
Efficiency and stability of numerical algorithms matter
Don’t reinvent the wheel - common core linear algebra tools (well defined API)
Low level algorithms for common linear algebra operations
Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran
Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)
Most default BLAS and LAPACK implementations (like R’s defaults) are somewhat dated
Written in Fortran and designed for a single cpu core
Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives:
ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools
cuBLAS / Magma - GPU libraries from Nvidia and UTK respectively
Accelerate / vecLib - Apple’s framework for GPU and multicore computing
n | 1 core | 2 cores | 4 cores | 8 cores | 16 cores |
---|---|---|---|---|---|
100 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
500 | 0.004 | 0.003 | 0.002 | 0.002 | 0.004 |
1000 | 0.028 | 0.016 | 0.010 | 0.007 | 0.009 |
2000 | 0.207 | 0.110 | 0.058 | 0.035 | 0.039 |
3000 | 0.679 | 0.352 | 0.183 | 0.103 | 0.081 |
4000 | 1.587 | 0.816 | 0.418 | 0.227 | 0.145 |
5000 | 3.104 | 1.583 | 0.807 | 0.453 | 0.266 |
Sta 523 - Fall 2023