Here's the algorithm without any of the kernel flim-flam, straight up:
Model
f(x;α) ≡ Σj=1D αiz(x;ωj)
Training
minα Σi=1N ℓ(f(xi;α),yi)
You just draw a bunch of functions, independently from your data set. You weight them and you tune those weights so that you get a low loss on your training data set okay.
In the second paper we showed this. In the same way that Fourier bases provide a dense basis set for an L
2 ball of L
2 functions, or in the same way that three layer wide neural networks could approximate any smooth function arbitrarily well, so too do a bunch of randomly drawn smooth functions approximately densely a function in a Hilbert space arbitrarily well with high probability.
So now you don't need to talk about kernels to justify these random features. They don't have to be eigenfunctions of any famous kernels, they're a legitimate basis for learning unto themselves.
In the third paper we finally derive generalization bounds for the algorithm I just showed you.
Theorem: With high probability,
R[f^] − R[f*] = O((1/√n + 1/√D) ||f*||)