Humans excel at extracting structurally-determined meaning from speech, despite the inherent physical variability of spoken language (e.g., background noise, speaker variability, accents). One way to achieve such perceptual robustness is for the brain to predict its sensory input and, to some extent, the linguistic content, based on its internal states. However, the combinatorial nature of language, which on one hand endows language with its unboundedness and expressive power, also renders prediction over a sequence of words a non-trivial and, at the very least, non-Markovian affair. How neural infrastructure allows for linguistic structures, e.g., the hierarchical organisation of phrases, to be jointly processed with ongoing predictions over incoming input is not yet well understood. To wit, this study takes a novel perspective on the relationship between structural and statistical knowledge of language in brain dynamics by focusing on phase and amplitude modulation. Syntactic features derived from constituent hierarchies, and surface statistics based on word sequential predictability obtained from a pretrained transformer model, were jointly used to reconstruct the neural oscillatory dynamics during naturalistic audiobook listening. We modelled the brain response to structured and statistical information via forward encoding models, and found that both types of features improve decoding performance on unseen data. Results indicated a substantial overlap between brain activity involved in both types of information, suggesting that the classic viewpoint that linguistic structures and statistics about them can be separated as a false dichotomy when language is processed in the brain. Syntactic features aided neural signal reconstruction over a longer period of time; in contrast, the effect of statistical features is comparatively shorter, but is tightly bound to the phase of neural dynamics, suggesting involvement in the temporal prediction and alignment of cortical oscillations involved in speech processing. Both features are jointly processed and contribute to ongoing neural dynamics during spoken language comprehension, and are locally integrated through cross-frequency coupling mechanisms.