Large language models literally do subspace projections on text to break it into contextual chunks, and then memorize the chunks. That’s how they’re defined.
Source: the paper that defined the transformer architecture and formulas for large language models, which has been cited in academic sources 85,000 times alone https://arxiv.org/abs/1706.03762
I believe your “They use attention mechanisms to figure out which parts of the text are important” is just a restatement of my “break it into contextual chunks”, no?