Skip to main content

PageRank Who? AI Giants Opt for Arbitrary Filters

·44 words·1 min · Download pdf

It is beyond me why OpenAI, AllenAI and even Google Brain folks don’t use principled ranking metric (even old ones like PageRank) to select top-k high quality documents from Common Crawl to create training datasets instead of contrived arbitrary similarity matches and filtering.

Discussion