Skip to main content

PageRank Who? AI Giants Opt for Arbitrary Filters

·44 words·1 min

It is beyond me why OpenAI, AllenAI and even Google Brain folks don’t use principled ranking metric (even old ones like PageRank) to select top-k high quality documents from Common Crawl to create training datasets instead of contrived arbitrary similarity matches and filtering.

Discussion