Details
-
Suggestion
-
Resolution: Unresolved
-
Not Evaluated
-
None
-
None
-
None
Description
I have a system that would benefit greatly from being able to extract the text of multiple PDF documents in parallel. This would allow a much more efficient construction of full text search indexes for a library of PDFs.
The idea here would be to allow different instances of QPdfDocument to be processed in parallel by different threads. The current implementation serializes all access to QPdfDocument methods by using a QPdfMutexLocker. For example, the following code would be fully parallelized, whereas today t1 and t2 have to wait on one another:
std::thread t1 = []() { QPdfDocument doc; doc.load("Book1.pdf"); doc.getAllText(0); }; std::thread t2 = []() { QPdfDocument doc; doc.load("Book2.pdf"); doc.getAllText(0); }; t1.join(); t2.join();
From a high level viewpoint, I can't find a reason why parsing and rendering PDFs would have to introduce contention between threads on the method call level.