At CourtListener, we’re developing a new system to convert scanned court documents to text. As part of our development we’ve analyzed more than 1,000 court opinions to determine what fonts courts are using.
Now that we have this information, our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we’ve attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.
Unsurprisingly, the top font — drumroll please — is Times New Roman.
Font | Regular | Bold | Italic | Bold Italic | Total |
---|---|---|---|---|---|
Times | 1454 | 953 | 867 | 47 | 3321 |
Courier | 369 | 333 | 209 | 131 | 1042 |
Arial | 364 | 39 | 11 | 41 | 455 |
Symbol | 212 | 0 | 0 | 0 | 212 |
Helvetica | 24 | 161 | 2 | 2 | 189 |
Century Schoolbook | 58 | 54 | 52 | 9 | 173 |
Garamond | 44 | 42 | 41 | 0 | 127 |
Palatino Linotype | 36 | 24 | 24 | 1 | 85 |
Old English | 42 | 0 | 0 | 0 | 42 |
Lincoln | 27 | 0 | 0 | 0 | 27 |
I love getting feedback and comments. Make my day by making a comment.
comments powered by Disqus