This README file belongs in a file archive found at http://runeberg.org/words/frekvens-20070122.tgz
The files in this archive document word frequencies by year and language, based on raw or proofread text from Project Runeberg's electronic facsimile editions, as of January 22, 2007.
Project Runeberg is an archive of freely available electronic editions of classic out-of-copyright Scandinavian literature, http://runeberg.org/
Most of its titles consist of scanned images (electronic facsimile) and raw text from optical character recognition (OCR) in varying degrees of proofreading. Volunteers are welcome to help in proofreading the scanned text.
Since the scanned images depict a particular printed edition, the resulting text is tied to a publishing year and to a particular orthography (details in spelling), which is not the case for electronic texts that are not backed by scanned images.
Even if Ibsen's drama Peer Gynt was written in 1867 and first performed in 1876, its reprint in the author's collected works in 1898 marks the state of the Norwegian language at this latter year. This is the kind of Norwegian spelling that people were reading in 1898. It might be the authors' original spelling from 1867 or a modernized version of 1898, but it can't be modernized beyond the publishing year.
The files herein are plain text, encoded in UTF-8. The file no-1880.top contains word frequencies in Norwegian books printed in the year 1880. The following list means that the word "og" occurred 8161 times.
8161 og 5569 i 3896 at 3616 af 3359 den
The words were extracted with hunspell 1.1.4, having the following affix and dictionary files:
---- blank.aff ---- SET UTF-8 WORDCHARS .:-'0123456789 ---- blank.dic ---- 1 xyzzy
and the Unix/Linux command line:
sed 's/<[^>]*>//g' *.txt | hunspell -d blank -l | sort | uniq -c | sort -nrf
Having hyphen, period, apostrophe and digits in WORDCHARS means the output list will contain words such as "etc.", "Dyre-", "General-Vejmester", "3-årig" (3-year-old), "1700-talet" (18th century), "n:o" (numero), "1:20000" (map scale), "12:50" and "23:-" (prices). However, it also means that the period at the end of sentences will be included with some words.
Non-proofread text with OCR errors will also appear, e.g. "wwTQft" and "forunderJigere". This can only be improved by further proofreading. Only using the fully proofread pages would have reduced the amount of text too much.
The following printed and scanned volumes were used for each file. Prefix with http://runeberg.org/
file | volumes |
---|---|
no-1880.top | norge80 |
no-1883.top | tekuke/1883 |
no-1884.top | tekuke/1884 tekuke/1884pat |
no-1888.top | tekuke/1888 |
no-1889.top | tekuke/1889 |
no-1890.top | tekuke/1890 |
no-1891.top | tekuke/1891 |
no-1892.top | tekuke/1892 tekuke/1892pat |
no-1893.top | tekuke/1893 |
no-1894.top | tekuke/1894 |
no-1896.top | ilnolihi/1 ilnolihi/2 ilnolihi/3 ilnolihi/4 |
no-1900.top | ibsen/1 ibsen/2 ibsen/3 ibsen/4 ibsen/5 ibsen/6 ibsen/7 ibsen/8 ibsen/9 ibsen/10 |
no-1903.top | brand |
no-1905.top | ilnolih2 |
no-1907.top | bjorfort |
no-1910.top | bjornson/1 bjornson/2 bjornson/3 bjornson/4 bjornson/5 |
no-1916.top | urmakeri |
no-1934.top | bokogbib/1934 |
no-1935.top | bokogbib/1935 |