Tuesday, August 25, 2015

Concordance

First go:

# Print list of word frequencies
{
    for (i = 1; i <= NF; i++)
        freq[$i]++
}

END {
    for (word in freq)
        printf "%s\t%d\n", word, freq[word]
}

Better version is :
# wordfreq.awk --- print list of word frequencies
#this is the file gawk1.gawk seen below in the command line

{
    $0 = tolower($0)    # remove case distinctions
    # remove punctuation
    gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
    for (i = 1; i <= NF; i++)
        freq[$i]++
}

END {
    for (word in freq)
        printf "%s\t%d\n", word, freq[word]
}


and use like this:

C:\Users\peterb\Desktop\Gawk Practice>gawk -f gawk1.gawk somePort0.txt|sort -k 2nr > out2



And the Portuguese came from Gutenberg books.

The sort instruction in the command line above was just copied from the awk page I was referring to here. Not sure what the " -k 2nr" is all about. The out2 file looks a bit like this.



dscd