# Print list of word frequencies
{
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
Better version is :
# wordfreq.awk --- print list of word frequencies
#this is the file gawk1.gawk seen below in the command line
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
and use like this:
C:\Users\peterb\Desktop\Gawk Practice>gawk -f gawk1.gawk somePort0.txt|sort -k 2nr > out2
And the Portuguese came from Gutenberg books.
The sort instruction in the command line above was just copied from the awk page I was referring to here. Not sure what the " -k 2nr" is all about. The out2 file looks a bit like this.
dscd
The sort instruction in the command line above was just copied from the awk page I was referring to here. Not sure what the " -k 2nr" is all about. The out2 file looks a bit like this.
dscd