Gawk and sed

First go:

# Print list of word frequencies
{
for (i = 1; i <= NF; i++)
freq[$i]++
}

END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}

Better version is :

# wordfreq.awk --- print list of word frequencies

#this is the file gawk1.gawk seen below in the command line

{

$0 = tolower($0) # remove case distinctions

# remove punctuation

gsub(/[^[:alnum:]_[:blank:]]/, "", $0)

for (i = 1; i <= NF; i++)

freq[$i]++

}

END {

for (word in freq)

printf "%s\t%d\n", word, freq[word]

}

and use like this:

C:\Users\peterb\Desktop\Gawk Practice>gawk -f gawk1.gawk somePort0.txt|sort -k 2nr > out2

And the Portuguese came from Gutenberg books.

The sort instruction in the command line above was just copied from the awk page I was referring to here. Not sure what the " -k 2nr" is all about. The out2 file looks a bit like this.

dscd

Gawk and sed

Tuesday, August 25, 2015

Concordance