Warning: Creating default object from empty value in /home/patricknevindwyer/digilutionary.com/wp-includes/functions.php on line 292
digilutionary.com
Patrick Dwyer

Frequency :: Java

in News, Java, Comparative Programming by patrick


Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /home/patricknevindwyer/digilutionary.com/wp-includes/functions-formatting.php on line 76

This program is part of the Comparative Programming :: Frequency Analysis set of examples.

Our Java example, while longer than most of the other frequency analysis programs, is fairly straight forward in it’s approach. To keep track of the number of character occurances in our text, we want to create a map of characters to numbers, which we’ll increment every time we see a valid character. In lines 11 through 19 we create our Hashtable and initialize it with the letters ‘a’ through ‘z’, associating with each the value ‘0.0′.

We’ll need to reference the letters ‘a’ through ‘z’ twice in our program; once to create our Hashtable, and again to print the results. For this reason we create a string of letters in alphabetical order on line 12. We can iterate through the string with a simple for loop (lines 17 and 44) to access our character counts.

With our Hashtable ready, we can open and read our file. On line 22 we create a BufferedReader to access the contents of our text file line by line. Our while loop (line 26) extracts each line from the file, placing it’s value in the line variable. This while loop will continue until our input file’s readLine method returns null, which signifies the end of the file.

Once we have a line of text from our file, we need to break it down into characters, the fundamental unit of text for our analysis. To do this we use a for loop, starting on line 29, that counts from 0 up to the length of our line. Using the index of our for loop, we can sequentially extract the characters from our line of text (line 30) using the String class’ substring method.

We don’t want to count any random characters we come across; we’re only interested in letters. Line 33 of our program contains a simple Regular Expression to test our extracted character. The pattern [a-zA-Z] matches any lower or upper case character, but nothing else. If our character matches our regular expression, we want to include it in our frequency analysis, so translate the character to lower case (in case it was an upper case character), and increment our character frequency in the Hashtable (line 37). We increment this value by first retrieving the old value using our current character

freq.get(c)

adding 1

freq.get(c) + 1.0

and then inserting our incremented value back into the Hashtable

freq.put(c, freq.get(c) + 1.0)

The last step in our for loop keeps track of the total number of valid characters so far encountered.

Once we’re done with our file, having read each line and extracted each valid character, we need to calculate our character frequencies and print them out (lines 44 to 48).

Using the string of letters from ‘a’ to ‘z’ that we defined earlier, we can get the key and value for each of the characters in our Hashtable. The key is the character itself, a letter from ‘a’ to ‘z’, while the value is how often the character occured in the text file we analyzed, the number we computed in the course of analyzing the line and characters of the file.

With the key and value in hand, it is a simple matter to determine the relative frequency of our character in the file (line 46). Printing the frequency of our character takes a little bit more work. We’d like the output of our program to look like:

a: 8.00
b: 1.51
c: 2.42
d: 3.90
e: 12.84
f: 2.20
.
.
.
t: 9.42
u: 3.02
v: 0.94
w: 2.28
x: 0.15
y: 2.16
z: 0.04

A first attempt at printing our results could be:

System.out.println(key + ": " + perc);

Which correctly prints a letter, followed by a colon, a space, and our frequency, but our frequency is a significantly long number:

a: 7.9990618682967325
b: 1.5108759990916503
c: 2.4194118807679277
d: 3.9029257051809445
.
.
.

We want to limit our frequency to two decimal places. Thankfully there is a simple method in the String class for printing numbers and strings in a controlled manner. We can pass C-Style string formats to the format method to create a better output string (line 47). The %2.2f instructs the format method that our floating point value should be printed with at most two leading digits, and two trailing digits.

While our frequency analysis is done, it’s important to note that our entire program is wrapped in a try-catch block; a construct of the java language intended for exception processing. When working with any type of Input/Output, be it files, networks, or peripherals, there is the possibility that communication will break, or somehow be disrupted. Many Java classes leave it up to the programmer to handle these situations. In our case opening and reading from a file can cause IOExceptions that we need to account for. In our program we don’t attempt to recover from the error, opting instead to print out a program trace of where the error occured (line 53).

Java
01import java.util.*;
02import java.io.*;
03
04public class freq {
05
06 public static void main(String[] args) {
07
08 try {
09
10 // create our map of lowercase letters to integers
11 Hashtable<String, Double> freq = new Hashtable<String, Double>();
12 String set = "abcdefghijklmnopqrstuvwxyz";
13
14 int count = 0;
15
16 // initialize our character counts to 0
17 for ( int i = 0; i < set.length(); i++) {
18 freq.put(set.substring(i, i+1), 0.0);
19 }
20
21 // open our file
22 BufferedReader in = new BufferedReader(new FileReader(args[0]));
23 String line;
24
25 // read in each line from the file
26 while ( (line = in.readLine()) != null) {
27
28 // extract each character from the line
29 for (int i = 0; i < line.length(); i++) {
30 String c = line.substring(i, i+1);
31
32 // try and match our character to a lower or upper case letter
33 if (c.matches("[a-zA-Z]")) {
34
35 // increment the count of our character
36 c = c.toLowerCase();
37 freq.put(c, freq.get(c) + 1.0);
38 count++;
39 }
40 }
41 }
42
43 // calculate the frequency of each of our characters, printing the result
44 for (int i = 0; i < set.length(); i++) {
45 String key = set.substring(i, i + 1);
46 double perc = freq.get(key) / count * 100.0;
47 System.out.println(key + ": " + String.format("%2.2f", perc));
48 }
49
50 } catch (IOException ioe) {
51
52 // catch any problems we have reading the file from disk
53 ioe.printStackTrace();
54 }
55 }
56}
[+-] Toggle Line Numbers

Program Source: freq.java
Text Source: republic.txt
This text was acquired from Project Gutenberg, and
is distributed as per the license at the beginning of the text.

Compiling the example

From the command line:

javac freq.java

Running the example

java freq republic.txt