Home > DeveloperSection > Forums > Read a .txt file and return a list of words with their frequency in the file
Ankita Pandey
Ankita Pandey

Total Post:183

Points:1285
Posted on    December-30-2014 12:45 AM

 Java Java  File 
Ratings:


 1 Reply(s)
 455  View(s)
Rate this:
I have this so far but it only prints the .txt file to the screen:

import java.io.*;

public class ReadFile {
    public static void main(String[] args) throws IOException {
        String Wordlist;
        int Frequency;

        File file = new File("file1.txt");
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
        String line = null;

        while( (line = br.readLine()) != null) {
            String [] tokens = line.split("\\s+");
            System.out.println(line);
        }
    }
}
Can anyone help me so it prints a word list and the words frequency?


Samuel Fernandes
Samuel Fernandes

Total Post:154

Points:1082
Posted on    December-30-2014 1:03 AM

Do something like this. I'm assuming only comma or period could occur in the file. Else you'll have to remove other punctuation characters as well. I'm using a TreeMap so the words in the map will be stored their natural alphabetical order

  public static TreeMap<String, Integer> generateFrequencyList()
    throws IOException {
    TreeMap<String, Integer> wordsFrequencyMap = new TreeMap<String, Integer>();
    String file = "/tmp/lorem.txt";
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    while( (line = br.readLine()) != null){
         String [] tokens = line.split("\\s+");
      for (String token : tokens) {
        token = removePunctuation(token);
        if (!wordsFrequencyMap.containsKey(token.toLowerCase())) {
          wordsFrequencyMap.put(token.toLowerCase(), 1);
        } else {
          int count = wordsFrequencyMap.get(token.toLowerCase());
          wordsFrequencyMap.put(token.toLowerCase(), count + 1);
        }
      }
    }
    return wordsFrequencyMap;
  }

  private static String removePunctuation(String token) {
    token = token.replaceAll("[^a-zA-Z]", "");
    return token;
  }
main method for testing is shown below. For getting the percentages, you could get count of all the words by iterating through the map and adding all the values and then do a second pass for getting the percentages. By the way, if this is part of a larger work, you could also take a look at apache commons math library for calculating Frequency distributions. If you use their Frequency class, you can keep adding all the words to it and then get the descriptive statistics at the end.

  public static void main(String[] args) {
    try {
      int totalWords = 0;   
      TreeMap<String, Integer> freqMap = generateFrequencyList();
      for (String key : freqMap.keySet()) {
        totalWords += freqMap.get(key);
      }

      System.out.println("Word\tCount\tPercentage");
      for (String key : freqMap.keySet()) {
         System.out.println(key+"\t"+freqMap.get(key)+"\t"+((double)freqMap.get(key)*100.0/(double)totalWords));    
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

Don't want to miss updates? Please click the below button!

Follow MindStick