java - Determining whether string is a proper noun in text -
i'm trying parse text (http://pastebin.com/raw.php?i=0wd91r2i) , retrieve words , number of occurrences. however, must not include proper nouns within final output. i'm not quite sure how accomplish task.
my attempt @ this
public class textanalysis { public static void main(string[] args) { arraylist<word> words = new arraylist<word>(); //instantiate array list of object word try { int linecount = 0; int wordcount = 0; int specialword = 0; url reader = new url("http://pastebin.com/raw.php?i=0wd91r2i"); scanner in = new scanner(reader.openstream()); while(in.hasnextline()) //while parse text { linecount++; string textinfo[] = in.nextline().replaceall("[^a-za-z ]", "").split("\\s+"); //use regex replace punctuation empty char , split words white space chars in between wordcount += textinfo.length; for(int i=0; i<textinfo.length; i++) { if(textinfo[i].tolowercase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches special word case, add count of special words continue next word { specialword++; continue; } if(!textinfo[i].matches(".*\\w.*")) continue; //also if text matches white space continue boolean found = false; for(word word: words) //check whether word exists in list -- if add count { if(word.getword().equals(textinfo[i])) { word.addoccurence(1); word.addline(linecount); found = true; } } if(!found) //else add new entry { words.add(new word(textinfo[i], linecount, 1)); } } } //adds data capital word lowercase word attempt @ proper nouns here for(word word: words) { for(int i=0; i<words.size(); i++) { if(character.isuppercase(word.getword().charat(0)) && word.getword().tolowercase().equals(words.get(i).getword())) { words.get(i).addoccurence(word.getoccurence()); words.get(i).addline(word.getline()); } } } comparator<word> occurencecomparator = new comparator<word>() //comparares list based on number of occurences { public int compare(word n1, word n2) { if(n1.getoccurence() < n2.getoccurence()) return 1; else if (n1.getoccurence() == n2.getoccurence()) return 0; else return -1; } }; collections.sort(words); // collections.sort(words, occurencecomparator); // arraylist<word> top_words = new arraylist<word>(words.sublist(0,100)); // collections.sort(top_words); system.out.printf("%-15s%-15s%s\n", "word", "occurences", "word distribution index"); for(word word: words) { word.settotalline(linecount); system.out.println(word); } system.out.println(wordcount); system.out.printf("%s%.3f\n","the connecting word index ",specialword*100.0/wordcount); } catch(ioexception ex) { system.out.println("web url not found"); } } }
formatting kind of off, not sure how correctly.
which determines if word capitalized , if there lower case version of word, adds data lower case word. however, not account words lower case version never appears such "four" or "now" in text. how might go without cross referencing dictionary?
edit: have solved problem myself.
thank you, however, wes attempting answer.
it seems algorithm assume word appears capitalized not appear uncapitalized proper noun. if that's case, can use following algorithm proper nouns.
//assume have tokenized whole file collection called allwords. hashset<string> lowercasewords = new hashset<>(); hashmap<string,string> lowertocap = new hashmap<>(); for(string word: allwords) { if (character.isuppercase(word.charat(0))){ lowertocap.put(word.tolowercase(),word); } else { lowercasewords.add(word.tolowercase); } } //remove words we've found capitalized, proper nouns left lowercasewords.removeall(lowertocap.keyset()); for(string propernounlower:lowercasewords) { system.out.println("proper noun: "+ lowertocap.get(propernounlower)); }
Comments
Post a Comment