java - Determining whether string is a proper noun in text -


i'm trying parse text (http://pastebin.com/raw.php?i=0wd91r2i) , retrieve words , number of occurrences. however, must not include proper nouns within final output. i'm not quite sure how accomplish task.

my attempt @ this

public class textanalysis {     public static void main(string[] args)     {         arraylist<word> words = new arraylist<word>(); //instantiate array list of object word         try         {             int linecount = 0;              int wordcount = 0;             int specialword = 0;             url reader = new url("http://pastebin.com/raw.php?i=0wd91r2i");             scanner in = new scanner(reader.openstream());             while(in.hasnextline()) //while parse text             {                 linecount++;                 string textinfo[] = in.nextline().replaceall("[^a-za-z ]", "").split("\\s+"); //use regex replace punctuation empty char , split words white space chars in between                 wordcount += textinfo.length;                  for(int i=0; i<textinfo.length; i++)                 {                     if(textinfo[i].tolowercase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches special word case, add count of special words continue next word                     {                         specialword++;                         continue;                     }                     if(!textinfo[i].matches(".*\\w.*")) continue; //also if text matches white space continue                     boolean found = false;                     for(word word: words) //check whether word exists in list -- if add count                     {                         if(word.getword().equals(textinfo[i]))                         {                             word.addoccurence(1);                             word.addline(linecount);                             found = true;                         }                     }                     if(!found) //else add new entry                     {                         words.add(new word(textinfo[i], linecount, 1));                     }                 }             }             //adds data capital word lowercase word attempt @ proper nouns here             for(word word: words)             {                 for(int i=0; i<words.size(); i++)                 {                     if(character.isuppercase(word.getword().charat(0)) && word.getword().tolowercase().equals(words.get(i).getword()))                     {                         words.get(i).addoccurence(word.getoccurence());                         words.get(i).addline(word.getline());                     }                 }             }              comparator<word> occurencecomparator = new comparator<word>() //comparares list based on number of occurences             {                 public int compare(word n1, word n2)                 {                     if(n1.getoccurence() < n2.getoccurence()) return 1;                     else if (n1.getoccurence() == n2.getoccurence()) return 0;                     else return -1;                 }             };             collections.sort(words);             // collections.sort(words, occurencecomparator);             // arraylist<word> top_words = new arraylist<word>(words.sublist(0,100));             // collections.sort(top_words);             system.out.printf("%-15s%-15s%s\n", "word", "occurences", "word distribution index");             for(word word: words)             {                 word.settotalline(linecount);                 system.out.println(word);             }             system.out.println(wordcount);             system.out.printf("%s%.3f\n","the connecting word index ",specialword*100.0/wordcount);         }         catch(ioexception ex)         {             system.out.println("web url not found");         }     } } 

formatting kind of off, not sure how correctly.

which determines if word capitalized , if there lower case version of word, adds data lower case word. however, not account words lower case version never appears such "four" or "now" in text. how might go without cross referencing dictionary?

edit: have solved problem myself.

thank you, however, wes attempting answer.

it seems algorithm assume word appears capitalized not appear uncapitalized proper noun. if that's case, can use following algorithm proper nouns.

//assume have tokenized whole file collection called allwords. hashset<string> lowercasewords = new hashset<>(); hashmap<string,string> lowertocap = new hashmap<>(); for(string word: allwords) {     if (character.isuppercase(word.charat(0))){         lowertocap.put(word.tolowercase(),word);     }     else {             lowercasewords.add(word.tolowercase);     } }  //remove words we've found capitalized, proper nouns left lowercasewords.removeall(lowertocap.keyset()); for(string propernounlower:lowercasewords) {     system.out.println("proper noun: "+ lowertocap.get(propernounlower)); } 

Comments

Popular posts from this blog

python - mat is not a numerical tuple : openCV error -

c# - MSAA finds controls UI Automation doesn't -

wordpress - .htaccess: RewriteRule: bad flag delimiters -