Language-detection heuristic (English, French or German) based on Unigram and Bigram models











up vote
1
down vote

favorite












Given a string, for example "I hate AI", I need to find out if the sentence is in English, German or French. Unigram Model makes the prediction on the basis of each character frequency in a training text, while Bigram model makes prediction based on what character follows another character.



The following code has 2 methods 1. getBigramResult() 2. getUnigramResult().



Both the methods take an ArrayList<Character> as a parameter and return a HashMap<Language,Double> with Key as the Language (French, English, German) and the probability associated with each language for the given character list as the value. The two methods are almost the same except for





  1. The for loop->



    for(int j = 0; j < textCharList.size() - 1; j++)// getBigramResult()

    for(int j=0; j<textCharList.size(); j++)// getUnigramResult()



  2. The if condition->



    if(textCharList.get(i) !='+' && textCharList.get(i+1) !='+')// getBigramResult()

    if(textCharList.get(i)!='+')// getUnigramResult()



  3. The probability calculating function



    getConditionalProbability(textCharacter.get(i),textCharacter.get(i+1)) // getBigramResult()

    getProbability(textCharacter.get(i))// getUnigramResult()


  4. getBigramResult() works on a class call BigramV2 and getUnigramResult() works on a class call Unigram.



The code of the methods are as follows



public static HashMap<Language, Double> getBigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size() - 1; j++) {
if (textCharList.get(j) != '+' && textCharList.get(j + 1) != '+') {
FileHandler.writeSentences("BIGRAM :"+textCharList.get(j)+""+textCharList.get(j + 1),false);
for (int k = 0; k < biGramList.size(); k++) {
BiGramV2 temp = biGramList.get(k);
double conditionalProbability = Math.log10(temp.getConditionalProbabilty(textCharList.get(j),
textCharList.get(j + 1)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j+1)+"|"+textCharList.get(j) +") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);
}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}

public static HashMap<Language, Double> getUnigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size(); j++) {
if (textCharList.get(j) != '+') {
FileHandler.writeSentences("UNIGRAM :"+textCharList.get(j),false);
for (int k = 0; k < uniGramList.size(); k++) {
Unigram temp = uniGramList.get(k);
double conditionalProbability = Math.log10(temp.getProbabilty(textCharList.get(j)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j)+") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);

}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}


Both the above methods getBigramResult() and getUnigramResult() are very similar, and I feel like it's not design efficient, but I am not able to refactor them because of the different outer for-loop, if block and different probability calculating methods.



My BiGramV2 class



public class BiGramV2  {
private double delta;
private Language language;

public BiGramV2(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
double storage = new double[dictCharacters.size()][dictCharacters.size()];
private double countOfRows = new double[dictCharacters.size()];

public void fit(List<Character> characters) {
for (int i = 0; i < characters.size() - 1; i++) {
if (characters.get(i) != '+' && characters.get(i + 1) != '+')
{
int rowNo = dictCharacters.indexOf(characters.get(i));
int columnNo = dictCharacters.indexOf(characters.get(i + 1));
storage[rowNo][columnNo]++;
countOfRows[rowNo]++;
}

}

}

public Language getLanguage()
{
return language;// Enum of GERMAN, FRENCH and ENGLISH
}

public double getConditionalProbabilty(char first, char second)
{
int rowNo = dictCharacters.indexOf(first);
int columnNo = dictCharacters.indexOf(second);
double numerator=storage[rowNo][columnNo] + delta;
double denominator=countOfRows[rowNo]+ (delta*dictCharacters.size());
double conditionalProbability=numerator/denominator;
return conditionalProbability;
}}


And my Unigram Class is



public class Unigram {

HashMap<Character,Integer> storage = new HashMap<Character,Integer>();
private double delta;
private Language language;
private int noOfCharacters=0;
public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');

public Unigram(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public Language getLanguage()
{
return language;
}

public void fit(List<Character> characters)
{
for (int i = 0; i < characters.size() ; i++) {
if (characters.get(i) != '+')
{
storage.put(characters.get(i), storage.getOrDefault(characters.get(i), 0)+1);
noOfCharacters++;
}
}

}

public double getProbabilty(char first)
{

double numerator=storage.get(first) + delta;
double denominator=noOfCharacters+ (delta*dictCharacters.size());
double probability=numerator/denominator;
return probability;
}


}



Any suggestion on my code would be appreciated.










share|improve this question









New contributor




dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 2




    Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
    – Sᴀᴍ Onᴇᴌᴀ
    2 days ago










  • @SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
    – dividedbyzero
    2 days ago










  • Please update the title to express what the code does not your concerns for the code.
    – bruglesco
    2 days ago










  • @bruglesco Do you think its ok now?
    – dividedbyzero
    2 days ago










  • The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
    – bruglesco
    2 days ago















up vote
1
down vote

favorite












Given a string, for example "I hate AI", I need to find out if the sentence is in English, German or French. Unigram Model makes the prediction on the basis of each character frequency in a training text, while Bigram model makes prediction based on what character follows another character.



The following code has 2 methods 1. getBigramResult() 2. getUnigramResult().



Both the methods take an ArrayList<Character> as a parameter and return a HashMap<Language,Double> with Key as the Language (French, English, German) and the probability associated with each language for the given character list as the value. The two methods are almost the same except for





  1. The for loop->



    for(int j = 0; j < textCharList.size() - 1; j++)// getBigramResult()

    for(int j=0; j<textCharList.size(); j++)// getUnigramResult()



  2. The if condition->



    if(textCharList.get(i) !='+' && textCharList.get(i+1) !='+')// getBigramResult()

    if(textCharList.get(i)!='+')// getUnigramResult()



  3. The probability calculating function



    getConditionalProbability(textCharacter.get(i),textCharacter.get(i+1)) // getBigramResult()

    getProbability(textCharacter.get(i))// getUnigramResult()


  4. getBigramResult() works on a class call BigramV2 and getUnigramResult() works on a class call Unigram.



The code of the methods are as follows



public static HashMap<Language, Double> getBigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size() - 1; j++) {
if (textCharList.get(j) != '+' && textCharList.get(j + 1) != '+') {
FileHandler.writeSentences("BIGRAM :"+textCharList.get(j)+""+textCharList.get(j + 1),false);
for (int k = 0; k < biGramList.size(); k++) {
BiGramV2 temp = biGramList.get(k);
double conditionalProbability = Math.log10(temp.getConditionalProbabilty(textCharList.get(j),
textCharList.get(j + 1)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j+1)+"|"+textCharList.get(j) +") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);
}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}

public static HashMap<Language, Double> getUnigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size(); j++) {
if (textCharList.get(j) != '+') {
FileHandler.writeSentences("UNIGRAM :"+textCharList.get(j),false);
for (int k = 0; k < uniGramList.size(); k++) {
Unigram temp = uniGramList.get(k);
double conditionalProbability = Math.log10(temp.getProbabilty(textCharList.get(j)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j)+") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);

}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}


Both the above methods getBigramResult() and getUnigramResult() are very similar, and I feel like it's not design efficient, but I am not able to refactor them because of the different outer for-loop, if block and different probability calculating methods.



My BiGramV2 class



public class BiGramV2  {
private double delta;
private Language language;

public BiGramV2(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
double storage = new double[dictCharacters.size()][dictCharacters.size()];
private double countOfRows = new double[dictCharacters.size()];

public void fit(List<Character> characters) {
for (int i = 0; i < characters.size() - 1; i++) {
if (characters.get(i) != '+' && characters.get(i + 1) != '+')
{
int rowNo = dictCharacters.indexOf(characters.get(i));
int columnNo = dictCharacters.indexOf(characters.get(i + 1));
storage[rowNo][columnNo]++;
countOfRows[rowNo]++;
}

}

}

public Language getLanguage()
{
return language;// Enum of GERMAN, FRENCH and ENGLISH
}

public double getConditionalProbabilty(char first, char second)
{
int rowNo = dictCharacters.indexOf(first);
int columnNo = dictCharacters.indexOf(second);
double numerator=storage[rowNo][columnNo] + delta;
double denominator=countOfRows[rowNo]+ (delta*dictCharacters.size());
double conditionalProbability=numerator/denominator;
return conditionalProbability;
}}


And my Unigram Class is



public class Unigram {

HashMap<Character,Integer> storage = new HashMap<Character,Integer>();
private double delta;
private Language language;
private int noOfCharacters=0;
public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');

public Unigram(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public Language getLanguage()
{
return language;
}

public void fit(List<Character> characters)
{
for (int i = 0; i < characters.size() ; i++) {
if (characters.get(i) != '+')
{
storage.put(characters.get(i), storage.getOrDefault(characters.get(i), 0)+1);
noOfCharacters++;
}
}

}

public double getProbabilty(char first)
{

double numerator=storage.get(first) + delta;
double denominator=noOfCharacters+ (delta*dictCharacters.size());
double probability=numerator/denominator;
return probability;
}


}



Any suggestion on my code would be appreciated.










share|improve this question









New contributor




dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 2




    Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
    – Sᴀᴍ Onᴇᴌᴀ
    2 days ago










  • @SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
    – dividedbyzero
    2 days ago










  • Please update the title to express what the code does not your concerns for the code.
    – bruglesco
    2 days ago










  • @bruglesco Do you think its ok now?
    – dividedbyzero
    2 days ago










  • The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
    – bruglesco
    2 days ago













up vote
1
down vote

favorite









up vote
1
down vote

favorite











Given a string, for example "I hate AI", I need to find out if the sentence is in English, German or French. Unigram Model makes the prediction on the basis of each character frequency in a training text, while Bigram model makes prediction based on what character follows another character.



The following code has 2 methods 1. getBigramResult() 2. getUnigramResult().



Both the methods take an ArrayList<Character> as a parameter and return a HashMap<Language,Double> with Key as the Language (French, English, German) and the probability associated with each language for the given character list as the value. The two methods are almost the same except for





  1. The for loop->



    for(int j = 0; j < textCharList.size() - 1; j++)// getBigramResult()

    for(int j=0; j<textCharList.size(); j++)// getUnigramResult()



  2. The if condition->



    if(textCharList.get(i) !='+' && textCharList.get(i+1) !='+')// getBigramResult()

    if(textCharList.get(i)!='+')// getUnigramResult()



  3. The probability calculating function



    getConditionalProbability(textCharacter.get(i),textCharacter.get(i+1)) // getBigramResult()

    getProbability(textCharacter.get(i))// getUnigramResult()


  4. getBigramResult() works on a class call BigramV2 and getUnigramResult() works on a class call Unigram.



The code of the methods are as follows



public static HashMap<Language, Double> getBigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size() - 1; j++) {
if (textCharList.get(j) != '+' && textCharList.get(j + 1) != '+') {
FileHandler.writeSentences("BIGRAM :"+textCharList.get(j)+""+textCharList.get(j + 1),false);
for (int k = 0; k < biGramList.size(); k++) {
BiGramV2 temp = biGramList.get(k);
double conditionalProbability = Math.log10(temp.getConditionalProbabilty(textCharList.get(j),
textCharList.get(j + 1)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j+1)+"|"+textCharList.get(j) +") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);
}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}

public static HashMap<Language, Double> getUnigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size(); j++) {
if (textCharList.get(j) != '+') {
FileHandler.writeSentences("UNIGRAM :"+textCharList.get(j),false);
for (int k = 0; k < uniGramList.size(); k++) {
Unigram temp = uniGramList.get(k);
double conditionalProbability = Math.log10(temp.getProbabilty(textCharList.get(j)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j)+") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);

}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}


Both the above methods getBigramResult() and getUnigramResult() are very similar, and I feel like it's not design efficient, but I am not able to refactor them because of the different outer for-loop, if block and different probability calculating methods.



My BiGramV2 class



public class BiGramV2  {
private double delta;
private Language language;

public BiGramV2(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
double storage = new double[dictCharacters.size()][dictCharacters.size()];
private double countOfRows = new double[dictCharacters.size()];

public void fit(List<Character> characters) {
for (int i = 0; i < characters.size() - 1; i++) {
if (characters.get(i) != '+' && characters.get(i + 1) != '+')
{
int rowNo = dictCharacters.indexOf(characters.get(i));
int columnNo = dictCharacters.indexOf(characters.get(i + 1));
storage[rowNo][columnNo]++;
countOfRows[rowNo]++;
}

}

}

public Language getLanguage()
{
return language;// Enum of GERMAN, FRENCH and ENGLISH
}

public double getConditionalProbabilty(char first, char second)
{
int rowNo = dictCharacters.indexOf(first);
int columnNo = dictCharacters.indexOf(second);
double numerator=storage[rowNo][columnNo] + delta;
double denominator=countOfRows[rowNo]+ (delta*dictCharacters.size());
double conditionalProbability=numerator/denominator;
return conditionalProbability;
}}


And my Unigram Class is



public class Unigram {

HashMap<Character,Integer> storage = new HashMap<Character,Integer>();
private double delta;
private Language language;
private int noOfCharacters=0;
public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');

public Unigram(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public Language getLanguage()
{
return language;
}

public void fit(List<Character> characters)
{
for (int i = 0; i < characters.size() ; i++) {
if (characters.get(i) != '+')
{
storage.put(characters.get(i), storage.getOrDefault(characters.get(i), 0)+1);
noOfCharacters++;
}
}

}

public double getProbabilty(char first)
{

double numerator=storage.get(first) + delta;
double denominator=noOfCharacters+ (delta*dictCharacters.size());
double probability=numerator/denominator;
return probability;
}


}



Any suggestion on my code would be appreciated.










share|improve this question









New contributor




dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











Given a string, for example "I hate AI", I need to find out if the sentence is in English, German or French. Unigram Model makes the prediction on the basis of each character frequency in a training text, while Bigram model makes prediction based on what character follows another character.



The following code has 2 methods 1. getBigramResult() 2. getUnigramResult().



Both the methods take an ArrayList<Character> as a parameter and return a HashMap<Language,Double> with Key as the Language (French, English, German) and the probability associated with each language for the given character list as the value. The two methods are almost the same except for





  1. The for loop->



    for(int j = 0; j < textCharList.size() - 1; j++)// getBigramResult()

    for(int j=0; j<textCharList.size(); j++)// getUnigramResult()



  2. The if condition->



    if(textCharList.get(i) !='+' && textCharList.get(i+1) !='+')// getBigramResult()

    if(textCharList.get(i)!='+')// getUnigramResult()



  3. The probability calculating function



    getConditionalProbability(textCharacter.get(i),textCharacter.get(i+1)) // getBigramResult()

    getProbability(textCharacter.get(i))// getUnigramResult()


  4. getBigramResult() works on a class call BigramV2 and getUnigramResult() works on a class call Unigram.



The code of the methods are as follows



public static HashMap<Language, Double> getBigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size() - 1; j++) {
if (textCharList.get(j) != '+' && textCharList.get(j + 1) != '+') {
FileHandler.writeSentences("BIGRAM :"+textCharList.get(j)+""+textCharList.get(j + 1),false);
for (int k = 0; k < biGramList.size(); k++) {
BiGramV2 temp = biGramList.get(k);
double conditionalProbability = Math.log10(temp.getConditionalProbabilty(textCharList.get(j),
textCharList.get(j + 1)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j+1)+"|"+textCharList.get(j) +") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);
}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}

public static HashMap<Language, Double> getUnigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size(); j++) {
if (textCharList.get(j) != '+') {
FileHandler.writeSentences("UNIGRAM :"+textCharList.get(j),false);
for (int k = 0; k < uniGramList.size(); k++) {
Unigram temp = uniGramList.get(k);
double conditionalProbability = Math.log10(temp.getProbabilty(textCharList.get(j)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j)+") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);

}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}


Both the above methods getBigramResult() and getUnigramResult() are very similar, and I feel like it's not design efficient, but I am not able to refactor them because of the different outer for-loop, if block and different probability calculating methods.



My BiGramV2 class



public class BiGramV2  {
private double delta;
private Language language;

public BiGramV2(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
double storage = new double[dictCharacters.size()][dictCharacters.size()];
private double countOfRows = new double[dictCharacters.size()];

public void fit(List<Character> characters) {
for (int i = 0; i < characters.size() - 1; i++) {
if (characters.get(i) != '+' && characters.get(i + 1) != '+')
{
int rowNo = dictCharacters.indexOf(characters.get(i));
int columnNo = dictCharacters.indexOf(characters.get(i + 1));
storage[rowNo][columnNo]++;
countOfRows[rowNo]++;
}

}

}

public Language getLanguage()
{
return language;// Enum of GERMAN, FRENCH and ENGLISH
}

public double getConditionalProbabilty(char first, char second)
{
int rowNo = dictCharacters.indexOf(first);
int columnNo = dictCharacters.indexOf(second);
double numerator=storage[rowNo][columnNo] + delta;
double denominator=countOfRows[rowNo]+ (delta*dictCharacters.size());
double conditionalProbability=numerator/denominator;
return conditionalProbability;
}}


And my Unigram Class is



public class Unigram {

HashMap<Character,Integer> storage = new HashMap<Character,Integer>();
private double delta;
private Language language;
private int noOfCharacters=0;
public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');

public Unigram(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public Language getLanguage()
{
return language;
}

public void fit(List<Character> characters)
{
for (int i = 0; i < characters.size() ; i++) {
if (characters.get(i) != '+')
{
storage.put(characters.get(i), storage.getOrDefault(characters.get(i), 0)+1);
noOfCharacters++;
}
}

}

public double getProbabilty(char first)
{

double numerator=storage.get(first) + delta;
double denominator=noOfCharacters+ (delta*dictCharacters.size());
double probability=numerator/denominator;
return probability;
}


}



Any suggestion on my code would be appreciated.







java natural-language-processing






share|improve this question









New contributor




dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited yesterday





















New contributor




dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 2 days ago









dividedbyzero

112




112




New contributor




dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






dividedbyzero is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 2




    Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
    – Sᴀᴍ Onᴇᴌᴀ
    2 days ago










  • @SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
    – dividedbyzero
    2 days ago










  • Please update the title to express what the code does not your concerns for the code.
    – bruglesco
    2 days ago










  • @bruglesco Do you think its ok now?
    – dividedbyzero
    2 days ago










  • The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
    – bruglesco
    2 days ago














  • 2




    Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
    – Sᴀᴍ Onᴇᴌᴀ
    2 days ago










  • @SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
    – dividedbyzero
    2 days ago










  • Please update the title to express what the code does not your concerns for the code.
    – bruglesco
    2 days ago










  • @bruglesco Do you think its ok now?
    – dividedbyzero
    2 days ago










  • The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
    – bruglesco
    2 days ago








2




2




Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
– Sᴀᴍ Onᴇᴌᴀ
2 days ago




Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
– Sᴀᴍ Onᴇᴌᴀ
2 days ago












@SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
– dividedbyzero
2 days ago




@SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
– dividedbyzero
2 days ago












Please update the title to express what the code does not your concerns for the code.
– bruglesco
2 days ago




Please update the title to express what the code does not your concerns for the code.
– bruglesco
2 days ago












@bruglesco Do you think its ok now?
– dividedbyzero
2 days ago




@bruglesco Do you think its ok now?
– dividedbyzero
2 days ago












The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
– bruglesco
2 days ago




The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
– bruglesco
2 days ago















active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






dividedbyzero is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208724%2flanguage-detection-heuristic-english-french-or-german-based-on-unigram-and-bi%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes








dividedbyzero is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















dividedbyzero is a new contributor. Be nice, and check out our Code of Conduct.













dividedbyzero is a new contributor. Be nice, and check out our Code of Conduct.












dividedbyzero is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208724%2flanguage-detection-heuristic-english-french-or-german-based-on-unigram-and-bi%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Ellipse (mathématiques)

Quarter-circle Tiles

Mont Emei