machine learning - Weird results with the randomForest R package -


i have data frame 10,000 rows , 2 columns, segment (a factor 32 values) , target (a factor 2 values, 'yes' , 'no', 5,000 of each). trying use random forest classify target using segment feature.

after training random forest classifier:

> forest <- randomforest(target ~ segment, data) 

the confusion matrix biased toward 'no':

> print(forest$confusion)        no yes class.error no  4872  76  0.01535974 yes 5033  19  0.99623911 

out of 10,000 rows, less 100 got classified 'yes' (even though original counts 50/50). if switch names of labels, opposite result:

> data$target <- as.factor(ifelse(data$target == 'yes', 'no', 'yes')) > forest <- randomforest(target ~ segment, data = data) > print(forest$confusion)        no yes class.error no  4915 137  0.02711797 yes 4810 138  0.97210994 

so not real signal ... furthermore, original cross-table relatively balanced:

> table(data$target, data$segment)           1   10   11   12   13   14   15   16   17   18   19    2   20   21   22   23   24   25   26   27   28   29    3   30   31   32    4    5    6    7    8    9   no  1074  113  121   86   68  165  210   70  120  127  101  132   90  108  171  122   95   95   76   72  105   71  234   58   83   72  290  162  262  192   64  139   yes 1114  105  136  120   73  201  209   78  130  124   90  145   81  104  155  128   79   85   83   70   93   78  266   70   93   76  291  160  235  194   49  137 

it looks randomforest takes first label , assigns points it. clarify, data frame subset of larger table more features - found out specific feature somehow leads result, no matter how many other features included. wondering whether missing basic random forest classifier, or whether there encoding issue or other bug leads weird result.

the original dataset available rds here:

https://www.dropbox.com/s/rjq6lmvd78d6aot/weird_random_forest.rds?dl=0

thank you!

your data frame balanced in sense "yes" , "no" equally overall. however, value of segment contains no information value of target in sense "yes" , "no" equally levels of segment, there's no reason expect predictions random forest or other procedure.

if convert segment numeric randomforest predicts "yes" 65% of time. 63% of data in values of segment "yes" (slightly) more probable "no", may explain high rate of "yes" predictions when segment numeric. whether segment numeric or factor, overall error rate same. i'm not sure why randomforest choosing "no" when segment factor.


Comments

Popular posts from this blog

python - mat is not a numerical tuple : openCV error -

c# - MSAA finds controls UI Automation doesn't -

wordpress - .htaccess: RewriteRule: bad flag delimiters -