machine learning - Weird results with the randomForest R package -
i have data frame 10,000 rows , 2 columns, segment (a factor 32 values) , target (a factor 2 values, 'yes' , 'no', 5,000 of each). trying use random forest classify target using segment feature.
after training random forest classifier:
> forest <- randomforest(target ~ segment, data)
the confusion matrix biased toward 'no':
> print(forest$confusion) no yes class.error no 4872 76 0.01535974 yes 5033 19 0.99623911
out of 10,000 rows, less 100 got classified 'yes' (even though original counts 50/50). if switch names of labels, opposite result:
> data$target <- as.factor(ifelse(data$target == 'yes', 'no', 'yes')) > forest <- randomforest(target ~ segment, data = data) > print(forest$confusion) no yes class.error no 4915 137 0.02711797 yes 4810 138 0.97210994
so not real signal ... furthermore, original cross-table relatively balanced:
> table(data$target, data$segment) 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32 4 5 6 7 8 9 no 1074 113 121 86 68 165 210 70 120 127 101 132 90 108 171 122 95 95 76 72 105 71 234 58 83 72 290 162 262 192 64 139 yes 1114 105 136 120 73 201 209 78 130 124 90 145 81 104 155 128 79 85 83 70 93 78 266 70 93 76 291 160 235 194 49 137
it looks randomforest takes first label , assigns points it. clarify, data frame subset of larger table more features - found out specific feature somehow leads result, no matter how many other features included. wondering whether missing basic random forest classifier, or whether there encoding issue or other bug leads weird result.
the original dataset available rds here:
https://www.dropbox.com/s/rjq6lmvd78d6aot/weird_random_forest.rds?dl=0
thank you!
your data frame balanced in sense "yes" , "no" equally overall. however, value of segment
contains no information value of target
in sense "yes" , "no" equally levels of segment
, there's no reason expect predictions random forest or other procedure.
if convert segment
numeric randomforest
predicts "yes" 65% of time. 63% of data in values of segment
"yes" (slightly) more probable "no", may explain high rate of "yes" predictions when segment
numeric. whether segment
numeric or factor, overall error rate same. i'm not sure why randomforest
choosing "no" when segment
factor.
Comments
Post a Comment