r - How do you sample groups in a data.table with a caveat -


this question similar how sample random rows within each group in data.table?.

the difference in minor subtlety did not have enough reputation discuss question itself.

let's change christopher manning's initial data little bit:

> dt = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20)) > dt        b  1:  1 102  2:  1   5  3:  1 658  4:  1 499  5:  2 632  6:  3 186  7:  4 761  8:  5 150  9:  6 423 10:  7 832 11:  8 883 12:  9 247 13: 10 894 14: 11 141 15: 12 891 16: 13 488 17: 14 101 18: 15 677 19:  1 400 20:  1 467 

if tried question's solution:

> dt[,.sd[sample(.n,3)],by = a]  error in sample.int(x, size, replace, prob) :    cannot take sample larger population when 'replace = false' 

this because there values in column occur once. cannot sample 3 times values occur less 3 times without using replacement (which not want do).

i struggling deal scenario. want sample 3 times when number of occurrences >= 3, pull number of occurrences if < 3. example our dt above want:

       b  1:  1 102  2:  1   5  3:  1 658  4:  2 632  5:  3 186  6:  4 761  7:  5 150  8:  6 423  9:  7 832 10:  8 883 11:  9 247 12: 10 894 13: 11 141 14: 12 891 15: 13 488 16: 14 101 17: 15 677 

maybe solution involve sorting data.table this, using rle() lengths find out n use in sample function above:

> dt <- dt[order(dt$a),] > dt        b  1:  1 102  2:  1   5  3:  1 658  4:  1 499  5:  1 400  6:  1 467  7:  2 632  8:  3 186  9:  4 761 10:  5 150 11:  6 423 12:  7 832 13:  8 883 14:  9 247 15: 10 894 16: 11 141 17: 12 891 18: 13 488 19: 14 101 20: 15 677  > ifelse(rle(dt$a)$lengths >= 3, 3,rle(dt$a)$lengths) > [1] 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

if replace "3" n, return how should sample a=1, a=2, a=3... have yet find way incorporate final solution. appreciated!

i might misunderstanding question, looking this?

set.seed(123) ## dt <- data.table(   a=c(1,1,1,1:15,1,1),    b=sample(1:1000,20)) ## r> dt[,.sd[sample(.n,min(.n,3))],by = a]        b  1:  1 288  2:  1 881  3:  1 409  4:  2 937  5:  3  46  6:  4 525  7:  5 887  8:  6 548  9:  7 453 10:  8 948 11:  9 449 12: 10 670 13: 11 566 14: 12 102 15: 13 993 16: 14 243 17: 15  42 

where drawing 3 samples b group a_i if a_i contains 3 or more values, else draw n values, n (n < 3) size of group a_i.

just demonstration, here 6 possible values of b a=1 sampling (assuming use same random seed above):

r> dt[order(a)][1:6,]      b 1: 1 288 2: 1 788 3: 1 409 4: 1 881 5: 1 323 6: 1 996 

Comments

Popular posts from this blog

python - mat is not a numerical tuple : openCV error -

c# - MSAA finds controls UI Automation doesn't -

wordpress - .htaccess: RewriteRule: bad flag delimiters -