r - How to create tidy data with a Dataset where values are duplicated over many rows -


sorry if large of example. looks more real life have hard time thinking of example better explain situation.

what want tidy data.frame can use medical conditions in summaries (avg) , in plots (edited) what need answered trying accomplish correctly. want row huge string values divided commas? need split more columns?

reports our data base vendor (actual data changed). the reports not give unique key. in data.frames person.id unique in , others multi rows of person.id , values.

person.id <- c("1017", "1018", "1018", "1018", "1018", "1018", "1018",                "1018", "1018", "1018", "1018", "1019", "1019", "1020",                "1020") med.condition <- c(na, "allergic rhinitis", "allergic rhinitis",                    "atopic dermatitis", "atopic dermatitis",                    "developmental speech",                    "developmental speech",                    "eye condition", "eye condition", "speech delay",                    "speech delay", "allergic reaction", na, "eczema",                    "obese") cond.type <- c("assessment", "assessment", na, "assessment", na, "assessment",                na, "assessment", na, "assessment", na, "assessment",                "assessment", "assessment", "assessment") df <- data.frame(person.id, med.condition, cond.type) 

which looks like:

  person.id  med.condition                              cond.type 1   1017    na                                          assessment 2   1018    allergic rhinitis                           assessment 3   1018    allergic rhinitis                           na 4   1018    atopic dermatitis                           assessment 5   1018    atopic dermatitis                           na 6   1018    developmental speech                        assessment 7   1018    developmental speech                        na 8   1018    eye condition                               assessment 9   1018    eye condition                               na 10  1018    speech delay                                assessment 11  1018    speech delay                                na 12  1019    allergic reaction                           assessment 13  1019    na                                          assessment 14  1020    eczema                                      assessment 15  1020    obese                                       assessment 

i want rows equal 1 person.id

do want have (only first 5 columns shown): used taplly fails @ tidy

    condition1         condition2        condition3        condition4           condition5 1017 na                 na                na                na                   na 1018 allergic rhinitis atopic dermatitis  allergic reaction developmental speech eye condition 1019 na                 na                na                na                   na 1020 eczema             obese             na                na                   na 

what do make dataset tidy?

     med.condtion 1017 na 1018 "allergic rhinitis", "atopic dermatitis", "developmental speech", "eye condition", "speech delay", "allergic reaction"  1019 na 1020 "eczema" "obese" 

or need think of in new way?

what tired tapply, reshape2

taplly doesn't work on example in program sorry

df2 <- data.frame(person.id, med.condition, cond.type) df2.wide <- tapply(x = df2$medical.condition, index = df2$person.id,                         function(x){                           ux <- unique(x)                           c(ux, rep(x = na, 9 - length (ux)))                         }) df2.wide <- as.data.frame(do.call('rbind', df2.wide), stringsasfactors = false) names(promis.b.temp) <- paste0('condition', 1:9) 

cols <- names(promis.b.temp) df2$med.all <- apply( df2[, cols], 1, paste, collapse = ", ")

reshape2 realized wouldn't work library(reshape2) df3 <- test %>% melt() %>% unique() %>% cast(person.id)

  • am tackling issue correctly?
  • will have problems when reports having filters strings?

i don't understand question. data appear "tidy".

the 2 things notice (1) duplicated values (which may or may not wanted) , (2) lack of unique id per person , medical condition.

if wanted long string separated commas (hard deal later, in opinion), can aggregate unique values in first 2 columns, this:

library(data.table) as.data.table(unique(df[1:2]))[, paste(med.condition, collapse = ","), = person.id] #    person.id                                                                                  v1 # 1:      1017                                                                                  na # 2:      1018 allergic rhinitis,atopic dermatitis,developmental speech,eye condition,speech delay # 3:      1019                                                                allergic reaction,na # 4:      1020                                                                        eczema,obese 

if wanted sequential id each person, can use getanid "splitstackshape" package:

library(splitstackshape) getanid(as.data.table(unique(df[1:2])) 

this let convert wide form, if desired, using dcast.data.table, this:

library(splitstackshape) dcast.data.table(getanid(as.data.table(unique(df[1:2])), "person.id"),                   person.id ~ .id, value.var = "med.condition",                   fun.aggregate = function(x) paste(x, collapse = ",")) #    person.id                 1                 2                    3             4            5 # 1:      1017                na                                                                   # 2:      1018 allergic rhinitis atopic dermatitis developmental speech eye condition speech delay # 3:      1019 allergic reaction                na                                                 # 4:      1020            eczema             obese                                                 

Comments

Popular posts from this blog

python - mat is not a numerical tuple : openCV error -

c# - MSAA finds controls UI Automation doesn't -

wordpress - .htaccess: RewriteRule: bad flag delimiters -