dplyrチェーンでスケーリングするときの注意
注意ってほどではないんだが,分析するとき,データセットを基準化(標準化,Z変換,standardization,scale)することがある.
よく使うのはScale関数なんだけど,dplyrのチェーンで使うと余計なもの(Scaled:centre/scale)がついてくるのでなんとなく気持ち悪い.
Scale関数を使ってどんな影響があるのかわからないけど,普通に(x-mean(x))/sd(x)でやったほうがよろしい様子.
ってか,そもそも関数使うって発想が古い?
> cbind(rep(1:2,10), matrix(runif(20*10),ncol=10)) %>% + tbl_df() %>% + rename(grp=V1) %>% + mutate(grp=as.factor(grp)) %>% + group_by(grp) %>>% + # scale関数つかったとき + (~mutate_at(.,vars(2:11), funs(scale)) %>% ungroup() %>% str(.) %>% print(.)) %>% + # (x-mean(x))/sd(x)にしたとき + mutate_at(.,vars(2:11), funs((.-mean(.))/sd(.))) %>% ungroup() %>% str() Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of 11 variables: $ grp: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ... $ V2 : num [1:10, 1] 0.38 0.839 -1.034 0.111 0.773 ... ..- attr(*, "scaled:center")= num 0.634 ..- attr(*, "scaled:scale")= num 0.23 $ V3 : num [1:10, 1] 0.817 -0.259 -1.136 -0.384 0.474 ... ..- attr(*, "scaled:center")= num 0.419 ..- attr(*, "scaled:scale")= num 0.332 $ V4 : num [1:10, 1] 0.451 -0.611 1.129 -1.104 -1.635 ... ..- attr(*, "scaled:center")= num 0.586 ..- attr(*, "scaled:scale")= num 0.317 $ V5 : num [1:10, 1] 0.0921 1.1026 1.1408 -0.2213 -1.4277 ... ..- attr(*, "scaled:center")= num 0.625 ..- attr(*, "scaled:scale")= num 0.282 $ V6 : num [1:10, 1] 0.567 -1.311 0.123 1.352 -0.772 ... ..- attr(*, "scaled:center")= num 0.436 ..- attr(*, "scaled:scale")= num 0.275 $ V7 : num [1:10, 1] -0.737 -1.362 -0.422 1.455 -0.821 ... ..- attr(*, "scaled:center")= num 0.544 ..- attr(*, "scaled:scale")= num 0.264 $ V8 : num [1:10, 1] 0.51 -2.427 0.653 -0.282 -1.424 ... ..- attr(*, "scaled:center")= num 0.493 ..- attr(*, "scaled:scale")= num 0.332 $ V9 : num [1:10, 1] -1.185 0.296 1.019 1.344 -0.895 ... ..- attr(*, "scaled:center")= num 0.488 ..- attr(*, "scaled:scale")= num 0.341 $ V10: num [1:10, 1] -0.935 0.782 1.396 -0.314 -0.407 ... ..- attr(*, "scaled:center")= num 0.343 ..- attr(*, "scaled:scale")= num 0.314 $ V11: num [1:10, 1] 1.241 -1.37 0.263 -1.209 -0.956 ... ..- attr(*, "scaled:center")= num 0.444 ..- attr(*, "scaled:scale")= num 0.248 NULL Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of 11 variables: $ grp: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ... $ V2 : num 0.38 0.839 -1.034 0.111 0.773 ... $ V3 : num 0.817 -0.259 -1.136 -0.384 0.474 ... $ V4 : num 0.451 -0.611 1.129 -1.104 -1.635 ... $ V5 : num 0.0921 1.1026 1.1408 -0.2213 -1.4277 ... $ V6 : num 0.567 -1.311 0.123 1.352 -0.772 ... $ V7 : num -0.737 -1.362 -0.422 1.455 -0.821 ... $ V8 : num 0.51 -2.427 0.653 -0.282 -1.424 ... $ V9 : num -1.185 0.296 1.019 1.344 -0.895 ... $ V10: num -0.935 0.782 1.396 -0.314 -0.407 ... $ V11: num 1.241 -1.37 0.263 -1.209 -0.956 ...