Subscribed unsubscribe Subscribe Subscribe

dplyrチェーンでスケーリングするときの注意

注意ってほどではないんだが,分析するとき,データセットを基準化(標準化,Z変換,standardization,scale)することがある.
よく使うのはScale関数なんだけど,dplyrのチェーンで使うと余計なもの(Scaled:centre/scale)がついてくるのでなんとなく気持ち悪い.
Scale関数を使ってどんな影響があるのかわからないけど,普通に(x-mean(x))/sd(x)でやったほうがよろしい様子.
ってか,そもそも関数使うって発想が古い?

> cbind(rep(1:2,10), matrix(runif(20*10),ncol=10)) %>% 
+     tbl_df() %>% 
+     rename(grp=V1) %>%
+     mutate(grp=as.factor(grp)) %>% 
+     group_by(grp) %>>%
+     # scale関数つかったとき
+     (~mutate_at(.,vars(2:11), funs(scale)) %>% ungroup() %>% str(.) %>% print(.)) %>% 
+     # (x-mean(x))/sd(x)にしたとき
+     mutate_at(.,vars(2:11), funs((.-mean(.))/sd(.))) %>% ungroup() %>% str()
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	20 obs. of  11 variables:
 $ grp: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ...
 $ V2 : num [1:10, 1] 0.38 0.839 -1.034 0.111 0.773 ...
  ..- attr(*, "scaled:center")= num 0.634
  ..- attr(*, "scaled:scale")= num 0.23
 $ V3 : num [1:10, 1] 0.817 -0.259 -1.136 -0.384 0.474 ...
  ..- attr(*, "scaled:center")= num 0.419
  ..- attr(*, "scaled:scale")= num 0.332
 $ V4 : num [1:10, 1] 0.451 -0.611 1.129 -1.104 -1.635 ...
  ..- attr(*, "scaled:center")= num 0.586
  ..- attr(*, "scaled:scale")= num 0.317
 $ V5 : num [1:10, 1] 0.0921 1.1026 1.1408 -0.2213 -1.4277 ...
  ..- attr(*, "scaled:center")= num 0.625
  ..- attr(*, "scaled:scale")= num 0.282
 $ V6 : num [1:10, 1] 0.567 -1.311 0.123 1.352 -0.772 ...
  ..- attr(*, "scaled:center")= num 0.436
  ..- attr(*, "scaled:scale")= num 0.275
 $ V7 : num [1:10, 1] -0.737 -1.362 -0.422 1.455 -0.821 ...
  ..- attr(*, "scaled:center")= num 0.544
  ..- attr(*, "scaled:scale")= num 0.264
 $ V8 : num [1:10, 1] 0.51 -2.427 0.653 -0.282 -1.424 ...
  ..- attr(*, "scaled:center")= num 0.493
  ..- attr(*, "scaled:scale")= num 0.332
 $ V9 : num [1:10, 1] -1.185 0.296 1.019 1.344 -0.895 ...
  ..- attr(*, "scaled:center")= num 0.488
  ..- attr(*, "scaled:scale")= num 0.341
 $ V10: num [1:10, 1] -0.935 0.782 1.396 -0.314 -0.407 ...
  ..- attr(*, "scaled:center")= num 0.343
  ..- attr(*, "scaled:scale")= num 0.314
 $ V11: num [1:10, 1] 1.241 -1.37 0.263 -1.209 -0.956 ...
  ..- attr(*, "scaled:center")= num 0.444
  ..- attr(*, "scaled:scale")= num 0.248
NULL
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	20 obs. of  11 variables:
 $ grp: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ...
 $ V2 : num  0.38 0.839 -1.034 0.111 0.773 ...
 $ V3 : num  0.817 -0.259 -1.136 -0.384 0.474 ...
 $ V4 : num  0.451 -0.611 1.129 -1.104 -1.635 ...
 $ V5 : num  0.0921 1.1026 1.1408 -0.2213 -1.4277 ...
 $ V6 : num  0.567 -1.311 0.123 1.352 -0.772 ...
 $ V7 : num  -0.737 -1.362 -0.422 1.455 -0.821 ...
 $ V8 : num  0.51 -2.427 0.653 -0.282 -1.424 ...
 $ V9 : num  -1.185 0.296 1.019 1.344 -0.895 ...
 $ V10: num  -0.935 0.782 1.396 -0.314 -0.407 ...
 $ V11: num  1.241 -1.37 0.263 -1.209 -0.956 ...