# R语言：数据汇总/透视/提炼

### 一、行列求和、平均值和频度

rowSums, colSums, rowMeans, colMeans 可以简单理解为按行或列求和或求均值，table把数字或字符当成因子统计频度，都相当简单：

```> a <- array(rep(1:3, each=3), dim=c(3,3))
> a
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 3
[3,] 1 2 3
> rowSums(a)
[1] 6 6 6
> colSums(a)
[1] 3 6 9
> table(a)
a
1 2 3
3 3 3```

```colSums (x, na.rm = FALSE, dims = 1)
rowSums (x, na.rm = FALSE, dims = 1)
colMeans(x, na.rm = FALSE, dims = 1)
rowMeans(x, na.rm = FALSE, dims = 1)```

```> b <- array(rep(1:3, each=9), dim=c(3,3,3))
> b
, , 1

[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1

, , 2

[,1] [,2] [,3]
[1,] 2 2 2
[2,] 2 2 2
[3,] 2 2 2

, , 3

[,1] [,2] [,3]
[1,] 3 3 3
[2,] 3 3 3
[3,] 3 3 3

> rowSums(b)
[1] 18 18 18
> rowSums(b,dims=1)
[1] 18 18 18
> rowSums(b,dims=2)
[,1] [,2] [,3]
[1,] 6 6 6
[2,] 6 6 6
[3,] 6 6 6
> colSums(b)
[,1] [,2] [,3]
[1,] 3 6 9
[2,] 3 6 9
[3,] 3 6 9
> colSums(b,dims=2)
[1] 9 18 27```

table可以统计数字出现的频率，也可以统计其他可以被看做因子的数据类型：

```> table(b)
b
1 2 3
9 9 9
> c <- sample(letters[1:5], 10, replace=TRUE)
> c
[1] "a" "c" "b" "d" "a" "e" "d" "e" "c" "a"
> table(c)
c
a b c d e
3 1 2 2 2```

```> a <- rep(letters[1:3], each=4)
> b <- sample(LETTERS[1:3],12,replace=T)
> table(a,b)
b
a A B C
a 0 3 1
b 3 0 1
c 1 1 2```

### 二、apply系列函数：

1、apply函数：

X参数为数组或矩阵；MARGIN为要应用计算函数的边/维，MARGIN=1为第一维（行），2为第二维（列），...；FUN为要应用的计算函数，后面可以加FUN的有名参数。比如，要按行或列计算数组a的标准差就可以这样：

```> apply(a, MARGIN=1, FUN=sd)
[1] 1 1 1
> apply(a, MARGIN=2, FUN=sd)
[1] 0 0 0```

MARGIN的长度可以不是1（多维应用），如果长度等于X的维数，应用到FUN函数的数据就只有一个值，结果没什么意义，甚至函数会获得无效值：

```> apply(b, MARGIN=3, FUN=sum)
[1]  9 18 27
> apply(b, MARGIN=1:2, FUN=sum)
[,1] [,2] [,3]
[1,]    6    6    6
[2,]    6    6    6
[3,]    6    6    6
> apply(a, MARGIN=1:2, FUN=sd)
[,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
[3,]   NA   NA   NA```

```> a
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    2    3
[3,]    1    2    3
> apply(a, MARGIN=1, FUN=quantile, probs=seq(0,1, 0.25))
[,1] [,2] [,3]
0%    1.0  1.0  1.0
25%   1.5  1.5  1.5
50%   2.0  2.0  2.0
75%   2.5  2.5  2.5
100%  3.0  3.0  3.0
> apply(a, MARGIN=2, FUN=quantile, probs=seq(0,1, 0.25))
[,1] [,2] [,3]
0%      1    2    3
25%     1    2    3
50%     1    2    3
75%     1    2    3
100%    1    2    3```

2、lapply、sapply和vapply函数：

2.1 lapply返回的结果为列表，长度与X相同

```> scores <- list(YuWen=c(80,88,94,70), ShuXue=c(99,87,100,68,77))
> lapply(scores, mean)
\$YuWen
[1] 83

\$ShuXue
[1] 86.2

> lapply(scores, quantile, probs=c(0.5,0.7,0.9))
\$YuWen
50%  70%  90%
84.0 88.6 92.2

\$ShuXue
50%  70%  90%
87.0 96.6 99.6```

2.2 sapply返回的结果比较“友好”，如果结果很整齐，就会得到向量或矩阵或数组

sapply是simplify了的lapply，所谓的simplify，是指对结果的数据结构进行了simplify，方便后续处理。

```> sapply(scores, mean)
YuWen ShuXue
83.0   86.2
> sapply(scores, quantile, probs=c(0.5,0.7,0.9))
YuWen ShuXue
50%  84.0   87.0
70%  88.6   96.6
90%  92.2   99.6```

2.3 vapply函数：对返回结果（value）进行类型检查的sapply

```> probs <- c(1:3/4)
> rt.value <- c(0,0,0) #设置返回值为3个数字
> vapply(scores, quantile, FUN.VALUE=rt.value, probs=probs)
YuWen ShuXue
25% 77.5 77
50% 84.0 87
75% 89.5 99
> probs <- c(1:4/4)
> vapply(scores, quantile, FUN.VALUE=rt.value, probs=probs)```

```> rt.value <- c(0,0,0,0) #返回值类型为4个数字
> vapply(scores, quantile, FUN.VALUE=rt.value, probs=probs)
YuWen ShuXue
25% 77.5 77
50% 84.0 87
75% 89.5 99
100% 94.0 100
> rt.value <- c(0,0,0,'') #设置返回值为3个数字和1个字符串
> vapply(scores, quantile, FUN.VALUE=rt.value, probs=probs)```

FUN.VALUE为必需参数。

3、 mapply函数：

R的在线文档说mapply是sapply的多变量版本（multivariate sapply），但它的参数顺序和sapply却不一样：

mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)

mapply应用的数据类型为向量或列表，FUN函数对每个数据元素应用FUN函数；如果参数长度为1，得到的结果和sapply是一样的；但如果参数长度不是1，FUN函数将按向量顺序和循环规则（短向量重复）逐个取参数应用到对应数据元素：

```> sapply(X=1:4, FUN=rep, times=4)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 1 2 3 4
[3,] 1 2 3 4
[4,] 1 2 3 4
> mapply(rep, x = 1:4, times=4)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 1 2 3 4
[3,] 1 2 3 4
[4,] 1 2 3 4
> mapply(rep, x = 1:4, times=1:4)
[[1]]
[1] 1

[[2]]
[1] 2 2

[[3]]
[1] 3 3 3

[[4]]
[1] 4 4 4 4

> mapply(rep, x = 1:4, times=1:2)
[[1]]
[1] 1

[[2]]
[1] 2 2

[[3]]
[1] 3

[[4]]
[1] 4 4```

4、tapply 和 by 函数：

tapply函数可以看做是table函数的扩展：table函数按因子组合计算频度，而tapply可以按因子组合应用各种函数。使用格式为：tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

X为要应用函数的数据，通常为向量；INDEX为因子，和table函数一样，它的长度必需和X相同。

```> (x <- 1:10)
[1] 1 2 3 4 5 6 7 8 9 10
> (f <- gl(2,5, labels=c("CK", "T")))
[1] CK CK CK CK CK T T T T T
Levels: CK T
> tapply(x, f, length) #FUN函数是length，得到的结果和table类似
CK T
5 5
> table(f)
f
CK T
5 5
> tapply(x, f, sum)
CK T
15 40```

by函数是tapply函数针对数据框类型数据的应用，但结果不怎么友好，你可以用下面语句看看情况：
with(mtcars, by(mtcars, cyl, summary))

### 三、aggregate函数

```aggregate(x, by, FUN, ..., simplify = TRUE)
aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)```

```> str(mtcars)
'data.frame':   32 obs. of  11 variables:
\$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
\$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
\$ disp: num  160 160 108 258 360 ...
\$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
\$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
\$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
\$ qsec: num  16.5 17 18.6 19.4 17 ...
\$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
\$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
\$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
\$ carb: num  4 4 1 1 2 1 4 2 2 4 ...```

```> attach(mtcars)
> aggregate(mtcars, by=list(cyl), FUN=mean)
Group.1 mpg cyl disp hp drat wt qsec vs am gear carb
1 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
2 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
3 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000```

by参数也可以包含多个类型的因子，得到的就是每个不同因子组合的统计结果：

```> aggregate(mtcars, by=list(cyl, gear), FUN=mean)

Group.1 Group.2    mpg cyl     disp       hp     drat       wt    qsec  vs   am gear     carb
1       4       3 21.500   4 120.1000  97.0000 3.700000 2.465000 20.0100 1.0 0.00    3 1.000000
2       6       3 19.750   6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00    3 1.000000
3       8       3 15.050   8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00    3 3.083333
4       4       4 26.925   4 102.6250  76.0000 4.110000 2.378125 19.6125 1.0 0.75    4 1.500000
5       6       4 19.750   6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50    4 4.000000
6       4       5 28.200   4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00    5 2.000000
7       6       5 19.700   6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00    5 6.000000
8       8       5 15.400   8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00    5 6.000000```

```> aggregate(cbind(mpg,hp) ~ cyl+gear, FUN=mean)
cyl gear    mpg       hp
1   4    3 21.500  97.0000
2   6    3 19.750 107.5000
3   8    3 15.050 194.1667
4   4    4 26.925  76.0000
5   6    4 19.750 116.5000
6   4    5 28.200 102.0000
7   6    5 19.700 175.0000
8   8    5 15.400 299.5000```

aggregate在时间序列数据上的应用请参考R的函数说明文档。