PCA的R实现和运用

Chen Hao posted on 03 Sep 2016

R中PCA的实现

在一文缕清PCA中，我们缕清楚了PCA的原理和计算方法，那么这里就动手用R来实践下PCA。

pca <- function(data){
    data <- as.matrix(data)
    ## 将每一列的平均值变为0
    data <- apply(data, 2, function(x) {x - mean(x)})
    ## 计算协方差矩阵
    corMatrix <- t(data) %*% data / nrow(data)
    ## 用'base'包中的eigen函数计算协方差矩阵的特征值的特征向量
    eigenRes <- eigen(corMatrix)
    eigenValues <- eigenRes$values
    eigenVectors <- eigenRes$vectors
    ## 用特征向量构成的矩阵转化原数据
    dataTransformed <- data %*% eigenRes$vectors
    ## 用列表结构返回数据
    return(list("eigenValues" = eigenValues,
                "eigenVectors" = eigenVectors,
                "dataTransformed" = dataTransformed))
}

在R的stats包中有两个函数来计算PCA：prcomp和princomp。其源代码可以通过getAnywhere(prcomp.default)和getAnywhere(princomp.default)获取。他俩的差别只是在计算方法上，prcomp使用的是SVD，而princomp使用的是和上面类似的求特征值和特征向量的方法。看看他两的文档：

prcomp: The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy. The print method for these objects prints the results in a nice format and the plot method produces a scree plot. Unlike princomp, variances are computed with the usual divisor N - 1.

princomp: The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. ‘princomp’ only handles so-called R-mode PCA, that is feature extraction of variables. If a data matrix is supplied (possibly via a formula) it is required that there are at least as many units as variables. For Q-mode PCA use ‘prcomp’.

Principal component analysis is underspecified if you have fewer samples than data point. Every data point will be it’s own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions. Simply speaking you can look at the problems like this: If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful. – Explained from stackoverflow

不同方法计算PCA的差别

## 使用数据iris
data <- iris[ ,1:4]
row.names(data) <- paste0(iris[,5], "_", 1:nrow(data))
head(data)

##          Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa_1          5.1         3.5          1.4         0.2
## setosa_2          4.9         3.0          1.4         0.2
## setosa_3          4.7         3.2          1.3         0.2
## setosa_4          4.6         3.1          1.5         0.2
## setosa_5          5.0         3.6          1.4         0.2
## setosa_6          5.4         3.9          1.7         0.4

pca_res1 <- pca(data)
pca_res2 <- prcomp(data)
pca_res3 <- princomp(data)

看看返回结果的差别：

str(pca_res1)

## List of 3
##  $ eigenValues    : num [1:4] 4.2001 0.2411 0.0777 0.0237
##  $ eigenVectors   : num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 0.6566 ...
##  $ dataTransformed: num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:150] "setosa_1" "setosa_2" "setosa_3" "setosa_4" ...
##   .. ..$ : NULL

str(pca_res2)

## List of 5
##  $ sdev    : num [1:4] 2.056 0.493 0.28 0.154
##  $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
##   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ scale   : logi FALSE
##  $ x       : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:150] "setosa_1" "setosa_2" "setosa_3" "setosa_4" ...
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  - attr(*, "class")= chr "prcomp"

str(pca_res3)

## List of 7
##  $ sdev    : Named num [1:4] 2.049 0.491 0.279 0.154
##   ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
##  $ loadings: loadings [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
##  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
##   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ scale   : Named num [1:4] 1 1 1 1
##   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ n.obs   : int 150
##  $ scores  : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:150] "setosa_1" "setosa_2" "setosa_3" "setosa_4" ...
##   .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
##  $ call    : language princomp(x = data)
##  - attr(*, "class")= chr "princomp"

看看特征值，会发现pca_res1中对应的特征值其实是变换后矩阵各列向量的方差。

sqrt(pca_res1$eigenValues)

## [1] 2.0494032 0.4909714 0.2787259 0.1538707

pca_res2$sdev

## [1] 2.0562689 0.4926162 0.2796596 0.1543862

pca_res3$sdev

##    Comp.1    Comp.2    Comp.3    Comp.4 
## 2.0494032 0.4909714 0.2787259 0.1538707

这里pca_res2和pca_res3中sdev的反差是由自由度使用的不用而造成的，正如上面prcomp中文档所表述的。

在来看看特征向量：

pca_res1$eigenVectors

##             [,1]        [,2]        [,3]       [,4]
## [1,]  0.36138659  0.65658877 -0.58202985  0.3154872
## [2,] -0.08452251  0.73016143  0.59791083 -0.3197231
## [3,]  0.85667061 -0.17337266  0.07623608 -0.4798390
## [4,]  0.35828920 -0.07548102  0.54583143  0.7536574

pca_res2$rotation

##                      PC1         PC2         PC3        PC4
## Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
## Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
## Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

unclass(pca_res3$loadings)

##                   Comp.1      Comp.2      Comp.3     Comp.4
## Sepal.Length  0.36138659 -0.65658877 -0.58202985  0.3154872
## Sepal.Width  -0.08452251 -0.73016143  0.59791083 -0.3197231
## Petal.Length  0.85667061  0.17337266  0.07623608 -0.4798390
## Petal.Width   0.35828920  0.07548102  0.54583143  0.7536574

这里只是对特征向量的叫法不同，prcomp中叫rotation，princomp中叫做loadings。其中正负的差别不会影响整体的结构。

再来看看变化后的矩阵：

head(pca_res1$dataTransformed)

##               [,1]       [,2]        [,3]         [,4]
## setosa_1 -2.684126  0.3193972 -0.02791483  0.002262437
## setosa_2 -2.714142 -0.1770012 -0.21046427  0.099026550
## setosa_3 -2.888991 -0.1449494  0.01790026  0.019968390
## setosa_4 -2.745343 -0.3182990  0.03155937 -0.075575817
## setosa_5 -2.728717  0.3267545  0.09007924 -0.061258593
## setosa_6 -2.280860  0.7413304  0.16867766 -0.024200858

head(pca_res2$x)

##                PC1        PC2         PC3          PC4
## setosa_1 -2.684126 -0.3193972  0.02791483  0.002262437
## setosa_2 -2.714142  0.1770012  0.21046427  0.099026550
## setosa_3 -2.888991  0.1449494 -0.01790026  0.019968390
## setosa_4 -2.745343  0.3182990 -0.03155937 -0.075575817
## setosa_5 -2.728717 -0.3267545 -0.09007924 -0.061258593
## setosa_6 -2.280860 -0.7413304 -0.16867766 -0.024200858

head(pca_res3$scores)

##             Comp.1     Comp.2      Comp.3       Comp.4
## setosa_1 -2.684126 -0.3193972 -0.02791483  0.002262437
## setosa_2 -2.714142  0.1770012 -0.21046427  0.099026550
## setosa_3 -2.888991  0.1449494  0.01790026  0.019968390
## setosa_4 -2.745343  0.3182990  0.03155937 -0.075575817
## setosa_5 -2.728717 -0.3267545  0.09007924 -0.061258593
## setosa_6 -2.280860 -0.7413304  0.16867766 -0.024200858

同样，这里除了正负号的差别，只是叫法的不同。

PCA结果的运用

这里我们来看看最常用的prcomp函数结果的运用。

主成分得分

summary(pca_res2)

## Importance of components:
##                           PC1     PC2    PC3     PC4
## Standard deviation     2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion  0.9246 0.97769 0.9948 1.00000

这里列出了各主成分的标准差，方差在所有主成分中的比例，以及累积标准差比例。

碎石图

plot(pca_res2)

plot of chunk unnamed-chunk-8

这里的柱状图表示的特征值的大小，也就是方差，注意不是返回值sdev标准差。

automatical selection of main PCs by elbow detection

elbow_detection <- function(scores, if_plot = FALSE){

    num_scores <- length(scores)
    if(num_scores < 2){
        stop("Input scores must be a vector with length more than 1!")
    }
    scores <- data.frame('id'=seq_len(num_scores), 'value'=scores)
    sorted_scores <- scores[order(scores$value, decreasing = T), ]

    ## use distance to diagonal line to determine the elbow point
    xy_coordinates <- cbind('x'=seq_len(num_scores), 'y'=sorted_scores$value)
    start_point <- xy_coordinates[1, ]
    end_point <- xy_coordinates[num_scores, ]
    x1 <- start_point[1]; x2 <- end_point[1]
    y1 <- start_point[2]; y2 <- end_point[2]
    a <- y1-y2; b <- x2-x1; c <- x1*y2-x2*y1
    dist_to_line <- abs(a*xy_coordinates[,'x'] + b*xy_coordinates[,'y'] + c) / sqrt(a^2+b^2)
    best_point_id <- which.max(dist_to_line)
    score_cutoff <- xy_coordinates[best_point_id, 'y']
    select_ID <- scores$id[which(scores$value >= score_cutoff)]

    if(if_plot){
        plot(seq(nrow(scores)), sorted_scores$value,
             col=ifelse(sorted_scores$value >= score_cutoff,'red','black'),
             xlab = "ID", ylab = "Score",
             main=paste0("Optimal number = ", length(select_ID),
                         " with cutoff value = ", round(score_cutoff, digits=4)))
    }

    return(select_ID)
}

双坐标图

biplot(pca_res2)

plot of chunk unnamed-chunk-10

根据特征向量计算每个变量对特征值的贡献

http://stats.stackexchange.com/questions/20101/what-is-the-difference-between-r-functions-prcomp-and-princomp

Hao Chen | 陈浩 Posts About Me

PCA的R实现和运用

R中PCA的实现

不同方法计算PCA的差别

PCA结果的运用

参考