PCA的R实现和运用
Chen Hao posted on 03 Sep 2016
R中PCA的实现
在一文缕清PCA 中,我们缕清楚了PCA的原理和计算方法,那么这里就动手用R来实践下PCA。
pca <- function ( data ){
data <- as.matrix ( data )
## 将每一列的平均值变为0
data <- apply ( data , 2 , function ( x ) { x - mean ( x )})
## 计算协方差矩阵
corMatrix <- t ( data ) %*% data / nrow ( data )
## 用'base'包中的eigen函数计算协方差矩阵的特征值的特征向量
eigenRes <- eigen ( corMatrix )
eigenValues <- eigenRes $ values
eigenVectors <- eigenRes $ vectors
## 用特征向量构成的矩阵转化原数据
dataTransformed <- data %*% eigenRes $ vectors
## 用列表结构返回数据
return ( list ( "eigenValues" = eigenValues ,
"eigenVectors" = eigenVectors ,
"dataTransformed" = dataTransformed ))
}
在R的stats
包中有两个函数来计算PCA:prcomp
和princomp
。其源代码可以通过getAnywhere(prcomp.default)
和getAnywhere(princomp.default)
获取。他俩的差别只是在计算方法上,prcomp
使用的是SVD,而princomp
使用的是和上面类似的求特征值和特征向量的方法。看看他两的文档:
prcomp : The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen
on the covariance matrix. This is generally the preferred method for numerical accuracy. The print method for these objects prints the results in a nice format and the plot method produces a scree plot. Unlike princomp
, variances are computed with the usual divisor N - 1.
princomp : The calculation is done using eigen
on the correlation or covariance matrix, as determined by cor
. ‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly via a formula) it is required that there are at least as many units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point. Every data point will be it’s own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions. Simply speaking you can look at the problems like this: If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful. – Explained from stackoverflow
不同方法计算PCA的差别
## 使用数据iris
data <- iris [ , 1 : 4 ]
row.names ( data ) <- paste0 ( iris [, 5 ], "_" , 1 : nrow ( data ))
head ( data )
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa_1 5.1 3.5 1.4 0.2
## setosa_2 4.9 3.0 1.4 0.2
## setosa_3 4.7 3.2 1.3 0.2
## setosa_4 4.6 3.1 1.5 0.2
## setosa_5 5.0 3.6 1.4 0.2
## setosa_6 5.4 3.9 1.7 0.4
pca_res1 <- pca ( data )
pca_res2 <- prcomp ( data )
pca_res3 <- princomp ( data )
看看返回结果的差别:
str ( pca_res1 )
## List of 3
## $ eigenValues : num [1:4] 4.2001 0.2411 0.0777 0.0237
## $ eigenVectors : num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 0.6566 ...
## $ dataTransformed: num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:150] "setosa_1" "setosa_2" "setosa_3" "setosa_4" ...
## .. ..$ : NULL
str ( pca_res2 )
## List of 5
## $ sdev : num [1:4] 2.056 0.493 0.28 0.154
## $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## $ center : Named num [1:4] 5.84 3.06 3.76 1.2
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ scale : logi FALSE
## $ x : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:150] "setosa_1" "setosa_2" "setosa_3" "setosa_4" ...
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## - attr(*, "class")= chr "prcomp"
str ( pca_res3 )
## List of 7
## $ sdev : Named num [1:4] 2.049 0.491 0.279 0.154
## ..- attr(*, "names")= chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
## $ loadings: loadings [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
## $ center : Named num [1:4] 5.84 3.06 3.76 1.2
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ scale : Named num [1:4] 1 1 1 1
## ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ n.obs : int 150
## $ scores : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:150] "setosa_1" "setosa_2" "setosa_3" "setosa_4" ...
## .. ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
## $ call : language princomp(x = data)
## - attr(*, "class")= chr "princomp"
看看特征值,会发现pca_res1
中对应的特征值其实是变换后矩阵各列向量的方差。
sqrt ( pca_res1 $ eigenValues )
## [1] 2.0494032 0.4909714 0.2787259 0.1538707
pca_res2 $ sdev
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
pca_res3 $ sdev
## Comp.1 Comp.2 Comp.3 Comp.4
## 2.0494032 0.4909714 0.2787259 0.1538707
这里pca_res2
和pca_res3
中sdev
的反差是由自由度使用的不用而造成的,正如上面prcomp
中文档所表述的。
在来看看特征向量:
pca_res1 $ eigenVectors
## [,1] [,2] [,3] [,4]
## [1,] 0.36138659 0.65658877 -0.58202985 0.3154872
## [2,] -0.08452251 0.73016143 0.59791083 -0.3197231
## [3,] 0.85667061 -0.17337266 0.07623608 -0.4798390
## [4,] 0.35828920 -0.07548102 0.54583143 0.7536574
pca_res2 $ rotation
## PC1 PC2 PC3 PC4
## Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872
## Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390
## Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574
unclass ( pca_res3 $ loadings )
## Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length 0.36138659 -0.65658877 -0.58202985 0.3154872
## Sepal.Width -0.08452251 -0.73016143 0.59791083 -0.3197231
## Petal.Length 0.85667061 0.17337266 0.07623608 -0.4798390
## Petal.Width 0.35828920 0.07548102 0.54583143 0.7536574
这里只是对特征向量的叫法不同,prcomp
中叫rotation
,princomp
中叫做loadings
。其中正负的差别不会影响整体的结构。
再来看看变化后的矩阵:
head ( pca_res1 $ dataTransformed )
## [,1] [,2] [,3] [,4]
## setosa_1 -2.684126 0.3193972 -0.02791483 0.002262437
## setosa_2 -2.714142 -0.1770012 -0.21046427 0.099026550
## setosa_3 -2.888991 -0.1449494 0.01790026 0.019968390
## setosa_4 -2.745343 -0.3182990 0.03155937 -0.075575817
## setosa_5 -2.728717 0.3267545 0.09007924 -0.061258593
## setosa_6 -2.280860 0.7413304 0.16867766 -0.024200858
head ( pca_res2 $ x )
## PC1 PC2 PC3 PC4
## setosa_1 -2.684126 -0.3193972 0.02791483 0.002262437
## setosa_2 -2.714142 0.1770012 0.21046427 0.099026550
## setosa_3 -2.888991 0.1449494 -0.01790026 0.019968390
## setosa_4 -2.745343 0.3182990 -0.03155937 -0.075575817
## setosa_5 -2.728717 -0.3267545 -0.09007924 -0.061258593
## setosa_6 -2.280860 -0.7413304 -0.16867766 -0.024200858
head ( pca_res3 $ scores )
## Comp.1 Comp.2 Comp.3 Comp.4
## setosa_1 -2.684126 -0.3193972 -0.02791483 0.002262437
## setosa_2 -2.714142 0.1770012 -0.21046427 0.099026550
## setosa_3 -2.888991 0.1449494 0.01790026 0.019968390
## setosa_4 -2.745343 0.3182990 0.03155937 -0.075575817
## setosa_5 -2.728717 -0.3267545 0.09007924 -0.061258593
## setosa_6 -2.280860 -0.7413304 0.16867766 -0.024200858
同样,这里除了正负号的差别,只是叫法的不同。
PCA结果的运用
这里我们来看看最常用的prcomp
函数结果的运用。
主成分得分
summary ( pca_res2 )
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion 0.9246 0.97769 0.9948 1.00000
这里列出了各主成分的标准差,方差在所有主成分中的比例,以及累积标准差比例。
碎石图
plot ( pca_res2 )
这里的柱状图表示的特征值的大小,也就是方差,注意不是返回值sdev
标准差。
automatical selection of main PCs by elbow detection
elbow_detection <- function ( scores , if_plot = FALSE ){
num_scores <- length ( scores )
if ( num_scores < 2 ){
stop ( "Input scores must be a vector with length more than 1!" )
}
scores <- data.frame ( 'id' = seq_len ( num_scores ), 'value' = scores )
sorted_scores <- scores [ order ( scores $ value , decreasing = T ), ]
## use distance to diagonal line to determine the elbow point
xy_coordinates <- cbind ( 'x' = seq_len ( num_scores ), 'y' = sorted_scores $ value )
start_point <- xy_coordinates [ 1 , ]
end_point <- xy_coordinates [ num_scores , ]
x1 <- start_point [ 1 ]; x2 <- end_point [ 1 ]
y1 <- start_point [ 2 ]; y2 <- end_point [ 2 ]
a <- y1 - y2 ; b <- x2 - x1 ; c <- x1 * y2 - x2 * y1
dist_to_line <- abs ( a * xy_coordinates [, 'x' ] + b * xy_coordinates [, 'y' ] + c ) / sqrt ( a ^ 2 + b ^ 2 )
best_point_id <- which.max ( dist_to_line )
score_cutoff <- xy_coordinates [ best_point_id , 'y' ]
select_ID <- scores $ id [ which ( scores $ value >= score_cutoff )]
if ( if_plot ){
plot ( seq ( nrow ( scores )), sorted_scores $ value ,
col = ifelse ( sorted_scores $ value >= score_cutoff , 'red' , 'black' ),
xlab = "ID" , ylab = "Score" ,
main = paste0 ( "Optimal number = " , length ( select_ID ),
" with cutoff value = " , round ( score_cutoff , digits = 4 )))
}
return ( select_ID )
}
双坐标图
biplot ( pca_res2 )
根据特征向量计算每个变量对特征值的贡献
http://stats.stackexchange.com/questions/20101/what-is-the-difference-between-r-functions-prcomp-and-princomp
参考