大数据分析 - R 简介

  • 简述

    本节专门向用户介绍 R 编程语言。R可以从cran网站下载。对于 Windows 用户,安装 rtoolsrstudio IDE很有用。
    背后的一般概念R是作为以编译语言(如 C、C++ 和 Fortran)开发的其他软件的接口,并为用户提供分析数据的交互式工具。
    导航到图书 zip 文件的文件夹bda/part2/R_introduction并打开R_introduction.Rproj文件。这将打开一个 RStudio 会话。然后打开 01_vectors.R 文件。逐行运行脚本并按照代码中的注释进行操作。为了学习,另一个有用的选择是只输入代码,这将帮助你习惯 R 语法。在 R 中,注释用# 符号书写。
    为了在书中展示运行 R 代码的结果,在对代码求值后,对 R 返回的结果进行注释。这样,您可以复制粘贴书中的代码,并在 R 中直接尝试其中的部分内容。
    
    # Create a vector of numbers 
    numbers = c(1, 2, 3, 4, 5) 
    print(numbers) 
    # [1] 1 2 3 4 5  
    # Create a vector of letters 
    ltrs = c('a', 'b', 'c', 'd', 'e') 
    # [1] "a" "b" "c" "d" "e"  
    # Concatenate both  
    mixed_vec = c(numbers, ltrs) 
    print(mixed_vec) 
    # [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"
    
    我们来分析一下前面代码中发生了什么。我们可以看到可以用数字和字母创建向量。我们不需要事先告诉 R 我们想要什么类型的数据类型。最后,我们能够创建一个包含数字和字母的向量。向量 mixed_vec 已将数字强制转换为字符,我们可以通过可视化值如何在引号内打印来看到这一点。
    以下代码显示了函数类返回的不同向量的数据类型。通常使用类函数来“询问”一个对象,询问他的类是什么。
    
    ### Evaluate the data types using class
    ### One dimensional objects 
    # Integer vector 
    num = 1:10 
    class(num) 
    # [1] "integer"  
    # Numeric vector, it has a float, 10.5 
    num = c(1:10, 10.5) 
    class(num) 
    # [1] "numeric"  
    # Character vector 
    ltrs = letters[1:10] 
    class(ltrs) 
    # [1] "character"  
    # Factor vector 
    fac = as.factor(ltrs) 
    class(fac) 
    # [1] "factor"
    
    R 也支持二维对象。在以下代码中,有 R 中使用的两种最流行的数据结构的示例:matrix 和 data.frame。
    
    # Matrix
    M = matrix(1:12, ncol = 4) 
    #      [,1] [,2] [,3] [,4] 
    # [1,]    1    4    7   10 
    # [2,]    2    5    8   11 
    # [3,]    3    6    9   12 
    lM = matrix(letters[1:12], ncol = 4) 
    #     [,1] [,2] [,3] [,4] 
    # [1,] "a"  "d"  "g"  "j"  
    # [2,] "b"  "e"  "h"  "k"  
    # [3,] "c"  "f"  "i"  "l"   
    # Coerces the numbers to character 
    # cbind concatenates two matrices (or vectors) in one matrix 
    cbind(M, lM) 
    #     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] 
    # [1,] "1"  "4"  "7"  "10" "a"  "d"  "g"  "j"  
    # [2,] "2"  "5"  "8"  "11" "b"  "e"  "h"  "k"  
    # [3,] "3"  "6"  "9"  "12" "c"  "f"  "i"  "l"   
    class(M) 
    # [1] "matrix" 
    class(lM) 
    # [1] "matrix"  
    # data.frame 
    # One of the main objects of R, handles different data types in the same object.  
    # It is possible to have numeric, character and factor vectors in the same data.frame  
    df = data.frame(n = 1:5, l = letters[1:5]) 
    df 
    #   n l 
    # 1 1 a 
    # 2 2 b 
    # 3 3 c 
    # 4 4 d 
    # 5 5 e 
    
    如上例所示,可以在同一个对象中使用不同的数据类型。一般来说,这就是数据在数据库中的呈现方式,API 部分数据是文本或字符向量和其他数字。分析师的工作是确定要分配哪种统计数据类型,然后为其使用正确的 R 数据类型。在统计中,我们通常认为变量有以下类型 -
    • 数字
    • 名义或分类
    • 序数
    在 R 中,向量可以属于以下类别 -
    • 数字 - 整数
    • 因素
    • 有序因子
    R 为每种统计类型的变量提供了一种数据类型。然而,有序因子很少使用,但可以由函数因子创建,或有序。
    以下部分介绍索引的概念。这是一个非常常见的操作,它处理选择对象的部分并对它们进行转换的问题。
    
    # Let's create a data.frame
    df = data.frame(numbers = 1:26, letters) 
    head(df) 
    #      numbers  letters 
    # 1       1       a 
    # 2       2       b 
    # 3       3       c 
    # 4       4       d 
    # 5       5       e 
    # 6       6       f 
    # str gives the structure of a data.frame, it’s a good summary to inspect an object 
    str(df) 
    #   'data.frame': 26 obs. of  2 variables: 
    #   $ numbers: int  1 2 3 4 5 6 7 8 9 10 ... 
    #   $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...  
    # The latter shows the letters character vector was coerced as a factor. 
    # This can be explained by the stringsAsFactors = TRUE argumnet in data.frame 
    # read ?data.frame for more information  
    class(df) 
    # [1] "data.frame"  
    ### Indexing
    # Get the first row 
    df[1, ] 
    #     numbers  letters 
    # 1       1       a  
    # Used for programming normally - returns the output as a list 
    df[1, , drop = TRUE] 
    # $numbers 
    # [1] 1 
    #  
    # $letters 
    # [1] a 
    # Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z  
    # Get several rows of the data.frame 
    df[5:7, ] 
    #      numbers  letters 
    # 5       5       e 
    # 6       6       f 
    # 7       7       g  
    ### Add one column that mixes the numeric column with the factor column 
    df$mixed = paste(df$numbers, df$letters, sep = ’’)  
    str(df) 
    # 'data.frame': 26 obs. of  3 variables: 
    # $ numbers: int  1 2 3 4 5 6 7 8 9 10 ...
    # $ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ... 
    # $ mixed  : chr  "1a" "2b" "3c" "4d" ...  
    ### Get columns 
    # Get the first column 
    df[, 1]  
    # It returns a one dimensional vector with that column  
    # Get two columns 
    df2 = df[, 1:2] 
    head(df2)  
    #      numbers  letters 
    # 1       1       a 
    # 2       2       b 
    # 3       3       c 
    # 4       4       d 
    # 5       5       e 
    # 6       6       f  
    # Get the first and third columns 
    df3 = df[, c(1, 3)] 
    df3[1:3, ]  
    #      numbers  mixed 
    # 1       1     1a
    # 2       2     2b 
    # 3       3     3c  
    ### Index columns from their names 
    names(df) 
    # [1] "numbers" "letters" "mixed"   
    # This is the best practice in programming, as many times indeces change, but 
    variable names don’t 
    # We create a variable with the names we want to subset 
    keep_vars = c("numbers", "mixed") 
    df4 = df[, keep_vars]  
    head(df4) 
    #      numbers  mixed 
    # 1       1     1a 
    # 2       2     2b 
    # 3       3     3c 
    # 4       4     4d 
    # 5       5     5e 
    # 6       6     6f  
    ### subset rows and columns 
    # Keep the first five rows 
    df5 = df[1:5, keep_vars] 
    df5 
    #      numbers  mixed 
    # 1       1     1a 
    # 2       2     2b
    # 3       3     3c 
    # 4       4     4d 
    # 5       5     5e  
    # subset rows using a logical condition 
    df6 = df[df$numbers < 10, keep_vars] 
    df6 
    #      numbers  mixed 
    # 1       1     1a 
    # 2       2     2b 
    # 3       3     3c 
    # 4       4     4d 
    # 5       5     5e 
    # 6       6     6f 
    # 7       7     7g 
    # 8       8     8h 
    # 9       9     9i