background-image: url(https://www.technotification.com/wp-content/uploads/2018/06/R-prograamming-for-data-science.jpg) background-position: center background-size: cover class: title-slide .bg-text[ # Introduction to Data Science with R ### week.4 <hr /> 10月 7, 2019 謝舒凱 ] --- # 課程資訊公告 .large[ - [《中文》課程參考網頁](https://rlads2019.github.io/lecture/resources.html) - 國慶連假後記名不計分小考預告(記名不計分) - DataCamp 前 20% 同學免試 - 小考內容 (R 基礎:Datacamp R 基礎 course 1-3) ] --- ## 學習方式建議 - 給我六個小時砍樹,我會花前四個小時磨斧 - 即早進入 `\(<g,t>\)` ??? typeof() , class(), vs mode() 北鼻態 》 G 點 》大腿/大神態 --- ## Data Structure - 為何需要資料結構? - `\(<g,t>\)`: **data type** vs **data structure** ? - .large[R 提供 6 種基礎的資料結構] <span style="color:green; font-weight:bold">向量 (vector), 矩陣 (matrix), 陣列 (array), 因子 (factor), 列表 (list) and 數據框 (data frame).</span> - .large[重點在於:怎麼建立、確認、轉換、取值、操作與計算] - **create, convert, access, manipulate, calculation**] ??? list() vs as.list() --- ## 圖示 ![](https://miro.medium.com/proxy/1*JjZYjvyBurwgQa1RBRtzAA.png) --- ## 向量 Vector 複習 - All vectors are one-dimensional and each element is of the same type. --- ## 矩陣 Matrix - a collection of elements that has a two-dimensional representation(i.e., columns and rows.) - A matrix can contain elements of the *same* data type only. (`character`, `numeric`, `logical`) - **create, convert, access, manipulate, calculation** ```r m0 <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol =3) m1 <- matrix(1:25, nrow = 5, ncol = 5) # check byrow= #rnames <- c("R1", "R2", "R3", "R4", "R5") #cnames <- c("C1", "C2", "C3", "C4", "C5") #m1 <- matrix(1:25, nrow = 5, ncol = 5, dimnames = list(rnames, cnames)) # class(m); mode(m) ``` --- ## 矩陣 Matrix ```r # access m1[3,4] m1[,3] m1[c(1:3),] # convert v <- as.vector(m1);v ``` --- ## 矩陣 Matrix - Another way is to bind columns or rows using `rbind()` and `cbind()` - can also use the `byrow` argument to specify how the matrix is filled. ```r # manipulate: merge and delete (y <- c(1:10)) m2 <- matrix(y, nrow = 5, ncol = 2);m2 #(m2 <- matrix(y, nrow = 5, ncol = 2, byrow = F)) (m3 <- rbind(m2, c(11,12))) (m4 <- cbind(m3, c(13:18))) (m4 <- m4[2,]) ``` --- ## 矩陣 Matrix ### 矩陣運算 ```r # Transpose the whole matrix t(m2) # Matrix multiplication m2 %*% t(m2) ``` --- ## 陣列 Array 陣列是矩陣的延伸,矩陣可說是 2 維的陣列。而陣列的維度可以大於 2。 ```r # array(data = NA, dim = length(data), dimnames = NULL) z <- c(1:30) dim1 <- c("a1", "a2","a3") dim2 <- c("b1","b2","b3", "b4", "b5") dim3 <- c("c1","c2") a <- array(z, dim = c(3,5,2), dimnames = list(dim1,dim2,dim3)) ``` --- ## 陣列 Array ```r a[2,4,1] ``` ``` ## [1] 11 ``` ```r a['a1','b4','c1'] ``` ``` ## [1] 10 ``` ```r dim(a) ``` ``` ## [1] 3 5 2 ``` --- ## 資料框 Data Frame .large[最常處理的資料結構] - A dataframe is similar to the matrix, but in a data frame, the columns can hold data elements of different types. - the most commonly used data type for most of the analysis. Number of columns equals to number of observed variables; number of rows equals to number of observations. ```r # create, manipulate, access # iris (iris.simple <- data.frame(Sepal.Length = c(5.1, 4.7,5.0), Sepal.Width = c(3.5, 3.2, 3.6), Pedal.Length = c(1.4, 1.3,1.4))) ``` ``` ## Sepal.Length Sepal.Width Pedal.Length ## 1 5.1 3.5 1.4 ## 2 4.7 3.2 1.3 ## 3 5.0 3.6 1.4 ``` ```r # str(); dim(); summary() ``` --- ## Data Frame - `[]`, `$`, `subset()` ```r iris.simple[,1] iris.simple$Sepal.Width iris.simple$Sepal.Width[2] subset(iris.simple, Sepal.Length < 5) ``` --- ## Data Frame ```r ## cbind(), rbind() names(iris.simple) names(iris.simple)[1] <- "sepal.length" ``` --- ## Data Frame - 基本運算 - 基本統計 `mean(), median(), sum(), min(), max(), sd(), ...` ```r # 練習自己建立一個 data frame students <- data.frame(c("Cedric","Fred","George","Cho","Draco","Ginny"), c(3,2,2,1,0,-1), c("H", "G", "G", "R", "S", "G")) names(students) <- c("name", "year", "house") # name the columns class(students) # "data.frame" class(students$year) # "numeric" class(students[,3]) # "factor" # find the dimensions nrow(students) ncol(students) dim(students) ``` --- ## In-class Exercise `mtcars` 是個很好的練習用例子。(打在 `NTU cool` 讓我知道) ```r #mtcars # The built-in data frame #help(mtcars) dim(mtcars) # The dimensions(rows and columns) nrow(mtcars) # Number of rows ncol(mtcars) # Number of columns names(mtcars) # The column names rownames(mtcars) # The row names summary(mtcars) # A summary of each column ``` --- ## 因子 Factor - 複習一下統計學中「變數」的分類 <img style='border: 1px solid;' width=40% src='./img/var.png'></img> - 在 R 中,類別(【男、女】)和有序(【好-中-差】)的變數稱作「因子」(factor). 在 data frame 中常看到。 Factors are variables which take on a limited number of values, aka categorical variables. In R, factors are stored as a vector of integer values with the corresponding set of character values you’ll see when displayed (colloquially, labels; in R, levels). --- ## 因子 Factor - Factors 可以視為是一種特殊的向量類型。只是其元素由定性變數所組成。 用 `factor()` 來產生,用 `levels()` 來取得 levels (values the categorical data can take)。 ```r gender <- c("female", "female", "male", "female", "male", "female") gender.2 <- factor(gender) levels(gender.2) ``` --- ## 因子 Factor ```r # 變成有序因子 honor <- c("cum laude","summa cum laude", "cum laude", "summa laude", "magna cum laude","cum laude") honor.fac <- factor(honor, levels =c("cum laude", "magna cum laude", "summa cum laude"), ordered = TRUE); honor.fac ``` --- ## List - 資料結構的大雜燴:其構成元素可以是向量、矩陣、陣列、數據框、甚至是表列。 - list 中的每個元素也可以有不同長度。 --- ## List - **create, access, manipulate** ```r # create v1 <- c(1:10) v2 <- c("life", "is", "short") m1 <- matrix(c(1:9), nrow=3) f1 <- factor(c("positive", "negative", "negative", "neutral", "positive")) name <- c("jessy", "jessica", "jessie") R <- c(60, 90, 92) PYTHON <- c(60, 95, 93) piano <- c("great", "ok","ok") df1 <- data.frame(name, R, PYTHON, piano) mylist <- list(v1,v2,m1,f1, df1) # 命名(注意語法!) mylist <- list(num = v1, char = v2, mat = m1, fac = f1, daframe = df1) ``` - `list()` vs. `as.list()`: create vs coerce --- ## 列表 List ```r ## access: three ways: [[index]], [[element.name]], list$element.name mylist[[1]] mylist[["num"]] mylist$num ``` - 利用 `table()` 建立 contingency table; `prop.table()` 轉成頻率。 ```r table(mylist$fac) ``` ``` ## ## negative neutral positive ## 2 1 2 ``` --- ## 邏輯流程 .large[ - 條件判斷 - 迴圈 ] --- ## 基本繪圖 Basic plotting - `plot()` 是基本作圖函式。 ```r #plot(iris) #plot(iris$Sepal.Length, iris$Petal.Length) ``` - `qplot()` 是 `ggplot2` 作圖套件的一個基本作圖函式,基本用法類似,但較美觀? ![](index_files/figure-html/unnamed-chunk-18-1.png)<!-- --> --- ## In-class Exercise - 結合上述資料,建立 data frame (無序、分類變數)。 - 利用 `table()` 建立 contingency table; `prop.table()` 轉成頻率。 - 做圖 --- ## Preparing/cleaning data - In many cases, getting our data in the rectangular arrangement of a matrix or data frame is the first step in preparing it for analysis. - As much as 60%-80% of the time Data Scientists spent on data analysis is focused on preparing the data for analysis. - (numerical data) handling missing data, outliers - (textual data) : tokenization/word segmentation --- ## Missing values 缺失值處理 > Missing values are values that should have been recorded but were not. - a numeric missing value is represented by `NA` (Not Available) while character missing values are represented by `<NA>`. - use the `is.na()` to identify the presence of NA for each column; the function `anyNA()` returns TRUE if the vector contains any missing values. ```r (missing_dat <- data.frame(col.1=c(1,NA,0,1),col.2=c("M","F",NA,"M"))) is.na(missing_dat$col.1) anyNA(missing_dat) # 提取非缺失值 missing_dat[!is.na(missing_dat)] ``` --- ## Missing values 缺失值處理 - We can replace the NA with the mean value or we can **remove these NA rows**. ```r (newdata <- na.omit(missing_dat)) ``` - 有許多函式都帶有 `na.rm` 參數,設成 TRUE 執行時會自動刪除所有的 NA,不然造成 `NA+[anything]=NA`。但要注意:Substitute or remove 從方法論上來說不一定是好事。 ```r sum(c(NA, 1,44,23,NA,99), na.rm = TRUE) ``` ``` ## [1] 167 ``` ??? NaN, NULL, Inf 用 is.na() 來檢查 --- ## Reading big files with `data.table` The `data.table` package is extremely useful — and much, much faster than `read.table` — for larger files. ```r require(data.table) ``` ``` ## Loading required package: data.table ``` ```r students <- as.data.table(students) students # note the slightly different print-out students[name=="Ginny"] # get rows with name == "Ginny" students[year==2] # get rows with year == 2 ``` --- ## Basic I/O 了解預設值 - `read.table(file, header = TRUE, sep = "")` - `write.table(x, file = "", append = FALSE, sep = " ", row.name = TRUE, col.names = TRUE)` --- ## Data input - `read.table()` 是最基本的資料輸入函式。至少有幾個參數要了解:`file, header, sep, stringAsFactors` - **file**: 相對路徑或絕對路徑,用 `/` 或是 `\\` 來表示。(e.g., OSX `"~/dsR/data"`, Windows `"C:\\dsR\\data"`) - **header**: 邏輯值。設成 TRUE,會將第一個 row 當成變數名。 - **sep**: 分隔符號。預設為空格。 - **stringAsFactors**: 預設是將字串的資料類型轉換成 factor 變數。想要字串被當成字串,則設成 FALSE. - For data exported from Excel, use `na.strings = c("", "#N/A", "#DIV/0!", "#NUM!")`. - **fill**: Load data file with columns of unequal length. 如果我們的原始檔本身,有不同的 columns 長度,那麼我們用`fill=TRUE`來補上 blank。 --- ## 給還沒習慣路徑概念的人 ```r data <- read.table(file.choose()) # for MAC/Linux data <- read.table(choose.files()) # for Windows ``` --- ## Data I/O 資料的輸出 - `row.names` 和 `col.names` 都是邏輯值。設成 TRUE 則會將 row or column names 一起輸出。 ```r write.csv(data, "~/dsR/data.csv", row.names = FALSE, fileEncoding = "utf8" ) ``` --- ## In-class Exercise 練習讀取外部檔案 [Personality](http://personality-project.org/r/#getdata) ```r personality <- read.table( "http://personality-project.org/r/datasets/maps.mixx.epi.bfi.data", header = TRUE) # or: header = T ``` --- background-image: url(../img/emo/boredom-small.png) --- ## Review <img style='border: 1px solid;' width=50% src='./img/data-science.png'></img> 資料科學涉及的歷程: - (操作型)定義可以利用資料回答的問題 (問題的類型決定了答案的類型!) - 蒐集與清理資料 - 探索、分析資料 (資料不適合回答問題,怎麼辦?) - 溝通 (transfer your findings to action!!) --- ## 分組練習 <span style="color:green; font-weight:bold">自己的資料自己玩</span> ```r dsr <- read.csv("data/week3.in.class.csv", header = TRUE, stringsAsFactors = FALSE) dsr.clean <- na.omit(dsr) dsr.clean$gender <-factor(dsr.clean$gender) dsr.clean$grade <-factor(dsr.clean$grade) str(dsr.clean) table(dsr.clean$gender) ```