background-image: url(http://fusionanalyticsworld.com/wp-content/uploads/2016/08/Data-Science-with-R1.jpg?a73fae) background-position: center background-size: cover class: title-slide .bg-text[ # Introduction to Data Science with R ### week.7 <hr /> 10月 31, 2019 謝舒凱 ] --- ## Data Manipulation using `dplyr` and `tidyr` - .large[data wrangling (= munging)] using `dplyr` > the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.) - .large[data reshaping] : change the layout of data set using `tidyr` > about changing the way data is organized into rows and columns. --- ## `tidyr` - a package that **reshape** the layout of tables (into **tidy** format), particularly designed to work in combination with `magrittr and dplyr` to build a solid data analysis pipeline. - Four main functions <img src="tidyr.png", width=450 height=380 align=right> --- ## tidy data Recall that a data set is called tidy when: - each column represents a variable - each row represents an observation (or case) Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization. --- background-image: url(../img/emo/boredom-small.png) --- ## `tidyr` ```r #install.packages("tidyr") library("tidyr") # use built-in dataset ""USArrest" ``` - Start by subsetting a small data set, ```r small <- USArrests[c(1, 10, 20, 30,40),]; small ``` ``` ## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Georgia 17.4 211 60 25.8 ## Maryland 11.3 300 67 27.8 ## New Jersey 7.4 159 89 18.8 ## South Carolina 14.4 279 48 22.5 ``` - Row names are states, use `cbind()` to add a column named 'state' in the data. which will make the data tidy and the analysis easier. ```r my_data <- cbind(state = rownames(small), small); my_data ``` ``` ## state Murder Assault UrbanPop Rape ## Alabama Alabama 13.2 236 58 21.2 ## Georgia Georgia 17.4 211 60 25.8 ## Maryland Maryland 11.3 300 67 27.8 ## New Jersey New Jersey 7.4 159 89 18.8 ## South Carolina South Carolina 14.4 279 48 22.5 ``` --- ## Tidy form Having your data in tidy format is crucial for facilitating the tasks of data analysis including data manipulation, modeling and visualization. <img src="tidy-data.png", width=500 height=380 align=center> --- ## `gather()` collapse columns into rows - collapses multiple columns into key-value pairs. 寬格式轉長格式 It produces a 'long' data format from a 'wide' one. - `gather(dataframe, key, value, ...)` <img src="gather.png", width=150 height=100 align = center> ```r my_data2 <- gather(my_data, key = "arrest_attribute", value = "arrest_estimate", -state) # gather only Murder and Assault #my_data2 <- gather(my_data, # key = "arrest_attribute", # value = "arrest_estimate", # Murder, Assault) ``` ??? Note that, all column names (except state) have been collapsed into a single key column (here “arrest_attribute”). Their values have been put into a value column (here “arrest_estimate”). --- ## `spread()`: spread two columns into multiple columns 相反動作 - does the reverse of `gather()`. It takes two columns (key and value) and spreads into multiple columns. It produces a 'wide' data format from a 'long' one. <img src="spread.png", width=150 height=100 align = center> ```r my_data3 <- spread(my_data2, key = "arrest_attribute", value = "arrest_estimate" ) ``` --- ## `unite()` : Unite multiple columns into one - takes multiple columns and paste them together into one. - `unite(data, col, ..., sep = "_")` <img src="unite.png", width=180 height=100 align = center> ```r my_data4 <- unite(my_data, col = "Murder_Assault", Murder, Assault, sep = "_") ``` --- ## `separate()`: separate one column into multiple 相反動作 - the reverse of `unite()`. It takes values inside a single character column and separates them into multiple columns. <img src="separate.png", width=180 height=100 align = center> ```r # Separate the column “Murder_Assault” [in my_data4] into two columns Murder and Assault separate(my_data4, col = "Murder_Assault", into = c("Murder", "Assault"), sep = "_") ``` ``` ## state Murder Assault UrbanPop Rape ## Alabama Alabama 13.2 236 58 21.2 ## Georgia Georgia 17.4 211 60 25.8 ## Maryland Maryland 11.3 300 67 27.8 ## New Jersey New Jersey 7.4 159 89 18.8 ## South Carolina South Carolina 14.4 279 48 22.5 ``` --- ## Chaining together - recall that `x %>% f` is equivalent to `\(f(x)\)` ```r my_data %>% gather(key = "arrest_attribute", value = "arrest_estimate", Murder:UrbanPop) %>% unite(col = "attribute_estimate", arrest_attribute, arrest_estimate) ``` ``` ## state Rape attribute_estimate ## 1 Alabama 21.2 Murder_13.2 ## 2 Georgia 25.8 Murder_17.4 ## 3 Maryland 27.8 Murder_11.3 ## 4 New Jersey 18.8 Murder_7.4 ## 5 South Carolina 22.5 Murder_14.4 ## 6 Alabama 21.2 Assault_236 ## 7 Georgia 25.8 Assault_211 ## 8 Maryland 27.8 Assault_300 ## 9 New Jersey 18.8 Assault_159 ## 10 South Carolina 22.5 Assault_279 ## 11 Alabama 21.2 UrbanPop_58 ## 12 Georgia 25.8 UrbanPop_60 ## 13 Maryland 27.8 UrbanPop_67 ## 14 New Jersey 18.8 UrbanPop_89 ## 15 South Carolina 22.5 UrbanPop_48 ``` --- # 小抄速記 Cheatsheet .large[ - [可以善用 cheatsheet](https://rstudio.com/resources/cheatsheets/) - 徹底弄懂這個 [Data Wrangling with dplyr and tidyr](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) ] --- ## 期中考加分練習 - 4 人一組。 - 繳交 html 檔到 NTU COOL. ```r # 檔案放在 # "https://github.com/loperntu/dsR/blob/gh-pages/lectures/07/birth2006.csv" # 有 31.1 MB, 也可以直接下載後再讀入 library(tidyverse) birth <- tbl_df(read.csv("birth2006.csv", row.names = 1)) # or read the r data format (which reduces considerably the size of large files by compression) # birth <- readRDS(file = "birth2006.csv") ``` --- ## 資料說明 `birth2006.csv`檔,包含了 427323 筆紀錄,與 13 筆變量。變量的意義如下: - `DOB_MM` and `DOB_WK` : 出生的月份與星期。 - `DBWT` : 北鼻的重量。 - `WTGAIN`:母親妊娠期北鼻的增加重量。 - `SEX` AND `APGAR5`:北鼻的性別和 APGAR5 分數。 - `DPLURAL` :胎數 - `ESTGEST`:預估的懷孕週數。 - `DMETH_REC`:剖腹產(c-section) 或自然產(vaginal) --- ## 探索分析初體驗 - 經過基本審視資料,可以發現 <span style="color:green; font-weight:bold">在週日(標號1)出生的北鼻比週間還少</span>. 請用程式產生數據說明這個事實,並且做出以適當的圖形。 - 對於上述事實有個猜想是就一般的狀況大部分的生產方式都是採「剖腹產」,而通常「剖腹產」的手術較少排在週日。請用程式產生數據說明這個事實。 - 如果我們要用圖形初步探索以下的變量之間是否有關聯,要如何利用`ggplot2`作圖? - *胎數與生產方式與北鼻體重* - *北鼻的體重成長與出生時重量*