Exploratory Data Analysis (1)

Introducing Data Science with R, week 6 2019-10-17

在了解基本的 R語法之後，我們要進入整頓資料 (data wrangling/transformation)（進而理解資料）的學習階段。

首先你會拿到資料，但是不整齊 (non-tidy) 的資料讓妳的人生被浪費。

幸福的人都很類似，不幸的人則各有各的不幸。 Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

那整齊的資料是什麼意思？

Tidy data

三原則

每個變數都必須有自己的資料欄 Each variable forms a column.
每個觀察都必須有自己的資料列 Each observation forms a row.
每個值都必須有自己的格子 Each value must have its own cell.

看例子 < 附帶一提，你可以開始看這本書了。
有學術興趣的人，可以試著讀 Wickham 大神的原著

Tidy data processing

開發團隊有個整體架構 tidyverse, 希望形成一個資料科學的套件生態圈，共享處理資料的哲學、語法、與資料結構。這個想法甚至引發了 R 語言教育的路線之爭 (R base first vs Tidyverse first)
我們先掌握兩個核心的套件架構即可。
- dplyr: a Grammar of Data Manipulation
- ggplot2: a Grammar of Graphics
在 DataCamp 上規劃的 Introduction to the Tidyverse 的課程，課程時數約四小時，請大家找時間預習與練習。

`dplyr`

Great for data exploration and transformation
Intuitive to write and easy to read, especially when using the chaining syntax
Fast on data frames

Basics

Data manipulation 五個基本動作：filter(), select(), arrange(), mutate(), summarise()。注意：直行（VAR）橫列 (OBS)

篩選 filter(): 按給定的邏輯判斷，篩選出符合要求的 OBS, 類似於 subset()。
選擇 select() : 用 VAR 作參數來選擇 OBS。
排列 arrange(): 按給定的 VAR 依次對 OBS 進行排序。類似於 order()。
增行 mutate(): 對已有 VAR 進行運算並添加為新的 VAR。類似於 transform()。
摘要 summarise(): 對data frame 調用其它函數進行 summarise, 並回傳一維結果。

First argument is a data frame, and subsequent arguments say what to do with data frame.

# load packages
suppressMessages(library(dplyr))
# install.packages(c("hflights", "Lahman"))
library(hflights)

# explore data: flights departing from two Houston airports in 2011
data(hflights)
head(hflights,5)

tbl_df() creates a 「local data frame」, which is simply a wrapper for a data frame that prints nicely.

# convert to local data frame
flights <- tbl_df(hflights)
# alternative: head(data.frame(flights),3)

# printing only shows 10 rows and as many columns as can fit on your screen
flights

# you can specify that you want to see more rows
#print(flights, n=20)

# convert to a normal data frame to see all of the columns
#data.frame(head(flights))

`filter` (keep rows marching criteria) : 篩選觀察

Base R approach to filtering forces you to repeat the data frame’s name, and dplyr approach is simpler to write and read.
filter(df, 回傳符合邏輯條件的 rows)

# base R approach to view all flights on January 1
#flights[flights$Month==1 & flights$DayofMonth==1, ]

# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, Month==1 & DayofMonth==1)

# use pipe for OR condition
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")

# you can also use %in% operator
#filter(flights, UniqueCarrier %in% c("AA", "UA"))

`select`: Pick columns by name 選取變量

Base R approach is awkward to type and to read, dplyr approach uses similar syntax to filter.

# base R approach to select DepTime, ArrTime, and FlightNum columns
#flights[, c("DepTime", "ArrTime", "FlightNum")]

# dplyr approach
select(flights, DepTime, ArrTime, FlightNum)

# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
# 或者使用 - 來排除某列
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))

Chaining/Pipelining - Usual way to perform multiple operations in one line is by nesting. - Can write commands in a natural order by using the %>% infix operator (which can be pronounced as 「then」) - Chaining increases readability significantly when there are many commands (Operator is automatically imported from the magrittr package)

# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)

# chaining method
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    filter(DepDelay > 60)

再一個例子

# create two vectors and calculate Euclidian distance between them
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))

# chaining method
(x1-x2)^2 %>% 
  sum() %>% 
  sqrt()

rename() 重命名變量 variable names rename(tbl, newname = oldname,...)

`arrange()`: 排序觀察 Reorder rows

arrange(依 variable 或是 variables 來排序), desc() 是倒序。

# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
#flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]

# dplyr approach
flights %>%
    select(UniqueCarrier, DepDelay) %>%
    arrange(DepDelay)

# use `desc` for descending
# flights %>%
#     select(UniqueCarrier, DepDelay) %>%
#     arrange(desc(DepDelay))
#

`mutate()`: Add new variables 擴展新 variable（在後面）

Create new variables that are functions of existing variables, 類似於transform()

# base R approach to create a new variable Speed (in mph)
#flights$Speed <- flights$Distance / flights$AirTime*60
#flights[, c("Distance", "AirTime", "Speed")]

# dplyr approach (prints the new variable but does not store it)
flights %>%
    select(Distance, AirTime) %>%
    mutate(Speed = Distance/AirTime*60)

# store the new variable
flights <- flights %>% 
  mutate(Speed = Distance/AirTime*60)

`summarise()`: Reduce variables to values

Primarily useful with data that has been grouped by one or more variables
summarise() uses the provided aggregation function to summarise each group

Group Data

以上功能加上分組操作group_by()這個概念結合起來時非常強大!

# base R approaches to calculate the average arrival delay to each destination
# head(with(
#   flights, 
#   tapply(ArrDelay, Dest, mean, na.rm=TRUE))
#   )
# head(aggregate(ArrDelay ~ Dest, flights, mean))


# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
    group_by(Dest) %>%
    summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))

summarise_each() allows you to apply the same summary function to multiple columns at once (Note: mutate_each() is also available)

# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(mean), Cancelled, Diverted)

# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
    group_by(UniqueCarrier) %>%
    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))

Helper function n() counts the number of rows in a group
Helper function n_distinct(vector) counts the number of unique items in that vector

# for each day of the year, count the total number of flights and sort in descending order
flights %>%
    group_by(Month, DayofMonth) %>%
    summarise(flight_count = n()) %>%
    arrange(desc(flight_count))

# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
    group_by(Dest) %>%
    summarise(flight_count = n(), plane_count = n_distinct(TailNum))

Grouping can sometimes be useful without summarising

# for each destination, show the number of cancelled and not cancelled flights
flights %>%
    group_by(Dest) %>%
    select(Cancelled) %>%
    table() %>%
    head()

Summarise function takes n inputs and returns 1 value
Window function takes n inputs and returns n values. Includes ranking and ordering functions (like min_rank()), offset functions (lead() and lag()), and cumulative aggregates (like cummean()).

`ggplot2`

作圖是 EDA 的一把瑞士刀。在應用技術之前，可以先想想
- 什麼樣資料適合用什麼樣的圖形表達？
- 適當的作圖工具套件（library）為何？
- 如何生動、產生互動？
先看看這個 plots to avoid
學習途徑：plot() - qplot() - ggplot() >> interactive plot rCharts, plotly, networkD3, dygraphs… (視你的應用需求而定)。有些套件本身就足夠滿足妳的需求。
- e.g., gather and display Google trend information.

#install.packages('gtrendsR')
library(gtrendsR)
trends <- gtrends(c("Nerds", "Smarties"), geo ="CA")
plot(trends)

make it more interactive

#install.packages('plotly')
library(plotly)
p <-plot(trends)

ggplotly(p)

`ggplot2` 的基本文法

gg 代表 grammar of graphics
- (data, aesthetics) + geometry
  - data: a data frame
  - aesthetics: used to indicate x and y variables, also used to control the color, size, shape of points, heights of bars, etc.
  - geometry: corresponds to the type of graphics (histogram, box plot,…)

library(ggplot2)
gg <- ggplot(diamonds, aes(price, carat)) +
  geom_point(color = "brown4") # scatter plot; size=1.5, shape=18

gg

Build plots steps by steps

gg <- gg + ggtitle("Diamond carat and price")

Rapid Data Exploration with dplyr and ggplot2

diamonds %>%                      # Start with the 'diamonds' dataset
  filter(cut == "Ideal") %>%      # Then, filter down to rows where cut == Ideal
  ggplot(aes(x=color,y=price)) +  # Then, plot using ggplot
    geom_boxplot()                #  with and create a boxplot

再回來玩 gtrendsR

library(gtrendsR)

# https://rdrr.io/cran/gtrendsR/man/gtrends.html

#define the keywords
keywords=c("Paris","New York","Barcelona")
#set the geographic area: DE = Germany
country=c('TW')
#set the time window
time=("2010-01-01 2018-08-27")
#set channels 
channel='web'



trends = gtrends(keywords, gprop =channel,geo=country, time = time )
#select only interst over time 
time_trend=trends$interest_over_time
head(time_trend)


plot<-ggplot(data=time_trend, aes(x=date, y=hits,group=keyword,col=keyword))+
  geom_line()+xlab('Time')+ylab('Relative Interest')+ theme_bw()+
  theme(legend.title = element_blank(),legend.position="bottom",legend.text=element_text(size=12))+ggtitle("Google Search Volume")
plot

中文資料參考

http://molecular-service-science.com/2013/11/27/r-ggplot-tutorial-1/

http://molecular-service-science.com/2014/01/23/r-ggplot-tutorial-2/

Exploratory Data Analysis (1)

Tidy data

Tidy data processing

dplyr

Basics

filter (keep rows marching criteria) : 篩選觀察

select: Pick columns by name 選取變量

arrange(): 排序觀察 Reorder rows

mutate(): Add new variables 擴展新 variable（在後面）

summarise(): Reduce variables to values