Data Wrangling

]

# Week 2: .fancy[Data Wrangling]

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M243.4 2.6l-224 96c-14 6-21.8 21-18.7 35.8S16.8 160 32 160v8c0 13.3 10.7 24 24 24H456c13.3 0 24-10.7 24-24v-8c15.2 0 28.3-10.7 31.3-25.6s-4.8-29.9-18.7-35.8l-224-96c-8.1-3.4-17.2-3.4-25.2 0zM128 224H64V420.3c-.6 .3-1.2 .7-1.8 1.1l-48 32c-11.7 7.8-17 22.4-12.9 35.9S17.9 512 32 512H480c14.1 0 26.5-9.2 30.6-22.7s-1.1-28.1-12.9-35.9l-48-32c-.6-.4-1.2-.7-1.8-1.1V224H384V416H344V224H280V416H232V224H168V416H128V224zm128-96c-17.7 0-32-14.3-32-32s14.3-32 32-32s32 14.3 32 32s-14.3 32-32 32z"/></svg> EMSE 6035: Marketing Analytics for Design Decisions
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M272 304h-96C78.8 304 0 382.8 0 480c0 17.67 14.33 32 32 32h384c17.67 0 32-14.33 32-32C448 382.8 369.2 304 272 304zM48.99 464C56.89 400.9 110.8 352 176 352h96c65.16 0 119.1 48.95 127 112H48.99zM224 256c70.69 0 128-57.31 128-128c0-70.69-57.31-128-128-128S96 57.31 96 128C96 198.7 153.3 256 224 256zM224 48c44.11 0 80 35.89 80 80c0 44.11-35.89 80-80 80S144 172.1 144 128C144 83.89 179.9 48 224 48z"/></svg> John Paul Helveston
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M152 64H296V24C296 10.75 306.7 0 320 0C333.3 0 344 10.75 344 24V64H384C419.3 64 448 92.65 448 128V448C448 483.3 419.3 512 384 512H64C28.65 512 0 483.3 0 448V128C0 92.65 28.65 64 64 64H104V24C104 10.75 114.7 0 128 0C141.3 0 152 10.75 152 24V64zM48 448C48 456.8 55.16 464 64 464H384C392.8 464 400 456.8 400 448V192H48V448z"/></svg> September 06, 2023

]

---

# Required Packages (check `practice.R` file)

Make sure you have these libraries installed:

```r
install.packages(c("tidyverse", "here"))
```

**Remember: you only need to install packages once!**

<br>

Once installed, you'll need to _load_ the libraries every time you open RStudio:

```r
library(tidyverse)
library(here)
```

---

# Week 2: .fancy[Data Wrangling]

### 1. Working with data frames
### 2. Data wrangling with the _tidyverse_

### BREAK

### 3. Project proposals

---

# Week 2: .fancy[Data Wrangling]

### 1. .orange[Working with data frames]
### 2. Data wrangling with the _tidyverse_

### BREAK

### 3. Project proposals

---

# The data frame...in Excel

---

# The data frame...in R

```r
beatles <- tibble(
    firstName   = c("John", "Paul", "Ringo", "George"),
    lastName    = c("Lennon", "McCartney", "Starr", "Harrison"),
    instrument  = c("guitar", "bass", "drums", "guitar"),
    yearOfBirth = c(1940, 1942, 1940, 1943),
    deceased    = c(TRUE, FALSE, FALSE, TRUE)
)

beatles
```

```
#> # A tibble: 4 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 John      Lennon    guitar            1940 TRUE    
#> 2 Paul      McCartney bass              1942 FALSE   
#> 3 Ringo     Starr     drums             1940 FALSE   
#> 4 George    Harrison  guitar            1943 TRUE
```

---

## **Columns**: _Vectors_ of values (must be same data type)

```r
beatles
```

Extract a column using `$`

```r
beatles$firstName
```

```
#> [1] "John"   "Paul"   "Ringo"  "George"
```

---

## **Rows**: Information about individual observations

Information about _John Lennon_ is in the first row:

```r
beatles[1,]
```

```
#> # A tibble: 1 × 5
#>   firstName lastName instrument yearOfBirth deceased
#>   <chr>     <chr>    <chr>            <dbl> <lgl>   
#> 1 John      Lennon   guitar            1940 TRUE
```

Information about _Paul McCartney_ is in the second row:

```r
beatles[2,]
```

```
#> # A tibble: 1 × 5
#>   firstName lastName  instrument yearOfBirth deceased
#>   <chr>     <chr>     <chr>            <dbl> <lgl>   
#> 1 Paul      McCartney bass              1942 FALSE
```

---

## Take a look at the `beatles` data frame in `practice.R`

---

# Getting data into R

<br>

## 1. Load external packages
## 2. Read in external files (usually a `.csv` file)

<br>

NOTE: csv = "comma-separated values"

---

## Data from an R package

```r
library(ggplot2)
```

See which data frames are available in a package:

```r
data(package = "ggplot2")
```

Find out more about a package data set:

```r
?msleep
```

---

## Back to `practice.R`

---

# Importing an external data file

<br>

Note the `data.csv` file in your `data` folder.

- **DO NOT** double-click it!
- **DO NOT** open it in Excel!

Excel can **corrupt** your data!

]

If you **must** open it in Excel:

- Make a copy 
- Open the copy

]

---

# Steps to importing external data files

## 1. Create a path to the data

```r
library(here)
*path_to_data <- here('data', 'data.csv')
path_to_data
```

```
#> [1] "/Users/jhelvy/gh/teaching/MADD/2023-Fall/class/2-data-wrangling/data/data.csv"
```

## 2. Import the data

```r
library(tidyverse)
*data <- read_csv(path_to_data)
```

---

## Using the **here** package to make file paths

The `here()` function builds the path to your **root** to your _working directory_ <br>(this is where your `.Rproj` file lives!)

```r
here()
```

```
#> [1] "/Users/jhelvy/gh/teaching/MADD/2023-Fall/class/2-data-wrangling"
```

The `here()` function builds the path to files _inside_ your working directory

```r
path_to_data <- here('data', 'data.csv')
path_to_data
```

```
#> [1] "/Users/jhelvy/gh/teaching/MADD/2023-Fall/class/2-data-wrangling/data/data.csv"
```

---

# Avoid hard-coding file paths!

### (they can break on different computers)

```r
path_to_data <- 'data/data.csv'
path_to_data
```

```
#> [1] "data/data.csv"
```

# 💩💩💩

---

# Back to reading in data

```r
path_to_data <- here('data', 'data.csv')
*data <- read_csv(path_to_data)
```

<br>

**Important**: Use `read_csv()` instead of `read.csv()`

---

## Your turn

1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `data`.

2) Use the `data` object to answer the following questions:

- How many rows and columns are in the data frame?
- What type of data is each column? (Just look, don't need to type out the answer)
- Preview the different columns - what do you think this data is about? What might one row represent?
- How many unique airports are in the data frame?
- What is the earliest and latest observation in the data frame?
- What is the lowest and highest cost of any one repair in the data frame?

]

---

# Week 2: .fancy[Data Wrangling]

### 1. Working with data frames
### 2. .orange[Data wrangling with the _tidyverse_]

### BREAK

### 3. Project proposals

---

### The tidyverse: `stringr` + `dplyr` + `readr` +  `ggplot2` + ...

<center>
<img src="images/horst_monsters_tidyverse.jpeg" width="950">
</center>Art by [Allison Horst](https://www.allisonhorst.com/)

---

# 80% of the job is data wrangling

---

## Today: data wrangling with **dplyr**

<center>
<img src="images/horst_monsters_data_wrangling.png" width="600">
</center>Art by [Allison Horst](https://www.allisonhorst.com/)

---

# .center[The main `dplyr` "verbs"]

<br>

"Verb"        | What it does
--------------|--------------------
`select()`    | Select columns by name
`filter()`    | Keep rows that match criteria
`arrange()`   | Sort rows based on column(s)
`mutate()`    | Create new columns 
`summarize()` | Create summary values

---

# .center[Core `tidyverse` concept:<br>**Chain functions together with "pipes"**]

# .center[`%>%`]

## Think of the words "...and then..."

```r
data %>% 
  do_something() %>% 
  do_something_else()
```

---

# Think of `%>%` as the words "...and then..."

**Without Pipes** (read from inside-out):

```r
leave_house(get_dressed(get_out_of_bed(wake_up(me))))
```

**With Pipes**:

```r
me %>%
    wake_up %>%
    get_out_of_bed %>%
    get_dressed %>%
    leave_house
```

---