class: middle, inverse .leftcol30[ <center> <img src="https://madd.seas.gwu.edu/images/logo.png" width=250> </center> ] .rightcol70[ # Week 2: .fancy[Data Wrangling] ###
EMSE 6035: Marketing Analytics for Design Decisions ###
John Paul Helveston ###
September 04, 2024 ] --- # Required Packages (check `practice.R` file) Make sure you have these libraries installed: ``` r install.packages(c("tidyverse", "here")) ``` **Remember: you only need to install packages once!** -- <br> Once installed, you'll need to _load_ the libraries every time you open RStudio: ``` r library(tidyverse) library(here) ``` --- class: inverse, middle # Week 2: .fancy[Data Wrangling] ### 1. Working with data frames ### 2. Data wrangling with the _tidyverse_ ### BREAK ### 3. Project proposals --- class: inverse, middle # Week 2: .fancy[Data Wrangling] ### 1. .orange[Working with data frames] ### 2. Data wrangling with the _tidyverse_ ### BREAK ### 3. Project proposals --- class: center # The data frame...in Excel <center> <img src="images/spreadsheet.png" width=700> </center> --- # The data frame...in R ``` r beatles <- tibble( firstName = c("John", "Paul", "Ringo", "George"), lastName = c("Lennon", "McCartney", "Starr", "Harrison"), instrument = c("guitar", "bass", "drums", "guitar"), yearOfBirth = c(1940, 1942, 1940, 1943), deceased = c(TRUE, FALSE, FALSE, TRUE) ) beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- ## **Columns**: _Vectors_ of values (must be same data type) ``` r beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` -- Extract a column using `$` ``` r beatles$firstName ``` ``` #> [1] "John" "Paul" "Ringo" "George" ``` --- ## **Rows**: Information about individual observations Information about _John Lennon_ is in the first row: ``` r beatles[1,] ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE ``` -- Information about _Paul McCartney_ is in the second row: ``` r beatles[2,] ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE ``` --- class: middle, center, inverse ## Take a look at the `beatles` data frame in `practice.R` --- # Getting data into R <br> ## 1. Load external packages ## 2. Read in external files (usually a `.csv` file) <br> NOTE: csv = "comma-separated values" --- ## Data from an R package ``` r library(ggplot2) ``` -- See which data frames are available in a package: ``` r data(package = "ggplot2") ``` -- Find out more about a package data set: ``` r ?msleep ``` --- class: middle, center, inverse ## Back to `practice.R` --- # Importing an external data file <br> .leftcol60[ Note the `data.csv` file in your `data` folder. - **DO NOT** double-click it! - **DO NOT** open it in Excel! Excel can **corrupt** your data! ] -- .rightcol40[ If you **must** open it in Excel: - Make a copy - Open the copy ] --- # Steps to importing external data files -- ## 1. Create a path to the data ``` r library(here) *path_to_data <- here('data', 'data.csv') path_to_data ``` ``` #> [1] "/Users/jhelvy/gh/teaching/MADD/2024-Fall/class/2-data-wrangling/data/data.csv" ``` -- ## 2. Import the data ``` r library(tidyverse) *data <- read_csv(path_to_data) ``` --- ## Using the **here** package to make file paths The `here()` function builds the path to your **root** to your _working directory_ <br>(this is where your `.Rproj` file lives!) ``` r here() ``` ``` #> [1] "/Users/jhelvy/gh/teaching/MADD/2024-Fall/class/2-data-wrangling" ``` -- The `here()` function builds the path to files _inside_ your working directory ``` r path_to_data <- here('data', 'data.csv') path_to_data ``` ``` #> [1] "/Users/jhelvy/gh/teaching/MADD/2024-Fall/class/2-data-wrangling/data/data.csv" ``` --- # Avoid hard-coding file paths! ### (they can break on different computers) ``` r path_to_data <- 'data/data.csv' path_to_data ``` ``` #> [1] "data/data.csv" ``` # 💩💩💩 --- # Back to reading in data ``` r path_to_data <- here('data', 'data.csv') *data <- read_csv(path_to_data) ``` <br> **Important**: Use `read_csv()` instead of `read.csv()` --- class: inverse
−
+
10
:
00
## Your turn .font90[ 1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `data`. 2) Use the `data` object to answer the following questions: - How many rows and columns are in the data frame? - What type of data is each column? (Just look, don't need to type out the answer) - Preview the different columns - what do you think this data is about? What might one row represent? - How many unique airports are in the data frame? - What is the earliest and latest observation in the data frame? - What is the lowest and highest cost of any one repair in the data frame? ] --- class: inverse, middle # Week 2: .fancy[Data Wrangling] ### 1. Working with data frames ### 2. .orange[Data wrangling with the _tidyverse_] ### BREAK ### 3. Project proposals --- class: center ### The tidyverse: `stringr` + `dplyr` + `readr` + `ggplot2` + ... <center> <img src="images/horst_monsters_tidyverse.jpeg" width="950"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- class: center # 80% of the job is data wrangling <center> <img src="images/data-wrangle.png" width="1000"> </center> --- class: center ## Today: data wrangling with **dplyr** <center> <img src="images/horst_monsters_data_wrangling.png" width="600"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- # .center[The main `dplyr` "verbs"] <br> "Verb" | What it does --------------|-------------------- `select()` | Select columns by name `filter()` | Keep rows that match criteria `arrange()` | Sort rows based on column(s) `mutate()` | Create new columns `summarize()` | Create summary values --- # .center[Core `tidyverse` concept:<br>**Chain functions together with "pipes"**] # .center[`%>%`] -- ## Think of the words "...and then..." ``` r data %>% do_something() %>% do_something_else() ``` --- # Think of `%>%` as the words "...and then..." -- **Without Pipes** (read from inside-out): ``` r leave_house(get_dressed(get_out_of_bed(wake_up(me)))) ``` -- **With Pipes**: ``` r me %>% wake_up %>% get_out_of_bed %>% get_dressed %>% leave_house ``` --- class: center, middle, inverse # Select columns with `select()` <br> <center> <img src="images/rstudio-cheatsheet-select.png" width="900"> </center> --- # Select columns with `select()` ``` r beatles <- tibble( firstName = c("John", "Paul", "Ringo", "George"), lastName = c("Lennon", "McCartney", "Starr", "Harrison"), instrument = c("guitar", "bass", "drums", "guitar"), yearOfBirth = c(1940, 1942, 1940, 1943), deceased = c(TRUE, FALSE, FALSE, TRUE) ) beatles ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- # Select columns with `select()` Select the columns `firstName` & `lastName` ``` r beatles %>% select(firstName, lastName) ``` ``` #> # A tibble: 4 × 2 #> firstName lastName #> <chr> <chr> #> 1 John Lennon #> 2 Paul McCartney #> 3 Ringo Starr #> 4 George Harrison ``` --- # Select columns with `select()` Use the `-` sign to drop columns ``` r beatles %>% select(-firstName, -lastName) ``` ``` #> # A tibble: 4 × 3 #> instrument yearOfBirth deceased #> <chr> <dbl> <lgl> #> 1 guitar 1940 TRUE #> 2 bass 1942 FALSE #> 3 drums 1940 FALSE #> 4 guitar 1943 TRUE ``` --- # Select columns with `select()` Select columns based on name criteria: - `ends_with()` = Select columns that end with a character string - `contains()` = Select columns that contain a character string - `matches()` = Select columns that match a regular expression - `one_of()` = Select column names that are from a group of names --- # Select columns with `select()` Select the columns that end with `"Name"`: ``` r beatles %>% select(ends_with("Name")) ``` ``` #> # A tibble: 4 × 2 #> firstName lastName #> <chr> <chr> #> 1 John Lennon #> 2 Paul McCartney #> 3 Ringo Starr #> 4 George Harrison ``` --- class: center, middle, inverse # Keep specific rows with `filter()` <br> <center> <img src="images/rstudio-cheatsheet-filter.png" width="900"> </center> --- # Keep specific rows with `filter()` Keep only the rows with band members born after 1941 ``` r #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE *#> 2 Paul McCartney bass 1942 FALSE #> 3 Ringo Starr drums 1940 FALSE *#> 4 George Harrison guitar 1943 TRUE ``` --- # Keep specific rows with `filter()` Keep only the rows with band members born after 1941 ``` r beatles %>% filter(yearOfBirth > 1941) ``` ``` #> # A tibble: 2 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE #> 2 George Harrison guitar 1943 TRUE ``` --- # Keep specific rows with `filter()` Keep only the rows with band members born after 1941 **& are still living** ``` r beatles %>% filter(yearOfBirth > 1941, deceased == FALSE) ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE ``` -- ``` r beatles %>% filter((yearOfBirth > 1941) & (deceased == FALSE)) ``` ``` #> # A tibble: 1 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 Paul McCartney bass 1942 FALSE ``` --- # .center[Logic operators for `filter()`] <br> Description | Example ------------|------------ Values greater than 1 | `value > 1` Values greater than or equal to 1 | `value >= 1` Values less than 1 | `value < 1` Values less than or equal to 1 | `value <= 1` Values equal to 1 | `value == 1` Values not equal to 1 | `value != 1` Values in the set c(1, 4) | `value %in% c(1, 4)` --- # Removing missing values Drop all rows where `variable` is `NA` ``` r data %>% filter(!is.na(variable)) ``` --- # Combine `filter()` and `select()` Get the **first & last name** of members born after 1941 & are still living ``` r beatles %>% filter(yearOfBirth > 1941, deceased == FALSE) %>% select(firstName, lastName) ``` ``` #> # A tibble: 1 × 2 #> firstName lastName #> <chr> <chr> #> 1 Paul McCartney ``` --- class: inverse
−
+
10
:
00
## Your turn .font90[ 1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `data`. 2) Use the `data` object and the `select()` and `filter()` functions to answer the following questions: - Create a new data frame, `dc`, that contains only the rows from DC airports. - Create a new data frame, `dc_dawn`, that contains only the rows from DC airports that occurred at dawn. - Create a new data frame, `dc_dawn_birds`, that contains only the rows from DC airports that occurred at dawn and only the columns about the _species_ of bird. - How many unique species of birds have been involved in accidents at DC airports? ] --- class: center, middle, inverse ## Create new variables with `mutate()` <br> <center> <img src="images/rstudio-cheatsheet-mutate.png" width="900"> </center> --- class: center background-color: #fff <center> <img src="images/horst_monsters_mutate.jpg" width="700"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- # Create new variables with `mutate()` Use the `yearOfBirth` variable to compute the age of each band member ``` r beatles %>% mutate(age = 2022 - yearOfBirth) ``` ``` #> # A tibble: 4 × 6 #> firstName lastName instrument yearOfBirth deceased age #> <chr> <chr> <chr> <dbl> <lgl> <dbl> #> 1 John Lennon guitar 1940 TRUE 82 #> 2 Paul McCartney bass 1942 FALSE 80 #> 3 Ringo Starr drums 1940 FALSE 82 #> 4 George Harrison guitar 1943 TRUE 79 ``` --- # You can _immediately_ use new variables ``` r beatles %>% mutate( age = 2022 - yearOfBirth, * meanAge = mean(age)) ``` ``` #> # A tibble: 4 × 7 #> firstName lastName instrument yearOfBirth deceased age meanAge #> <chr> <chr> <chr> <dbl> <lgl> <dbl> <dbl> #> 1 John Lennon guitar 1940 TRUE 82 80.8 #> 2 Paul McCartney bass 1942 FALSE 80 80.8 #> 3 Ringo Starr drums 1940 FALSE 82 80.8 #> 4 George Harrison guitar 1943 TRUE 79 80.8 ``` --- # .center[Handling if/else conditions] ### .center[`ifelse(<condition>, <if TRUE>, <else>)`] -- ``` r beatles %>% mutate(playsGuitar = ifelse(instrument == "guitar", TRUE, FALSE)) ``` ``` #> # A tibble: 4 × 6 #> firstName lastName instrument yearOfBirth deceased playsGuitar #> <chr> <chr> <chr> <dbl> <lgl> <lgl> #> 1 John Lennon guitar 1940 TRUE TRUE #> 2 Paul McCartney bass 1942 FALSE FALSE #> 3 Ringo Starr drums 1940 FALSE FALSE #> 4 George Harrison guitar 1943 TRUE TRUE ``` --- # Sort data frame with `arrange()` -- Sort `beatles` data frame by year of birth ``` r beatles %>% arrange(yearOfBirth) ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 John Lennon guitar 1940 TRUE #> 2 Ringo Starr drums 1940 FALSE #> 3 Paul McCartney bass 1942 FALSE #> 4 George Harrison guitar 1943 TRUE ``` --- # Sort data frame with `arrange()` Use the `desc()` function to sort in descending order ``` r beatles %>% * arrange(desc(yearOfBirth)) ``` ``` #> # A tibble: 4 × 5 #> firstName lastName instrument yearOfBirth deceased #> <chr> <chr> <chr> <dbl> <lgl> #> 1 George Harrison guitar 1943 TRUE #> 2 Paul McCartney bass 1942 FALSE #> 3 John Lennon guitar 1940 TRUE #> 4 Ringo Starr drums 1940 FALSE ``` --- # Sort rows with `arrange()` Compute the band member age, then sort based on the youngest: ``` r beatles %>% mutate(age = 2022 - yearOfBirth) %>% arrange(age) ``` ``` #> # A tibble: 4 × 6 #> firstName lastName instrument yearOfBirth deceased age #> <chr> <chr> <chr> <dbl> <lgl> <dbl> #> 1 George Harrison guitar 1943 TRUE 79 #> 2 Paul McCartney bass 1942 FALSE 80 #> 3 John Lennon guitar 1940 TRUE 82 #> 4 Ringo Starr drums 1940 FALSE 82 ``` --- class: inverse
−
+
10
:
00
## Your turn .font90[ 1) Use the `here()` and `read_csv()` functions to load the `data.csv` file that is in the `data` folder. Name the data frame object `data`. 2) Using the `data` object, create the following new variables: - `height_miles`: The `height` variable converted to miles (Hint: there are 5,280 feet in a mile). - `cost_mil`: Is `TRUE` if the repair costs was greater or equal to $1 million, `FALSE` otherwise. 3) Remove rows that have `NA` for `cost_repairs_infl_adj` and re-arrange the resulting data frame based on the highest height and most expensive cost ] --- class: inverse, center # .fancy[Break]
−
+
05
:
00
--- class: inverse, middle # Week 2: .fancy[Data Wrangling] ### 1. Working with data frames ### 2. Data wrangling with the _tidyverse_ ### BREAK ### 3. .orange[Project proposals] --- class: center, middle # [Project Proposal Guidelines](https://madd.seas.gwu.edu/2024-Fall/project/1-proposal.html) --- # Proposal Items Item | Description ---- | ------------------------------------ **Abstract** | Product / technology in just a few sentences **Introduction** | Description, picture, background **Market Opportunity** | Identify your customer, competitors, and market size **Product Attributes** | 2-4 key variables related to product's design and performance **Research Questions** | 2-4 research questions you hope to answer about your product **Questions** | Major outstanding questions to be resolved --- # Today .cols3[ ### Market Opportunity - Identify customer - Identify competitors - Identify market size ] -- .cols3[ ### Product Attributes Features your _customer_ cares about ] -- .cols3[ ### Research Questions Decisions you are trying to inform ] --- class: middle background-color: #fff # .center[Example: **Folding solar panels**] .leftcol60[ <center> <img src="images/solar.jpg" width=800> </center> ] .rightcol40[ ### Who is your customer? - General public? - Outdoor enthusiasts? - Emergency gear? ### Competitors? - Similar folding panels - Batteries? ] --- class: middle background-color: #fff # .center[Example: **Electric vehicle battery**] .leftcol60[ <center> <img src="images/ev.png" width=100%> </center> ] .rightcol40[ ### Who is your customer? - Car buyers ### Competitors? - Hybrid vehicles? - Efficient gasoline vehicles? ] --- class: center, middle .leftcol[ ## Product Attributes #### Features your _customer_ cares about ] .rightcol[ ## Research Questions #### Decisions you are trying to inform ] --- background-image: url(images/solar1.png) background-size: contain --- class: center # Product Attributes Table ([example](https://docs.google.com/spreadsheets/d/1Hmxfav_l1bubnaPkIiiMW0tZrFA-xblP9_ndN_6TB1I/edit?usp=sharing)) <center> <img src="images/solar2.png" width=1000> </center> --- class: inverse
−
+
15
:
00
## Team Proposals 1. Sit with your team 2. Discuss & identify your customer & potential competitors 3. Discuss & identify key _Product Attributes_ & _Research Questions_ 4. Start building out your model relationships table (copy from [this example](https://docs.google.com/spreadsheets/d/1Hmxfav_l1bubnaPkIiiMW0tZrFA-xblP9_ndN_6TB1I/edit?usp=sharing)) ### Suggestions - You may want to start with simple bullet lists - Start with more items rather than fewer (can always cut back later)