class: middle, inverse .leftcol30[ <center> <img src="https://github.com/emse-madd-gwu/emse-madd-gwu.github.io/raw/master/images/madd_hex_sticker.png" width=250> </center> ] .rightcol70[ # Week 9: .fancy[Uncertainty] ###
EMSE 6035: Marketing Analytics for Design Decisions ###
John Paul Helveston ###
October 27, 2021 ] --- # Pilot Analysis Report ### Assignment is now [posted](https://madd.seas.gwu.edu/2021-Fall/p4-pilot-analysis.html) ### Due 11/07 (that's 10 days from now) --- # Quick correction from last week **Observations** - Height of students (inches): ``` #> [1] 65 69 66 67 68 72 68 69 63 70 ``` a) Let's say we know that the height of students, `\(\tilde{x}\)`, in a classroom follows a normal distribution. A professor obtains the above height measurements students in her classroom. What is the log-likelihood that `\(\tilde{x} \sim \mathcal{N} (68, 4)\)`? In other words, compute `\(\ln \mathcal{L} (\mu = 68, \sigma = 4)\)`. -- b) Compute the log-likelihood function using the same standard deviation `\((\sigma = 4)\)` but with the following different values for the mean, `\(\mu: 66, 67, 68, 69, 70\)`. How do the results compare? Which value for `\(\mu\)` produces the highest log-likelihood? --- .leftcol[ ## Computing the<br>_likelihood_ Load the data ```r x <- c(65, 69, 66, 67, 68, 72, 68, 69, 63, 70) ``` Compute the value of f(x) for each x ```r f_x <- dnorm(x, 68, 4) ``` Likelihood is the product of values in `f_x` ```r prod(f_x) ``` ``` #> [1] 1.447528e-11 ``` ] -- .rightcol[ ## Computing the<br>_log-likelihood_ Take the log of the likelihood ```r log(prod(f_x)) ``` ``` #> [1] -24.95858 ``` The way we typically compute the log-likelihood is by summing up the log of the values in `f_x` ```r sum(log(f_x)) ``` ``` #> [1] -24.95858 ``` ] --- .leftcol[ ```r library(tidyverse) # Create a vectors of values for the mean means <- c(66, 67, 68, 69, 70) # Compute the likelihood using different values for the mean: L1 <- sum(log(dnorm(x, means[1], 4))) L2 <- sum(log(dnorm(x, means[2], 4))) L3 <- sum(log(dnorm(x, means[3], 4))) L4 <- sum(log(dnorm(x, means[4], 4))) L5 <- sum(log(dnorm(x, means[5], 4))) logLiks <- c(L1, L2, L3, L4, L5) # Plot the result: df <- data.frame(means, logLiks) df %>% ggplot(aes(x = means, y = logLiks)) + geom_line() + geom_point() + theme_bw() + labs( x = "Mean Value", y = "Log-likelihood Values" ) ``` ] .rightcol[ <img src="figs/unnamed-chunk-7-1.png" width="432" /> ] --- class: inverse, middle # Week 9: .fancy[Uncertainty] ### 1. Computing uncertainty ### 2. Reshaping data ### BREAK ### 3. Cleaning pilot data ### 4. Estimating pilot data models --- class: inverse, middle # Week 9: .fancy[Uncertainty] ### 1. .orange[Computing uncertainty] ### 2. Reshaping data ### BREAK ### 3. Cleaning pilot data ### 4. Estimating pilot data models --- class: center # Maximum likelihood estimation <center> <img src="images/mle1.png" width=100%> </center> --- <center> <img src="images/mle2.png" width=90%> </center> --- class: middle, center ### The _curvature_ of the log-likelihood function is related to the hessian <center> <img src="images/covariance.png" width=500> </center> --- class: middle, center ### The _curvature_ of the log-likelihood function is related to the hessian <center> <img src="images/covariance2.png" width=900> </center> --- class: middle, center ### Usually report parameter uncertainty ("standard errors") with `\(\sigma\)` values <center> <img src="images/uncertainty.png" width=1100> </center> --- class: inverse # Practice Question 1 Suppose we estimate a model and get the following results: $$ \hat{\beta} = `\begin{bmatrix} -0.4 \\ 0.5 \end{bmatrix}` $$ $$ \nabla_{\beta}^2 \ln(\mathcal{L}) = `\begin{bmatrix} -6000 & 60 \\ 60 & -700 \end{bmatrix}` $$ a) Use the hessian to compute the standard errors for `\(\hat{\beta}\)` b) Use the standard errors to compute a 95% confidence interval around `\(\hat{\beta}\)` --- # .center[Simulating uncertainty] We can use the coefficients and hessian from a model to obtain draws that reflect parameter uncertainty .leftcol[ ```r beta <- c(-0.7, 0.1, -4.0) hessian <- matrix(c( -6000, 50, 60, 50, -700, 50, 60, 50, -300), ncol = 3, byrow = TRUE) ``` ] .rightcol[ ```r covariance <- -1*solve(hessian) draws <- MASS::mvrnorm(10^5, beta, covariance) head(draws) ``` ``` #> [,1] [,2] [,3] #> [1,] -0.7219066 0.12844865 -3.979129 #> [2,] -0.7056323 0.05400818 -3.990519 #> [3,] -0.6969429 0.07709515 -4.027605 #> [4,] -0.7003554 0.07658722 -4.063454 #> [5,] -0.7061336 0.09679411 -4.110877 #> [6,] -0.6947397 0.15654305 -3.869518 ``` ] --- # .center[Simulating uncertainty] We can use the coefficients and hessian from a model to obtain draws that reflect parameter uncertainty .cols3[ ```r hist(draws[, 1]) ``` <img src="figs/unnamed-chunk-12-1.png" width="522.144" /> ] .cols3[ ```r hist(draws[, 2]) ``` <img src="figs/unnamed-chunk-13-1.png" width="522.144" /> ] .cols3[ ```r hist(draws[, 3]) ``` <img src="figs/unnamed-chunk-14-1.png" width="522.144" /> ] --- class: inverse # Practice Question 2 .leftcol[ Suppose we estimate the following utility model describing preferences for cars: $$ u_j = \alpha p_j + \beta_1 x_j^{mpg} + \beta_2 x_j^{elec} + \varepsilon_j $$ a) Generate 10,000 draws of the model coefficients using the estimated coefficients and hessian. Use the `mvrnorm()` function from the `MASS` library. b) Use the draws to compute the mean and 95% confidence intervals of each parameter estimate. ] .rightcol[ The estimated model produces the following results: Parameter | Coefficient ----------|------------ `\(\alpha\)` | -0.7 `\(\beta_1\)` | 0.1 `\(\beta_2\)` | -0.4 Hessian: $$ `\begin{bmatrix} -6000 & 50 & 60 \\ 50 & -700 & 50 \\ 60 & 50 & -300 \end{bmatrix}` $$ ] --- class: center ## Download the [logitr-cars](https://github.com/emse-madd-gwu/logitr-cars) repo from GitHub <center> <img src="images/logitr-cars.png" width=900> </center> --- # .center[Computing and visualizing uncertainty] <br> .rightcol80[ ## 1. Open `logitr-cars` ## 2. Open `code/5.1-uncertainty.R` ] --- class: inverse, middle # Week 9: .fancy[Uncertainty] ### 1. Computing uncertainty ### 2. .orange[Reshaping data] ### BREAK ### 3. Cleaning pilot data ### 4. Estimating pilot data models --- # .center[Names, Values, and Observations] - Variable **Name**: The name of something you can measure - Variable **Value**: One instance of a measured variable - **Observation**: A set of associated measurements across multiple variables -- .code100[ ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] --- # "Long" format data - Each **variable** has its own **column** - Each **observation** has its own **row** <center> <img src="images/tidy-data.png" width = "1000"> </center> --- .leftcol[ # "Long" format data - Each **variable** has its own **column** - Each **observation** has its own **row** ] .rightcol[ ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] <center> <img src="images/tidy-data.png" width = "1000"> </center> --- .leftcol40[.code70[ # "Long" format ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ]] .rightcol60[.code70[ # "Wide" format ``` #> # A tibble: 6 × 8 #> year DHS DOC DOD DOE DOT EPA HHS #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 #> 2 1977 0 837 37967 13741 1095 966 9507 #> 3 1978 0 871 37022 15663 1156 1175 10533 #> 4 1979 0 952 37174 15612 1004 1102 10127 #> 5 1980 0 945 37005 15226 1048 903 10045 #> 6 1981 0 829 41737 14798 978 901 9644 ``` ]] --- .center[.font130[**"Long" format: variable names describe the values below them**]] .leftcol40[.code70[ ## "Long" format ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ]] .rightcol60[.code70[ ## "Wide" format ``` #> # A tibble: 6 × 8 #> year DHS DOC DOD DOE DOT EPA HHS #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 #> 2 1977 0 837 37967 13741 1095 966 9507 #> 3 1978 0 871 37022 15663 1156 1175 10533 #> 4 1979 0 952 37174 15612 1004 1102 10127 #> 5 1980 0 945 37005 15226 1048 903 10045 #> 6 1981 0 829 41737 14798 978 901 9644 ``` ]] --- # **Quick practice 1**: "long" or "wide" format? **Description**: Tuberculosis cases in various countries .code100[ ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` ] --- # **Quick practice 2**: "long" or "wide" format? **Description**: Word counts by character type in "Lord of the Rings" trilogy .code90[ ``` #> # A tibble: 9 × 4 #> Film Race Female Male #> <chr> <chr> <dbl> <dbl> #> 1 The Fellowship Of The Ring Elf 1229 971 #> 2 The Fellowship Of The Ring Hobbit 14 3644 #> 3 The Fellowship Of The Ring Man 0 1995 #> 4 The Return Of The King Elf 183 510 #> 5 The Return Of The King Hobbit 2 2673 #> 6 The Return Of The King Man 268 2459 #> 7 The Two Towers Elf 331 513 #> 8 The Two Towers Hobbit 0 2463 #> 9 The Two Towers Man 401 3589 ``` ] --- # **Quick practice 3**: "long" or "wide" format? **Description**: Word counts by character type in "Lord of the Rings" trilogy .code60[ ``` #> # A tibble: 18 × 4 #> Film Race Gender Word_Count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Fellowship Of The Ring Hobbit Female 14 #> 4 The Fellowship Of The Ring Hobbit Male 3644 #> 5 The Fellowship Of The Ring Man Female 0 #> 6 The Fellowship Of The Ring Man Male 1995 #> 7 The Return Of The King Elf Female 183 #> 8 The Return Of The King Elf Male 510 #> 9 The Return Of The King Hobbit Female 2 #> 10 The Return Of The King Hobbit Male 2673 #> 11 The Return Of The King Man Female 268 #> 12 The Return Of The King Man Male 2459 #> 13 The Two Towers Elf Female 331 #> 14 The Two Towers Elf Male 513 #> 15 The Two Towers Hobbit Female 0 #> 16 The Two Towers Hobbit Male 2463 #> 17 The Two Towers Man Female 401 #> 18 The Two Towers Man Male 3589 ``` ] --- class: inverse, center, middle # Reshaping data with ## `pivot_longer()` and `pivot_wider()` --- ## .center[From "long" to "wide" with `pivot_wider()`] <center> <img src="images/tidy-spread.png" width=600> </center> --- ## .center[From "long" to "wide" with `pivot_wider()`] .leftcol45[ ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] .rightcol55[ ```r fed_spend_wide <- fed_spend_long %>% * pivot_wider( * names_from = department, * values_from = rd_budget_mil) head(fed_spend_wide) ``` ``` #> # A tibble: 6 × 7 #> year DHS DOC DOD DOE DOT EPA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 #> 2 1977 0 837 37967 13741 1095 966 #> 3 1978 0 871 37022 15663 1156 1175 #> 4 1979 0 952 37174 15612 1004 1102 #> 5 1980 0 945 37005 15226 1048 903 #> 6 1981 0 829 41737 14798 978 901 ``` ] --- ## .center[From "wide" to "long" with `pivot_longer()`] <center> <img src="images/tidy-gather.png" width=600> </center> --- ## .center[From "wide" to "long" with `pivot_longer()`] .leftcol45[ ```r names(fed_spend_wide) ``` ``` #> [1] "year" "DHS" "DOC" "DOD" "DOE" "DOT" "EPA" "HHS" "Interior" "NASA" "NIH" "NSF" "Other" "USDA" "VA" ``` ] .rightcol55[ ```r fed_spend_long <- fed_spend_wide %>% * pivot_longer( * cols = DHS:VA, * names_to = "department", * values_to = "rd_budget_mil") head(fed_spend_long) ``` ``` #> # A tibble: 6 × 3 #> year department rd_budget_mil #> <dbl> <chr> <dbl> #> 1 1976 DHS 0 #> 2 1976 DOC 819 #> 3 1976 DOD 35696 #> 4 1976 DOE 10882 #> 5 1976 DOT 1142 #> 6 1976 EPA 968 ``` ] --- ## Can also set `cols` by selecting which columns _not_ to use .leftcol45[ ```r names(fed_spend_wide) ``` ``` #> [1] "year" "DHS" "DOC" "DOD" "DOE" "DOT" "EPA" "HHS" "Interior" "NASA" "NIH" "NSF" "Other" "USDA" "VA" ``` ] .rightcol55[ ```r fed_spend_long <- fed_spend_wide %>% pivot_longer( * cols = -year, names_to = "department", values_to = "rd_budget_mil") head(fed_spend_long) ``` ``` #> # A tibble: 6 × 3 #> year department rd_budget_mil #> <dbl> <chr> <dbl> #> 1 1976 DHS 0 #> 2 1976 DOC 819 #> 3 1976 DOD 35696 #> 4 1976 DOE 10882 #> 5 1976 DOT 1142 #> 6 1976 EPA 968 ``` ] --- class: inverse
20
:
00
# Your turn: Long <--> Wide Open the `practice.Rmd` file. Under "In Class Question 1", write code to read in the following two files: - `pv_cells.csv`: Data on solar photovoltaic cell production by country - `milk_production.csv`: Data on milk production by state Now modify the format of each: - If the data are in "wide" format, convert it to "long" with `pivot_longer()` - If the data are in "long" format, convert it to "wide" with `pivot_wider()` --- class: inverse, center # .fancy[Break]
05
:
00
--- class: inverse, middle # Week 9: .fancy[Uncertainty] ### 1. Computing uncertainty ### 2. Reshaping data ### BREAK ### 3. .orange[Cleaning pilot data] ### 4. Estimating pilot data models --- class: center ## Download the [formr4conjoint](https://github.com/jhelvy/formr4conjoint) repo from GitHub <center> <img src="images/formr4conjoint.png" width=900> </center> --- # .center[Cleaning formr survey data] <br> .rightcol80[ ## 1. Open `formr4conjoint.Rproj` ## 2. Open `code/data_cleaning.R` ] --- class: inverse
20
:
00
# Your Turn ## As a team, pick up where you left off last week and create a `choiceData` data frame in a "long" format --- class: inverse, middle # Week 9: .fancy[Uncertainty] ### 1. Computing uncertainty ### 2. Reshaping data ### BREAK ### 3. Cleaning pilot data ### 4. .orange[Estimating pilot data models] --- # .center[Estimating pilot data models] <br> .rightcol80[ ## 1. Open `formr4conjoint.Rproj` ## 2. Open `code/modeling.R` ] --- class: inverse # Your Turn ## As a team: 1. Use your `choiceData` data frame to estimate preliminary choice models. 2. Interpret your model coefficients with uncertainty.