Your Favorite Movies: Analysis 2

An Example for 431 Project A

Author

Thomas E. Love, Ph.D.

Published

2023-10-10

1 R Packages

library(Hmisc)
library(janitor)
library(naniar)

library(xfun) ## or, if you prefer use library(sessioninfo)

library(googlesheets4)
library(gt)
library(gtExtras)
library(mosaic)

library(broom)
library(patchwork)

library(tidyverse)

url_boost <- "https://raw.githubusercontent.com/THOMASELOVE/431-data/main/data-and-code/Love-boost.R"

source(url_boost) ## to get Love-boost.R functions like bootdif()

theme_set(theme_bw())
knitr::opts_chunk$set(comment = NA)

2 Data Ingest

gs4_deauth()

movies2 <- 
  read_sheet("https://docs.google.com/spreadsheets/d/1qJnQWSjXyOXFZOO8VZixpgZWbraUW66SfP5hE5bKW4k") |>
  select(film_id, film, imdb_stars, dr_love) |>
  mutate(film_id = as.character(film_id),
         dr_love = factor(dr_love))

✔ Reading from "movies_2023-09-14".

✔ Range 'Initial Data'.

dim(movies2)

[1] 201   4

3 CWRU Colors

I decided to show off here and make use of some of the 2023 official CWRU colors.

Note that once I’ve named these in this way, based on their hexadecimal representations, I do not include quotes around them in ggplot-based plots.

## CWRU colors

cwru_blue <- '#003071'
cwru_darkblue <- '#09143A'
cwru_trueblue <- '#007AB8'
cwru_lightblue <- '#A6D2E6'
cwru_darkgray <- '#999999'
cwru_lightgray <- '#D3D3D3'
cwru_terracottaorange <- '#D63D1F'
cwru_fallyellow <- '#E69E40'
cwru_bluegreen <- '#377E72'
cwru_violetpurple <- '#692C95'
cwru_vividgreen <- '#61A530'

4 An Important Message

I wrote this document to help you get a feel for what we are looking for in Analysis 2 for Project A, and to make the scope of that work a bit clearer.

Use your own words, not mine, in preparing your analytic work for Project A. Thanks.

5 Research Question

For movies in our sample of “favorite movies”, do the movies which Dr. Love has seen have meaningfully different IMDB Stars ratings on average as compared to the movies he has not seen?

6 Analysis

6.1 Variables

Here our key variables include a quantitative outcome and a two-level predictor. We are interested in understanding something about the difference in imdb_stars between movies Dr. Love has and has not seen.

Our outcome is imdb_stars, which is a proprietary rating available for each film on IMDB, identified on a scale from 1 (lowest) to 10 (highest).

Our predictor is dr_love, which is an indicator of whether Dr. Love has seen the movie. The available levels of the factor are Yes and No.

Neither of these variables display any missing values within our sample, so we have a complete set of 201 movies.

What is our design?

This Analysis 2 is about comparing two means with independent samples. You should be able to explain, in a clear English sentence or two, why your data were gathered using an independent-samples design. I’ll leave that to you, but it’s very important.

What to do about missing data

In Project A, I would begin Analysis 2 by filtering to complete cases if I had missing values, and thus make a statement that you are assuming the appropriate missing data mechanism in that instance. I would not impute for Analysis 2.

If you don’t have missing values in the variables used in this analysis, there is no reason to specify a missing data mechanism or do any filtering or imputation.

6.2 Summaries

6.2.1 Distribution of the Outcome within Predictor Groups

Here is a box and violin plot of the data, which is probably the most likely starting plot. Here we see that each of the distributions is somewhat left-skewed, with the mean well below the median.

ggplot(movies2, aes(x = imdb_stars, y = dr_love)) +
  geom_violin(fill = cwru_lightgray) +
  geom_boxplot(width = 0.3, fill = cwru_lightblue, notch = TRUE,
               outlier.color = cwru_blue, outlier.size = 3) +
  stat_summary(fun = "mean", geom = "point", shape = 23, size = 3, 
               fill = cwru_blue) +
  labs(x = "IMDB Stars", y = "Has Dr. Love seen movie?",
       title = "Favorite Movies Sample for Fall 2023")

The movies Dr. Love has seen seem to have a somewhat higher median and mean IMDB score than the movies he has not, within our sample.

As we saw in Analysis 1, several of these movies are identified here as low outliers in terms of their IMDB Stars. Here’s the set of films with the five lowest IMDB star ratings in our sample.

movies2 |> arrange(imdb_stars) |> head(5)

# A tibble: 5 × 4
  film_id film                  imdb_stars dr_love
  <chr>   <chr>                      <dbl> <fct>  
1 61      The Gingerdead Man           3.4 No     
2 163     The Room                     3.6 Yes    
3 115     Madea Goes To Jail           4.5 No     
4 78      High School Musical 2        5.2 No     
5 83      House Party 2                5.3 No

We could also fit a pair of Normal Q-Q plots to describe the two groups (movies Dr. Love has seen and those he hasn’t), if that was helpful in some substantial way. I don’t know that it gives me any information I didn’t already have from the previous plot in this case. I still think that the data are substantially left-skewed in each group.

p1 <- ggplot(
  data = movies2 |> filter(dr_love == "Yes"), aes(sample = imdb_stars)) +
  geom_qq() + geom_qq_line(col = cwru_trueblue) +
  theme(aspect.ratio = 1) + 
  labs(title = "Movies Dr. Love has seen", y = "IMDB Stars",
       x = "Expectation for Standard Normal")

p2 <- ggplot(
  data = movies2 |> filter(dr_love == "No"), aes(sample = imdb_stars)) +
  geom_qq() + geom_qq_line(col = cwru_trueblue) +
  theme(aspect.ratio = 1) + 
  labs(title = "Movies Dr. Love hasn't seen", y = "IMDB Stars", 
       x = "Expectation for Standard Normal")

p1 + p2

Should I consider transforming the outcome in Analysis 2?

Not in Project A, no.

6.2.2 Numerical Summaries

key2 <- 
  favstats(imdb_stars ~ dr_love, data = movies2)

key2 |> gt() |> gt_theme_dark()

dr_love	min	Q1	median	Q3	max	mean	sd	n	missing
No	3.4	7.0	7.6	8.0	8.7	7.420161	0.8360748	124	0
Yes	3.6	7.4	7.9	8.4	9.3	7.792208	0.9050685	77	0

Again, we see that the “Yes” group has a higher mean and median imdb_stars value than the “No” group in our sample.

6.3 Approach

6.3.1 90% CI for difference in group means via t-based procedure

Given the clear skew in our data, we’ll focus on the bootstrap results that follow, but for now, we can also fit a 90% CI for the difference in mean imdb_stars between the two dr_love groups while either assuming equal variances (in the population) or not.

Since we have an unbalanced design (as we see from the unequal sample sizes below), we must make a decision here as to whether assuming equal variances is reasonable. Here, the sample variances are:

movies2 |> group_by(dr_love) |>
  summarise(n = n(), var(imdb_stars))

# A tibble: 2 × 3
  dr_love     n `var(imdb_stars)`
  <fct>   <int>             <dbl>
1 No        124             0.699
2 Yes        77             0.819

The variance ratio is 1.17, with the larger sample size group (Yes) associated with the larger sample variance. I’d worry about this more if the variance was at least 50% larger in one group than the other, but it’s not hard to run this either making the assumption of equal population variances or not making that assumption.

Here’s a Welch t test approach, which does not assume equal variances.

t.test(imdb_stars ~ dr_love, data = movies2, 
       conf.int = TRUE, conf.level = 0.90) |>
  tidy() |> 
  gt() |> fmt_number(columns = where(is.numeric), decimals = 3) |> 
  gt_theme_espn()

estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
−0.372	7.420	7.792	−2.916	0.004	151.586	−0.583	−0.161	Welch Two Sample t-test	two.sided

Here’s a two-sample t test approach which does assume equal variances.

t.test(imdb_stars ~ dr_love, data = movies2, var.equal = TRUE,
       conf.int = TRUE, conf.level = 0.90) |>
  tidy() |> 
  gt() |> fmt_number(columns = where(is.numeric), decimals = 3) |> 
  gt_theme_excel()

estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
−0.372	7.420	7.792	−2.971	0.003	199.000	−0.579	−0.165	Two Sample t-test	two.sided

An equivalent way to get the “equal variances” 90% confidence interval result follows.

m3 <- lm(imdb_stars ~ dr_love, data = movies2)

tidy(m3, conf.int = TRUE, conf.level = 0.90) |> 
  gt() |> fmt_number(columns = where(is.numeric), decimals = 3) |> 
  gt_theme_excel()

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	7.420	0.078	95.736	0.000	7.292	7.548
dr_loveYes	0.372	0.125	2.971	0.003	0.165	0.579

6.3.2 90% CI for difference in group means via bootstrap

In our setting, given the clear skew in the imdb_stars values within each of our groups (movies Dr. Love has and hasn’t seen), we will want to focus on a bootstrap 90% confidence interval for the difference in means. We’ll stick to a percentile bootstrap here.

set.seed(4312023)
bootdif(movies2$imdb_stars, movies2$dr_love, conf.level = 0.90, B.reps = 2000)

Mean Difference            0.05            0.95 
      0.3720465       0.1572827       0.5764050

Our 90% bootstrap confidence interval for the difference in population means (Yes - No) in imdb_stars is (0.157, 0.576) with a point estimate of 0.372. I suppose the question now is whether or not this would be considered a meaningful difference in average imdb_stars.

Note

As the instructions suggest, you should show your work and your reasoning (not just your code), and comment on any analytic decisions you make. Be sure to actively present and justify any assumptions you are making.

6.4 Conclusions

Note

Again, I’ll just reprint the Advice on Analysis 2 Conclusions from the Instructions…

For Analysis 2, you’ll write two paragraphs.

In the first paragraph, you should provide a clear restatement of your research question, followed by a clear and appropriate response to your research question, motivated by your results. Interpret your 90% confidence interval’s endpoints and width in context.

Then, write a paragraph which summarizes the key limitations of your work in Analysis 2.

If you see problems with the assumptions behind your choice of interval, that would be a good thing to talk about here, for instance.
Another issue that is worth discussing is your target population, and what evidence you can describe that might indicate whether your selected states are a representative sample of the US as a whole, or perhaps some particular part of the United States.
You should also provide at least one useful “next step” that you could take to improve this analysis (just saying “get more data” isn’t a sufficient next step.)

7 Session Information

session_info()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)


Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

Package version:
  askpass_1.2.0       backports_1.4.1     base64enc_0.1-3    
  bigD_0.2.0          bit_4.0.5           bit64_4.0.5        
  bitops_1.0.7        blob_1.2.4          broom_1.0.5        
  bslib_0.5.1         cachem_1.0.8        callr_3.7.3        
  cellranger_1.1.0    checkmate_2.2.0     cli_3.6.1          
  clipr_0.8.0         cluster_2.1.4       colorspace_2.1-0   
  commonmark_1.9.0    compiler_4.3.1      conflicted_1.2.0   
  cpp11_0.4.6         crayon_1.5.2        curl_5.1.0         
  data.table_1.14.8   DBI_1.1.3           dbplyr_2.3.4       
  digest_0.6.33       dplyr_1.1.3         dtplyr_1.3.1       
  ellipsis_0.3.2      evaluate_0.22       fansi_1.0.5        
  farver_2.1.1        fastmap_1.1.1       fontawesome_0.5.2  
  forcats_1.0.0       foreign_0.8-84      Formula_1.2-5      
  fs_1.6.3            gargle_1.5.2        generics_0.1.3     
  ggforce_0.4.1       ggformula_0.10.4    ggplot2_3.4.4      
  ggridges_0.5.4      ggstance_0.3.6      glue_1.6.2         
  googledrive_2.1.1   googlesheets4_1.1.1 graphics_4.3.1     
  grDevices_4.3.1     grid_4.3.1          gridExtra_2.3      
  gt_0.10.0           gtable_0.3.4        gtExtras_0.5.0     
  haven_2.5.3         highr_0.10          Hmisc_5.1-1        
  hms_1.1.3           htmlTable_2.4.1     htmltools_0.5.6.1  
  htmlwidgets_1.6.2   httr_1.4.7          ids_1.0.1          
  isoband_0.2.7       janitor_2.2.0       jquerylib_0.1.4    
  jsonlite_1.8.7      juicyjuice_0.1.0    knitr_1.44         
  labeling_0.4.3      labelled_2.12.0     lattice_0.21-8     
  lifecycle_1.0.3     lubridate_1.9.3     magrittr_2.0.3     
  markdown_1.10       MASS_7.3-60         Matrix_1.6-1.1     
  memoise_2.0.1       methods_4.3.1       mgcv_1.8.42        
  mime_0.12           modelr_0.1.11       mosaic_1.8.4.2     
  mosaicCore_0.9.2.1  mosaicData_0.20.3   munsell_0.5.0      
  naniar_1.0.0        nlme_3.1.162        nnet_7.3-19        
  norm_1.0.11.1       openssl_2.1.1       paletteer_1.5.0    
  patchwork_1.1.3     pillar_1.9.0        pkgconfig_2.0.3    
  plyr_1.8.9          polyclip_1.10-6     prettyunits_1.2.0  
  prismatic_1.1.1     processx_3.8.2      progress_1.2.2     
  ps_1.7.5            purrr_1.0.2         R6_2.5.1           
  ragg_1.2.6          rappdirs_0.3.3      RColorBrewer_1.1.3 
  Rcpp_1.0.11         RcppEigen_0.3.3.9.3 reactable_0.4.4    
  reactR_0.5.0        readr_2.1.4         readxl_1.4.3       
  rematch_2.0.0       rematch2_2.1.2      reprex_2.0.2       
  rlang_1.1.1         rmarkdown_2.25      rpart_4.1.19       
  rstudioapi_0.15.0   rvest_1.0.3         sass_0.4.7         
  scales_1.2.1        selectr_0.4.2       snakecase_0.11.1   
  splines_4.3.1       stats_4.3.1         stringi_1.7.12     
  stringr_1.5.0       sys_3.4.2           systemfonts_1.0.5  
  textshaping_0.3.7   tibble_3.2.1        tidyr_1.3.0        
  tidyselect_1.2.0    tidyverse_2.0.0     timechange_0.2.0   
  tinytex_0.48        tools_4.3.1         tweenr_2.0.2       
  tzdb_0.4.0          UpSetR_1.4.0        utf8_1.2.3         
  utils_4.3.1         uuid_1.1.1          V8_4.4.0           
  vctrs_0.6.4         viridis_0.6.4       viridisLite_0.4.2  
  visdat_0.6.0        vroom_1.6.4         withr_2.5.1        
  xfun_0.40           xml2_1.3.5          yaml_2.3.7