Appropriate age rating by the US-based rating agency
genre
Film category
let’s do some touches on the dataset
Get rid of the blank X1 Variable.
Change release date into an actual date.
change character variables to factors
Calculate the return on investment as the worldwide_gross/production_budget.
Calculate the percentage of total gross as domestic revenue.
Get the year, month, and day out of the release date.
Remove rows where the revenue is $0 (unreleased movies, or data integrity problems), and remove rows missing information about the distributor. Go ahead and remove any data where the rating is unavailable also.
There doesn’t appear to be much documented before 1975, so let’s restrict (read: filter) the dataset to movies made since 1975. Also, we’re going to be doing some analyses by year, and the data for 2018 is still incomplete, let’s remove all of 2018. Let’s get anything produced in 1975 and after (>=1975) but before 2018.
What does the worldwide movie market look like by decade? Let’s first group by year and genre and compute the sum of the worldwide gross revenue. After we do that, let’s plot a barplot showing year on the x-axis and the sum of the revenue on the y-axis, where we’re passing the genre variable to the fill aesthetic of the bar.
Generally most of the points lie above the “breakeven” line. This is good – if movies weren’t profitable they wouldn’t keep making them. Proportionally there seem to be many more larger points in the Horror genre, indicative of higher ROI.
R-rated movies have a lower average revenue but ROI isn’t substantially less. We can see that while G-rated movies have the highest mean revenue, there were relatively few of them produced, and had a lower total revenue. There were more R-rated movies, but PG-13 movies really drove the total revenue worldwide.
mov|>group_by(mpaa_rating)|>summarize( meanrev =mean(worldwide_gross), totrev =sum(worldwide_gross), roi =mean(roi), number =n())#> # A tibble: 4 × 5#> mpaa_rating meanrev totrev roi number#> <fct> <dbl> <dbl> <dbl> <int>#> 1 G 189913348 13863674404 4.42 73#> 2 PG 147227422. 78324988428 4.64 532#> 3 PG-13 113477939. 120173136920 3.06 1059#> 4 R 63627931. 92451383780 4.42 1453
Are there fewer R-rated movies being produced? Not really. Let’s look at the overall number of movies with any particular rating faceted by genre.
Yes, on average G-rated movies look to perform better. But there aren’t that many of them being produced, and they aren’t bringing in the lions share of revenue.
but wait , is there any association between genre and mpaa_rating?
# Create frequency table, save for reuseptable<-mov%>%# Save table for reuseselect(mpaa_rating, genre)%>%# Variables for tabletable()%>%# Create 2 x 2 tableprint()# Show table#> genre#> mpaa_rating Action Adventure Comedy Drama Horror#> G 0 62 4 7 0#> PG 23 293 77 133 6#> PG-13 215 80 319 388 57#> R 268 14 356 621 194
CHI-SQUARED TEST
# Get chi-squared test for mpaa_rating and genreptable%>%chisq.test()#> #> Pearson's Chi-squared test#> #> data: .#> X-squared = 1343.7, df = 12, p-value < 2.2e-16
great ,p-value is less than 0.05 hence we can tell that genre and mpaa_rating are greatly associated .
let us Join to IMDB reviews dataset and get more insights
Correlation measures the strength and direction of association between two variables. There are three common correlation tests: the Pearson product moment (Pearson’s r), Spearman’s rank-order (Spearman’s rho), and Kendall’s tau (Kendall’s tau).
Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.
The Spearman’s rho and Kendal’s tao correlations are non-parametric measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.
Visualize correlation matrix with corrplot() from corrplot package
library(corrplot)df%>%cor()%>%corrplot( type ="upper", # Matrix: full, upper, or lower diag =F, # Remove diagonal order ="original", # Order for labels tl.col ="black", # Font color tl.srt =45# Label angle)
production cost ,world wide gross and domestic gross all seem to be inter-correlated
but is it significant?
# SINGLE CORRELATION ######################################## Use cor.test() to test one pair of variables at a time.# cor.test() gives r, the hypothesis test, and the# confidence interval. This command uses the "exposition# pipe," %$%, from magrittr, which passes the columns from# the data frame (and not the data frame itself)df%$%cor.test(production_budget,worldwide_gross)#> #> Pearson's product-moment correlation#> #> data: production_budget and worldwide_gross#> t = 47.722, df = 3115, p-value < 2.2e-16#> alternative hypothesis: true correlation is not equal to 0#> 95 percent confidence interval:#> 0.6291207 0.6697034#> sample estimates:#> cor #> 0.649875
off course yes ,the correlation is statistically significant
Separately for each MPAA rating, i will display the mean IMDB rating and mean number of votes cast.