Your code should be placed into an R Markdown file, and published to rpubs.com. When you submit the assignment, please include the URL for your published file. If you are unable to publish to rpubs.com, please attach your R Markdown file.
You will find this assignment much easier if you go through the Week 6 hands on lab first. Week 6 assignment is due end of day on Sunday March 8th.
Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).
Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.
Determine the frequency for one of the categorical variables.
Determine the frequency for one of the categorical variables, by a different categorical variable.
Create a graph for a single numeric variable.
Create a scatterplot of two numeric variables.
Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).
USe following commands to explore data sets:
require(ggplot2)
## Loading required package: ggplot2
data(package = "ggplot2")
View(movies)
Length, Budget, Votes
Title, Year, Length, Rating, r1 - 10, mpaa
action, animation, comedy, drama, documentary, romance, short
Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.
summary (movies)
## title year length budget
## Length:58788 Min. :1893 Min. : 1.00 Min. : 0
## Class :character 1st Qu.:1958 1st Qu.: 74.00 1st Qu.: 250000
## Mode :character Median :1983 Median : 90.00 Median : 3000000
## Mean :1976 Mean : 82.34 Mean : 13412513
## 3rd Qu.:1997 3rd Qu.: 100.00 3rd Qu.: 15000000
## Max. :2005 Max. :5220.00 Max. :200000000
## NA's :53573
## rating votes r1 r2
## Min. : 1.000 Min. : 5.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 5.000 1st Qu.: 11.0 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 6.100 Median : 30.0 Median : 4.500 Median : 4.500
## Mean : 5.933 Mean : 632.1 Mean : 7.014 Mean : 4.022
## 3rd Qu.: 7.000 3rd Qu.: 112.0 3rd Qu.: 4.500 3rd Qu.: 4.500
## Max. :10.000 Max. :157608.0 Max. :100.000 Max. :84.500
##
## r3 r4 r5 r6
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 4.500 1st Qu.: 4.50
## Median : 4.500 Median : 4.500 Median : 4.500 Median :14.50
## Mean : 4.721 Mean : 6.375 Mean : 9.797 Mean :13.04
## 3rd Qu.: 4.500 3rd Qu.: 4.500 3rd Qu.: 14.500 3rd Qu.:14.50
## Max. :84.500 Max. :100.000 Max. :100.000 Max. :84.50
##
## r7 r8 r9 r10
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 4.50 1st Qu.: 4.50 1st Qu.: 4.500 1st Qu.: 4.50
## Median : 14.50 Median : 14.50 Median : 4.500 Median : 14.50
## Mean : 15.55 Mean : 13.88 Mean : 8.954 Mean : 16.85
## 3rd Qu.: 24.50 3rd Qu.: 24.50 3rd Qu.: 14.500 3rd Qu.: 24.50
## Max. :100.00 Max. :100.00 Max. :100.000 Max. :100.00
##
## mpaa Action Animation Comedy
## :53864 Min. :0.00000 Min. :0.00000 Min. :0.0000
## NC-17: 16 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## PG : 528 Median :0.00000 Median :0.00000 Median :0.0000
## PG-13: 1003 Mean :0.07974 Mean :0.06277 Mean :0.2938
## R : 3377 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## Drama Documentary Romance Short
## Min. :0.000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.371 Mean :0.05906 Mean :0.0807 Mean :0.1609
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.00000 Max. :1.0000 Max. :1.0000
##
Determine the frequency for one of the categorical variables.
table(movies$year) # apply the table function
##
## 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
## 1 9 3 13 9 5 9 16 28 9 37 42 17 17 12
## 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922
## 24 30 26 22 34 32 54 54 49 37 41 52 43 53 52
## 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937
## 48 50 79 94 83 109 184 288 346 412 421 482 464 484 484
## 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952
## 463 484 503 521 507 446 426 375 432 439 478 486 491 518 513
## 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967
## 539 501 522 497 556 528 495 478 466 513 503 517 530 579 594
## 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982
## 651 625 586 646 637 634 625 619 665 617 609 632 681 661 689
## 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## 698 749 792 792 957 944 944 899 888 948 1016 1199 1248 1390 1568
## 1998 1999 2000 2001 2002 2003 2004 2005
## 1705 1927 2048 2121 2168 2158 1945 349
Determine the frequency for one of the categorical variables, by a different categorical variable.
table(movies$year, movies$mpaa)
##
## NC-17 PG PG-13 R
## 1893 1 0 0 0 0
## 1894 9 0 0 0 0
## 1895 3 0 0 0 0
## 1896 13 0 0 0 0
## 1897 9 0 0 0 0
## 1898 5 0 0 0 0
## 1899 9 0 0 0 0
## 1900 16 0 0 0 0
## 1901 28 0 0 0 0
## 1902 9 0 0 0 0
## 1903 37 0 0 0 0
## 1904 42 0 0 0 0
## 1905 17 0 0 0 0
## 1906 17 0 0 0 0
## 1907 12 0 0 0 0
## 1908 24 0 0 0 0
## 1909 30 0 0 0 0
## 1910 26 0 0 0 0
## 1911 22 0 0 0 0
## 1912 34 0 0 0 0
## 1913 32 0 0 0 0
## 1914 54 0 0 0 0
## 1915 54 0 0 0 0
## 1916 49 0 0 0 0
## 1917 37 0 0 0 0
## 1918 41 0 0 0 0
## 1919 52 0 0 0 0
## 1920 43 0 0 0 0
## 1921 53 0 0 0 0
## 1922 52 0 0 0 0
## 1923 48 0 0 0 0
## 1924 50 0 0 0 0
## 1925 79 0 0 0 0
## 1926 94 0 0 0 0
## 1927 83 0 0 0 0
## 1928 109 0 0 0 0
## 1929 184 0 0 0 0
## 1930 288 0 0 0 0
## 1931 346 0 0 0 0
## 1932 412 0 0 0 0
## 1933 421 0 0 0 0
## 1934 481 0 1 0 0
## 1935 464 0 0 0 0
## 1936 484 0 0 0 0
## 1937 484 0 0 0 0
## 1938 462 0 1 0 0
## 1939 484 0 0 0 0
## 1940 503 0 0 0 0
## 1941 521 0 0 0 0
## 1942 507 0 0 0 0
## 1943 446 0 0 0 0
## 1944 426 0 0 0 0
## 1945 374 0 0 1 0
## 1946 431 0 1 0 0
## 1947 439 0 0 0 0
## 1948 478 0 0 0 0
## 1949 486 0 0 0 0
## 1950 491 0 0 0 0
## 1951 516 0 2 0 0
## 1952 513 0 0 0 0
## 1953 539 0 0 0 0
## 1954 501 0 0 0 0
## 1955 520 0 1 1 0
## 1956 495 0 2 0 0
## 1957 556 0 0 0 0
## 1958 526 0 2 0 0
## 1959 495 0 0 0 0
## 1960 474 0 1 3 0
## 1961 462 0 3 1 0
## 1962 511 0 1 1 0
## 1963 502 0 1 0 0
## 1964 515 0 2 0 0
## 1965 527 0 0 3 0
## 1966 571 0 4 2 2
## 1967 593 0 1 0 0
## 1968 641 0 1 6 3
## 1969 614 0 1 8 2
## 1970 564 0 6 7 9
## 1971 624 0 0 15 7
## 1972 633 2 1 1 0
## 1973 631 1 0 0 2
## 1974 623 1 0 0 1
## 1975 616 2 0 1 0
## 1976 664 0 0 0 1
## 1977 613 0 1 1 2
## 1978 606 0 1 1 1
## 1979 627 0 1 1 3
## 1980 677 0 1 1 2
## 1981 658 1 0 0 2
## 1982 683 0 2 2 2
## 1983 692 0 2 2 2
## 1984 744 0 1 0 4
## 1985 787 0 1 2 2
## 1986 787 0 0 0 5
## 1987 953 0 0 1 3
## 1988 937 0 1 2 4
## 1989 930 0 0 3 11
## 1990 892 0 0 1 6
## 1991 873 0 1 2 12
## 1992 921 0 1 3 23
## 1993 973 1 6 3 33
## 1994 1105 0 8 15 71
## 1995 842 4 46 63 293
## 1996 970 0 47 62 311
## 1997 1118 2 56 66 326
## 1998 1248 0 48 73 336
## 1999 1424 0 44 92 367
## 2000 1563 0 39 103 343
## 2001 1611 1 42 109 358
## 2002 1667 0 50 107 344
## 2003 1747 0 43 104 264
## 2004 1596 0 42 111 196
## 2005 289 1 12 23 24
Create a graph for a single numeric variable.
ggplot(data=movies,aes(x=rating)) +
geom_histogram(binwidth=0.5) +
labs(title="Movie Ratings", x="Ratings", y="Frequency")
Create a scatterplot of two numeric variables.
a <- ggplot(data = movies, aes(x = budget, y = votes, ))
a <- a + geom_point( size = 1)
a <- a + xlab("Budget") + ylab("Votes") + ggtitle("Votes by Budget")
a
## Warning: Removed 53573 rows containing missing values (geom_point).