Assignment:

Your code should be placed into an R Markdown file, and published to rpubs.com. When you submit the assignment, please include the URL for your published file. If you are unable to publish to rpubs.com, please attach your R Markdown file.

You will find this assignment much easier if you go through the Week 6 hands on lab first. Week 6 assignment is due end of day on Sunday March 8th.

  1. Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).

  2. Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.

  3. Determine the frequency for one of the categorical variables.

  4. Determine the frequency for one of the categorical variables, by a different categorical variable.

  5. Create a graph for a single numeric variable.

  6. Create a scatterplot of two numeric variables.


1.

Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).

USe following commands to explore data sets:

require(ggplot2)
## Loading required package: ggplot2
data(package = "ggplot2")
View(movies)
Numerical Variables

Length, Budget, Votes

Catagorical Variables

Title, Year, Length, Rating, r1 - 10, mpaa

Binary Variables

action, animation, comedy, drama, documentary, romance, short

2.

Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.

summary (movies)
##     title                year          length            budget         
##  Length:58788       Min.   :1893   Min.   :   1.00   Min.   :        0  
##  Class :character   1st Qu.:1958   1st Qu.:  74.00   1st Qu.:   250000  
##  Mode  :character   Median :1983   Median :  90.00   Median :  3000000  
##                     Mean   :1976   Mean   :  82.34   Mean   : 13412513  
##                     3rd Qu.:1997   3rd Qu.: 100.00   3rd Qu.: 15000000  
##                     Max.   :2005   Max.   :5220.00   Max.   :200000000  
##                                                      NA's   :53573      
##      rating           votes                r1                r2        
##  Min.   : 1.000   Min.   :     5.0   Min.   :  0.000   Min.   : 0.000  
##  1st Qu.: 5.000   1st Qu.:    11.0   1st Qu.:  0.000   1st Qu.: 0.000  
##  Median : 6.100   Median :    30.0   Median :  4.500   Median : 4.500  
##  Mean   : 5.933   Mean   :   632.1   Mean   :  7.014   Mean   : 4.022  
##  3rd Qu.: 7.000   3rd Qu.:   112.0   3rd Qu.:  4.500   3rd Qu.: 4.500  
##  Max.   :10.000   Max.   :157608.0   Max.   :100.000   Max.   :84.500  
##                                                                        
##        r3               r4                r5                r6       
##  Min.   : 0.000   Min.   :  0.000   Min.   :  0.000   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:  0.000   1st Qu.:  4.500   1st Qu.: 4.50  
##  Median : 4.500   Median :  4.500   Median :  4.500   Median :14.50  
##  Mean   : 4.721   Mean   :  6.375   Mean   :  9.797   Mean   :13.04  
##  3rd Qu.: 4.500   3rd Qu.:  4.500   3rd Qu.: 14.500   3rd Qu.:14.50  
##  Max.   :84.500   Max.   :100.000   Max.   :100.000   Max.   :84.50  
##                                                                      
##        r7               r8               r9               r10        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:  4.50   1st Qu.:  4.50   1st Qu.:  4.500   1st Qu.:  4.50  
##  Median : 14.50   Median : 14.50   Median :  4.500   Median : 14.50  
##  Mean   : 15.55   Mean   : 13.88   Mean   :  8.954   Mean   : 16.85  
##  3rd Qu.: 24.50   3rd Qu.: 24.50   3rd Qu.: 14.500   3rd Qu.: 24.50  
##  Max.   :100.00   Max.   :100.00   Max.   :100.000   Max.   :100.00  
##                                                                      
##     mpaa           Action          Animation           Comedy      
##       :53864   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  NC-17:   16   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  PG   :  528   Median :0.00000   Median :0.00000   Median :0.0000  
##  PG-13: 1003   Mean   :0.07974   Mean   :0.06277   Mean   :0.2938  
##  R    : 3377   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##                Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##                                                                    
##      Drama        Documentary         Romance           Short       
##  Min.   :0.000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.371   Mean   :0.05906   Mean   :0.0807   Mean   :0.1609  
##  3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
## 
3.

Determine the frequency for one of the categorical variables.

table(movies$year)   # apply the table function
## 
## 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 
##    1    9    3   13    9    5    9   16   28    9   37   42   17   17   12 
## 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 
##   24   30   26   22   34   32   54   54   49   37   41   52   43   53   52 
## 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 
##   48   50   79   94   83  109  184  288  346  412  421  482  464  484  484 
## 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 
##  463  484  503  521  507  446  426  375  432  439  478  486  491  518  513 
## 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 
##  539  501  522  497  556  528  495  478  466  513  503  517  530  579  594 
## 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 
##  651  625  586  646  637  634  625  619  665  617  609  632  681  661  689 
## 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 
##  698  749  792  792  957  944  944  899  888  948 1016 1199 1248 1390 1568 
## 1998 1999 2000 2001 2002 2003 2004 2005 
## 1705 1927 2048 2121 2168 2158 1945  349
4.

Determine the frequency for one of the categorical variables, by a different categorical variable.

table(movies$year, movies$mpaa)
##       
##             NC-17   PG PG-13    R
##   1893    1     0    0     0    0
##   1894    9     0    0     0    0
##   1895    3     0    0     0    0
##   1896   13     0    0     0    0
##   1897    9     0    0     0    0
##   1898    5     0    0     0    0
##   1899    9     0    0     0    0
##   1900   16     0    0     0    0
##   1901   28     0    0     0    0
##   1902    9     0    0     0    0
##   1903   37     0    0     0    0
##   1904   42     0    0     0    0
##   1905   17     0    0     0    0
##   1906   17     0    0     0    0
##   1907   12     0    0     0    0
##   1908   24     0    0     0    0
##   1909   30     0    0     0    0
##   1910   26     0    0     0    0
##   1911   22     0    0     0    0
##   1912   34     0    0     0    0
##   1913   32     0    0     0    0
##   1914   54     0    0     0    0
##   1915   54     0    0     0    0
##   1916   49     0    0     0    0
##   1917   37     0    0     0    0
##   1918   41     0    0     0    0
##   1919   52     0    0     0    0
##   1920   43     0    0     0    0
##   1921   53     0    0     0    0
##   1922   52     0    0     0    0
##   1923   48     0    0     0    0
##   1924   50     0    0     0    0
##   1925   79     0    0     0    0
##   1926   94     0    0     0    0
##   1927   83     0    0     0    0
##   1928  109     0    0     0    0
##   1929  184     0    0     0    0
##   1930  288     0    0     0    0
##   1931  346     0    0     0    0
##   1932  412     0    0     0    0
##   1933  421     0    0     0    0
##   1934  481     0    1     0    0
##   1935  464     0    0     0    0
##   1936  484     0    0     0    0
##   1937  484     0    0     0    0
##   1938  462     0    1     0    0
##   1939  484     0    0     0    0
##   1940  503     0    0     0    0
##   1941  521     0    0     0    0
##   1942  507     0    0     0    0
##   1943  446     0    0     0    0
##   1944  426     0    0     0    0
##   1945  374     0    0     1    0
##   1946  431     0    1     0    0
##   1947  439     0    0     0    0
##   1948  478     0    0     0    0
##   1949  486     0    0     0    0
##   1950  491     0    0     0    0
##   1951  516     0    2     0    0
##   1952  513     0    0     0    0
##   1953  539     0    0     0    0
##   1954  501     0    0     0    0
##   1955  520     0    1     1    0
##   1956  495     0    2     0    0
##   1957  556     0    0     0    0
##   1958  526     0    2     0    0
##   1959  495     0    0     0    0
##   1960  474     0    1     3    0
##   1961  462     0    3     1    0
##   1962  511     0    1     1    0
##   1963  502     0    1     0    0
##   1964  515     0    2     0    0
##   1965  527     0    0     3    0
##   1966  571     0    4     2    2
##   1967  593     0    1     0    0
##   1968  641     0    1     6    3
##   1969  614     0    1     8    2
##   1970  564     0    6     7    9
##   1971  624     0    0    15    7
##   1972  633     2    1     1    0
##   1973  631     1    0     0    2
##   1974  623     1    0     0    1
##   1975  616     2    0     1    0
##   1976  664     0    0     0    1
##   1977  613     0    1     1    2
##   1978  606     0    1     1    1
##   1979  627     0    1     1    3
##   1980  677     0    1     1    2
##   1981  658     1    0     0    2
##   1982  683     0    2     2    2
##   1983  692     0    2     2    2
##   1984  744     0    1     0    4
##   1985  787     0    1     2    2
##   1986  787     0    0     0    5
##   1987  953     0    0     1    3
##   1988  937     0    1     2    4
##   1989  930     0    0     3   11
##   1990  892     0    0     1    6
##   1991  873     0    1     2   12
##   1992  921     0    1     3   23
##   1993  973     1    6     3   33
##   1994 1105     0    8    15   71
##   1995  842     4   46    63  293
##   1996  970     0   47    62  311
##   1997 1118     2   56    66  326
##   1998 1248     0   48    73  336
##   1999 1424     0   44    92  367
##   2000 1563     0   39   103  343
##   2001 1611     1   42   109  358
##   2002 1667     0   50   107  344
##   2003 1747     0   43   104  264
##   2004 1596     0   42   111  196
##   2005  289     1   12    23   24
5.

Create a graph for a single numeric variable.

ggplot(data=movies,aes(x=rating)) + 
geom_histogram(binwidth=0.5) +
labs(title="Movie Ratings", x="Ratings", y="Frequency")

6.

Create a scatterplot of two numeric variables.

a <- ggplot(data = movies, aes(x = budget, y = votes, ))
a <- a + geom_point( size = 1)
a <- a + xlab("Budget") + ylab("Votes") + ggtitle("Votes by Budget") 
a
## Warning: Removed 53573 rows containing missing values (geom_point).